[LWN Logo]
[Timeline]
Date: Thu, 2 Nov 2000 16:50:33 +0100
To: lwn@lwn.net
Subject: Toronto Linux Expo, day 3: Clustering and HA.

Here are my notes for the "Clustering and High Availability" track
for day 3 of Linux Expo Toronto 2000.

Note the scoop: Google is going to index newsgroups!

Previous reports:

[Day 1: Linux 2.4 and Security]
http://lwn.net/2000/1102/a/events-le-tor.php3 

[Day 2: Keynotes]
http://lwn.net/2000/1102/a/events-le-tor-2.php3

        S. Fermigier, Toronto, November 1, 2000

Nick Carr
=========

"Clustering in the Linux Commercial Environment". Mission Critical Linux.

Gartner Group (oct 2000): 45% of large enterprises are evaluating Linux, 25%
of North American and 36% of European companies say they have an Linux
implementation.

Linux server deployment is driven by low initial cost, reliability and low
operating cost. Linux is still missing professional service, 24x7 support and
enterprise-level high-availability solutions.

The cluster market in now differentiating into 3 important segments: high
performance computing, load balancing and high-availability. 

HPC solutions (like Beowulf) is not suited for commercial applications. The
end-user is responsible for parallelizing the applications. Largest running
Beowulf is a 528 nodes with PIII-800 CPU. Critical for performance is the
network topology. (Pictures on www.beowulf.org). All (American) universities
and research places are running Beowulf. Software used include: PVM, MPI and
BSP + optional products like BPROC, Beowulf Ethernet Channel Bonding and
Virtual Memory Pre-pager.

Network load balancing is used to create large-scale Internet Websites
(usually about a few tens of servers). The most common solution is LVS (Linux
Virtual Server) or commercial implementations. You typically have a load
balancing machine, and several servers behind it. The problem is that these
systems don't share data so you have to provide data synchronization between
the servers (data-copying or a back-end NFS server). So it's not suited for
transaction application environments.  Load balancing algorithms include:
round-robin, weighted round-robin, least-connection and weighted
least-connection. Examples include: Piranha (RH), Ultra Monkey (VA),
TurboCluster (TurboLinux), Central Dispatch (Resonate).

For commercial applications, you must be able to run unmodified applications.
Two-nodes systems account for >80% of the market. This enables Linux to move
into the back-end database market. These clusters are usually based on shared
storage (typically RAID, but you'd better user good quality RAID SCSI
controllers) with a heartbeat mechanism to monitor status of nodes. 

Clusterable applications: Apache, Sendmail, databases (MySQL, Oracle...); NFS
still hard to failover; Samba not there yet.  Cluster solutions include:
Failsafe (SGI, port from IRIX) and Kimberlite provided by Mission Critical
Linux, both Open Source, as well as several closed source offerings.

You can also combine commercial and LVS cluster solutions (load-balanced
Apache for the front-end and commercial HA database server for the back-end).
Other important related projects include: journaling filesystems, LVM, cluster
file systems and distributed lock managers.

Brian Stevens
=============

"News Advances in Clustering"

Clustering = availability through redundancy so there is no single point of
failure.  There are many proprietary offerings (VMS, TruCluster, SGI Failsafe,
IBM HACMP, NT Wolfpack) requiring typically custom hardware.

The role of an HA system is to ensure that the application is always running
on one and only one cluster member (because that's what most of the shelf
application expect). Typically, each member monitors each other over multiple
communications channels and takes over serving applications for a failed
member. Such a system must address: system failure, maintenance, network
outage and system hangup. The hard part is the weird cases.

Convolo is a closed source product based on the open source Kimberlite.
Its goal is to be distribution, platform and storage subsystem independent.

Jim Reese (Google)
==================

"Google". Jim is Chief Something at Google.

3 business models: advertisement, searches on specific websites (Cisco) and
syndicated searches.

Google started in sept 98 by Larry Page and Sergey Brin. Now serving 50M
queries a day. Focus is to be the world's best search engine, not a portal.
Biggest challenge is scalability. There were 140M Web users in 99 and 500M
pages.

Now there are 2B indexable Web pages. 85% of users use a search engine and
search engine are consistently on Mediametrix top 10.

Google interface: clean, elegant, simple and intuitive. Relevance though smart
hyperlink analysis. Speed (goal < 0.5 s per query).  Scalability -> 1000
queries / s. 

Features: highlighted text, grouped results, cached results, similar pages.

PageRank is a patent-pending ranking technology based on link structure
analysis to provide objective measure of the importance of a Web page.

Google uses hardware load balancers (redundant), a farm of Web servers, of
index servers and document servers.

Answering a single query involves dozens of servers (index and document
servers). They user of the shelf PCs, that are cheap but unreliable.  You have
to put reliability and fault tolerance on the software end.  Basically, they
replicate everything (servers, backend clusters and datacenters). The index is
read-only so there are no consistency problems. They use TCP-based
communication between servers, with multiple channels. Each Web server has a
cluster of back-end servers that has a copy of all the data.

Each cluster has 80 servers and is connected by 4x1gbps Ethernet.

Hardware = ElCheapo (http://rackable.com) PCs: 256 Mb ram, 40-80 IDE disk,
1 per channel. Just 1 Eepro100 NIC per machine. One cabinet holds 80 1U
servers. Two fast Ethernet switches / cabinet. 100% Linux. They generate a
lot of heat.  Goal is to maximize MIPS/ft^2.

Stats: 6000 servers. 500 TB storage. 1 Google day = 16.5 years.  33 machines
die on average every day -> redundancy.

Everything will fail someday (data center, fuses, power strips, AC, network
(switches, cables, NICs, DNS, routing, loops, security, bandwidth), hard disk,
MB, NIC, RAM, software (OS, application, basic services)).  Four hours to get
a patch from Alan Cox last year during a syn-flood attack. They use a very
stripped-down version of Red Hat and rpm everything.

Monitoring: monitor everything and when a server fails, simply restart it.
Advanced: performance monitoring, real time stats by a complex Pythonscript. Email or pager alerts when critical thresholds are crossed.

Challenges: keeping system images consistent, monitoring system state,
debugging performance problems, bits errors when copying terabytes
(you must do checksums).

Lessons: KISS, keep everything identical, expect hardware failures, cheap
hardware allows more computations power. Traffic grows much faster than you
think. Design everything with "factor N" scalability.

!!! They are going to index the newsgroups (like Altavista in the old days).

Simon Horman (VA Linux)
=======================

"Building Web Farms with Linux".

Information on http://ultramonkey.sourceforge.net/

HA on Linux started in may 1998. "fake" released in november 1998: implements
IP address takeover. Now there are many projects, fake superseded by
heartbeat. See http://www.linux-ha.org/

To provide HA, you have to eliminate single points of failure.

Web farms typically employ both HA and scalability technologies (the Web
doesn't go to sleep).

Architecture = 3 levels: IPVS, Web servers, storage (SCSI, fiber channel or
NFS).

Top layer: you usually put 2 IPVS (1 as a standby).

Middle layer = expendable pieces of hardware. Theyr contain very little
content and state. They do complex processing and processing power can be
increased by adding servers. You should store state on clients using cookies
or URL mangling.

Bottom layer: NFS, AFS or a database server (in the future server independant
storage, like the one provided by GFS may be used).

Geographic distribution: you can use intelligent DNS solutions (Resonate,
Eddieware) that return the IP address of one of the servers, or a central HTTP
server can handle all incoming and redirect them using HTTP redirect or a
rewrite rule. Or use EBGP.

IP Adress Takeover: steal the IP address from a (failed) machine, but this may
cause ARP problems. Solution: gratuitous ARP (ARP replies without ARP
request). But there are also problems with this approach. You have to
configure your switches correctly.

Layer 4 switching: ability to multiplex connection and redirect them. There
are special equipments: Alteon, Cisco LocalDirector or LVS... You have to use
a right scheduling algorithm. Several forwarding techniques: direct routing
(similar to layer 2 switching), IP-IP encapsulation (tunneling) or NAT.

DNS methods: problem is DNS caching (TTL too long: less control, TTL too
short: increased load on DNS server). This is not the right way to do
load-balancing. But it is a viable method for geographic distribution. 

Heartbeat: messages sent between machines every few seconds. If a heartbeat is
not received, the node is assumed to have failed. Problems in case of network
partitionning (one machine in both partition will think the other has failed). 

Existing solutions (some): 

- Heartbeat (Alan Robertson): effects an IP address takeover in case of
a failover.

- LVS (Wensong Zhang) = IPVS. Load balancing with several load-balancing
algorithms.

- Eddieware (Ericsson): intelligent HTTP gateway that multiplexes
connections to back-end servers (can define some kind of QoS metrics). It
also has an enhanced DNS server (for geographic distribution).

- TurboCluster (TurboLinux, closed source): supports layer-4 switching.

- Resonate (Resonate, closed source): analogous to the intelligent
gateway component of Eddieware.

- Global Dispatch (Resonate, closed source): similar to the enhanced DNS part
of Eddieware.

- Piranha (Red Hat): suite of tols to configure an IPVS-based service.
HTML-based configuration tool.

- UltraMonkey (VA Linux): LVS for load balancing, heartbeat for HA between
load-balancers.

- The Global File System (GFS): designed to facilitate access to shared
fiber channel disks, so there is no host acting as a single point of failure.
It is currently functionnal, but not ready for production yet.

Future solutions: 

- Need for a generic framework (S. Tweedie) based on some kind of distributed
lock manager. This is a long term project.

- FailSafe (SGI and SuSE).

Conclusion: the support for HA under Linux continues to grow.

-- 
Stefane Fermigier, Tel: 06 63 04 12 77 (mobile).
Portalux.com: le portail Linux.
"Internet: Learn what you know. Share what you don't."