LWN @ OLS: Linux HA notes from Peter

From: Peter Badovinatz <tabmowzo@yahoo.com>
Subject: Notes from OLS Clustering Working Group

25 July 2001 - Ottawa Linux Symposium
Clustering Working Group
Discussion leaders Lars Marowsky-Bree and Alan Robertson
Notes by Peter Badovinatz

(note on typography:  I attempted to identify when Lars or Alan was 
speaking.  Most audience questions/comments are noted with "Comment:"
or "C:".  I identified a couple of audience members when I could easily
get the name, otherwise they remain anonymous.)

HA Working group BOF

Large attendance, hopefully folks will sign up charts.

Opening:

Alan:
Opening comments about framework
- different components
- different requirements
 - HPC
 - HA
- compare & contrast what's needed

Lars:
Many, many different clustering solutions
Which one to write to
Too many to choose from
All essentially do the same thing

Alan:
Discussing his framework, purpose to provide a set of components that
can be mixed and matched.

Flip chart:
Create HPC clusters
Create HA clusters
Viable OSS 'clusters' project
Be able to vary cluster characteristics based on component choice
- able to have people understand and contribute to individual pieces
 - smaller, more understandable, don't necessarily need to know
"everything" that's there to contribute.
 - hopefully though strengthen the ovderall project.
Two APIs
 - internal - components that provide the clustering solution that
need to work together to provide an overall system view
  - heartbeating
  - membership services
  - monitoring
 - external
  - monitor scripts
  - control scripts
  - membership/cluster manager interactions

Rename internal API to "CPI" Component Programming Interface.
 - comment audience: want to normalise APIs, not start OSS project
 - Alan: no, we are trying to do both, because one follows from the
other

Lars: Customers want one clustering view
Still debating a bit with audience member, he still seems to view that
this isn't quite an OSS project

Alan:
Pieces
 - M Membership
 - C Cluster manager
 - R Resource monitor
 - S Resource scripts
APIs and reference implementations
Comment: programmers not good at defining APIs
Alan: yes, but someone has to do it
Comment: why the reference implementation?
Alan: need the implementation to verify that the APIs work, are
useful, are good
 - will allow us to evolve the APIs through usage and experience
But, implementation is NOT binding
Hopefully the eventual APIs will be relatively "binding"

C: two working groups, one for APIs and one for components?
 - not really, need to know the APIs to know the components
 - recursive descent, must do each one partially to keep making
progress on both

C: if this is a standards project, it's wider than Linux, is
there a consortium or group to work with?
 - nope, that's actually a point with clusters, is that everyone
already has their own view of clusters and HA and HPC...

C: two working groups, cross-test API & ref. implementers
 - wants to exploit all of the current interested parties to get this

C: will this be standard implementation or reference implementation?
 - well, probably be reference, not everyone will want all of it
 - no linear process to settle this, must be iterative

Alan's STONITH API, still iterating on it, 2 or 3 rounds and still
wants to do more.

Alan:
 - Lars mostly interested in how it manages resources.
 - Compaq mostly interested in membership, system structure.

Resource - resources are 'entities' that are managed, things that the
customers perceive as providing service
 - every vendor requires that resource scripts be written differently
 - very annoying to customers, admins, etc.

Want/need these APIs/functions be usable within the kernel as well as
at user-space.
 - where the function is implemented should be transparent
 - APIs are hopefully the same
  - well, as close to being the same as they can be, there will be
differences but these should be "insignificant"

C: need to be careful about the kernel implementation, it's a spartan
environment
 - yes, must make tradeoffs
C: are you talking about modifying the kernel?  this could be
dangerous and difficult and paint us into a kernel.
 - no, we aren't viewing that the kernel need to be modified
 - think of this as kernel modules

C: buy in from current implementers?
 - Alan mostly talking to OSS folks
 - Lars talked to closed source vendors, they are interested in
standardising on the resource script level, boring for them, and would
like to have that.

C: what about HPC parties?
 - don't know yet, we haven't necessarily talked to them too much
 - input from linux-cluster has been useful but not deep enough to
know precisely what to do

C: need to ensure everyone on the working group identifies what are
their IP backgrounds to avoid polluting the definitions or the
implementations.
 - Peter identified himself as avoiding and being careful about the
membership and n-phase stuff due to patents

C: it's not a standard if Sun, e.g., can't implement a closed-source
 - yes, but, we may need to 'work around' this
 - Alan says that IBM may release patents for OSS but not
closed-source

(wish Alan hadn't gotten us into the IP/patent area...)

Alan says that there may be 'shims' or 'impedance matchers' that
allow their existing products/implementations simply match only a
small number of APIs, e.g., the resource script level.
 - acceptable for this

C: agreement to do anything?
 - Lars/Alan yes we need that to make progress

CPI/API is often the same thing, but not always
 - Alan difference matters sometimes, but not always
 - closed source probably won't provide them but may use them

C: two kinds of interfaces, programming and binary...  maybe don't
want the split
 - some agreement here
 - must ensure no recompilation at application level
 - but, may require that components need recompilation

C: anyone disagree with the general idea?
 - no takers :)

Alan wants to minimise the effort for an application to become cluster
aware 

C: policy document about goals, IP, etc.
 - will do, need to be codified
 - concern about understanding how contributions will be managed
  - licensing, etc.
 - Alan IP not be a barrier to implementation

C: group mgmt level, 'who is in the group'.  will we specify the
protocols? will it be protocol free?
 - Lars/Alan  no, probably can't be completely, we'll have a group
comm service, but applications will define what they do
C: group must be synchronised, scaling effects
 - ordered messaging, virtual synchrony,
 - low level messages, no ordering, just reliable delivery, up nodes
receive it
 - build on this to add ordering, synchrony, etc.
 - group delivery

C: OSI-like (networking definitions), multi-layer definition for group
communication? 
 - not sure, maybe
 - Lars still thinking of the application layer

C: define some components only communicate horizontally, other
communication goes vertically (as does OSI)
 - Alan some go both ways, barrier with applications
 - but, it is unclear who are the clients of the group components

C: try to minimise number of interfaces?
 - Lars want minimal, but, can only be so minimal, need as many as are
needed

C: wants to avoid any-to-any communication requirements, scaling problems
 - makes sense, he's right

C: discussions about XML, scripts on mail lists
 - scripts are 'standard' way that resources are managed, just the way
things are done
 - XML, need some way to intercommunicate among nodes/components

C: heterogeneous clusters?
 - yes, intent is that everything will work across heterogeneous
clusters
 - however, many implementations using the framework, code, etc. may
not work in such a cluster, e.g., cluster file system
 - some specific implementations may optimise to target specific
hardware and my not be generally useful
 - APIs and definitions should NOT be exclusive
  - you can conform to the framework without supporting all possible
hardware combinations

Coffee break discussion:
C: Drew Streib 'Free Standards Group' - important to work out policy
document issues, what is it, how licensed, etc.  Can still move
forward on technical group, but need to have this policy side going.
 - how to get document, how to comply with standard
 - royalties necessary?  compliance, how to get it?
 - certification, costs, etc.
 - licensing the spec?
 - state all of this up front, copyright on name of document
  - prevent certification process, etc. being done by someone else
  - name and copyright on the NAME of the document, bit different from
what and how the document is done

Reconvene
Alan - make list of components
 - not the standard
 - but want to understand disagreements/agreements

Lars still looking at the application side, what kinds of APIs does he
want
 - membership
 - cluster communication
 - resource control methodology

Alan, what are components that we may want
 low level, communications
 - raw
 - reliable
 - ordered
 - group
Ordered is that all recipients receive all messages in same order
 - avoid 'how' to do it for now, although some comments started to go
that way...
Cluster membership
 - when do nodes join, when do nodes leave, evict node
Group membership

Tangent onto security 
Alan - cluster is a set of machines whose backplane is the internet
 - security issues
 - need single administrative domain for a cluster
  - can't cross admin domains

C: need function for authenticate/authorise
 - Alan cluster needs to trust each other

C: authentication on cluster joining
 - Acceptable in principle

C: more trust issues, geographically separate, need security between
the two, needs to be part of the framework definitions
 - initially a piece that always says 'yes'
 - Alan feels every message is signed/authenticated/etc.

C: DMA across nodes, authenticate messages is absurd (HPC clusters
have different view)
 - Lars clusters that trust the hardware level have different needs
for security of messages

Sum: security IS an issue, but we can't solve it now

Back to components...
Barrier services
Resources
Event services
 - confusion here, one C: and Peter understand this
 - Alan/Lars push things to unifying, calling event services as
listening to groups
 - ordered events / unordered events
user interface

C: are we trying to define too much / too closely?
 - Alan wants to throw a bunch of stuff out

C: a bunch of modules that may be available
 - Alan application doesn't care much about what's there

C: wants a nice extensible approach
 - Alan my paper does that, some criticism too much

C: naming, be separate standard, let applications worry about that,
short term, basic standard about communications and interconnection,
let applications worry about naming at their level
 - Alan some naming needed, e.g., are node numbers needed

running out of time here...

Lars how to proceed?  suggests Linux Kongress in Enschede.  how to
come to agreement?

Alan some issues about LSB 'holes', e.g., no specific start/stop
daemon started

Which mailing list?  sorta undecided...

C: discuss internal kinds of things with time left

so more on components
Alan RPC (using term generally) based on some sort of guaranteed
communications
 - not sockets, not quite right model
  - not always using networks (serial, over disks, etc.)
  - deliver is to "cluster members" or "group members"
   - all or some or one
marshaling/demarshaling of data
I/O protection (fencing, etc.)
logging facility
heartbeating
initialisation
quorum mechanisms/policies
configuration (Lars: a database) "repository"
user interface interaction with config "repository"

(battery running down rapidly here :)

Alan want configuration of objects be uniform, although data may be
different

C: versioning information, which version are you talking to
 - Alan group membership layer for this
 - C: communication
 - C: versions of the APIs
 - migration of the individual nodes
  - must be able to upgrade cluster on the fly (node at a time)

C (Albert Calahan): real time and resource minimisation issues, avoid
hitting cpu or network at inopportune times, e.g., don't require a
regular heartbeat
 - Alan be able to tune components or timing, or use different methods
to determine cluster membership

C (Albert Calahan): clusterwide shared memory
 - optional component, not all clusters want this

C: Capability map
 - Alan show what functional pieces are available.
 - C: bitmap
 - Alan not taken with bitmap idea

C: Real-time
 - Lars comments about tightly synchronised time across cluster, where
subsequent time calls on different nodes always show time incrementing.
Back to OLS 2001 coverage