[LWN Logo]

From: "Matthew O'Keefe" <okeefe@lcse.umn.edu>
Subject: GFS, DLMs, and clusters
To: linux-ha@muc.de
Date: Tue, 16 Feb 1999 14:50:20 -0600 (CST)


Hi,

I had a few comments on David's insightful observations about GFS
and High Availability in Linux.  Please note that a good part
of Saturday afternoon at the Linux/GFS storage workshop 
(to be held March 5-6 in Mountain View, CA) will be
taken up with High Availability, with talks by both David and Stephen
Tweedie on the subject.
 
> Date: Wed, 03 Feb 1999 19:22:03 -0800
> From: David Brower <dbrower@us.oracle.com>
> Organization: Oracle Corporation
> To: gfs-devel@lcse.umn.edu, linux-ha@muc.de, David <DBROWER@us.oracle.com>
> Subject: Global File System and linux-ha
> 
> "T.Pospisek's MailLists" forwarded a pointer to GFS to linux-ha and I, 
> for one, found this most interesting.
> 
> >From an H/A perspective, GFS seems like a promising start to what
> seems necessary.  It does have a number of design decisions
> that seem problematic, and ought to be examined.  I know from the
> papers that some are on the radar, but it doesn't hurt to call for
> more emphasis.  Here are some things that concern me.
> 
> 1.  The use of the DLOCK mechanism is an elegant way of punting
>     a deeper problem, which is providing a real DLM.  By putting
>     the locks in the drive, a unified point of serialization is
>     certainly reached, but other things one might want to do are
>     not handled.  The in-progress DLOCK daemon doesn't sound like
>     it will really address the deeper issues.  Is it a SPOF?  If 
>     not, isn't it becoming a DLM?

Yes that's right.  Once you create a distributed DLOCK daemon then you
are really building a DLM.  Also you are right in that DLOCKS were used
so that we could punt on the issue of a DLM; a real  DLM is highly
desirable, but also is something that is non-trivial to build.
We only wanted to build one non-trivial thing at a time: first, a
cluster file system, and then later a DLM or something like it.  In
any case, SCSI needed some kind of synchronization primitive
anyway, since it is becoming a networking protocol with advent of
Fibre Channel and network devices.
> 
> 2.  Following on, any DLM needs to be h/a, with adequate redundancy.
>     It also, usually, needs to be built on some clear definition of
>     a lock domain.  This leads naturally into the need to have some
>     sort of cluster/domain group membership definition.  The linux-ha
>     crowd has been nibbling into this issue, so far inconclusively
>     as far as I can see.

I totally agree.  At some point you need to say: all these machines, and
only these machines, are part of a domain.  You need a hierarchy, for
performance, stability, and organizational neatness.
> 
> 3.  Once you have a cluster definition, and shared disk, for real
>     h/a fault tolerance, you need to build in a way of keeping 
>     non-members from scribbling your disks.  The GFS pool mechanism
>     may be a good place for that sort of thing.  And/But it MUST
>     be tightly integrated with the membership definition.  Pfister's
>     book talks about this as i/o fencing.  There are applications for
>     FibreChannel tokens here -- on group reconfig, invalidate old
> tokens,
>     so if an insane node starts to access, the h/w tells it to go away.

Fibre Channel switches have a feature called zoning, that lets the
hardware fence off certain machines from certain storage devices.  THe
zones could be mapped to domains in our case, and controlled from the
pool layer. Coupling the domain membership with the hardware zoning
should be easy to do in the pool driver. 
> 
> 
> 5.  Back to GFS, I'd encourage use of journalled meta-data ASAP; I don't
>     think you can have a highly available file system without it.  The
>     fsck problem really needs to be solved to be viable.
No question about it.  After write caching is implemented in the next
month or so, journaling becomes our top priority.

> 
> 6.  The worst case scenario for a cluster file system is one where two
> nodes
>     have opened a log file in append mode and are writing records to
> it.  I
>     suspect this ultimately needs to be solved, but I am not convinced
> doing
>     so is a high priority.

Multiple writes do happen in some applications I've heard about,
like splicing togethervideo sequences for a news program.  However,
these multiple writes take place to different parts of the file.
> 
> 7.  Back to DLM, and why I don't quite buy the DLOCK scheme, there are
> needs
>     for locking that I don't think DLOCK will handle.  You need to
> support
>     O_EXCL mode, and the various other advisory and mandatory locking
> schemes
>     in UNIX.   This requires other extensions to Linux to make work,
> does it 
>     not?

Converting from a shared lock to an exclusive lock can also be tricky.
We are coming around to the need for coordination among the clients for
access to Dlocks in certain cases, so that the whole protocol includes
both DLOCKS and a separate coordination scheme among the clients
(I know, I know, that separate coordination layer starts to look like
a DLM :-).
> 
> 8.  For my purposes, I'd want a share file system that was integrated
> with
>     a visible membership service, had journalled/consistent metadata,
> direct
>     i/o for blocks, and concurrent access to blocks by multiple nodes. 
> I would
>     not want exclusive access per-node, quite the opposite.

Agreed this is the right vision for sure.
> 
> I am hoping to make the March 5/6 GFS workshop to get 
> a better sense of the state-of-the-world.  In the meantime, I strongly
> encourage some cross-pollination between the GFS and H/A communities.
> I don't think GFS w/o HA is as usefull as it ought to be, and H/A with
> no shared storage isn't as usefull as it could be.

As mentioned earlier, Dave will be speaking at the workshop on Saturday
afternoon.  I expect a lively and interesting discussion on Linux HA.

Regards,
Matt O'Keefe
okeefe@ece.umn.edu
University of Minnesota

> 
> cheers,
> 
> -dB
> begin:vcard 
> n:Brower;David
> tel;work:650 506 5512
> x-mozilla-html:FALSE
> org:Data Storage Technologies;Parallel Server Development
> adr:;;400 Oracle Parkway;Redwood Shores;CA;94065;
> version:2.1
> email;internet:dbrower@us.oracle.com
> title:Consulting Member of the Technical Staff
> x-mozilla-cpt:senna.us.oracle.com;1792
> fn:David Brower
> end:vcard
> 
> --------------DF6EB6149BCDB6B0293848BB--
> 
> 
> To unsubscribe from this list: send the line "unsubscribe gfs-devel" in
> the body of a message to majordomo@lcse.umn.edu
>