[LWN Logo]

Date:	Sun, 07 Mar 1999 14:49:55 -0500
From:	Werner Krebs <werner.krebs@yale.edu>
To:	linux-kernel@vger.rutgers.edu
Subject: GNU Queue now with Linux kernel checkpoint API.

Hi folks.

As promised earlier here, I've added support for two different
checkpointing methods to GNU Queue: user-level (which is
cross-platform) and kernel-level (which uses Eduardo Pinheiro's
checkpointing API for the Linux kernel.)

As you know, checkpoint migration allows running jobs to be moved
dynamically between machines in a cluster. GNU Queue was originally a
load-balancing system that specialized in interactive job balancing, so
the goal is to be able basically to fire up any job and have it behave,
as a far as the user is concerned, as if it were running on the local
terminal.

The kernel API is the one pertinent to this list. GNU Queue actually
makes a very nice wrapper to Eduardo's API (which was originally written
for the Nomad distributed operating system, still in development. There
is a wrapper specific for Beowulf clusters, but AFAIK GNU Queue is the
only other wrapper.). The API does not support sockets (or, apparantly)
ptys, and GNU Queue re-creates these as part of its normal operation, so
there is a nice synergy.

With the API, GNU Queue can checkpoint migrate the interactive 'vi'
editor without recompilation --- well, sort of. You need to type
':wq[Return]' because under the Queue 1.20.1-pre2 development release
the freshly created terminal upon restart starts in cooked mode. This
command causes vi to issue the call to switch the terminal back into raw
mode. (Interestingly, SIGCONT and SIGWINCH don't work.) This probably
could be fixed in either the kernel API (by having the API restore
terminal state upon restart --- probably the correct solution) or at the
user level (by having GNU Queue memorize terminal state and restore it
upon restart.)

More interesting, EMACS does not work. Upon restart, the first keystroke
causes EMACS to die on signal 29, `a pollable event occured,' which
suggests to me that the checkpointing API is not properly restoring the
state of a select() call. Obviously, more hacking is needed.... :)

Note that Eduardo's solution is different than the one I discussed here
a few months ago: Eduardo saves state for later reconstruction on the
remote host, rather than `forwarding' system calls over the network.
This has advantages and disadvantages: Eduardo can handle migrating
multi-process jobs (which is difficult when forward calls) but can't
handle sockets (which is easy). Another disadvantage is that calls such
as `hostname()' and `getpid()' will return different values after a
migration has ocurred (unless we change this); if we prevent getpid()
from changing, then we must map calls to raise(), signal(), etc., to
continue to behave correctly after the migration.

In any event, if anyone wants to experiment with or develop applications
for a working checkpoint migration system that handles interactive and
multi-process jobs, GNU Queue 1.20.1-pre2, with experimental
checkpointing support, is in early development pre-release (i.e., with
massive debugging messages turned on, &c). It can be fetched off GNU
Queue's homepage, http://bioinfo.mbb.yale.edu/~wkrebs/queue.html .

As always, suggestions & comments are welcome; GNU Queue wouldn't be
what it is today without them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/