Date: Sun, 07 Mar 1999 14:49:55 -0500 From: Werner Krebs <werner.krebs@yale.edu> To: linux-kernel@vger.rutgers.edu Subject: GNU Queue now with Linux kernel checkpoint API. Hi folks. As promised earlier here, I've added support for two different checkpointing methods to GNU Queue: user-level (which is cross-platform) and kernel-level (which uses Eduardo Pinheiro's checkpointing API for the Linux kernel.) As you know, checkpoint migration allows running jobs to be moved dynamically between machines in a cluster. GNU Queue was originally a load-balancing system that specialized in interactive job balancing, so the goal is to be able basically to fire up any job and have it behave, as a far as the user is concerned, as if it were running on the local terminal. The kernel API is the one pertinent to this list. GNU Queue actually makes a very nice wrapper to Eduardo's API (which was originally written for the Nomad distributed operating system, still in development. There is a wrapper specific for Beowulf clusters, but AFAIK GNU Queue is the only other wrapper.). The API does not support sockets (or, apparantly) ptys, and GNU Queue re-creates these as part of its normal operation, so there is a nice synergy. With the API, GNU Queue can checkpoint migrate the interactive 'vi' editor without recompilation --- well, sort of. You need to type ':wq[Return]' because under the Queue 1.20.1-pre2 development release the freshly created terminal upon restart starts in cooked mode. This command causes vi to issue the call to switch the terminal back into raw mode. (Interestingly, SIGCONT and SIGWINCH don't work.) This probably could be fixed in either the kernel API (by having the API restore terminal state upon restart --- probably the correct solution) or at the user level (by having GNU Queue memorize terminal state and restore it upon restart.) More interesting, EMACS does not work. Upon restart, the first keystroke causes EMACS to die on signal 29, `a pollable event occured,' which suggests to me that the checkpointing API is not properly restoring the state of a select() call. Obviously, more hacking is needed.... :) Note that Eduardo's solution is different than the one I discussed here a few months ago: Eduardo saves state for later reconstruction on the remote host, rather than `forwarding' system calls over the network. This has advantages and disadvantages: Eduardo can handle migrating multi-process jobs (which is difficult when forward calls) but can't handle sockets (which is easy). Another disadvantage is that calls such as `hostname()' and `getpid()' will return different values after a migration has ocurred (unless we change this); if we prevent getpid() from changing, then we must map calls to raise(), signal(), etc., to continue to behave correctly after the migration. In any event, if anyone wants to experiment with or develop applications for a working checkpoint migration system that handles interactive and multi-process jobs, GNU Queue 1.20.1-pre2, with experimental checkpointing support, is in early development pre-release (i.e., with massive debugging messages turned on, &c). It can be fetched off GNU Queue's homepage, http://bioinfo.mbb.yale.edu/~wkrebs/queue.html . As always, suggestions & comments are welcome; GNU Queue wouldn't be what it is today without them. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/