[LWN Logo]
[LWN.net]

Sections:
 Main page
 Security
 Kernel
 Distributions
 On the Desktop
 Development
 Commerce
 Linux in the news
 Announcements
 Linux History
 Letters
All in one big page

See also: last week's Kernel page.

Kernel development


The current kernel release is still 2.4.7. The 2.4.8 prepatch is currently at 2.4.8pre3; it includes the usual collection of fixes, along with the single-use patch from Daniel Phillips which was covered last week. There have been complaints that the 2.4.8pre series is much slower on systems with large amounts of memory; the VM hackers are currently hot on the trail of those problems.

Users of Adaptec adaptors (i.e. your editor, grumble grumble...) on SMP systems were unpleasantly surprised with 2.4.8pre2, which crashed on boot. The check that caused the crash has been removed, but there appears to be a strange problem that still lurks in there somewhere.

Alan Cox's latest patch is 2.4.7ac3. It contains a great many architecture-specific changes; slowly the kernel trees for the various ports are finding their way back toward the mainline. There's also some enhancements for User-Mode Linux and many miscellaneous fixes.

A new kernel API for completion events. It is common in kernel code to set some sort of process in motion, then to go to sleep and wait until that process completes. There are several ways of implementing the "wait for completion" part; which is the proper one to use depends on the specific situation. Until 2.4.7 came out, one technique used involved semaphores. The initiating process would declare a semaphore as a local variable (i.e. on the stack), starting out in the locked state; the process would do what was needed to arrange for some work to be done, then wait on the semaphore. The code actually doing the work would simply unlock the semaphore when the task was complete.

On the surface, this technique is appealing because it avoids some obvious race conditions. If, for example, the work gets done before the kernel gets around to waiting on the semaphore, it notices that fact and simply doesn't wait. The sleep_on() and wake_up() calls can be much trickier to use correctly in this situation. But, as it turns out, there is a race condition here too, which is a result of how the semaphores themselves work.

When a semaphore is to be unlocked, the code (1) sets the semaphore itself to the unlocked state, then (2) calls wake_up() to notify any processes that might have been waiting on the semaphore. If the waiter tests the semaphore between those two steps, it will never actually wait, and may well execute the rest of its code before the wake_up() call happens. That is not normally a problem, but, if the semaphore is sitting on a kernel stack somewhere, it could cease to exist before the wake_up() call, which requires data from the semaphore, runs. In other words, it could be working with a pointer into random memory; the technical term for this is "oops." This particular race is highly unlikely to ever actually happen, but it's still a race.

The performance of this approach is also suboptimal, due to the fact that semaphores are optimized for the unlocked case. In this particular situation, the semaphore will almost always be locked.

Linus chose not to change the semaphore implementation (it's "painful as hell"); instead, he created a new interface for the handling of completion events. All a process need do to use this facility is to create and initialize a completion structure:

	struct completion event;
	init_completion(&event);
Then it can set things in motion, and call:
   	wait_for_completion(&event);
to sleep until things are done. The task actually doing the work can perform a simple call to
	complete(&event);
and the waiting process wakes up.

It's a relatively straightforward solution, even if changing APIs in the middle of a stable kernel series may look a little strange. If nothing else, the whole affair makes it clear, once again, just how hard it is to avoid race conditions in kernel code.

The first initramfs patch was posted by Alexander Viro this week. This patch is the implementation of the new 2.5 boot process that was first discussed in the July 12 kernel page. In this scheme, the kernel executable image carries with it a cpio archive containing the contents of the initial root filesystem. That archive is loaded into a ramdisk at boot time, at which time it can be used to continue the system initialization process.

The hope is to move much kernel initialization code out of kernel space and into this ramdisk. The result is a smaller kernel and more flexibility in how the bootstrap process is set up. For the moment, the tasks that have been moved to user space include:

  • Finding and mounting the real (permanent) root filesystem. NFS root filesystems are handled here as well.
  • Setting up any initial ramdisk (usually for the purpose of loading kernel modules needed for the boot process).
  • Running the linuxrc boot script.
  • Finding the real init process and running it.
There is more that can be moved into this filesystem, but that's a good start. The claim is that kernels running with this patch will function identically; no boot setups should be broken or require changes. Mr. Viro would, of course, like to hear from anybody with evidence to the contrary.

Heading toward ext3 1.0. ext3 2.4-0.9.5 was released by Andrew Morton. This version continues the work toward a truly stable ext3 journaling filesystem release, fixing a number of bugs. Much work has also gone into performance improvements on a number of fronts. Among other things, synchronous operations happen more quickly; this should make people running large mail systems happy, since many mail transfer agents make heavy use of synchronous directory operations.

Another change in 0.9.5 is the ability to use an external journal. External journals live on a separate device (perhaps a non-volatile RAM device), and, in theory, can speed up the operation of the filesystem. Writes to an external journal should be very quick, and journal operations will not contend with writes to the rest of the disk. The initial performance results with external journals appear to be mixed, however.

Those interested in ext3 may also want to see an older patch announcement from Andrew which contains a detailed explanation of the three journaling modes supported.

Much slower routing performance in 2.4 has been reported by some users. The common factor in these reports is that the people involved are still using the 2.2 ipchains interface to set up their firewalling. The ipchains module in 2.4 carries full connection tracking along with it; most people setting up ipchains rules probably do not need that feature. The solution is to switch to iptables.

Other patches and updates released this week include:

  • Daniel Phillips has posted a new version of his patch for the handling of pages that are used only once.

  • Anton Altaparmakov has released version 1.2.0 of the Linux-NTFS support tools.

  • Also from Anton is this patch which adds support for Windows 2000/XP dynamic disks.

  • David Schleef has posted Comedi-0.7.60, a collection of data acquisition device drivers.

  • Alan Cox has modified the kernel Makefile to add a "make rpm" target. The result, of course, is an RPM file containing the compiled kernel. A "make deb" option will likely be added in the near future.

  • Milan Pikula has started a new mailing list for those who are interested in filesystem repair and crash recovery topics.

  • An Mwave modem driver for 2.4.7 has been released by Paul Schroeder.

  • devfsd v1.3.12 was released by Richard Gooch.

  • Richard also released a patch that, when used with devfs, enables a 2.4 kernel to support up to ("approximately") 2144 SCSI disks. He warns that it is untested and could result in filesystem corruption. There have been few problem reports, but it turns out that, for now, limitations in other parts of the system will still limit the maximum number of disks to far less than 2144.

  • The Linux Test Project has released ltp-20010801, the latest version of its kernel test suite.

  • Andreas Gruenbacher has posted an access control list patch for 2.4.7.

  • Constantin Loizides has been working with a number of journaling filesystems to determine the degree to which they experience fragmentation under long-term, sustained use. He has posted his findings; the results vary significantly between the various filesystems.

  • Keith Owens has released a 2.5 kbuild release candidate.

  • Adam Goode has started a project to write a driver for the Logitech iFeel mouse. This device is fun in that it can be used to provide tactile feedback to the user - little bumps as the pointer moves over buttons and such.

  • mdctl 0.4 was released by Neil Brown; it is a utility for controlling RAID devices, meant to replace mkraid, raidstart, etc.

  • Version 2.2.0 of the Functionally Overloaded Linux Kernel patch is now available; it has almost anything one could imagine, including several kitchen sinks. FOLK creator Jonathan Day informs us that the size of the patch is now 1/3 that of the standard kernel.

Section Editor: Jonathan Corbet


August 2, 2001

For other kernel news, see:

Other resources:

 

Next: Distributions

 
Eklektix, Inc. Linux powered! Copyright © 2001 Eklektix, Inc., all rights reserved
Linux ® is a registered trademark of Linus Torvalds