[LWN Logo]
[LWN.net]

Sections:
 Main page
 Security
 Kernel
 Distributions
 Development
 Commerce
 Linux in the news
 Announcements
 Linux History
 Letters
All in one big page

See also: last week's Kernel page.

Kernel development


The current kernel release is still 2.4.1. The two usual prepatch tracks are in full swing. On the Linus side, there is 2.4.2pre1, released just after LinuxWorld. It contains a small set of fixes, and doesn't yet deal with the known 2.4.2 problems (see below). Alan Cox, instead, has released 2.4.1ac5, which contains a much larger set of fixes.

On the 2.2 kernel front Alan has released 2.2.19pre8. There are still, apparently, a few things yet to go into this patch, so the real 2.2.19 release is not yet imminent.

Some difficulties with 2.4.1. While many (most) users are running 2.4.1 without trouble, there are a couple of issues that have come up which are worth knowing about. They are:

  • There is a bug in the handling of Unix datagram sockets which locks up the kernel - or at least one processor on SMP systems. Chris Evans has posted a simple test program which demonstrates the bug - don't run it on your big server. A patch for this bug exists, and will certainly be merged into the next kernel prepatches.

  • Hans Reiser has posted a message on stability problems with ReiserFS. There are currently five outstanding bugs with this filesystem, not all of which yet have fixes available (one of them looks like hardware problems, rather than a real ReiserFS bug).

Neither of these issues is all that surprising. Every major stable kernel release seems to have one denial of service bug lurking somewhere; it takes a larger testing community to flush it out. Similarly, ReiserFS is now seeing testing on a far larger scale than it ever has in the past, and a few surprises are certain to show up. This is the late stage of the free software development process in action; fixes are being made quickly, and the end result will be a more stable kernel.

ReiserFS can also cause system crashes, but this is not a ReiserFS bug. It seems that some people are building the 2.4.x kernel with Red Hat's "gcc-2.96" compiler that was shipped with Red Hat 7. That compiler has some, um, issues, and it miscompiles some of the ReiserFS code. If you're running a late Red Hat system, be sure to build your kernels with "kgcc," or at least get the latest, patched gcc from Red Hat (which is said to work much better).

The great kiobuf debate. Recently, a fairly fierce debate has been filling up mailboxes on the linux-kernel and kiobuf-io-devel mailing lists. It all has to do with the kiobuf data structure, which was, until recently, seen as a generally good addition to the kernel in the 2.3 series.

The kiobuf structure was added, initially, to support raw disk I/O; kiobufs and their supporting routines make it easy for kernel code to move data directly between user space and a device, without an intervening copy into kernel space, and without having to worry about the ugly details of memory management. Their use has slowly grown; in the 2.4.1 kernel kiobufs can be found in the generic SCSI (sg) driver and in the logical volume manager code. There is also a patch floating around that uses kiobufs to implement direct, user-to-user pipes. And SGI's XFS patch not only uses kiobufs, but modifies the block I/O subsystem to make them integral to disk I/O.

One would think that kiobufs were taking over, except for the little fact that the zero-copy networking patches do not use them. Instead, a new and completely different mechanism for direct userspace access was created. In the discussion that followed, it turned out that quite a few people, including Linus, are not pleased with the kiobuf design.

In a (very) simplified way, that design is as follows: a kiobuf, in the end, consists of an array of struct page structures, along with an initial offset and a total length value. By using page structures directly, the kiobuf allows the code using it to avoid dealing with the virtual memory entirely - a struct page refers directly to a physical page. The initial offset tells where, in the first page, the data starts; all the remaining pages are filled with data starting at the beginning. A kiobuf thus describes a single, contiguous area; working with multiple areas requires using a "kiovec" - an array of kiobufs - instead.

The objections to this design include:

  • It is said to be a very heavyweight structure. Kiobufs are a bit large, mostly due to the incorporation of an array for the page structures. Ingo Molnar has characterized kiobufs as "big fat monster-trucks of IO workload."

  • Kiobufs do not handle scatter/gather operations (those which work from multiple, noncontiguous memory areas) very gracefully; such an operation requires setting up a kiovec and using several kiobufs which, as previously noted, are already criticized as being too large. Networking, in particular, makes heavy use of scatter/gather I/O, and needs to be able to set up and tear down structures very quickly.

  • One of the reasons that kiobufs are difficult for scatter/gather operations is that they assume that all data is aligned on page boundaries, with the exception of the first page. That tends to be true for disk I/O, but is rarely the case for networking. Linus, in particular, doesn't want any page alignment assumptions in this sort of code.

In the end, the fight seems to boil down to this: should a kiobuf include an array of offset/length pairs for each page within the buffer? With such an array, scatter/gather operations could be described with a single kiobuf, and the kiovec idea could go away.

Linus, certainly, takes the position that the offset and length values should be pushed down deep in the structure in this way. Kiobuf designer Stephen Tweedie, however disagrees. Putting the length and offset at that level would make it hard to get the completion status of any individual segment and would tend to split apart large requests which should really stay together.

The discussion then wandered into whether the venerable buffer head structure could be made to do what kiobufs do. A number of people seem to think that they could, especially if the block I/O API were modified to make it easy to submit large chains of them as a single operation. But no code for this use of buffer heads has, as yet, been forthcoming.

This issue, clearly, goes pretty deeply into how fundamental operations are performed in the kernel. For this reason, the design issues involved seem to touch a number of nerves. It will probably be some time before a real resolution is reached; those who are programming with kiobufs, however, should be prepared to see the interface change...

The first public Linux-NTFS release is out, see the announcement for details. This release makes it possible to mount NT filesystems in a writable mode under Linux. It's not yet perfect, however; when it writes to an NTFS partition it leaves a bit of damage behind. For the short term, it was evidently easier to provide a separate utility ("ntfsfix") which fixes things up afterwards.

Other patches and updates released this week include:

  • David Miller continues to put out frequent zero-copy networking patches; this patch also, currently, contains the fix to the Unix datagram bug.

  • Jeff Merkey has released version v1.1-7 of his driver for Dolphin Scalable Coherent Interface adapters.

  • A new kernel development mailing list has been created by Ingo Oeser; it is intended to host discussion of a wide range of operating system techniques, not just those in use in the Linux kernel.

  • devfs-v99.19 was posted by Richard Gooch; it is a backport of the latest devfs code to the 2.2.18 kernel. He has also posted devfsd-v1.3.11, the devfs daemon that is needed to use a devfs-enabled kernel.

  • Rusty Russell has released code to generate a graph of the 2.4.0 kernel. It requires several hours to run, and, on some systems, has proven a little difficult to generate.

  • Juergen Schneider has posted a patch which adds an animated boot logo to the framebuffer driver.

  • Robert H. de Vries has posted a new version of his POSIX timers patch. This time around, Linus responded that he'll not be applying the patch anytime soon, since he does not like the implementation.

  • The USAGI Project (USAGI = "UniverSAl playGround for Ipv6") has announced the second stable release of its system, which features support for both the 2.2.18 and 2.4.0 kernels.

Section Editor: Jonathan Corbet


February 8, 2001

For other kernel news, see:

Other resources:

 

Next: Distributions

 
Eklektix, Inc. Linux powered! Copyright © 2001 Eklektix, Inc., all rights reserved
Linux ® is a registered trademark of Linus Torvalds