[LWN Logo]
[LWN.net]

Sections:
 Main page
 Security
 Kernel
 Distributions
 Development
 Commerce
 Linux in the news
 Announcements
 Linux History
 Letters
All in one big page

See also: last week's Kernel page.

Kernel development


The current development kernel release is still 2.5.0. Linus's current prepatch is 2.5.1-pre5. With recent prepatches, life has gotten interesting; we have a true development kernel once again. Things that have gone into 2.5.1 so far include:
  • The new driver model implemented by Patrick Mochel. This code implements a system-wide tree of all devices which will be helpful for system configuration and power management tasks; it was covered in the October 25 LWN kernel page.

  • The beginnings of the block layer thrash-up (see below).

  • Richard Gooch's new devfs core code. The end result of this work should be a more stable devfs, but it's giving some people difficulties at the moment; approach with care.
In general, it pays to be careful with the 2.5.1 prepatches. Some of the changes are truly disruptive, and a bit of instability is to be expected for a while yet.

The current stable kernel release is 2.4.16. Marcelo ("the wonder penguin") has released 2.4.17-pre4, which contains a relatively lengthy list of fixes and updates. Here, too, the new devfs code is causing difficulties for some users.

On the 'design' of Linux. For those who haven't yet seen it elsewhere, here's Linus's 'Linux wasn't designed' message that was widely circulated. In another message, Linus talked further on how he thinks software gets built:

It's "directed mutation" on a microscopic level, but there is very little macroscopic direction. There are lots of individuals with some generic feeling about where they want to take the system (and I'm obviously one of them), but in the end we're all a bunch of people with not very good vision.

And that is GOOD.

It does seem that quite a bit of progress can be made, even with poor vision.

Ripping up the block layer. It has been long understood that the 2.5 development series would include major changes to the block (disk) I/O layer. The block code has no end of performance problems, especially on high-end systems; it's also quite ugly in a number of places. So, the integration of Jens Axboe's new block I/O code, while highly disruptive, is a good thing.

Since 2.2, much of the block I/O subsystem has worked with a single spinlock, called io_request_lock. If the system was trying to figure out how to merge a request into a very long queue, or if a block driver was slow in figuring out what it wanted to do, all other block operations would have to stop and wait. This lock was serializing operations which had nothing to do with each other, and was an obvious scalability bottleneck.

With 2.5.1, that lock is no more; instead, each request queue (which, in well-written drivers, corresponds to each device) has its own lock. This kind of change can be scary, since some drivers will have depended on the global serialization enforced by io_request_lock; its removal has the potential to create subtle and nasty bugs. It may be a little while before all the block drivers are known to be safe.

Another problem with the old block code was its use of the "buffer head" ("bh") structure as the building block of the request queue. Higher-level code would go to some lengths to create large, contiguous block I/O requests, which would then be fragmented into a large number of single-block requests, each with its own buffer head. The elevator code then had the task of trying to merge the request back together again.

Buffer heads are now a thing of the past, at least as a visible part of the block I/O interface. Block I/O requests are now described by a new bio structure which, in turn, contains a list of bio_vec structures describing the data to be transferred. The bh structure included a virtual pointer to the data to be transferred; the new structures, instead, contain struct page pointers directly into the system memory map.

Much of the kernel has moved toward working with page structures, often as a result of the challenges of dealing with high memory, which has no virtual mapping into kernel space. Block drivers will now have to deal with high memory directly, but support code has been provided to make that easier. The advantages of working with page structures are worth the trouble; in particular, handling large, clustered requests from the raw I/O layer (or the pending asynchronous I/O patch by Ben LaHaise) will be much easier.

Also included are the block-highmem patches, which enable DMA operations directly to and from high memory. With the 2.4 kernel, such operations require copying data via "bounce buffers" in low memory. Bounce buffers can create severe performance problems on large-memory systems, and they are (usually) entirely unnecessary.

Finally, a whole set of support code has been added which hides much of the structure of the request queue from block drivers. Included is a nice routine for setting up DMA requests easily. The result is that all block drivers must be updated, but the resulting code should be simpler.

The block work is far from done, however; quite a bit of work is still pending. Jens has already stated his plan to break all of the block drivers again shortly. Upcoming changes include moving the building of SCSI-like commands into the generic block layer, and running ioctl() operations through the request queue so that they are automatically serialized with the I/O operations.

For more information, see Jens's writeup of the block I/O changes so far, and Suparna Bhattacharya's notes on the LSE web site.

Merging the new kbuild. Back at the Kernel Summit, it was agreed that one of the first things to happen in 2.5 would be the integration of the new kbuild code. Block I/O has jumped in first, but kbuild remains on the agenda. To push things forward, Keith Owens has proposed a schedule for the merging of kbuild. It calls for the new build code to be added in 2.5.2-pre1, and the old system to be ripped out in -pre2. The original plan called for deferring the integration of CML2 until 2.5.3, but Eric Raymond was less than thrilled with the idea. So a revised version of the timeline has CML2 going in simultaneously with kbuild. There's just a couple of obstacles to overcome, like the fact that the two do not currently work together. One assumes these little details can be dealt with.

There has been little comment on the plan to integrate the new kbuild; it does not appear to be a controversial change (though there is a little grumbling about the new kbuild being slower).

Most speakers, when giving a talk, try to be well tuned to signals from the audience. So, when your editor was addressing folks at Linux Kongress about 2.5 changes, the sound of vomiting from the seats got his attention. The subject at hand was, of course, CML2. This development remains controversial, and the talk of integrating it with kbuild started up the same old flame wars.

Said wars have been covered in this space in the past, and there is very little to add. In theory, Linus has said he will merge CML2 and the topic should be moot. Eric Raymond did not help things, however, with his statement that he plans to try to get Marcelo to integrate CML2 into the 2.4 tree as well. This idea, at least, is not controversial - almost nobody seems to think it's a good idea. The 2.4 kernel just does not need that sort of change.

With regard to 2.5, the main stumbling point still appears to be the use of Python 2 as the implementation language. One would think people could just install Python and be done with it, but it's apparently not so simple. Most of the dissenters are just grumbling, but there are a couple of other efforts out there. Greg Banks has a CML2 in C project going, though progress has pretty well stopped in recent months. Jan Harkes, instead, has put together a patch which ports the CML2 code to Python 1.5. Since the older Python is available on more older systems, one would hope this patch might help reduce the complaining somewhat.

But, then, as devfs shows, some developments never seem to reach a point of being accepted by everybody. (Current versions of these patches are kbuild 1.1.0 and CML2 1.9.4).

Eliminating sleep_on. For years, the standard way to put a process to sleep within the kernel is with the sleep_on() function or its variants. sleep_on() simply blocks the calling process until somebody explicitly wakes it (or, in some cases, a signal or timeout happens). On SMP systems, however, sleep_on() has a serious problem. Consider a typical usage:

    if (something not ready)
        sleep_on(&my_wait_queue);
If the "something" becomes ready between the two lines of code, the wakeup event will be missed and the process may sleep for much longer than intended.

Workarounds for this problem have existed for a long time. The wait_event() macros handle this case without races; often semaphores or the newish "completion event" interface can be used. If all else fails, a relatively complicated "manual sleep" can be coded. All of these techniques are used in the kernel, but code that calls sleep_on() still exists.

The plan for some time has been to remove sleep_on() in the 2.5 series, on the theory that there is no safe way to call it. Now that patches are going in, people have begun to ask when this removal might take place. The answer, for now, is a patch from David Woodhouse. It does not yet go so far as completely removing the function; instead it adds some checks which detect (and complain about) unsafe calls. It is a gradual approach, but the intent remains the same: eventually sleep_on() and friends will go away, and any code that still calls them will have to be updated.

Incremental prepatches. H. Peter Anvin has announced a much-requested feature for the kernel.org archives: incremental prepatches. Posted prepatches are relative to the last official kernel release; users wishing to go from one prepatch to another have to restart with a clean kernel, or explicitly back out the previous prepatch. With the new scheme, it is necessary only to download the (usually smaller) incremental patch and apply that. The incremental patches will also make it easier to see exactly what has changed between prepatches.

Integrating ALSA. The Advanced Linux Sound Architecture project has been working since early 1998 to build a better sound subsystem for the Linux kernel. Some people were surprised that ALSA was not integrated into 2.4, but the fact is that the project never proposed its code for that release. The ALSA hackers have been taking their time and trying to get it right.

Now, however, it appears that the time has come. ALSA maintainer Jaroslav Kysela has indicated that he and the code are ready, and Alan Cox has encouraged him to submit it. The last call belongs to Linus, of course, but chances are good that ALSA will find its way into a 2.5 kernel before too long. It will probably live alongside the OSS drivers for a while, but, in the long term, it seems certain that OSS will eventually be removed.

Other patches and updates released this week include:

  • Peter Braam has released version 1.0.6-test1 of the InterMezzo filesystem. There is also an InterMezzo roadmap available for those interested in where this distributed filesystem is going.

  • Larry McVoy has posted a partial description of his long-standing "ccCluster" idea. Worth a read for a different approach to multiprocessor systems.

  • Christoph Rohland has posted a document for the tmpfs filesystem, intended for the kernel documentation directory.

  • IBM has released version 1.0.10 of the JFS journaling filesystem.

  • Richard Gooch has released a pile of devfs updates, including devfsd-v1.3.20, devfs-v99.21 (for 2.2 kernels), devfs-v199.3 (for 2.4) and devfs-v203 (for 2.5).

  • Davide Libenzi has posted a patch which implements "task struct coloring." This coloring is the spreading of task structure alignment so that they do not all sit on the same cache line (which is currently the case). The result should be improved kernel performance, especially on SMP systems. A later version of the patch also adds kernel stack coloring.

  • Bert Hubert has posted a set of documents describing the kernel's network traffic control capabilities. Traffic control has been present since 2.2, and it provides some very nice features, but lack of good documentation has limited its usage. This work is a welcome step in the right direction.

  • Version v1.13 of the Dolphin PCI-SCI driver has been released by Jeff Merkey.

  • Keith Owens has released kdb v1.9 for the 2.4.16 kernel.

  • ext3 0.9.16 for 2.4 kernels was released by Andrew Morton.

  • The international kernel patch is back: a beta version for 2.4.16 was announced by Herbert Valerio Riedel.

  • Nathan Scott has posted a new version of the extended attributes interface.

  • A patch improving the performance of kernel statistics counters was posted by Ravikiran G Thirumalai.

  • Ian Stewart has announced a new release of the AC'97 "linmodem" driver.

Section Editor: Jonathan Corbet


December 6, 2001

For other kernel news, see:

Other resources:

 

Next: Distributions

 
Eklektix, Inc. Linux powered! Copyright © 2001 Eklektix, Inc., all rights reserved
Linux ® is a registered trademark of Linus Torvalds