Sections:
Main page
Security
Kernel
Distributions
Development
Commerce
Linux in the news
Announcements
Linux History
Letters
All in one big page

Kernel development

The current development kernel release is still 2.5.1; Linus's prepatch series is now up to 2.5.2-pre10. Many of the recent prepatches were dominated by kdev_t fixes, but that task now appears to be mostly complete. So 2.5.2-pre10 includes buffer cache changes from Al Viro, more USB updates, a large ARM update, and, incidentally, a new scheduler (see below).

As an aside, the kdev_t change is a classic example of how Linus manages kernel development. A proprietary software architect would, after deciding to proceed with this sort of data structure change, instruct one or more lower-level engineers to go through and make the new structure work everywhere. Linus, instead, just fixes the stuff that he uses personally, and assumes that people will come in with patches to deal with the rest. And that is, generally, exactly what happens. And if nobody bothers to fix a certain piece of code, that, too, is useful: it implies that nobody is using that subsystem, and it can be considered for removal.

The current "dj" patch from Dave Jones is 2.5.1-dj13; it is synchronized with 2.5.2-pre9, and includes some additional fixes.

The current stable kernel release is 2.4.17. Marcelo has released 2.4.18-pre2 as the next step toward the 2.4.18 stable release.

Those working with ancient kernels may be interested in the first 2.0.40 release candidate from David Weinehall.

Other kernel trees:

Andrea Arcangeli's 2.4 patch is 2.4.18-pre2aa1; it adds a number of fixes, a scheduler update of its own, user-mode Linux, TUX, and more to 2.4.18-pre2.
Michael Cohen has released the second 'mjc' kernel, which adds a number of experimental things to the 2.4.17 kernel.
Nathan Russell has decided to get into the act with 2.4.18-nj1; this one adds Rik van Riel's reverse mapping VM and a few fixes to the 2.4.18-pre1 kernel.

The -nj1 announcement drew a complaint that, perhaps, the number of kernel trees was getting a little too large.

A new scheduler goes into 2.5.2. One can assume that the worst of the block I/O and kdev_t problems in the 2.5.2-pre series have been dealt with: Linus has decided to stir things up again by replacing the scheduler outright. To make things even more fun, the new code is not one of the scheduler implementations that has been floating around for a while; instead it is a completely new development from Ingo Molnar.

Ingo included a detailed description of how the new code works. The following discussion is drawn mostly from that posting; those interested in the details may well want to go to the original source.

The new scheduler starts, like most of the proposed alternatives, by creating per-CPU run queues. As the number of processors increases, the overhead caused the current global run queue becomes intolerable. A separate run queue for each processor eliminates the need for a global spinlock, and avoids a great deal of cache line contention between processors.

The current Linux scheduler works by passing over every runnable process and calculating a "goodness" value; the process with the highest goodness is the one that gets to run. As the number of runnable processes increases, the goodness calculation gets more expensive. So the new scheduler does away with all of that. It implements, instead, an array of run queues, sorted by priority. When it is time to select a new process to run, the scheduler need only consult a bitmap to find the runnable process with the highest priority and pick it off the appropriate queue. The selection process is thus O(1): it does not take more time as the number of processes grows.

The code is actually a little bit more complicated than that, in that there are actually two run queue arrays. Once a process exhausts its time slice, it is moved from the "active" to the "expired" array, where the priority sorting is retained. When all processes have used up their time slices, the expired and active arrays are switched and the process begins anew.

The per-CPU run queue arrays also have the effect of keeping processes on the same processor. Efficient use of memory caches requires this behavior; the old scheduler also tried to enforce CPU affinity, but not always very effectively. With the new scheduler, instead, a different problem could arise: all of the runnable processes could find themselves contending for a single CPU, while other CPUs sit idle. To avoid this situation, the scheduler will occasionally move processes between processors if the run queues get too far out of balance.

Real-time processes are handled a little differently, and in a way that has drawn a few complaints. There is a single, global run queue for real-time processes. When such a process becomes runnable, all processors are interrupted and race to see who can execute the process first. The idea here is that real-time processes require the minimum possible latency, and there is no way to know in advance which processor will be able to drop what it's doing to run a real-time process. By setting all processors on the task, the new scheduler, through brute force, identifies the processor which can respond most quickly each time. But there will be a cost paid by all other processes on the system, which will find themselves interrupted (briefly, usually) every time a real-time process wants to run.

There are a few other features to the new code. One is that it restored the "run a newly forked child before the parent" behavior that was briefly tried in the 2.4 series. That change has been backed out in more recent versions of the patch, however. The new scheduler also tracks CPU usage over time (a few seconds worth) in an attempt to identify the real CPU hogs on the system. Those processes will find their priorities decreased somewhat so that interactive tasks can run first.

Ingo's explanation also includes a number of benchmark results, all of which show improved performance with the new scheduler. The most impressive, perhaps, is the one that tests pure context switching behavior. With this benchmark, the 2.5.2-pre6 scheduler could do just over 240,000 context switches per second on a two-processor system, and 108,000 on an eight-processor system. In other words, the eight-processor system ran slower due to the increased global run queue and lock contention. With the new scheduler, the results were 977,000 switches per second (two processors) and 6,117,000 per second (eight processors). Not bad.

For another set of benchmarks, see this posting from Mike Kravetz. These results show that the new scheduler could still use some work in places.

As of this writing, Ingo's latest patch is version E1, which is quite small (since most of the patch is already in 2.5.2-pre10). Early versions of the patch generated a fair number of bug reports, but those seem to have slowed down. It's hard to say what will happen in latter 2.5 releases, but the scheduler wars may be over, for now.

ALSA gets ready. Advanced Linux Sound Architecture founder and maintainer Jaroslav Kysela has posted a patch containing the latest ALSA code, ready for inclusion into the 2.5.2-pre9 kernel. Among other things, this version of the subsystem shows the beginning of an effort to strip out the 2.2 and 2.4 compatibility code.

The thing that drew everybody's attention, however, was not the ALSA code itself, but its new directory organization. The patch creates a new top-level sound directory in the kernel tree; many developers had expected to see ALSA in the drivers/sound directory instead. The thinking is that ALSA includes much more that just drivers; there is a whole set of higher-level functions as well. Linus compares the ALSA code to the networking subsystem; network drivers are part of it, but there is much more involved.

The final organization is, in fact, likely to mirror how the networking code is handled. The sound directory will contain most of the ALSA code, but the low-level drivers will stay in drivers/sound. There is still no word on when ALSA will actually be merged, and there probably won't be until it shows up in a prepatch.

How to improve latency? The question of latency - the time it takes the kernel to respond to events - is back on the agenda. There seems to be a consensus that the 2.4 kernel is still too unresponsive for many needs. People working with streaming media tend to be the first to complain, but they are not the only ones.

There are two approaches out there for addressing the latency problems:

Find the specific places where the kernel executes for too long and insert explicit "scheduling points" to break them up. Andrew Morton has an extensive low-latency patch which can provide 1ms maximum latency in most situations; he has also sent out a mini low-latency patch for inclusion into the 2.4 kernel. This patch provides a 36ms worst-case latency, but is far less intrusive.
Make the Linux kernel preemptible; this is done by allowing the kernel to be kicked off the processor any time it is not holding a spinlock. Essentially, any part of the kernel that is not a spinlock-protected critical section becomes an implicit scheduling point. The current preemptible kernel patch, for both 2.4 and 2.5, is available from Robert Love.

There are considerable differences of opinion over which approach is best - at least, for 2.5. Nobody is really proposing that the preemptible kernel patch go into 2.4; it is far too big a change for that. Many would like to see it in 2.5, however.

The main objection to the preemptible kernel patch is that it breaks a long-standing fundamental assumption: that code running in kernel space will not be preempted until it explicitly gives up the processor. This could lead to no end of weird, difficult to debug problems. But the truth of the matter is that a preemptible kernel does not add any complexity not already introduced by SMP systems. Preemptability appears to have a good chance of winning in the end, but that, of course, is a call for Linus, and he has been silent on the topic.

Other patches and updates released this week include:

Jens Axboe has posted a new I/O scheduler (i.e. "elevator") which attempts to ensure that block I/O requests are executed within a given time period. (People wanting to try the patch should use the later version which contains some fixes).
A new hashed wait queue patch was posted by William Lee Irwin III.
dietHotplug 0.4 was announced by Greg Kroah-Hartman.
A new radix-tree page cache patch was announced by Christoph Hellwig and Momchil Velikov.
Eric Raymond announced CML2-2.0, a major release which includes new language features and autoconfiguration support. It was quickly followed by the 2.0.1 'brown paper bag' release; the current version is 2.0.4.
The SGI Visual Workstation support in the kernel has gone unmaintained for so long that its removal was being considered. No longer: Jesse Barnes has posted a new Visual Workstation patch for 2.4.17.
Manfred Spraul has posted a new zero-copy pipe patch.
Rik van Riel has released version 11a of his reverse mapping VM implementation.
Release 1.0.12 of IBM's journaling filesystem was announced by Steve Best.
Paul Larson has announced the Linux Test Project 20020108 release.
Trent Jaeger has announced a set of tools for the verification of the hooks used by the Linux Security Module patch. A few problems have been turned up which may result in changes to the LSM patch.
Greg Kroah-Hartman has released klibc 0.1, a small C library for applications that go into an initramfs partition. This posting was followed by a request for comments on what is really needed for an initramfs C library and the best way to get there.

Section Editor: Jonathan Corbet

January 10, 2002

For other kernel news, see:
Kernel traffic
Kernel Newsflash
Kernel Trap
2.5 Status
Other resources:
L-K mailing list FAQ
Linux-MM
Linux Scalability Effort
Kernel Newbies
Linux Device Drivers

Next: Distributions