Sections:
Main page
Security
Kernel
Distributions
On the Desktop
Development
Commerce
Linux in the news
Announcements
Linux History
Letters
All in one big page

Kernel development

The current kernel release is still 2.4.2. The current 2.4.3 prepatch from Linus is 2.4.3pre4, which contains a small collection of important fixes (and an item marked "Alan Cox: continued merging," which could cover a lot of stuff).

Alan Cox's prepatch is up to 2.4.2ac20, and it is rather larger. It includes a fair amount of more ambitious changes, some of which have ominous tags like "hopefully fix the buslogic corruptions."

There have been no announced 2.2.19 prepatch releases over the last week, though 2.2.19pre17 was quietly dropped onto the FTP site on March 11.

SCHED_IDLE, again. It was a relatively slow week on linux-kernel, so perhaps it's fitting that one of the topics that came up was the old idea of a SCHED_IDLE scheduling class. A SCHED_IDLE process would run only if no other process wanted the CPU. This behavior is different from the usual Unix behavior; normally, even very low-priority processes will get a little bit of CPU time. A true SCHED_IDLE class would allow you to run that compute-intensive pig latin song title encoding code without it getting in the way of those all-important kernel builds.

The problem with SCHED_IDLE hasn't changed, though. Even idle tasks occasionally need kernel resources. It is possible for an idle task to obtain an important kernel lock or semaphore, then get blocked out of the CPU by a regular task. At that point, the system can hang; the idle task can not run to release its resources, so everybody else just has to wait.

This is a variant of the classic "priority inversion" problem, where a low-priority process can monopolize resources needed by higher-priority tasks, and keep them from executing. Priority inversion can be a serious problem, especially if the system involved is on Mars at the time. But even terrestrial applications need to avoid situations that can cause this problem. For this reason, a true idle task has never been incorporated into the Linux scheduler.

This time around, the observation was made that processes rarely, if ever, hold important kernel resources when running in user space. In other words, locks and semaphores are only held while the process is running in kernel mode. So the usual sort of solution to priority inversion problems - complicated priority inheritance schemes and such - is overly complex for this situation. It should suffice to remove the idle task attribute from processes running in the kernel. Jamie Lokier posted a simple hack which implements this behavior on x86 systems.

Such changes are 2.5 material, of course, so it may be some time before we know if some form of this patch will go in or not. Linus has been hostile to the SCHED_IDLE idea in the past, and this fix may not be adequate to address his concerns. Nonetheless, it's a step in the right direction; Linux may yet have an idle task implementation.

Preemptable kernel patch. With little fanfare, Nigel Gamble (who works at MontaVista Software) posted a patch to the 2.4.2 kernel which makes the Linux kernel preemptable. Normally, the kernel follows longstanding Unix tradition in that kernel code can not have the processor taken away from it. When the system is running in kernel mode, the code will run until it voluntarily gives up the processor, or until it returns to user mode. The one exception to this rule is hardware interrupts, but very little work is supposed to be done by interrupt handlers.

This mode of operation has traditionally been convenient for kernel programmers, since it reduces the amount of concurrency (and, thus, race conditions) that they have to deal with. It also tends to increase latency, however; the amount of time it takes the system to respond to an event can increase. Thus, your sound card may be crying for more data, but if some other piece of kernel code is hogging the processor, the sound card will have to wait. In many situations, this sort of latency can cause problems.

The solution is to make the kernel preemptable, so that a higher-priority process can run even if the system is running in kernel mode. Once upon a time, this would have been a very large change, given the whole new set of concurrency issues that would have to be dealt with. But multiprocessor systems have all the same concurrency issues, and the kernel hackers have been forced to deal with them. At this point, adding preemption to the kernel adds very little in the way of problems.

So, Mr. Gamble's patch is surprisingly small. There are some scheduler changes, of course, to make the preemption happen. There is also a bit of code which disallows preemption anytime that the kernel code holds a spinlock. This is necessary for a number of reasons: spinlocks should be held for very short periods, so code which holds one should be allowed to run to completion. Spinlocks exist to prevent certain types of concurrency; a preemptable kernel patch should not defeat that purpose. Finally, preempting code which holds a spinlock could deadlock the system if another thread in the kernel attempts to obtain the same lock on the same processor.

This patch is not 2.4 material, of course; a change of this magnitude has to wait for the next development series. But Mr. Gamble has shown that this change is relatively straightforward; it would be surprising if some variant of this patch didn't show up early in the 2.5 series.

Is it time for a massive configuration variable renaming? Keith Owens thinks so, and has posted a patch which changes the name of every configuration variable that is automatically derived from other configuration variables. There are advantages to knowing which variables can not be changed directly by the user; this patch makes that knowledge explicit by appending a _DERIVED extension onto each such variable.

Now, anytime you post a patch which changes 130 variables and touches 553 source files, you're going to raise a few eyebrows. Doing so in a stable kernel series doesn't help, either. So it's not surprising that this patch attracted some complaints. These varied from the usual "it's unnecessary" or "wrong solution" variety through this query from Eric Raymond, who is under the impression that his CML2 configuration scheme will be adopted in 2.5, and is thus wondering why people are bothering to mess with the older scheme.

In fact, nobody came out in support of the proposed change. This patch would appear to be doomed. Hopefully the 2.5 kernel series really will see a replacement of the kernel configuration system; at that point, a lot of things will get easier.

Actually, things have been somewhat quiet on the CML2 front for a while; Eric has pronounced it ready, and is mostly just waiting for it to be incorporated into the development tree. There has been one bit of progress, however. Back in November, the CML2 system was examined on this page; one of the things we noticed is that the CML2 compiler took an awfully long time to run. Eric finally looked into the performance side of things, and found something interesting: the compiler took 28 seconds to run on his system, and 26 of those were spent in the automatically-generated expression parser code. One might just conclude that there is some room for optimization there.

And, in fact, after recoding the parser by hand, Eric reported that the compiler's execution time had been cut in half. 2.5 kernel configuration is not going to have to be slow after all.

Other patches and updates released this week include:

Geert Uytterhoeven submitted a patch fixing up the frame buffer penguin logo code. Among other things, the penguin has, once again, lost its glass of beer. If the new logo looks rather grumpy, you'll know why...
Rik van Riel has been on a mission to add documentation to the memory management code. He's put out a patch fixing up mm.h, mmzone.h, and swap.h. More is apparently coming, eventually.
Ulrich Windl has posted his PPSkit (nanosecond timekeeping) patch, ported to the 2.4.2 kernel.
Daniel Phillips has reworked his ext2 directory index patch to work with the Linux page cache, rather than the buffer cache. Once again, he provides a detailed and interesting writeup of what he had to do to make it all work. The page cache version is about twice as fast as the older, buffer cache version.
IBM has released version 2.0 of its "dynamic probes" debugging facility.

Section Editor: Jonathan Corbet

March 15, 2001

For other kernel news, see:
Kernel traffic
Kernel Newsflash
Kernel Trap
2.5 Status
Other resources:
L-K mailing list FAQ
Linux-MM
Linux Scalability Effort
Kernel Newbies
Linux Device Drivers

Next: Distributions