Sections:
Main page
Security
Kernel
Distributions
On the Desktop
Development
Commerce
Linux in the news
Announcements
Linux History
Letters
All in one big page

Kernel development

The current kernel release remains 2.4.9. The current 2.4.10 prepatch is 2.4.10pre2, which contains a number of fixes and cleanups, but nothing too revolutionary.

Alan Cox's current patch is 2.4.9ac3. This one contains everything from 2.4.9, with the exception of the min()/max() stuff and the virtual memory changes. The VM work, at least, is likely to find its way into the "ac" series slowly, in order to make it easier to assess the effects of each change individually. Also present in the 2.4.9ac series is a large merge of the MIPS port.

Whither 2.5? Back on June 21, Linus said that the 2.5 series "will open in a week or two." That, of course, was more than two months ago. As of September 4, it will have been a full eight months since 2.4 came out. Never in ten years of Linux development has there been such a long period without a development kernel.

This hiatus is making itself felt in a number of ways. One is that development items are finding their way into the "stable" kernel series. Back in January, Linus had laid out a stern policy on patches for 2.4:

In order for a patch to be accepted, it needs to be accompanied by some pretty strong arguments for the fact that not only is it really fixing bugs, but that those bugs are _serious_ and can cause real problems.

Patches since then have included API changes, wholesale driver replacements, zero-copy networking, and numerous other changes that would seem to have bent the above rule just a little bit.

Meanwhile, much of the serious development work that is on tap for 2.5 remains isolated and untested, or not done at all. And developers are increasingly wondering when the 2.5 series will start.

It is important, certainly, to hold the line on development work while the 2.4 kernel stabilizes. The developers need to maintain their focus on stability until the job is really done, and an open development series could easily distract many of them. But eight months is a long time without a development kernel. It seems time for 2.5 to start.

The min() and max() issue... Linus returned from Finland and put in his contribution to the debate on the changes to min() and max():

Yes, the new Linux min/max macros are different from the ones people are used to. Yes, I expected a lot of flamage. And no, I don't care one whit. Unlike EVERY SINGLE other C version of min/max I've ever seen, the new Linux kernel versions at least have a fighting chance in hell of generating correct code.

In other words, he does not intend to back down on this change, and people should just deal with it and get on with things. Most of the kernel hackers seem to have accepted this, with, perhaps, a final grumble or two and the discussion has died down.

An open question, still, is what Alan Cox will do in the "ac" series. 2.4.9ac3 does not have "the min/max thing which needs to be dealt with." Last week Alan had said that he would not incorporate this change in this form - though he does agree with the basic goals of the change. This change, however, affects a lot of files throughout the kernel, and maintaining a kernel that differs from Linus's in this respect would be a lot of work for Alan and many other kernel developers. It would probably be much easier for everybody involved to just adopt Linus's new way of doing things and be done with it.

Then, there was the well-intentioned guy who suggested supporting both the new and the old min/max macros, and surrounding each call with a #ifdef. That idea didn't get too far...

In search of smart readahead. This week saw a complaint that disk read performance is very slow when numerous threads are all reading simultaneously. One suggestion that came out quickly was to increase the readahead limit for disk files. It's an approach that has worked for some people, but a more general solution requires a deeper look.

"Readahead," of course, is the act of speculatively reading a file's contents beyond what a process has asked for, with the idea that the process will get around to asking for it soon. When properly done, readahead can greatly increase read performance on a system, and most operating systems implement the technique. A larger readahead limit can help performance by creating more contiguous I/O operations for the disk, and by making it easier to stay ahead of the reading process. So increasing the readahead size would seem like a fairly straightforward decision. Until, of course, you realize that readahead requires memory, and the system might just have one or two other possible uses for that memory.

In fact, it can even be worse than that. As Rik van Riel points out, awful things can happen if the system tries to perform more readahead than it has memory for. When memory gets tight, pages used for readahead can be reclaimed for other purposes, with the result that the data so carefully read ahead gets dropped on the floor. When the reading process gets around to asking for that data, it has to be read from the disk again. In this mode, all readahead does is increase memory pressure and duplicate I/O operations; the system would be better off giving up on readahead.

The solution, it would seem, would be to be smarter about just how much readahead is done. When lots of memory is available, the readahead window should be large; as memory gets tight that window should be reduced. There are several ideas on how smartness should be implemented, however.

A relatively simple approach, again from Rik van Riel, is to keep track, on a per-file basis, of how many pages are read ahead, and how many are actually still there waiting when the process wants them. A large discrepancy would mean that pages are getting tossed before they are used, and the readahead window should shrink. This approach has the added benefit of adapting the window for each reading process; the readahead window would naturally be larger for processes that move through their data quickly.
Daniel Phillips has a more complicated scheme involving active management of readahead pages as a separate class of memory page.
Linus points out that "trying to come up with a complex algorithm on how to change read-ahead based on memory pressure is just bound to be extremely fragile and have strange performance effects." He proposes, instead, to simply drop readahead requests when the I/O request queue for a disk fills up. It's a simple technique that is easy to implement and verify, though it is not clear that it would fix all of the readahead problems.

Rik van Riel has stated his intention to proceed (with others) on an approach which dynamically scales the readahead window size "using heuristics not all that much different from TCP window scaling." Stay tuned for a patch.

Journaling filesystem performance comparison. Andrew Theurer (of IBM) has posted the results of a performance comparison between several Linux filesystems. The standard ext2 filesystem beat all of the journaling filesystems by a fair amount; JFS was the fastest among the journaling systems. ReiserFS came in last in this set of tests.

It turns out that Randy Dunlap, too, has been testing journaling filesystems. He is using a different benchmarking tool, but has come up with roughly similar results.

The ReiserFS testing, as it turns out, was done with a default mount option that reduces performance (but which saves disk space). People interested in performance in ReiserFS should mount with the -notails option. The above tests will be rerun with that option, but no results had been posted as of "press" time.

Other patches and updates released this week include:

Jens Axboe has released a new version of his zero-bounce high memory block I/O patch (which contains the 64-bit PCI DMA code from David Miller as well). Among other things, the number of drivers that support high memory I/O has been increased.
devfs v191 and devfsd v1.3.18 were released by Richard Gooch.
Robert Love has updated his patch which allows network devices to contribute to the /dev/random entropy pool.
Also from Robert Love is an updated version of the preemptable kernel patch by Nigel Gamble (covered in the March 15, 2001 LWN kernel page). Note that this patch "has not been tested" on SMP systems.
A snapshot of the x86-64 port, based on the 2.4.9 kernel, has been posted by Andi Kleen.
Keith Owens has released new versions of the ksymoops and modutils packages.
Version 0.9.0beta7 of the ALSA sound driver system has been released by Jaroslav Kysela.
Neil Brown has released version 0.5 of his "mdctl" RAID control utility.
A "diet" version of the hotplug scripts, intended for memory-constrained systems, has been released by Greg Kroah-Hartman.
The FOLK Project has a new, more bloated than ever release with a number of new features, including the Linux security module patch and more.

Section Editor: Jonathan Corbet

August 30, 2001

For other kernel news, see:
Kernel traffic
Kernel Newsflash
Kernel Trap
2.5 Status
Other resources:
L-K mailing list FAQ
Linux-MM
Linux Scalability Effort
Kernel Newbies
Linux Device Drivers

Next: Distributions