[LWN Logo]
[LWN.net]

Sections:
 Main page
 Security
 Kernel
 Distributions
 Development
 Commerce
 Linux in the news
 Announcements
 Linux History
 Letters
All in one big page

See also: last week's Kernel page.

Kernel development


The current development kernel release is still 2.5.0. The current 2.5.1 prepatch is 2.5.1-pre10. On the surface, little has changed over the last week; most of the changelog entries seem to be some variant of "Jens Axboe: bio work." The thrashing of the block layer is taking some time to stabilize - as to be expected from a change of this magnitude. The last of the disruptive block I/O changes have not yet hit the kernel, so this situation could persist for a while yet.

Also included in this prepatch is a Super-H architecture update, some network driver work, an NTFS update, USB fixes, memory pools (see below), and the inevitable superblock cleanup patches from Al Viro.

The current stable kernel release is 2.4.16. Marcelo's prepatches are up to 2.4.17-pre8; he has stated that the next prepatch will be the first 2.4.17 release candidate. Marcelo's stated plan is to have the final release be the same as the last release candidate; the hope is to be done with surprises caused by last-minute patches.

Memory pools are a new addition to the kernel as of 2.5.1-pre10. The idea behind "mempools," which were implemented by Ingo Molnar, is to provide a memory allocation function that is guaranteed to work, even when memory is tight. Some places in the kernel can not afford to have memory allocations fail. For example, memory pressure can force the system to swap pages out, but that swap operation will require memory to be executed. If the memory to set up the swap is not available, the system comes to a halt.

Memory pools work by simply preallocating a bunch of memory and keeping it aside until it's needed. The actual allocation and freeing of memory is handled by somebody else (the idea seems to be for mempools to be layered over the slab allocator); all mempools do is stock up ahead of time. Their use will thus increase the kernel's memory consumption (by the amount of memory that is set aside). For certain critical paths, though, they should help to improve the stability of the system under heavy load.

Coming soon: bigger device numbers. One of the long-stated plans for 2.5 is to increase the size of dev_t, the type which is used to represent device numbers. This type, as it stands now, has roots all the way back to the original Unix systems - it is a 16-bit quantity, with eight bits for the major number, and eight for the minor. It is inadequate for modern systems, which can have, literally, thousands of devices on them. So dev_t has to grow.

Linus laid out the plan some time ago (see the March 29, 2001 LWN Kernel Page): dev_t would grow to 32 bits. Of those, twelve would designate the major number, and 20 the minor number. A number of people would rather see 64-bit device numbers, but Linus is opposed to that.

Changing device numbers raises a number of interesting compatibility problems. Consider, for example, a tar or dump archive containing a /dev directory. The archive contains the device numbers for every entry in that directory; if those numbers stop working after the dev_t change, everybody's backups have just been rendered invalid. System administrators, when faced with that prospect, tend to break out in a cold sweat, overindulge in beer, and switch to BSD.

Fortunately, that particular problem has a solution. In the new scheme, the major number zero is set aside as a marker for "legacy" device numbers. Any 32-bit device number with a major number of zero is interpreted as an old-style number and "just works." A change to the C library will be required before applications can exchange larger device numbers with the kernel, but the change should be relatively smooth beyond that.

On the kernel side, however, life could be more interesting. Kernel developers really do try to avoid breaking applications, but they are more willing to tear things up inside the kernel. Especially in a development series.

The kernel version of the device number type is kdev_t. It has long been meant to be an opaque type, but it's really just dev_t in kernel drag. People had assumed that kdev_t would grow along with dev_t, but that's not what Linus has in mind. Linus wants kdev_t to go away entirely. All of the interfaces in the kernel which currently use that type will be changed to take a pointer to an appropriate structure. Block drivers, thus, will see a pointer to a struct block_device rather than a device number. Some sort of struct char_device will also probably be created to handle a similar role.

In other words, the kernel will no longer use device numbers at all, except as a means of communication with user space. Internally, device numbers will not exist. A lot of kernel code is going to have to change to make this happen; one does not have to look very hard to see more unstable development kernel releases in the future - see, for example, Al Viro's description of some of the issues involved. But, then, that's what development kernels are for.

Where do important changes get tested? One would think that, now that we finally have a development kernel again, non-trivial changes would show up there before being merged into the stable 2.4 series. Thus, there was some surprise when support for "hyperthreading" on Pentium IV processors went into 2.4.17-pre5. That support still does not exist in 2.5, and has thus not seen the wider testing that it could experience there.

The reasoning behind putting this change into 2.4, as explained by Alan Cox, is interesting. The claim that normal users will not be affected by the change is standard. But Alan also points out that, due to the ongoing block I/O work, the 2.5 series "isn't usable for that kind of thing in the near future." So, if a feature like hyperthreading is to be tried out, it must be added to the stable kernel series.

Things will get better as the block layer stabilizes - at least, until the next set of disruptive changes go in. Until then, it's a bit ironic that the only place to test certain kinds of changes is the stable kernel series.

(Hyperthreading, for those who are interested, is the hardware trick of making a single processor appear to be multiple virtual processors as a way of keeping busy while waiting for memory accesses. See Intel's Hyperthreading page for details).

Work on the scheduler is also coming to a boil. It is a widely (though not universally) held belief that the Linux scheduler is overdue for a rewrite in 2.5. Quoting Alan Cox again:

Its a great scheduler for a single or dual processor 486/pentium type box running a home environment. It gets a bit flaky by the time its running oracle on a 4 way, it gets very flaky by the time its running lotus back ends on an 8 way. It doesn't take lunacy like java, broken JVM implementations and volcanomark to make it go astray.

The scheduler's performance on larger systems and under load has been shown to be inadequate numerous times. But there is little agreement on what should replace it.

Mike Kravetz and company at IBM have posted a new multi-queue scheduler patch for the 2.5.0 kernel. This scheduler cuts down on scheduling time by maintaining a separate run queue for each processor on the system. It tries to improve performance while maintaining the same behavior as the existing scheduler.

Alan Cox has a new scheduler of his own which works by maintaining a set of eight (currently) run queues for each processor. Picking a process to run is just a matter of taking the first one off the highest priority queue.

Finally, Davide Libenzi has a scheduler patch which implements a per-CPU run queue and some load balancing code.

All of these projects share the same goals: cut down on scheduling overhead, work harder to keep processes from moving between processors, and retain good performance in low-load situations. The low-load performance is considered critical: it is, after all, the normal situation for most systems, and the current scheduler handles it well. No patch which impairs low-load performance is likely to get too far.

The hyperthreading issue mentioned above is likely to throw a new set of complications into the mix. A processor which does hyperthreading looks like two independent CPUs, but it should not be scheduled as such - it is better to divide process across real (hardware) processors first. Expect scheduling to be a hot topic for some time.

Linux Advanced Routing & Traffic Control Documentation Project. Bert Hubert has been working for some time on the documentation of the advanced Linux routing features. The Linux traffic control mechanism has been available since the 2.1 days, but is greatly underutilized. The quality of the available documentation has not helped here. The code is great, but it's hard to figure out how to use it. So an effort to shine some light in that direction is more than welcome.

Bert's work has how grown into the Linux Advanced Routing & Traffic Control documentation project, and a great deal of information is available there. The latest addition is the tc-cbq man page: "Nearly 2500 words, 8 printed pages, of nearly unintelligible gobledygook, explaining mostly how CBQ works." Good stuff.

Other patches and updates released this week include:

  • Andrea Arcangeli has made available (as a tarball containing a magicpoint file) the slides from his PLUTO talk on the new VM implementation. This is the first documentation that has been made available on the new code. We have also made the slides available in HTML format.

  • Daniel Phillips has posted his ALS paper on ext2 directory indexes, along with a wealth of benchmark results. Worth a look if this work interest you at all.

  • Kernel Traffic #145 (December 10) is available.

  • Rusty Russell has posted a patch making it easy for kernel code to set up per-CPU data areas.

  • The ltp-20021206 release is available from the Linux Test Project.

  • The latest User-mode Linux release from Jeff Dike is 0.53-2.4.16.

  • A new preemptible kernel patch is available from Robert Love.

  • Karim Yaghmour has released version 0.9.5pre4 of the Linux Trace Toolkit.

  • Jason Baietto has released a set of "multiprocessor control interface" programs. These allow users to bind tasks to processors and other, similar tasks.

  • Ben LaHaise has posted a patch which adds his kvec type (essentially a lightweight replacement for kiobufs) to the kernel. kvecs are needed for his asynchronous I/O work, among other things. Also available is this patch, which works the kvec structure into the new block I/O code.

  • Eric Raymond has released CML2 1.9.7.

  • The 2001_12_10 release of the security module code is available. Also available is a new security module adding labeled IPv4 networking to SELinux.

  • Lennert Buytenhek has released version 0.0.4pre1 of his bridging netfilter code.

  • Jozsef Kadlecsik has been added to the netfilter core team.

Section Editor: Jonathan Corbet


December 13, 2001

For other kernel news, see:

Other resources:

 

Next: Distributions

 
Eklektix, Inc. Linux powered! Copyright © 2001 Eklektix, Inc., all rights reserved
Linux ® is a registered trademark of Linus Torvalds