Sections:
Main page
Security
Kernel
Distributions
Development
Commerce
Linux in the news
Announcements
Letters
All in one big page

Kernel development

The current development kernel release is 2.5.15, released on May 9. Changes this time around include a resumption of the "device model" work (with an emphasis on the x86 PCI code), more IDE reworking (including the removal of /proc/ide - see last week's LWN Kernel Page), an NFS server update, many patches from the "dj" series, and lots of other fixes and updates.

The in-progress 2.5.16 patch, as seen in BitKeeper, includes an ISDN update, George Anziger's 64-bit jiffies patch, the usual IDE patches, some networking updates, work on the new NFS export scheme, and more.

Dave Jones's latest patch is 2.5.15-dj1, which contains a relatively small set of fixes and updates.

The latest 2.5 status summary from Guillaume Boissiere is dated May 15.

The current stable kernel release is 2.4.18. No 2.4.19 prepatches have been released by Marcelo this week.

The current patch from Alan Cox is 2.4.19-pre8-ac4. The biggest change here is a new set of IDE updates by Andre Hedrick that went into -ac3. The 2.4 and 2.5 IDE subsystems continue to go in very different directions.

On the 2.2 front, Alan has released 2.2.21-rc4, the latest 2.2.21 release candidate. Unless something turns up, this one will become the real 2.2.21.

The future of in-kernel web servers. Some recent discussion on troubles with khttpd, the in-kernel web server which has been present since the early 2.3 days, led to the statement that khttpd would soon be removed from the 2.5 series. khttpd has a number of happy users, but it has been essentially unmaintained for a number of years, and it has been superseded by Ingo Molnar's TUX server. So the kernel developers see little reason to keep it around.

The more interesting question, perhaps, is whether TUX will take the place of khttpd. There appears to be little consensus on whether TUX should go in or not. Some developers are worried about the impact of the TUX patch, while others claim it affects little other code. It is not clear how much of a performance benefit TUX really provides - some user-space web servers are said to be getting quite close to TUX in speed. And, of course, a number of people feel that an application like a web server has no place inside the Linux kernel.

Servers like TUX and khttpd remain interesting as a demonstration of how to create the shortest, fastest path between the network and files on a disk. Chances are that TUX will find its way into a mainline kernel sooner or later.

Per-driver filesystems made easy. Alexander Viro has long been a proponent of small, special-purpose filesystems as a way for device drivers (or other kernel subsystems) to communicate with user space. The mini filesystem approach, he says, is a far cleaner and safer technique than the alternatives: /proc, the ioctl() call, or devfs. This approach makes sense to a number of people, but it has not been widely adopted. After all, if you are not Al Viro (which is the case for most of us), hacking up a new filesystem can be a little intimidating.

So he has been trying for a while to make the task of writing driver filesystems easier. His latest posting includes a set of library functions which mostly concern themselves with the creation of superblocks for virtual filesystems. The superblock is a good thing to hide within a library layer; virtual filesystems just need something to hand to the VFS; there should be no need for each one to duplicate a lot of "fill in the superblock field" code.

The other half of the posting is a driver which creates a little filesystem to export the value of a set of VIA motherboard temperature sensors. The whole thing takes up 70 lines of code, and much of that, of course, is dealing with getting information from the sensors. The task of creating special purpose virtual filesystems has indeed been made easy.

The trickier part in the long run may be on the system administration side. If the mini filesystem approach takes off, each system will have to be configured to mount these filesystems in the right places. /proc files and ioctl() calls just show up in their standard places, but filesystems must be explicitly mounted somewhere. How are VIA motherboard users to know that they can mount a devvia filesystem somewhere to read their temperature sensors? Add in a dozen other hardware-specific filesystems and one begins to see that some work on system administration tools will be needed to make it all easy to manage.

A different approach to asynchronous I/O. It started with a discussion of the O_DIRECT flag, which can be used to request that "direct" I/O be performed on a file. Direct I/O moves data directly between the userspace buffer and the device performing the I/O, without copying through kernel space. Direct I/O can be faster, since it avoids copy operations and because it does not fill the system's page cache with data that will not be used again.

It was noted recently that benchmarks using O_DIRECT tend to perform worse than those using regular, cached I/O. The reason for this poor performance is reasonably straightforward: direct I/O, as implemented in Linux, is synchronous. The application must sleep and wait for the operation to complete, and there is no opportunity to reorder operations for better I/O performance. If you really want to make O_DIRECT work well, you need to combine it with asynchronous I/O.

So, one would think, there would be a motivation to get the asynchronous I/O patches into the 2.5 kernel. Linus, however, has other ideas, based on his opinion of O_DIRECT:

The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances.

In other words, one might conclude that he doesn't like it.

A statement like that, of course, raises an immediate question: how, exactly, would one design a high-performance, zero-copy, asynchronous I/O subsystem if you can't get the monkeys to share their substances with you? Linus's answer is to split apart the two aspects of the problem: performing the I/O and connecting the data to user space.

In this new scheme, a process wishing to do asynchronous, direct reads from a file would, after opening that file, invoke a new system call:

     readahead (file_desc, offset, size);

This call will set the kernel to populating the system's page cache with data from the file starting at the given offset, for an amount approximating size. At this point, the data is in (kernel) memory, and is not visible to the userspace application. Actually getting at the data requires calling mmap with a special MAP_UNCACHED flag.

This memory mapping is special in a couple of ways. One is that it does not set up any page tables when the mapping is established, so it happens very quickly. The other is that, when the user application generates a page fault (by trying to access the data it ordered with readahead()), the page is "stolen" from the page cache and turned into a private page belonging to the application. Until the fault happens, the read operation is entirely asynchronous; once the application actually tries to use the data, it will wait if the operation still has not completed.

If the application is, instead, looking to write data, it starts by populating its mapped memory segment. When things are ready to go, another new system call:

	mwrite (file_desc, address, length);

is used. mwrite() puts the page back into the page cache (where it will get written eventually) and removes it from the process's page table. The (new) fdatasync_area() system call may be used to force (and wait for) specific pages to be written.

A process which is simply copying data need never access the pages in the mapping directly. In this case, no page tables ever get built, and things go even more quickly. Pure copy cases are relatively rare, though, especially since this scheme would not support I/O to network connections (which do not use the page cache). The high-profile application for this sort of I/O (or O_DIRECT) is Oracle, which performs lots of I/O out of large segments.

So far, all this is just a scheme sketched out by Linus, with no implementation to play with. Should some ambitious kernel hacker code it up, however, it would be interesting to see how it really performs relative to other techniques.

Corrections on the buffer head work. Andrew Morton politely pointed out that your editor was more confused than usual when writing about Andrew's buffer head work last week. The bulk of that work actually affected the way the write() system call was handled. In the old scheme, data to be written back to files would find its way into the buffer head least-recently-used queue, where it would eventually be flushed to disk. With the new code, this data is written directly from the page cache, in a more page-oriented mode.

Buffer heads are still used to coordinate the I/O process, for now. As a result of all the block layer work that has gone in, the block system now takes those buffer heads and digs down to the real pages underneath them. So, at some point, an obvious step will be to remove the buffer head "middleman," and submit pages to be written directly to the block layer. So, eventually, buffer heads will no longer be the main I/O mechanism for block I/O.

Sorry for the confusion.

Other patches and updates released this week include:

Kernel trees:

Martin Loschwitz: 2.5.15-ml2; looks like 2.5.15 plus recent, mainstream patches.
Jörg Prante: 2.4.19-pre8-jp12; ALSA, JFS, XFS, RMAP, preemptible kernel, FreeS/WAN, etc.
J.A. Magallon: 2.4.19-pre8-jam2.
Andrea Arcangeli: 2.4.19-pre8-aa3.

Core kernel code:

Rik van Riel: I/O wait statistics.
Rusty Russell: Futex update.
Hugh Dickens: noht boot option to disable hyperthreading.
Patricia Gaughen: discontiguous memory support for ia32 NUMA systems.
Hanna Linder: fast walk dcache for 2.4.19-pre8.
Rusty Russell: hotplug CPU preparation, mostly dealing with the management of idle tasks on new CPUs (I, II, III, IV, and V)

Device drivers

Martin Dalecki: IDE reworking: ( 59, 60, 61, 62a, (Linus didn't like 62), 63, 64
Bakonyi Ferenc: RivaTV driver 0.8.0.
Denis Oliver Kropp: VMWare framebuffer driver, version 0.5.2.
Richard Gooch: devfs v199.14 for 2.4.19-pre8 and version 213 for 2.5.15. .
Johannes Erdfelt: rework USB device reference counting.
Greg Kroah-Hartman: further rework USB reference counting.
Neil Brown: make RAID 5 work in 2.5 (1, 2, and 3)

Filesystems:

Anton Altaparmakov: NTFS 2.0.7.
Pawel Kot: backport of NTFS 2.0.7 for 2.4.18.
Jan Harkes: new iget_locked() function for inode creation (1, 2, 3, 4, 5, and 6)
Peter Chubb: remove 2TB filesystem size limit.
Hirotaka Sasaki: alternative patch to remove the 2TB limit.

Kernel building:

Keith Owens: kbuild 2.5 core-14. Keith has also posted another note stating that kbuild is ready for inclusion.
Andi Kleen: add a CONFIG_ISA option.

Miscellaneous:

Denis Vlasenko: kernel maintainers file.
Karim Yaghmour: Linux Trace Toolkit for 2.5.15.
Neil Brown: mdadm tool 1.0.0 for the management of RAID sets.
Greg Kroah-Hartman: pcihpview 0.3, a GUI tools for PCI hotplug management.
Patricia Gaughen: updated NUMA status page.
Jari Ruusu: loop-AES 1.6c file and swap crypto package.
Kernel Traffic #166 is available.

Ports:

James Bottomley: NCR Voyager port.
James Bottomley: split up i386 code into subarchitectures.
Robert Love: preemptible kernel for MIPS processors for 2.4.19-pre8.

Section Editor: Jonathan Corbet

May 16, 2002

For other kernel news, see:
Kernel traffic
Kernel Newsflash
Kernel Trap
2.5 Status
Other resources:
L-K mailing list FAQ
Linux-MM
Linux Scalability Effort
Kernel Newbies
Linux Device Drivers

Next: Distributions