[LWN Logo]

 Main page
 On the Desktop
 Linux in the news
 Linux History
All in one big page

See also: last week's Kernel page.

Kernel development

The current kernel release is still 2.4.3. Linus's 2.4.4 prepatch has reached 2.4.4pre4; it includes much more stuff from Alan Cox's "ac" series, a number of fixes, and, interestingly, the zero-copy networking patch (see below). Alan Cox's series (currently at 2.4.3ac9) is getting smaller as the patches get into the mainstream kernel, but there's still quite a bit of stuff there. Some of what's there, including the user-mode Linux patch, will evidently not go to Linus at all, at least for now.

Zero-copy networking will be in 2.4.4. This patch, by David Miller, Alexey Kuznetsov, and others, has been in development and testing for some time, and was incorporated into the "ac" kernel series back in 2.4.2ac4. In a way, it is a surprising change to see in a stable kernel series, since it makes fundamental changes deep in the networking code. From all reports, however, it is solid, and, in certain situations, it should produce significant performance benefits.

Zero-copy networking speeds things up by avoiding, whenever possible, copies of the data to be transferred. In an optimal case, a buffer full of data sent over the network by an application (an FTP server, say) will go directly to the network interface from the application's memory. Without zero-copy networking, however, that's not how things are done - at a minimum, the data is copied into kernel space and assembled into one or more packets before going to the wire. All that copying can slow things down and fill up the cache; it's not surprising that people want to eliminate it.

Making zero-copy work is not straightforward, and the patch is large. Various issues have to be dealt with, including:

  • A fast and flexible method must exist for locating the user data array in physical memory, locking it down, and making it available to the hardware. As has been covered before on this page, the "kiobuf" mechanism was deemed too heavyweight for the networking code. So zero-copy networking passes around simple structures with direct pointers to the struct pages for the user buffer.

  • A user buffer must be assembled into one or more packets, with headers, before transmission. Zero-copy requires that the separate pieces remain apart until joined by the hardware - the alternative is to copy the data into a kernel-space packet buffer. So the kernel must be able to keep track of packets that are stored in several distinct pieces, and the network drivers (and hardware) must be prepared to handle the "scatter/gather" operations that piece together the packets at transmission time.

  • Most network protocols require checksums to be calculated for packets at transmission time. Normally the kernel calculates the checksums, but doing so requires, of course, a pass over the data. If you are going to iterate over the data to calculate the checksum, you might as well copy it while you're at it; the difference in cost is relatively small. If, instead, you want to do zero-copy networking, your hardware must be capable of supplying the checksum - and the driver must be able to tell it to do so.

  • Systems where zero-copy networking makes sense are also likely to have tremendous amounts of memory - above the kernel "high memory" mark and perhaps more than can be addressed with 32 bits. If you're transferring data directly to and from user buffers, you must be prepared for them to be in high memory - and the device must be able to address that memory.

To handle all of this stuff, the zero-copy networking patch makes some fundamental changes to the networking core code. Traditionally, packets are passed around via a struct sk_buff structure, usually referred to as an "skb." The skb contains the entire packet, headers and all. With zero-copy, an skb can now be "paged," or "nonlinear," meaning that it consists of several pieces which are not contiguous in memory. Much of the code which handles skb structures must be changed to take this new structure into account.

The driver interface has also seen changes. There is a new "features" variable in the netdevice structure which is used to mark some of the capabilities of the device (and its driver); these include the ability to perform checksums, deal with high memory, and do scatter/gather I/O. This variable was actually added in 2.4.0-test12, just before the official 2.4.0 release, but it's only with the zero-copy patch that it is seeing some real use.

The change in the driver interface means that zero-copy I/O is only possible if the relevant network driver has been updated to support it. So far, only the AceNIC and Sun HME drivers have been fully converted. The work required appears not to be large, assuming that the hardware is reasonable, so more drivers will likely be updated in the future.

Zero-copy networking is not a win for everybody; it really only makes sense on high-end hardware and very fast networks. In that situation, though, it should be a real performance win; expect more amazing web server benchmark results in the near future.

Children first. Adam Richter posted a patch which makes a subtle change in the way the fork() system call works. It is interesting to look at as an example of how little tactical changes can affect operating system performance.

On Unix-like systems, the child of a process that forks gets a copy of the parent process's entire address space (normally). Actually copying everything, of course, would be most inefficient. Read-only memory (such as program code) can be simply shared, but writable memory requires a bit more cleverness. The technique used is to share the data space, but to mark it "copy on write" (or "COW"). Both processes see the same COW pages, until one of them tries to make a change. At that point, the kernel makes a copy of the relevant page, making it private to the process, which is unaware that anything has happened.

The 2.4.3 kernel, on a fork(), puts the child process into the run queue and resumes executing in the parent. The child will run sometime later as part of the normal timesharing of the processor. It turns out that this is not the best way of doing things from a performance point of view, though.

The parent process will likely go on modifying its private data, causing the system to make copies of the various COW pages shared with the child process. But the child, in most cases, is unlikely to ever look at those pages; instead, it will probably perform a few operations, then go and exec() some other program, which breaks its attachment to the shared pages. If the child were to run first, the parent would probably not need to copy all those pages, and performance would be improved.

And, in fact, according to Linus, the performance difference is visible. As a result, this patch went into 2.4.4pre4 (though it does not show up in the changelog).

Other patches and updates released this week include:

  • Eric Raymond has released cml2-1.2.0. Testing activity has been high, resulting in a number of squashed bugs. The performance problems appear to be a thing of the past, and much of the recent discussion has moved to things like the proper colors to use in the X configuration interface. Eric has thanked everybody who has participated in the conversation, "even the most mossbacked grumbling conservatives."

  • Alexander Viro has posted a patch which moves ext2 directories to the page cache.

  • Bharata B. Rao has released a new version of his patch to arbitrate access to the debug registers in the kernel.

  • Maneesh Soni has a fix for the longstanding module unload race problems that uses a two-phase cleanup scheme.

  • Linus Torvalds sent out a design for a new fast user-space semaphore implementation. It would be blindingly fast, especially in the no-contention case, but would also abandon the SYSV semaphore API.

  • Jari Ruusu has released a filesystem encryption mechanism which is implemented as a loadable kernel module. It's aimed at people who want encrypted files, but do not want to apply the full international kernel patch.

  • A read-only Veritas filesystem implementation was released by Christoph Hellwig.

  • Johan Verrept has released a USB host controller interface for user-mode Linux. This code will allow the debugging of USB drivers in a user-mode kernel, making development of those drivers a much more pleasant task.

Section Editor: Jonathan Corbet

April 19, 2001

For other kernel news, see:

Other resources:


Next: Distributions

Eklektix, Inc. Linux powered! Copyright © 2001 Eklektix, Inc., all rights reserved
Linux ® is a registered trademark of Linus Torvalds