[LWN Logo]
[LWN.net]

Sections:
 Main page
 Security
 Kernel
 Distributions
 Development
 Commerce
 Linux in the news
 Announcements
 Linux History
 Letters
All in one big page

See also: last week's Kernel page.

Kernel development


The current kernel release is still 2.4.0. Linus continues to put together a 2.4.1 prepatch, currently at 2.4.1-pre10. His approach remains conservative, and this patch (especially if you ignore ReiserFS) is relatively small.

Those looking for something meatier may want to consider, instead, 2.4.0-ac11 from Alan Cox. This release contains literally hundreds of patches - almost 10MB worth.

Cutting out the middleman in data transfers. The discussion started by David Miller's posting of an experimental zero-copy networking implementation (discussed on this page two weeks ago) continues, though it has moved into new areas. One of those is the optimization of data transfers to avoid copying the data as much as possible. Consider, for example, the sendfile() interface that Linux supports now; using sendfile(), an application (a web server, say) can transfer a disk file to a network socket without ever having to read it into user space. There is an obvious performance gain from operating in this mode for certain applications.

So, why not extend the idea to its logical conclusion? Why not have a system call that says "copy data from here to there, and optimize as much as possible"? One approach to this mode is Larry McVoy's 'splice' interface, which tries to provide a general way for user space processes to control high-performance copies. It provides "push" and "pull" primitives which handle the destination and source sides of a copy, respectively, and give the application some latitude in how the two are put together.

Here's Linus's comments on splice and why it has not been implemented so far. Essentially, sendfile handled the task that most users wanted, the splice interface needed a bit of work, and it didn't fit well into the structure of the kernel at the time. The kernel has since evolved, and Linus's message hints that an implementation of a modified form of splice would be easier now, and that it might even be accepted.

One can take the idea further, however: why not, when appropriate, simply tell the hardware to copy the data between devices directly and leave the kernel (and the processor) out of it altogether? According to Linus, that's one of those great ideas that turns out not to be so great in practice. His short response to the idea was:

device-to-device copies sound like the ultimate thing.

They suck. They add a lot of complexity and do not work in general. And, if your "normal" usage pattern really is to just move the data without even looking at it, then you have to ask yourself whether you're doing something worthwhile in the first place.

Further into the discussion, Linus came up with other reasons to avoid direct device-to-device (D2D?) copies. One is that there is very little use for the capability in the end. One can talk, for example, of streaming video directly to disk - but how often will a user be recording video without wanting to look at it too? Another is that very little hardware supports that mode of operation. Linus sees a trend toward connecting hardware with direct, point-to-point links that are not amenable to direct operations between devices. Quoth Linus: "Just wait. My crystal ball is infallible."

TCP_CORK or MSG_MORE? Another branch of the same discussion has to do with getting optimal performance from network transfers. Imagine a web server using the sendfile() interface described above. In response to a request for a page, the server will first write out a short set of HTTP headers, then use sendfile() to actually transfer the page data. By the time the sendfile() call is actually made, however, the headers will have gone out on the net as a very short packet. The result is poor performance on both the sending and receiving side.

Linux has handled this issue with a TCP option called TCP_CORK. If an application sets that option on a socket, the kernel will not send out short packets. Instead, it will wait until enough data has shown up to fill a maximum-size packet, then send it. When TCP_CORK is turned off, any remaining data will go out on the wire.

TCP_CORK does the job reasonably well. Recently, however, a contingent led by Ingo Molnar has been pushing for a new interface which uses a flag called MSG_MORE. Rather than applying to the socket in general, MSG_MORE is attached to a one or more write operations on that socket. It says "there will be more data coming," and the kernel knows to buffer data to get bigger packets. The advantages of this approach are said to be (1) it requires no persistent state on the socket, thus helping, among other things, to avoid programming errors; and (2) it avoids the system call overhead of toggling the TCP_CORK flag. Ingo used MSG_MORE in the implementation of the TUX kernel web server, and is happy with the results.

Linus, however, is not convinced. MSG_MORE requires a flag to be set on every transfer, only works on sockets, and requires that the code that is doing the writing be aware of the flag. TCP_CORK, instead, works with programs using the standard I/O package, and it can be set on sockets that are passed to other applications, such as CGI scripts, that are completely unaware of its presence. The TCP_CORK flag preserves a lot more of the standard Unix stream semantics.

Conclusion: don't expect to see MSG_MORE show up in user space anytime soon.

Fixing the 2.4.0 USB breakage. When 2.4.0 came out, it included a last-minute change to the usb_device_id structure, which is used to find driver modules for specific USB devices. Unfortunately, the form of this change was such that it broke the USB autoloading mechanism entirely. Since then, the USB maintainers, along with modutils maintainer Keith Owens, have been trying to figure out a way to make things work again.

The problem is that modutils, which handles the actual module loading process, can not distinguish the new usb_device_id structure from the old one. Making modutils work with the 2.4.0 version of the structure is not a problem - but then it will cease to work for earlier versions. Keith Owens places great importance on backward compatibility, and does not want to break things for any version. So he has produced a kernel patch which adds a version number to the relevant structures. With versioning, changes can be detected and everything can be made to work.

Linus, however, does not want to apply the patch. It is, after all, a binary interface change; such changes are generally avoided within a stable kernel series. Besides, the only other kernels which used the USB device table were the 2.4.0-test kernels - that structure was added in 2.4.0-test10. Nobody feels all that bad about breaking the prerelease kernels, in the end.

Almost nobody, that is; Mr. Owens is still not entirely happy. He has released modutils-2.4.2 which makes the 2.4.0 format work, but he has done so "under protest." People who want to be able to switch between 2.4.0 and the 2.4.0-test kernels will have to keep two versions of modutils around; everybody else can just install 2.4.2 and USB autoloading will work again.

Should the kbuild list move to SourceForge? Michael Elizabeth Chastain has posted a proposal to move the kbuild mailing list (which discusses the kernel configuration and building system) to a SourceForge project. He has a few reasons, but any kbuild reader will know the first one intuitively: spam routinely exceeds real postings on that list. With luck, moving to a site with better spam filtering would help to make the list usable again.

The one objection to the move came in the form of this posting, which raised the concern that the free software world is becoming too dependent on SourceForge.

But it just concerns me when a single company has the ability to (temporarily) freeze the development of half the world's open-source software just by unplugging a roomful of servers, either voluntarily or not (think "court order").

This is a concern that LWN has raised in the past as well. This time, however, there was a semi-official response in the form of this message from Eric Raymond, who is on the VA Linux board of directors. According to Eric:

We're not blind to this problem. We don't want to be a chokepoint; it's in VA's interest for the community to know it's protected against accident or malfeasance. This is why we're developing a network of active mirror sites -- not just to improve performance, but so one of them could take the baton if the SourceForge primary site had to shut down for some reason.

It is good to see an acknowledgement of this concern from VA. SourceForge is a great resource, but it has led to an unprecedented concentration of free software projects in a single place.

Other patches and updates released this week include:

Section Editor: Jonathan Corbet


January 25, 2001

For other kernel news, see:

Other resources:

 

Next: Distributions

 
Eklektix, Inc. Linux powered! Copyright © 2001 Eklektix, Inc., all rights reserved
Linux ® is a registered trademark of Linus Torvalds