Sections:
Main page
Security
Kernel
Distributions
On the Desktop
Development
Commerce
Linux in the news
Announcements
Linux History
Letters
All in one big page

Kernel development

The current kernel release is 2.4.4, as it has been since April 28. The 2.4.5 prepatch is up to 2.4.5pre5; it still contains mostly bug fixes, cleanups, and stuff merged in from the "ac" series. Alan Cox's series, instead, is at 2.4.4ac15. It contains, as usual, a larger set of fixes, including the results of a renewed janitorial effort that has targeted a number of common driver problems.

ioctl considered harmful. Last week we mentioned how the kernel hackers were considering alternatives to the venerable ioctl() system call. That consideration has turned into an all-out assault; it may well be that ioctl() will be a deprecated, legacy interface in the 2.6 kernel. One might even think that, now that the Linux kernel has established itself as a complete, high-performance, Unix-compatible kernel, the developers are beginning to look beyond Unix in search of better solutions.

The ioctl() call, of course, is the general "device control" system call; it handles operations that do not readily map to any of the other available calls. An application would call ioctl() to rewind a tape, format a diskette, or change the data rate on a serial line.

Clearly, these are important operations to be able to perform. So what's the problem with ioctl()? In the end, it comes down to structure. Consider the standard write() call:

    ssize_t write (int fd, const void *buf, 
                   size_t count);

Everybody knows exactly what this call is supposed to do - it moves count bytes from buf to whatever is represented by the file descriptor fd. The buf argument is read-only. Any implementation of write() which violated this understanding would not pass any sort of review. As Linus put it, "Code like that would not pass through anybody's yuck-o-meter."

The ioctl() interface is somewhat different:

    int ioctl(int fd, int request, ...);

Here, the request is a magic number (usually) understood only by the driver or filesystem behind the file descriptor. The third argument ("...") is not even named by the man page, so there is little point in asking how it should behave. It might be an integer, a structure, or a pointer, and it could be either an input or an output parameter, or both. It could even be a pointer to a structure containing other pointers which will be dereferenced by kernel code.

What that means is that every ioctl() implementation is different, and that there are no standards for what is implemented or how it is done. Alexander Viro described the problem well:

I think that we have several thousands of these beasts. And that's several thousands of undocumented system calls hidden in bowels of sys_ioctl(). Undocumented == for most of them we have no information of argument types, which arguments are in-, out- or in-out, which contain pointers to other userland structures, etc.

The Linux system call interface is generally held under very tight control, since, for all practical purposes, it defines the kernel to the rest of the system. In the middle of that interface, however, is ioctl(), a loose cannon which circumvents that control and makes anything possible. It is, says Linus, unfixable:

Basically, ioctl's will _never_ be done right, because of the way people think about them. They are a back door. They are by design typeless and without rules. They are, in fact, the Microsoft of UNIX.

Thus far, the discussion has been heavily one-sided - ioctl() has very few defenders. Things get more interesting, of course, when you get into what the replacement for ioctl() might be.

One alternative would be to allow device options to be specified at open time, as part of the device "name". This option was discussed last week, and we'll see it again in the discussion of Ben LaHaise's patch, shortly. It would make some things simple, but does not eliminate the need for some sort of control interface - not all operations can be performed at open time.

Another approach calls for the opening of a control channel as a separate file descriptor, then invoking operations with write() and read() calls. Such an approach is workable and network-friendly, but it lacks the atomic nature of ioctl(). Things can happen between when an operation request is written and the results are read back.

Yet another possibility is for device drivers to create little control filesystems. A tape driver, for example, could export a filesystem for each device that included a file called rewind; a write to that file would cause the tape to be rewound. A variant of this approach would be for drivers to export a control interface via /proc, but, to a lot of people, /proc suffers from the same problems as ioctl().

Linus thinks it might be sufficient to just redefine the call interface to explicitly specify the input and output buffers, and their lengths. That would nail down much of the semantics of ioctl() at a level that the kernel could support (and check) them; ioctl() calls would become much more standardized. He has also suggested just removing the ioctl() operation entirely from the standard driver interface in an attempt to force people to come up with something new - much like what he has done with major numbers.

The shape of the solution is far from clear at this point; it is going to be an interesting development to watch. At least, from a distance... quoting Alexander Viro again:

It should be fixed, but it won't be easy and it won't be fast. If you want to help - wonderful. But keep in mind that it will take months of wading through the ugliest code we have in the tree. If you've got a weak stomach - stay out. I've been there and it's not a nice place.

Open-time options and disk partitioning. One way of avoiding the need for ioctl() calls, at least some of the time, is to make various device operations available at open time. So, for example, if you want to work with the serial port at /dev/ttyS0, and you need to talk at 9600 baud, you could just open /dev/ttyS0/speed=9600 and everything would be taken care of. This approach eliminates an ioctl() call, and also makes life easier for people who want to deal with devices from shell scripts.

Ben LaHaise has posted a patch which implements these "device arguments." It provides a generic mechanism which allows device drivers to specify which arguments they handle, and to do the parsing.

Some people have complained about this sort of "side effect" associated with open() calls. But, as Linus pointed out, this sort of behavior is really nothing new. The first diskette drive on a system, for example, is known as /dev/fd0, but it can also be opened as /dev/fd0D360, /dev/fd0H1440, or any of a number of other names. Each name is associated with a different device behavior - one that can also be selected with ioctl(). Tape devices, too, have long used naming to distinguish between different densities, compression settings, rewind behavior, etc. The only thing that is new here is the use of a standard mechanism for device arguments, rather than encoding them in the device minor number.

Ben's patch goes further, however, by using device arguments to push the handling of disk partitions into user space. Once you have arguments, you can set things up to open a disk drive with a name like /dev/sda/offset=1024,limit=2048, which provides access to a subset of the disk. If you can do this, it is no longer necessary to understand partitions inside the kernel; it's just a matter of setting up your open-time (or mount-time) options.

There are a number of objections to this approach; see Linus's response for some of them. So this particular approach to disks is unlikely to be adopted. It is, however, an interesting example of the kind of thinking that is happening around access to devices. The 2.5 series is going to be fun to watch.

ext3 on 2.4. Stephen Tweedie's work with the ext3 journaling filesystem is interesting, but many have lamented the fact that it works with the 2.2 kernel only. Now a port to 2.4 is available, thanks to the efforts of Andrew Morton and Peter Braam. There's a glitch or two left, but the overall assessment is "quite solid." You have to get it from a CVS tree, however; there is no more convenient distribution available. See the detailed instructions on how to check out a copy.

vger.kernel.org enables ECN. Rather later than had initially been threatened, the vger.kernel.org mailing list server has turned on the explicit congestion notification (ECN) option. That means it can no longer talk to systems which are behind a firewall that does not handle ECN properly. If you're wondering what happened to your linux-kernel mail (or mail from any of a number of other vger lists), ECN could well be your problem. See Jeff Garzik's ECN page for more information on ECN problems and how to deal with them.

Nobody distributes a standard Linux kernel? Last week's kernel page made that claim, stating that no distributor includes a "standard" kernel without adding patches. Such a claim is always dangerous, especially given the diversity of the Linux world and the sheer number of available distributions. So your author fully expected to get mail on how some distribution with a name like "Mongolian StoutLinux" does not patch its kernels.

What came back, instead, was a small pile of polite mail stating that Slackware distributes pristine kernels. Your author must confess to not having run Slackware since sometime in 1994, so some research was in order. It turns out that the claim is almost true. Slackware did add a couple of fixes to its 2.2 kernels, but the 2.4.4 kernel in the slackware-current directory is exactly as Linus made it.

Other patches and updates released this week include:

Alexander Viro has posted a patch which creates struct char_device as part of a continuing reorganization of how devices are handled internally.
Marko Kreen posted a request for comments on a next-generation devfs design.
Eric Raymond has released CML2-1.4.5.
Randy Dunlap has documented the x86 kernel initialization routine.
JFS 0.3.2 was released by IBM.
Hua Zhong has posted a kernel module which performs process checkpointing and restarting.
Ricardo Galli has performed a set of benchmark runs to determine the relative speeds of the XFS, ReiserFS, and ext2 filesystems.
USBMon 0.2, a USB monitoring tool, has been released by Dave Harding.
Ben Collins has released version 0.9.9.5 of the SILO Sparc boot loader.

Section Editor: Jonathan Corbet

May 24, 2001

For other kernel news, see:
Kernel traffic
Kernel Newsflash
Kernel Trap
2.5 Status
Other resources:
L-K mailing list FAQ
Linux-MM
Linux Scalability Effort
Kernel Newbies
Linux Device Drivers

Next: Distributions