Sections:
Main page
Security
Kernel
Distributions
On the Desktop
Development
Commerce
Linux in the news
Announcements
Linux History
Letters
All in one big page

Kernel development

The current kernel release is still 2.4.7. Linus released 2.4.8pre7 on August 7, but, as of this writing, the changelog had not been updated from the pre6 version. It contains more VM fixes (see below) and a number of other updates.

Alan Cox's latest is 2.4.7ac10. It contains a vast number of fixes and updates, but the most interesting part may be the merging of the ext3 journaling filesystem, which happened in 2.4.7ac4. While ext3 will likely not find its way into Linus's kernel for some time yet, its presence in the "ac" series is a firm step in that direction.

On synchronous directory operations. If an application renames (or makes a link to) a file, how can it know when that operation has found its way to the physical device and will not disappear if the system crashes? Most applications are not too concerned about such issues, as long as their operations make it to persistent storage eventually. But there are exceptions. In particular, a number of mail transfer agents (MTAs), such as Postfix and qmail, depend heavily on link and rename operations for reliable delivery of mail. They need to know when these operations have completed.

Many Unix-like systems, it seems, implement directory operations like link() in a synchronous manner. When link() returns, the operation has completed and will not disappear. Or, at least, any event that could cause it to disappear will be sufficiently severe that reliable mail delivery will be fairly low on the list of concerns. Linux (and the ext2 filesystem, in particular), however, performs directory operations asynchronously; they are buffered like most other filesystem operations. The result is better performance, at the cost of an increase in hair-pulling and grumbling from MTA authors.

Said authors, who tend not to be quiet or reserved people, have been fairly clear on how they feel about Linux's directory operation semantics. There have been claims that the Single Unix Standard requires synchronous directory operations, but that appears to be an issue upon which reasonable people can differ.

The answer from the Linux developers has been that, if an application needs a directory operation to be synchronous, it needs to ask for those semantics explicitly. That can be done in several ways. One could simply mount the filesystem with the sync option, but that is so painfully slow that nobody is much interested in it. Another is to request synchronous operations on the directory in question with the ext2 chatter +S option. It works, but MTA authors seem to not like it, perhaps because it makes all operations synchronous, even those which do not need to be. Finally, an application can open the directory in question and use fsync() to explicitly synchronize any outstanding operations there.

The fsync() option seems like the best, since it lets the application say when the synchronization must happen. But MTA authors grumble again and some, at least, refuse to do it. The complaint is that it's a special, nonportable coding requirement imposed only by Linux.

What people would like to see, it seems, is one or both of the following:

An fsync() operation on a file also synchronizes directory entries belonging to that file. These semantics are difficult to implement in the general sense - file names are distinct from the files themselves, and a file can have more than one of them. Linus has pointed out a possible solution, however, that could work in this particular situation.
A new mount option, called something like dirsync, that would cause directory operations to be synchronous. Nobody has posted a patch to do this yet, but one may well be forthcoming.

This whole issue is a classic confrontation between groups of developers with strong ideas of how things should be done. In the end, however, Linux hackers want their platform to work well for mail delivery, while MTA authors would be happy if their applications worked properly on Linux. Some sort of solution should be achievable here.

Who maintains the Linux sound drivers? While people toss in a patch occasionally, it turns out that nobody is currently taking the role of the maintainer of the Linux sound drivers, and the Open Sound System (OSS) drivers in particular. Not much has changed with those drivers in some time, and all of the serious sound hackers have been off bashing on ALSA for some time now.

ALSA is expected to replace OSS in the 2.5 development kernel. As a result, one can detect a certain "why bother?" attitude in the air when OSS maintenance is discussed. The fact remains, however, that OSS will remain the standard sound driver in the 2.4 kernel; swapping in ALSA would be too big a change for a stable kernel series. Even the 2.4 series. So somebody really should be keeping an eye on it for a little while yet...

Chasing the virtual memory problems. Virtual memory performance in 2.4.x is still widely considered to be poor; it is, perhaps, the single largest outstanding problem with the 2.4 series. The effort to improve VM performance got some new energy this week when Ben LaHaise took a look at the problem. While Ben didn't actually nail down any VM bugs himself, his work was crucial in directing the attention of some of the other VM hackers - and Linus - to the right place. While Linux may not be out of the VM woods yet, some real problems have been found and fixed in the recent prepatches.

Ben's investigation showed that there were problems in how the kernel throttles write requests. There was some code in place which attempted to keep disk writes from overwhelming the system, but it did not work quite as intended. Instead, it had the effect of allowing the write queue(s) to grow to great lengths while, simultaneously, allowing an aggressive writer to keep other processes from submitting I/O requests for long periods of time. The long queues take up a lot of memory, of course. They also could reach a length where even a very fast drive could not perform all of the queued operations within, in some cases, multiple seconds. An interactive process could thus find itself unable to queue a request for some time, then waiting, again, for an operation that ended up at the wrong end of a very long queue.

The solution involves a couple of separate tweaks:

The old throttling code is simply removed, since it created fairness problems without actually solving the problems.
The maximum length of an I/O request queue is drastically reduced. This reduces the maximum latency that any individual request should experience, while, perhaps, reducing the effectiveness of the elevator algorithm slightly. This change also moves write throttling to the request allocation stage, which, it is hoped, should solve that problem in a more fair and resource-efficient manner.

There are also, as it turns out, some problems with how the 2.4 kernel accounts for memory. Marcelo Tosatti has put in some patches to fix how the amount of free memory in each zone is calculated. And Linus found a bug in how the kernel decided how much memory it could use for I/O buffers. These problems, too, could allow the system to be overwhelmed by write operations that really should have been throttled.

Many of these fixes have gone into 2.4.8pre4 and subsequent releases; Alan Cox seems to be holding off on putting them into his series at this point. There are some good initial reports, but more testing (and more work) will certainly be required.

Rik van Riel, meanwhile, has posted a patch which should make 2.4.8 much friendlier to systems without large amounts of swap space. The current kernel, remember, keeps a page in swap even after it has been paged back into main memory. There are certain performance benefits to doing so, but systems with small swap areas can run out of swap space easily. And a system that has run out of swap is not a friendly place to work. Hopefully that problem is now a thing of the past.

Buried in VMAs. The Linux kernel makes use of "virtual memory areas" (VMAs) to keep track of the larger chunks of memory in use by any process. One VMA is associated with one range of memory all using the same source or backing store and the same access permissions. Thus, for example, loading a shareable library will generally create at least two VMAs: one for the library code, and one for its associated data area.

For a relatively simple example of how VMAs are set up, type:

  cat /proc/self/maps

to see the VMAs used by the cat command itself.

There are reasons for wanting to keep the number of VMAs under control. Each VMA requires a data structure in the kernel, so large numbers of VMAs will take up a significant amount of kernel memory. It is also often necessary to be able to find a specific VMA in a hurry. For example, when a page fault occurs, the kernel must locate the VMA describing the faulting address so that the fault can be resolved. The VMA lookup routine is reasonably efficient, but performance will still suffer if VMAs grow without bound. Normally there is no problem here; the emacs process being used to type this text - which is not a small process - has 53 virtual memory areas in use, which is a reasonable number. Netscape uses 64 VMAs.

Recently, however, Chris Wedgewood noticed that Mozilla was running rather sluggishly. Yes, lots of Mozilla users notice that, but this was a more severe than usual case. A quick look, via the handy /proc interface, showed that the process had over 5,000 VMAs currently mapped. That is more than enough to affect the performance of the Mozilla process, and the system as a whole. Other GNOME applications, such as evolution, show similar patterns.

Your editor runs Galeon, which, as everybody knows, is a much lighter program. And, in fact, it is, as of this writing, running within a svelte 1474 VMAs. Better, but still far too many. But the real problem, as has been discussed on the kernel list, can be seen if you look at the actual VMA mappings. Here is an excerpt:

  40c52000-40c5a000 rw-p 000bd000 00:00 0
  40c5a000-40c61000 rw-p 000c5000 00:00 0
  40c61000-40c69000 rw-p 000cc000 00:00 0
  40c69000-40c71000 rw-p 000d4000 00:00 0
  40c71000-40c74000 rw-p 000dc000 00:00 0

The pair of hexadecimal addresses on the left is the virtual address range covered by each VMA. A quick look shows that most of Galeon's VMAs are simple anonymous memory pages, and that they are contiguous. In other words, they could be represented by a single VMA rather than hundreds or thousands.

The Linux kernel makes an attempt to merge contiguous VMAs when it is relatively easy to do. But the more comprehensive merging code that 2.2 had has been abandoned, with the reasoning that (1) it is only useful in very rare cases, and (2) it is extremely difficult to get right. There is very little enthusiasm for thrashing up the VMA merging code again without compelling evidence that it is really necessary. Which means there is a need for an understanding of just what is going on to cause this kind of behavior.

To this end, Mr. Wedgewood performed a detailed analysis of the system call pattern that brings about the explosion of VMAs. The problem, it seems, is with the malloc() implementation in the C library, which plays some tricky and complicated games with memory allocation. In particular, it does a lot of memory mapping, followed by partial unmapping for alignment purposes, and, crucially, changes to memory protection as segments of memory are parceled out.

The C library plays with protections, presumably, in an attempt to catch overruns of allocated memory. But, if you change the protection on a subsection of a VMA, that VMA must be split into two, independently protected VMAs. When the kernel does this split, it could attempt to merge the newly protected VMA with those next to it, but currently does not. The result is, for certain memory allocation patterns, lots of VMAs.

It's possible that a patch will emerge which makes mprotect() perform VMA merging. But there appears to also be a certain inclination among the kernel hackers to blame the problem on the C library and forget about it. Relations across the kernel-glibc divide are not always the best, and it is precisely this sort of issue that can create disagreements. But, until one side or the other makes a change, some applications are going to run sluggishly under 2.4.

Other patches and updates released this week include:

Alexander Viro decided he was tired of waiting and submitted a patch fixing a race condition in devfs. Richard Gooch didn't like the fix. What followed started at the name-calling level, but then evolved into a productive technical discussion. One result is new devfs and devfsd releases from Richard; expect more in the near future.
The first release of the 2.5 kernel build system has been announced by Keith Owens. See the announcement for a detailed description of this release.
Also from Keith: a proposal to change the way /proc/ksyms works on the IA64 architecture (and, presumably, others that use function descriptors).
Richard Gooch has a new version of his patch which allows the 2.4 kernel (with devfs) to support up to 2144 SCSI devices.
Matthew Macleod has posted a version of the international crypto patch for 2.4.7. Jari Ruusu, meanwhile, has released loop-AES-v1.3d, which is just the file encryption part of the international crypto patch.
A new Compaq Hotplug PCI driver was released by Greg Kroah-Hartman.
IBM has released version 1.0.2 of its journaling filesystem.
Etienne Lorrain has announced version 0.4 of his "Gujin" bootloader.
Alexander Viro has implemented a general parser for mount options which, he hopes, will help to generalize and clean up the option handling in the various filesystems supported by Linux.
Mike Kravetz and associates have posted a scalable scheduler patch which addresses some of the scheduling problems seen on larger systems (see our OLS coverage for details). Linus didn't like the patch, but his objections had more to do with coding style than the actual changes made. A new version should be forthcoming soon.
A new security module patch has been posted by Greg Kroah-Hartman.
Andreas Gruenbacher has released version 0.7.15 of the access control list (ACL) patch.
HP has released version 0.8 of the HP OfficeJet driver.

Section Editor: Jonathan Corbet

August 9, 2001

For other kernel news, see:
Kernel traffic
Kernel Newsflash
Kernel Trap
2.5 Status
Other resources:
L-K mailing list FAQ
Linux-MM
Linux Scalability Effort
Kernel Newbies
Linux Device Drivers

Next: Distributions