[LWN Logo]
[LWN.net]

Sections:
 Main page
 Security
 Kernel
 Distributions
 On the Desktop
 Development
 Commerce
 Linux in the news
 Announcements
 Linux History
 Letters
All in one big page

See also: last week's Kernel page.

Kernel development


The current kernel release is 2.4.2. The official 2.4.2 patch is large due to the incorporation of the CRIS architecture and a big S/390 update, but it really contains only a few patches that matter to most users. It does have a fix for the ReiserFS "zero bytes" problem, and a couple of other worthwhile fixes, so 2.4 users will want to upgrade.

Alan Cox has been somewhat more active; his patch is up to 2.4.1ac20. On the 2.2 front, 2.2.19pre14 is out.

Trouble with exception tables. Those who were not looking closely may have missed the following note at the top of Alan Cox's announcement of 2.4.1ac15:

Question of the day for the VM folks: If CPU1 is loading the exception tables for a module and CPU2 faults.. what happens 8)

The answer, of course, was "the system dies an ugly death." The problem is interesting to look at as an example of how an obscure part of the kernel works, and how hard it can be to get fine-grained SMP working correctly. Prepare your pocket protectors, here we go...

In general, Linux system calls require that data be exchanged between the user process making the call and the kernel. The kernel does not (usually) access user memory directly; instead, data of interest is copied between a kernel space buffer and the user's memory. This copying is done with functions like copy_from_user() and copy_to_user().

Back in the 2.0 days, those functions (well, their equivalents, which had different names) went to a great deal of trouble to be sure that the user process could actually access the given memory array (thus avoiding security problems) and to ensure that the array was resident in memory. More modern kernels, however, let the processor's memory management unit handle most of those tasks. After some very simple checks (which keep the process from overwriting arbitrary kernel-space memory), the copy functions simply proceed with the operation.

So what happens if the user process has passed in a bogus pointer? The memory management unit will deliver an exception to the processor, indicating a "segmentation fault" sort of error. Normally, such errors are fatal when they occur in kernel code. But before the kernel goes off and generates one of those cheery "oops" messages, it performs a check to see if the kernel had been attempting a user space copy.

That check is done by way of an exception table. By way of some deep inline assembly magic, every function which copies data to or from user space creates an exception table entry giving the address of the instruction actually doing the copy. When a fault happens in kernel mode, the kernel fault handler scans through the exception tables trying to match the address of the faulting instruction with a table entry. If a match is found, a special error exit is taken, the copy operation fails gracefully, and the system call returns a segmentation fault error.

So where is the problem? Every loadable module comes with its own exception table; when the kernel is handling a kernel fault, it must check the exception table for each loaded module. This check is performed by scanning a list of module structures to find all the tables. The problem is: the module structure is set up and added to the list before the exception table is copied in from the module file. Should the kernel try to handle a fault at that particular moment, it will be looking at an exception table which holds garbage.

A similar problem comes up when the module is removed. The module structure is removed from the list before the memory for the exception table is freed, which is the proper order of operations. But another CPU may have already found that exception table before the module structure was removed, and may be scanning it just when it is freed (and, perhaps, reused for some other purpose). Again, there is the potential for serious disorder.

Alan has fixed these problems in later "ac" kernel releases by putting in a lock to control access to the exception tables. When something is being done with an exception table, any other users must wait until the job is done. This fix, probably, will find its way into an official kernel release before too long.

The fight to eliminate all of this kind of race conditions from the kernel will probably never end, however. Multiprocessor systems are inherently tricky to work with. And module deletion, in particular, seems to be problematic. It may well be that, in the 2.4 series, removing a module from the system will never be entirely safe - even though almost everybody will get away with it almost all of the time. Various schemes exist for fixing the module deletion problems (see, for example, this note from Keith Owens), but they are all big enough that they are unlikely to make it into the 2.4 series.

ReiserFS and NFS. The various problems that have bit (a small number of) ReiserFS users in 2.4 are being cleared up. But one larger problem remains: ReiserFS, as shipped in 2.4, does not support NFS. That limitation gets in the way of quite a few people who would like to use ReiserFS, but who also need to be able to export their filesystems.

For the short term, those who are not afraid of kernel patches can have a look at this message from Neil Brown describing where to get the patches he has made available. They are still under development, but they provide "reasonable NFS service" in their current state.

The picture for the longer term is a bit less clear. Neil has a plan for proper support of NFS with ReiserFS, and for improving NFS service in general. It is, however, a large change, requiring tweaks to every filesystem which needs to support NFS. Filesystem changes tend to make kernel hackers nervous, especially in the middle of a stable kernel series. And, in fact, Alan Cox responded that he was not interested in such an extensive patch.

Those who are curious about the troubles with NFS should look at Neil's justification for the changes. It is a lengthy, detailed, and well-argued discussion of how the current NFS implementation fails to mesh well with the various Linux filesystems, and exactly what needs to be fixed to make things work better. It was persuasive enough that Alan agreed that the approach made sense - for the 2.5 kernel series.

Thus, the 2.4 kernel may never support exporting of ReiserFS filesystems over NFS. Those who need this capability will have to apply the patch themselves. That is, if the distributors do not apply the patch themselves before shipping the 2.4 kernel. SuSE, at least, applied such a patch when it shipped ReiserFS with 2.2, so it would not be surprising to see that happen again.

Indexed directories in ext2. Daniel Phillips, creator of the TUX2 filesystem, recently encountered a problem:

Earlier this month a runaway installation script decided to mail all its problems to root. After a couple of hours the script aborted, having created 65535 entries in Postfix's maildrop directory. Removing those files took an awfully long time.

The difficulty here, of course, is that the ext2 filesystem keeps directories as a simple, linear list. When a directory is small everything works fine, but search time grows as the square of the directory size. Thus, the system slows to a crawl when it must work with large directories.

Mr. Phillips is not the type to just complain about this sort of performance problem. Instead, he posted a lengthy request for comments which not only described a fix for the problem, but which also included a patch which implements that fix.

Essentially, he has implemented a "uniform depth hash tree" for ext2 directories. It resembles, superficially, the balanced trees used in ReiserFS, but it is claimed to be much simpler to implement. Certainly the results are impressive. For small directories, the patched ext2 behaves much like the original; as the directory size gets larger, however, the [Performance graph] performance difference becomes huge. A performance graph (reproduced in small form to the right; click on it to get the full version) was posted; the patched system's performance is represented by the red bars that you can't really even see.

For those who want hard data, there's a set of results in Mr. Phillips' posting. At the far end of the test, he timed a process that created 90,000 files in a directory. With the patched filesystem, this process took just over 13 seconds; with standard ext2, instead, it required over 33 minutes.

For those who would like to try the patch, bear in mind that it's in an internal form. "There has not been a lot of stability testing yet and indeed there are a number of unhandled error conditions in the code, and possibly some buffer leaks as well." Linus had a list of things to try as well. So a stable version of this patch is probably somewhat distant, but there seems to be a lot of interest in the final result.

The Adaptive Domain Environment for Operating Systems (Adeos) was unveiled by Karim Yaghmour this week. The purpose of Adeos is to allow multiple operating systems to share a single system; thus, it is designed as a small layer which controls and arbitrates access to the hardware. With sufficient cleverness, various operating systems can be run in such a way that they think they have the system to themselves, while, in reality, they are coexisting with other systems.

One obvious application of Adeos would be the creation of "simultaneous dual-boot" systems. Another, however, would be to support running Linux alongside a real-time kernel in a way that does not violate the RTLinux patent (see last week's LWN). Adeos could also be helpful for operating system development and debugging.

Of course, at the moment it's mostly just a design (which may be found on the Adeos web page). A small amount of code is available via the Adeos page on SourceForge, but it "will certainly crash your machine." But all projects have to start somewhere.

How short is too short? Some 2.4 users have found, to their surprise, that the 2.4 kernel ignores them when they try to set a small MTU (maximum packet size) on a network interface. Setting a small MTU is a fairly common practice, especially among users of noisy point-to-point links. The theory, of course, is that using short packets limits the amount of data that is lost (and must be retransmitted) when noise corrupts a packet.

The 2.4 kernel puts, by default, a floor of 536 octets on the MTU. Attempts to go below that are ignored. The reasoning, according to networking guru Alexey Kuznetsov, is that the IP protocol suite just does not work right below that size. Others have disagreed, however, pointing out that they have perfectly good connectivity using a smaller MTU.

The 536-octet limit is likely to go in the near future. Meanwhile, anybody who needs to go lower than that need only change the limit via the /proc interface:

    echo 256 > /proc/sys/net/ipv4/route/min_adv_mss
No source hacking required.

Other patches and updates released this week include:

  • ALSA 0.9beta1 is out. This is the first beta release of the ALSA (Advance Linux Sound Architecture) code.

  • Mike Coleman has released SUBTERFUGUE 0.2, his "framework for observing and playing with the reality of software."

  • David Miller has released Zerocopy Beta 1. This is the first beta release of this code; "I currently feel that all performance and other issues have been addressed and that the patch is up for serious consideration for inclusion into a future 2.4.x release."

  • Paul Gortmaker noticed that the BUG macro, used for kernel debugging, creates several tens of kilobytes of string data in the kernel image. So he has produced a more efficient version which shrinks things down considerably.

  • A new version of the automatic kernel configurer was released by Giacomo Catenazzi.

  • Philip Blundell has released net-tools 1.59; this release fixes an unpleasant bug in 1.58, and an upgrade is recommended.

  • A FreeS/WAN redesign plan, in its third revision, has been posted by Richard Guy Briggs.

Section Editor: Jonathan Corbet


February 22, 2001

For other kernel news, see:

Other resources:

 

Next: Distributions

 
Eklektix, Inc. Linux powered! Copyright © 2001 Eklektix, Inc., all rights reserved
Linux ® is a registered trademark of Linus Torvalds