[LWN Logo]
[LWN.net]

Sections:
 Main page
 Security
 Kernel
 Distributions
 On the Desktop
 Development
 Commerce
 Linux in the news
 Announcements
 Linux History
 Letters
All in one big page

See also: last week's Kernel page.

Kernel development


The current kernel release is still 2.4.6. Linus's 2.4.7 prepatch is up to 2.4.7pre7, with no word on when a real 2.4.7 release will happen (to say nothing of the much-awaited 2.5.0). Alan Cox, meanwhile, is at 2.4.6ac5.

Keeping your processes from wandering. In an ideal world, all processors on an SMP system would be identical, and it would not matter where any particular process runs. Life is different, of course, in the real world. Here, not all processors are the same from a process's point of view.

The bottleneck between the processor and memory forces the use of multiple layers of cache memory within each processor itself. By keeping frequently-accessed memory close to the processor, the cache has a major accelerating effect on performance. Often the best performance optimizations don't involve squeezing out instructions or unrolling loops; instead, the best results often come from changing data access patterns to work better with the processor cache.

Foremost among those optimizations, certainly, is to avoid trashing the cache completely. But that is what happens when a process moves from one CPU to another. The cache which has been built up in the old CPU does not follow the process to its new home. As a result, the process runs slowly for some time as it fills the cache on the new processor, perhaps forcing out another process's data while it's at it.

For this reason, the Linux scheduler tries hard to avoid moving processes between CPUs. Normally it works reasonably well; if two jobs are running on a two-processor system, one would expect each job to stick to one processor. So a group of kernel hackers were surprised when they found a case where processes would continually cycle through all of the processors on a system. Another user reported similar behavior; he found that running a single, compute-intensive process on a two-processor system would actually go faster if he fired up "setiathome" to keep one of the processors occupied.

What appears to be happening is this: one CPU is happily running a process (we'll call it "p1") when it does something that makes another process ("p2") runnable. The scheduler decides that p2 should execute on a different CPU, so it sends an "inter-processor interrupt" to force the other CPU to go into the scheduler and pick up the new task. All appears to have been properly arranged, and the scheduler on the original CPU returns to the original process (p1) that was running there.

That process, however, quickly hits a stopping point, forcing a new scheduling decision. Because inter-processor interrupts take a while, p2 still has not started running on its intended CPU. Instead, the first CPU sees p2 ready to go, and starts running it. When p1 again becomes runnable, it will find that p2 has taken its place; it's p1 that gets booted out of its processor and has to move to a new home with a cold, unwelcoming cache.

With the right kind of load, that sequence of events can happen over and over, causing processes to move frequently through the system. The result is poor performance, bad benchmark results, and an increase in "Linux sucks" posts on the net.

The fix, as posted by Hubertus Franke, is to mark a process when it is decided that said process will run on a different CPU. Other processors will not attempt to run a process marked in this way, while the target processor will make a point of running it. The fix removes the race condition between the two processors, and restores a bit of stability in this particular case. Of course, being a scheduler change, it may well make things worse for some other type of load, but nobody has identified that load yet...

Journaling filesystems are slower? While nobody disputes the benefits provided by journaling filesystems, the generally-accepted wisdom seems to be that journaling necessarily slows things down. After all, a journaling filesystem adds the overhead of maintaining the journal and very carefully serializing operations to preserve the integrity of the filesystem at all times. That extra work costs.

It turns out, however, that there is an important class of applications for which a journaling filesystem can be faster. Certain applications need to know when data written to the disk is actually committed to the platter; usually they are working with explicit data ordering constraints of their own. Such applications will use one of the synchronous write operations in the filesystem to enforce these constraints. Database systems can operate in this mode. The NFS protocol also requires that a (strictly conforming) NFS server also perform synchronous writes.

A synchronous write operation can cause several disk head seeks, as the data and associated metadata are updated. And that, of course, can take a while. When journaling is in use, however, the story is different. Once all of the relevant data is in the journal, the filesystem can report a synchronous write as being complete; the full writeback can then happen at leisure, since the data is safe in the journal.

And the journal, of course, is laid out on a contiguous piece of the disk. Journaling, thus, removes the head seeks from synchronous writes and eliminates much of the latency from those operations. With some preliminary tests using ext3 and knfsd, performance was reported to be 1.5 times better. Journaling is not only safer; it may even be faster.

Cleaning out the right zones. Marcelo Tosatti has been working on a patch which provides detailed information on how the memory management system is working in the 2.4 kernel. After all, the various efforts to improve memory management can only be helped by having a view of what is actually going on. One of the first results that Marcelo has found is that the code that tries to free up pages in response to memory shortages is often not looking in the right place.

Linux divides physical memory into multiple "zones," each of which has different physical characteristics; for example, the DMA zone contains memory that may be used for DMA operations to ISA devices. (See the June 7 kernel page for a more detailed discussion of zones.) Memory allocation can be requested from one or more zones in particular. Often, only a specific zone will do for a particular request.

The problem is that, while the kernel allocates memory from specific zones, it does not take zones into account when freeing memory. Instead, it blindly passes through memory freeing anything that looks useful. As a result, the kernel could be freeing memory (i.e. taking it from processes that could use it) that belongs to a zone that already has plenty of free memory and does not need any more. Meanwhile, another zone could be under tremendous pressure which is not helped in any way by freeing memory from the first zone.

This sort of behavior has been suspected in the past, but Marcelo's instrumentation has shown that it really happens. So what is to be done but make a new patch which causes the kernel to go after pages belonging to the specific zones that are feeling pressure? Evidently some sorts of deadlock problems have already been solved by this patch. It will see some reworking (Linus had some quibbles with the implementation), but this one looks destined for a 2.4 kernel sometime soon. (See also: Dave McCracken's patch for a silly swapping bug that would prevent the use of high memory for swap reads; this one, too, could be responsible for a lot of problems.)

Other patches and updates released this week include:

  • The Stanford Checker is back. The latest results include code which uses memory that has been freed (10 instances), and unsafe use of user-supplied values (52 instances), which can lead to nasty security bugs.

  • IBM has released version 2.2.0 of the Dynamic Probes kernel debugging tool.

  • Keith Owens has released a new version of the 2.5 kernel build system which has the "implicit dependency" problem solved.

  • Justin Gibbs has announced a beta release of version 6.2.0 of the aic7xxx SCSI driver. Among other things, it includes high addressing support.

  • The example driver code from the second edition of Linux Device Drivers is now available for download from the O'Reilly web site. The full release of the book source will take a little longer, however.

Section Editor: Jonathan Corbet


July 19, 2001

For other kernel news, see:

Other resources:

 

Next: Distributions

 
Eklektix, Inc. Linux powered! Copyright © 2001 Eklektix, Inc., all rights reserved
Linux ® is a registered trademark of Linus Torvalds