Linux in the news
All in one big page
See also: last week's Kernel page.
The current development kernel release is still 2.5.2. Linus has issued 2.5.3 prepatches up to 2.5.3-pre4; this prepatch has a working ATA/IDE layer, a reworking of the in-core inode structure, and a lot of fixes.
The current stable kernel release is 2.4.17. Marcelo's latest prepatch is 2.4.18-pre7, which contains quite a few fixes and updates, and not much else. A 2.4.18 release candidate has not yet been issued.
Other kernel tree releases: Dave Jones has released 2.5.2-dj4; this one is updated through 2.5.3-pre2. It adds a more recent version of the scheduler patch, various fixes, and a major update which makes all input devices use the input layer. As a result, people who run this kernel will have to enable some new configuration options, and probably change their X server configuration as well - unless, of course, they have no keyboards or mice to worry about. See Vojtech Pavlik's note for more information on what is required.
Andrea Arcangeli has announced 2.4.18-pre4-aa1. The most interesting thing in this patch, perhaps, is a change which lets the kernel put page tables in high memory. "It only compiles on x86 and it is still a bit experimental. I couldn't reproduce problems yet though."
2.0.40-rc2 has been released by David Weinehall. This one will go out as the real 2.0.40 stable release unless somebody comes up with a good reason why it shouldn't.
What Rik van Riel is up to. VM hacker Rik van Riel may have taken a bit of a setback when Linus replaced his code with an alternative virtual memory implementation in the 2.4 series. He has not, however, given up on VM work; instead, he has been steadily releasing a set of "reverse mapping" ("rmap") patches for the 2.4 kernels. The rmap code was incorporated into 2.4.18-pre3-ac2 by Alan Cox, who compares the rmap VM favorably with the much-respected FreeBSD memory implementation.
So perhaps it's time to give rmap a look, starting with a bit of superficial background. Linux, of course, is a virtual memory system. This implies that every address generated by a process must be looked up (by the hardware) in a page table. The page table entry (PTE) will either map the address onto a physical memory address or note that the page is not present.
A key point is that a page table is a one-way mapping. Given a virtual address in a process's context, the corresponding physical address may be found. But there is, in the stock kernel, no easy way to find which page table entry (in which process's context) refers to a specific page. Things are complicated by the fact that pages can be shared among processes as well; if a page in memory holds code from the C library, for example, there will be many page table entries pointing to it. The only way to find them is to scan every process's page tables, looking for matching entries. That is a long and expensive task.
The one-way nature of the Linux VM makes memory management tasks harder. Before the kernel can free a given page in physical memory, it must find and mark every page table entry pointing to that page. This is done by scanning page tables and "freeing" pages by invalidating page table entries and decrementing the reference counts on the corresponding pages. When the reference count goes to zero, the system knows that the page is now truly free.
Managing physical memory by scanning virtual memory is inefficient. Many tables may have to be scanned before a given page can be freed. And if the kernel has a pressing need to free pages in a particular zone (subsection of physical memory), scanning virtual memory is not particularly helpful. It may be necessary to look at a large number of PTEs before finding a single page in the right zone.
The solution to these problems is reverse mapping, the creation of a data structure which, given a physical page, can return a list of PTEs which point to that page. The logical place for this information is the system memory map, which is an array of struct page structures, one for each page of physical memory on the system. Rik's patch adds a pte_chain member to the page structure; it points to a simple linked list of pointers to PTEs. Access is thus simple; if you have a physical page you want to work with, just go to its page structure and follow the chain.
Once you have that capability, a number of things become possible. Freeing a page is now straightforward, since all of the relevant PTEs can be found and modified at once. It is also easier to keep track of which pages are actually being used; follow the pte_chain and check each entry's "referenced" bit, and adjust the page's "age" accordingly. This information will help the VM system pick the right pages to throw out. If memory is tight in a particular zone, the physical pages in that zone can be scanned directly without having to sift through tremendous numbers of irrelevant page table entries. All of these features will help to create a more responsive and stable VM under varying loads.
There is a cost, of course. The page structure has a new field, the pte_chain pointer. Then there are linked list entries for the reverse mappings. This extra memory usage matters. As a simplified, "back of the envelope" calculation, consider a 32-bit system with 128MB of main memory, using 4KB pages, and with exactly one PTE for every physical page. This system has 32768 pages; the overhead for the pte_chain, at 12 bytes/page, will occupy almost 400KB of memory - 96 pages. That is a substantial increase in the kernel's memory use - some would call it severe bloat.
The memory used for reverse mappings is actually pretty small compared to the full overhead of the VM system. Even so, the rmap patch tries to mitigate that impact somewhat with another change. The standard Linux page structure includes a wait queue for any process that needs to perform an exclusive operation on the page. That wait_queue_head_t field takes up 12 bytes (i386 architecture, at least) and tends to be unused much of the time. It is not often that a process must actually wait on a page. So, in the rmap patch, the wait queue has been removed from the page structure. Instead, a much smaller list of wait queues is maintained; for a given physical page, a hash function is used to find the associated wait queue. Occasional collisions will occur, resulting in processes waking up when their pages are not yet ready. That is a performance penalty that, with clever coding, should be far outweighed by the memory savings.
The patch contains some other bits, such as a simple "defragmenter" which tries to make large, contiguous memory allocations work (though most of the implementation work remains to be done), and a "drop behind" function which frees up pages belonging to files when a process is doing sequential I/O and has passed over them. There is also a more structured approach to "inactive" pages - pages which have been taken away from a process but which still contain the (potentially useful) data. The new code tries to keep around a fair number of clean, inactive pages; these pages can be quickly given back to their processes if called for, but are also available for allocation elsewhere if need be. Finally, the patch adds a fair number of general cleanups and a lot of comments.
Rik's patch has drawn a number of favorable reviews. For now, however, it is not being proposed for inclusion into 2.5. Indeed, it is only available for the 2.4 kernel series. Rik is currently working with 2.4 only as a way of having a stable base to start from. VM hacking can lead to weird and subtle bugs; it's not helpful if the rest of the kernel is also in great flux with bugs of its own. There will eventually be a 2.5 version, Rik tells us, when things have calmed down and the rmap patch itself is in a more finished state.
The current rmap version is release 12a.
What is up with the Athlon bug? The word first showed up on the Gentoo Linux site: a bug in the AMD Athlon CPU could cause data corruption on Linux systems. The word was that the problem had to do with what happens when 4MB pages are invalidated by the processor; the workaround was to tell the kernel to run without large pages with the mem=nopentium boot option.
The only problem is this: the Linux kernel only uses 4MB pages for kernel space itself. It maps all of (low) memory using large pages, then leaves the mapping alone - 4MB pages are never invalidated. The explanation left many kernel hackers unsatisfied, and the investigation continued.
What is actually going on, as posted by Gentoo's Daniel Robbins, is rather more subtle. The kernel's 4MB mappings cover all of (low) physical memory, including things like AGP memory. In some situations, the CPU can generate "speculative writes" to that memory via the 4MB mapping, and this has the effect of loading a cache line with data from memory. That cache line will eventually be written back to memory (even though the "speculative write" is never executed and the data has not been changed); unfortunately, the AGP processor can have modified the underlying memory in the mean time. The cached memory is thus stale and incorrect, and corrupts things.
Real fixes are still in the works. Meanwhile the mem=nopentium option will work for people who are affected by this problem.
Creeping ACPI. Jes Sorensen tracked down a problem with his shiny new Vaio laptop; it seems that the interrupt line for his CardBus controller was not getting set up properly. He has posted a small, special-purpose fix which patches things up in that case.
The underlying problem, however, remains. Many of the older, BIOS-level hardware tables which have traditionally been used to configure things like interrupts are going away. Instead, the newer ACPI standard is being used. If the kernel is to be able to work with newer hardware, it will need a functioning ACPI implemention, including the AML interpreter.
Running the full-blown ACPI setup is not an entirely popular idea, as was discussed on this page last July. ACPI brings substantial amounts of kernel bloat, reliability worries, and security concerns. Many (or most) people who have really looked over ACPI tend to be unenthusiastic about putting it into their kernels.
Finding a solution that allows future hardware to work without equipping the Linux kernel with an interpreter that can run arbitrary, closed source code is going to be a challenge. Proposals for a "configure and dump" mode for ACPI will address the bloat concerns, but not the others. It will not be a good day when Linux can configure a disk drive, but only at the cost of running a bunch of closed, buggy AML code with, perhaps, some "digital rights management" software thrown in as a bonus.
Other patches and updates released this week include:
Section Editor: Jonathan Corbet
January 24, 2002