Weekly Edition
Daily updates
Events Calendar
Book reviews
Penguin Gallery

The Ottawa Linux Symposium 2000

Herein is our coverage of OLS. There is far more going on here than can be handled by one person, so it's far from complete. Nonetheless, we hope to capture some of the flavor of the event.

Jump directly to:

Wednesday
Thursday
Friday
Saturday

Wednesday at OLS

July 19, 2000
Jonathan Corbet

Ottawa is a pretty city - at least this time of year. It's bright, green, clean, and pleasant to walk around in. Certainly a great place for a conference. Some of the younger hackers seem to appreciate that the drinking age is 19, too...

The Ottawa Linux Symposium got off to a bit of a rough start. After sending around a note encouraging everybody to show up early for registration, they managed to get going an hour and a half late. In the end, they had to tell people to go to Miguel's noon keynote without badges, because otherwise they couldn't get through the lines in time. Oh well.

The conference organizers had a great idea, however: set up a wireless LAN in the convention center and Les Suites hotel, and lend out PCMCIA cards to the first hundred or so attendees. Too bad you have to load a separate driver, and that the system doesn't appear to be working all that well yet. The congress center was full of people glaring at their laptops in a puzzled way. The idea is nice, anyway...

Miguel showed up to give his keynote, only to note that nobody had gotten around to setting up a projector for his slides. At least the ensuing panic and delay allowed them to get more people through the registration process. (See the discussion of Miguel's talk in the July 20 LWN).

The rest of the afternoon was taken up by a pair of talks by Rik van Riel, Ben LaHaise, and Stephen Tweedie on the Linux memory management system. The talk originally included a lengthy introduction to memory management, but much of that got cut out to make up time lost by the delays earlier in the day. There was much technical talk on how things are done, and especially on the new "kiobuf" interface. Perhaps most interesting, however, were the plans for 2.4 and 2.5. There is still much that needs to be done with the memory subsystem before the 2.4 kernel can come out, including:

"Balanced page aging." Page aging is the scheme used to find which pages to throw out when memory is tight; the current setup has a tendency to pick on the wrong pages in some situations. The balanced scheme will split pages into three groups: active pages which are in current use, inactive pages which need to be saved to disk before being reused, and "scavenge" pages which are clean and can be used at any time. Both inactive and scavenge pages still contain useful data, and can still be used by processes, but this organization makes it easier to quickly find pages when the memory pressure is strong.
A "flush callback" mechanism, which would let the subsystem of the kernel which is responsible for writing a page to disk determine how that writing is done. Thus, for example, pages representing part of a modified file would be flushed to disk via a callback to the filesystem. The imporant feature here is that the callback can refuse to do the flush if conditions are not right. Thus a journaling filesystem, for example, is able to complete a transaction before having to deal with other pages to write.
"Pinned page reservations" will allow filesystems to reserve pages for short-term use. Reserved pages would still contain cache data, if any, and would be removed from the cache only if the filesystem actually calls on them.

The above are all considered "emergency" changes that must go in. The list of changes that can wait until 2.5 is a bit more vague, but includes:

"Physical page aging." The current aging scheme looks at process virtual memory. Doing things that way can give distorted usage information (leading to the eviction of the wrong pages), and can also cause the memory management subsystem to pass over the same memory many times. Looking directly at the physical memory array makes more sense.
Large pages. The current page size is fixed and relatively small; a number of architectures can handle much large pages, often in a subset of the physical memory range. Using larger pages for at least part of memory can give better performance.
Better page table handling. Current page tables are nailed down in memory, even when the associated process is doing nothing; they can take up a lot of RAM. Through more sophisticated handling of page tables, such as shared tables and swappable tables, better behavior can be achieved.

The evening's program includes a reception with Jon 'maddog' Hall speaking and lots of beer; your author intends to participate as soon as this writing is complete... Despite the startup glitches, this is a high-quality conference. Watch this space for reports from the next three days.

Thursday

Thursday seems to have been filesystem day.

Theodore Ts'o started the day off with an overview of what's happening with Linux filesystems. Of course, he had to begin with half an hour of messing around with the video projector and the sound system before he could get going... Something in the Congress Center building is highly inimical to radio waves, to the detriment of both the wireless networking and the wireless microphones the conference is trying to use. Ted ended up simply yelling at the audience.

Filesystems, in the Ts'o view of the world, are divided up into three broad categories.

FAT filesystems use a file allocation table to keep track of what's going on. This approach leads to a number of problems, including extremely poor performance on fragmented files. It persists, however, due to its use in Windows systems and its very low space overhead on small (i.e. floppy) filesystems.
Inode filesystems separate the name of a file from its other metadata (location, protections, owner, ...). Most of the disk-based Linux filesystems are inode systems; these include ext2, ext3, reiserfs, xfs, jfs, etc.
Network filesystems keep data "out there" instead of on a local disk. Included here are the standards NFS and SMB/CIFS; other contenders in this area are AFS, CODA, Intermezzo, and so on.

Of much interest in the disk-based filesystem arena is, of course, the set of journaling filesystems. Currently the only one that is claiming to be ready for prime time on Linux is reiserfs; soon it will be joined by ext3 (see below), IBM's JFS, and SGI's XFS. Ted looks forward to the opportunity to benchmark JFS and XFS together on the same platform, something which has not been possible until now.

Another variant of interest is log-structured filesystems. These essentially treat the disk as a big circular buffer; they never overwrite blocks, all operations are, instead, replacements. Writes are fast because they always happen at the end of the buffer, and consistency is pretty much guaranteed. Log-structured filesystems have problems, however, in read performance and in their need for a "garbage collection" pass to keep space free at the end of the buffer area.

Another interesting sub-area is filesystems for flash memory. Flash brings its own constraints: it is expensive and thus small, and it can only be written a finite number of times. The limit on writing means that usage needs to be spread out over an entire flash array to avoid losing pieces too early. On the other hand, flash memory has no seek delays, so fragmentation is not a problem.

A couple of filesystems exist for flash memory now. "cramfs" is a compressed filesystem added recently to the 2.3 kernel development series. JFFS is a log-structured filesystem. Log-structured systems automatically spread usage over an entire device, and are thus well suited to flash memory.

Ted started out by saying that filesystems are expected to provide top performance for everybody. At this point, however, it also becomes clear that there are uses for application-specific filesystems. As a result, don't expect Linux to have fewer filesystems anytime soon - diversity helps Linux to be useful in many different situations.

Time was also taken out to address the issue of multi-stream files - those which, like Macintosh files, have more than one "fork" of data. Miguel de Icaza, on Wednesday, advocated support for such files; Ted Ts'o thinks it is a bad idea. None of the currently-used network protocols support multi-stream files, neither do standard Linux tools. Any new API for multi-stream files would be nonstandard, since POSIX has no concept of such files. While Miguel states that the protocols and utilities simply need to be fixed, Ted says that the same functionality can be obtained in other ways, and that developers should "think different" to get the results they need.

Stephen Tweedie presented the work that he has done with the ext3 filesystem. In an (until now) unheard-of move, he actually started the talk on time, without difficulties...

Ext3 is the time-tested ext2 filesystem with the addition of journaling capabilities. Journaling works by bunching up the changes to a filesystem into atomic transactions; either all of the changes actually happen, or none of them do. For example, the simple task of writing some data to a file can involve numerous steps:

Allocating a new block to hold the data.
Updating the block pointers in the inode.
Updating the size of the file (and modification date) in the inode.
Adding the new space to the user's disk quota record.

Oh, yes, and also actually writing the data. If the system is interrupted when only some of those operations have completed, the filesystem ends up in an inconsistent and possibly dangerous state. That's when you have to run fsck to clean everything up.

In a journaling filesystem, all of those operations are first written to a special journal file, followed by a special "commit" record. Only when that has been done is the filesystem itself touched. If the system goes down in the middle, the whole thing is replayed from the journal file, so everything gets done. fsck becomes a thing of the past.

Stephen's goal with this work was to add the journaling capability to ext2 in a minimal fashion. He guarantees a 100% consistent filesystem at boot time, no matter what happens. But many other things that could go into ext2, such as b-trees, extent mapping, etc., have been left out in the name of simplicity.

The architecture used actually separates journaling out from the filesystem itself. A new journaling layer actually handles the details of making journaling work; it exports the capability to any module that needs it. So the addition of journaling to other filesystems should be relatively easy, if the interest is there.

For now, the journaling code writes everything to the journal file - including actual data written to files. In the long term, a more efficient implementation will journal only the file metadata. As long as the filesystem takes pains to write user data to the file before the metadata, there is no need to journal the user data. That, obviously, will greatly decrease the amount of data that has to go through the journal, and help to fix the "poor" (Stephen's word) write performance of ext3 now.

The ext3 journaling implementation is stable now, and working on a number of systems. There are still some user interface issues to deal with - some of the utilities need some work, and it's possible for an overzealous system administrator to delete the journal file. Those sort of issues, along with a port to 2.4, will be dealt with in the future.

Steve Best of IBM finished out the afternoon with a discussion of the port of JFS to Linux. JFS is IBM's journaling filesystem, the release of which was announced at LinuxWorld back in February. Since then, nine "code drops" have been done, with successively higher levels of functionality.

JFS boasts some impressive features. File sizes can go up to 4 petabytes. Its journaling is metadata-only and thus, hopefully, fast. JFS is also an extent-based filesystem, which can make things much faster for large files. Dynamic allocation of inodes is done, adding flexibility while reducing the space wasted by static inode tables. And the filesystem can be defragmented and resized while online. There are still a few difficulties, though. The version of the filesystem they are porting comes from OS/2, and thus, for example, still does not understand case-sensitive filenames. (It retains the OS/2 disk structure, making filesystems portable between Linux and OS/2).

JFS's journal recovery is performed by a user-space program ("logredo"). Some members of the audience made an issue of the fact that it will be hard to use JFS as the root filesystem, since it must be mounted (to find logredo) before the journal has been replayed. The approach to that problem is to "mount read-only and hope for the best" until logredo can be run. Some people were unimpressed by this (ext3 does log recovery in the kernel, and does not have this problem), but the fact is that ext2 filesystems work that way now.

It was asked whether JFS might ever use the independent journaling layer that has been implemented for ext3. "Not now" is the answer - they want to complete their port before getting into large architectural changes.

The wireless networking is working at this point. It's just a matter of fixing a couple of things in the scripts and being in the right place. As it turns out, the rooms where the talks are held do not qualify as a "right place," frustrating all those who want to play around on the net while waiting for the speaker to say something interesting. Tip for future conference organizers: wireless networks are an absolutely great idea, but don't forget the technical support side of things. Even a "networking hints" bulletin board would have helped a lot of people get going.

Friday

So much for the good weather in Ottawa, today is gray and rainy. Fortunately one can get to the conference almost entirely without exposing oneself to the outdoors, by virtue of walking through the tremendous shopping mall which is attached to the conference center. The software store even has Corel Linux featured in its display window...

This morning Richard Gooch had scheduled a 10:00 BOF session on devfs. At about 10:10, the conference folks got around to setting up a room and putting up a sign. As a result, there were all of three of us there. Nonetheless, it was a fun conversation. Mr. Gooch has gotten past a critical hurdle in getting devfs into the kernel (after more than two years of effort), but there are still many people who oppose its existence. The devfs wars are not over yet.

At this point, the long-term fate of devfs probably rests in the hands of the distributors. If devfs starts cropping up in some high-profile distributions, it will be used by default. Thus SGI, which is sponsoring work on devfs, is said to be pushing some of the distributors to go in that direction. One large distributor, MandrakeSoft, is said to be seriously considering enabling devfs in its standard kernel.

Back in the main conference program, Deepak Saxena gave a presentation on the I2O bus and the status of its support under Linux. I2O was the subject of some concern in the Linux community a couple of years ago, due to the fact that its specification was only available under NDA. Those days are over now, of course, with the specification being openly available on the I2O web site and Intel supporting Linux I2O development.

I2O is driven by the problem that intensive I/O loads can swamp even modern processors. Driving a gigabyte ethernet card, for example, can easily reduce a system to servicing interrupts and doing nothing else. The I2O approach is to place another CPU (the I/O Processor or IOP) between the main processor and the devices; this CPU offloads as much of the I/O processing load as possible.

There is nothing new about this sort of architecture - your author learned to program (far too many) years ago on a Control Data Cyber mainframe running KRONOS that was organized in just this way. When the I/O load gets too intense, it's time to throw another computer at the problem.

Anyway, the I/O processor now needs to know the details of driving the specific peripherals which are on the I2O bus. Perhaps the best feature of I2O, in the end, is that it has defined how the various types of cards are supposed to operate. Network cards, for example, have a specific interface that they must implement. Suddenly there is no more need for many dozens of ethernet drivers - a single driver will suffice.

All I/O is done through communication with the IOP. A message-passing scheme is used, which has good and bad points. On the good side, many operations can be performed with a single message in each direction, greatly reducing the interrupt load on the main processor. This scheme also tends to increase latencies, however. Certain applications, such as network communications with lots of short packets, can suffer from this latency increase.

The core I2O implementation for Linux exists now. The block storage device driver works, and the system can boot from an I2O disk. The LAN device driver works as well - it gets similar throughput on high-speed networks as the regular PCI drivers, but with significantly lower CPU overhead. There was no specific mention of other types of drivers, such as sequential storage.

For the future, look for higher-level operations to be split off onto the IOP. For example, the "socket" device class will implement an entire TCP/IP stack, taking all the protocol overhead out of the main processor. There is also an interest in implementing direct device-to-device transfers, allowing, for example, static files to be served to the web without involving the CPU.

The majority of us stuck with non-I2O systems were, instead, the topic of Jes Sorensen's talk on the optimization of SMP device drivers. This talk was a distillation of his experience with the AceNIC gigabit ethernet driver; it went from having real performance problems to being able to blast out some serious data.

The techniques involved are highly detailed, and probably not of great interest to those who don't hack device drivers. They include:

Separate parallel paths in the driver when possible. In an ethernet driver, that means keeping the transmit and receive paths separate.
Avoid cache contention between parallel paths.
Use memory-mapped I/O instead of port operations.
Avoid spinlock contention where possible; the use of the atomic type operations can help in this regard, sometimes.

In the end, very high performance remains a tricky and difficult goal.

Saturday

Les Suites isn't a bad hotel to be stuck in, but nobody has accused the elevators of being overly efficient. Getting in and out of the building can be a lengthy process... So imagine your author's joy at being joined on the way in by Slashdot's Emmett Plant - with accordion. Elevator music indeed...

Lars Marowsky-Br�e gave a session on SuSE's work porting FailSafe, a high-availability system being open-sourced by SGI, to Linux. He took some time to distinguish this offering from what a number of other Linux vendors are doing: FailSafe is not "just another two-node solution." He was a little critical of the number of vendors who are reimplementing high availability systems so that they can have their own entry in the market. SuSE, instead, is going with code that scales up to 16 nodes, and which has been out there and working for five years.

FailSafe itself deals only with high availability. Thus it does not, for example, handle load balancing. It's job is to keep track of a set of network nodes and the services running on them. If something stops working, FailSafe will find a new place for it to run.

The system is very application oriented - it can be set up to do things like moving a large Oracle server from one node to another. It is also implemented entirely in user space, perhaps making it unique among high availability systems. No kernel additions are required.

FailSafe will be released in August (LinuxWorld "might make a good time" for the announcement). The code - 350,000 lines worth - will be licensed under the GPL, with the LGPL applied to some libraries that application developers can use. Meanwhile, binary snapshots are actually available now.

Despite the kernel-heavy nature of this conference, there is interest in the applications side as well. Dan Winship gave a presentation on Evolution, Helix's mail/address book/calendaring system. It is, according to Dan, "the most buzzword-compliant application for Linux." But it's something that we want to have anyway.

The core of the application remains the mail agent, where people spend a lot of time. It has a great many features, including the ability to display and send HTML-formatted mail. It tries, says Dan, to be "well behaved" about sending HTML mail. It can deal with local, POP, and IMAP mail now; there is interest in writing other back ends to allow, for example, pulling mail from services like Hotmail while stripping out the advertisements.

The "vFolder" scheme is intended to be a more flexible way of handling mail folders. Instead of setting up physical folders, Evolution throws everything into one big database and lets users overlay folders in any way they want. Thus it's easy to add a folder like "messages from Miguel." The user can then go on and make another one called "unread messages from Miguel containing the word 'sucks'". That folder, according to Dan, would be one of the larger ones...

Evolution stresses the integration between its components. Thus the address book is available for mail composition; it can also snarf "vcards" out of incoming mail to add new entries. Similar tricks are available with the calendar, making things like meeting scheduling easier. The calendar can also be used to mark messages as "reply to this later." If you've not actually made the reply within the allotted time window, you'll get an alarm to remind you.

The upcoming "powersearch" feature will also allow a search like "show me all messages, contacts, and appointments related to OLS" and get a nice display of the whole thing.

Evolution exports it components for other systems to use. Thus it's easy to write scripts that work with the address book, for example. What Evolution will not do ("we're not stupid") is run scripts that come in via mail. Someday there will be the ability to add more font ends to the system - examples include web-based, text, and the obligatory emacs interface.

Other upcoming goodies include the ability to stick notes onto messages - that is apparently in the code now, though it has to be explicitly enabled in the build process. Project management features are on the list. They also want to integrate the "gnomacs" emacs component for mail composition. Finally, integration of encryption (PGP and GPG) ison the list - but not there yet.

David Miller's keynote was scheduled at 3:15 as the last event. Around 3:20 or so, the conference staff got around to setting up a video projector and the sound system... The lack of the projector, at least, can be understood, since David projected his talk from slides. But Alan Cox had a special introduction in mind...the very first mail message he ever got from David. It was a "subscribe" message sent to the linux-multicast mailing list - not to the request address...

David's talk was a history of the Linux kernel from a David-centric point of view. It was a lively, anecdote-laden presentation that would be impossible to reproduce here. So I'll content myself with a few moments in history:

David's first message to Linus. After struggling for three days to bring up an early SLS system, he dropped off a note from his father's AOL account proclaiming his success. Linus responded back "as if I had a clue," thus insuring that David would want to continue working with the system. Many people can get a clue if you help them on the way; Linus is still doing that with the kernel.
His first contribution: helping with the transition to the ELF binary format. That also helped him get a job at Rutgers after the admins noticed all of his (unauthorized) activity on their systems. The second contribution was moving the mailing lists over to vger.rutgers.edu. It turns out the lists have to move again - soon they will show up at vger.redhat.com...
Third contribution: the Sparc Linux project. It started as a water cooler joke - just "a little code" would be required. Seven months of "torture" later, David could run a shell on an SLC.
Once things were tuned better, David entered his famous "bragging on comp.os.solaris" phase. After an especially long message proclaiming Linux's superiority to Solaris, he got an answer from a Sun engineer: "Have you ever kissed a girl?"
Subsequent moves included taking over networking, the long process of getting 2.1 out, riding with Linus and getting pulled over, working with Cobalt Networks, and ending up at Red Hat. That last job pretty much lets him do what he wants, working with the kernel most of the time and jumping on to something else occasionally when the bug strikes.
The Mindcraft benchmarks. Two months after the benchmark, the TCP stack was fully SMP threaded; five months after that the entire networking subsystem was fully parallelized. The end result: last months Specweb results which blow away everything else. Things like Mindcraft work perversely with Linux - once problems are pointed out, people just jump in and fix them.

Questions from the audience included "when's 2.4 coming out?" David doesn't know better than anybody else... but maybe a month and a half or so after big changes stop going into the kernel.

Thus ends the Ottawa Linux Showcase, except for the Helix Code party to be held tonight. This has been a top-quality event - despite my occasional pokes at the conference organization. This event is tightly focused on the code, with a near absence of hype. LinuxWorld-like events are good for what they are, but it is at events like OLS that the Linux community really can be seen.

I'll be back next year.