The Grumpy Editor's video journey part 2: Video editors In the first installment in this series, your editor took on the task of getting video data onto his system in digital form. Part 3 talked about authoring DVDs with the nicely edited versions of those video clips. Now it's time to fill in the missing second part, wherein your editor turns raw captured video into something suitable for DVD creation. The task to be accomplished is relatively simple: for each video clip, trim off the extra junk at the beginning and the end. Some of them also require internal editing; there were signs of operator error in the form of, say, extended sequences where the sole subject matter was the floor and, perhaps, the cinematographer's shoe. Nice transitions between the clips were desired - a basic fade to black at the end, if nothing else. The addition of titles is useful. And, as an added bonus, the video clips needed to be deinterlaced before being written in a form suitable for passing to the dvdauthor utility. In the process, your editor encountered several tools in varying states of readiness. He has become better acquainted than ever with the notion of "build hell." A rather more than passing acquaintance with the behavior of the out-of-memory killer in 2.6.24-rc kernels has also been achieved. And, at the end, your editor believes he has a reasonable sense of the state of the art in Linux video editing. Avidemux Avidemux is a GTK-based editor which, according to its web page, is "designed for simple cutting, filtering and encoding tasks." It is an interesting combination of simplicity in some areas combined with great power and complexity in others. It has a lot of potential, but it also has a few rough edges. For example, Avidemux handles DVD-style MPEG2 files without trouble. But a reader who digs far enough into the documentation (which is extensive and useful, incidentally) finds a warning that one must exercise the "build VBR time map" option, or audio and video will become unsynchronized in the final product. This operation is nearly instantaneous on a five-minute clip; given the problems which can result from not doing it, why does Avidemux not just build this "time map" when the file is loaded? Why set a trap like that for your users? The actual video editing operations are quite simple. Avidemux can only handle a single video clip, and that clip has a single set of begin/end points. It is possible to delete from the middle of a clip using those endpoints; deletion is instantaneous and leaves no sign on the timeline. There is no "undo" operation, but there is an option to dump all changes made to the file. There is a scrollbar which enables quick movement through the clip; the arrow keys move by single frames. In general, the interface is responsive on your editor's machine. One place where Avidemux excels is in its selection of video filters. For example, your editor went looking for a filter to deinterlace the video; he found 21 different deinterlacing filters. Many of these filters have an extensive set of configuration options. Actually choosing the right filter and options for the task at hand is an intimidating task, and the documentation does not provide a whole lot of guidance. In the end, Your editor got reasonable results with the "yadif" filter, as can be seen in the "before" and "after" images on the left. A fade-to-black ending was achieved with another filter. It works beautifully, if one does not mind that (1) there is no choice of what to fade to beyond a "fade to black" toggle, (2) the portion of the clip to be affected must be identified by typing in frame numbers, and (3) those frame numbers are not adjusted should somebody, say, delete some video from an earlier part in the clip. The capability is there, but the interface needs some work. Other filters allow cropping, mirroring, color modifications, noise removal, sharpening, blurring, addition of subtitles, the addition of logos from image files, the creation of animated DVD menus, etc. Should all of those be inadequate, the "swiss army knife" filter is there for more general low-level processing. There is also a scripting interface for Avidemux, though your editor did not attempt to make use of it. The interface allows the user to view the video either before or after the filters have been applied - or both together. The latter mode, though, tends to run slowly, though the post-filter output, by itself, worked just fine. In the end, saving the file out as a DVD "video object" does the job - though one has to assume that the rather spartan "save" dialog will do that. Like most (but not all) video editors, Avidemux does not actually change the video data until told to render a new file. The list of edits, filters, etc. can be saved as a "project" file (an Avidemux script, really) so an editing session can be resumed at a future point using the original material. The bottom line is that Avidemux is a capable and reasonably solid tool - your editor was not able to make it crash. Its long list of filters will be appealing to some users. Its inability to work with more than one clip at a time will rule it out for many others, though. Like so many other tools in this category, it's almost there. Cinelerra The Cinelerra tool has an interesting history. It was once known as "Broadcast 2000," before being withdrawn because somebody was worried about legal liability. Now it is available as "Cinelerra," but in two versions. The "official" version is published by a company named Heroine Warrior, which has no real interest in the hassles of dealing with a community or making regular releases. Heroine Warrior is, however, generous enough to make the code available under the GPL; a group of developers has taken the code and made Cinelerra CV - the "community version." This version is supposed to be under active development and move more quickly, but it still doesn't seem to be moving all that fast, unfortunately. There are some good documents for Cinelerra, but, reading them, one starts to encounter certain themes. For example: Cinelerra is not perfect. Before long you will be familiar with the tendency it has to crash Or this one: Quicktime is not the standard for UNIX but we use it because it's well documented. All of the Quicktime movies on the internet are compressed. Cinelerra doesn't support most compressed Quicktime movies but does support some. If it crashes when loading a Quicktime movie, that means the format probably wasn't supported. Cinelerra is by far the most complex - and capable - of the tools available for Linux. If you are looking for an editor designed for the creation of complicated video with lots of effects, Cinelerra is the tool for you. Unfortunately, Cinelerra does not appear to have a development community which is up to the maintenance of a tool of this size. So it is difficult to work with and not particularly robust. At startup, Cinelerra puts up four individual windows. The "timeline" shows all of the tracks being edited, and is the place where much work actually gets done. There are two video windows; one displays the current state of the timeline, while the other can be used to look at individual clips outside of the timeline. Then the "resources" window holds everything else. The timeline display is quite nice. Video thumbnails along the line give a rough sense of what is happening in each clip. The display of audio levels is also highly useful when one is trying to find specific events; it would be nice if other tools picked up this idea. A number of editing operations can be performed directly on the timeline; each track, for example, has a horizontal line which can be manipulated to adjust the (audio or video) levels at any given point. So a fade-to-black, for example, is a simple matter of ramping the video level down at the right place. For more complex operations, there is a large list of effects which can be applied. These effects show up on the timeline next to the tracks they operate on; their end points can easily be dragged around. Cinelerra will attempt to render effects when the timeline is being played, but that tends to slow the program (not the fastest tool to begin with) to a point where it cannot keep up with normal video rates. Cinelerra does not modify any data until told to render the project. It cannot create DVD video objects directly; one must render audio and video separately, then multiplex them outside of the program. The edit list can be saved separately. There is a whole host of features in Cinelerra not found anywhere else. For example, it can be used to drive a rendering farm for those big production jobs. There is a motion tracking subsystem built into it ("The intricacies of motion tracking are enough to sustain entire companies and build careers around"). There's a set of options for audio and video capture. And so on. But your editor could never get all that far with Cinelerra before it ran the system out of memory. One does, indeed, become familiar with its tendency to crash, but it's especially annoying when it takes the rest of the system down with it. Cinelerra should really be one of the star applications in the free software world. It has a great deal of power and can do amazing things; it could be a professional-quality tool. What it needs is for the community to truly take charge of the "community version" and turn it into a system which is fast, robust, and easier to use. To that end, it would help if the two people on the planet who can succeed in actually building this system would clean up that process and, in general, make Cinelerra more welcoming to new developers. The foundation for a great video editor is here, but there is a lot of finishing work to be done. Kdenlive Kdenlive is a KDE-based editor under active development; version 0.5 was released in August, 2007. Having not found a version for Rawhide, your editor set out to build this tool, only to give up in despair. So, as an aside, your editor would like to offer a helpful suggestion to developers who want people to actually use their code: if you absolutely must use your own build tool instead of make, and there is just no alternative to using a tool which nobody has heard of or packages and which does not have a web site or working download location, please consider just packaging said tool with your code. Your editor is sure that "unsermake" is vastly superior to the alternatives which we all have on our systems already, but it doesn't help if you can't find it. Of course, even after solving that problem, your editor was not able to build this tool. Fortunately, Ubuntu ships it, so that is the version which was used here. The initial Kdenlive experience is a little rough; it asks for a set of default parameters. How is one to choose between, say, "CIF NTSC" or "DV NTSC" or "DV NTSC Widescreen"? There is no help on offer to guide the user toward the right choice. Once past that, the user sees a window with three major panes which offer functionality similar to that available from Cinelerra. The first step is to bring one or more video clips into the "project tree," which is (usually) visible in the upper left pane. These clips can be viewed in the "clip monitor" on the right. A clip of interest can then be dragged down to the timeline area, where it can be easily positioned relative to any others which are already there. Kdenlive uses the "divide and conquer" editing method. To remove a section of a clip, the user positions to one end of that section, then selects "razor" to split the clip in two at that point. Another split at the other end isolates the section to be removed, which can then be deleted with a separate operation. There is (with the exception of transitions) no way to apply an operation to a part of a clip - the area of interest must always be razored out first. As a result, the fade-to-black effect is not quite as easily achieved in Kdenlive as with some other tools. There is a "brightness" effect, but it changes the brightness to a constant value through the entire clip. The way to fade out a scene is to add a new clip with a solid color (easily done in Kdenlive), then use a crossfade transition to join the two clips together. Transitions are added by selecting the first track and, via the right-button menu, selecting the desired transition. Various parameters (such as the time required for the transition) can then be tweaked. It all works easily; Kdenlive is a fun tool for quickly piecing together different bits of video into a coherent whole. There are separate video windows for displaying individual clips and the timeline as a whole; by default, they cannot both be viewed at the same time. Playback is responsive. It's a little more awkward than with some tools, though: the position cursor is small and hard to grab, and there is a shortage of keyboard shortcuts for moving around. The timeline is less informative and less functional than Cinelerra's, but the information one really needs is there. When the project is done, there is a nice "export to DVD" option there to do the rest of the work. Kdenlive can create the video object files and fire up Qdvdauthor to do the rest, or it can create a basic, single-title DVD internally and (using k3b) burn it to a disc. Your editor, thus, should have mentioned Kdenlive in the DVD authoring article, but he was unaware of this feature at that time. It all works easily; your editor was able to make a playable DVD with minimal trouble. It was not the most beautiful DVD, though, because Kdenlive has no deinterlacing capability. Those of us unlucky enough to be starting with interlaced video must handle that operation separately, before or after the editing process. While any of the editors discussed here could conceivably work with high-definition video, Kdenlive is the only one which appears to have been written with that in mind. Projects can be set up in HD formats without undue tweaking. Your editor was not in a position to test this capability, though. All told, Kdenlive comes across as one of the most finished of the free editing tools. It is relatively straightforward to use and it has all of the features that most people are likely to need. For many applications, this could well be the first tool to reach for. Kino Despite its "K" name, Kino is a GTK-based video editor. It is quick and easy to use, but also lacking somewhat in power. Kino only works with a single video format - the digital video (DV) format associated with contemporary camcorders. When started with something else (say, your editor's MPEG files from the capture card), it will offer to convert the file into DV. This process works, but the result is a significant (5-10x) increase in the size of the file. There is no timeline in Kino; instead, it has a "storyboard" in the leftmost pane. Each video clip becomes a separate scene in the storyboard, with each being played strictly before the one after it. Like Kdenlive, Kino works by dividing clips and applying operations to the pieces. So trimming video is done by "splitting" the scene into wanted and unwanted parts, then deleting the latter. The documents make much of the "powerful" three-point trim feature, but your editor doesn't get it; it just seems like a way to set the beginning and ending split points on the same screen, but the amount of work remains the same. Moving within clips is quick and easy in Kino. There is also a scrollbar-based "jog wheel" for variable-speed motion in either direction. What your editor really likes, though, are the keyboard shortcuts, including vi-style bindings for moving, frame-by-frame, through the material. It makes finding the exact spot to make a cut a quick affair. Kino offers a reasonable set of effects, though the interface and implementation are awkward. Most effects apply to a full scene, so the normal mode of operation is to split scenes where an effect is to be placed. There is an option to "limit" an effect to a period of time at the beginning or end of a scene, though, so something like fade-to-black or a crossfade can be done without making new scenes. Or so one would think. Unlike most other editors, Kino does not apply effects at playback time; instead, an effect must be rendered when it is applied to the scene. The result is a new scene (even if the limit option described above is used) which contains the result of a new DV file created by the effect renderer. For good measure, the rendering code places the rendered file (with a name like 001.kinofx.dv) in the user's home directory, which can quickly become cluttered with them. This approach lets Kino display effects without performance problems, but it is a bit messy and inelegant. While Kino only works with DV files, it has one of the nicest export dialogs around. There is a long list of options, one of which is DVD-style MPEG. There's even a "deinterlace" pulldown with a few options. The internal deinterlacer is, as advertised in the menu, very fast, but the results are not all that great. If one, instead, has Kino use the external YUV deinterlacer, things will be exceedingly slow, but the results are worth it. Examples from both deinterlacers can be seen on the left. By default, the DVD exporter creates the necessary video object file and a simple dvdauthor script for a minimal DVD. There are options, though, to burn the DVD immediately or to go into Qdvdauthor for further work. One might mention here that, like most of the other tools discussed here, Kino does not play nicely with others when it comes to the audio subsystem. Each tool has its own way of responding to contention, though. In this case, if Kino is unable to get exclusive access to the audio device, it shows its displeasure by playing video (silently, of course) at ten times the normal speed. After a while one learns to recognize this particular tantrum, but it still would be nicer if the application would say something like "I'm not willing to share the audio device, can you please stop your music player if you want to play back your video?" Bottom line: Kino is a reasonably capable editor which, after a very short learning period, is quick and fun to use. It may well be the best option for people with relatively simple needs. Those wanting more sophisticated capabilities, though, are likely to see it as an underpowered toy. LiVES The Linux Video Editing System (LiVES) is a relatively simple editor with some interesting capabilities. The web page claims: LiVES is good enough to be used as a VJ tool for professional performances, and as a video editor is capable of creating dazzling clips in a wide variety of formats. Your editor, however, is not a VJ. So his experience with this tool was not the best. The process of importing a video clip into LiVES is slow and disk-intensive. After some investigation, your editor figured out why: LiVES works by converting every video frame into a separate JPEG image file. The end result is a directory containing tens of thousands of images and a massive expansion in the size of the clip. It also cannot be good for system performance in general; your editor can only suggest that using a filesystem with indexed directories would be a good idea. LiVES is one of those applications with such a sense of its own importance that it comes up maximized from the outset. The interface reconfigures itself on the fly depending on what operations are selected - in particular, video display windows come and go in a frequent and distracting manner. The default directory for video files in /usr/local. Cross-fading one clip into another works, but it loses the synchronization with the audio. Many tasks are done by running external programs; should that program fail, LiVES will tell the user, but it does not pass on the information provided by that program. So figuring out why things fail is a matter of digging through debug and strace output. Somewhere in this process, your editor decided that, while LiVES may indeed make VJs happy, it is not a serious editing tool for the rest of us. There is the potential for some nice features there, but this application needs a lot of work before it will be ready for general use. PiTiVi One gets used to thinking of video editors as being huge programs written in relatively fast languages. PiTiVi, however, is an exception to the rule: it's a smallish application written in Python. Of course, it's only small when one overlooks some of the external pieces - like gstreamer. This application, too, was a bit of a challenge to get going. It has various dependencies not accounted for in its configure script, including some strange ones: why does a video editor need to import Zope modules? Still, your editor had better luck here than with some of the alternatives. The good news is that, despite its Python implementation, PiTiVi is responsive when moving around in video clips. On the other hand, moving around in clips is really about all that PiTiVi can do at this point. There is a rudimentary timeline display which does not do anything, and no editing options are available. So PiTiVi, while being a promising start, is not really an editor at this time. Conclusion Worth mentioning in passing: the Open Movie Editor looks like a tool with some promise. It disliked your editor's video files, though, claiming that it only supports files with a 25 frames/second rate. Your editor, deep in NTSC country, has no such files. Hopefully, as this project matures, it will achieve the generality this kind of tool must have. The free software community can be aggravating sometimes. We clearly have the ability and the desire to create top-quality tools for tasks like video editing. But what we get is a half dozen tools, none of which is a complete solution to the problem. Your editor would be the first to say that competition between projects can be a good thing, inspiring everybody involved to push harder and achieve more. But, still, maybe having fewer competing tools might just help people to work together and make tools which are truly great. That said, the state of the art in Linux video editing is not as bad as one might think. The tools are there to put together a decent video without a great deal of trouble. As mentioned above, Kdenlive is arguably the most polished of these tools, with Kino also being a good candidate for simpler applications. And Cinelerra remains in its position as the application that is going to be truly spectacular, once all of those loose ends finally get tied up. Your editor once heard Lawrence Lessig say that text is like Latin for younger people today, and that video is the preferred way to communicate. If that is true, then we want to make it possible to communicate as richly as possible while using free tools. We have a good base to build on, and many smart people have solved many of the hardest problems. Finishing the job is well within our capabilities. The Grumpy Editor's video journey part 3: DVD authoring As readers of the first part of this series will remember, your editor has set out on a project to digitize a set of old video tapes and turn them into properly-formatted DVD media suitable for handing out to the grandparents. Part 1 was about the task of capturing this data to disk; part 2 covers the video editors available for turning the captured data into something watchable, and part 3 covers the task of creating a DVD from the edited video. Attentive readers may have noticed that part 2 has not yet been written; there are more editors available than your editor had expected (currently under review are Cinelerra CV, Kino, PiTiVi, LiVES, and Avidemux), so that process is taking longer than expected. For the purposes of this article, let us assume that your editor has a disk full of video clips which have been edited and properly formatted into the MPEG2/AC3 video object files expected by DVD players. There will be a discussion of the best ways to get those files there in the near future, promise. Many of us have burned CDs and found the process to be relatively straightforward - the biggest obstacle is often just getting past the grumpiness built into cdrecord and its latter-day derivatives. Creating data DVDs is not a whole lot harder. So one might be inclined to approach the task of creating a video DVD with a "this will be easy" attitude. It is, in fact, a task just about anybody can learn to do, but it is on a different order of complexity than creating a CD full of music. A video DVD is, in truth, a program complete with its own hierarchical structure, menus, and code written for the simple virtual machine lurking within every DVD player. Creating a playable DVD requires writing that program. If DVDs are programs, then the one compiler available for Linux systems is the command-line dvdauthor tool. Regardless of how one builds a DVD, dvdauthor will be involved in the process at some point. This tool requires a collection of video objects representing the actual video titles and also implementing the menus, subtitles, and more. It's all tied together via a complex XML file (example) which is compiled by dvdauthor to create the final product. It is possible to create all of these pieces by hand, and, doubtless, Real Linux Video Jocks would not do it any other way. One can use dvdauthor to help with the generation of parts of the XML file. There is documentation which seems fairly complete, if a bit terse. But the fact of the matter is that most people attempting to use this tool directly will give up in despair. There is no reason why DVD authors should have to work at this level; dvdauthor is essentially an assembler which, while being absolutely essential to do most of the heavy lifting, should be hidden from most polite company. DVD creation is a visual task; there should be visually-oriented tools for this job. The good news is that these tools do, indeed, exist. DVDStyler The first of these tools is DVDStyler, a GTK-based application. There are three basic tabs which are used to work through the tasks of piecing together a DVD; they are labeled "Directories," "Backgrounds," and "Buttons." The directories tab pulls up a simple internal directory browser, useful for adding objects to the DVD. So, if the DVD author has a collection of VOB files containing video data, they can be found by way of this tab and added, one by one, to the DVD. Each object shows up in the bottom pane of the window, generally with an unhelpful annotation like "Title 2". There is no easy way to see what each of those titles is; one must query their properties and look at the associated file name. As a grumpy aside, your editor must note that the directory browser uselessly starts at $HOME. One need not work with much video data before realizing that special provisions must be made for its storage; video objects are unlikely to be kept in the home directory. Your editor has a hard time understanding why tools like this are unable to start file searches in the current working directory, which is a much more likely place to find things of interest. Switching to $HOME is not just a least-surprise violation; it actively makes things harder for the user. The "Backgrounds" tab helpfully offers a dozen or so canned background images which can be used for the DVD menus. They are nice backgrounds, and they might just be useful for somebody struggling through the process of creating a DVD for the first time. Your editor, though, suspects that most users, by the time they create their second (working) DVD, might just want to supply their own background images. They will look for that option under the "Backgrounds" tab in vain, though. It is possible to supply a custom image: go to the large (video screen) pane, right-click, select "properties," and set an image there. It's easy, once you've figured it out. But one would think that, having gone to the trouble to provide an entire mode dedicated to background images, the developer would have thought to toss in a "none of the above" button. The hardest part of creating a DVD (once one has suitable video in place, obviously) is getting the menus to work. DVDStyler starts with an empty main menu in place; it is up to the user to add entries which will do interesting things. That is done by way of the "Buttons" tab. There's a selection of arrows available, as well as the ability to add basic text buttons. The button of interest can be simply dragged to the right spot on the menu, sized appropriately, and configured to do the right thing. There are also "empty" buttons for more complicated situations where the real button text (or image) is found on the menu's background image. Having added a button, the author must tell the system what happens in response to events on that button. To that end, there is a separate "properties" dialog. Usually one wants a button to cause a certain video title to be played, and that is easily configured. If more than one menu has been created, buttons can also be set to jump from one menu to the next. There is a "custom" blank for the harder cases which require direct entry of code to be executed by the DVD virtual machine. In DVDStyler, the selection of relatively obscure options (subtitles, languages, camera angles) can only be set up in this way. Also required is a specification of what happens when one of the directional arrows is pressed. The default "auto" setting leaves that up to the player, which will probably do the right thing - the down arrow, for example, will move the focus to the next button below the current one. Anybody who is concerned about the user interface provided by the resulting DVD will probably want to set these actions explicitly, though - a somewhat tedious and time-consuming task. Eventually, the time comes to actually create the DVD. Most first-time users will probably go to the DVD menu for this task, but the "burn" option is not there - it's under the "file" menu instead. The resulting dialog works nicely, giving the user the option to stop after generating the ISO image or to run a preview application (xine by default) before actually writing to the disk. Underneath this dialog is a whole set of helper commands which are run; those can be configured if need be, but most users will not tread there. All told, your editor found DVDStyler to be the easier tool to use for quickly putting together a video disk. There is just one little problem: those disks never quite worked right on your editor's ancient DVD player. Somehow, a misunderstanding about how the menus should work crept in. Your editor suspects, perhaps, that overlapping buttons may have something to do with it; the other application reviewed by your editor (QDVDAuthor) detected and corrected that situation, but DVDStyler did not. In any case, newer players had no problem with the generated disks, so this may not be a problem that most people need to be concerned with. 'Q' DVD-Author The other DVD authoring application considered here is 'Q' DVD-Author (or qdvdauthor from here on out in an effort to save your editor's typing fingers). This is a Qt-based application aimed at providing complete DVD authoring capability. It is arguably more complete and mature than DVDStyler, but more complex as well. Qdvdauthor provides a three-paned window with areas for the current set of audio/video objects, the DVD hierarchy, and the menu designer. The audio/video pane, on the left end, is clearly a work in progress. There is a thumbnail area which shows the opening frame of the associated video - sometimes. Other times it stays green and qdvdauthor silently leaves an mplayer process desperately cranking away in the background. It was only when the load average on your editor's system got to around 20 that he figured that one out. There is a "play" button which pops up a cheery "not yet implemented" button. The run time of each video title is also displayed. All told, it is a more useful display than what DVDStyler offers, with the potential to be quite a bit better yet. The middle pane shows the current hierarchy of objects making up the DVD. It is a helpful display, given that DVDs truly are hierarchical objects. It likes to reset itself to the top, though, making it necessary to scroll repeatedly toward the bottom when the DVD gets more complex. The right pane shows one of the DVD menus - or a couple of other things we'll see later on. One very nice feature is the little display at the bottom showing how much data has been committed to the DVD so far and how much room remains. Video titles are easily added using the prominent "add movie" button. Once attention turns to the menu creation process, one notices that there is no separate "backgrounds" tab - but there is a button for adding a custom background image, which is what is really needed anyway. Your editor found that dragging a thumbnail from the video pane over to the menu area created a picture button which would play the associated title - a nice feature. The creation of text buttons (or those from a separate image) is a bit more labor-intensive, requiring the user to right-click on the background, select "add text", draw a rectangle to define the text area, fill in a rather gaudy text dialog (shown left) with the actual text (and tweak fonts and such), right-click on the newly-added text and select "define as button", then fill in the button properties dialog (shown right). That last step involves setting the button name (necessary - it would be nice if it defaulted to the button text) and picking the various associated actions. It takes a while. Eventually, the time comes to commit all of that work to an actual DVD. A click on the associated button gets that process going. If one has been sloppy in drawing out buttons, the first thing to come up will be a warning that some of the buttons overlap, accompanied by an offer to fix the problem automatically. One can also decline the offer (aborting the process) to fix the problem manually. This is as good a point as any to note that moving and resizing buttons in qdvdauthor is a real exercise in pain. The button areas have the usual grab points for moving, dragging edges and corners, or rotating the button. But none of those are visible until the user has clicked the mouse and committed himself to doing something. The end result is that attempts to drag a button often do something else - like rotating them to some strange angle. The basic interaction modes for operating on graphical objects in a display have been well understood for years; one can only imagine that whoever designed this interface was engaging in some sort of sadistic exercise which was sponsored by purveyors of strong drink. Once the buttons have been sorted out, selecting the burn operation brings up a rather intimidating dialog showing all of the commands which will be executed to get the job done. It's at this point that one realizes just how much behind-the-scenes magic is going on to make the DVD creation process actually happen. There are options to disable specific parts of the process (actually burning the disk, for example), and the adventurous can edit the commands before they run. Most people, though, will probably just hit the "OK" button at the bottom and watch the process unfold. Which it does, just as one would expect. There's a few other nice features hidden in this application. The menu pane can be made to show the XML file which will be generated for dvdauthor; it can also be put into a garish and complex dialog which facilitates the addition of subtitles. There is a template mechanism for menus, and a network-based repository from which qdvdauthor can download new templates. There is an operation which will convert the entire DVD between the NTSC and PAL formats - your editor has not yet exercised this option, but, given that some of the grandparents for whom this work is intended live in Europe, it will eventually come in handy. There is a little-used plugin mechanism and a theme feature as well; long-neglected Motif users will be glad to know there is a style for them. The addition of audio to menus and intro/outro sequences to titles is relatively straightforward. There is also an option to make DVD slideshows out of a series of still images. Conclusion Either one of these applications can get the job done. They both show the best of how an application on a Unix-like system can add power by using existing tools. Neither DVDStyler nor qdvdauthor actually does much of the work of creating menus or burning DVDs; they mostly just put together fiendishly-complex command lines and call out to the tools which have been designed to do that work well. Overall, the combination works reasonably well. A feature which is lacking from both tools is a "hold my hand" mode for people who are not - and do not want to be - experts in DVD creation. A sequence of screens which would set up an initial menu, import titles, and create buttons for each would be most helpful in this regard. As it is, users must have their own internal checklist in mind when creating DVDs, and it is easy to miss things. Your editor, while certainly slower than most, is unlikely to be the only one to have created an impressive pile of coasters before finally producing a DVD which actually worked as intended. While the tools edited here are, in your editor's opinion, the best available for Linux for this task, there are some others to be aware of: Tovid is a set of command-line tools for the creation of DVD menus and putting the whole structure together. They hide much of the underlying complexity and may prove useful for users not wanting to work with a graphical interface. VideoLink is an interesting tool which enables the creation of DVD menus in HTML. It then renders them with a web browser and prepares the result for burning to a DVD. Kino (which will be covered in depth in part 2) can produce a simple dvdauthor script to make a no-menu DVD with a single title. KDE DVD Authoring Wizard is a kdialog script which steps the user through the creation of a simple DVD. It provides the handholding mentioned above, but, arguably, simplifies out too much of the process. Of all these tools, it must be said that qdvdauthor is, at this time, the most complete and capable. It provides access to almost any capability supported by current DVD players, is relatively easy to use, and works most of the time. With luck, the developers (who released the 1.0.0 version reviewed here in November, 2007) will devote themselves to smoothing out the remaining rough edges, leaving us with a tool which DVD authors at any level can use. The future of unencrypted web traffic Hypertext transfer protocol (http) is the heart of the web, providing the means to retrieve content from remote servers. It is an unencrypted, text-based protocol which allows malicious intermediaries to snoop on and potentially modify the traffic. Unfortunately, internet service providers (ISPs) are getting increasingly bold in manipulating the traffic that they carry. This has lead some to call for the elimination of http, in favor of encrypted http (aka secure http or https). An ISP is perfectly situated to gather an enormous amount of information about its users, their website preferences and habits (often called clickstream data). Some have reportedly been selling some of that data in a thinly-anonymized form to advertisers and others. As AOL's well-intentioned, but poorly implemented, release of search queries showed, it is rather easy to analyze this kind of data and pierce the anonymity, deriving the specific user. Another recent ISP trick is to modify a retrieved web page to display other information – under the control of the ISP – which looks like it comes from the website itself. Canadian ISP Rogers Internet has been testing a system to add content to the Google homepage for their customers who are near their monthly bandwidth limits. There are also plans afoot for ISPs to use clickstream data to target advertising – though just where those ads would show up is far from clear. This kind of manipulation is unlikely to be what internet users expect – to the extent they think about it all. The model folks tend to use is that of a phone company; we do not expect them to sell our call records to the highest bidder, nor do we give them license to modify our calls. Various telecommunications privacy laws protect that data, but those laws have not (yet) been applied to internet traffic. In addition, ISPs tend to have a monopoly or near-monopoly, which restricts alternative, less-intrusive ISPs from competing. Fortunately, there are technical solutions possible in the internet realm that would be difficult or impossible to implement network-wide in the phone system. Encrypting website traffic will go a long way towards eliminating this kind of ISP abuse, though it is no panacea. As more of these kinds of privacy invasions occur, we should see more routine use of https by websites. Currently, https is almost exclusively used for e-commerce transactions; typing in credit card numbers and the like. Authentication via username and password is another area that sees widespread encrypted pages. Sites may start to use https for their entire site to combat clickstream and page rewriting abuse – though there will still be some information leakage as the ISPs can still see what sites are being visited. In order to make an https connection, the server must have a certificate with its public key. Typically those are signed by an authority recognized by browsers which allows the browser to authenticate that the certificate belongs to the host visited. Getting signed certificates is a bit cumbersome, costs some money, and they need to be renewed periodically – all of which adds up to a headache for a site, especially a small, non-commercial site, that wants to switch to using https. Self-signed certificates are an alternative, but because they are susceptible to man-in-the-middle attacks, browsers warn their users when they receive one. Another problem with this approach is the extra processing required on the server to support encrypting each and every request. There is a non-trivial amount of extra work that must be done per request and cannot be cached. Sites that wish to avoid the problems that some ISPs are introducing will just have to bear that cost. Pushing bits is not very glamorous, but that is really what one hires an ISP to do. Since they seem to be finding new and exciting ways to interfere with those bits – Comcast messing with BitTorrent traffic for example – internet users will have to find ways to thwart their schemes and encryption will be a big part of that effort. Using https site-wide is only one step, other services will also need to be protected from ISP abuse. What if an ISP started manipulating the results returned from DNS queries, perhaps routing some to a server they control? Development issues part 1: Project communication Free software projects, like all projects, live and die by their communications; developers must be able to talk to each other easily so that a consistent, coherent result emerges. But developers have differing ideas about what methods to use. A discussion on the Emacs development list provides a nice contrast between two of the main communications methods used by projects today. Traditionally, developer communications have been handled by the venerable mailing list, but that is changing, at least for some projects. Internet relay chat (IRC) has become the tool of choice for newer projects, which may leave those who are not inclined towards realtime communication out of the loop. Development methodologies are evolving, and some are adopting the new ways more quickly than others – some may never adopt them at all. The difference between communicating in IRC or via a mailing list is in some ways like the difference between text messaging and email. Email has its advantages, in that the recipient chooses the time to read and respond to the message, but it is often seen as slow. Text messaging or IRC have the advantage of speed; people receive a message and generally respond immediately. But that speed comes at a cost – interrupting the recipient. It also requires a full-time internet connection. While email archives are somewhat cumbersome to use, they are usable. IRC logs are exceedingly painful as they are not subject-based; they just cover a specific time span of all conversation on the channel. Email conversations may play out over days or weeks, but they are generally easier to follow compared to the multiple interleaved conversations that occur on IRC channels. It is in the nature of the medium: IRC conversations are meant to be used immediately, not reread weeks later. It is, in some ways, a culture clash. Younger developers tend to be more inclined towards realtime communications, while older hackers tend to be more comfortable with mailing lists. In what would seem to be an uphill battle, Eric S. Raymond has been advocating a more "modern" development style for GNU Emacs. His messages, appearing on Emacs-devel, champion a development style that includes IRC communication, a bug tracking system, and a version control system (VCS) more advanced than CVS. Raymond's experiences working with the Battle for Wesnoth development team exposed him to some of the newer techniques used in project communication, particularly IRC. He reached a somewhat surprising conclusion about IRC: And far from finding I can't keep up, I've discovered that I like the stimulation. I grok how the kids feel about this, because mailing-list-only projects have started to seem slow and boring to me, too. The Wesnoth project uses IRC for all day-to-day design and development decisions, leaving the mailing list for more complicated discussions and white papers. This has the effect of excluding interested developers who are not able or willing to monitor an IRC channel throughout their day, but that is unlikely to be the intent. The reverse is also true: the perceived slow pace of mailing-list only projects has the effect of excluding those with a strong preference for a faster style of development. As Raymond shows, though, there is hope that members of one school can retrain – if they wish – for the other. While decision making by IRC does not seem to be in the cards any time soon for Emacs, an upgrade to something other than CVS seems to have gained more traction. Richard Stallman has been asking a lot of questions about git while other developers discuss other distributed version control systems (DVCS), like darcs, monotone, arch, and Mercurial. Raymond is working on a survey of the VCS landscape that, once completed, he and others hope will guide the project into a better VCS choice. One of the main DVCS features that seems of interest to Stallman is the "offline" capabilities. Having the entire history of a project and being able to do commits of work in progress while being disconnected from the internet are features that CVS does not have. Stallman is adamant that the tools used to develop Emacs be usable by those who are not always connected to the net which makes a DVCS rather attractive. The Emacs project is one of the oldest free software projects in existence; it is, like its founder, fairly resistant to change. While Emacs itself is used by hackers everywhere, it is increasingly falling behind its competitors, at least partially because of the slow pace at which it is developed. Raymond's belief is that by upgrading the tools used to take advantage of advances made since CVS and mailman were new, the time between Emacs releases could be reduced to something more sane. Doing that could go a long way towards making Emacs more relevant to younger hackers: When those Eclipse fans pointed and laughed because we're still stuck on CVS and don't have a bug tracker, what counter could I have had? They know these are bad choices and they know that I know it -- so when they write off Emacs as old, tired, and irrelevant to anything they're interested in, I find it increasingly difficult to reply. It is unlikely that just some tool changes will be enough to resurrect the flagging popularity of Emacs, but there are hopeful signs. Some of Raymond's suggestions met a warmer reception than one might have expected. It is clear that a fair number of Emacs fans and developers are frustrated with the current state of affairs. It may be that "just some tool changes" are enough to reinvigorate the project to a point where it attracts more developers and users. That can only be a good thing for Emacs. The Linux Libertine Open Fonts Project The Libertine Open Fonts Project, which first showed up on LWN in May, 2006, is an open source font project. The project's leader is Philipp H. Poll. The Libertine project description states: Letters and fonts have two charakteristics: On the one hand they are basic elements of communication and fundament of our culture, on the other hand they are cultural goods and artcraft. You are able to see just the first aspect, but when it comes to software you’ll see copyrights and patents even on the most elementary fonts. Therefore we want to give you a free alternative: This is why we founded the Libertine Open Fonts Project. The Libertine license information states: Our fonts are free in the sense of the GPL and OFL. In a nutshell: Changing the font is allowed as long as the derivative work is published under the same license again. Pedantics keep claiming that the embedded use of GPL-fonts in i.e. PDFs requires the free publication of the PDF as well. This, of course, is absolute nonsense, because - to our opinion - the font is not significantly changed by the embedding. To abolish the conflict some members of the FSF have written an addition to the license: the so called “Font Exception”. Our fonts’ GPL contains this font exception (since version 2.7). Since version 2.1.9 LinuxLibertine is also licensed under the OFL, which will clarify usability-conflicts. The Libertine font files are available as both TTF (TrueType) and OTF (OpenType) fonts. The Linux-compatible LaTeX typesetting system supports the Libertine fonts. See the Libertine LaTeX document [PDF] for usage and installation instructions. Libertine includes a wide variety of Font Styles. Numerous languages are supported, and many special characters are available. For a look at some of the LaTeX accessible font characters, see the glyph list document [PDF]. Version 2.7.9 of the Libertine font project was recently announced. This release adds hinting, which allows the fonts to be used with Microsoft Word. Other changes include improved kern pairs for better typography, some minor tweaks and some bug fixes. The libertine fonts are available for download here. The fonts come in a standard .tgz file which includes all of the font collections as both .ttf and .otf files. The Fontforge source files are also available. Fontforge is an open-source outline font editor. RCU part 3: the RCU API [Editor's note: this is the third and final installment in Paul McKenney's "What is RCU?" series. The first and second parts remain available for those who might have missed them. Many thanks to Paul for letting LWN run these articles.] Introduction Read-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October of 2002. RCU is most frequently described as a replacement for reader-writer locking, but has also been used in a number of other ways. RCU is notable in that RCU readers do not directly synchronize with RCU updaters, which makes RCU read paths extremely fast, and also permits RCU readers to accomplish useful work even when running concurrently with RCU updaters. This leads to the question "what exactly is RCU?", a question that this document addresses from the viewpoint of the Linux kernel's RCU API. RCU has a Family of Wait-to-Finish APIs RCU has Publish-Subscribe and Version-Maintenance APIs So, What is RCU Really? These sections are followed by a references section and the answers to the Quick Quizzes. RCU has a Family of Wait-to-Finish APIs The most straightforward answer to "what is RCU" is that RCU is an API used in the Linux kernel, as summarized by the pair of tables in this section (the first table shows the wait-for-RCU-readers portions of the API, while the second table shows the publish/subscribe portions of the API). Or, more precisely, RCU is a family of APIs as shown in the first table, with each column corresponding to a member of the RCU API family. If you are new to RCU, you might consider focusing on just one of the columns in the following table. For example, if you are primarily interested in understanding how RCU is used in the Linux kernel, "RCU Classic" would be the place to start, as it is used most frequently. On the other hand, if you want to understand RCU for its own sake, "SRCU" has the simplest API. You can always come back for the other columns later. If you are already familiar with RCU, the following pair of tables can serve as a useful reference. Quick Quiz 1: Why are some of the cells in the above table colored green? The "RCU Classic" column corresponds to the original RCU implementation, in which RCU read-side critical sections are delimited by rcu_read_lock() and rcu_read_unlock(), which may be nested. The corresponding synchronous update-side primitives, synchronize_rcu(), along with its synonym synchronize_net(), wait for any currently executing RCU read-side critical sections to complete. The length of this wait is known as a "grace period". The asynchronous update-side primitive, call_rcu(), invokes a specified function with a specified argument after a subsequent grace period. For example, call_rcu(p,f); will result in the "RCU callback" f(p) being invoked after a subsequent grace period. There are situations, such as when unloading a module that uses call_rcu(), when it is necessary to wait for all outstanding RCU callbacks to complete. The rcu_barrier() primitive does this job. In the "RCU BH" column, rcu_read_lock_bh() and rcu_read_unlock_bh() delimit RCU read-side critical sections, and call_rcu_bh() invokes the specified function and argument after a subsequent grace period. Note that RCU BH does not have a synchronous synchronize_rcu_bh() interface, though one could easily be added if required. Quick Quiz 2: What happens if you mix and match? For example, suppose you use rcu_read_lock() and rcu_read_unlock() to delimit RCU read-side critical sections, but then use call_rcu_bh() to post an RCU callback? In the "RCU Sched" column, anything that disables preemption acts as an RCU read-side critical section, and synchronize_sched() waits for the corresponding RCU grace period. This RCU API family was added in the 2.6.12 kernel, which split the old synchronize_kernel() API into the current synchronize_rcu() (for RCU Classic) and synchronize_sched() (for RCU Sched). Note that RCU Sched does not have an asynchronous call_rcu_sched() interface, though one could be added if required. Quick Quiz 3: What happens if you mix and match RCU Classic and RCU Sched? The "Realtime RCU" column has the same API as does RCU Classic, the only difference being that RCU read-side critical sections may be preempted and may block while acquiring spinlocks. The design of Realtime RCU is described in the LWN article The design of preemptible read-copy-update. Quick Quiz 4: What happens if you mix and match Realtime RCU and RCU Classic? The "SRCU" column displays a specialized RCU API that permits general sleeping in RCU read-side critical sections, as was described in the LWN article Sleepable RCU. Of course, use of synchronize_srcu() in an SRCU read-side critical section can result in self-deadlock, so should be avoided. SRCU differs from earlier RCU implementations in that the caller allocates an srcu_struct for each distinct SRCU usage. This approach prevents SRCU read-side critical sections from blocking unrelated synchronize_srcu() invocations. In addition, in this variant of RCU, srcu_read_lock() returns a value that must be passed into the corresponding srcu_read_unlock(). The "QRCU" column presents an RCU implementation with the same API structure as SRCU, but optimized for extremely low-latency grace periods in absence of readers, as described in the LWN article Using Promela and Spin to verify parallel algorithms. As with SRCU, use of synchronize_qrcu() can result in self-deadlock, so should be avoided. Although QRCU has not yet been accepted into the Linux kernel, it is worth mentioning given that it is the only RCU implementation that can boast deep sub-microsecond grace-period latencies. Quick Quiz 5: Why do both SRCU and QRCU lack asynchronous call_srcu() or call_qrcu() interfaces? Quick Quiz 6: Under what conditions can synchronize_srcu() be safely used within an SRCU read-side critical section? The Linux kernel currently has a surprising number of RCU APIs and implementations. There is some hope of reducing this number, evidenced by the fact that a given build of the Linux kernel currently has at most three implementations behind four APIs (given that RCU Classic and Realtime RCU share the same API). However, careful inspection and analysis will be required, just as would be required for one of the many locking APIs. RCU has Publish-Subscribe and Version-Maintenance APIs Fortunately, the RCU publish-subscribe and version-maintenance primitives shown in the following table apply to all of the variants of RCU discussed above. This commonality can in some cases allow more code to be shared, which certainly reduces the API proliferation that would otherwise occur. The first pair of categories operate on Linux struct list_head lists, which are circular, doubly-linked lists. The list_for_each_entry_rcu() primitive traverses an RCU-protected list in a type-safe manner, while also enforcing memory ordering for situations where a new list element is inserted into the list concurrently with traversal. On non-Alpha platforms, this primitive incurs little or no performance penalty compared to list_for_each_entry(). The list_add_rcu(), list_add_tail_rcu(), and list_replace_rcu() primitives are analogous to their non-RCU counterparts, but incur the overhead of an additional memory barrier on weakly-ordered machines. The list_del_rcu() primitive is also analogous to its non-RCU counterpart, but oddly enough is very slightly faster due to the fact that it poisons only the prev pointer rather than both the prev and next pointers as list_del() must do. Finally, the list_splice_init_rcu() primitive is similar to its non-RCU counterpart, but incurs a full grace-period latency. The purpose of this grace period is to allow RCU readers to finish their traversal of the source list before completely disconnecting it from the list header -- failure to do this could prevent such readers from ever terminating their traversal. Quick Quiz 7: Why doesn't list_del_rcu() poison both the next and prev pointers? The second pair of categories operate on Linux's struct hlist_head, which is a linear linked list. One advantage of struct hlist_head over struct list_head is that the former requires only a single-pointer list header, which can save significant memory in large hash tables. The struct hlist_head primitives in the table relate to their non-RCU counterparts in much the same way as do the struct list_head primitives. The final pair of categories operate directly on pointers, and are useful for creating RCU-protected non-list data structures, such as RCU-protected arrays and trees. The rcu_assign_pointer() primitive ensures that any prior initialization remains ordered before the assignment to the pointer on weakly ordered machines. Similarly, the rcu_dereference() primitive ensures that subsequent code dereferencing the pointer will see the effects of initialization code prior to the corresponding rcu_assign_pointer() on Alpha CPUs. On non-Alpha CPUs, rcu_dereference() documents which pointer dereferences are protected by RCU. Quick Quiz 8: Normally, any pointer subject to rcu_dereference() should always be updated using rcu_assign_pointer(). What is an exception to this rule? Quick Quiz 9: Are there any downsides to the fact that these traversal and update primitives can be used with any of the RCU API family members? So, What is RCU Really? At its core, RCU is nothing more nor less than an API that supports publication and subscription for insertions, waiting for all RCU readers to complete, and maintenance of multiple versions. That said, it is possible to build higher-level constructs on top of RCU, including the reader-writer-locking, reference-counting, and existence-guarantee constructs listed in the companion article. Furthermore, I have no doubt that the Linux community will continue to find interesting new uses for RCU, just as they do for any of a number of synchronization primitives throughout the kernel. Finally, a complete view of RCU would also include all of the things you can do with these APIs. Acknowledgements We are all indebted to Andy Whitcroft, Jon Walpole, and Gautham Shenoy, whose review of an early draft of this document greatly improved it. I owe thanks to the members of the Relativistic Programming project and to members of PNW TEC for many valuable discussions. I am grateful to Dan Frye for his support of this effort. This work represents the view of the author and does not necessarily represent the view of IBM. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. References This section gives a short annotated bibliography describing using RCU, Linux-kernel RCU implementations, background, and historical perspectives. For more information, see Paul E. McKenney's RCU Page. Using RCU Overview of Linux-Kernel Reference Counting (McKenney, January 2007) [PDF]. Overview of Linux-kernel reference counting (including RCU) prepared for the Concurrency Working Group of the C/C++ standards committee. RCU and Unloadable Modules (McKenney, January 2007). Describes how to unload modules that use call_rcu(), so as to avoid RCU callbacks trying to use the module after it has been unloaded. Recent Developments in SELinux Kernel Performance. James Morris describes a performance problem in the SELinux Access Vector Cache (AVC), and its resolution via RCU in a patch by Kaigai Kohei. Using Read-Copy-Update Techniques for System V IPC in the Linux 2.5 Kernel (Arcangeli et al., June 2003) [PDF]. Describes how RCU is used in the Linux kernel's System V IPC implementation. Linux-Kernel RCU Implementations The design of preemptible read-copy-update (McKenney, October 2007). Describes a high-performance RCU implementation for realtime use. Sleepable RCU (McKenney, October 2006). Description of SRCU. Using Promela and Spin to verify parallel algorithms (McKenney, August 2007). Description of the QRCU patch. RCU dissertation (McKenney, July 2004) [PDF]. Section 2.2.20 (pages 62-64) gives a history of RCU-like mechanisms, a very brief summary of which can be found below. Chapter 4 (pages 71-98) and Appendix C (pages 326-345) review a number of different types of RCU implementations, summarizing a number of earlier papers. Chapter 5 (pages 137-178) gives an overview of a number of "design patterns" guiding use of RCU. Chapter 6 (pages 179-234) describes some early uses of RCU. Using RCU in the Linux 2.5 Kernel (October 2003). Brief summary of why RCU can be helpful, along with an analogy between RCU and reader-writer locking. Anyone who is laboring under the misapprehension that the Linux community would never have independently invented RCU should read this netdev posting and this one as well. Both postings pre-date the earliest known introduction of RCU to the Linux community. Background Real-Time Linux Wiki. Provides much valuable information on the -rt patchset for both kernel and application developers. Home of the -rt kernel patchsets. Memory Ordering in Modern Microprocessors (McKenney, August 2005) [PDF]. Gives an overview of how Linux's memory-ordering primitives work on a number of computer architectures. Historical Perspectives on RCU and Related Mechanisms Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System (Gamsa et al., February 1999) [PDF]. Independent invention of a mechanism very similar to RCU. Tornado is a research operating system developed at the University of Toronto. This operating system uses its analog to RCU pervasively. Some of the University of Toronto students brought this operating system with them to IBM Research, where it was developed as part of the K42 project. Read-Copy Update: Using Execution History to Solve Concurrency Problems (McKenney and Slingwine, October 1998) [PDF]. First non-patent publication of DYNIX/ptx's RCU implementation. Passive Serialization in a Multitasking Environment (Hennessey et al., February 1989). This patent describes an RCU-like mechanism that was apparently used in IBM's VM/XA mainframe hypervisor. This is the earliest known production use of an RCU-like mechanism. Concurrent Manipulation of Binary Search Trees (Kung and Lehman, September 1980). The earliest known publication of an RCU-like mechanism, using a garbage collector to implicitly compute grace periods. Answers to Quick Quizzes Quick Quiz 1: Why are some of the cells in the above table colored green? Answer: The green API members (rcu_read_lock(), rcu_read_unlock(), and call_rcu()) were the only members of the Linux RCU API that Paul E. McKenney was aware of back in the mid-90s. During this timeframe, he was under the mistaken impression that he knew all that there is to know about RCU. Back to Quick Quiz 1. Quick Quiz 2: What happens if you mix and match? For example, suppose you use rcu_read_lock() and rcu_read_unlock() to delimit RCU read-side critical sections, but then use call_rcu_bh() to post an RCU callback? Answer: If there happened to be no RCU read-side critical sections delimited by rcu_read_lock_bh() and rcu_read_unlock_bh() at the time call_rcu_bh() was invoked, RCU would be within its rights to invoke the callback immediately, possibly freeing a data structure still being used by the RCU read-side critical section! This is not merely a theoretical possibility: a long-running RCU read-side critical section delimited by rcu_read_lock() and rcu_read_unlock() is vulnerable to this failure mode. This vulnerability disappears in -rt kernels, where RCU Classic and RCU BH both map onto a common implementation. Back to Quick Quiz 2. Quick Quiz 3: What happens if you mix and match RCU Classic and RCU Sched? Answer: In a non-PREEMPT or a PREEMPT kernel, mixing these two works "by accident" because in those kernel builds, RCU Classic and RCU Sched map to the same implementation. However, this mixture is fatal in PREEMPT_RT builds using the -rt patchset, due to the fact that Realtime RCU's read-side critical sections can be preempted, which would permit synchronize_sched() to return before the RCU read-side critical section reached its rcu_read_unlock() call. This could in turn result in a data structure being freed before the read-side critical section was finished with it, which could in turn greatly increase the actuarial risk experienced by your kernel. In fact, the split between RCU Classic and RCU Sched was inspired by the need for preemptible RCU read-side critical sections. Back to Quick Quiz 3. Quick Quiz 4: What happens if you mix and match Realtime RCU and RCU Classic? Answer: That would be up to you, because you would have to code up changes to the kernel to make such mixing possible. Currently, any kernel running with RCU Classic cannot access Realtime RCU and vice versa. Back to Quick Quiz 4. Quick Quiz 5: Why do both SRCU and QRCU lack asynchronous call_srcu() or call_qrcu() interfaces? Answer: Given an asynchronous interface, a single task could register an arbitrarily large number of SRCU or QRCU callbacks, thereby consuming an arbitrarily large quantity of memory. In contrast, given the current synchronous synchronize_srcu() and synchronize_qrcu() interfaces, a given task must finish waiting for a given grace period before it can start waiting for the next one. Back to Quick Quiz 5. Quick Quiz 6: Under what conditions can synchronize_srcu() be safely used within an SRCU read-side critical section? Answer: In principle, you can use synchronize_srcu() with a given srcu_struct within an SRCU read-side critical section that uses some other srcu_struct. In practice, however, doing this is almost certainly a bad idea. In particular, the following could still result in deadlock: Back to Quick Quiz 6. Quick Quiz 7: Why doesn't list_del_rcu() poison both the next and prev pointers? Answer: Poisoning the next pointer would interfere with concurrent RCU readers, who must use this pointer. However, RCU readers are forbidden from using the prev pointer, so it may safely be poisoned. Back to Quick Quiz 7. Quick Quiz 8: Normally, any pointer subject to rcu_dereference() must always be updated using rcu_assign_pointer(). What is an exception to this rule? Answer: One such exception is when a multi-element linked data structure is initialized as a unit while inaccessible to other CPUs, and then a single rcu_assign_pointer() is used to plant a global pointer to this data structure. The initialization-time pointer assignments need not use rcu_assign_pointer(), though any such assignments that happen after the structure is globally visible must use rcu_assign_pointer(). However, unless this initialization code is on an impressively hot code-path, it is probably wise to use rcu_assign_pointer() anyway, even though it is in theory unnecessary. It is all too easy for a "minor" change to invalidate your cherished assumptions about the initialization happening privately. Back to Quick Quiz 8. Quick Quiz 9: Are there any downsides to the fact that these traversal and update primitives can be used with any of the RCU API family members? Answer: It can sometimes be difficult for automated code checkers such as "sparse" (or indeed for human beings) to work out which type of RCU read-side critical section a given RCU traversal primitive corresponds to. For example, consider the following: Is the rcu_dereference() primitive in an RCU Classic or an RCU Sched critical section? What would you have to do to figure this out? Back to Quick Quiz 9. Development issues part 2: Bug tracking Once upon a time, free software was a relatively rare commodity, and there was a real novelty in being able to run a free package for a specific purpose. The availability of a free C compiler, for example, was cause for celebration. The fact that said compiler was not always the most reliable program on the system did little to reduce enthusiasm; many of us persisted in irrational endeavors (like trying to use gcc to build the X Window System) despite the occasionally painful (and predictable) consequences. And, in the process, we helped to make both programs more reliable. There comes a time, though, when even the most die-hard free software proponent wishes that things would just work. As our software finds its way into more situations where failures are unwelcome (at best), the level of tolerance for bugs is falling. The desire for fewer flaws, however, runs counter to the desire for increasingly capable (and thus more complex) software. Somehow we have to find ways to simultaneously grow our systems and reduce the total number of bugs. To this end, a few projects have been having some interesting discussions on the tracking and fixing of bugs. As has been discussed in this companion article, Eric Raymond has been busily stirring up trouble on the Emacs development list. His point, deemed reasonable by your editor, is that Emacs must adopt a number of relatively modern development practices if it is to have any hope of remaining relevant at all. One of his key points is that Emacs needs to have a real bug tracking system. Says Eric: Now I consider Emacs: 1100K lines, a COCOMO estimate of over 328 years, and no issue database. I think I think I understand much better now now why the team has only been able to ship one release in five years. Trying to converge on a releasable state with as poor a view of the Emacs bug load as we have must be damn near impossible. While some of Eric's suggestions appear to be non-starters - imagine trying to get Richard Stallman to hang out on an IRC channel - the bug tracker suggestion might just go somewhere. Certainly it could only be an improvement for a project of that size to have some sort of idea of what the current list of outstanding bugs looks like. It might even help bring about another Emacs release before the end of the decade. Bug trackers are not a magical solution to the bug problem, though; in fact, they can create some problems of their own. The Fedora project, which does have a bug tracker, is currently trying to figure out what to do with the contents of that tracker. It seems that said tracker contains over 13,000 bugs, almost 10,000 of which apply to Fedora 7 and later. A bug database of this size is simply overwhelming to anybody who tries to do something about it. As a result, Fedora users are filing bugs, only to see nothing happen in response. Not even a "thanks for your report" message. This situation is discouraging for everybody involved, causing Fedora users to give up on reporting bugs and developers to fear looking at the tracker. In the Fedora case, there appears to be a near-consensus that the biggest problem is in triaging bug entries. This is not a job which can be automated; somebody has to go through bug submissions, weed out the duplicates, identify those which are really "features," figure out which developer should be notified, etc. Tying bug entries to those found in upstream trackers would be a highly useful bonus. Without this sort of effort, the bug tracker quickly fills with low-quality entries which help nobody. For the most part, nobody is doing this job for Fedora now. Red Hat is not paying for a staff member to triage bugs, and the wider community has not filled this gap. In the short term, any sort of solution looks like it will have to come from the community, so the Fedora folks are wondering what can be done to encourage more participation. Simply asking for help is the obvious first step, as is making sure that the process is easy. Then they may consider the tactics adopted by other large projects - Mozilla's policy of expressing its appreciation by sending a T-shirt, for example. As an aside, one of the more useful bits of information to come from this discussion was the existence of this family of URLs: Fill in the name, and the result is an immediate list of open bugs for the given package. Thus, for example, a visit to bugz.fedoraproject.org/gcc yields a list of compiler bugs. This result can be had directly from bugzilla, of course, but this interface is faster and easier. The Fedora developers have discussed a number of related issues, such as whether the Fedora bug database should be separated from the RHEL system and what can be done to make Red Hat better appreciate the value of doing more of its quality assurance work in the Fedora repository. But the core problem is just getting human attention applied to the bug reports. Digging through bug databases is a relatively unglamorous job; it is not an easy path toward rock-star hacker status. But it is an important and relatively easy way to help make free software better. Just in time to serve as an example of how well bug management can work, the GNOME project has posted its annual bugzilla statistics. It seems that over 110,000 GNOME bugs were filed in 2007, almost 109,000 of them were closed. The top bug-closers for the year were: It is worth pondering for a moment on the amount of energy required to close over 14,000 bugs in a year - that's almost 40 per day, every day, without a break. This kind of energy does exist within our community, and some projects are putting it to very good use. While it is easy to get a contrary impression, the kernel does, in fact, have a bug tracker; there is also, in the form of Natalie Protasevich, somebody who handles the care and feeding of that tracker. But, as a recent episode shows, that still is not always sufficient to actually get the bugs fixed. On November 13, 2007, a bug in the SCSI subsystem was reported to the linux-kernel mailing list. It was put into the tracker as bug 9370 on the same day. Some developers looked at it over the next few days, but, even though a specific commit which appeared to cause the bug had been identified, no solution was forthcoming. Discussion eventually died out. At least until January 2, when Ingo Molnar decided to stir the pot by posting a patch to revert the seemingly guilty commit. At that point the discussion picked up and a reliable way of reproducing the bug was found. The commit which was said to have caused the problem was, in fact, not guilty; it had just caused an older bug to come to light. The discussion did not stop there, though. A number of charges went back and forth which do not require discussion here. But one core point is this: as long as the bug report sat in the tracker, nothing much appeared to be happening with it - though, it seems, the SCSI developers had not forgotten it and were trying to figure out what was really going on. But once the problem came back to the linux-kernel list in the form of a brute-force solution, the root cause was found in short order. The key here was bringing the problem to the attention of a wider group of people; the crucial recipe for reproducing the problem came from a developer who had not been looking at the problem previously. In the kernel context, at least, giving wide exposure to a bug often helps immensely in getting that bug fixed. That is especially true for the sort of hard-to-reproduce bugs which tend to come up in kernel programming. So, while bug trackers are a useful tool for ensuring that problems do not fall through the cracks, it seems that one of the most potent anti-bug tools we have - discussing the problem via a widely-distributed email list - is the same tool we have been using for decades. The Linux trace toolkit's next generation Instrumenting a running kernel for debugging or profiling is on the wish list of many administrators and developers. Advocates of OpenSolaris like to point to DTrace as a feature that Linux lacks, though SystemTap has started to close that gap. The Linux Trace Toolkit next generation (LTTng) takes a different approach and was recently submitted for inclusion in the kernel (in two patches: arch independent and arch dependent). LTTng relies upon kernel markers to provide static probe points for its kernel tracing activities. It also provides the ability to trace userspace programs and combine that data with kernel tracing data to give a detailed view of the internals of the system. Unlike other tools, LTTng takes a post-processing approach, storing the data away as efficiently as possible for later analysis. This is in contrast to SystemTap and DTrace which have their own mini-languages that specify what to do as each trace point is reached. One of the major design goals of LTTng is to have as little impact on the system as possible, not only when it is actually tracing events, but also when it is disabled. Kernel hackers are quite resistant to debugging solutions that add any significant performance penalty when not in use. In addition, any significant delays while enabled may change the system timing such that the bug or condition being studied does not occur. For this reason, LTTng does not take the path that various dynamic tracing solutions have used and avoids the expense of a breakpoint interrupt by using the static markers. Another major design goal is to provide monotonically increasing timestamp values for events. The original LTT uses timestamps derived from the kernel Network Time Protocol (NTP) time, which can fluctuate somewhat as adjustments are made – sometimes going backward. LTTng uses a timestamp derived from the hardware clocks that will work on various processor architectures and clock speeds. In addition, the timestamps can be correlated between different processors in a multi-processor system. As LTTng gathers its data, it uses relayfs to get the data to a userspace daemon (lttd) that writes the data to disk. The daemon is started from the lttctl command-line tool, which controls the tracing settings in the kernel via a netlink socket. A user wishing to investigate tracing could use lttctl to start and stop a trace; once the trace is complete, the data could be viewed and analyzed. The LTT viewer (LTTV) is the program that is used to analyze the data gathered. It provides both GUI and text-based viewers to interpret the binary data generated by LTTng and present it to the user. Multi-gigabyte files of tracing data are not uncommon when using LTTng, so a tool like LTTV is indispensable for visualization and filtering to allow the user to focus on the events of interest. LTTV has a plugin mechanism that allows users to develop their own display and analysis tools, while using the LTTV framework and filtering capabilities. An advantage of using static probe points – though some may see it as a disadvantage – is that they can be maintained with the kernel code they are targeting. If the kernel markers patch is merged, subsystems can add probe points at places they find interesting or useful and those markers will be carried along in the kernel source; updated as the kernel changes. Other solutions rely on matching an external list of probes with the version of the running kernel, which can result in mismatches and incorrect traces. Also, SystemTap will be able to use any markers that get added to the kernel as is, so users who want the abilities that it provides will also benefit. LTTng is being developed at the École Polytechnique de Montréal with support from quite a few Linux companies. It has the looks of a very well thought out framework that builds upon the tracing work that has been done before. It certainly won't make it into 2.6.24, but it would seem to have a good chance of making it into a future mainline kernel. LWN.net: a ten-year timeline (part 1) LWN is about to celebrate a birthday. Picking the true anniversary of an enterprise like LWN can be a bit tricky - there are many points which could be said to mark the true birth of the organization. After some thought, we have decreed that LWN.net was born on January 30, 1998. So we have a tenth anniversary coming up. That's a long time - far longer than any of us thought we would be doing this. Life is funny that way, somehow. One cannot let a date like this go by without at least partially taking advantage of its hype-creation possibilities. So there will be a few things happening to celebrate our decade of writing about Linux, culminating with some sort of celebration on the 30th, when your editor will be speaking at this year's (sold-out!) linux.conf.au in Melbourne, Australia. One of those will be a short series of articles - starting with this one - looking back at those ten years. What a long, strange trip it has been. Back in early 1997, your editor was the manager of a software development, system administration, and data delivery group at the National Center for Atmospheric Research. He had, at that point, been using Linux for a few years. It was running on a number of servers, of course, but we had also deployed it on desktops and used it for the acquisition and display of meteorological data, including high-bandwidth (for the time) doppler radar data. Don't let anybody tell you that real-time Linux is a new thing. At this time, your editor was seeing two futures: (1) an increasingly dilbertesque life spent mostly in meetings, and (2) the clearly bright future of Linux. So he was actively looking for ways to move out of conference rooms and toward Linux, and talking over schemes with a number of friends. An early idea - to commercialize one of the first weather stations ever put on the World Wide Web with LWN editor Forrest Cook, never quite took off. But that thought process continued. During that same time, Elizabeth Coolbaugh had just left a very similar position at the same institution; she was looking for a new project for the next phase of her life. After some discussions, Liz and your editor settled on a business idea which seemed to have some promise. It was not to be the last silly decision they were to make. You see, at that time there was a struggling Linux distributor named Red Hat which was beginning to get the sense that there might be a market for its boxed Linux product in the corporate world. But companies need support, and Red Hat lacked the ability to provide that support. So the company's management came up with the "support partner" concept. Upon being accepted into this program, partner companies would be able to sell Red Hat-backed support certificates, which Red Hat would help to market. This widespread network of Linux experts would be able to provide local support to clients and would, for the hardest problems, be able to get help from Red Hat itself. It looked like a winner for everybody involved. That program was not yet operational at this time, though - but Red Hat promised it would be Real Soon Now. Your soon-to-be editors, not yet having done much business with Red Hat beyond ordering an occasional CD, believed this promise. But it still made sense to do something productive while waiting. The idea that emerged after some talk was to put up a regular newsletter about what was happening in the fast-evolving Linux community. Even back then, keeping up with everything was hard, so we figured that the service would be valuable. As an added bonus, it would attract attention to this new support company (called Eklektix) and show just how blindingly smart and up on Linux we were. Discussion of details occurred slowly through much of 1997. On January 22, 1998, the first issue of LWN was posted; it talked about the 2.1.79 kernel, the brand-new spinlock mechanism, the devfs debate, the creation of Red Hat Advanced Development Labs, and attempts to bring Java to Linux. The January 29, 1998 issue changed the format and led off with Netscape's announcement that it would be releasing the source code for its browser. We also found all of two news articles about Linux (we posted every one we found in those days) and talked about NFS problems, the devfs debate, the Debian 2.0 release roadmap, and gcc 2.8 problems. At this point, we had posted two issues, but had not actually told anybody about them. Unsurprisingly, traffic was low. That changed on January 30, when our announcement made it out to the comp.os.linux.announce newsgroup - the best way to get the news out at that time. As promotional text the announcement was rudimentary at best, but it had the desired result - we got over 1000 page views on that first day, which seemed like a lot at the time. LWN was off and running. Some highlights from the early days of LWN: February 12, 1998: Eric Raymond starts pushing "open source" instead of free software. Worries over whether Intel's proposed "Merced" architecture would support Linux. February 19, 1998: Richard Stallman fights back against Open Source. SCO claims to be the largest provider of Unix-based servers. Jesse Berst's famous "could you get fired for choosing Linux?" article runs. Jaroslav Kysela launches the "Ultra" (later ALSA) sound driver project. March 12, 1998: Ralph Nader suggests that Dell should sell Linux-installed systems. March 19, 1998: Bruce Perens resigns from the Debian project, saying: "I'm sorry it had to be this way, but I feel that my mission to bring free software to the masses really isn't compatible with Debian any longer, and that I should be working with one of the more mainstream Linux distributions." Sendmail, Inc. was launched. April 2, 1998: the Mozilla source release happens. Alan Cox joins Red Hat. The feature freeze for the 2.2 kernel is announced. The Open Group announces that use of the X Window System will requires fees - but Linux users had XFree86 and didn't care. It's fair to say that we didn't entirely grasp the significance of the events reported in the April 2 edition. The hiring of Alan Cox was one of the first in a long series - before then, almost nobody actually had a job which involved developing Linux. The Open Group's attempt to relicense X was thoroughly defeated by the existence of a free version with an active development community - a story which would be repeated a number of times in the coming years. April 30, 1998: Red Hat gets around to launching its support program, with Eklektix as one of the four they had managed to sign up. Kernel development halts as a result of the birth of Linus's second child. May 28, 1998: LWN moves to its own domain at LWN.net. The Linux Standard Base is proposed. Your editor first describes himself as "grumpy" after producing LWN by himself (Liz was at Linux Expo). PC Week calls Linux "a communist operating system in a capitalist society" and predicts its demise. Red Hat 5.1 is released. July 16, 1998: KDE 1.0 is released; KDE v. GNOME flamewars spread across numerous mailing lists and web sites. July 23, 1998: Oracle ports some of its products to Linux. Linus decrees that 8MB of memory will be needed for the 2.2 kernel. The Oracle announcement seems mundane now, but the existence of Oracle products for Linux was a specific indicator that many people were looking for. It was an indication that Linux was a "serious" platform. Richard Stallman, of course, thought that Oracle's announcement was terrible news. July 30, 1998: Debian 2.0 is released. Rumors circulate that IBM is considering Linux. Linux-Mandrake is launched. August 13, 1998: the Open Source Initiative is launched, flame wars result. Richard Stallman calls for free documentation for free software. The kernel goes into a "hard code freeze" - not the first or last time that a Linus-decreed freeze would prove to be less hard than anticipated. The devfs discussion continues. Red Hat states that it cannot legally ship Qt or KDE. August 20, 1998: Red Hat launches Rawhide. Bruce Perens bails out of the Linux Standard Base effort. October 1, 1998: Intel and Netscape (and two venture capital firms) invest in Red Hat. Also notable this week was the first of the big "Linus burnout" episodes, making it clear that something in the kernel development process needed to change. Let us now pause for a moment. From this distance, it may be hard to appreciate just how big the news of the Red Hat investments was. For all that had happened, Linux was still a somewhat obscure phenomenon, unknown to much of the information technology world. When Intel put money into Red Hat, it became clear to all that both Linux and Red Hat were headed toward success. This was, in some real sense, the point where Linux entered the dotcom bubble, though the real action was still a year away. The 2.1.123 release failed to compile as a result of some merging errors; developers got upset about the state of affairs and a long, inflammatory discussion resulted. Linus stormed out of the virtual room and took a vacation. It was a somewhat scary series of events which foreshadowed more to come; getting the kernel development process to scale as the community grew was a multi-year process. During this time, LWN was also growing in both readership and size; it was taking increasing amounts of time. We eventually had to move the server from its initial location (behind an ISDN line in your editor's basement) to a proper hosting facility. But, remember, LWN was not the main endeavor; it was an attention attractor for the support services offered by Eklektix, Inc. This business plan was not going particularly well. Those who dealt with Red Hat in that era know that, as a company, it was a rather chaotic place. The marketing for the support partners never happened, and the backup services for the support plans the partners were able to sell themselves were, shall we say, less than the customers thought they deserved given what they had paid. The support partner program was not a big success for anybody involved. As a result, one of the first things Red Hat did with its new pile of cash was to cancel this program and start building its own, internal support operation. Eklektix continued to push its own support offerings for a while, but the fact of the matter is that it was not a fun business: it seemed to mostly consist of cleaning up after low-budget ISPs which could not be bothered to install security updates. So the search for alternatives began. Meanwhile: October 16, 1998: Larry McVoy contacts LWN and describes his upcoming "BitKeeper" software as a way of making Linus "scale". Debian takes an official position against KDE. November 5, 1998: The Halloween Memo. November 19, 1998: The Qt library becomes available under the new QPL, eliminating roadblocks for the distribution of KDE. VA Research (also known as VA Linux VA Software SourceForge) gets a big venture capital infusion. Red Hat hires Matthew Szulik as CEO. The first LWN Linux timeline was released at the end of 1998. January 28, 1999: LWN's first anniversary. The 2.2 kernel is released, complete with a trivially-exploited security hole. Linus decrees that 32-bit Linux will never support more than 2GB of memory. The TCP-wrappers distribution is compromised. The Windows refund movement gathers steam. February 11, 1999: perhaps the first big discussion of binary-only modules. February 25, 1999: IBM announces support for Red Hat Linux on its systems. About this time, Eklektix announced that its new line of business would be training - and Linux system administration training in particular. The announcement was timed for the first ever LinuxWorld conference; both LWN editors spoke there, with Jon delivering a system administration tutorial to 450 attendees. It was the start of a new phase - though it was not much more successful than the one which came before. If the investments in Red Hat were the beginning of the Linux bubble, LinuxWorld was where the inflation began in earnest. The amount of money on display there was impressive to say the least. The Red Hat party will live forevermore in the memory (or lack of memory, as the case may be) of all who attended. LinuxCare, which was supposed to be the big support success story for Linux, was unveiled at this conference. Never had there been so much overt commercial interest around Linux. March 25, 1999: It turns out that BitKeeper is to come out under a not-really-open-source license. April 8, 1999: Discouraged Mozilla developers resign from the project - there was a time when it seemed like a usable Mozilla browser would never come. Dell buys a piece of Red Hat. Al Gore claims to have an open source presidential campaign. RMS battles for "GNU/Linux" on linux-kernel. April 15, 1999: the Mindcraft study. It turned out that some of Mindcraft's criticisms were right, but we fixed the problems in a hurry. April 27, 1999: The last Linux Expo is held in Raleigh. It is interesting to note that, during this time, LWN got its first acquisition offer: from Red Hat. We turned it down: the terms of the offer looked much like indentured servitude under firm Red Hat control. But we did work a deal with the company to supply news items for its portal site. Yes, during this time, Red Hat's business model was aiming toward becoming the dominant network portal for Linux-related information. Remember, this was 1999. June 10, 1999: Red Hat files for its IPO. VA Linux bulks up on free software developers. July 1, 1999: Slashdot is acquired by Andover.net. Eric Raymond and Richard Stallman feud over "open source." July 22, 1999: Red Hat gives Linux hackers an opportunity to buy pre-IPO stock. August 12, 1999: Red Hat goes public, with great success. Andover acquires Freshmeat.net. The second LinuxWorld conference is held. The Red Hat IPO was the beginning of a new phase: clearly somebody was making a lot of money from Linux, even if who wasn't exactly clear. What was clear is that Eklektix was not on the list. When we planned out the training offering, we had a set of spreadsheets with some truly wonderful numbers on the income which was sure to result. Somehow reality failed to match the spreadsheets. So we came to realize that we needed to look in other directions. At this time, advertising was beginning to bring in some actual money. But, more to the point, as the market heated up, companies were showing increasing amounts of interest in anybody who had any sort of Linux credibility or mindshare. We had some of that credibility at that time. So we decided to see what would happen if we let the word out that LWN was for sale. Suffice to say that the result was a far wilder ride than we could have ever anticipated. But that will be the topic of next week's installment. 2.6.24 - some statistics As of this writing, the 2.6.24 kernel is getting close to a release - though there is likely to be one more -rc version to look at first. The rate of change has slowed significantly, though, and the final regressions are being chased down. So it seems like a suitable time to look at the patches which went into this kernel and where they came from. This is, in many ways, a record-breaking development cycle. Over 10,000 individual changesets have been merged this time around, with a net growth of almost 300,000 lines of code. 950 developers contributed this code; of those, 358 contributed just one patch. By comparison, the previous cycle (2.6.23) merged some 6200 patches from about 860 developers. Given that, it's not surprising that the 2.6.24 cycle has been a little longer than some of its predecessors. Without further ado, here is the list of top contributors to this kernel: By either method of counting, Thomas Gleixner comes out at the top of the list by virtue of his work on the i386/x86_64 architecture merger. Bringing those architectures together and making the result work well was a huge job; this effort will continue into future development cycles. (For the curious, simply renamed files were not counted as "changed lines" in the generation of these numbers). Note that many of these patches also carry a signoff by Ingo Molnar, but git only stores the name of a single "author" for a changeset. Other contributors of large numbers of changesets in 2.6.24 include Bartlomiej Zolnierkiewicz (lots of IDE driver patches), Adrian Bunk (cleanups all over the kernel tree), Ralf Baechle (MIPS architecture work), Pavel Emelyanov (mostly network and PID namespaces), Tejun Heo (serial ATA and a number of sysfs cleanups), Johannes Berg (wireless networking), and Al Viro (mostly annotation patches and related fixes). If one looks at the number of changed lines, the list of developers changes almost completely: Zhu Yi (iwlwifi driver), Auke Kok (e1000 driver), Michael Buesch (wireless networking and the b43 driver), Ivo van Doorn (rt2x00 wireless driver), Matthew Wilcox (SCSI, especially advansys and sym53c8xx drivers), Adrian Bunk (cleanups and code deletions), Larry Finger (mainly addition of the b43 legacy driver), and David Miller (networking and SPARC64). If one assigns developers' contributions to employers and totals the results, the following numbers emerge (note that these tables have been updated since initial publication to fix an error): In many ways, these lists look similar to those posted for past kernels. But there are a few things which jump out this time around: Intel has made it to the top of the "by lines changed" list - and not just by a little bit. This happened by virtue of the work done by four of the top-20 developers, but also by dozens of others who contributed to the 2.6.24 kernel. Intel has a lot of people working on the kernel, many of whom spend little time in the limelight. Movial found its way onto the list for the first time as a result of having hired a very active developer. The amount of work done by people known to be hacking on their own time has grown a bit. This change is mostly a result of more complete information on our side - many developers have moved out of the "unknown" category. Quite a bit of the no-employer work this time around was done on the wireless networking tree; since much of the interesting work in this area currently involves reverse engineering, perhaps it is not surprising that relatively few companies are willing to sponsor it. All told, some 130 distinct employers were identified for the contributors to 2.6.24. That is a lot of companies to be working on one body of code. Looking at the Signed-off-by headers of patches is always interesting; if one removes the signoffs added by the authors themselves, what is left is a list of the gatekeepers - those who channel the code into the mainline. The people who signed off on the most patches which they did not write are: There are not a lot of changes here from previous development cycles. While quite a few developers add signoffs to code and pass it on, they work for a relatively small number of companies - 7 employers account for 70% of the non-author signoffs. Finally, given that we are starting a new year, it is worth taking a quick look back at the entirety of 2007. In 2007, Linus merged just over 30,000 changesets (more than 80 per day, every day) from 1900 developers working for (at least) 200 companies. All told, they changed over 2 million lines of code, growing the kernel by more than 750,000 lines. The kernel developers are, in other words, touching over 5,000 lines of code every day - that is a high rate of change. The top contributors over the course of the year (by changesets) were: It should be noted that the employer numbers are more approximate than usual. Some developers changed employers in 2007, but LWN, as a matter of policy, does not maintain a database of developers and their employers over time. Still, the picture is relatively constant - the same companies continue to contribute approximately the same percentage of the patches going into the kernel over relatively long periods of time. Overall, the picture that results from all these numbers is one of a widespread and healthy development community. There appears to be no shortage of jobs for kernel developers, but also room for those who work outside of the office. The kernel truly is a common resource, with literally thousands of people working to improve it. And it shows no signs of slowing down anytime soon. Your editor would like to profusely thank Greg Kroah-Hartman for his help in improving these statistics. GoboLinux GoboLinux is an alternative distribution that redefines the entire filesystem hierarchy. The distribution joined the LWN Distributions List in late October 2003 at version 007. Now at version 014, the project has made quite a bit of headway. The website has been translated into several major languages, along with much of the documentation. An early article written by GoboLinux creator Hisham Muhammad explains how the distribution evolved from a custom Linux From Scratch installation, and the motivation for changing the directory structure. The whole thing started when I had to install programs at the University. As I had no write access to the standard Unix directories, I created my own directories under $HOME the way I saw fit. I upgraded the programs from source constantly, and couldn't use a package manager. My solution was the most obvious one: to place each program in its own directory, such as ~/Programs/AfterStep. Soon the environment variables (PATH, LD_LIBRARY_PATH...) got bigger and bigger, so I created centralized directories for each class of files, containing symbolic links: ~/Libraries, ~/Headers and so on. A natural evolution was to write shell scripts to handle the links, configures and Makefiles. I downloaded the 014 release and stuck the CD into my ancient Sony Vaio laptop. After booting I was first prompted for my preferred language and keyboard settings and then taken to a console screen with text advising me to "run startx to run the live CD or you can install from here." I ran startx and soon was looking at a familiar KDE desktop. This release features KDE 3.5.8, Glibc 2.5 and Xorg 7.2. From here you'll find a desktop icon for GParted and another to install GoboLinux, so you can easily create a separate partition for GoboLinux before an installation. I ran it as live CD and brought up a Konsole so I poke about the filesystem hierarchy. The home directory looks much like any other Linux system, but a cd /, followed by ls -al reveals something else entirely. There are only six subdirectories here: Depot, Files, Mount, Programs, System, and Users. Depot proved to be empty, but the other directories have their own subdirectories, which branch further as necessary. For example, I found everything need to compile the linux kernel for a variety of architectures under /Files/Compile/Sources/linux-2.6.23.8/ (the version used by this release). To see all the installed programs just look at /Programs where each package has it's own subdirectory. Different versions of the packages can also be easily installed without conflict, since the directory structure includes the version number, e.g. /Programs/Xorg/7.2/. The home directory for users is under /Users instead of /home, but it works just the same. As a long time Unix/Linux user I'm used to the old hierarchy, with cryptic names like /etc and /bin. I thought I might have a hard time getting used to GoboLinux. Instead, I found it intuitive and easy to work with. Next time you are looking for something different in a desktop, give GoboLinux a try. The launch of RPM 5.0 Stable version 5.0.0 of RPM, the rpm package manager, formerly known as the Red Hat package manager, has been announced. RPM5 is a fork of RPM; it should not be confused with the version used by Red Hat, Fedora, SUSE, and others, which can still be found at rpm.org. The project description states: RPM is a powerful and mature command-line driven package management system capable of installing, uninstalling, verifying, querying, and updating Unix software packages. Each software package consists of an archive of files along with information about the package like its version, a description, and the like. There is also a library API, permitting advanced developers to manage such transactions from programming languages such as C, Perl or Python. Traditionally, RPM is a core component of many Linux distributions, including Red Hat Enterprise Linux, Fedora, Novell SUSE Linux Enterprise, openSUSE, CentOS, Mandriva Linux, and many others. But RPM is also used for software packaging on many other Unix operating systems like FreeBSD, Sun OpenSolaris, IBM AIX and Apple Mac OS X through the cross-platform Unix software distribution OpenPKG. Additionally, the RPM archive format is an official part of the Linux Standard Base (LSB). The RPM5 developers certainly have a high opinion of what this release brings: The relaunch of the RPM project in spring 2007 and today's following availability of RPM 5 marks a major milestone for the previously rather Linux-centric RPM. RPM now finally evolved into a fully cross-platform and reusable software packaging tool. RPM Version 5.0.0 differs in numerous ways from other versions. As noted above, the project aims to be cross-platform. Much of the code is said to have been cleaned up and numerous bugs have been fixed. The RPM build process has been completely rewritten to improve portability. The code base has been ported to all of the major UNIX-based platforms and Windows. All of the most widely used open-source and proprietary compilers are now supported. Supported compression formats now include bzip, bzip2 and LZMA. Initial support has been added for XAR, the XML Archive file format, while support for the old RPMv3 format has been removed. New package specification features have been added and RPM 5 can now automatically track vendor distribution files. In the last several years, the RPM project has been plagued by a bit of controversy. The issues mainly centered around maintenance of the code and which version was used by Red Hat. In August, 2006, LWN asked Who maintains RPM? More recently, Ralf S. Engelschall from the OpenPKG distribution has posted a blog entry that discusses the project's history and considers which version is "official". Lastly, the initial RPM 5.0.0 announcement on LWN produced some lively discussion of RPM issues. The much-trumpeted release of RPM5 seems unlikely to put an end to this controversy, to say the least. RPM5 would appear to have a certain amount of development energy and momentum, but it is not used by any major distributions and it is not at all clear that this will change; in particular, Red Hat and Fedora seem highly unlikely to drop their version of RPM for RPM5. So this fork - and the bad feelings that go along with it - will probably persist indefinitely. That's not what anybody would wish for a crucial (and normally relatively boring) system tool like rpm. Hiding open ports with shimmer Open TCP or UDP ports on an internet-facing host can be worrisome to an administrator, they almost feel like an invitation to an attacker. If an unknown or unpatched vulnerability is running behind the port, the host could be compromised. Admins have come up with some reasonable ways to deflect the simplest of these attacks: changing the well-known port or port knocking. The new shimmer project provides a twist, by using cryptographic techniques to choose the port to open. The basic idea is that one port (within a chosen range) will be open to real traffic of the service that the admin wants to hide – ssh or a private web server for example. The number of that port will be able to be calculated by both client and server using a secret that they share. A client that connects to the proper port gets forwarded to the real service. In addition to the proper port, 15 other ports are opened and connected to a blacklist service. Any connection made to those ports will result in the source IP address being banned for 15 minutes. The server redoes the calculation each minute, coming up with a new set of 16 ports – one good and 15 bad. In order to calculate the port number, the shared secret (key) is combined with the time (to the nearest minute), and the name of the service, then hashed using SHA-256. The hash is used as an AES key to encrypt the numbers 0 through 15. Those values are mapped into the port range and serve as the 16 port numbers for that minute. In order to handle small clock variations between client and server, the server actually keeps each set of 16 open for three minutes – adding the set for the minutes before and after the current one. While this seems like it provides a great deal of security to hide an open port behind, in reality it is more showy than useful. As with simple port knocking, or changing the well-known port number, it is vulnerable to an attacker that can monitor traffic to the server and observe successful connections. Shimmer leaves three ports wide open at any given time with 45 ports that will cause an IP to get blacklisted. Depending on the size of the port range chosen, the odds aren't that bad of randomly guessing the right port. Someone with few thousand IP addresses to use probably won't have any difficulty. Much like the other techniques, shimmer will likely deflect all but the most determined of attackers, but is unlikely to provide much in the way of a barrier against those. It sounds attractive and uses cryptographic terms and techniques which may make it seem more secure than it really is. Using it without understanding this could lead to a false sense of security. Ten-year timeline, part 2: the bubble days Last week, we began a multi-part series looking at the soon-to-be ten years of LWN. At the end of that episode, we were coming to the realization that the training business was, perhaps, not going to perform quite as well as our spreadsheets had suggested it might. It turns out that spreadsheets created with free software can be just as deceptive as those done with proprietary programs - who would have ever guessed? So we decided to look into whether it might be possible to make some sort of deal with some other company - preferably one with some money - to keep the show going. Just how one might go about looking for such a deal is not immediately obvious - especially if you're a bunch of technical people who have no clue about how corporate acquisitions are done. Somehow, hanging an "Acquire Us!" sign on the front page did not quite seem like the right way to go. After some thought, we decided that the best approach might be to just quietly slip the word to a few people that we might be open to offers, then sit back and see what happened. As it turned out, that was all we needed to do. Much of the following story has never been told - but all of the non-disclosure agreements have run out by now, so this seems like the right time. Meanwhile, things were happening at a furious pace in the Linux community. August 26, 1999: Red Hat and Caldera get around to year-2000 compliance. The 2.3.15 patch is "huge", touching all of 600 files (2.6.24 currently has changes to over 10,000 files). The first Ottawa Linux Symposium concludes. September 2, 1999: Sun buys StarDivision, but uses its "community source license" for the code. Red Hat shuts down "Red Hat Linux" vendors on Amazon. September 9, 1999: SCO (old SCO, mind you, not the current company) trashes Linux in Europe. Bruce Perens worries that Sun may be trying to grab control of the Linux desktop through its acquisition of StarDivision. Disruptive changes in the "stable" 2.2 kernel upset users. September 16, 1999: the 2.3 kernel goes into "feature freeze," with Linus predicting a release by the end of the year. He neglected to specify which year, though. Cobalt networks files to go public. LinuxOne - a company nobody had ever heard of - files to go public. Andover.net (the company which had bought Slashdot) files to go public. The first ext3 filesystem patches are released. The 2.3 feature freeze is instructive - 2.4.0 was not released until January, 2001 - 16 months after this "freeze" went into effect. Over the next months we'll see plenty of reasons for the delay in the 2.4.0 release; Linus was famously not a great release manager. But releases which failed to arrive were the norm back in those days. Free software was much like proprietary software in that regard. One has to look back to realize just how much better we have gotten at getting software releases out in a reasonable period of time. The IPO filings were beginning to pile up - much to your editor's chagrin. Actually reading those things is a painful chore, and we felt that we needed to examine all of them. The relative newcomers out there may be wondering who that LinuxOne company is. So were we, at the time. LinuxOne materialized out of thin air, slapped its name onto a copy of Red Hat Linux, and called itself a Linux company. They clearly hoped to get in on the general mania and make a bunch of money before people caught on - they nearly achieved it, too. September 30, 1999: Caldera spinoff Lineo gets going - remember Embedix and Embrowser? Red Hat drops LWN news from its web site. Lineo got spun out of Caldera for a couple of apparent reasons: (1) to isolate the DR-DOS lawsuit which was being pursued against Microsoft, and (2) to try to double the number of public offerings. The first objective was achieved, and the suit was ultimately successful. In the end, though, Lineo still failed to get off the ground. October 7, 1999: Sun announces that it will be releasing the Solaris source code. The OpenBSD project grabs the last freely-licensed version of ssh and starts the OpenSSH project. October 14, 1999: TurboLinux gets a big chunk of venture money. SCO (old SCO) buys a chunk of the Linux Mall. Crypto export rules in the U.S. begin to soften. The devfs discussion continues. SGI, VA Linux, and O'Reilly launch a commercialized version of the Debian distribution. VA Linux files for its IPO. Old-timers will remember the Linux Mall - that was the place, once upon a time, where we bought our Linux CDs (and stuffed penguins too). Yes, we actually bought Linux on CD and waited for it to show up via mail, though it may seem a little strange now. The Linux Mall, and its founder Mark Bolzern, were fixtures in the early days of Linux. As Linux grew and bandwidth increased, though, the Linux Mall was having a bit of a hard time of it. The name was famous, though, and the site got a lot of traffic, so companies interested in getting into the Linux hype were interested in it. It may be getting a bit ahead of the story, but this is as good a place as any to let it be known that one of the things that the Linux Mall wanted to do with its new-found wealth was to acquire a media outlet like LWN. It was part of the bigger plan of creating a full-featured e-commerce "mall" centered around Linux. We considered the offer long and hard, but, in the end, declined it. Just as well: the Linux Mall missed the IPO boat and got folded into EBIZ, which, in turn, eventually went bankrupt. Had we taken that path, there would be no LWN now. October 21, 1999: LinuxToday is acquired by Internet.com; co-founder Dave Whitinger leaves the building. ATI announces that it will be releasing 3D programming information for its video adapters - the good news here is that it's finally getting around to doing that. November 4, 1999: DVD encryption is cracked and DeCSS is released. The Y2K-related "windowing" patent threatens the kernel. Burn all GIFs day. The kernel gets past the longstanding 1GB limit on installed memory. Slackware 7 (the successor to Slackware 4) is released. The non-profit Red Hat Center for Open Source launches - and is never heard from again. November 11, 1999: Cobalt network goes public, shares begin trading at $130. November 18, 1999: The Linux Business Expo is held as part of the once-famous COMDEX event. Red Hat acquires Cygnus. BitKeeper is said to be getting closer to release. Mozilla hits milestone 11 and is said to be getting closer to release. Advogato.org launches. LWN has only rarely operated booths at conferences, but we did have one at the Comdex Linux Business Expo. For the curious, here's a picture from the event featuring LWN editor Rebecca Sobol. That week's LWN edition was produced from that booth after the floor closed, under the watchful eye of security guards who didn't think we should be there. Your editor remembers it as one of the coldest experiences of his life. During the show, we subjected to constant, highly-amplified screaming obnoxiousness from the large booth being run by LinuxToday - the acquisition, it seemed, had put that site onto a rather less dignified path. The other thing LWN was doing at this event was talking with potential suitors. One of those was a company called Atipa, which was operating a large booth of its own. Atipa was a VA-style Linux box vendor with a grand plan for a Linux portal site which would, eventually, be the place people went for Linux information. They thought that LWN would make a good addition to that portal, and were pushing hard to make a deal. We met a few times with Atipa's CEO, a charismatic man who told a good story. The company, he said, was going to outdo even the coming VA Linux IPO, which was already clearly going to be big. Along the way he was going to pick up companies like Applix and open-source the ApplixWare office suite - something which would have been nice at the time. He stated flat out that he was soon to be a billionaire, and that we could share in that bonanza. It was quite the tale, but we tended to walk out of these meetings believing every word of it. With some distance, though, the glow always faded. We wondered why our visit to the company's headquarters revealed a building almost devoid of people. The magic "profit happens here" step in their plans seemed less inevitable when looked at later. In the end, we did not take this deal. Thereafter, we received (unverifiable) word that Atipa's investors started asking some harder questions and found that, perhaps, they, too, had allowed themselves to be charmed more than they should have. Atipa rather abruptly found a new CEO, the IPO never happened, and investors, presumably, lost their money. Also at the Linux Business Expo, we met with some representatives from O'Reilly. They were getting the O'Reilly network off the ground, and thought that LWN might make a good addition to it. They eventually offered us a deal (which looked more like a traditional angel investment than an acquisition) and a network affiliation which would have given us a portion of the revenue from the ads they sold. Your editor, who has a lot of respect for the people at O'Reilly, has always had a bit of regret at turning down this offer. It was an opportunity to get business advice from some very smart people. But it would almost certainly have been fatal to LWN once the advertising market fell apart. Meanwhile, the acquisition of Cygnus by Red Hat led to a fair amount of online worrying about whether Red Hat was set to take over Linux by virtue of employing a number of GCC developers. Such fears look a little silly now, but they seemed real then. December 9, 1999: Andover.net goes public. The kernel gets NUMA support (during a feature freeze, remember). Sun announces a Linux Java release, rolling over the "Blackdown" team which had been working on this release for years. December 12, 1999: VA Linux goes public, setting the record for the largest first-day gain in NASDAQ history. Eric Raymond gets rich and lets us all know about it. The non-free BitKeeper license is revealed. LinuxCare acquires the Puffin Group and gets another $32 million. The Linux Capital Group launches; it starts by funding Progeny Linux. Companies send out "we use Linux" press releases in an attempt to make their stock price go up. The VA IPO was not just the peak of the Linux bubble - it could well be the peak of the dotcom bubble as a whole. It was not possible to watch that stock rise to well over $300 a share on the first day and not be overwhelmed by a sense of unreality. Still, it seemed like no more than what Linux deserved, and people somehow expected it to continue. January 6, 2000: Linux survives Y2K. Red Hat buys Hell's Kitchen Software, does nothing with it. VA Linux launches the SourceForge site. January 13, 2000: Caldera Systems (later to become SCO) files for its IPO. The kernel gets a new block driver API and 32-bit UIDs - still during the feature freeze. January 20, 2000: LinuxCare files for its IPO. Linus Torvalds shuts down the sale of a number of Linux-related domain names. Secure Computing Corporation announces that it will be developing (what becomes) SELinux. Enoch becomes Gentoo Linux. TurboLinux completes another funding round. Once upon a time, Caldera Systems was supposed to be among the biggest winners in the distribution sector - they had the business connections and the distribution channels. "Linux for business" got the company far enough to do an IPO, but not much beyond that. This is, of course, the company which eventually became the SCO Group. Caldera was well overshadowed by LinuxCare, though. The distribution business always looked like a hard one to maintain over the long term - that is why Red Hat was trying to be a web portal company. Services were going to be the real gold mine, and LinuxCare was going to be at the top of the Linux support industry. The company got money from left and right (a funding round produced offers of ten times the target amount) and hired a long list of well-known Linux hackers. Need we say that LWN's editors paid a visit to LinuxCare during this time? It was a hard time for LinuxCare to discuss acquisitions, since the IPO process was already underway, but discuss they did. So we went to the famous San Francisco headquarters. Your editor's memories from that day are strong. LinuxCare was filled with hundreds of people who all believed they were on the way toward an IPO that would exceed even VA Linux; suffice to say they were happy about the prospect. Meanwhile, though, a couple hundred of them were all working in a single not-very-large room called "the barn"; it resembled, more than anything else, a school lunchroom filled with long tables. Everybody worked on a laptop because there was no room in their tiny piece of table space for anything else. They all complained about having colds. It looked awful. LinuxCare's negotiator was an ex-fighter jet pilot who retained the "top gun" attitude. When valuations were discussed, we were told that offering LinuxCare's pre-IPO shares at $50-60 each was being generous to us. Issues like editorial control were not really even on the table. In the end, we turned this deal down, but with a feeling like we were throwing a winning lottery ticket in the trash. Of course, subsequent events showed that we need not have worried about this particular missed opportunity. February 10, 2000: Real-time Linux turns out to be patented. VA Linux acquires Andover.Net. The KDE project moves to SourceForge. Atipa acquires Enhanced Software Technologies. The Linux Fund announces that it will be filing for an IPO. The Andover.Net acquisition was announced at LinuxWorld in New York - LWN was there, of course. The initial deal included a massive pile of cash to be handed to Andover.Net's shareholders, but people questioned that handout to the extent that it eventually went away. Andover.Net's owners had to content themselves mostly with VA Linux shares, which, already, were worth considerably less than they had been on IPO day. In the end, Andover.Net turned out to be a good buy for VA Linux, once it became clear that the Linux-installed computer business was harder than it had looked. We were approached by a VA executive at LinuxWorld to see if we were interested in maybe being acquired sometime. By then, though, we had so many offers that we couldn't really give them all serious consideration. So we did not pursue that opportunity. But, at this event, we did talk with some representatives from ZDNet, who were also looking for a Linux site to buy. The offer they made was, by far, the most generous of any. By some reckoning, we should have taken it. Certainly it would have come out better than most of the other options we had. But ZDNet would have exercised more editorial control than we would have liked, and, being already a public company, it didn't offer that IPO "pop" that we somehow thought was our due. So we ended up not taking that path. February 17, 2000: devfs is merged into the mainline kernel. Also merged is the "softnet" core networking rework. Remember, the kernel is in a feature freeze. February 24, 2000: Eazel is founded with the goal of improving Linux usability. To your editor, Eazel never made sense from the beginning. There was, truly, no revenue model. Indeed, it seemed like a scam designed to draw venture money for the purpose of writing Nautilus. To that extent it succeeded, but the investors cannot have been happy in the end. March 2, 2000: Atipa announces $30 million in investments. March 23, 2000: Caldera Systems goes public; its share price merely doubles. The planned date for LinuxCare's IPO passes with no offering. April 4, 2000: Linuxcare's IPO is pushed back to April 24 - or so they say. EBIZ acquires longtime Linux CD distributor InfoMagic. Atipa Linux Solutions acquires DCG Computer Corp. Sendmail Inc. gets $35 million in funding. This was the point where LWN announced that it had been acquired by a company called Tucows. We had, in fact, been talking with them for some months, and had made the decision in February. It took some time, though, for the lawyers to hammer out the final agreement. In the end, we were probably exceedingly lucky: market conditions were going downhill in a hurry by this point and, had the negotiations stretched out much longer, Tucows might have started looking for reasons to back out of the deal. Or maybe not. We went with Tucows for a number of reasons, but at the top of the list was that they were clearly smart and decent people who, while arguably being carried away by the bubble like the rest of us, clearly had a functioning business underneath it all. Their acquisition of LWN never yielded the benefits they were looking for, but the people at Tucows always treated us well and we still count them as friends. Perhaps we were smart, or perhaps we were just very lucky, but, in retrospect, we came out of a complex, high-stakes process having made what was probably the best possible decision. The Tucows acquisition made it possible for LWN editors Rebecca Sobol and Forrest Cook to join as regular staff members. It also positioned us within a safe harbor for the dotcom crash, which was already in progress. But the story of those years will be the subject of next week's installment. Unprivileged mounts There are a number of filesystem-related patches aimed at the upcoming 2.6.25 merge window; one of those is the unprivileged mount patch by Miklos Szeredi. This patch enables an unprivileged user process to call the mount() system call and - in certain circumstances - have that call actually succeed. It could eventually lead to a situation where users have more flexibility to create their own environments and the setuid mount utility is no longer needed. This patch adds a new field (uid) to the vfsmount structure, allowing the kernel to keep track of the owner of a specific filesystem mount. The system administrator can give ownership of a specific mount to a user with the new MNT_SETUSER flag. A common pattern might be to bind-mount a user's home directory on top of itself, giving the user the ownership of that mount. Once that has been done, the user is allowed to freely mount other filesystems below that mount point - with a couple of conditions: There is a system-wide limit on the number of allowed user mounts; once that limit is hit, no more unprivileged mounts will be allowed until somebody unmounts something. The current patch has no provision for per-user or per-group mount limits, but such a feature would not be particularly hard to add should the need arise. The filesystem type must be marked as being safe for unprivileged mounts. Miklos notes that a filesystem must go through "a thorough audit" before this flag can be set with any confidence. The patch, as posted, marks the fuse filesystem (which allows for the creation of filesystems implemented in user space) as being safe; fuse was designed for this mode of operation in the first place. Bind mounts are also allowed, with some additional conditions. If the system allows the mount, the flags allowing for setuid and device files will be forcibly cleared - unless the user has the requisite capabilities anyway. Users are allowed to unmount filesystems they own, again without privilege, but cannot unmount any others. Another new mount flag (MNT_NOMNT) marks a specific filesystem as being the end of the line - no unprivileged submounts are allowed below it. The end result of [PULL QUOTE: One might well wonder why this change to the mount() system call is called for, given that users have been able to do unprivileged mounts for years. END QUOTE] all this should be a mechanism by which users can organize their filesystem hierarchies without any need for administrative privileges, and without the risk of compromising system security. One might well wonder why this change to the mount() system call is called for, given that users have been able to do unprivileged mounts for years. The answer is that the current mechanism has a couple of shortcomings. Every potential unprivileged mount must be explicitly enabled via a line in /etc/fstab. That works well for simple situations, such as allowing a user to mount a CD or a USB storage device. When users start wanting to do more complicated things, like mounting their own special fuse filesystems, the /etc/fstab mechanism breaks down. There is a separate, setuid program which grants the right to make unprivileged fuse mounts, but it represents a workaround rather than a proper solution. The current user mount mechanism also requires that the mount utility be installed setuid root. Every setuid binary is a potential security hole, so there is value in eliminating privileged programs when possible. The unprivileged mount patch offers the possibility of eliminating the setuid mount program while simultaneously leaving policy control in the hands of the system administrator. So, unless something surprising comes up, chances are good that this capability will appear in the 2.6.25 kernel. Making code reviews easier with Review Board Reviewing code is a thankless, but very important, task for any software project. For free software projects, the "many eyes make all bugs shallow" aphorism only works if the eyes actually focus on the code in question. Review Board is a web-based application that helps reviewers examine the code, while making it easier for a developer to track those reviews. Borne out of frustration with the process of code reviews at VMware, Review Board has made a great deal of progress since being released last May. The idea behind it is to centralize all of the pieces that need to come together for a review: code diffs, screenshots of UI functionality, comments by other developers, etc. On many projects, reviews are handled by email, but that can be difficult to use; various pieces of the puzzle are spread around in multiple messages and locations. Often a reviewer needs to see more context than a simple email diff provides or wants to comment on a related section of code that is not contained in the diff; each requires a reviewer to do more work. In a complicated set of changes, ensuring that the developer and any other reviewers can follow what code the comments pertain to can also be difficult. It is these kinds of problems that Review Board is meant to solve. Review Board presents a side-by-side diff view, shown at right, with lots of extras, many of which will be familiar to users of other graphical diff tools. Changed lines are highlighted in different colors based on whether they are additions, deletions, or changes. Changes on a particular line are highlighted in a slightly darker color so that they can be distinguished more easily as well. The numbered tabs along the left edge provide a link to a reviewer's comments about that section of the code. This is where Review Board shows that it is much more than just a diff viewer. Using AJAX techniques, Review Board allows a reviewer to interact very naturally with the code. They can highlight a certain section, which will pop up a text widget that records comments associated with that section of code. When other reviewers or the developer read those comments, the code snippet is included, with a link back to the code in the diff view. Each of these comments can then be commented upon which allows for a conversation about the code to develop. It is not just code that can be annotated; screenshots of application functionality or bugs can be attached to reviews, as well. Sections of the screenshot can be highlighted and commented upon, as shown at left. This feature is an excellent example of where a web-based tool can shine; doing the same task in text-based email would be painful. Not all projects need it, but those that do will find it quite useful as anyone who has spent time trying to describe a UI problem in email will attest. Inter-diffs is another useful feature that Review Board provides. Often in the code review process, several revisions of the original patch are made. It can be tedious to wade through a large diff, most of which has been uncontroversial (or resolved earlier) to get to the changes in the area of interest. Review Board has the ability to see changes between any two revisions of the patch, which should reduce much of the hassle. Another thing that Review Board does is to assist in managing code reviews. When a developer posts something for review, various reviewers can be notified via email. Review Board keeps track of that information, presenting users with a "dashboard" view of their pending reviews, both those they submitted and those that others have asked them to do. This high-level overview is the first screen the user sees when they log on to the system, shown at right. This makes keeping track of work that needs to be done – or who to prod to get a review moving again – much easier. Currently, Review Board best supports the Subversion and Perforce version control systems (VCS), but support for others, including distributed VCS Mercurial and git, are being actively developed and are usable in their current states. Released under an MIT license, Review Board is written in Python, using the Django web framework. Development is hosted at Google Code; the developers, unsurprisingly, uses the software for internal code reviews. Other systems to assist in the code review process do exist. Codestriker is a Perl based web application that has similar aspirations to Review Board. Also of interest is Python founder Guido van Rossum's first project at Google: a code review system he calls "Mondrian". It is closely tied to Google proprietary code, though, so it seems unlikely to be released as free software – though it might make an appearance as a tool for Google Code projects to use. Code reviews are very powerful, but generally painful to perform; any tool that claims that "Code reviews are fun again! ...almost.", as Review Board does, will be welcomed by many. It will be interesting to see whether a code review tracker becomes a standard part of newer free software projects. Over the last few years, we have seen the rise of distributed VCS, bug trackers, and wikis to assist in distributed development. Will Review Board – or something like it – be the next tool to be added? State of the unionfs LWN last looked at the unionfs filesystem almost exactly one year ago. Things have been relatively quiet on the unionfs front during much of that time, but unionfs has not gone away. Now the unionfs developers are back with an improved version and a determined push to get the code into 2.6.25. So another look seems indicated. The core idea behind unionfs is to allow multiple, independent filesystems to be merged into a single, coherent whole. As an example, consider a user with a distribution install DVD full of packages, a small disk, and painfully slow bandwidth. It would be nice to keep the DVD-stored packages around for future installation. What is also nice, though, is to be able to keep a directory full of updates from the distributor and use those, when they exist, in favor of the read-only DVD version. Using unionfs, this user could mount the DVD read-only, then mount a writable filesystem (for the updates) on top of the DVD. Updated packages go into the writable filesystem, but all of the available packages are visible, together, in the unified view. To avoid confusion, the user could delete obsoleted packages, at which point they would no longer be visible in the unionfs filesystem, even though they cannot actually be deleted from the underlying DVD. Thus unionfs allows the creation of an apparently writable filesystem on a read-only base; many other applications are possible as well. If a user rewrites a file which is stored on a read-only "branch" of a union filesystem, the response is relatively straightforward: the newly-written file is stored on a higher-priority, writable branch. If no such branch exists, the operation fails. Dealing with the deletion of a file from a read-only branch is trickier, though. In this case, unionfs will create a "whiteout" in the form of a special file (starting with .wh.) on a writable branch. Some reviewers have disliked this approach since it will clutter the upper branch with those special files over time. But it is hard to come up with another way to handle deletion, especially if (as is the case here) your goal is to keep core VFS changes to an absolute minimum. That hasn't kept the unionfs developers from trying, though. Off to the side, they have a version of unionfs which maintains a small, special-purpose partition of its own (on writable storage). Metadata (whiteouts, in particular) is stored to this special unionfs partition and no longer clutters the component filesystems. There are other advantages to the dedicated partition scheme, including the ability to include one unionfs as a branch in a second union; see the unionfs ODF document for more information on this approach, which the developers hope to slowly migrate into the version they are currently proposing for the mainline. Another persistent problem with unionfs has been coping with modifications made directly to the component branches without going through the union. The January, 2007 version of the patch came packaged with some dire warnings: direct modification of unionfs branches could lead to system crashes and data loss. Given that filesystems which have been bundled into a union still exist independently, they will always present a tempting target for modification, even when there is not a specific reason (wanting to put files onto a specific component filesystem, for example). So a unionfs implementation which cannot handle such modifications sets a trap for every user who uses it. The developers claim to have solved this problem in the current version of the patch. Now, almost every entry into the unionfs code causes it to check the modification times for the relevant file in all layers of the union. If the file turns out to have been changed, unionfs will forget about the file and reload the information from scratch, causing the most current version of the file (or directory) to be visible to the user. This approach solves the problem in a relatively efficient manner, with one exception: unionfs cannot tell when a process modifies a file which it has mapped into its address space with mmap(). So, in that case, changes may not be visible to processes accessing the affected file through the unionfs. In both cases, the unionfs developers would really prefer to have better support from the VFS. Some operating systems have provided native support for whiteouts, but Linux lacks that support. There is also no way for a filesystem at the bottom of a stack of filesystems to notify the higher layers that something has been changed. Fixing either of these would require significant VFS modifications, though, and the changes might propagate down into the individual filesystem implementations as well. So nobody is expecting them to happen anytime soon. Another significant change in unionfs is the elimination of the ioctl() interface for the management of branches. All changes to an existing unionfs are now done using the remount option of the mount command. This change eliminates the need for a separate utility for unionfs configuration and makes it possible to do complicated changes in an atomic manner. The end result of all this is that the unionfs hackers think that the time has come to put the code into the mainline. There, it would become the second supported stacking filesystem (the first being eCryptfs), and would help toward the long-term goal of making the VFS layer work better with stacking. Some people speak as if the merging of unionfs into 2.6.25 is a done deal, but that is not yet guaranteed. Christoph Hellwig, whose opinion on such things carries a heavy weight, is opposed to the unionfs idea: I think we made it pretty clear that unionfs is not the way to go, and that we'll get the union mount patches clear once the per-mountpoint r/o and unprivileged mount patches series are in and stable. Unionfs hacker Erez Zadok responds that unionfs is working - and used - now, while getting union support into the VFS is a distant prospect. So he recommends: I think a better approach would be to start with Unionfs (a standalone file system that doesn't touch the rest of the kernel). And as Linux gradually starts supporting more and more features that help unioning/stacking in general, to change Unionfs to use those features (e.g., native whiteout support). Eventually there could be basic unioning support at the VFS level, and concurrently a file-system which offers the extra features (e.g., persistency). When one looks at a recent posting of the union mount patch, it's hard to see them as a near-term solution. As described by its author (Bharata Rao), this work is in an early, exploratory state; there are a number of problems for which solutions are not really in sight. The union mount approach, which does the hard work in the VFS layer, may well be the right long-term approach, but it will not be in a state where it can be shipped to users anytime soon. In the end, the problem is a hard one, and unionfs has a considerable lead toward being a real solution. That, alone, is not enough to guarantee that unionfs will make it into the 2.6.25 kernel, but it does help that cause considerably. Anybody opposing the merger of unionfs will have to explain why the union filesystem capability should not be available to Linux users in 2008. A better btrfs Chris Mason has recently released Btrfs v0.10, which contains a number of interesting new features. In general, Btrfs has come a long way since LWN first wrote about it last June. Btrfs may, in some years, be the filesystem most of us are using - at least, for those of us who will still be using rotating storage then. So it bears watching. Btrfs, remember, is an entire new filesystem being developed by Chris Mason. It is a copy-on-write system which is capable of quickly creating snapshots of the state of the filesystem at any time. The snapshotting is so fast, in fact, that it is used as the Btrfs transactional mechanism, eliminating the need for a separate journal. It supports subvolumes - essentially the existence of multiple, independent filesystems on the same device. Btrfs is designed for speed, and also provides checksumming for all stored data. Some kernel patches show up and quickly find their way into production use. For example, one year ago, nobody (outside of the -ck list, perhaps) was talking about fair scheduling; but, as of this writing, the CFS scheduler has been shipping for a few months. KVM also went from initial posting to merged over the course of about two kernel release cycles. Filesystems do not work that way, though. Filesystem developers tend to be a cautious, conservative bunch; those who aren't that way tend not to survive their first few encounters with users who have lost data. This is all a way of saying that, even though Btrfs is advancing quickly, one should not plan on using it in any sort of production role for a while yet. As if to drive that point home, Btrfs still crashes the system when the filesystem runs out of space. The v0.10 patch, like its predecessors, also changes the on-disk format. The on-disk format change is one of the key features in this version of the Btrfs patch. The format now includes back references on almost all objects in the filesystem. As a result, it is now easy to answer questions like "to which file does this block belong?" Back references have a few uses, not the least of which is the addition of some redundant information which can be used to check the integrity of the filesystem. If a file claims to own a set of blocks which, in turn, claim to belong to a different file, then something is clearly wrong. Back references can also be used to quickly determine which files are affected when disk blocks turn bad. Most users, however, will be more interested in another new feature which has been enabled by the existence of back references: online resizing. It is now possible to change the size of a Btrfs filesystem while it is mounted and busy - this includes shrinking the filesystem. If the Btrfs code has to give up some space, it can now quickly find the affected files and move the necessary blocks out of the way. So Btrfs should work nicely with the device mapper code, growing or shrinking filesystems as conditions require. Another interesting feature in v0.10 is the associated in-place ext3 converter. It is now possible to non-destructively convert an existing ext3 filesystem to Btrfs - and to go back if need be. The converter works by stashing a copy of the ext3 metadata found at the beginning of the disk, then creating a parallel directory tree in the free space on the filesystem. So the entire ext3 filesystem remains on the disk, taking up some space but preserving a fallback should Btrfs not work out. The actual file data is shared between the two filesystems; since Btrfs does copy-on-write, the original ext3 filesystem remains even after the Btrfs filesystem has been changed. Switching to Btrfs forevermore is a simple matter of deleting the ext3 subvolume, recovering the extra disk space in the process. Finally, the copy-on-write mechanism can be turned off now with a mount option. For certain types of workloads, copy-on-write just slows things down without providing any real advantages. Since (1) one of those workloads is relational database management, and (2) Chris works for Oracle, the only surprise here is that this option took as long as it did to arrive. If multiple snapshots reference a given file, though, copy-on-write is still performed; otherwise it would not be possible to keep the snapshots independent of each other. For those who are curious about where Btrfs will go from here, Chris has posted a timeline describing what he plans to accomplish over the coming year. Next on the list would appear to be "storage pools," allowing a Btrfs filesystem to span multiple devices. Once that's in place, striping and mirroring will be implemented within the filesystem. Longer-term projects include per-directory snapshots, fine-grained locking (the filesystem currently uses a single, global lock), built-in incremental backup support, and online filesystem checking. Fixing that pesky out-of-space problem isn't on the list, but one assumes Chris has it in the back of his mind somewhere. ext3 metaclustering The ext3 system uses the classic Unix block pointer method for keeping track of the blocks in each file. For a given file, the on-disk inode structure contains space for twelve block numbers; they point to the first twelve blocks in the file - the first 48KB of space. If the file is larger than that, a 13th pointer contains the address of the first indirect block; this block contains another 1024 (on a 4K block filesystem) block pointers. Should that not suffice, there's a 14th pointer for the double-indirect block - each entry in that block is the address of an indirect block. And if even that is not enough, there's a 15th entry pointing to a triple-indirect block full of pointers to double-indirect blocks. This is a very efficient representation for small files - the kinds of files Unix systems typically held, once upon a time. In current times, when one can forget about that directory full of DVD images and never even notice the lost space, it does not work quite as well - there is a lot of overhead for all of those individual block pointers, and a large data structure to manage. That is why removing a large file on an ext3 filesystem can take a long time - the system has to chase down all of those indirect blocks, which, in turn, forces a lot of disk activity and head seeks. For this reason, contemporary filesystems tend to use extent-based mechanisms to associate blocks with files, but that is not really an option for ext3. An additional problem with all those indirect blocks is that filesystem checkers must locate and verify them all. That, again, causes a lot of head seeking and makes fsck run slowly. Slow filesystem checking was the motivation behind this patch from Abhishek Rai which attempts to improve performance on filesystems with a lot of indirect blocks. The approach taken is relatively simple: the patch just tries to group indirect block allocations together on the disk. The current ext3 code will allocate indirect blocks when they are needed to account for data blocks being added to the file; they are usually placed adjacent to those data blocks. One might think that this placement would speed subsequent accesses to the file, but that is not necessarily so; the reading or writing of the indirect block will tend to happen at a different time than operations on the data blocks. What this placement does accomplish, though, is the distribution of the indirect blocks all over the disk. So a process which must examine all of the indirect blocks associated with a file must cause the disk to do a lot of head seeks. The "metaclustering" approach works by reserving a set of contiguous blocks at the end of each block group. Whenever an indirect block is needed, the filesystem tries to get one from this dedicated area first. The end result is that all of the indirect blocks are located next to each other. Should somebody need to read a number of those blocks without being interested in the contents of the data blocks, they can grab them all quickly with minimal seeking. Filesystem checkers, as it happens, need to do exactly that - as does the file removal process. The patch did not come with benchmarks, but the speedup that comes from the elimination of all those seeks should be significant. Even so, Andrew Morton questioned the need for this patch, worrying that its benefits do not justify the risks that comes with modifying an established, heavily-used filesystem: In any decent environment, people will fsck their ext3 filesystems during planned downtime, and the benefit of reducing that downtime from 6 hours/machine to 2 hours/machine is probably fairly small, given that there is no service interruption. Others disagreed, though, noting that it's the unplanned filesystem checks which are often the most time-critical. That includes the delightful "maximal mount count" boot-time check which, in your editor's experience, always happens when one is trying to get set up to give a talk somewhere. So this patch might just find eventual acceptance - it should be relatively low-risk and does not require any on-disk format changes. This is a filesystem patch, though, so nobody will be in any hurry to get it into the mainline before a lot of testing and review has been done. SAMP? A few articles making predictions for 2008 had put an initial public offering by MySQL on their list. The company had clearly been heading in that direction for a while; sales were growing, venture capital was coming in, etc. In the end, though, the MySQL IPO seems destined not to happen - Sun Microsystems got there first. The deal is structured as a full acquisition - Sun will pay about $800 million for all outstanding shares of MySQL stock. In addition, about $200 million in options will be covered, so, overall, this is a billion-dollar deal. Not bad for a company which is based on free software. Sun is making the right noises about how this deal will work. There is no talk of taking MySQL proprietary or changing its license. MySQL will continue to be supported on all platforms, and not just Solaris. A series of grants will be made to help university researchers advance the state of the art in database management systems. There is a lot of talk about continuing to support "the community," though details are (perhaps necessarily) scarce. CEO Jonathan Schwartz says that Sun will be working to improve "the rest of the LAMP" stack, though he says nothing about the "L" (for Linux) part. Chances are that this deal will be a good thing for MySQL users. Sun is clearly making MySQL an important part of its overall strategy (in these days, one does not toss $1 billion toward unimportant objectives) and can be expected to continue - or accelerate - development of the system. Sun's free software orientation is strong enough that the chances of parts or all of MySQL going proprietary seem small. Indeed, nothing in Sun's releases says anything about MySQL's commercial licensing business; the emphasis appears to be strongly on support and services. So MySQL might just become even more open than it is now. Sun appears to be positioning itself to compete strongly with Oracle. Both companies are working hard to be able to offer the entire software stack to their customers. So Oracle's push into the Linux distribution business and Sun's database venture are both aimed at having the same story for their sales staff to tell: we, in some way, own and control all of the software you are looking to run. No problems with incompatibilities, finger-pointing, etc. As an added bonus, Sun will happily sell you the hardware you need too. Do expect an increase in efforts aimed at moving MySQL users away from the (Oracle-owned) InnoDB engine, though. For Sun to sell that story, though, it will to have continue to push Solaris hard as an alternative to Linux. Either that, or the company will eventually find itself shopping for a Linux distributor of its own. Either way, it seems likely that competitive pressures for operating systems (and higher layers) sales and support are set to increase, especially in the high-performance web server area. Red Hat, whose PostgreSQL-based database offering appears to have fallen below the radar, may find itself scrambling for a response. Sun makes a big point of being able to sell the entire package, and there is some truth to that. Processors, storage, systems software, database software, programming languages, office suites, and more can all be had from one company. What remains to be seen is whether this is really what customers want. There is a lot of value in being able to integrate components from multiple sources and not being dependent on a single vendor. Your editor, who managed a transition from being an all-DEC shop to an all-Sun shop some twenty years ago, is not convinced that those days are worth going back to. A kernel security hole Security holes can sneak into code in surprising ways, even in highly scrutinized codebases. Perhaps even more surprising is how long they can persist in something as popular as the Linux kernel before someone notices. The release of stable kernels 2.6.22.16 and 2.6.23.14 this week are instructive for both of those reasons. The bug that led to the releases is fixed by a two line patch, but might be exploitable to cause filesystem corruption. If it were a bug in a driver for an obscure piece of hardware, with relatively few users, it might have been less eye opening, but it was in the Virtual File System (VFS) layer of the kernel. VFS is the abstraction that allows all kernel filesystems to be used identically regardless of their underlying implementation. The open() system call is used to open any file on any type of filesystem; VFS is what makes that work. In fact it is the open() path that is affected by the bug. Due to a faulty test, the bug allows directories to be opened for writing, which is generally a recipe for disaster. It could also allow a file on a read-only filesystem to be opened for writing – depending on the underlying filesystem implementation, that could lead to corruption. In both cases, they are only locally exploitable. The bug was introduced in a change to support NFS in October of 2005 – more than two years ago; all kernels since 2.6.15 are affected. The change was aimed at making NFSv4 open calls be atomic (because an open is really a lookup followed by an open), but also did some code reorganization that changed the semantics of a flag variable. That variable was being used to determine the access mode for directories and read-only filesystems, so that change subtly broke the tests. Part of the problem is that the tests are in a function called may_open(), which takes two flag parameters: The incorrect code was using flag in the tests when it should have been using acc_mode. Each of them is a bitmask of values that, on first glance, might be easy to confuse – each is related to permissions. The bit values for each have names like FMODE_WRITE and MAY_WRITE, which would seem to have a fair amount of overlap. This may explain why the problem was not spotted at the time it was introduced. There may be no easy solution to this kind of problem – other than more scrutiny. Using different types, rather than plain int, for each flag might have helped, but since the tests were using the right kind of bit values for flag, that is a somewhat hard sell. Something unpleasant to consider in all of this is that this may not be the first time this problem has been noticed. It may just have been the first time it was noticed by someone who reported it. Folks with a malicious intent are much less inclined to report bugs. This particular bug is not one that would be particularly useful to attackers, but we would do well to remember that fixing a two year old hole means that systems were vulnerable for all that time. It is not only the good guys who can read code. Use Ubuntu Tweak to adjust hidden GNOME options Ubuntu Tweak is a GNOME desktop configuration tool that works with versions 7.04 and 7.10 of the Ubuntu distribution. From the application's splash screen: This is a tool for Ubuntu which makes it easy to change hidden system and desktop settings. Ubuntu Tweak is currently only for the GNOME Desktop Environment. Version 0.2.4 of Ubuntu tweak was announced in December, 2007: "With many bugs fixed and two language added, the stable version of Ubuntu Tweak 0.2.4 released!" Installation was trivial, the .deb file was downloaded in the Firefox web browser; that, in turn, allowed the installer application to be run. A minute later, the software was ready to go. The application was automatically added to the GNOME Applications/System Tools pulldown menu. So, what can Ubuntu Tweak do? There are a number of top-level icons, some with multiple sub-icons. Top-level categories include: Computer, Startup, Desktop, System and Security. Clicking on the Computer icon reveals useful information such as the hostname, distribution version, kernel rev, platform, CPU type and speed and memory capacity. The username, home directory, shell and default language are also displayed. The Startup icon allows the user to toggle features such as the automatic saving of session changes, the logout prompt, remote TCP connections and the splash screen. The Desktop icon allows numerous features to be adjusted on the Desktop Icon Settings, the Metacity window manager, Compiz Fusion, the GNOME panel and menu and the Nautilus file browser. The System icon has toggles and sliders for controlling various power management parameters. Lastly, the Security option has toggles for disabling the Run Application dialog, the Lock Screen, Printing, Printer Setup, Save to Disk and User Switching. That's about all there is to this version of Ubuntu Tweak, there is room to add many more control options. Ubuntu Tweak seems like a useful tool for managing options that don't really fit anywhere else on the desktop environment. The only surprise is that this is, by name, only useful for the Ubuntu distribution. It seems as though making a multi-distribution GNOME-tweak would not require many changes to the code. Is Gentoo in crisis? It all started with a blog post by Daniel Robbins. That was on January 11. But of course, it didn't really start there. That's just when the internal furor over the revocation of the Gentoo Foundation's corporate license became public. Developers had been trying to figure out what to do in the internal gentoo-core mailing list for about a week, and as such things do, it leaked. The larger-scale problems didn't even start there. The Gentoo Weekly Newsletter hasn't been posted for 13 weeks, and the Gentoo homepage hadn't seen any changes in the same amount of time. Furthermore, Gentoo's second release of 2007, dubbed 2007.1, never happened and on Monday was announced canceled. What do these problems mean? Is Gentoo collapsing? Another blog post by Daniel Robbins suggests part of the answer—serious communication problems exist between developers and the rest of the Gentoo community. The relevant aspect here is that developers are so focused on working in their little areas that they fail to tell the world what they're doing. Everyone wants to develop, and nobody wants to spend time telling the world what's being developed. Most developers don't want to spend time doing anything but develop. In the same way, developers don't enjoy spending time dealing with "boring" issues like donations, copyright, tax returns, etc., nor are they generally any good at it. Development remains active in the background—new versions of packages appear, bugs are fixed, the gentoo-dev mailing list is quite active, and so is IRC. Developers continue to blog on Planet Gentoo. But none of that is apparent to Gentoo users, who go to the homepage, read the weekly newsletter, and wait for the next release. To users, things can look like they're in stasis. That's where Gentoo needs to concentrate its efforts: telling the world what developers are doing. To accomplish that, the project will either need to find new contributors interested in doing this or streamline its processes so that less effort is required to communicate (for example, automatically including Planet information or new versions from packages.gentoo.org on the homepage). Specifically, one hope with the foundation is to hand off the work to people who enjoy dealing with it, so developers can concentrate on development—people at Software in the Public Interest, or the Software Freedom Conservancy. An announcement on the Gentoo homepage proposing a move to a monthly newsletter brought nearly 20 offers of help in only 2 days, so it may be that the project hasn't been looking for non-development help in all the right places. Gentoo isn't dying, but its developers need to tell that to the world. Ten-year timeline part 3: The Tucows years This is the third installment in a ten-year retrospective inspired by LWN's tenth anniversary; those who have not yet seen them may want to have a look at Part 1 and Part 2. At the end of the second part, LWN had just emerged from the peak of the dotcom bubble having made a deal with Tucows. For almost two years we operated as a part of that company; here's some highlights from that time. April 13, 2000: Linuxcare postpones its IPO indefinitely and rearranges its management. Minix is released as free software. April 20, 2000: Linux Business Expo in Chicago. Microsoft's FrontPage back door is exposed. Devfs flame wars continue. Red Hat fired by its ad agency. Shares of Caldera, VA Linux Systems and Andover.Net all fall below their IPO prices. April 27, 2000: Oracle creates Miracle Linux in Japan. Red Hat launches its embedded developer's kit. May 4, 2000: Linuxcare lays off 35% of its staff and officially cancels its IPO. Needless to say, by this time we were happy to have found a relatively stable place to be - times were starting to look a little tough. Between the end of the Linuxcare IPO - once supposed to be the biggest and best of them all - and the fact that other Linux companies had fallen below their initial prices, it seemed that the honeymoon was pretty well over. By this time, LWN's revenue stream from advertising had pretty well dried up too. Red Hat's embedded business is a classic case of a lost opportunity. The acquisition of Cygnus should have placed Red Hat in a strong position in this sector, but, somehow, it all slipped away. May 11, 2000: Red Hat changes direction, dumps its news site, and jumps into the venture capital business. The first public BitKeeper release happens. The Free Standards Group is formed. May 18, 2000: Rumors of Wine 1.0. IBM releases the S/390 port. Memory management problems plague the pre-2.4 development kernels. One might think it cynical and mean-spirited to point out that we're still waiting for Wine 1.0. But we'll do it anyway. The memory management issues with 2.4 were to be with us for some time, as it turned out. May 25, 2000: The Linux Mall and EBIZ merge. Lineo files for an IPO. Eric Raymond decides to rewrite the kernel configuration system. June 8, 2000: A fight over whether Reiserfs should go into the 2.4 kernel. June 22, 2000: British telecom claims to own a patent on linking and starts suing ISPs for being part of the world wide web. 2.4.0 test kernels come out in two flavors with different memory managers. More Reiserfs flames. Given that the 2.4.0 release was far overdue, one would think that arguments over whether a completely new filesystem should be added would be considered out of place. But they did happen, with Hans Reiser showing a level of anger and paranoia that put much of the community off of dealing with him for years. It is rare that kernel developers are accused of putting corporate interests above those of the kernel as a whole, but that happened here. It is actually worth reflecting on this a bit: kernel developers work for roughly 200 companies, many of which are direct competitors. But that competition has remained almost entirely absent from the development process. We are very good at developing common resources in a highly collaborative way while competing at different levels. June 29, 2000: MySQL switches to the GPL, moves to SourceForge. 2.4.0-test2 is officially blessed with penguin pee. July 20, 2000: Miguel de Icaza proclaims that "Unix sucks" at OLS. Sun releases StarOffice under the GPL. Rumors circulate that Caldera might acquire SCO; if only we'd known where that would go. Larry Wall announces that Perl 6 will be a complete rewrite of the language. If only we'd known where that would go - or not go. A set of locking changes goes into the 2.4.0-test kernel - which is allegedly stabilizing for release. August 3, 2000: Copyleft is sued by the DVDCCA for putting the DeCSS code on T-shirts. Caldera's acquisition of SCO's Unix business (and name) becomes official. August 17, 2000: The GNOME Foundation is formed. Debian 2.2 ("potato") is released. August 24, 2000 KDE/GNOME flame wars break out anew. Eric Raymond strongly criticizes Linus's management practices. VA Linux claims that SourceForge hosts "over 76%" of the world's free software. Caldera/SCO announces the "Linux and Unix marriage" - something it will wish to annul later on. Something which was widely understood, but little talked about, during this time was the great amount of effort VA Linux put into recruiting projects to SourceForge. It was a clear effort to become the home for as much software as possible. Quite a few prominent projects moved over with great fanfare, only to drift away more quietly later on. SourceForge still hosts a great many projects, but it is seen by many now as a home of last resort. August 31, 2000: The Open Source Development Lab announces its existence. September 7, 2000: Trolltech releases Qt under the GPL. The CueCat saga begins. The RSA patent is released into the public domain - two weeks before it expires. Lest anybody think that the dotcom silliness was truly over by this point, the CueCat story should convince them otherwise. Digital Convergence spent many millions of dollars sending around free barcode scanners on the idea that people would want to swipe codes from advertisements and be taken to the associated web site. This company considered using the scanner for any other purpose to be a violation of the DMCA, and made loud threats at people distributing drivers which enabled such uses. The company's threats came to nothing, but they foreshadowed the DMCA follies to come. September 14, 2000: Linus decrees that the kernel is licensed under version 2 (only) of the GPL. September 21, 2000: Sun acquires Cobalt Networks. Caldera dumps $3 million into EBIZ. Linus proclaims the kernel to be in "final freeze," with only critical fixes being accepted. September 28, 2000: the Red Hat Network launches. Red Hat 7 is released, featuring "gcc-2.96," a release which the GCC project never made. The Red Hat Network was the core of what was to become the subscription services which support the company so nicely now. Back then, though, that outcome still was not clear, and Red Hat continued to experiment with a number of business ideas. October 26, 2000: KDE 2.0 is released. LynuxWorks files for an IPO. November 2, 2000: Turbolinux files for an IPO. Linuxcare shuts down its European operation. Linus describes the 2.4.0-test10 kernel as having "no known bugs." December 7, 2000: The 2.4.0-test12 prepatches include the new PA-RISC architecture and rework of the task queue API - both of which, apparently, were fixes for critical problems. EBIZ tells its shareholders that things will get better soon, honest. December 21, 2000: Corel sells its Linux business to (what becomes) Xandros. January 11, 2001: the 2.4.0 kernel is released at last. Linus warns that it's not yet open season for new patches. The first SELinux prototype is released. Many people had begun to worry that 2.4.0 would never come. The story of the development of this kernel, though, was not done yet. January 18, 2001: The Ramen worm attacks Red Hat Linux systems. Turbolinux and Linuxcare agree to merge. Lineo withdraws its IPO application. VA Linux warns that earnings will not be up to expectations. Helix Code gets $15 million in venture investments. The InterBase backdoor is discovered. Reiserfs gets merged for the 2.4.1 kernel. The first linux.conf.au happens. February 8, 2001: SUSE (still SuSE then) lays off most of its US staff. February 22, 2001: VA Linux lays off 25% of its staff, gets a new CEO. Turbolinux cancels its IPO. Microsoft's Jim Allchin calls Linux "un-American". March 15, 2001: Eazel releases Nautilus 1.0, lays off half its staff. March 22, 2001: The Stanford Checker surfaces with a long list of potential kernel bugs. EBIZ announces a plan to acquire Linux NetworX. By this point, things were looking downright scary. During the bubble days, almost anybody who wanted to work in free software development could get a job somewhere. By this point, though, quite a few people were without jobs and some of them were leaving the community altogether. The Stanford Checker was a GCC derivative which could do static analysis; for many, it was the first real demonstration of what that kind of tool could do. Despite some early reassurances, this code was never released; instead, it was used to found Coverity. The community has benefited strongly from Coverity's work, but imagine what we could have done with the source to the Checker. It is a little sad that we have been unable to develop similar capabilities in free software. April 5, 2001: Wind River Systems buys BSDi. The first kernel summit is held. Alan Cox states that the 2.4 kernel is not yet stable. Larry Wall begins to post the design of Perl 6. April 19, 2001: Wind River Systems lays off the Slackware staff. MandrakeSoft starts asking for donations from users. April 26, 2001: Ed Felten receives DMCA threats over his breaking of the Secure Digital Music Initiative watermarking scheme. Eric Raymond proclaims his intent to hack the kernel's social systems. The threats against Ed Felten - who had participated on a contest put on by SDMI proponents - were a strong signal that, in the U.S., the DMCA could bite developers hard. Worse was to come, though. Meanwhile, Eric Raymond's attempts to "hack" a rather unimpressed kernel community provided a steady stream of comic relief. May 3, 2001: Turbolinux and Linuxcare cancel their merger. VA Linux posts horrific quarterly earnings. Sony releases Linux for the Playstation 2 console. May 10, 2001: EBIZ cancels its acquisition of Linux NetworX. The Bergen Linux Users Group implements RFC 1149. May 17, 2001: Eazel shuts down. Enhanced Software Technologies - owned by Atipa - shuts down. May 24, 2001: MandrakeSoft lays off 20% of its employees, including its CEO. Your editor has said previously that Eazel's plan never seemed (to him) to make sense; the investors finally came to the same conclusion and pulled the plug. Another plan which did not make sense was what had happened to MandrakeSoft: outside managers placed in the company by its venture capitalists had decide that Mandrake should be an e-learning company - not exactly its area of core expertise. That strategy just about destroyed MandrakeSoft before the decision to go back to its distributor roots was made. The company has taken many years to recover from that mistake. June 21, 2001: Red Hat turns a profit. GCC 3.0 is released. June 28, 2001: Caldera announces plans to move its distribution to per-seat licensing. Linus announces that the 2.5 development series will open "in a week or two." Meanwhile memory management problems continue to plague the 2.4 kernel (now at 2.4.5). VA Linux leaves the hardware business. MandrakeSoft announces plans for an IPO. LynuxWorks withdraws its IPO application. In these difficult days, the fact that Red Hat could produce a profit - even a tiny one - offered a ray of hope. The failure of VA Linux to make it in the hardware business was a sobering counterexample, though, given that VA was once the most prominent company selling Linux-installed systems. July 4, 2001: Version 1.0 of the Linux Standard Base is released. July 12, 2001: The Mono project is launched. Atipa shuts down. July 19, 2001: MySQL and NuSphere end up alleging GPL violations (and more) in court. Dmitry Sklyarov is arrested on DMCA charges in Las Vegas. EBIZ warns stockholders that more money must be found or the company will not be viable. More than anything else, the arrest of Dmitry was a wakeup call for the community. It seemed that, in the U.S., any developer could be arrested for interfering with the business plans of large companies. As a result of this action, some developers still refuse to travel to the U.S. August 2, 2001: MandrakeSoft completes its IPO, raising €4.2 million. August 16, 2001: LWN co-founder and editor Liz Coolbaugh leaves LWN. We still miss Liz - but she remains a good friend. August 30, 2001: Dmitry Sklyarov is charged with conspiracy and faces 25 years in prison. VA Linux takes the SourceForge software proprietary. September 6, 2001: IBM and others put millions of dollars into SUSE to keep it from bankruptcy. Sistina takes its Global Filesystem (GFS) proprietary. September 13, 2001: Caldera turns in horrific quarterly earnings; layoffs and a reverse stock split follow. Lineo lays off a large portion of its staff. Great Bridge, a company seeking to commercialize PostgreSQL, shuts down entirely. EBIZ goes into chapter 11 bankruptcy. September 27, 2001: The 2.4.10 kernel is released. Few people remember September, 2001, as one of their favorite months. Beyond the terrible events occurring in the wider world, the problems in the commercial Linux sector just seemed to get steadily worse. The 2.4.10 kernel release is an important point as well. Here is where the longstanding memory-management problems came to a crux; Linus responded by ripping out the 2.4.9 VM code and replacing it with a completely different implementation. What followed may be the closest we ever came to a fork in the Linux development process. Some distributors stayed with 2.4.9 for a long time - RHEL 2 systems (still supported by Red Hat) are still running a kernel which, at least, claims to be 2.4.9. The worst passed, however, and this is the point at which 2.4 started toward something resembling stability. October 4, 2001: The World Wide Web Consortium proposes allowing patented technology with proprietary licensing into web standards. SUSE brings in another round of funding and announces the layoff of 120 people. October 11, 2001: Michael Hammel leaves LWN. Tucows, which had not been helped by having launched a major new offering on September 11, laid off a number of people, including Michael. His desktop columns had been a welcome addition to LWN, and his departure was a big loss. October 18, 2001: Progeny stops development of its Debian-based distribution. October 25, 2001: Lindows announces its existence. November 8, 2001: Linus announces that 2.5 will start soon. Marcelo Tosatti is named as the 2.4 maintainer. IBM open-sources Eclipse. The European software patent directive picks up steam. November 29, 2001: The 2.5 kernel development series starts - with a filesystem corruption bug. December 6, 2001: The Mandrake Club is launched as a fund-raising initiative. Initially the Mandrake Club was meant to function as a sort of tip jar. As financial problems at MandrakeSoft got worse, though, it became the storefront through which the Mandrake distribution was sold. Not everybody liked how the Club was run, but it doubtless helped MandrakeSoft to survive into the present. December 20, 2001: Charges against Dmitry Sklyarov are "deferred" and he returns home to Russia. January 17, 2002: DeCSS creator Jon Johansen is indicted in Norway. January 31, 2002: LWN is unacquired. 2.5 kernel patches get dropped, leading to another "Linus does not scale" discussion. The indictment of Mr. Johansen made it clear that DMCA-like problems were not limited to the USA. Meanwhile, by this time, Tucows had come to terms with the fact that its acquisition (and ongoing operation) of LWN was not helping it, given the directions its business was taking. So, after some discussion, LWN was unacquired - it was given back to its creators, with Tucows holding on to a small piece just in case. The parting was on the best of terms; it revalidated our decision to go with Tucows in the first place. But, after almost two years, it was time for LWN to venture back out into a scary world as an independent business. That was the beginning of a new phase, with its own ups and downs, which will be discussed in the next installment. The LV2 Audio Plugin Standard LADSPA, Richard Furse's Linux Audio Developer's Simple Plugin API, provides a plug-in framework for software audio effects. LADSPA applications are divided into two categories, host applications and plugins. From the LADSPA site: LADSPA is a standard that allows software audio processors and effects to be plugged into a wide range of audio synthesis and recording packages. For instance, it allows a developer to write a reverb program and bundle it into a LADSPA "plugin library." Ordinary users can then use this reverb within any LADSPA-friendly audio application. Most major audio applications on Linux support LADSPA. Recently, the LV2 Audio Plugin Standard was announced by Dave Robillard, the aim of LV2 is to replace LADSPA: LV2 is a standard for plugins and matching host applications, mainly targeted at audio processing and generation. LV2 is a simple but extensible successor of LADSPA. intended to address the limitations of LADSPA which many applications have outgrown. While LADSPA has been quite successful with many plugins and hosts, it is quite limited and can't be extended without breaking existing implementations. LV2 in contrast is designed with extensibility in mind right from start. One of the LADSPA limitations comes from the use of fixed data fields in the plugin binaries. LV2 defines its plugin data by using the Resource Description Framework (RDF) standard. This allows for a much wider variety of plugin data definitions. The RDF files also allow for the inclusion of multiple string definitions, which allows for plugin internationalization. The core LV2 code is intentionally designed to be small and generic, while allowing for support of independently designed extensions. Plugin identification has been changed from an ID number to a URI, this allows for extended capabilities such as the reference or fetching of plugins across the network. While LADSPA only used floating point numbers for port connections, LV2 supports port type extensions. This can be used to handle MIDI, OSC (OpenSound Control), frequency domain and other types of data. LV2 bundles of all of the data for each plugin into a single directory for easy access. As with ALSA, the actual lv2 core specification is relatively simple, the lv2core-1.tar.gz source file consists of a C header file, some build files and documentation. Several software packages were released at the same time as the LV2 standard announcement. SLV2 0.4.2 is a C library that is used to access the LV2 plugins: "Unlike LADSPA, LV2 is (more or less) designed with the assumption that hosts will use a library to discover/load/use plugins. SLV2 is one such library, which does the Right Thing with as little burden on host authors as possible." The lv2dynparam extension and helper was also announced: "The extension consists of a header describing the extension interface and libraries, one for plugins and one for hosts, to expose functionality in more usable, from programmer point of view, interface." Three LV2 compatible plugins were also announced by author Nedko Arnaudov, these include the lv2vocoder version 1, Simple Sine Generator 20080109 and zynadd plugin version 1. Arnaudov also released zynjacku version 1, a JACK based GTK2 host for LV2 synthesizers. The success of LV2 will revolve around its adoption by one or more of the major LADSPA applications, as well as the conversion of more LADSPA plugins. Conceptually, LV2 seems like a step forward for the Linux audio plugin architecture. Finding system latency with LatencyTOP Stuttering audio or an unresponsive desktop – typically caused by operating system latency – are two things that annoy users. They can be difficult problems to diagnose, though, as they are transient and buried deep inside the kernel. A new tool, LatencyTOP, seeks to provide more information on where latency is occurring so that it can be fixed or avoided. Latency is the measure of how much time elapses between when an action is initiated and when its effects become visible. If a user clicks the mouse button in an application, the latency is the amount of time between that click and when the associated action begins. There are lots of different reasons for latency, some of which are outside of Linux's control; being able to measure what latency the OS is contributing will be very useful. LatencyTOP is reporting on a specific subset of latency causes, as described in the announcement: There are many types and causes of latency, and LatencyTOP [focuses on the] type that causes audio skipping and desktop stutters. Specifically, LatencyTOP focuses on the cases where the applications want to run and execute useful code, but there's some resource that's not currently available (and the kernel then blocks the process). This is done both on a system level and on a per process level, so that you can see what's happening to the system, and which process is suffering and/or causing the delays. LatencyTOP measures the average and maximum amount of latency in various operations by inserting annotation calls in the kernel. An example from the announcement is instructive: The scheduler accumulates any time spent sleeping, between the set_latency_reason() and restore_latency_reason() calls, charging it to the "sync system call". Any lower level calls to set the latency reason will be ignored in this code path – they may be useful in other code paths – as it is the highest level active reason that gets charged. The current interface for annotating is likely to change, though the semantics will stay the same. Comments on the original submission suggested using the kernel markers feature that was merged for 2.6.24. LatencyTOP developer Arjan van de Ven seems amenable to that; reusing a kernel interface, rather than adding a new one, is generally the right choice. There is other work to do as well, the patch was submitted for other kernel hackers to test and comment on, not to be merged into the mainline. LatencyTOP comes with a userspace application, shown at right, that displays the information gathered. It reads from the /proc/latency_stats file that is created by the LatencyTOP infrastructure patch – so long as you enable CONFIG_LATENCYTOP in the kernel. It displays the nine – an off-by-one in the code as it would seem that ten were intended – largest latencies over the past 30 seconds in the upper pane. A list of process names runs along the bottom of the display, which can be selected with the arrow keys. The latency sources for that process will then be shown in the lower pane. The example at left shows the tool with the firefox process selected. As can be seen, there are still lots of areas that need annotations – "Unknown reason" along with the wait channel are displayed when the reason has not been set. When narrowing a problem down, it should be straightforward for a kernel hacker to add annotations to the appropriate locations. LatencyTOP, like its sibling PowerTOP – also developed by van de Ven at the Intel Open Source Technology Center – is a powerful tool for trying to track down system problems. It will probably undergo some changes along the way: the userspace application is still rather rudimentary and the kernel data collection needs finer-grained locking. But, before too long, a mainstream tool to measure system latency based on this work should appear. A better ext4 Last week's Kernel Page may have been filesystem-heavy, but there was still a big omission, in the form of ext4. But ext4, being the successor to ext3, may well be the filesystem many of us are using a few years from now. Things have been relatively quiet on that front - at least, outside of the relevant mailing lists - but the ext4 developers have not been idle. Some of their work has now come to the surface with Ted Ts'o's posting of the ext4 merge plans for 2.6.25. One of the changes going into ext4 is a lifting of the longstanding 4KB block size limit. That does not mean that just any block size works, though, and this feature will benefit fewer people than one might think, for one specific reason: the block size must still be no larger than the page size on the host system. So those of us running x86 systems with 4KB pages will be stuck with 4KB blocks still. And, on any system, the maximum block size is now 64KB. One amusing effect of this change is that the size of a directory entry can now be as large as 64KB as well. But the field which holds the size of directory entries is only 16 bits wide. So a special hack has been employed to recognize 64KB directory entries and keep everything consistent. Some internal variables have overflow problems as well. Block numbers are stored as a signed, 32-bit quantity, and so are block group numbers. That limits the maximum size of a filesystem to a mere 256PB. In 2.6.25, these values will become unsigned long variables, eliminating that intolerably low limit. Through some trickery, the inode field which stores the number of blocks associated with a file will be expanded to 48 bits, raising the maximum size of an individual file to just under 248 512-byte blocks. The work does not stop there, though: another patch redefines that field to mean the number of filesystem blocks (instead of 512-byte sectors) used by the file. This is a change which has to be handled carefully, since it is an on-disk format change which could create trouble for people with existing ext4 filesystems. Everybody who is using ext4 should certainly be doing so with the knowledge that it's a development filesystem and is only suitable for storing files which are not valuable for more than about 30 minutes - Rawhide OpenOffice.org updates, say. But it still would be nice to not trash every existing ext4 filesystem out there. So the i_blocks field will continue, by default, to hold the number of 512-byte blocks. But, if that field exceeds 32 bits and forces the use of 48-bit numbers, it is thereafter interpreted as filesystem blocks. Since no existing filesystems are yet using 48-bit numbers, this approach successfully avoids breaking them. Journal checksums are another feature arriving for 2.6.25. If the system crashes, the journal is used to recover any transactions which were committed, but which did not actually make it to disk. It sure would be nice to know that the journal, as stored in the filesystem, is intact before using it to make changes elsewhere. The checksum enables the filesystem to ensure that the journal is good and avoid (further) corrupting the filesystem if it is not. An interesting side benefit is that the checksum loosens the constraints on how the journal is written to disk, since an incompletely-written journal will now be detected; that should help to improve filesystem performance slightly. Note that full data checksumming is still not on the agenda for ext4. But checksumming the journal is a good (if small) step in the right direction. Another change is a VFS API change, in that it turns the i_version field of the inode structure into an unsigned, 64-bit value on all architectures. This version number is incremented when the file is changed, and it's stored (split into two fields) in the on-disk inode. 64-bit version numbers are required by NFSv4, which uses them to provide the dreaded "stale file handle" error when things change. There is a new ioctl() (EXT4_IOC_MIGRATE) which can be used to explicitly request that the on-disk inode for a file be converted to the ext4 format. The ext4 filesystem is extent-based, and has been for some time. "Extent-based" means that it tracks block allocations by extents (first block, number of blocks) rather than storing pointers to each individual block, as is done in ext3. There are a number of performance benefits to doing things this way, especially for larger files. Those benefits disappear, though, if a file's blocks cannot be grouped into the smallest number of extents possible. One technique which greatly helps in optimizing block allocations for files is to allocate them in relatively large groups, rather than individually. In 2.6.25, ext4 will contain the multi-block allocator, which does exactly that. One might think that allocating a few blocks at a time would not be that big of a change, but the multi-block allocator is by far the most complex patch in the set. A lot of effort and heuristics go into deciding how many blocks to allocate, finding the optimal set of blocks, tracking the allocation, recovering blocks which end up never being used, ensuring that an application cannot read pre-allocated (but unwritten) blocks in search of leaked secrets, etc. It is quite a bit of code, but it is worth the trouble; multi-block allocation will be enabled by default in 2.6.25. As noted above, a number of these patches force changes to the on-disk data structure. According to Ted, though, these should be the last on-disk changes for ext4. There are some features which still will not have been merged when 2.6.25 comes around - delayed allocation and online defragmentation among them - but they should not require format changes. So ext4 is getting closer to the point where it is considered ready for production use. It is not at that point yet, though, and people who use it are still doing so at their own risk. To help drive that point home, Ted has proposed a new mount flag (called test_fs) which communicates to the kernel the user's understanding that they are about to mount a developmental filesystem and will not go filing lawsuits if things go wrong. In the absence of this mount option, an ext4 filesystem will refuse to mount. One might think that child-proofing the filesystem in this way would not be necessary, but some extra care in this area can only be a good thing. Filesystem-related surprises are rarely welcome. Web security vulnerabilities and Javascript Various recent, unrelated security issues seem to have a common thread: Javascript. It is not the fault of the language, exactly, nor of any particular implementation. It is the fundamental nature of how the language is used that often causes it to be "front and center" when security problems are found on the web. Imagine that your computer reaches out across the net, to an unverified site, over an unencrypted link and grabs code that it executes with little in the way of further inspection. When put that way, it sounds rather dangerous, but that is exactly what browsers do with Javascript code. There are limits to what Javascript is allowed to do—meant to thwart malicious uses—but it has to have some privileges on the local machine in order to be useful. One of the recent outbreaks is the "random js" attack, which propagates through Javascript served by legitimate websites. It generates a random .js filename for each visitor—which is where the name comes from—inserting a reference to it in a page on the site. It also stores the IP address of the visitor so that it does not repeat the infection multiple times. The payload then tries to exploit a dozen or more Windows vulnerabilities to install malware of various sorts. The payload is not a problem for Linux users, but the websites hosting the attack are running Apache, many on Linux. The big unresolved question is how the servers were infected. It could be as simple as getting root access via insecure or intercepted root passwords. Or there could be some, as yet unknown, exploit. That certainly bears watching. Because of the privileges that Javascript has on a local host, it can be used to spread malware, by exploiting the trust that users—those that even concern themselves with such things—have in the website they are visiting. It can also play a role in redirecting traffic away from a trusted site, even though the site itself has not been compromised. A post by Nat Torkington at O'Reilly illustrates a common problem that content providers need to worry about. O'Reilly's perl.com site carried advertising that required them to load Javascript from the advertiser's site. All was well until the domain expired. A porn site bought it and started providing the required Javascript file with new contents redirecting the users to their site. A man-in-the-middle or DNS cache poisoning attack could be used for similar results on a smaller scale basis. One can certainly see how it might be used by phishers as well. It is a difficult problem, as website owners need to be able to call out to advertisers' Javascript, but users typically do not expect to run code from a site they did not directly access. A theoretical attack on home routers has started to show up in the wild. It uses Javascript to exploit a vulnerability in home routers to change the DNS entries for a popular Mexican bank. After that, accesses to the bank would instead go to the malicious website which would collect usernames and passwords, allowing the attacker to access the accounts. Once again, users probably do not expect that surfing to a random site could suddenly expose them to bank account compromise. There are some things that can be done. For users, if Javascript cannot be disabled entirely—something increasingly difficult in the "Web 2.0" world—it can at least be leashed using NoScript for Firefox. For website owners, Google's Caja project, seeks to define a subset of Javascript which implements an object-capability language, which would make it easier to sandbox remote code. If this effort succeeds, one can imagine that users could restrict their browsers to only use the Caja subset some day as well. A Code of Conduct The openSUSE project board has proposed a code conduct for mailing lists and IRC. This would be in addition to the existing Guiding Principles, mailing list netiquette guide and IRC rules. There seems to be a trend among open source projects to adopt a code of conduct. As the number of people participating on mailing lists and IRC channels increases, so does the level of poorly stated questions, off-topic chatter and other annoyances. As levels of frustration increase so does the potential for rudeness. Whether a poster intends to be rude, or is only perceived to be rude makes little difference. The international nature of this communication almost ensures there will be some misunderstandings based on culture and language. So do codes of conduct really work? They can, but often they do not. If the code is not enforced then there is no incentive for anyone to read the code, much less follow it. If the code is too actively enforced it will stifle communication. Somewhere in between there must be a happy medium. Finding it can be a challenge for even the most diplomatic of enforcers. There are no quick fixes for the problems that come with active channels of communication. There are many documents throughout the web that urge people to be polite and helpful, how to ask better questions and how to provide better answers. LWN readers may be more aware of them than the average netizen. It is up to the aware to educate the unaware in as kind and gentle a manner as possible. Memory management notifiers Virtualized guests running under Linux like to think that they are doing their own memory management. The truth of the matter, though, is that the host system cannot allow guests to directly modify the page tables used by the hardware; allowing that sort of access would compromise the security of the host. So, somehow, the host must be involved in the guest's memory management. One common technique is through the use of shadow page tables. Guest systems maintain their own page tables, but they are not the tables used by the memory management unit. Instead, whenever the guest makes a change to its tables, the host system intercepts the operation, checks it for validity, then mirrors the change in the real page tables, which "shadow" those maintained by the guest. One problem with this technique, as implemented in Linux currently, is that there is no easy way for the host to feed page table changes back to the guest. In particular, if the host system decides that it wants to push a given page out to swap, it can't tell the guest that the page is no longer resident. So virtualization mechanisms like KVM avoid the problem altogether by pinning pages in memory when they are mapped in shadow page tables. That solves the problem, but it makes it impossible to swap processes running KVM-based virtual machines out of main memory. This seems like a good thing to fix. And a fix exists, in the form of the MMU notifiers patch posted by Andrea Arcangeli (from his shiny new Qumranet address). This patch allows an interested subsystem to be notified whenever specific memory management events take place. The process starts by setting up a set of callbacks: These callbacks are bundled into an mmu_notifier structure: The interested code then registers its notifier with: Here, mm is the mm_struct structure associated with a given address space. It is not expected that anybody will be interested in all memory management events, so notifiers are associated with specific address spaces. Once the notifier is in place, the callbacks will be invoked when interesting things happen: release() is called when the relevant mm_struct is about to go away. So it will be the last callback made to that notifier. age_page() indicates that the memory management subsystem wants to clear the "referenced" flag on the page associated with the given address. This callback should return the previous value of the referenced bit, or the closest approximation available on the host architecture. invalidate_page() and invalidate_range() are both ways of telling the guest that the given address(es) are no longer valid - the page has been reclaimed. Upon return from this callback, the affected address range should not be referenced by the guest. For the curious, the KVM patches (showing how these notifiers are used there) have also been posted. While this patch set is aimed at KVM, there has been some interest from other directions as well - virtual machines are not the only places where separate (but related) page tables are maintained. Graphical processing units on contemporary video cards are an example - they have their own memory management units and have some interesting management issues of their own. Remote DMA (RDMA) engines are another possible user. So these patches have attracted comments from a few potential users, and have changed significantly since their first posting. The discussion is still ongoing, so further changes may come about before the notifiers find their way into the mainline. Ten-year timeline part 4: the end and the beginning When your editor started this series, the idea was to have four installments covering the ten-year life (so far) of LWN. Well, this is the fourth installment, and it gets less than halfway there. This is not, it seems, a topic which inspires brevity. So this series will continue past the anniversary, though your editor anticipates picking up the pace a bit for the second five years. There is less to be learned, arguably, by looking at events in the relatively recent past. Anyway, at the end of the third installment, LWN had been unacquired by Tucows and was, once again, on its own. The worst of the dotcom bust may have passed, but it was still a somewhat scary environment in which to be attempting to restart a business. It was, in fact, even scarier than we had thought when we so naively set out to show that we could do a better job of bringing in the cash than Tucows did. February 7, 2002: Linus tries BitKeeper at last. February 14, 2002: Sun states that it will "ship a full implementation of the Linux operating system." Dave Whitinger joins LWN.net. Dave Whitinger was, of course, one of the founders of LinuxToday. He joined LWN with the intent of helping us develop the advertising side of the business. That did not work out as intended, but it is hardly Dave's fault; it was a terrible time to be trying to sell advertising. February 28, 2002: Sun cuts off free access to StarOffice, but we had OpenOffice.org by then and didn't mind. BitKeeper starts to settle in as the kernel's source management system. Linus stuck with BitKeeper after his initial trial, setting a number of things in motion. For the next few years, the use of proprietary software at the core of the kernel development process would be a constant source of unhappiness and worry - and, in fact, the story had just the sort of unhappy ending that some observers had feared. But this was also the move which rationalized the kernel work flow and made the whole system scale; the incredible rate of change we see now would not have been possible without it. The use of BitKeeper also made the community aware of what distributed source control could do and, eventually, inspired the creation of a number of free programs with the same essential features. One could say that the community would have eventually developed these systems on its own without the push from Larry McVoy and BitKeeper, and that's probably true. But the fact is: we didn't do it at that time, so we had no real alternative to BitKeeper. March 7, 2002: Martin Dalecki's "IDE cleanup" patches start to raise concerns among kernel developers, who have this strange notion that their disks should actually work. A petition against the use of BitKeeper circulates on the net. Eric Raymond goes around telling the world that the kernel development process is "in crisis." March 14, 2002: Richard Stallman claims that the GNU HURD will be ready by the end of the year. MandrakeSoft pleads for donations to keep the business alive - and LWN does too. Martin Dalecki officially takes over IDE maintenance - and breaks more systems. We got about $5,000 from our initial plea for donations. It was a real act of generosity on the part of our readers, but one does not keep a business with five employees going for very long with that sort of money. March 28, 2002: The proposed "consumer broadband and digital television promotion act" would require DRM technology in all software which touches digital media. Lineo lays off more staff. April 25, 2002: More BitKeeper flames. Lineo goes through a "recapitalization" effort to be able to do things like pay its employees. May 2, 2002: OpenOffice.org 1.0 is released. June 6, 2002: LWN switches to the "new" site code. Red Hat applies for a few software patents. ADEOS, a real-time system which avoids the RTLinux patent, is released. UnitedLinux launches. Mozilla 1.0 is released. It is amazing how many readers hated the new code. Certainly there were a lot of silly things in the initial version of the site; we fixed a number of them in a hurry. Many readers disliked the ability to post comments - often posting comments to that effect. The addition of comments was something we thought about carefully for a long time; we were quite concerned that they could ruin the feel of the site. In the end, it seems, trusting our readers has paid off; the quality of the conversation here is often quite good. UnitedLinux was a cooperative effort between Caldera, Conectiva, SuSE, and Turbolinux; the idea was to join together to create a common base from which each could then craft a separate product. The effort was never all that successful, and the presence of Caldera would, of course, doom it outright in the end. But it was a big deal at the time. It is interesting to see that Mandriva (despite MandrakeSoft's refusal to join UnitedLinux) and Turbolinux are now attempting a very similar sort of arrangement. June 13, 2002: Secure Computing Corporation claims patents on SELinux. June 27, 2002: The 2002 kernel summit sets October 31 as the date for the 2.6 feature freeze. GNOME 2.0 is released. July 4, 2002: Darl McBride takes over at SCO. July 25, 2002: LWN announces "the end of the road." The "IDE cleanup" patch series (up to number 100) causes system lockups and file corruption. Debian GNU/Linux 3.0 ("woody") is released. Version 1.0 of the Ogg Vorbis codec is released. By the end of July, we had come to realize that the advertising business was not going to work out for LWN, and we were short of other ideas. The bank account had reached a point where we could not pay even very small expenses. So we concluded that it was time to throw in the towel and try something else - though we had no clue of what "something else" might be. It was with a heavy heart that we announced our plan to shut down the site. What happened next is that our donation box, which had sat mostly empty after the initial announcement, was suddenly topped up to the tune of about $35,000. Many of the donations came with notes to the effect of "use this to throw a big party." This, shall we say, got our attention. We decided that, just maybe, the subscription idea was worth a try after all, and decided to make a go of it. It was not the end after all. August 1, 2002: A new beginning. HP tries to use the DMCA to shut down disclosure of security holes. August 15, 2002: Distributions from MandrakeSoft, Red Hat, and SuSE are certified to be compliant with the Linux Standard Base. This was when our credit card merchant bank at the time decided that all those donations might just be fraudulent. So they seized the money back out of our bank account. That, too, got our attention. It took a few months and some lawyer time to get the money you all had sent in our direction; during that time, it was money from PayPal (the subject of everybody else's horror stories) that kept the lights on while our main source of cash was blocked. Needless to say, we got a new merchant bank, which we still use to this day. The new bank exhibits a rather higher clue level than the old one did, but we also learned a valuable lesson: don't mess with the credit card money pipeline. Every now and then, somebody asks why we don't accept pure donations; this is why. August 22, 2002: Martin Dalecki quits and the entire series of 115 "IDE cleanup" patches is deleted from the 2.5 kernel. August 29, 2002: British Telecom's attempt to patent the web dies in court. The BitKeeper license changes. Caldera becomes the SCO Group. September 12, 2002: Some patches get dropped after Linus starts running his mail through a spam filter. It's hard to believe that, only 5+ years ago, somebody with an email address as well distributed as Linus's could get by without spam filtering. There are a lot of free "productivity" applications, but, arguably, few have actually increased productivity to the extent that SpamAssassin has. September 26, 2002: The first development release of the "Phoenix" browser is announced. UnitedLinux upsets the community by releasing a closed beta. Phoenix was the Mozilla Foundation's answer to (relatively) lightweight browsers like Galeon, which had managed to turn the Gecko engine into something which was truly usable. The Phoenix browser proved popular, and eventually became the tool now known as Firefox. October 3, 2002: The first subscriber-only weekly edition. Eldred v. Ashcroft is argued in the U.S. Supreme Court. Eldred v. Ashcroft, argued by Lawrence Lessig, was an attempt to roll back copyright extension in the US; it eventually was unsuccessful. To this day, there still has not really been a successful challenge to the extensions to copyright passed over the last few decades - though some especially nasty attempts to make things even worse were defeated. With the October 3, 2002 edition, LWN adopted the new policy of requiring subscriptions in order to read our original content prior to the publication of the weekly edition. That policy has stayed essentially unchanged since then, despite the occasional temptation to increase the subscriber-only period. Subscription rates have also stayed unchanged, even though raising them is also tempting. Subscriptions have certainly been successful, in that they have kept the operation going in the years since then. And there is a real joy associated with being truly answerable to our readers instead of advertisers. Nonetheless, it is a challenging business; people do not like to pay to read web-based content. The fact that so many of our readers are willing to do so is most gratifying. Trends in other parts of the net are moving away from this approach, though, with formerly subscription sites moving to pure advertising models. So it will be interesting to see how it all plays out in the future. Meanwhile, next week's installment will look at how things went for Linux (and LWN) starting toward the end of 2002. Stay tuned. A ten-year retrospective from LWN's other co-founder Hello to all LWN readers! For the tenth anniversary of LWN, I've been dragged out of my closet to say a few words. Am I stunned that LWN is still going after 10 years? Not really. Much more stunning to me is the realization that the number of years LWN has been published without me are now almost double the number of years it was published with me. That is much harder to get over. As a result, all new readers from 2002 on have no reason to know who I am or what I've written in the past. For those of you that remember me and have asked about me, thank you and rest assured that I haven't forgotten you either. My name is Elizabeth Coolbaugh (Liz) and I was there for the very first issue as well as many issues that followed in 1998 through 2001. I've always said it was the very best job I ever had. I wish for all of you, if you haven't experienced it yet, a job where your first weeks of work are greeted with happy, enthusiastic letters. As the years went by, letters of praise, though much sparser, never totally ceased. You couldn't have a better incentive to work harder and harder! Jon has done an excellent job of going over the history of the first few years already, so all I can add is some tidbits or personal viewpoints. I'll mention that for me, the start of LWN was actually back in the early 1980's, when Jon, Becky and I came together as a programming team in the then infamous "Assembly Language Programming" class offered through the Engineering School at CU Boulder. We got a chance to experience lots of late nights, interesting hardware experiences and how to keep going with pizza, chocolate, caffeine, etc. That is a good way to get to know your future business partners. Jon and Becky never let me down and we all found different strengths to add to the mix. Forrest was around, too, though not working with us directly at the time. Jon mentioned that I was between jobs at the time we began. In fact, I had left NCAR three months pregnant. I loved working at NCAR for many, many years, but I had always said that I would leave it when the work stopped being fun. It actually stopped being fun about two years before that, but I had weathered rough times before and waited to make sure the situation wasn't going to turn-around before choosing to move on. The challenge of a new baby on the way (and the continuing challenge of the Multiple Sclerosis that eventually led to my departure from LWN) finally made it "the right time". So I'd actually had most of a year off to recuperate, re-organize, have a baby and test the job market waters. What I wanted was a job that used my professional skills and yet was part-time, to help me keep the health I'd regained. What a pipe-dream! Companies that would have gladly recruited me full-time just tossed my resume into the nearest recycle bin. The nicer ones told me to go out and find someone else with identical skills who wanted to job-share a full-time job and they would be willing to consider the possibility. Not bloody likely. So when Jon and I were having lunch and he suggested we might be able to work together to create something giving me what I wanted and allowing him to eventually leave NCAR, it seemed to be the right idea at the right time. I never regretted the decision, but in fact, I had a full-time working spouse to cushion the decision. Brandon's reaction (my husband) to becoming the sole support of the family and a new father in one fell swoop was a little different -- much like a deer full-blinded by headlights. In the spirit of true confessions, though I had fifteen years experience in the computing field and had worked with many different operating systems, VMS and Solaris being primary, I'd never actually touched a Linux system. Jon's unwavering belief in my ability to pick it all up in a heartbeat was both daunting and encouraging at the same time. So I installed my first Linux system only three or four months before we first started publishing. It did give me a fresh, unbiased view of the whole community, though. Okay, not totally unbiased. I did sit on the emacs side of the whole emacs/vi war. To get started, I subscribed to say, a hundred different newsgroups and mailing lists full of people I'd never met, topics I'd never heard of and flame wars I didn't care to read. It was truly a new skill to develop to learn to skim through them searching for the topics people cared about, the posts that actually carried real information and gently lift each little kernel of "news" out and place in into the newsletter, then wait to hear how well I'd done. The response was totally overwhelming. I will never, ever forget the emails we received those first couple of months. New people were finding us each week and so the responses kept coming in. They drove me to try and make my contributions worthy of the praise they sent. It is because of those emails that I'm not surprised LWN is still out there today. People wanted and needed what we had to offer. Jon's vision of what people liked and wanted has always been clear and that is another important piece of why LWN is still going strong. My take on the Red Hat Support fiasco: I have no hard feelings. Although my work as a systems administrator had always included supporting people and I had enjoyed the interaction, I had no idea what I was getting into offering 24 hour support from my home. Just as my daughter was getting old enough to give me a full-night's sleep, I was getting phone calls at 2am and 3am, having to wake up to a fully alert state and go into emergency fix-it mode. I'm surprised I survived until all the contracts we had sold finally expired. In the long run, Red Hat's ideas gave us the courage to start our own business and since writing for LWN was what I learned to love, I consider the end result to have worked out for the best. I also carefully noted for the future that telephone support work was definite going to be a last resort for any future career moves. Meanwhile, since the few contracts we had didn't bring in enough to pay the bills, let alone enough to support Jon's full-time entry, I also did contract work as a technical writer, remote or on-site administration of Linux for some local companies and I don't even remember what else. Eventually, Jon had to take the risk, forgo waiting for a reliable income and quit his day job in order to increase the income stream. Note that his early work on LWN was always done in addition to continuing his full-time job and trying to increase our income stream at the same time. No wonder he got grumpy if I was out sick or worse, got to head to a fun Linux conference, leaving him to pick up the slack! Of course, it was terrifying in turn for me when the situation reversed and Jon was unavailable. Picking up the kernel page for the week? Ack! I didn't usually complain. Instead, I kept my head low, worked hard and hoped not to see too many corrections or criticisms come in. It was wonderful for both Jon and I when we were finally able to add Becky to the mix. I think initially we were only able to scrape up enough to pay her for 10 hours a week, but every hour helped. I haven't forgotten, Becky (okay, it should be Rebecca, but she'll always be Becky to me), the hours you put in at a very low rate of pay. Of course, we did pay you first -- the downside to being the business owners for us. Over the course of the next couple of years, we continued to bring in our income from other sources. We did actually initiate putting some advertising on our site and it brought in a tiny amount of money, but the bread and butter of the company continued to be contract work done in addition to the weekly publication. That included our most successful side foray, building and teaching Linux classes. What else did I love about LWN? I so enjoyed the friendships I made throughout so many different communities. Will Rogers once said he never met a man he didn't like. Well, I've met many! But truly, in all the years I worked for LWN, I never met anyone I didn't like. Sometimes people I liked said things or did things that I didn't like, but underneath it, they were all good people, smart, idealistic and very strongly opinionated. That was part of what I liked and enjoyed, so I never held people's opinions against them. The conferences I attended and at which I spoke were like the icing on the cake. I got to meet in-person people I had only come to know through newsgroups and mailing lists or occasionally personal correspondence. I got to meet even more people and share in the excitement. And yes, I do remember the late nights going out for food, drink and conversation with you -- the Atlanta Showcase, LinuxWorld San Jose, Embedded Systems Conference San Jose, LinuxWorld New York, the Colorado Linux Info Quest and the Singapore Linux Conference. Each one provides me with rich memories. My trip out to Singapore was one high-point. So many good and wonderful people and such a wonderful experience. I thought it was to be the first of many international conferences that I would be attending and I am still so sad that it was my last. I particularly regret never making it out to any of early Linux conferences in India, despite invitations. Professionally, though, the highlight of the work was actually developing myself as a journalist, rather than a computer expert. I enjoyed researching more in-depth articles. When rumors floated my way, I loved actually going out and contacting the people involved first hand by telephone -- short-circuiting email and the rest, to discuss the issues and get their first-hand viewpoints. Since our community wasn't exactly hounded by the media back then, everybody actually wanted to talk to me and was more than happy to give me the straight scoop, instead of just seeing themselves misquoted elsewhere the next day, with the resultant flames. Best of all, I was occasionally able to get the sources of both sides of a controversy together and talk. I can think of at least twice where problems got resolved as a result, people got together and I got the scoop on a story the next day that had literally changed as a result of my work. Very heady stuff. Jon has already done an excellent job of covering our experience with the dot-com bubble, so I won't add to his description. It was truly a unique life experience that we enjoyed to the fullest, knowing that another like it was unlikely to come by us again. We were very fortunate in our decisions and I agree that the people at Tucows were extremely good to us. Well, at this point, all this happened a long time ago. I had a great time and regret nothing I did, only the things I didn't get time to do. For those who have asked after me personally, be assured that health-wise, giving up my job was again the right choice at the right time and I'm doing much, much better than I was in August of 2001. You're still not likely to see me back any time in the near future. I focus my research skills now-a-days on tracking traditional and alternative medical discoveries, implementing what seems good to me and serving as an ad-hoc resource for other family members. Oh yes, and serving as a chauffeur to my daughter, who is now ten years old, just as LWN is. Take care, all of you, remember to be proud of what you are achieving and *always* have fun doing it. I stand by my opinion that when work ceases to be fun, it is time for a change. LCA: The state of Debian The Debian miniconf is one of the oldest of linux.conf.au traditions. This year, Martin Krafft was the person who - with short notice - got to lead off this gathering with the "state of Debian" talk. Debian, as always, is an active project, and it seems that much is going well. The Debian security team has grown over the last year. Martin noted that Debian, for all practical purposes, had no security support for a period after the Etch Sarge release. Those days are over, though, and Debian's security support is, once again, solid. There is now good security support for the testing distribution as well; in fact, testing updates often come out before those for the stable distribution. That result comes from the fact that testing updates do not need to support all architectures and there are fewer embargo issues. The upcoming Lenny release, it was noted, will have implemented most of the features called for in the security-hardening specification. The state of translations is good; Debian supports 58 languages now, and may support 77 by the Lenny release. The Smith Review Project has been working through the package base, ensuring that package descriptions are, well, descriptive, in proper English, and easily translatable. On the ports side, the Sparc32 port has been officially retired; to the dismay of relatively few users. The Lenny release will include a new port: Debian GNU/kFreeBSD, which is based on the FreeBSD kernel. Martin thought this port would appeal to those Debian users who have been complaining about the increasing "multimedia orientation" of the Linux-based distribution. Much work is going into making the package repository more searchable. The debtags project, which is putting a set of standardized tags onto packages, is relatively advanced. This effort will address a number of longstanding problems, like the fact that a search for "image editor" does not turn up GIMP, which is an "image manipulation program." Debtags will also make it possible to search for packages which are related to other packages. There is also the apt-xapian-index project, which is working toward indexing all package metadata and providing a fast search capability. Other bits of current status: The debian-med project - building a version of Debian aimed at the medical industry - is headed toward a 1.0 release. The Debian mirror network is growing. There are six new primary mirrors, and around 100 new secondary mirrors. Lenny will use UTF-8 nearly exclusively. Developers are working on fixing the remaining packages which do not yet support UTF-8. The venerable dselect is almost retired. There are still dselect users out there; Martin suggests that all of those folks move to aptitude. There are a lot of new games coming into the distribution. The Etch-and-a-half release will be happening soon. This is a version of Etch which offers a 2.6.24 kernel - needed to make Etch work on newer hardware. The original 2.6.18 kernel will remain an option for Etch users. Looking forward to 2008, Martin noted that the Lenny release is currently planned for December. Lots of emphasis on "planned" - given Debian's history in this regard, few people actually expect the release to happen on time. Martin did say that things have been getting better in this regard, with Etch being "only" four months behind schedule. A Lenny release which is only a couple months late seems feasible. Something which is just coming into play is the new "Debian maintainer" status. Unlike full developers, maintainers cannot vote, have no access to the debian-private list, and do not have much access to the wider Debian infrastructure. About all they really can do is upload a specific set of packages. So the "maintainer" designation is good for those who want to maintain a small set of packages, but who are not looking to be an active participant in Debian as a whole, and who do not want to run the "new maintainer" gauntlet. Martin was asked whether there was any thought of downgrading any existing developers to maintainers. He said that there was some interest in doing that. There are currently just over 1000 developers, all of whom have full access to the repository. Some 400 of those are inactive, but they still possess a key which lets them make changes to the system; this is a clear security issue. The MIA project is looking to identify these people and, eventually, move them to inactive status. On the issue of whether the project would be forcibly downgrading active developers who, for whatever reason, are not entirely welcome in the community, Martin says that will not be happening. There is just no way to do it without bringing massive disruption and flame wars, and nobody wants that. There was also a question on the role of the debian-private list. The biggest use of debian-private, according to Martin, is vacation announcements; developers need to let the project know that they will not be around, but they do not wish to announce their absence to the wider world. There are some other discussions there too, of course. Current policy says that debian-private discussions will be disclosed after three years in the absence of a request to the contrary. There's an effort afoot to disclose older traffic from before the adoption of that policy, but that requires the assent of all of the participants. The debian-women project, unfortunately, is currently stalled; the main participants have not had the time to push things forward. The #debian-women channel remains active, though, and is generally a nice and supportive place to be. There are currently about twelve active female contributors to Debian. Martin thinks that women are becoming more present in general, though, and he stated that "the Debian cowboy days are done." On the packaging front: the packages.qa.debian.org site has been redone in "beautiful CSS." There are now RSS feeds for those who want to follow the status of specific packages. A new "LowThresholdNMU" flag has been added; this is essentially a statement on the part of the maintainer that he will not get offended if others upload fixes to the package. Packages can now use bzip2 compression. There has also been a major rework of the shared library infrastructure, which now looks at actual symbol use when determining shared library dependencies. This change should make it possible to install individual packages from testing into a stable system without having to update all of the libraries that package uses. There is a growing trend toward team maintenance, especially for the larger package sets. This approach increases the robustness of the system and minimizes problems with MIA maintainers. Version control systems are working their way into the Debian infrastructure. Packages can now have a set of Vcs-* headers which point to the upstream source repository; these can be used, for example, with the debcheckout command to clone the source repository without having to know anything about the source management system used. Version control systems also offer a solution to the current problem of "hackish packaging tools" being used by many developers. In the future, source packages might just include a shallow repository which can be fed straight to git (or some other system). This project is stalled at the moment, but Martin thinks it will go somewhere; it would be nice if the distributors could come up with a common scheme that they can all use. The final topic in this session was a question from the audience on whether Debian might ever go to a shorter release cycle. The projected 18 months for Lenny seems like a step in that direction, but 18 months is still quite a bit longer than the cycles used by many other free distributions. Martin thinks that going shorter is unlikely. The fact of the matter is that distribution upgrades are a hassle, requiring a fair amount of administrative attention. Ubuntu may have made some progress with its use of upgrade scripts, but the basic problem remains. On top of that, shorter release cycles would necessarily lead to a shortening of the time for which security updates are available for any specific release. And that, in turn, would force users into more frequent updates whether they want to do that or not. So one should not expect six-month release cycles from Debian anytime soon. What got into 2.6.25 As of this writing, some 3800 patches have been merged into the mainline git repository since the release of 2.6.24. That is fewer than one might have expected, but Linus's travel to linux.conf.au is slowing the process somewhat. Expect more than the usual amount of interesting stuff to be merged relatively late in the merge window period. User-visible changes include: New drivers have been added for Globe Trotter HSDPA wireless cards, HIFN 795x crypto accelerator chips, Xceive xc2028 and xc5000 tuners, Cirrus Logic CS5345 analog-to-digital converters, several Beholder TV tuners, Syntek DC1125 cameras, Silicon Labs Si470x FM radio receivers, Atmel AT91CAP9 processors, Qualcomm MSM7X00A processors, Marvell Orion system-on-a-chip devices, Marvell Feroceon processors, SuperH 7203 and 7263 processors, SGI IP28 systems, R6040 Ethernet adapters, Broadcom NetXtremeII 10Gb network adapters, RTL8180 and 8185-based wireless network cards, Microchip EN28J60 Ethernet chips, and, finally, Atheros-based wireless network adapters. The Seagate ST-02/Future Domain TMC-8xx and PSI240i SCSI drivers have been removed due to lack of interest and maintenance. Salsa20 stream cipher support has been added to the crypto layer (at least for the x86 architecture - it's an assembly implementation). Some realtime work has gone into the scheduler; in particular, the kernel will be more aggressive about moving tasks between processors when multiple realtime tasks are contending for the same CPU. The implementation of cpusets has been made to work more with the scheduler domains mechanism. The option to make the big kernel lock preemptible has been made the default; eventually the non-preemptible version will go away altogether. High-resolution timers can be used for preemption, making fair scheduling more accurate. The group scheduling feature has been enhanced with realtime support. The Preemptible read-copy-update patches have been merged. Support for the LatencyTop utility has been merged. Kprobes support for the ARM architecture has been added. The new CLONE_IO flag to clone() causes I/O contexts (used in the CFQ block I/O scheduler) to be shared with the new child process. The idle class for I/O scheduling has been changed to not be 100% idle when the device is busy; as a result, it is far less likely to cause priority inversion problems and is no longer limited to privileged processes. A long list of new ext4 features, including large file support, (very) large filesystem support, journal checksumming, multi-block allocation, and more, has been added in. The splice() system call now supports TCP receive streams. Controller area network protocol support has been merged. The network traffic shaper, long obsolete and scheduled for removal, is gone. Quite a bit of work has been done on the network namespace code which was first merged in 2.6.24. Extending namespace awareness through the entire networking subsystem is a big job which is, at this point, mostly complete. Changes visible to kernel developers include: Chinese translations of a number of core kernel development documents have been added to the tree. There have been a great many changes to the low-level device model APIs dealing with kobjects and ksets. These changes have, in turn, forced a large number of adjustments throughout the tree. See Documentation/kobject.txt for an overview of the new API. There is a new set of security module functions for dealing with filesystem mount and unmount operations. The chained scatterlist API has been augmented with the sg_table patches. There have been some changes to the block request completion API. See this article for a description of the new way of doing things. As of this writing, the merging process has just begun, so expect a long list again next week. Among other things, the x86 tree update, with 908 changesets, is waiting on the wings. There is quite a bit of code yet to be merged for this development cycle. A new block request completion API The 2.6 block layer has traditionally provided a pair of functions by which a driver could indicate that an I/O request had been completed. A call to end_that_request_first() signaled the transfer of a certain amount of data and would return a value indicating whether the request as a whole was complete. Once all sectors in a request had been transferred, it was up to the driver to pass the request to end_that_request_last() for final cleanup. There was also a function called simply end_request() which might or might not end the entire request, depending on how much data had been transferred. This API has worked for a long time, but it has occasionally proved confusing for driver developers. It was also hard for drivers to communicate useful error information with this interface. So, as of 2.6.25, there will be a new way for drivers to indicate request completion. After a block driver has transferred one or more sectors (or failed in the attempt), it should now make a call to: Where rq is the I/O request, error is zero or a negative error code, and nr_bytes is the number of bytes successfully transferred. If blk_end_request() returns zero, the request is fully processed and the driver can forget about it. Otherwise there are still sectors to be transferred and the driver should continue with the same request. blk_end_request() must acquire the queue lock to do its job. If the driver already holds that lock, it should call __blk_end_request() instead. Block drivers traditionally did a number of housekeeping tasks between calls to end_that_request_first() and end_that_request_last(). These include calling add_disk_randomness() to contribute to the entropy pool, returning any tags used with the request, and removing the request from the queue. All of that stuff is now done within blk_end_request(), so drivers can forget about it. The occasional driver had to carry out other tasks between the completion of the request and its removal from the queue. For drivers with this kind of special need, there is a separate function to call: In this version, drv_callback() will be called (without the queue lock held) between the completion of the request and its final cleanup. If the callback returns a non-zero value, that final cleanup will not be done. This function will always acquire the queue lock - there is no version for drivers which have already taken that lock. In general, though, the use of the callback functionality is likely to be a sign that the driver is being tricker than it really needs to be. This change was accompanied by a fair number of patches converting all in-tree drivers to the new interface. The old completion functions have been removed, so out-of-tree drivers will need updating before they will work with 2.6.25. Gerbv reaches the 2.0 release milestone Gerbv (Gerber Viewer) is a utility for displaying CAD files that are used in the manufacture of electronic printed circuit boards: Gerbv is a viewer for Gerber (RS-274X) files. It is one of the utilities affiliated with the gEDA project. Gerber files are generated from PCB CAD systems and sent to PCB manufacturers as the basis for the manufacturing process. The standard supported by gerbv is RS-274X. In the 1980s, computer generated Gerber files were used to drive photo-plotter machines made by by the Gerber Systems Corporation. The photo plotters used a mechanically stepped light source and rotating image wheels to optically imprint a image of a circuit board onto a large piece of film. The film was then used to manufacture the printed circuit board. Additionally, PCB manufacturing requires information for defining the size and placement of drill holes (drill files). The photo plotting machines are now obsolete, but the Gerber standard remains as a standard in the PCB manufacturing business. The output from Gerber file plots can look considerably different than the original CAD drawings, making a visualization tool like Gerbv important. Gerbv can be used for examining the CAD files generated by such software as CadSoft Eagle, a popular commercial application with a freely downloadable hobby version. Another Linux-compatible printed circuit CAD application is PCB. PCB is less powerful than Eagle, but is open-source software. LWN examined PCB a long time ago. Version 2.0.0 of Gerbv was recently announced: "Gerbv release 2.0.0 represents a a whole new look for gerbv. Most importantly, the layer control GUI has been made much more powerful through the outstanding work of Julian Lamb. Julian has also re-worked the GUI's button and menus to make them more convenient to use. We are certain that you will find gerbv-2.0.0 even easier to use than before because of Julian's amazing work!" The feature list for Gerbv 2.0.0 now includes: Display of RS-274x Gerber files. The complete implementation of the current Gerber spec. Display of Excellon drill files. Display of XYRS pick-place files for surface mount technology. A completely redesigned GUI. Controls for zoom/pan and fit to screen. A measure tool for making mouse-controlled distance calculations. User selected display of the various layers. Support for transparency so that multiple layers can be viewed. Report windows showing Gerber and drill code stats and errors. A built-in print button. Use of the Cairo graphics library, enabling export of PDF, PS, SVG, and PNG files. Incorporation of a new unit test suite in the code. Improved file-type autodetection. Expanded configuration options for the build system. The project's SourceForge screenshot page gives several examples of Gerbv 2.0.0 in use. Installation of Gerbv 2.0.0 was straightforward. The source code was downloaded, uncompressed and untared. The standard Unix configure/make/make install steps were performed on a Ubuntu Feisty Fawn system, no problems were encountered. Gerbv 2.0.0 was tested on some Eagle CAD files that your author had worked on in the past. Startup was easy, running the command gerbv slc1.* had the desired effect of pulling in all of the various layers for the test project. Moving and zooming around the layers showed the CAD graphics in detail, as expected. The analyze tools produced a lot of useful status information for the various files. Details in the copper layers that did not show up in Eagle (version 4.16) were easily seen with Gerbv. In the past, your author has encountered problems with Eagle incorrectly displaying the placement and scaling of text on the silk screen layer. This showed up when CAD files were taken to a board manufacturer. Gerbv displayed the text as it appears on the manufacturer's system, which is the desired behavior. The export functions were experimented with. Export to a png file worked as expected. Export to a PostScript file caused Gerbv to hang up. Export to a PDF file took a very long time to complete, and gpdf took a long time to load the file. When gpdf finished rendering, it only displayed large polygons that were barely visible due to their almost identical colors. Export to svg produced a file that caused the mirage image viewer to hang when reading. An attempt to convert the svg file to a jpg file with convert resulted in this error: Clearly, this is still a .0.0 release with some bugs. Despite these problems, Gerbv 2.0.0 is a tool that is useful, if not critical, for performing Linux-based printed circuit board design. Avoiding the OOM killer with mem_notify Having applications that use up all the available memory can be a fairly painful experience. For Linux systems, it generally means a visit from the out-of-memory (OOM) killer, which will try to find processes to kill. As one would guess, coming up with rules governing which process to kill is challenging—someone, somewhere, will always be unhappy with a choice the OOM killer makes. Avoiding it altogether is the goal of the mem_notify patch. When memory gets tight, it is quite possible that applications have memory allocated—often caches for better performance—that they could free. After all, it is generally better to lose some performance than to face the consequences of being chosen by the OOM killer. But, currently, there is no way for a process to know that the kernel is feeling memory pressure. The patch provides a way for interested programs to monitor the /dev/mem_notify file to be notified if memory starts to run low. /dev/mem_notify is a character device that signals memory pressure by becoming readable. Interested programs can open the file and then use poll() or select() to monitor the file descriptor. Alternatively, signal-driven I/O can be enabled via the FASYNC flag and the system will deliver a SIGIO signal to the process when the device becomes readable. If it becomes readable, the process should free any memory that it can afford to give up. If enough memory is freed this way, the kernel will have no need to call in the OOM killer. The crux of the patch is how to decide that memory pressure is occurring. mem_notify modifies shrink_active_list() to look for movement of an anonymous page to the inactive list, which is an indication that some will likely be swapped out soon. When that occurs, memory_pressure_notify() (with the pressure flag set to 1) will be called for that zone. When the number of free pages for the zone increase above a threshold—based on pages_high and lowmem_reserve for the zone—memory_pressure_notify() is called again, but with the pressure flag set to 0, effectively ending the memory pressure event for that zone. If there are numerous processes waiting for a memory pressure notification, it could be counterproductive to wake them all at once—the "thundering herd" problem. To combat this, the patch set adds the ability to wake fewer processes than are waiting on the poll event, by adding the poll_wait_exclusive() function. poll_wait_exclusive() will in turn call add_wait_queue_exclusive() so that a member of the wake_up() family can be used that will limit the number of processes woken up. Previously, only poll_wait() was available, it uses add_wait_queue(), which does not provide this ability. Also, to reduce the frequency of processes waking up to reclaim memory, memory_pressure_notify() will only do that once every five seconds. The /proc/zoneinfo output has been changed to include the mem_notify status. This can be used by a human for diagnostic purposes or by a program to check the current status of zones for memory pressure. The embedded community has a lot of interest in seeing this feature get added to the kernel. Devices like phones and PDAs are often running close to their memory limits and the OOM killer is currently unavoidable when the user opens yet another application. With this patch in place, programs that use a lot of memory, but could get by with less, can be changed to free up their caches and the like when memory gets tight. As memory hungry programs get changed, other users will benefit as well. The patch, submitted by Kosaki Motohiro, has been through several iterations on linux-kernel. The work was originally started by Marcelo Tosatti, with the fifth version recently posted by Kosaki. Previous versions have been well received and with relatively few comments on this iteration, it would seem to be getting close to being merged. LCA: Bruce Schneier on the two sides of security The conference portion of linux.conf.au opened on Wednesday morning with a keynote by Bruce Schneier. LCA is a sold-out event; in fact, there are rather more attendees than can be fit into the hall where the keynotes are held. Thus the room was packed, with the second-class citizens - those with yellow badges who put off registration until late - watching a remote feed in a separate room. Those folks may have had a more distant experience, but it was almost certainly a cooler one too. Bruce's key point is that we need to rethink how we try to achieve security, though it took a while to explain just why that is. Security, he says, has two components: The feeling of security: that which helps us to sleep well at night. The reality of security: whether we are, in fact, secure. These two aspects of the problem are entirely separate from each other, but they both have to be addressed if our security goals are to be achieved. Security is always a set of tradeoffs which we are all making every day. As an example, consider that, in all likelihood, nobody in the audience was wearing a bulletproof vest. It's not that the vests do not work; instead, nobody feels that the cost of wearing a bulletproof vest is justified given the risk. On a bigger scale, the answer to the question of how to prevent more 9/11-like attacks is clear: ban all aircraft. In fact, that was done in the US for a few days after those attacks, but, in the longer term, that is not a tradeoff that people are willing to make. So the fundamental question for any security tradeoff is: is it worth it? As it happens, we are quite bad at making that decision. We tend to respond to feelings rather than reality. Spectacular risks drive us more than everyday risks. We fear the strange over the familiar and the personified (think Osama bin Laden) over the anonymous. Involuntary risks are seen as being bigger than those entered into voluntarily. In the end, evolution has equipped us quite well for making tradeoffs in the small communities we lived in many, many thousands of years ago. We are less well equipped for the world we live in now. Since we respond to feelings more than reality, there are strong economic incentives for solutions which address feelings. The result is snake-oil products and security theater. Sometimes people notice that they are being sold bad security (later Bruce mentioned a US survey which indicated that the Transportation Security Agency is now less trusted than the taxation agency), but, all too often, they don't. They have a poor understanding of the risks and the costs involved, and there are plenty of people with strong interests in confusing the issue. The security market is a lemons market, one where buyers and sellers have asymmetric access to information. Economic research shows that, in such markets, the bad products tend to drive the good ones out of the market. There is no easy way to evaluate the work which has gone into the creation of a truly secure product, so buyers respond to other, less reliable signals. Things like price, sales claims, or the Gartner Group. These signals are sloppy and prone to manipulation. When security is outsourced to outside agencies - governments, say - the problem gets even worse. In the business world, information eventually brings some order to a lemons market. As businesses learn about what really works, access to information evens out - though there is always a problem with very rare, high-cost events where information is not available. In the individual world, though, it is much harder, because fear plays a much bigger role. The fact of the matter is that fear is wired deeply into how we work - it is a result of a very old part of our brain. As humans, we have the ability to override our fears when reason indicates that we should, but it is a hard thing to do. The default state is that fear rules. So this is Bruce's core point: the feelings matter. All that security theater out there is not entirely stupid; any security solution must address the fears that people feel. We must address both aspects of security. The problem is where the feeling of security and the reality of security diverge from each other. If only feelings are addressed, security has not really been achieved. If only the reality of security is addressed, people feel insecure and may make bad decisions. Either way, the full problem has not been solved. Addressing this all-too-common problem is hard, though; Bruce knows of no better way than the spreading of good information. Your editor's perspective follows - nothing from this point on was said during the talk. It seems that he has a point here. Consider some common situations in the free software world: A large number of security updates from a distributor may be an indication that the reality of security is being achieved: problems are being found and fixed before they are exploited. But all those updates can undermine the feeling of security. The seemingly endless stream of Wireshark updates is a case in point; most of these problems are found through proactive auditing by the developers and have never been exploited by the Bad Guys. But the feeling of insecurity associated with Wireshark can be strong. This feeling can push users toward other software which, while not having that long history of security updates, is actually less secure. A system running SELinux may, in fact, be highly secure. But many administrators still turn it off. SELinux does not make them feel secure because they do not understand it, and they fear (rightly or wrongly) that it will interfere with the proper operation of the system. But, by turning it off, they undoubtedly expose themselves to a number of attacks which SELinux would block. We should hear Bruce's point and think a bit more about how we can ensure that free software creates the feeling of security - but a feeling which is backed up by real security. It's a hard problem, one which lacks technical solutions. But we'll find ourselves less secure than we would otherwise be if we do not address that side of the issue. Finding bugs lurking in the DOM The Document Object Model (DOM) for HTML is quite useful for handling a variety of dynamic effects for web pages, but it is complex. It interacts with Javascript and CSS (or they with it) in ways that are sometimes surprising—the DOM has often been the source of browser bugs. A new project, from well-known DOM bug finder Michal Zalewski, seeks to systematically exercise the DOM in browsers to eliminate as many holes as it can. The project, with the unassuming name of DOM access checker (or dom-checker) was just announced on the full-disclosure mailing list (along with Bugtraq and others). Zalewski and colleague Filipe Almeida, both of Google, describe their tool as follows: DOM access checker is a tool designed to automatically validate numerous aspects of domain security policy enforcement (cross-domain DOM access, Javascript cookies, XMLHttpRequest calls, event and transition handling) to detect common security attack or information disclosure vectors. The checker consists of a three HTML files and a Javascript configuration file that can be loaded from the internet via HTTP (a live version is available from the project website) or from the local disk, using the file:// protocol. Ideally, they should be loaded from both places and give the same results. The screenshot for a sample run using Firefox 3 (Fedora/3.0b3pre-0.beta2.12.nightly20080121.fc9 for the curious) is at left. After pressing the "Click here to begin tests" button, the Javascript test harness runs 15 major tests, each with many separate subtests. Each subtest reports success or failure to the screen as it runs. Firefox 3 failed 15 of the 1500 or so checks in the standard set of tests. According to the announcement, "DOM Checker had been used to find a number of major security bypass and information disclosure problems in several popular browsers." Zalewski and Almeida worked with the browser teams to resolve the most serious issues. But, common browsers will still fail up to 30 of the less important tests—for privacy, rather than security, holes. The hope is that the browser vendors pick up these tests to use as part of their quality assurance process. They could also be used for regression testing to find problems that have crept in while fixing other bugs or adding new features. The checker is a framework that could easily be extended with additional tests covering other areas of DOM functionality. With the advent of AJAX, DOM manipulations via Javascript are being used more and more by web sites, so tools to discover these kinds of bugs are welcome. LCA: Bringing X into a two-handed world Our graphical interfaces, as implemented through the X Window System, are designed around a single keyboard and a single mouse. But humans are social creatures who want to work together and share systems; they also tend to design their activities around the fact that we have two hands. Moving X out of the single-device model is not a task for the faint of heart, but Peter Hutterer is making a go of it. His LCA talk on multi-pointer X was an interesting update on where this work stands. The X device model is based on the idea of a core keyboard and a core pointer. Even in a situation where multiple input devices are present (a second mouse plugged into a laptop, say), the application still only sees a single, core device. There is no way to tell, using these core devices, which physical device generated any given event. This, of course, will be an obstacle for any application wanting to provide multi-device support. As it happens, the XInput extension has provided basic multiple-device support for many years. XInput events look much like core device events, except that (1) applications must register to receive them separately, and (2) they include an ID number identifying the device which generated the event. XInput does not solve the problem by itself, though, for a couple of reasons. Beyond the fact that it does not provide a way for users to specify how different devices should be handled, XInput suffers from the little difficulty that approximately 100% of X applications do not make use of it. So nobody is listening to all those nice XInput events with associated device IDs. The one exception Peter mentioned is the GIMP, which uses XInput to deal with tablets. Of course, multiple devices work on current systems; that is because the X server also generates core events for all devices. That causes the device ID to be lost, but, since applications do not care, this is not a problem, for now. But it does mean that we are still stuck in a world where systems have a single pointer and a single keyboard. Luckily for us, says Peter, multi-pointer X is on the horizon. MPX extends X through the creation of the concept of "master" and "slave" devices. Master devices are those which generate events seen by MPX-aware clients; they are virtual devices which can be created and destroyed by the user at will. Slave devices, instead, correspond to the physical devices attached to the system. Through the use of a modified xinput command, users can create masters and attach specific slaves to them. In the MPX world, one of three things will happen whenever something is done with a physical (slave) device: The X server will create an XInput event from the slave device and deliver it to any applications which have asked for such events. If that event is not delivered (because nobody was interested), a core event from the associated master device is created and queued for delivery. If the event is still undelivered, the server will create an XInput event from the master device to which the slave is attached and attempt to deliver that. The end result is a scheme where multiple devices still work as expected with non-MPX-aware applications. But when an application which does take advantage of MPX shows up, it will have access to the real information about what the user is doing. Peter ran a demo of some of the things he was able to do. By default, there is still only one pointer and one keyboard. Once a new master is created, though, and slave devices attached to it, things get more interesting. Two mouse pointers exist on the screen, each of which can be used independently. It's possible to be typing into two separate windows at the same time. Or, with the right window manager, the user can move windows simultaneously, or resize a window by grabbing two corners at the same time. It was great fun to watch. MPX brings with it an API which can be used with multi-device applications. When applications use it, says Peter, the result is "eternal happiness." That just leaves the problem of "the other 100%" of the application base which lacks this awareness. To a certain extent, things just work, even when independent pointers are used in the same application. There are some exceptions, though, which have required some workarounds in the system. For example, applications typically respond when the pointer enters a specific window - illuminating a button within the application, for example. Things work fine when two pointers enter that button. But, likely as not, once the first pointer leave the button, it will go dark and refuse to respond to events from the other pointer. The solution is to nest enter and leave events, so that only the first entry is reported to the application, and only the final exit. Another problem results when a mouse button is pushed while another button is being held down (for a drag operation, perhaps) on a different device. Do that within Nautilus, and the application simply locks up - not the eternal happiness Peter was hoping for. So, when the application holds a grab on one device (as happens when buttons are held down), no other button events will be reported. Also problematic is what to do when the application asks where the pointer is: which pointer should be reported? In this case, the server simply assigns one pointer as the one to report on. All of this makes standard applications work - almost all the time. Some interesting problems remain, though. How, for example, should a window manager place new windows in a multi-user, multi-device situation? Users will want their windows in their part of the display space, but the window manager has no real way of knowing where that is - or even which user the window "belongs" to. In general, the whole paradigm under which desktop applications have been developed is unprepared to deal with a multi-device world. Things will get worse as more types of input devices enter the picture. Touch screens are bad enough; they have no persistent state, so things change every time the user touches the device. But touch screens of the future will report multiple touch points simultaneously, and each of those will have attributes like the area of the touch, the pressure being applied, etc. Perhaps the device will sense elevation - a third dimension above the device itself. All of this is going to require a massive rethinking of how our applications work. There are going to be a lot of big problems. But that, says Peter, is what happens when one explores new areas. One gets the sense that he is looking forward to the challenge. LCA: Disintermediating distributions One of the mini-confs which happened ahead of linux.conf.au proper was the "distribution summit," meant to be a place where representatives and users of all distributions could talk about issues of interest to all. The highlight of this event, perhaps, was Jeff Waugh's talk on disintermediating distributions - or, as he rephrased it, "distributed distributions." If his ideas take hold, they could be the beginning of a new relationship between free software projects and their users. It all started, says Jeff, some years ago, when he ran into Mark Shuttleworth fresh from a visit to Antarctica. Mark's pitch, says Jeff, "sounded like crack" at the time. By 2003 or so, it just didn't seem like there was a whole lot of room for a new distribution. But Mark had some interesting ideas, and Jeff signed on; the result, of course, was Ubuntu. Ubuntu has clearly had some success, but, in some important ways, it has failed to work out - at least for Jeff. He found himself distracted by Ubuntu's lack of participation in Debian, from which it derived its product. There was a real tension between tracking Debian and tracking upstream projects more directly. Despite Jeff's insistence that Ubuntu should be tracking (and pushing updates into) Debian's unstable distribution, Ubuntu often chose to go with upstream, resulting in what is, in effect, a fork of the Debian distribution - in terms of both the technology and the community. What Ubuntu was doing was taking upstream packages, modifying them, bringing in shiny new features, and generally looking for ways to differentiate itself from the other distributors. So, for example, the first Ubuntu release contained a great deal of Project Utopia work (aimed at making hardware "just work" with Linux) which had been done by developers from other distributions; Ubuntu shipped it first, though, and got a lot of credit for it. Novell's behind-closed-doors development of Xgl was motivated primarily by the wish to keep Ubuntu from shipping it first. Meanwhile, Red Hat had slowly learned that trying to differentiate itself by diverging from upstream was a path to pain. So Red Hat's developers created AIGLX, in an open, community oriented manner; the result is that AIGLX has proved to be the winning technology. Events like these led Jeff to wonder about just where the integration of packages should be done - upstream or downstream? From Jeff's (GNOME-based) upstream point of view, he wonders why he doesn't have a direct relationship with his users. While most projects deliver their code through middlemen (distributors), there is an example of a project which has managed to maintain a much more direct relationship: Firefox. Most Firefox users are direct clients of the project - though most of them are Windows users. The Firefox trademark has been used to ensure that, even when distributors are involved, the upstream developers get a say in what is delivered to users. So, what happens if you take out the middleman? It's instructive to look back at what life was like before there were distributors. It was, Jeff says, much like pigs playing in mud; perhaps they enjoyed it, but it was messy. There are, in fact, a lot of good things that distributors have done for us. You can get a fully integrated stack of software from one source, and the distributor acts, in a way, as the user's advocate toward the upstream project. We don't want to lose out on all that. But, if one were to look at facilitating a more direct relationship between development project and their users, one would want to take advantage of a number of maturing technologies. These include: OpenID. Any process of distributing distributions must look at distributed identity, and OpenID is the way to do it. DOAP. "Sounds terrible" but it's a useful way of describing a project with XML. With a DOAP description, a user can find a project's mailing lists, bug tracker, source repository, etc. Atom. This is how projects can distribute information about what they are doing. XMPP. This is a Jabber-based message queueing and presence protocol. It can be used to more active publishing of information than Atom can do. Distributed revision control. Lots of functionality for integration between projects, and between upstream and downstream. Jeff sees git as a step backward, though; some of the other offerings, he thinks, have much better user interfaces. Also important are the packaging efforts which are underway in a number of places. These include Fedora, which is "becoming competitive with Debian" as a community project. OpenSUSE has put together a build system which can create packages for a number of distributions. Debian has had a community build system for years; there is interest in Debian in going the next step, though - ideas like building packages directly from a distributed version control system. Ubuntu's Launchpad was "a spectacular vision," though the reality is "a bit of a snore"; it didn't achieve its goal of helping upstream and downstream work together. Then there's Bugzilla, which is the "bug filing gauntlet" between projects and their users. The Debian bug tracking system has done a better job of facilitating bug reports by allowing them to be submitted by email. But most big projects are using Bugzilla. It would be much improved by using OpenID (so that users would not have to register to file bugs) and some sort of Atom-based feed which would make querying bugs easy. If you take out the distribution, what do you replace it with? How do we achieve consistency? We need to create standards for how we interact with each other. And we can, in fact, be very good at consistency and standards when the need is clear. Good release management is a step toward that goal. GNOME once had very bad release management, but has pulled it together. Doing time-based releases was a hard sell, but few developers would want anything else now. Now GNOME release management just works. Consistency in source management is needed. Once upon a time that was done through CVS, but CVS is no longer up to the job, and now every project is using a different distributed version control system. But, sooner or later, one of the competing projects will win out and "hopefully we'll have clarity again." Autotools and pkgconfig can also go a long way toward creating consistency between projects. So, if we can push the available tools up into the upstream projects, those projects can get better at producing packages for distributions themselves. Once the tools (like bug trackers) can talk to each other, people will start making more use of them and network effects will take over. But, at the moment, the knowledge about integration remains at the distribution level. Debian, Jeff thinks, is well placed to take on a project like this and push its integration knowledge upstream. While Debian has typically been ten years ahead of everybody else in its packaging and integration abilities, it currently has a "relevancy problem." Finding ways to help upstream projects support their users more directly while maintaining overall integration and consistency would be a perfect way for Debian to maintain its leadership in this area. That could change the game for everybody, bringing projects closer to their users and making us all "happy as pigs in mud." More stuff for 2.6.25 Since last week's installment, some 3800 changesets have been merged into the mainline git repository. Some of the more interesting user-visible changes found in that patch stream include: Support for new hardware, including RDC R-321x system-on-chip processors, Onkyo SE-90PCI and SE-200PCI sound devices, Xilinx ML403 AC97 controllers, TI TLV320AIC3X audio codecs, Realtek ALC889/ALC267/ALC269 codecs, VIA VT1708B HD audio codecs, SiS 7019 Audio Accelerator devices, C-Media 8788 (Oxygen) audio chipsets, Asus AV200-based sound cards, Freescale MPC8610 audio devices, Audiotrak Prodigy 7.1 HiFi audio devices, Conexant 5051 audio codecs, MediaTek/TempoTec HiFier Fantasia sound cards, wireless RNDIS devices (and Broadcom 4320-based devices in particular), USB printer gadgets (intended for use in printer firmware), and NetEffect 1/10Gb ethernet adapters. The nearly-unused ALSA sequencer instrument layer has been removed. SELinux has a new set of checks which allow the creation of policies which control the flow of packets into and out of the system. Netfilter has a more flexible "hashlimit" mechanism for limiting the number of packets to/from a given source over time. There is a new "flow" classifier for the network fair queueing code which allows the more flexible creation of traffic policies. The futex mechanism has a new "bitset wait" mechanism which allows for more targeted wakeups. This feature will be used by glibc to implement optimized reader-writer locks. PCI hotplug is no longer an experimental feature. Support for PCI Express ASPM, a power management protocol, has been added. The virtio "balloon" driver (which can be used to change the amount of memory used by a KVM guest) and PCI driver have been added. The CLONE_STOPPED bit (for the clone() system call) is said to be unused and is planned for removal. For 2.6.25, a warning will be printed. The timerfd() system call is back, with a reworked, more capable API. The page map patches, which enable much better accounting of memory use by processes, have been merged. The "PM QOS" infrastructure allows both kernel and user-space code to register quality-of-service requirements (in the form of CPU DMA latency, network latency, and network throughput). These requirements will be taken into account when the kernel considers putting the system into a lower-power state. Per-process capability bounding sets (which permanently remove potential capabilities from a process) are now supported. 64-bit capability mask support has also been merged. The simplified mandatory access control kernel (SMACK) security module has been merged. The smbfs filesystem has (finally) been deprecated in favor of CIFS. It is now scheduled for removal in 2.6.27. There is a new RPC transport module allowing (client) NFS mounts using RDMA. Changes visible to kernel developers include: A large number of SUNRPC symbols (rpc_* and rpcauth_*) have been changed to GPL-only exports. The x86 architecture merger continues, with quite a few files being coalesced. The "flatmem" and "discontigmem" memory models have been removed on the 64-bit x86 architecture; "sparsemem" is now used for all builds. The x86 spinlock implementation has been replaced with a "ticket spinlock" mechanism which provides fair FIFO behavior. The fastcall function attribute didn't do anything on the x86 architecture, so it has been removed. x86 has a new set of functions for easily manipulating page attributes. They are: There is also a set of set_pages_* functions which take a struct page pointer rather than a beginning address. Early-boot debugging of x86 systems via the FireWire port is now supported. Bidirectional command support has been added to the SCSI layer. There is a new process state called TASK_KILLABLE. It is a blocked state similar to TASK_UNINTERRUPTIBLE, with the difference that a wakeup will happen upon delivery of a fatal signal. The idea is to allow (almost) uninterruptible sleeps, but to still allow the process to be killed outright - thus ending the problem of unkillable processes stuck in the "D" state. There is a new set of functions for using this state: wait_event_killable(), schedule_timeout_killable(), mutex_lock_killable(), etc. add_disk_randomness() has been unexported as there are no more in-tree users. pci_enable_device_bars() has been replaced by two more-specific functions: pci_enable_device_io() and pci_enable_device_mem(). The high-resolution timer API has been augmented with: It will move the given timer's expiration forward past the current time as determined by the associated clock. The device structure now holds a pointer to a device_dma_parameters structure: These parameters are used by the DMA mapping layer (and the IOMMU mapping code in particular) to ensure that I/O operations are set up within the device's constraints. The PCI layer supports this feature with two new functions: Drivers for devices with unusually strict DMA limitations should probably use these functions to ensure that those restrictions are respected. One thing which has not made it into 2.6.25 is the KGDB debugger for the x86 architecture. Amusingly, a linux.conf.au kernel mini-conf discussion of "sneaking" KGDB past Linus proceeded for some time before the participants noticed him standing in the back of the room listening to the whole thing. His current position is that he won't pull it as part of the x86 tree, and he's still not much interested in the idea in general. As of this writing, the merge window is still open and could stay that way for as much as a week. So more interesting code could still find its way in through this merge window; stay tuned. An interview with the new openSUSE community manager Joe 'Zonker' Brockmeier has joined the openSUSE project as the openSUSE community manager. We were pleased to have the opportunity to ask Zonker a few questions about his new job. Many LWN readers will remember that you were a regular contributor to LWN. Any comments on what you have been up to between there and here? Sure -- I stopped contributing to LWN when I took a full-time job with OSTG/Linux.com (now the company known as SourceForge), and had to stop freelancing. I was editorial director there for two years, and then joined Linux Magazine as Editor-in-Chief. I've missed contributing to LWN, but I still read LWN religiously. As community manager will you be employed by Novell? Yes. Will you report to the openSUSE board? I will be working with the board, but I report to Justin Steinman at Novell. It's an unusual position, though, because my job is in large part to be an advocate/ombudsman for the community. openSUSE has adopted a Code of Conduct for mailing lists and IRC. As community manager, will policing this traffic be a part of your job? No -- we don't plan to have anyone actively policing the lists looking for violations. Instead, the board is working on a policy to allow community members to bring violations of the Code to the board to decide whether disciplinary action should be needed. I hope that it's something that won't be needed often, or at all -- and I don't think it will be needed often. How much control does Novell hold over openSUSE development? Should there be more or less control? Is Novell allowing the community to make its own decisions? Right now, I'd say Novell is still guiding development pretty closely, but would like the community to have a more prominent voice in the direction of the development of openSUSE. I think the Fedora Project is a pretty good model here, and I really think Max Spevack did a great job in terms of helping Fedora come into its own. The openSUSE Board appointed last November is a step towards giving the community more control over governance of the project. This is a new position. How much latitude will you have to define what the community manager is/does? Well, certain aspects of the job are already well-defined. For example, a big part of the job will be traveling to conferences to speak about openSUSE and also to organize an openSUSE conference. But there's definitely some room to define the role as well. OpenSUSE has a weekly news letter which has come out almost weekly since its inception last November. Do you have any plans to get involved with that? Is it useful? Yes, I do plan to contribute and help out with that where needed. I think it's very useful -- communication is vital to the health of a project like openSUSE. There are a lot of people contributing to openSUSE, and without something like the weekly news, it would be easy for contributors to lose track of what their colleagues are doing. It's also important to spreading the news outside of the openSUSE community so that other open source projects know what we're up to and possibly find ways to collaborate and help reduce duplication of effort between projects. Finally, I think it's a good way to show what various contributors are doing and help recognize the contributors that are having an impact on openSUSE. What are your plans for the openSUSE community? Over the long term, I'd like to help foster increased adoption of openSUSE by a significant amount -- which means doing a better job of promoting the distro, as well as communicating with potential users and finding out what it is they need/want from openSUSE and working on delivering that. (I'd encourage LWN readers to check out the alpha builds for openSUSE 11.0 and give us feedback as we're working on the final 11.0 release that should be done in July.) I also want to work on developing a recognition system so that contributors are acknowledged for their work, which we're doing more on already -- we just announced our membership program for contributors to be recognized. I also want to make sure we're providing a "roadmap" so that potential contributors have a clear path into the project and know where to get started -- whether that's development, artwork, documentation, quality assurance, advocating openSUSE, or supporting other users. Also, organize the first openSUSE conference, make sure openSUSE is better represented at other conferences, and help provide potential contributors with a roadmap to becoming contributors. I'd like to make it as easy as possible for people to participate. Finally, but not least -- I want to do what I can to help coordinate increased cooperation between Linux distros and reduce duplication of effort. While a lot of folks might like to portray the situation as openSUSE vs. Fedora, Ubuntu, or any other distro, I don't see it that way -- if someone is already happily using another distro, then I consider that a win. I want to focus on attracting people who aren't running Linux at all yet. There's plenty of work left to do, and I hope we can do a better job of pooling our resources to attract those people. Is there anything you would like to add? Just that I'd like to encourage LWN readers to visit zonker.opensuse.org and news.opensuse.org for updates on the openSUSE project, and to feel free to contact me (zonker@opensuse.org) with any questions, suggestions, and comments related to openSUSE. Thank you for taking the time to answer our questions. linux.conf.au 2008 linux.conf.au has an interesting structure which differentiates it from most other events. Every year, a completely new set of organizers takes over the event, moves it to a new city, and puts its own stamp on it. They have a great deal of freedom in how they run LCA, but there is still a group of Linux Australia members and past organizers who keep an eye on things and help ensure that the event does not run into problems. The result is a conference which has a lot of fresh energy every year, but which is also reliably interesting. Many attendees consider it to be one of the best Linux events to be found anywhere in the world. This year, LCA was held in Melbourne, Australia; the organizing team was led by Donna Benjamin. The now-familiar LCA formula was followed, but with some small changes. The tutorial day is no more, replaced by relatively short tutorial sessions on each day. The traditional auction for charity was also gone this year; instead, a raffle (with Greg Kroah-Hartman's 2.6.22 contributor poster as the main prize) yielded some $1000 for a local penguin refuge. The raffle was certainly a lower-pressure, less alcohol-fueled way of raising money, but LCA without Rusty Russell as auctioneer just isn't quite the same. That quibble notwithstanding, LCA 2008 was an interesting, well-organized, and well-attended event. Ms. Benjamin and company have certainly upheld the standards for this conference. A number of LCA talks have been covered in separate LWN articles, and a few more may yet follow. This article will quickly review a few other high points, as seen from your editor's perspective. It's worth noting that videos for almost all of the talks have been posted on the conference web site. Certainly one high point came on January 30, the day that LWN celebrated its tenth anniversary. The crowd sang a rousing - if not entirely harmonious - version of "happy birthday" after Bruce Schneier's keynote. The following morning tea featured special LWN muffins; they were, much to your editor's delight, of the intense chocolate variety. It is hard to imagine a better place or time to celebrate to celebrate ten years of LWN. While most LCA presentations are quite technical in nature, there are exceptions. Australian lawyer Kimberlee Weatherall's talk on legal issues was called "Stop in the name of law"; it covered a number of topics of interest to a global audience. Kimberlee, it's worth noting, was the recipient of the "Rusty Wrench" award for service to the free software community at last year's LCA in Sydney. The Digital Millennium Copyright Act, she noted, is ten years old now. At this point, the debate on its anti-circumvention provisions is essentially done, and anti-circumvention has won; she is not expecting to see any major changes in countries which have adopted such laws. The music industry may be moving away from use of DRM, but "they were never very good at it anyway." DRM is still going strong in other areas, such as movies and subscription television. Similarly, the fight to end software patents is over, and we have lost. There are incredible numbers of software patents issued every year; every one of those patents represents a significant investment by its owner. The total amount of investment in these patents is huge; that amount of money is almost impossible to displace. It is also very hard to define what a software patent really is; there are thousands of them in Europe, which ostensibly does not allow software patents. No matter how the rules are written, lawyers will find a way around them. What is happening on the patent front, instead, is a more constructive engagement with the process. Some reform is happening in the US, as a result of the KSR decision and various attempts to mitigate the costs associated with patents. So the situation might improve slowly over time. GPLv3 is out. It now has to pass two tests: the market test (will projects use it?) and any legal tests which might be brought. Kimberlee expressed some doubts on whether GPLv3 will really hold up in court, but did not elaborate on them. There is a new threat out there which we should not underestimate: the push to force copyright enforcement duties onto ISPs. This effort takes two forms: getting "infringers" disconnected, and requiring ISPs to filter data passing through their networks. There are a lot of problems with either approach, but that is not stopping the industry (and others, such as anti-porn crusaders) from pushing hard for ISP responsibility. This is a fight to watch. So what should the free software community do? Not much, says Kimberlee, except to keep coding. The production of good code brings us allies with money, and that's what we're going to need. As long as we are successful, people will go out of our way to protect us. Keep doing what we do, and things should come out OK. Anthony Baxter is the Python release manager; he was also the keynote speaker for the third day of the conference. He is, to say the least, an entertaining speaker, so this would be a good one to watch on video. The talk was about coming changes in Python, and Python 3.0 in particular. The 3.0 release, he says, is "the one where we break all of your code." It's the first backward-incompatible update of the language (at least, if you don't deal in C extension modules). There are a lot of changes to the language which your editor will not repeat here; they are well documented on the Python web sites. As noted, many of these changes will cause existing code to break. This is being done, says Anthony, because the Python language is now 16 years old. Like all 16-year-olds, it has a number of annoying features. It's time to clean out a lot of accumulated cruft and get back to the minimal, "there is one way to do it" vision that has always driven the language. Perhaps what's most interesting is what won't be done. The language will not be bloated - it will stay Python. There will be no braces; white space will still be used to mark blocks of code. The much-criticized global interpreter lock will remain. And, importantly, this will be an incremental (if big) update - there will be no overall rewrite of the interpreter. The experience of certain other projects (being Perl 6 and Mozilla) shows that total rewrites tend to be much longer, more painful affairs than anybody might envision at the outset. There will be migration tools, of course, and warnings built into the forthcoming 2.6 release which will point out things that may cause migration difficulties. The 2.x series will be supported for some years into the future. And, says Anthony, there will be no Python 4.0 release. This is their one chance to break everything and start over, and they plan to get it right this time. Dave Jones is the head maintainer for the Fedora kernel. At LCA 2008 he took a break from pointing out user-space problems and talked about "a day in the life of a distribution kernel maintainer." The real subject of the talk was the process that the Fedora project goes through to put together the kernels they ship. There are currently three developers working on the Fedora kernel (Dave, Chuck Ebbert, and Kyle McMartin), and "several dozen" working on the RHEL kernels. Most of the RHEL folks are doing backports of fixes, drivers, etc. to the older kernels used by RHEL releases. Once a kernel has been chosen for release, it's time to start adding patches. Some interesting numbers were put up at this point. Red Hat Linux 7 had 70 patches added to its 2.2.24 kernel. That number went slowly up, to the point where Fedora Core 6 had 191 patches. There are currently 63 patches added to the Fedora 8 kernel, though that may grow over the life of this release. By comparison, RHEL 5 is shipping a 2.6.18 kernel with 1628 patches added to it - a very different world. There's all kinds of patches which go into a distributor kernel. These include security technologies (ExecShield) which have not made it into the mainline, changes to some default parameters, the silencing of certain "scary messages" which tend to provoke lots of needless bug reports, out-of-tree drivers, patches which help debug problems found in the field, stuff which has been vetoed upstream, and more. Then it's a matter of putting the package and dealing with the subsequent bug reports - lots of them. The closing ceremony included the traditional introduction of the organizer for next year's event. This event will go, for the first time ever, to Hobart, Tasmania; see MarchSouth.org for more information. There is some information on what this team is planning in the bid document [1.6MB PDF]; your editor is intrigued by the following: "The official Speakers' Dinner will be held at a mystery location south of Hobart following a 40 minute river cruise on a high speed luxury catamaran." It's never too soon to get that talk proposal together. Finally, the last few LCA events have included the passing of the "Rusty Wrench" award to somebody who has performed a great service to the community. Recipients so far are Rusty Russell (after whom the award is named), Pia Waugh, and Kimberlee Weatherall. The Rusty Wrench was not awarded at LCA2008, though. It seems that, in the future, the Rusty Wrench will be part of an extensive set of awards which will be handed out at a separate "gala dinner" event held in the (Australian) winter. The awarding of the Rusty Wrench was a nice LCA feature which will be missed, but, then, there are advantages to having another excuse to visit Australia. PostgreSQL releases version 8.3 Version 8.3 of the PostgreSQL DBMS was announced on February 4, 2008: "Today the PostgreSQL Global Development Group releases the long-awaited version 8.3 of the most advanced open source database, which cements our place as the best performing open source database." Version 8.3 brings many new features. First on the list is the cleaning up of data type conversions. This improvement may impact backwards compatibility issues with older applications, but will insure better data integrity in the future. There are four new capabilities that aim to improve the consistency of response times, these include Heap Only Tuple for speeding up access to frequently updated data, asynchronous commits, spread checkpoint autotuning and a just-in-time background writing strategy. There have been numerous speed improvements including better recovery time for the write ahead log, faster small-merge joins, faster LIKE/ILIKE comparisons, improvements to searches using LIMIT, lazy XID assignment for improving read-mostly database speed and function costing for faster query planning. Large database support improvements include synchronized scans for multiple users, level 2 cache scan protection to prevent CPU thrashing and reductions in the size of headers for variable size fields. Windows users will benefit from new Visual C++ support and some code rewrites. Administration improvements include output of logs to database-loadable files, SSPI and GSSAPI support for Kerberos authentication, embeddable GUC settings at function creation time, parallel autovacuum workers, the pg_standby tool for configuring warm standby servers and a new ability to specify the position of NULLs at the beginning or end of results. Development improvements include API improvements to the full text search tool, plan invalidation for clearing cached plans and automatically dropping plans when tables are updated, and updatable cursors. Data type enhancements include full support for the ANSI SQL:2003 XML spec, support for 128 bit UUIDs, support for arrays of compound types and support for ENUM columns with a defined ordered list of alternatives. The ENUM enhancement allows applications to be migrated from the MySQL DBMS. The PostgreSQL stored procedure language has a simplified syntax for row-returning functions and new support for scrollable cursors, which allows procedures to perform complex row manipulations. A number of new accessory tools are being released with PostgreSQL 8.3 including a multi-threaded connection pooler, a distributed, horizontally scaled table interface, an SNMP interface, a SELinux-based security extension, a new GUI with debugging and step-through execution capabilities, a new replicated query agent, a multi-master asynchronous replication system, an integrated clustering tools project and an improved replication system. For more information on the new features in PostgreSQL 8.3, see the release notes. The feature matrix gives a tabular view of features added versus the version number. In order to speed the next release up, the PostgreSQL team plans to implement a new development plan for version 8.4: In the 8.4 development cycle we would like to try a new style of development, designed to keep the patch queue to a limited size and to provide timely feedback to developers on the work they submit. To do this we will replace the traditional 'feature freeze' with a series of 'commit fests' throughout the development cycle. The idea of commit fests was discussed last October in -hackers, and it seemed to meet with general approval. Whenever a commit fest is in progress, the focus will shift from development to review, feedback and commit of patches. Each fest will continue until all patches in the queue have either been committed to the CVS repository, returned to the author for additional work, or rejected outright, and until that has happened, no new patches will be considered. Version 8.3 represents a major step forward for PostgreSQL, if the new development style bears fruit, the next major version will come about more quickly. CRFS and POHMELFS Performance, or lack thereof, has often been a knock against the venerable Network File System (NFS), but no real competition has emerged. NFS also has some serious flaws for programmers and users, with behavior that is markedly different from that of local filesystems. Both of these problems are spurring the creation of new network filesystems; two of which were announced in the last week. The Coherent Remote File System (CRFS) was introduced last week at linux.conf.au by Zach Brown of Oracle. It uses BTRFS—pronounced "butter-f-s"—as its storage on the server, rather than layering atop any POSIX filesystem as NFS does. According to Brown, BTRFS has a number of important features that outweigh the inconvenience for users of getting their data into a BTRFS volume. The biggest is the ability to do compound operations (creating or unlinking a file for example) in an atomic and idempotent manner. CRFS has a userspace daemon (crfsd) that talks to the BTRFS volume as well as multiple clients. The clients use the kernel VFS caching infrastructure extensively, thus are implemented as kernel modules. A user wishing to access the underlying BTRFS volume on the server, must mount it as a CRFS volume; crfsd must have exclusive access to the BTRFS. This is also different from NFS which will cooperate with local mounts of the underlying filesystem. The basic idea behind CRFS is to have clients cache as much of the filesystem data as they can while using cache coherency protocols to reduce the amount of network traffic that gets generated. Clients keep track of the cache state for each object they have stored, while the server tracks the cache state of all objects that any client has. The messages between server and client consist of cache state transitions and the data being transferred. Data transfer in both directions is done using CRFS "item ranges". CRFS objects use the BTRFS key scheme to represent objects (file data, directories, directory entries, inodes, etc.) in the filesystem. An item range is a contiguous section of the key space, specified by a minimum and maximum key value as part of the message. When the client is filling its cache, it can request a particular key but also offer to take other surrounding keys as part of the response; if the server sees those keys in the BTRFS leaf node, it can send them along as well. Something on the order of a 3x speedup over asynchronous NFS mounts is the current performance of CRFS for a simple untar. Comparing to synchronous NFS mounts (where each write has to actually hit the remote disk) is not a sensible comparison; there is a roughly 10x speed difference between the two types of NFS mounts. Brown has been working on CRFS for "about a year" and is planning to release the code eventually. Until that happens, the slides [PDF] and video [Theora] from his talk—as well as a few postings to his weblog—are the only sources of information about CRFS. Another filesystem, that aims to have a broader reach than CRFS, is the Parallel Optimized Host Message Exchange Layered File System (POHMELFS), announced in linux-kernel posting by Evgeniy Polyakov. POHMELFS is meant to be a building block for a distributed filesystem that would offer a multi-server architecture and allow for disconnected filesystem operations. Polyakov has only been working on it for a month, so it is, at best, the start of a proof of concept. The POHMELFS vision is in some ways similar to CRFS in that the clients will handle as much as possible locally, with minimal server interaction. Like CRFS, client kernel modules talk to a server userspace daemon, using cache coherency protocols to keep the data and metadata in sync. For CRFS, the coherency is not yet implemented, but is fleshed out to some extent, while POHMELFS has quite a bit of fleshing out to do. Unlike CRFS, POHMELFS supports POSIX filesystems on the server side and the code is available now. There are some rather large hurdles to overcome in the POHMELFS vision, not least of which is handling file IDs in separate client-side filesystems such that they can be synchronized with the server. The current code implements a write-through cache version that creates objects on the server before they are used in the client side cache. There is also an additional patch that implements a hack to disable the writeback cache and use only the client side caching. The latter is, not surprisingly, very fast, but not terribly usable for multiple mounts of the filesystem. Essentially Polyakov is showing the benefits of client-side caching, but in the context of a broader scheme. It will be a long time, if ever, that we see some descendant of either of these filesystems in the kernel. There is much work to be done, but they are worth looking at to see where networking and distributed filesystems may be headed. For them to be useful outside of just the Linux world—like the ubiquity of NFS—there would have to be some kind of standardization followed by adoption by the major players. That will take a very long time. Security hardening for Debian Making the programs in a distribution more resistant to exploits—a process known as hardening—is a fairly common way to reduce the attack surface for the distribution. Many distributions have made an effort in this area, with some adding in an overall security architecture, like AppArmor for SUSE or SELinux for Red Hat and Fedora distributions. Debian is currently looking at enabling some hardening features, potentially throughout a large swath of packages that it distributes. The features being considered and the concerns raised provide an interesting look at the tradeoffs. A posting to debian-devel-announce regarding hardening features for Lenny started the conversation. Those packages that are most susceptible—network services, packages that parse files from untrusted sources, or those that have been the subject of a security alert—should enable a set of security tools that will help deflect attacks against them. Various attacks rely upon certain characteristics of Linux binaries that allow them to be exploited. By altering the way the binaries are built, those particular threats can be mitigated. The experimental hardening-wrapper package makes enabling the various toolchain differences as easy as setting DEB_BUILD_HARDENING=1 in the environment. This will change gcc, g++, and ld to use the desired flags when building packages. Each hardening feature can also be disabled separately by setting DEB_BUILD_HARDENING_xyzzy=0 (where xyzzy is the name of a hardening feature) if they cause build or performance problems for a particular package. The specific features enabled are described in the original posting as well as with more detail on the Debian wiki entry for Hardening. They are: using -Wformat to catch printf() family calls that do not have a string literal for the format string which can lead to problems if the argument came from an untrusted source and contains format specifiers. using -D_FORTIFY_SOURCE_ to validate glibc calls such as strcpy() when the buffer sizes are known at compile time, which can help stop buffer overflow attacks. using -fstack-protector to thwart most stack smashing attacks. creating Position Independent Executables (PIE) which facilitates using the Address Space Layout Randomization that is available in some kernels. This makes it difficult for an attacker to have any knowledge of what the addresses for the program's sections will look like. using ld -z relro to change certain sections to be read-only once ld has made its modifications while loading the program. This can thwart attacks that try to overwrite the Global Offset Table (GOT). Many other distributions have already been down this path: Gentoo has a page describing their hardened toolchain, Mark Cox of Red Hat has a detailed look at the evolution of security features in Red Hat and Fedora releases, OpenSUSE has a page about its security features, and so on. There is a price to be paid in binary size, execution speed, and cache behavior for these techniques, but for most environments, where resources are not massively constrained, the cost is worth it. It makes new attacks against those systems more difficult to design, which will make users and administrators sleep a little better at night. Ticket spinlocks Spinlocks are the lowest-level mutual exclusion mechanism in the Linux kernel. As such, they have a great deal of influence over the safety and performance of the kernel, so it is not surprising that a great deal of optimization effort has gone into the various (architecture-specific) spinlock implementations. That does not mean that all of the work has been done, though; a patch merged for 2.6.25 shows that there is always more which can be done. On the x86 architecture, in the 2.6.24 kernel, a spinlock is represented by an integer value. A value of one indicates that the lock is available. The spin_lock() code works by decrementing the value (in a system-wide atomic manner), then looking to see whether the result is zero; if so, the lock has been successfully obtained. Should, instead, the result of the decrement option be negative, the spin_lock() code knows that the lock is owned by somebody else. So it busy-waits ("spins") in a tight loop until the value of the lock becomes positive; then it goes back to the beginning and tries again. Once the critical section has been executed, the owner of the lock releases it by setting it to 1. This implementation is very fast, especially in the uncontended case (which is how things should be most of the time). It also makes it easy to see how bad the contention for a lock is - the more negative the value of the lock gets, the more processors are trying to acquire it. But there is one shortcoming with this approach: it is unfair. Once the lock is released, the first processor which is able to decrement it will be the new owner. There is no way to ensure that the processor which has been waiting the longest gets the lock first; in fact, the processor which just released the lock may, by virtue of owning that cache line, have an advantage should it decide to reacquire the lock quickly. One would hope that spinlock unfairness would not be a problem; usually, if there is serious contention for locks, that contention is a performance issue even before fairness is taken into account. Nick Piggin recently revisited this issue, though, after noticing: On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable, with a userspace test having a difference of up to 2x runtime per thread, and some threads are starved or "unfairly" granted the lock up to 1 000 000 (!) times. This sort of runtime difference is certainly undesirable. But lock unfairness can also create latency issues; it is hard to give latency guarantees when the wait time for a spinlock can be arbitrarily long. Nick's response was a new spinlock implementation which he calls "ticket spinlocks." Under the initial version of this patch, a spinlock became a 16-bit quantity, split into two bytes: Each byte can be thought of as a ticket number. If you have ever been to a store where customers take paper tickets to ensure that they are served in the order of arrival, you can think of the "next" field as being the number on the next ticket in the dispenser, while "owner" is the number appearing in the "now serving" display over the counter. So, in the new scheme, the value of a lock is initialized (both fields) to zero. spin_lock() starts by noting the value of the lock, then incrementing the "next" field - all in a single, atomic operation. If the value of "next" (before the increment) is equal to "owner," the lock has been obtained and work can continue. Otherwise the processor will spin, waiting until "owner" is incremented to the right value. In this scheme, releasing a lock is a simple matter of incrementing "owner." The implementation described above does have one small disadvantage in that it limits the number of processors to 256 - any more than that, and a heavily-contended lock could lead to multiple processors thinking they had the same ticket number. Needless to say, the resulting potential for mayhem is not something which can be tolerated. But the 256-processor limit is an unwelcome constraint for those working on large systems, which already have rather more processors than that. So the add-on "big ticket" patch - also merged for 2.6.25 - uses 16-bit values when the configured maximum number of processors exceeds 256. That raises the maximum system size to 65536 processors - who could ever want more than that? With the older spinlock implementation, all processors contending for a lock fought to see who could grab it first. Now they wait nicely in line and grab the lock in the order of arrival. Multi-thread run times even out, and maximum latencies are reduced (and, more to the point, made deterministic). There is a slight cost to the new implementation, says Nick, but that gets very small on contemporary processors and is essentially zero relative to the cost of a cache miss - which is a common event when dealing with contended locks. The x86 maintainers clearly thought that the benefits of eliminating the unseemly scramble for spinlocks exceeded this small cost; it seems unlikely that others will disagree. LCA: Two talks on the state of X The X window system is the kernel of the desktop Linux experience; if X does not work well, nothing built on top of it will work well either. Despite its crucial role, X suffered from relative neglect for a number of years before being revitalized by the X.org project. Two talks at linux.conf.au covered the current state of the X window system and where we can expect things to go in the near future. Keith Packard is a fixture at Linux-related events, so it was no surprise to see him turn up at LCA. His talk covered X at a relatively high, feature-oriented level. There is a lot going on with X, to say the least. Keith started, though, with the announcement that Intel had released complete documentation for some of its video chips - a welcome move, beyond any doubt. There are a lot of things that X.org is shooting for in the near future. The desktop should be fully composited, allowing software layers to provide all sorts of interesting effects. There should be no tearing (the briefly inconsistent windows which result from partial updates). We need integrated 2D and 3D graphics - a goal which is complicated by the fact that the 2D and 3D APIs do not talk to each other. A flicker-free boot (where the X server starts early and never restarts) is on most distributors' wishlist. Other desired features include fast and secure user switching, "hotplug everywhere," reduced power consumption, and a reduction in the (massive) amount of code which runs with root privileges. So where do things stand now? 2D graphics and textured video work well. Overlaid video (where video data is sent directly to the frame buffer - a performance technique used by some video playback applications) does not work with compositing, though. 3D graphics does not always work that well either; Keith put up the classic example of glxgears running while the window manager is doing the "desktops on a cube" routine - the 3D application runs outside of the normal composite mechanism and so cannot be rotated with all the other windows. On the tearing front, only 3D graphics supports no-tearing operations now. Avoiding tearing is really just a matter of waiting for the video retrace before making changes, but the 2D API lacks support for that. The integration of APIs is an area requiring some work still. One problem is that Xv (video) output cannot be drawn offscreen - again, a problem for compositing. Some applications still use overlays, which really just have no place on the contemporary desktop. It is impossible to do 3D graphics to or from pixmaps, which defeats any attempt to pass graphical data between the 2D and 3D APIs. On the other side, 2D operations do not support textures. Fast user switching can involve switching between virtual terminals, which is "painful." Only one user session can be running 3D graphics at a time, which is a big limitation. On the hotplug front, there are some limitations on how the framebuffer is handled. In particular, the X server cannot resize the framebuffer, and it can only associate one framebuffer with the graphics processor. Some GPUs have maximum line widths, so the one-framebuffer issue limits the maximum size of the internal desktop. With regard to power usage: Keith noted that using framebuffer compression in the Intel driver saves 1/2 watt of power. But there are a number of things to be fixed yet. 2D graphics busy-waits on the GPU, meaning that a graphics-intensive program can peg the system's CPU, even though the GPU is doing all of the real work. But the GPU could be doing more as well; for example, video playback does most of the decoding, rescaling, and color conversion in the CPU. But contemporary graphics processors can do all of that work - they can, for example, take the bit stream directly from a DVD and display it. The GPU requires less power than the CPU, so shifting that work over would be good for power consumption as well as system responsiveness. Having summarized the state of the art, Keith turned his attention to the future. There is quite a bit of work being done in a number of areas - and not being done in others - which leads toward a better X for everybody. On the 3D compositing front, what's needed is to eliminate the "shared back buffers" used for 3D rendering so that the rendered output can be handled like any other graphical data. Eliminating tearing requires providing the ability to synchronize with the vertical retrace operation in the graphics card. The core mechanism to do this is already there in the form of the X Sync extension. But, says Keith, nobody is working on bringing all of this together at the moment. Getting rid of boot-time flickering, instead, is a matter of getting the X server properly set up sufficiently early in the process. That's mostly a distributor's job. To further integrate APIs, one thing which must be done is to get rid of overlays and to allow all graphical operations (including Xv operations) to draw into pixmaps. There is a need for some 3D extensions to create a channel between GLX and pixmaps. Supporting fast user switching means adding the ability to work with multiple DRM master. Framebuffer resizing, instead, means moving completely over to the EXA acceleration architecture and finishing the transition to the TTM memory manager. In the process, it may become necessary to break all existing DRI applications, unfortunately. And multiple framebuffer support is the objective of a project called "shatter," which will allow screens to be split across framebuffers. Improving the power consumption means getting rid of the busy-waiting with 2D graphics (Keith say the answer is simple: "block"). The XvMC protocol should be extended beyond MPEG; in particular, it needs work to be able to properly support HDTV. All of this stuff is currently happening. Finally, on the security issue, Keith noted the ongoing work to move graphical mode setting into the kernel. That will eliminate the need for the server to directly access the hardware - at least, when DRM-based 2D graphics are being done. In that case, it will become possible to run the X server as "nobody," eliminating all privilege. There are few people who would argue against the idea of taking root privileges away from a massive program like the X server. In a separate talk, Dave Airlie covered the state of Linux graphics at a lower level - support for graphics adapters. He, too, talked about moving graphical mode setting into the kernel, bringing an end to a longstanding "legacy issue" and turning the X server into just a rendering system. That will reduce security problems and help with other nagging issues (graphical boot, suspend and resume) as well. Mode setting is the biggest area of work at the moment. Beyond that, the graphics developers are working on getting TTM into the kernel; this will give them a much better handle on what is happening with graphics memory. Then, graphics drivers are slowly being reworked around the Gallium3D architecture. This will improve and simplify these drivers significantly, but "it's going to be a while" before this work is ready. The upcoming DRI2 work will improve buffering and fix the "glxgears on a cube" problem. Moving on to graphics adapters: AMD/ATI has, of course, begun the process of releasing documentation for its hardware. This happened in an interesting way, though: AMD went to SUSE in order to get a driver developed ahead of the documentation release; the result was the "radeonhd" driver. Meanwhile, the Avivo project, which had been reverse-engineering ATI cards, had made significant progress toward a working driver. Dave took that work and the AMD documentation to create the improved "radeon" driver. So now there are two competing projects writing drivers for ATI adapters. Dave noted that code is moving in both directions, though, so it is not a complete duplication of work. (As an aside, from what your editor has heard, most observers expect the radeon driver to win out in the end). The ATI R500 architecture is a logical addition to the earlier (supported) chipsets, so R500 support will come relatively quickly. R600, instead, is a totally new processor, so R600 owners will be "in for a wait" before a working driver is available. Intel has, says Dave, implemented the "perfect solution": it develops free drivers for its own hardware. These drivers are generally well done and well documented. Intel is "doing it right." NVIDIA, of course, is not doing it right. The Nouveau driver is coming along, now, with 5-6 developers working on it. Dave had an RandR implementation in a state of half-completion for some time; he finally decided that he would not be able to push it forward and merged it into the mainline repository. Since then, others have run with it and RandR support is moving forward quickly. It was, he says, a classic example of why it is good to get the code out there early, whether or not it is "ready." Performance is starting to get good, to the point that NVIDIA suddenly added some new acceleration improvements to its binary-only driver. Dave is still hoping that NVIDIA might yet release some documents - if it happens by next year, he says, he'll stand in front of the room and dance a jig. Ten-year timeline part 5: Not just SCO Part 4 of this retrospective ended in October, 2002, when LWN adopted its current subscription model. That change brought a certain amount of stability for LWN (too much, we might argue), but, in the wider Linux world, things continued to happen. This installment picks up where the last left off. During this period, the business of Linux was relatively quiet - not that many acquisitions, but not many failures either. But quite a bit was happening around legal issues, copyright enforcement, and more... October 10, 2002: BitKeeper flames return as the non-compete clause in its license comes to light. The sendmail source distribution is trojaned. BitKeeper flames were a more-or-less constant feature in those days, but BitKeeper became an established part of the kernel development process anyway. In the October 10, 2002 edition, your editor wrote: "If Larry McVoy (or his board of directors) wakes up hung over one morning and decides to end free access to BitKeeper, the show is over." That was, unfortunately, an example of your editor's crystal ball working rather better than usual. The trojaning of sendmail was the first of a few such incidents. It looked like a scary trend for a while, but, in fact, the frequency of this kind of attack has dropped quite a bit in the intervening years. October 31, 2002: the first cryptographic code is finally merged into the Linux kernel. The first Reiser4 snapshot is posted. December 19, 2002: The Creative Commons project is launched. ElcomSoft (Dmitry Sklyarov's employer) is acquitted of DMCA violation charges. Kernel developers start to complain that the 2.5 feature freeze is thawing. January 16, 2003: The U.S. Supreme Court decides in favor of unlimited copyright term extensions. MandrakeSoft enters bankruptcy. The SCO Group starts making noises about its "Unix IP." January 30, 2003: SCO forms SCOSource and makes rather more dire noises about Linux. By this point, there was a certain amount of discomfort over the direction SCO was taking. But nobody had any clue of just how weird it would actually get. February 6, 2003: The MS-SQL worm infects the net - in about 15 minutes. LWN begins its porting drivers to 2.6 series. Remember the days of disruptive worms? MS-SQL was one of the scariest, in that it did most of its propagation in just a few minutes. We don't see to many worms like that anymore; contemporary crackers prefer to turn systems into zombies and rent them out. March 13, 2003: The SCO Group files a $1 billion lawsuit against IBM. And so it began, with SCO telling the world that the Linux community could not possibly have achieved what it did unless the work had been stolen by IBM. For the remainder of this retrospective, your editor will attempt to keep the number of SCO-related entries to a minimum. It has been quite an experience to go back and reread all of those McBride/Enderle/Boies/DiDio/Lyons/etc. quotes, and it is tempting to put them all here. But that temptation will be resisted; those who want to relive that bit of bizarre history in more detail can read the LWN pages directly or dig through the considerable resources at Groklaw. SCO is about as scary as Y2K now, but, in 2003, the SCO suit was a frightening event. To many of us it seemed possible that, maybe, one out of thousands of developers might have slipped something improper into the kernel code base. And, in any case, we were under attack by a company with millions of dollars to burn and a loud-mouthed CEO. The whole thing cost us a lot of time and anxiety - and, for those most directly involved - money. Nonetheless, your editor will reiterate his claim that, overall, the SCO attack has been good for us. We needed to improve our legal defenses; as Linux grew, there could be no doubt that people would attempt to use the legal system to grab a piece of the pie. In SCO we had an arrogant assailant with no substance; we were attacked by a clown. We got the ability to straighten up our processes, arrange better legal help, and prove that our code is clean without the inconvenience of facing a complaint with a bit of legitimacy. The community is now close to immune from copyright-based attack, and is much better poised to deal with similar attackers (patent trolls, for example) who could still do us some serious damage. March 27, 2003: Keith Packard is kicked out of the XFree86 core team. Red Hat Linux 9 - the last Red Hat Linux release - is announced. May 15, 2003: SCO suspends Linux sales and sends a warning letter to 1500 Linux users. May 22, 2003: The GNU and Ghostscript projects part ways. Microsoft buys a $10 million Unix license from SCO. May 29, 2003: Novell claims that it, not SCO, owns Unix. Kernel developers get upset about the fact that there has been no 2.4 kernel release for six months. The 2.5 kernel gets a reworked char device layer, IDE tagged command queueing support and the USB gadget subsystem - seven months into the 2.5 feature freeze. The city of Munich decides to move to Linux. Novell's claim was clearly significant at the time, though it fell below the radar again for several months. In the end, of course, this was the factor which killed SCO. That is convenient, but almost unfortunate too: there would have been value in seeing the substance of SCO's claims demolished in court. In these days of fast releases, it is interesting to consider that, for the first half of 2003, there were no stable kernel releases at all. June 19, 2003: Linus Torvalds moves to OSDL. The kernel gets a massively reworked ext3 filesystem - eight months into the feature freeze. SCO raises its claim for damages to $3 billion and "terminates" IBM's AIX license. Software patents return to the European Parliament. July 10, 2003: Andrew Morton moves to OSDL. OSDL was often controversial in the Linux community, but nobody doubted that providing a home for developers like Linus and Andrew was a good thing. Until now, neither had held a job where working on Linux was their primary duty. Meanwhile, few suspected how big the software patent battle in Europe would become - or that the anti-patent side would emerge victorious (for now). July 17, 2003: The 2.6.0-test1 kernel is released; it includes the new anticipatory disk I/O scheduler. Slackware celebrates its 10th anniversary. The Mozilla Foundation is created. July 24, 2003: Red Hat gets out of the boxed distribution business. Mozilla starts requesting donations from users. Selling Linux in boxes was how Red Hat got going, so the end of that business was a clear sign that things had changed. The separation of Mozilla and AOL (which had bought Netscape) was a little scary at the time; it seemed that the project could fade away before the Mozilla browser became truly ready and that it was an Internet Explorer future for all of us. Things were a little lean at Mozilla for a while. Now that Mozilla is bringing in tens of millions of dollars every year, the idea that it once sought donations is amusing. August 7, 2003: Novell acquires Ximian. Red Hat files suit against SCO. SCO offers the "intellectual property license for Linux." SELinux is merged for the 2.6.0-test3 kernel. August 21, 2003: SCO shows some "copied code." SCO, remember, "encrypted" its slides of "copied" code by switching them to a Greek font - a scheme which the community, somehow, managed to overcome. The code in question was straight from ancient Unix; it had been contributed by SGI, and had already been removed by the time it was revealed. After this, nobody worried that SCO might come up with the "millions of lines" of code that, it said, it could prove it owned. September 25, 2003: The Fedora project launches. Software patents pass in the European Parliament. Sun's Jonathan Schwartz says "We do not believe that Linux plays a role on the server. Period." October 16, 2003: Under pressure from the FSF and others, LinkSys releases source for its WRT54G routers. Fedora started with all kinds of talk about what a community-oriented project it would be. The reality was rather slower in coming, but is beginning to be visible now. Meanwhile, Fedora was a useful (and used) distribution from the outset. The LinkSys settlement was the result of a long battle. It was an important early GPL enforcement action which led to the creation of a number of distributions created for the sole purpose of doing interesting things on LinkSys routers. The ironic result is that LinkSys almost certainly sold quite a few more units than it would have if it had continued to hold on to the code. October 23, 2003: SCO gets $50 million from BayStar. November 6, 2003: Novell acquires SUSE. A fight erupts over the "Linux Gazette" name. December 24, 2003: SCO claims ownership of the Unix ABI. The 2.6.0 kernel is released. Red Hat acquires Sistina. The Mozilla Foundation asks for more donations. 2.6.0 took almost exactly three years after 2.4.0 came out. For the few developers who had observed the 2.4 feature freezes, their code - which could be four years old at this point - was only now making it into an official mainline release. It was not yet understood at this point, but, once 2.6.0 came out, the "new kernel development model" started to take shape. Never again would we go years between major stable releases. January 22, 2004: SCO files its "slander of title" suit against Novell. Linus gets dunked. January 29, 2004: UnitedLinux dies a quiet death. SCO sends a letter to the U.S. Congress. Version 2 of the Apache License is adopted. February 5, 2004: XFree86 leader David Dawes changes the project's license. There had been trouble in XFree86 for a long time, but the license change brought it all to a head. This was the move which killed XFree86, led to the creation of the revitalized X.org, and, eventually, brought life back to X development. February 12, 2004: The Grumpy Editor makes his debut. The first Grumpy Editor article was never intended to be the beginning of a series; your editor was simply grumpy that the Galeon browser had gone the route of many early GNOME 2.x applications: less configurability, fewer features, and worse performance. The persona proved popular with readers, though, and the Grumpy Editor has been making irregular appearances on LWN ever since. February 19, 2004: The Netfilter team settles its first GPL enforcement action in Europe. February 26, 2004: X11 development moves to the freedesktop.org project. MandrakeSoft is ordered by a French court to stop using the "Mandrake" name. March 4, 2004: SCO sues AutoZone and DaimlerChrysler. EV1Servers.Net buys an expensive SCO license - a move they certainly still regret. FreeS/WAN shuts down. The attack on Linux users had been long foreshadowed - and feared. Regardless of the validity of its claims, SCO could certainly make life hard for Linux by attacking those who use it. The attacks were so laughable, though, that they had no appreciable effect, even in the short term. March 11, 2004: The Anderer memo surfaces, tying SCO to Microsoft. The tenth anniversary of the green card spam. March 18, 2004: Open Source Risk Management launches. MandrakeSoft files its plan to exit bankruptcy. For those who don't remember, OSRM was a scheme to sell insurance against legal attacks to users of free software. But, by this point, nobody was all that worried about SCO, and OSRM never did take off. On the other hand, MandrakeSoft did succeed in getting out of bankruptcy and is still with us. March 25, 2004: BitMover claims that the pace of kernel development has doubled as a result of the adoption of BitKeeper. This installment started with BitKeeper, and will end there. For all the complaints about BitKeeper and its associated "don't piss off Larry" license, few could contest the claim that kernel development was proceeding at a much faster pace. We needed a tool like that. To this day, it remains discouraging that we were not able to develop a distributed revision control system for ourselves until Larry McVoy and BitMover showed the way. If there was ever an itch in need of scratching, this was it. The next installment (which will most likely appear two weeks from now) will start with April, 2004 and come fairly close to the present. Stay tuned. vmsplice(): the making of a local root exploit As this is being written, distributors are working quickly to ship kernel updates fixing the local root vulnerabilities in the vmsplice() system call. Unlike a number of other recent vulnerabilities which have required special situations (such as the presence of specific hardware) to exploit, these vulnerabilities are trivially exploited and the code to do so is circulating on the net. Your editor found himself wondering how such a wide hole could find its way into the core kernel code, so he set himself the task of figuring out just what was going on - a task which took rather longer than he had expected. The splice() system call, remember, is a mechanism for creating data flow plumbing within the kernel. It can be used to join two file descriptors; the kernel will then read data from one of those descriptors and write it to the other in the most efficient way possible. So one can write a trivial file copy program which opens the source and destination files, then splices the two together. The vmsplice() variant connects a file descriptor (which must be a pipe) to a region of user memory; it is in this system call that the problems came to be. The first step in understanding this vulnerability is that, in fact, it is three separate bugs. When the word of this problem first came out, it was thought to only affect 2.6.23 and 2.6.24 kernels. Changes to the vmsplice() code had caused the omission of a couple of important permissions checks. In particular, if the application had requested that vmsplice() move the contents of a pipe into a range of memory, the kernel didn't check whether that application had the right to write to that memory. So the exploit could simply write a code snippet of its choice into a pipe, then ask the kernel to copy it into a piece of kernel memory. Think of it as a quick-and-easy rootkit installation mechanism. If the application is, instead, splicing a memory range into a pipe, the kernel must, first, read in one or more iovec structures describing that memory range. The 2.6.23 vmsplice() changes omitted a check on whether the purported iovec structures were in readable memory. This looks more like an information disclosure vulnerability than anything else - though, as we will see, it can be hard to tell sometimes. These two vulnerabilities (CVE-2008-0009 and CVE-2008-0010) were patched in the 2.6.23.15 and 2.6.24.1 kernel updates, released on February 8. On February 10, Niki Denev pointed out that the kernel appeared to be still vulnerable after the fix. In fact, the vulnerability was the result of a different problem - and it is a much worse one, in that kernels all the way back to 2.6.17 are affected. At this point, a large proportion of running Linux systems are vulnerable. This one has been fixed in the 2.6.22.18, 2.6.23.16, and 2.6.24.2 kernels, also released on the 10th. At this point, with luck, all of these bugs have been firmly stomped - though, now, we need to see a lot of distributor updates. The problem, once again, is in the memory-to-pipe implementation. The function get_iovec_page_array() is charged with finding a set of struct page pointers corresponding to the array of iovec structures passed in by the calling application. Those pointers are stored in this array: Where PIPE_BUFFERS happens to be 16. In order to avoid overflowing this array, get_iovec_page_array() does the following check: Here, off is the offset into the first page of the memory to be transferred, len is the length passed in by the application, and buffers is the current index into the pages array. Now, if we turn our attention to the exploit code for a moment, we see it setting up a number of memory areas with mmap(); some of that setup is not necessary for the exploit to work, as it turns out. At the end, the code does this (edited slightly): The map_addr address points to one of the areas created with mmap() which, crucially, is significantly more than PIPE_BUFFERS pages long. And the length is passed through as the largest possible unsigned long value. Now let's go back to fs/splice.c, where the vmsplice() implementation lives. We note that, prior to the fix, the kernel did not check whether the memory area pointed to by the iovec structure was readable by the calling process. Once again, this looks like an information disclosure vulnerability - the process could cause any bit of kernel memory to be written to the pipe, from which it could be read. But the exploit code is, in fact, passing in a valid pointer - it's just the length which is clearly absurd. Looking back at the code which calculates npages, we see something interesting: Since len will be ULONG_MAX when the exploit runs, the addition will cause an integer overflow - with the effect that npages is calculated to be zero. Which, one would think, would cause no pages to be examined at all. Except that there is an unfortunate interaction with another part of the kernel. Once npages has been calculated, the next line of code looks like this: get_user_pages() is the core memory management function used to pin a set of user-space pages into memory and locate their struct page pointers. While the npages variable passed as an argument is an unsigned quantity, the prototype for get_user_pages() declares it as a simple int called len. And, to complete the evil, this function processes pages in a do {} while(); loop which ends thusly: So, if get_user_pages() is passed with a len argument of zero, it will pass through the mapping loop once, decrement len to a negative number, then continue faulting in pages until it hits an address which lacks a valid mapping. At that point it will stop and return. But, by then, it may have stored far more entries into the pages array than the caller had allocated space for. The practical result in this case is that get_user_pages() faults in (and stores struct page pointers for) the entire region mapped by the exploit code. That region (by design) has more than PIPE_BUFFERS pages - in fact, it has three times that many, so 48 pointers get stored into a 16-pointer array. And this turns the failure to read-verify the source array into a buffer overflow vulnerability within the kernel. Once that is in place, it is a relatively straightforward exercise for any suitably 31337 hacker to cause the kernel to jump into the code of his or her choice. Game over. (Update: as a linux-kernel reader pointed out, the story is a little more complicated still at this point; this is an unusual sort of buffer overflow attack). The fix which was applied simply checks the address range that the application is trying to splice into the pipe. Since a range of length ULONG_MAX is unlikely to be valid, the vulnerability is closed - as are any potential information disclosure problems. This vulnerability is a clear example of how a seemingly read-only vulnerability can be escalated into something rather more severe. It also shows what can happen when certain types of sloppiness find their way into the code - if get_user_pages() is asked to get zero pages, that's how many it should do. Your editor is working on a patch to clean that up a bit. Meanwhile, everybody should ensure that they are running current kernels with the vulnerability closed. A report from SCALE 2008 Escaping the cold for 70 degree days in Los Angeles might be a reason for some—Colorado-based LWN Editors for example—but it clearly is not the reason that most folks choose to attend Southern California Linux Expo (SCALE). Many of the approximately 1400 attendees already live in the region, so it is the speakers, participants, and the expo floor that bring them in. I attended the sixth annual SCALE (SCALE 6x), just held, February 8-10 and it didn't take me very long to see why it continues to grow and prosper. SCALE is a three day event, with two main conference days on Saturday and Sunday and a set of mini-conferences running in parallel on Friday. Each mini-conference covers a focused topic of interest to the community, with this year's topics examining Women in Open Source (WIOS), Open Source Software in Education (OSSIE), and Demonstrating Open Source Healthcare Solutions (DOHCS). It was a full day as each had eight or more hour-long sessions. Allison Randal kicked off the WIOS track with a presentation aimed at encouraging more women to give presentations at conferences. Her talk, "The Art of Conference Presentations", was not particularly gender specific, of course. It covered the process of proposing, creating and giving talks to conferences. Randall's advice was cogent, from avoiding "cute" titles to establishing credibility via your biography without feeling like you are bragging. Her most important point was to not wait around until you are the perfect speaker, but to go out and start speaking; your voice and style will come with practice. Over in the OSSIE track, Dan Anderson related his experiences teaching computer science concepts to middle and high school students over the last fourteen years. His approach is to use computing as a bridge between math, science, and technology. He discussed the process of creating, or trying to create, a stable curriculum in the face of rapid technological change. Because the hardware, operating systems, and languages all change quickly, his courses need to focus on concepts that are not specific to any of those. Over the years he has taught, the language used in the advanced placement course—dictated by the state CollegeBoard company—has gone from Pascal, through C++, and now uses Java, with some rumblings being heard about moving to Python. As he points out, "much of what a High School student learns about technology will be outdated by the time they graduate from college." He uses How to Design Programs as the core text for his courses. It uses a graphical programming environment called DrScheme, which is based on Scheme, that allows different subsets of the language to be used based on the skill level of the student. Anderson has integrated various peripherals, like cameras and audio equipment, into the environment so that students can interact with the real world in interesting ways. His students work on projects like voice authentication and computer vision; this year's project is to recognize tic-tac-toe as drawn on a white board. Other topics from OSSIE included a tutorial introduction to the moodle content management system (CMS) for online learning. Much like other CMS projects, moodle allows the creation of websites with various kinds of content—audio, video, images, and text—but organized as a course. It provides a framework and philosophy to guide the development of online classes. Students access the content via the web, completing tasks, taking quizzes, and participating in forums and chats with other students. Charles Edge (no relation) spoke about the challenges of implementing directory services for educational institutions. One problem is that the term "directory services" cover a large amount of ground, from tracking users (both employees and students) to allowing single sign-on (SSO) into multiple machines and services throughout the school. The biggest challenge can be handling the sheer numbers of people to be tracked. Open source solutions do exist, OpenLDAP for storing the information, Kerberos for single sign-on and Simple Authentication and Security Layer (SASL) for extending the reach of the SSO into other services, but it is complex to configure and administer. For scalability and robustness in large installations, Edge suggests Microsoft's Active Directory, which was not a particularly popular opinion with the open source oriented audience. The first day closed with a WIOS panel discussion, where six of the women presenting or showing at the conference discussed the issues facing women in open source. The discussion was informal and wide-ranging with a great deal of audience participation. Audience members asked questions as well as offered opinions and theories on why the participation of women is low and what can be done to make things better. No real conclusions were reached, as is usual for discussions of this topic; it is one of the more puzzling attributes of the free/open source community. The animated and amusing Ubuntu community manager Jono Bacon gave a rousing keynote to start things off on Saturday. He tried to ensure that everyone was awake by leading a greeting in multiple languages (including Klingon). His main point was to describe the responsibilities of the various "factions" that jockey to determine the future of open source software—companies, distributions, and communities—trying to show that each has an important role. In fact, it is up to all constituents to ensure that the greater Linux ecosystem thrives and that each group works well with the others. It was all pretty much "motherhood and apple pie" stuff, but well described and illustrated—all with Chuck Norris to keep track of the score. Bacon did provide the quote of the show when he said that free software was "started by a guy with a beard who was pissed off at a printer." Saturday was also the first day that the expo floor was open. Some 80 booths were there, representing companies large and small as well as lots of free software projects. One of the more interesting booths contained a working simulator of a 747 cockpit. All of the instruments were driven from a realtime Linux box and the FlightGear flight simulator was used to generate the cockpit window view. The two machines communicated over the network and various laptops were able to view the flight from other perspectives by getting updates from the simulator. It was rather impressive. The linuxastronomy.org project was also on hand with their telescope prototype. The telescope will be controlled via a Linux machine allowing it to be pointed at locations as specified by users. A Linux desktop application will send locations to the telescope over the internet, allowing it to be remotely controlled so that it can be installed in a mountaintop or other location with (relatively) little light pollution and good viewing conditions. In addition, the project was demonstrating many of the free astronomy programs available for Linux. A mobile audio studio product, Indamixx, did not have a booth, but could be seen all over the show. The company loaned two of the UMPC-based devices to the conference which were used to do podcasts of interviews with speakers and attendees. The device runs Linux with Audacity and ardour along with other free software. The company has tweaked things to make it all work well and be easy to use on the device. It looks to be quite capable as well as easily portable. In another interesting talk, David Maxwell of Coverity gave an update on their project to scan free software for security holes. The US Department of Homeland Security gave Coverity a grant to work with free software projects to use the Coverity Prevent static code analysis tool (once known as the "Stanford Checker") on the code. The scan project has found over 7,000 defects in around a hundred free software projects since its inception. Maxwell is the Open Source Strategist for Coverity; he is looking for more projects to participate. He is encouraging any free/open source software project to get in touch with him to get signed up for the program. Projects that join get their code scanned with a report being generated on the Coverity website for project members to view. The projects can then fix any of the issues that are actually bugs, mark others as "not a bug", and resubmit the code. The Coverity system will check the latest code out of their source code repository and check it again. Once all issues that the tool finds are handled, the project can move up to a higher "rung on the scan ladder" which will allow them to be scanned by more recent versions of the Coverity tool. Bdale Garbee had perhaps the geekiest talk of the show on Saturday afternoon with "Open Avionics for Model Rockets". Garbee gave an overview of the hobby, which has gone far beyond the Estes rockets that many of us dabbled with in our youth. These rockets can go to 10,000 feet and above; just how high they go is one of the questions that led folks to start outfitting them with instruments. Deploying the recovery system—typically a parachute—at apogee is very desirable and a barometric sensor with a little bit of logic tied to the ejection charge can do just that. Unfortunately, all of the commercially available options for these systems are completely closed; even the protocol to talk to the device is not released by the manufacturers. Garbee decided to once again combine one of his hobbies with open source to design and build an open device. Both the hardware and software will be released under free licenses (GPL and Open Hardware License); he had version 0.1 of the hardware (missing the accelerometer due to a problem in the board layout) with him at the show. The AltusMetrum system also has an onboard barometric sensor and will be able to support things like GPS devices and radio transmitters—so that lost rockets do not stay lost. Garbee expects to flight test the board and design version 0.2 of the hardware over the coming months. Sunday's keynote, by Stormy Peters of OpenLogic was entitled "Would you do it again for free?". Peters looked at whether external rewards, usually money, affect the motivation of open source developers; in particular, if the pay stops, will the project work stop as well? She cited four separate "studies" (including two that weren't intended as studies) that seemed to show that adding a reward, or penalty, can sometimes have a counter-intuitive effect (see an entry in her weblog for more information). Peters came to no firm conclusions about what the long-term effects of paying open source developers would be, but there are some mitigating factors that seem to provide hope that developers would continue if the paychecks stopped. When a payment or reward is in line with expectations for doing a particular task, it is much less demotivating. Also, if the payment is for working on the project, not tied to a specific goal or milestone, it is also less of a problem. Both of those are typically the case with folks who are paid—40% of open source developers are, according to Peters—for their work in the community. After a last wander through the show floor, I was able to catch a few minutes of the talk given by Ken Gilmer and Angel Roman of Bug Labs describing their modular embedded Linux gadget building system. The system consists of a core module along with various plug-in devices: camera, motion detector, GPS, etc. that can be combined into a single Java programmable device. Many additional peripheral modules are planned. The software that runs on the device is free and Bug Labs has a community site to share application code; they are clearly hoping that they can foster a community of users and developers. As can be seen, SCALE offers a wide variety of technical content in a well organized and fun conference. It has grown beyond the capacity of the Airport Westin where it has been held for the last few years; expect a new, bigger venue somewhere in LA next year. Over the last few years, SCALE has drawn from more areas of the southwest US in moving from a small, local conference to a regional one. If things continue, in another few years it may grow into a national conference; one can only hope that if that happens, it will continue to be as well run and interesting as it is today. Before the 2.6.25 merge window closed... The 2.6.25 merge window closed on February 10, after the merging of an eye-opening 9450 non-merge changesets. Most of the changes merged for 2.6.25 were covered in the first and second "what got merged" articles. This, the third in the series, covers the final 1900 patches merged before the window closed. User-visible changes include: There are new drivers for SC2681/SC2691-based serial ports, Dallas DS1511 timekeeping chips, AT91sam9 realtime clock devices, Compaq ASIC3 multi-function chips, Cell Broadband Engine memory controllers, Marvell MV64x60 memory controllers, PA Semi PWRficient NAND flash interfaces, Marvell Orion NAND flash controllers, Freescale eLBC NAND flash controllers, Sharp Zaurus SL-6000x keyboards, Fujitsu Lifebook Application Panel buttons, IPWireless 3G UMTS PCMCIA cards, intelligent storage device enclosures, Winbond W83L786NG and W83L786NR sensor chips, Texas Instruments ADS7828 12-bit 8-channel ADC devices, and Sony MemoryStick cards. Also added are updated video drivers for Radeon R500 chipsets (2D acceleration is now supported) and Intel i915 chipsets (suspend and resume now work properly). Several more obsolete OSS audio drivers have been removed. The old mxser driver has also been removed in favor of mxser_new, now called simply "mxser." File descriptors returned by inotify_init() now support signal-based (using SIGIO) I/O. There is also a new notification event (IN_ATTRIB) sent when the link count of a watched file changes. The mac80211 (formerly Devicescape) wireless subsystem is no longer marked "experimental." The memory use controller for containers has been merged. This controller was described in this LWN article, but the patch has evolved somewhat since then and the details have changed. Some documentation can be found in Documentation/controllers/memory.txt. ACPI thermal regulation support has been added; see Documentation/thermal/sysfs-api.txt for details on how it works. The ACPI code also now supports the Windows Management Instrumentation interface, and uses that support to make recent Acer laptops work. ACPI now provides support for users who want to override their system's Differentiated System Description Table (DSDT). The XFS filesystem now supports the fallocate() system call. ATA-over-Ethernet (AoE) now properly supports devices with multiple network interfaces (and, thus, multiple paths to the host). Support for the MN10300 architecture (little-endian mode only) has been added. Support for a.out binaries has been removed from the ELF loader. Pure a.out systems will still work, though. Disk I/O statistics (as seen in /proc/diskstats and under /sys/block) have been augmented with more information about request merging and I/O wait time. The S390 architecture now implements dynamic page tables - processes will use 2-, 3-, or 4-level page tables depending on the size of their address space. The ext4 "in development" flag has been added; mounting an ext4 filesystem will now require an explicit "I know this might explode" option. Changes visible to kernel developers include: Many nopage() methods have been replaced by the newer fault() API; the near-term plan is to remove nopage() altogether. See this article for a description of the new way of "page not present" handling. This cycle has also seen a bit of a reinvigoration of the long-stalled project to eliminate the big kernel lock. A number of BKL-removal patches have been merged, with more certainly to come. A generic resource counter mechanism was merged as part of the memory controller patch set; see <linux/res_counter.h> for the details. reserve_bootmem() has a new flags parameter. Most callers will set it to BOOTMEM_DEFAULT; the kdump code, though, uses BOOTMEM_EXCLUSIVE to ensure that it is the only one to touch the memory. Most architectures now have support for cmpxchg64() and cmpxchg_local(). There is a new set of string functions: These functions convert the given strings to various forms of long values, but they will return an error status if the given string value, as a whole, does not represent a proper integer value. These functions are now used in the parsing of kernel parameters. At this point, the merging of features is done (though there has been a bit of pushing for one or two things to slip in) and the stabilization period begins. With luck, that process will go a little more quickly than it did with 2.6.24. The Chandler Project moves forward The Chandler Project is a small-group collaboration application that is being produced by the non-profit Open Source Applications Foundation (OSAF). OSAF was founded by Mitchell Kapor. The foundation's History document reveals some background information. The project has been under development for a number of years. Version 0.1 of Chandler was announced in April, 2003. From the Chandler FAQ entry on What is Chandler? Chandler Project is an open source, standards-based personal information manager (PIM) built around small group collaboration and a core set of information management workflows modelled on Inbox usage patterns and David Allen's GTD (Getting Things Done) methodology. See Vision for a more in-depth answer to this question. Chandler provides an all-inclusive view of personal information, it can operate on notes, email, tasks, appointments, events, contacts, documents and additional personal resources. The Chandler Desktop application provides a single user interface with the ability to enter, view, search, group and share all of the supported types of information. The software is cross-platform, it currently runs on the Linux, Windows and Macintosh platforms. The Chandler software is being distributed under version 2.0 of the Apache Software License. The Chandler features document explains how the project is arranged: Chandler consists of a cross-platform (Windows, Mac OS X and Linux) Chandler Desktop application and Chandler Hub, a sharing service and web application. Chandler is open source and standards-based. The FeatureList document covers the Chandler capabilities in more detail, some screenshots are included. OSAF provides free access to the Chandler Hub, information there is available to any user with an account and a web browser. The Chandler Server provides a central store for locally managed information. There are some demo movies that show Chandler in action, some of the basic Chandler concepts and terms are explained: Item Chandler has four kinds of items: Note, Message, Task and Event. Chandler items can be of multiple kinds, e.g. Scheduled Tasks and Invitations. Collection Chandler's primary mechanism for grouping items. Collections can contain items of any kind. Application Area Chandler has four application areas: Mail, Tasks, Calendar and an all-inclusive All area. Chandler's application areas are a way to filter down your collections by item kind. Triage Status An attribute on every item that is Chandler's principle mechanism for helping you manage what you're working on. The three triage statuses are NOW, LATER and DONE. Tickler Alarm A custom alarm you can set on any item to automatically triage that item to NOW at a time you specify. Two new releases were recently announced, Chandler Desktop 0.7.4 and Chandler Server 0.12.0. The new Chandler Desktop change summary says: "The 0.7.4 release adds a Tip of the day feature and a German translation contributed by a user. The triage status behavior was improved to be more useful. There have been dozens of bug fixes across the application, as well as fixes to the build and testing infrastructures." The new Chandler Server change summary says: "This release supports a standalone WAR form of Cosmo ready to drop in to an existing Tomcat installation. A security issue allowing unauthorized access when a collection had been shared was fixed. A number of smaller bugs have also been fixed for Unicode usernames, error logging, and the calendar web UI." Chandler is in an active phase of development. The software has evolved from an interesting concept to a functioning system in recent years. Organizations and individuals who have a need for some advanced management and communications capabilities should be able to find some benefits from using Chandler. Eee PC security or lack thereof The Eee PC has garnered a lot of press for its small form factor, low weight, and solid-state disk, but it has also made a poor showing with security researchers. RISE Security released a report on the security of the Eee last week, showing that it can be subverted ("rooted") right out of the box from ASUS. Unfortunately, it is even worse than that as, even after updating an Eee using the standard mechanism, the hole is not patched. The vulnerability identified by RISE is in the Samba daemon (smbd), version 3.0.24, which is installed and runs on stock Eee PCs. The vulnerability, CVE-2007-2446 was identified and patched last May, so the Eee is shipping with a version of Samba known to be vulnerable to an arbitrary code execution flaw for nine months or so. In itself, that is not completely surprising. When hardware vendors install a distribution—or commercial OS like Windows—they tend to install the latest released version, which is likely to be out of date with respect to security issues. A vendor installing Fedora 8 or Debian etch today will be behind on countless security updates. But, unlike the Samba problem discovered on the Eee, updates do exist in the standard places. If the new user updates their system immediately, there is a fairly small window of vulnerability. Unfortunately for Eee owners, the modified Xandros distribution that comes with it does not yet have an update for Samba. This leaves all Eee PCs vulnerable to being rooted by anyone on the same network. Since the Eee is meant as a mobile device, it likely spends a lot of its time connected to various public networks, especially wireless networks. The Eee makes an interesting target for attackers because it very well might have authentication information for banks or brokerages as well as other private or confidential files. Some have seriously downplayed the threat but it is clear they don't understand it: The root attack performed was relatively easy to do, if you like command lines. Maybe Asus or Xandros could work on a patch for this. It almost makes one wonder how many other exploits are lying under the surface just waiting to be found. But, it's not like this actually puts you in danger, just how many hackers are going to be looking for the Asus EeePC or even Xandros based system online and attack them? Probably not many. Sales of the Eee last year was around 300,000 units; large enough to be an attractive target for the malicious. Because there is not an update to close the hole, Eee users have to rely on other means to protect themselves. This eeeuser.com comment thread provides some of the better advice for dealing with the problem. Removing the Samba package seems to be the simplest, but fairly heavy handed, way to avoid the hole—but many folks need a working Samba. There is no way to disable Samba from the Eee GUI which is the way most owners plan to interact with the machine. This whole incident makes it seem like ASUS (and perhaps Xandros) are not terribly interested in the security of the machines that they sell. There is a larger issue here. When the normal means of getting security patches comes from the same medium that is also the biggest security threat, there will always be windows of vulnerability. Even if hardware vendors diligently update the distribution they install, there is still some shelf-life and shipping time where security updates can be released. Various studies have shown that there may not be enough time to download patches before an unpatched system succumbs to an attack. It is a difficult problem to solve completely. Any solution must be very straightforward and consistent so that unsophisticated users can be trained to do it as a matter of course. News about security issues needs to get more widespread attention as well, so that those same users know when the procedure needs to be followed. Firewalls and other network protections only go so far if the machine needs to reach out to the internet to pick up its updates. If distributions provided some kind of blob (tar file, .deb, .rpm, etc.) that contained all of the security updates since the release, users could grab that from a different (presumably patched or not vulnerable) machine, put it on a USB stick or some other removable media and get it to the new machine. A utility provided by the distribution could then process that blob to apply all the relevant patches—all while the vulnerable machine stayed off the net. As the world domination plan continues, threats against Linux will become more commonplace; we need to try and ensure that users, especially the unsophisticated ones, can be secure in their choice of Linux. linux-next and patch management process The kernel development process operates at a furious pace, merging on the order of 10,000 changesets over the course of a 2-3 month release cycle. There have been many changes over the last few years which have helped to make this level of patch flow possible, and the process has been optimized significantly. An ongoing discussion on the kernel mailing list has made it clear, though, that a truly optimal solution has not yet been found. It started with the announcement of the linux-next tree. This tree, to be maintained by Stephen Rothwell, is intended to be a gathering point for the patches which are planned to be merged in the next development cycle. So, since we are currently in the 2.6.25 cycle, linux-next will accumulate patches for 2.6.26. The idea is to solve the patch integration issues there and reduce the demands on Andrew Morton's time. The question which was immediately raised was this: how do we deal with big API changes which require changes in multiple subsystems? These changes are already problematic, often requiring maintainers to rework their trees in the middle of the merge window. Trying to integrate such changes earlier, in a separate tree, could bring a new set of problems. There will be a lot of conflicts between patches done before and after the API change, and somebody is going to have to put the pieces back together again. Andrew does some of that now, but the problem is big enough that not even Andrew can solve it all the time. The bidirectional SCSI patches merged for 2.6.25 were held up as an example; that change required coordinated SCSI and block layer patches, and it never was possible to get the whole thing working in -mm. Arjan van de Ven asserted that the only way to make large API changes work is to merge them first, at the beginning of the merge window. The merged patch would fix all in-tree users of the changed API, as is the usual rule. Maintainers of all other trees could then merge with the updated mainline, fixing any new code which might be affected by the API change. This is, essentially, the approach which was taken for the big device model changes in 2.6.25; they hit the mainline at the beginning of the merge window, then everybody else got to adapt to the new way of doing things. Greg Kroah-Hartman worries that this approach is not sufficient, especially when live trees are being merged. If an API change in one tree forces a change to a separate tree, the coordination issues just get hard. Keeping the secondary changes in the primary tree risks conflicts with patches in the proper subsystem tree. Patches which reach across trees are also, increasingly, being discouraged as making life harder for everybody. But the fixup patch will not apply to its nominal subsystem tree as long as the API change itself is not there. In the -mm tree, this sort of problem is glued together by a series of fixup patches maintained by Andrew; Greg says that the linux-next tree would need something similar. David Miller's suggestion was to resolve this sort of conflict through frequent rebasing of the -next tree. Rebasing is an operation (supported by git and other code management tools) which takes a set of patches against one tree and does what's required to make them apply to a different version of the tree. It can be quite useful for maintaining patches against a moving target - which kernel trees tend to be. David talked about how he rebases his (networking subsystem) trees frequently as a way of eliminating conflicts with the mainline and, in the process, cleaning some cruft out of the development history. It turns out, though, that this frequent rebasing is not popular with the developers who are downstream of David. Rebasing the tree forces all downstream contributors to do the same thing, and to deal with any merge conflicts that result. It makes it much harder to prepare trees which can be pulled upstream and creates extra work. This was where Linus jumped into the conversation and expressed his dislike of rebasing. He echoed the complaints from downstream developers that a constantly-rebased tree is hard to prepare patches against. It also confuses the development history, making changes to other developers' patches in silent ways. After somebody's patch set has been rebased, it is no longer the patches that were sent. So, says Linus: So there's a real reason why we strive to *not* rewrite history. Rewriting history silently turns tested code into totally untested code, with absolutely no indication left to say that it now is untested. It is about here that Andrew Morton commented that git does not appear to be matching entirely well with the way that kernel developers work. Some of the solution may be found in tools more oriented toward the management of patch queues - such as quilt. There may be a renewed push to get more quilt-like functionality built into git (along the lines of the stacked git project) in the near future. Linus is also not entirely pleased with how the integration of patches only happens in the mainline: I'm also a bit unhappy about the fact you think all merging has to go through my tree and has to be visible during the two-week merge period. Quite frankly, I think that you guys could - and should - just try to sort API changes out more actively against each other, and if you can't, then that's a problem too. His suggestion is that a separate git tree should be created to contain a large API change - and nothing else. Affected subsystem maintainers could then merge that tree and develop against the result. In the end, all of the pieces should merge nicely in the mainline. This approach raises a number of interesting issues. The API-change tree has to be agreed upon by everybody, and it must be quite stable - lots of changes at that level will create downstream trouble. There must also be a high degree of confidence that this API-change tree will, in fact, get merged into the mainline; should Linus balk, everybody else's trees will no longer be applicable to the mainline. Replacing the current "tree of trees" patch flow with something messier could create a number of coordination issues. And there are fears that a mainline tree built from this process would fail to build in many of its intermediate states, which would make tools like "git bisect" much harder to use. Even so, it could be part of the long-term solution. Linus also took the opportunity to complain about large-scale API changes in general: Really. I do agree that we need to fix up bad designs, but I disagree violently with the notion that this should be seen as some ongoing thing. The API churn should absolutely *not* be seen as a constant pain, and if it is (and it clearly is) then I think the people involved should start off not by asking "how can we synchronize", but looking a bit deeper and saying "what are we doing wrong?" He also stated that the costs of big API changes are high enough that we should, more often, stay with older interfaces, even if they are not as good as they could be. Others disagreed, claiming that Linux must continue to evolve if it is to stay alive and relevant. The rate of change seems unlikely to fall in the near future. There may be some changes to how big changes are done, though. As suggested by Ted Ts'o, more changes could be done by creating entirely new interfaces rather than breaking old ones. With Ted's scheme, the old interface would be marked "deprecated" at the beginning of the merge window. Developers would then have the entire development cycle to adjust to the change, and the deprecated interface would be removed before the final release. There is resistance to this approach, based on the observation that getting rid of deprecated interfaces tends to be harder than one would expect. But, still, it is a relatively painless way of making changes. The current transition (in the memory management area) from the nopage() VMA operation to fault() is an example of how it can work. Nick Piggin has been slowly changing in-tree users with the eventual goal of removing nopage() altogether. For now, though, both interfaces coexist in the tree and nothing has been broken. Like the kernel itself, its development process is undergoing constant change and (hopefully) improvement. As the development community and the rate of change continues to grow, the process will have to adjust accordingly. What changes come out of this discussion remain to be seen. But it's worth noting that Andrew Morton fears that the biggest problem - regressions and bugs - will be relatively unaffected. Autodownloading considered harmful A Fedora user recently asked: might it be possible for the project to put together a package which would automatically download and install the (proprietary) Google Earth application? Debian has googleearth-package, which makes an installable package from the downloaded application, but there is no such convenience for Fedora users. The quick answer appeared to be "no" - Fedora is for free software only, and packaging tools for proprietary programs do not fit the bill. It did not take long for others to point out the "autodownloader" facility shipped with the Fedora games spin now. This tool is needed to make certain games work where the game is free software, but it needs proprietary data to provide the full experience. Games like Quake3 and Rise of the Triad fit this description. With autodownloader, these games can be shipped with Fedora and the proprietary data will be fetched automatically on the destination machine. This scenario does not seem all that different than downloading a proprietary application like Google Earth and installing it. The difference, as seen by the Fedora camp, is that autodownloader can only obtain data, not code. The fact that much of that data may, in fact, be code which is fed to a virtual machine within the game is sort of glossed over. In the discussion, it was also suggested that games requiring autodownloader should come with enough free data to be minimally usable, though that does not seem to have been enforced with great vigor. Alan Cox's suggestion that the real test should be "is it possible to create free data for this game?" makes some sense, but that is not the operative rule now. Such a discussion cannot go on long, though, before somebody brings up the real sore point: CodecBuddy. This time, it was Hans de Goede who raised the issue: Not only does it automatically download some gratis closed source code, it even offers the user to buy closed source code, effectively free advertising for commercial closed source! According to Hans, there is no point in discussing autodownloader as long as CodecBuddy remains in the repository. Outgoing Fedora leader Max Spevack is trying to organize a discussion aimed at reaching some sort of clarity on these issues. Christopher Blizzard had an interesting idea: hand more of the decisions about (and responsibility for) the shipping of problematic code to the upstream projects. The Miro project was held up as an example. Christopher's proposal has some echoes of the disintermediation of distributions discussion which was covered here last week. When it comes to patent-encumbered codecs, distributions like Fedora would happily accept disintermediation. In the absence of a real solution to the patent problem, some sort of disintermediation may be the only workable answer for distributions like Fedora. They may not be willing to ship the code, but others are. So it's mostly just a matter of making the connection between those repositories and the users as straightforward and painless as possible. Spending time with search engines to find useful programs or data may build character, but it does not help create a useful or pleasurable Linux user experience. Ten-year timeline part 6: almost to the present Part 5 of this increasingly long series stopped in March, 2004, when BitMover loudly proclaimed that the use of BitKeeper had doubled the pace of kernel development. This installment picks up from there, looking at a year when BitKeeper remained in the news, the SCO case was in progress, software patents became more threatening, and more. April 8, 2004: The first X.org release. SELinux shows up in a Fedora Core 2 test release. Red Hat v. SCO is put on indefinite hold (where it remains to this day). Anti-software-patent demonstrations are held in Europe. This week featured some important news. The launch of X.org signaled the resurrection of Linux desktop work and the beginning of a much more interesting and promising era. Meanwhile, Fedora took the lead in pushing SELinux-based mandatory access control technology into a general-purpose system. That work is still very much in progress nearly four years later, but, like it or not, SELinux has become an important part of our defensive arsenal. April 15, 2004: The 2.6.6 kernel gains POSIX message queues, filesystem speedups, internal API changes, laptop mode, 4K stacks, auditing, the CFQ I/O scheduler, and more. Sun and Microsoft make a $2 billion deal. Lindows becomes Linspire. April 22, 2004: Linspire files to go public. BayStar tells SCO it wants its money back. April 29, 2004: Gentoo founder Daniel Robbins leaves the project. Something else which was going on during this time was a rising level of discontent over the management of the Fedora project, which was not turning out to be the open community that many had hoped for. Pause for a moment and revisit this classic dialog posted by Konstantin Ryabitsev, which so clearly documented how the situation was seen by the community at that time. Fedora has come a long way since then. May 20, 2004: The European Council approves the software patent directive, sending it back to the Parliament for final passage. Remember: the directive approved by the Council was the original version which legitimized software patents, not the version amended by the Parliament which did not. Thus started the final (so far) round in the fight against European software patents - a round which we eventually won. May 27, 2004: The kernel adopts the Signed-off-by: convention. The 2.6.7 kernel gains scheduling domains, the object-based reverse mapping VM, filtered wakeups, and more. The thing to remember here is that 2.6 was alleged to be a stable kernel series, and everybody was still waiting for 2.7 to start. Linus defended the massive VM changes with the claim that they were, in fact, an "implementation detail." The realization that the kernel development process had, in fact, already changed did not come through until... July 22, 2004: The "new" kernel development process is adopted. This kernel summit decision - which, among other things, said that there would be no 2.7 kernel - surprised almost everybody. Certainly there have been some issues since then, but nobody really wants to go back to the old, pre-2.6 days. August 5, 2004: Open Source Risk Management funds a study showing that the kernel infringes on 283 patents, offers patent suit insurance. SCO Forum is held, featuring a keynote by Rob Enderle; the rest of the world looks on incredulously. The Munich Linux deployment is put on hold as a result of software patent fears. August 19, 2004: Lindows gives up on its IPO. The 2.6.8.1 kernel is released. There were interesting cross-currents happening at this time. On the one hand, companies like Open Source Risk Management were trying to use SCO as a way to scare companies (and individual developers) into buying its insurance offerings. On the other, there was a hallucinogenic aspect to the SCO Forum discussions that escaped nobody; SCO's time of being taken seriously by the wider world was already done. It's worth noting that OSRM still exists, but its insurance offering now is for companies worried about GPL-infringement suits. Meanwhile, 2.6.8.1 was the first three-dot kernel release ever; it was rushed out in response to an unpleasant, last-minute bug in 2.6.8. August 26, 2004: IBM brings GPL-infringement charges against SCO. LWN fails to reproduce the posted reiser4 filesystem benchmarks, gets in trouble with Namesys. September 16, 2004: Sun announces plans to open-source Solaris. OSDL and the Free Standards Group announce a plan for cooperation on the Linux Standard Base. OSDL and the FSG were, at this point, separate groups which, at times, almost seemed to be in competition with each other. Those days, of course, are no more: the two have since merged and become the Linux Foundation. September 23, 2004: the Ubuntu distribution announces its existence. Who would have thought that one could create a major new distribution in 2004? One might well wonder whether the situation is any less open now. October 7, 2004: the bnetd developers lose their DMCA case. Concerns about kernel quality are expressed. Microsoft's FAT patent is overturned. October 14, 2004: Novell says it will use its patents "as appropriate" to defend free software projects against patent attacks. Jeff Merkey offers $50,000 for the right to take the kernel proprietary. The realtime preemption patch set gets started. October 21, 2004: the first Ubuntu release (4.10) comes out. Busybox 1.0 is released at last. Mozilla begins fund raising to advertise Firefox in the New York Times. November 11, 2004: Firefox 1.0 is released. Novell gets $500 million in anti-trust cash from Microsoft. The Firefox 1.0 release was, in a very real sense, the much-delayed culmination of the process which began back in 1998, when Netscape announced that it would be releasing its code. Firefox was almost seven years in the making, but, sometimes, late really is better than never. Even those of us who use a different browser should be thankful for the effect Firefox has had toward the creation of a standard-compliant web and a competitive environment for web browsers. November 18, 2004: the Linux Core Consortium is formed by Conectiva, MandrakeSoft, Progeny, and Turbolinux. December 2, 2004: MandrakeSoft turns a profit. Whether it's called United Linux, the Linux Core Consortium, or Manbo-Labs, this is an idea which returns on occasion: pool effort on the creation of a base distribution so that each player can concentrate their differentiation efforts on the higher levels. It often seems not to work, though. It is hard to compete with more community-based distributions through the establishment of a base platform by corporate fiat. It seems that the true "base" distributions have names like Debian or Fedora. January 13, 2005: Debian runs afoul of the Mozilla trademark policy. The European Parliament attempts to restart the software patent discussion from the beginning. January 27, 2005: Sun starts releasing Solaris code under the CDDL. February 3, 2005: The Software Freedom Law Center is founded. Eben Moglen starts talking about GPLv3. Russ Nelson becomes the president of the Open Source Initiative - briefly. February 10, 2005: IBM's requests for summary judgment in the SCO case are dismissed - temporarily - by Judge Kimball. BitKeeper flame wars return, this time about the locking-up of history metadata and license-based prohibitions on its extraction. The locking-up of metadata within BitKeeper was a sore point even for developers who had accepted BitKeeper in general. Larry McVoy was unsympathetic, though, stating that he was operating within his rights. This episode was the beginning of the end for BitKeeper and the kernel. March 3, 2005: MandrakeSoft acquires Conectiva. The European Commission ignores the European Parliament's request to restart the software patent directive process. March 10, 2005: Kernel quality concerns lead to the creation of the -stable tree. Those quality concerns are not gone now, though they have diminished somewhat. The -stable tree seemed like an experiment at the time, but it has proved successful and is still being produced almost three years later. April 7, 2005: The BitKeeper era comes to an abrupt end when the free-beer license for the software is terminated by BitMover. (Unfounded) rumors about a merger between UserLinux and Ubuntu circulate. April 14, 2005: Linus posts the first version of git. MandrakeSoft becomes Mandriva. The termination of free-beer BitKeeper was probably inevitable from the very beginning of its existence; trying to maintain a closed system with proprietary data formats in the middle of a highly open process was always a losing proposition. For some time, many of us had feared that it could end in a much uglier way than it actually played out. We, the community, had danced on some thin ice for a while, but, when it broke, the water was only ankle-deep. We got lucky. As your editor has said before, BitKeeper did us a lot of good by bringing order to the kernel development process when things had been working very poorly, and by showing the world what distributed revision control could do. It set the stage for what came after. Git was not the first free distributed revision control system, but it was the first to be employed on such a massive scale. In a real sense, git launched a new era of free software development. On that note, this article will end - and, probably, the retrospective series ends as well. As events become more recent, the difficulty of putting them into historical perspective gets greater. A retrospective covering the remaining 2+ years risks becoming a repeat of the annual timelines and adding little of value. That period is best left for the 20-year retrospective. So, the entire LWN staff would like to say "thanks!" one last time to our readers, who have treated us so well for the last ten years. It has been an incredible ride. SCO to continue the fight? Just as it seemed the SCO saga was drawing to a close, a new player, with up to $100 million to risk, has come on the scene. Stephen Norris Capital Partners (SNCP) has made an offer to take SCO private while providing a line of credit to allow the company to continue its operations. If the bankruptcy court in Delaware agrees to the plan—which is not a foregone conclusion—SCO and its various legal cases could be with us for a long time to come. SNCP will put up $5 million in cash to essentially purchase between 51 and 85% of SCO; the exact percentage is dependent upon how much of the $95 million credit line is used to pay off Novell and/or IBM. If there is no payment, because SCO eventually wins those cases, SNCP will get 51%. If the payment is over $30 million, SNCP gets 85%; in between those two, the percentage of ownership will be pro-rated between the two. The actual transaction would issue "Series A Preferred" stock to SNCP (and its investors), which would be convertible into SCO "New Common Stock"; the current common stockholders would be see their shares "extinguished" and a trust established for them. This deal would take SCO private, no longer publicly traded nor subject to SEC reporting requirements. Under the proposed agreement, the credit line has an interest rate of the London Interbank Offered Rate (LIBOR) plus "1700 basis points"—17% for those without a high-finance background—which currently works out to be around 20%. This is clearly not cheap money, but it does provide a rather large war chest for SCO to continue the fight. The Memorandum of Understanding (MOU) [PDF] makes it clear that interest payments are part of what the line of credit is supposed to pay for: The purpose of the loan is to provide funds for (i) working capital for SCO following its emergence from bankruptcy, (ii) to pay interest when due under the Debt Financing, and (iii) to support the prosecution of the Reorganized Debtor's Litigation Claims, including providing letters of credit or other financial arrangements adequate to support any required appellate bonds (in which event the Reorganized SCO shall pay the reasonable letter of credit fees and expenses), and to effect payment of any final award against the Reorganized Debtor). SCO's bombastic CEO, Darl McBride, will be required to resign as a condition of the deal. The Series A stockholders would be entitled to elect four of the seven board members, ensuring that they control the day-to-day direction of the company. The CEO would hold another seat, as would an "outside executive with suitable industry expertise." The remaining seat would be open to anyone and voted on by the current common stockholders. What do the current stockholders get from this deal? Not much in the short term, as the MOU would set up a trust with $2 million (from the $5 million cash investment) to be distributed amongst the current stockholders. The current common stock would be "extinguished" and the trust would hold "New Common Stock" equivalent to the 15-49% left over based on the amount of the credit line used. Shareholders would get a pro-rata interest in the trust based on their current percentage of ownership. Based on 22 million outstanding shares, the distribution will amount to around $0.09 per share. Since SCO sued IBM in March 2003, most of the stock speculation has been based on some kind of monetary settlement from IBM. Investors in SCO since that time have essentially been betting on that outcome; the new arrangement still allows the current stockholders to hold onto their litigation lottery ticket. Any settlement money that comes to SCO as a result of the Novell and IBM cases would be paid to the trust in the percentage of ownership of the company that it holds (i.e. 15-49%). At that time, the trust would also get its percentage of four times the previous year's earnings. These would then be distributed to the members of the trust. It's a fairly complicated deal, this just covers the high points; the curious are directed at the MOU itself. It is a bit premature to proclaim that SCO is going private or getting $100 million as some in the press have done. The bankruptcy court will have its say; Novell may have an objection or two as well though, as things currently stand, they would be the likely beneficiary of some substantial part of the line of credit. We may get a read on how confident Novell is based on what, if any, objections they raise. It is hard to imagine that SNCP thinks SCO's business prospects are such that a large financial commitment is warranted. This is very clearly an attempt to wring money out of the current litigation—and perhaps start additional lawsuits. It is interesting to note that in addition to the Novell and IBM lawsuits, the MOU specifically mentions the Autozone case. There is speculation that the idea of a "Linux tax" on users is an outcome that SNCP and its investors covet. The question is, does SNCP truly believe that the claims made by SCO—without much in the way of supporting evidence so far—are likely to succeed on their merits? Or do they think that by providing enough incentive—in the form of a further protracted legal battle—might cause someone to settle? The IBM case has been dragging on for almost five years now. With the kind of money SCO would have at its disposal if this deal goes through, dragging out for another five does not seem implausible. At some point IBM or Novell may tire of the whole thing and try to cut some kind of deal. One hopes not, but that may be exactly what SNCP is betting on. The other side of that coin is that if that doesn't happen, we may well get a real hearing on some of IBM's counterclaims, in particular the GPL-infringement claims. That could be very interesting to watch. The state of Nouveau, part I [Editor's note: the following is the first in a two-part article on the status of the Nouveau project. This installment is an introductory piece describing the problem; the second part (to appear in one week) looks at how Nouveau development is being done and its current status.] Nouveau is an effort to create a complete open source driver for NVidia graphics cards for X.org. It aims to support 2D and 3D acceleration from the early NV04 cards up to the latest G80 Cards and work across all supported architectures like x86-64, PPC and x86. The project originated when Stéphane Marchesin set out to de-obfuscate parts of the NVidia-maintained nv driver. However, NVidia had corporate policies in place about the nv driver, and had no plans to change them at the time. So they refused Stéphane's patches. This left Stéphane with the greatest open source choice: "fork it"! At FOSDEM in February 2006, Stéphane unveiled his plans for an open source driver for NVidia hardware called Nouveau. The name was suggested by his IRC client's French autoreplace feature which suggested the word "nouveau" when he typed "nv". People liked it, so the name stuck. The FOSDEM presentation got the project enough publicity to engage the curiosity of other developers. Ben Skeggs was one of the first developers to sign up. He had worked on reverse engineering the R300 (one of ATI graphics chips) shader components and writing parts of the R300 driver; as a result, he had great experience with graphics drivers. He initially showed interest in the NV40 shaders only, but he got caught in the event horizon and has worked on every aspect of the driver for NV40 and later cards. The project engaged other developers with short and long term interest. It also generated a large amount of interest due to a pledge drive that an independent user started. However, the project was mainly developed on IRC and it was quite difficult for newcomers to get any insight into previous development; reading IRC logs is unpractical at best. With this in mind, KoalaBR decided to start summarizing development in a series of articles known as the TiNDC (The irregular Nouveau Development Companion). This series of articles proved very useful for attracting developers and testers to the project. TiNDC issues are published every two to four weeks; as of this writing, the current issue is TiNDC #34. Linux.conf.au 2007 saw the first live demo of Nouveau. Dave Airlie had signed up to give a talk on the subject; he managed to persuade Ben Skeggs that showing a working glxgears demo would be a great finish to the talk. Ben toiled furiously with the other developers to get the init code into shape for his laptop card and the presentation was a great success. After missing a Google Summer of Code place, X.org granted Nouveau a Vacation of Code alternative. This saw Arthur Huillet join the team to complete proper Xv support on Nouveau. Arthur saw the light and continued with the project once the VoC ended. In autumn 2007 Stuart Bennett and Maarten Maathuis vowed to get Nouveau's RandR1.2 into a better shape. Since then a steady stream of patches has advanced the code greatly. The project now has 8 regular contributors (Stéphane Marchesin, Ben Skeggs, Patrice Mandin, Arthur Huillet, Pekka Paalanen, Maarten Maathuis, Peter Winters, Jeremy Kolb, Stuart Bennett) with many more part time contributors, testers, writers and translators. NVidia card families This article will use the NVidia GPU technical names as opposed to marketing names. Where there are "N" and "G" naming the "N" variant (NV4x, NV5x) will be used. Further information can be found on the Nouveau site. Graphic Stack Overview Before jumping into the Nouveau driver, this section provides a short background on the mess that is the Linux graphics stack. This stack has a long history dating back to Unix X servers and the XFree86 project. This history has lead to a situation quite unlike the driver situation for any other device on a Linux system. The graphics drivers existed mainly in user space, provided by the XFree86 project, and little or no kernel interaction was required. The user-space component known as the DDX (Device-Dependant X) was responsible for initializing the card, setting modes and providing acceleration for 2D operations. The kernel also provided framebuffer drivers on certain systems to allow a usable console before X started. The interaction between these drivers and the X.org drivers was very complex and often caused many problems regarding which driver "owned" the hardware. The DRI project was started to add support for direct rendering of 3D applications on Linux. This meant that an application could talk to the 3D hardware directly, bypassing the X server. OpenGL was the standard 3D API, but it is a complex interface which is definitely too large to implement in-kernel. GPUs also provided completely different low-level interfaces. So, due to the complexity of the higher level interface and nonstandard nature of the hardware APIs, a kernel component (DRM) and a userspace driver (DRI) were required to securely expose the hardware interfaces and provide the OpenGL API. Shortcomings of the current architecture have been noted over the past few years; the current belief is that GPU initialization, memory management, and mode setting need to migrate to the kernel in order to provide better support for features such as suspend/resume, proper cohabitation of X and framebuffer driver, kernel error reporting, and future graphics card technologies. The GPU memory manager implemented by Tungsten Graphics is known as TTM. It was originally designed as a general VM memory manager but initially targeted at Intel hardware. On top of this memory manager, a new modesetting architecture for the kernel is being implemented. This is based on the RandR 1.2 work found in the X.org server. GPU architecture Graphics cards are programmed in numerous ways, but most initialization and mode setting is done via memory-mapped IO. This is just a set of registers accessible to the CPU via its standard memory address space. The registers in this address space are split up into ranges dealing with various features of the graphics card such as mode setup, output control, or clock configuration. A longer explanation can be found on Wikipedia. Most recent GPUs also provide some sort of command processing ability where tasks can be offloaded from the CPU to be executed on the GPU, reducing the amount of CPU time required to execute graphical operations. This interface is commonly a FIFO implemented as a circular ring buffer into which commands are pushed by the CPU for processing by the GPU. It is located somewhere in a shared memory area (AGP memory, PCIGART, or video RAM). The GPU will also have a set of state information that is used to process these commands, usually known as a context. Most modern GPUs only contain a single command processing state machine. However NVidia hardware has always contained multiple independent "channels" which consist of a private FIFO (push buffer), a graphics context and a number of context objects. The push buffer contains the commands to be processed by the card. The graphics context stores application specific data such as matrices, texture unit configuration, blending setup, shader information etc. Each channel has 8 subchannels to which graphics objects are bound in order to be addressed by FIFO commands. Each NVidia card provides between 16 and 128 channels, depending on model; these are assigned to different rendering-related tasks. Each 3D client has an associated channel, while some are reserved for use in the kernel and the X server. Channels are context-switched by software via an interrupt (on older cards) or automatically by the hardware on cards after the NV30. Now what to store within the FIFO? Each NVidia card offers a set of objects, each of which provide a set of methods related to a given task, e.g. DMA memory transfers or rendering. Those methods are the ones used by the driver (or on a higher level, the rendering application). Whenever a client connects, it uses an ioctl() to create the channel. After that the client creates the objects it needs via an additional ioctl(). Currently we do have two types of possible clients: X (via the DDX driver) and OpenGL via DRI/MESA. An accelerated framebuffer using the new mode setting architecture (nouveaufb) will also be a future client to avoid conflicts with nvidiafb. Let's have a look at a small number of objects: From this list, you can see that there are object types which are available on all cards (NV_MEMORY_TO_MEMORY_FORMAT) while others are only available on certain cards. For example, each class of card has its own 3D-engine object, such as NV10TCL on NV1x and NV20TCL on NV2x. An object is identified by a unique number: its "class". This id is 0x5f for NV_IMAGE_BLIT, 0x9f for NV12_IMAGE_BLIT and 0x39 for NV_MEMORY_TO_MEMORY_FORMAT. If you want to use functionality provided by a given object, you must first bind this object to a subchannel. The card provides a certain number of subchannels which correspond to a certain number of "active" (or "bound") objects. A command in the FIFO is made of a command header, followed by one or more parameters. The command header usually contains the subchannel number, the method offset to be called, and the number of parameters (a command header can also define a jump in the FIFO but this is outside the scope of this document). Each method the object provides has an offset which has to be set in the command. In order to limit the number of command headers to be written, thereby improving performance, NVidia cards will call several subsequent methods in a row if you provide several parameters. How do we refer to an object? The data written to the FIFO doesn't hold any info about that... Binding an object to a subchannel is done by writing the object ID as an argument to method number 0. For example: 00044000 5c00000c binds object id 5c00000c to subchannel 2. This object ID is used as a key in a hash table kept in the card's memory which is filled up when creating objects. The creation of an object relies on special memory areas. RAMIN is "instance memory", an area of memory through which the graphics engines of the card are configured. A RAMIN area is present on all NVIDIA chipsets in some form, but it has evolved quite a bit as newer chipsets have been released. Basically, RAMIN is what contains the objects. An object is usually not big (128 bytes in general, up to a few kilobytes in case of DMA transfer objects). There are also a few specific areas in RAMIN that are worth mentioning: RAMFC, the FIFO Context Table. It is a global table that stores the configuration/state of the FIFO engine for each channel. It doesn't exist in the same way on NV5x, where the FIFO has registers that contain pointers to each channel's PFIFO state, rather than a single global table. RAMHT, the FIFO hash table. A global table, used by PFIFO to locate context objects, except on NV5x, where each channel has its own hash table. Additional information can be found on the Nv object types and Honza Havlicek pages on the Nouveau site. Reverse engineering: more than NVIDIA deserves? Reverse engineering is a longstanding tradition in the free software community. It has often been the only way to get hardware to work when the manufacturer refuses to make documentation available, but there is more to it than that. Some of us, certainly, enjoy the challenge of figuring out how a particular device works. And our sense of freedom tells us that it is our right to understand the hardware which we have purchased and rightfully own. We, as a group, tend not to respond well to those who tell us that reverse engineering a product is not the right thing to do. But, increasingly, your editor is hearing voices within the community which are saying just that. One of the most prominent reverse engineering projects at the moment is Nouveau, which is starting to have some real success in making NVIDIA graphics adapters work with free software; see this week's Kernel Page for an article on the state of this project. NVIDIA hardware has been a problem for a long time, of course. It is said to be nicely-designed, and it is certainly present in a significant percentage of new machines, but NVIDIA has had no interest in making free drivers (or documentation) available for some years. So the only way for owners of this hardware to use it with reasonable performance under Linux is to use NVIDIA's proprietary kernel module, and that is a price many of us are not willing to pay. There are currently about eight developers working to make the Nouveau driver better. They have reached a point where their understanding of the hardware and their reverse engineering tools are quite good; that, in turn, is enabling fast progress toward the creation of a working driver. With this kind of developer attention, the Nouveau driver may reach a stable state over the course of the next year, at least for some versions of the hardware. And that, it seems, should be a good thing. Except for one little issue. NVIDIA's competition in this market is provided mainly by Intel and AMD/ATI. Intel provides free drivers for its hardware as a matter of company policy, and AMD has pushed a much more friendly policy onto ATI since the middle of last year. So free drivers for Intel video adapters come with distributions, and the first ATI drivers are beginning to become available. One rather perverse result of this situation is that there are almost no community developers working on the Intel drivers at all. The development and maintenance of those drivers is an expense carried by Intel alone. One could argue that the lack of hardware documentation from Intel has made it hard for other developers to participate; Intel is now beginning to address that problem by burying the community in comprehensive, Creative Commons-licensed hardware programming manuals. It will be interesting to see how much more community help Intel gets as a result of its documentation release. ATI, which has not, to date, provided working, free drivers, is arguably getting more help from the community and, especially, from distributors who have an interest in working drivers. But that company, too, is putting in resources of its own toward that goal. NVIDIA, instead, is giving us nothing - and, in return, we are giving it an eight-person development team dedicated to the production of free drivers for its hardware. Once Nouveau is in a working state, Linux users will be able to buy NVIDIA hardware in the knowledge that it will simply work without requiring them to download and use binary-only kernel modules. The result of that can only be higher sales for NVIDIA. While talking to developers at linux.conf.au, your editor heard a number of them say that NVIDIA does not deserve a gift of this magnitude from the community. We are now quite close to having free support for video hardware at all performance levels, supplied by friendly companies. Rather than penalize those companies by making a free gift to their biggest competitor, some say, shouldn't NVIDIA be made to pay for its behavior by exclusion from our community until it comes around? There is a point here. The biggest lever we have when talking with hardware companies (or any company, for that matter) is money. Companies which see themselves as missing out on the Linux market will find a strong incentive to change their behavior. So if NVIDIA finds that system resellers are not using its chipsets for Linux-based systems, it will have to reconsider its position with regard to free drivers. In the past, there was no credible alternative to NVIDIA, so the company had no real reason to fear that it could lose money as a result of its uncooperative behavior. Now there are well-supported alternatives at the lower end of the market, and the prospect of the same for high-end graphics as well. So there will be no need to buy hardware from this particular vendor, and, since the alternatives will be well supported, every reason to buy from somebody else. Unless NVIDIA's hardware, too, is made to work via a community-supported driver. Should this happen, one could well say that we, as a community, have taken a prize away from companies which have treated us well and handed it to their competitor (which has not). Arguably, the community should not pursue the creation of reverse-engineered drivers in situations where competing vendors are playing by our rules. Otherwise, we are sending a rather conflicted message to both types of companies. It may really be true that, in the long run, the Nouveau driver is harmful to our real interests. All of this discussion may be moot. There's no way that any of us could keep others from reverse-engineering their hardware and writing drivers, even if we wanted to. Anybody arguing against the mainline inclusion of a GPL-licensed driver for popular hardware is likely to end up in a minority position, to say the least. So, as a community, we cannot make a collective decision to stop this kind of development. But, as individual developers, we may occasionally want to give a moment's thought to the question of whether our activities are truly beneficial in the long run. Marketing Fedora It is an exciting time for free software as massive strides forward have been made in increasing both market share and mind share within the less technically orientated circles of society. Ubuntu is now available pre-installed on Dell systems, SUSE on Lenovo systems, the Xandros based eeePC has sold millions already and the One Laptop Per Child project has gone into mass production. Stephen Fry, the popular British actor, is even pledging his support in national newspapers. Taking advantage of this momentum and using it to help extend existing communities is going to be vital for any free software project, and with this in mind Fedora is seeking to set itself on solid ground with a revitalised marketing effort which hopes to both define Fedora's position in the world and find new ways of growing its user and contributor base. Recently the first tentative steps have been made along this path with the revitalising of Fedora's community marketing team. In Fedora talk there is now an official Special Interest Group (SIG). Following on from a session at the recent Fedora Users' and Developers' Conference the SIG is gaining a lot of momentum, with input from Red Hat's professional marketing team pouring in. This help is being provided on top of their Red Hat duties, and so their involvement is exactly the same as that of any other community members. So far their contributions have largely been aiding the creation of a long term marketing plan for Fedora, which will help to provide a more consistent message across Fedora's many outlets. This means that not only will Fedora's community Ambassadors be better briefed on what the key promotional aspects of Fedora are, but they'll have a better understanding of the best methods to achieve this and more support in terms of marketing collateral. The same benefits will also apply to Fedora's online marketing efforts, including their Developer Interviews and Release Overviews. Creating this plan still depends on overcoming a number of challenges. Foremost amongst these is understanding exactly what Fedora is, and what its target audience is. Recently Fedora has gone from being a simple distribution, to the upstream for an increasing number of projects. Thanks to its open build tools and custom re-spinning applications there are a growing number of custom spins, and other projects such as the Creative Commons LiveContent CDs and DVDs, as well as offerings from the Fedora Unity Project. Graphical tools such as Revisor have made re-spinning easy. Other Fedora derivatives, notably Red Hat Enterprise Linux and the OLPC, don't rely on the custom re-spinning applications, but do rely on Fedora source code to build their distributions. To accompany this, and widening Fedora's mission even further is the newly launched beta of a service called Fedora TV. Its goals are to encourage the use and development of free media formats such as OGG Vorbis/Theora, PNGs and SVGs, as well as encouraging the continued development of the free software tools to create media in these formats. This is not to say that Fedora is no longer focused on its core purpose of providing a distribution which showcases the latest and greatest free software has to offer. Fedora 9 (Sulphur) Alpha was released recently and a quick glance at its release notes shows a lot of interesting new features appearing. Along with the usual bundle of software updates, including KDE 4 and GNOME 2.21.4, a lot of attention has been given to Anaconda, Fedora's system installer. In particular Anaconda now has the ability to resize partitions as well as create and install the system on encrypted partitions. Also exciting is the inclusion of FreeIPA, a system which "... combines the power of the Fedora Directory Server with FreeRADIUS, MIT Kerberos, NTP and DNS to provide an easy, out of the box solution" for managing various auditing, identity and policy processes. If the events following Fedora 8's release are anything to go by, we can expect to see many of these features appearing in other distributions during the autumn 2008 and spring 2009 release cycles. Also a significant challenge for the Fedora Marketing SIG is not just defining what Fedora is, but persuading people that they want to be a part of it. In the short term this means promoting the large amount of work that Fedora does upstream and making it as easy as possible for people to get involved by lowering their barriers to entry. In the long term this means, as Paul Frields, Fedora's new project leader, recently commented, overcoming the "... decline of volunteerism in the USA overall ..." Of course, talk and good intentions are wonderful, but without practical results are meaningless. To this end the Fedora Marketing SIG is already beginning to pick up speed. Concrete, long term plans are being laid with the aid of Red Hat's professionals; and in the short term Fedora seems to be cropping up in popular news sites more often than it has done for quite a while. Fedora developers are gaining increased recognition for the work that they put in, which often shows up in other distributions. With the release of Fedora 9 (Sulphur) Alpha, and the increased attention that this received in comparison to previous early development releases, as well as an already impressive set of new features, the future seems bright. Rob Savoye discusses the Gnash project On February 14, 2008 the Boulder Linux Users Group presented a talk by Rob Savoye entitled Gnash, and the quest for Open Media politics and legalities. This article aims to cover some of the key points raised by Rob. The Gnash home page describes the project: Gnash is a GNU Flash movie player. Previously, it was only possible to play flash movies with proprietary software. While there are some other free flash players, none support anything beyond SWF v4. Gnash is based on GameSWF, and supports many SWF v7 features. Gnash is cross-platform software. It currently works on the Linux, MacOS, Windows and some embedded platforms. Under Linux, it runs on the KDE, Gnome and FLTK desktop environments. Gnash can be run in standalone mode or as a browser plugin for Mozilla Firefox and Konqueror. The software currently runs on small platforms such as cell phones and PDAs, larger desktop systems and game platforms. Gnash does not yet run on the ROCKbox platform, but that is an interesting idea. Gnash has been developed with efficiency in mind from the beginning. One of the main design goals has been to trap all possible errors and deal with them correctly. The Open Media Now! Foundation has been created as a support base for Gnash: OMNow is a foundation dedicated to the development, support and empowerment of an open media infrastructure. Upon this infrastructure stand companies and individuals who need free media solutions. Free media solutions save companies money and give them control over product technology. Such solutions support individuals by offering them legal ways to create, distribute and display their creative works. Our foundation opens the media market by actively developing operating system-agnostic and cross-platform solutions. Gnash development originally started because of a need for an open-source alternative to proprietary Flash/FLV players. Red Hat's Bob Young is supporting the Gnash project. His desire was to have a legal, but free client that allowed Linux users to view online video sites like YouTube. Gnash development has been done using a Clean room reverse engineering technique. By agreeing to the license for the Adobe (formerly Shockwave) Flash player, a developer gives up the right to develop a competing product. This has limited the input from some "tainted" developers to only remotely testing the application and reporting bugs. Rob made a number of comments on the Gnash development process. Reverse engineering of a proprietary format has been tricky, it involved a lot of effort from numerous people. Developers involved in this type of project require a lot of personal motivation. After enough hours staring at hex dumps, one is able to recognize data structures and read the text represented by hex-encoded ASCII. Patterns emerge in the hex output, some apparent bugs have even been found in the data generated by proprietary CODECs. The Gnash project has wider goals than just providing a free media player. The writing of open-source creation tools, servers and clients is in the planning stages. One interesting concept is to have Gnash negotiate with a content server and automatically switch to a free CODEC mid stream. There are plans to support a broader selection of free video CODECs. This is somewhat hampered by the numerous and fuzzy legal issues around CODECs. FLV is currently the most common online video format, it tends to lock users in and has successfully locked in the market. Gnash hopes to break this lock by giving Gnash free CODECs with more features such as higher quality video and better bandwidth utilization. Interestingly, the mobile phone platform, which has a much quicker design cycle turnaround, may lead the way for open video formats. Due to its small memory footprint, Gnash is often the best, if not only option for providing video on phones. Patent-free CODECs can have a large appeal to content providers. With proprietary CODECs, it is up to the provider to pay the licensing fees. This can often consume most of the profit such an organization brings in. Free CODECs will enable a much larger group of content providers to open up. The Wikipedia online encyclopedia project has recently started experimenting with a collaborative video project. Rob mentioned one interesting side topic that applies to many free software projects. There are three stages of project development. The first is making software that works in basic way. This is relatively easy, and is where many projects get stuck. The next stage is to make the software work well. Some, but not many, free software projects graduate to this level. The last stage is to make a product. This is something that only a few free software projects ever achieve. A product works well for almost all users and is easy to figure out. Bugs are rarely encountered. It can take more effort to move to the product level than the other stages combined. Wrapping things up, Rob mentioned that the Gnash project is very much in need of some assistance from a GUI expert, knowledge of both KDE and GNOME is desirable. Interested people should apply. Also, a new release of Gnash should be out fairly soon. KHB: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services When I was but a wee computer science student at New Mexico Tech, a graduate student in OS handed me an inch-thick print-out and told me that if I was really interested in operating systems, I had to read this. It was something about a completely lock-free operating system optimized using run-time code generation, written from scratch in assembly running on a homemade two-CPU SMP with a two-word compare-and-swap instruction - you know, nothing fancy. The print-out I was holding was Alexia (formerly Henry) Massalin's PhD thesis, Synthesis: An Efficient Implementation of Fundamental Operating Systems Services (html version here). Dutifully, I read the entire 158 pages. At the end, I realized that I understood not a word of it, right up to and including the cartoon of a koala saying "QUA!" at the end. Okay, I exaggerate - lock-free algorithms had been a hobby of mine for the previous few months - but the main point I came away with was that there was a lot of cool stuff in operating systems that I had yet to learn. Every year or two after that, I'd pick up my now bedraggled copy of "Synthesis" and reread it, and every time I would understand a little bit more. First came the lock-free algorithms, then the run-time code generation, then quajects. The individual techniques were not always new in and of themselves, but in Synthesis they were developed, elaborated, and implemented throughout a fully functioning UNIX-style operating system. I still don't understand all of Synthesis, but I understand enough now to realize that my grad student friend was right: anyone really interested in operating systems should read this thesis. Run-time code generation The name "Synthesis" comes from run-time code generation - code synthesis - used to optimize and re-optimize kernel routines in response to changing conditions. The concept of optimizing code during run-time is by now familiar to many programmers in part from Transmeta's processor-level code optimization, used to lower power consumption (and many programmers are familiar with Transmeta as the one-time employer of Linus Torvalds.) Run-time code generation in Synthesis begins with some level of compile-time optimization, optimizations that will be efficient regardless of the run-time environment. The result can thought of as a template for the final code, with "holes" where the run-time data will go. The run-time code generation then takes advantage of data-dependent optimizations. For example, if the code evaluates A * B, and at run-time we discover that B is always 1, then we can generate more efficient code that skips the multiplication step and run that code instead of the original. Fully optimized versions of the code pre-computed for common data values can be simply swapped in without any further run-time computation. Another example from the thesis: [...] Suppose that the compiler knows, either through static control-flow analysis, or simply by the programmer telling it through some directives, that the function f(p1, ...) = 4 * p1 + ... will be specialized at run-time for constant p1. The compiler can deduce that the expression 4 * p1 will reduce to a constant, but it does not know what particular value that constant will have. It can capture this knowledge in a custom code generator for f that computes the value 4 * p1 when p1 becomes known and stores it in the correct spot in the machine code of the specialized function f, bypassing the need for analysis at run-time. Run-time code generation in Synthesis is a fusion of compile-time and run-time optimizations in which useful code templates are created at compile time that can later be optimized simply and cleanly at run time. Quajects Understanding run-time code generation is a prerequisite for understanding quajects, the basic unit out of which the Synthesis kernel is constructed. Quajects are almost but not quite entirely unlike objects. Like objects, quajects come in types - queue quaject, thread quaject, buffer quaject - and encapsulate all the data associated with the quaject. Unlike objects, which contain pointers to functions implementing their methods, quajects contain the code implementing their methods directly. That's right - the actual executable instructions are stored inside the data structure of the quaject, with the code nestled up against the data it will operate on. In cases where the code is too large to fit in the quaject, the code jumps out to the rest of the method located elsewhere in memory. The code implementing the methods is created by filling in pre-compiled templates and can be self-modifying as well. Quajects interact with other quajects via a direct and simple system of cross-quaject calls: callentries, callouts, and callbacks. The user of quaject invokes callentries in the quaject, which implement that quaject's methods. Usually the callentry returns back to the caller as normal, but in exceptional situations the quaject will invoke a method in the caller's quaject - a callback. Callouts are places where a quaject invokes some other quaject's callentries. Synthesis implements a basic set of quajects - thread, queue, buffer, clock, etc. - and builds higher-level structures by combining lower-level quajects. For example, a UNIX process is constructed out of a thread quaject, a memory quaject, and some I/O quajects. As an example, let's look at the queue quaject's interface. A queue has two callentries, queue_put and queue_get. These are invoked by another quaject wanting to add or remove entries to and from the queue. The queue quaject also has four callbacks into the caller's quaject, queue_full, queue_full-1, queue_empty, and queue_empty-1. When a caller invokes the queue_put method and the queue is full, the queue quaject invokes the queue_full callback in the caller's quaject. From the thesis: The idea is: instead of returning a condition code for interpretation by the invoker, the queue quaject directly calls the appropriate handling routines supplied by the invoker, speeding execution by eliminating the interpretation of return status codes. The queue_full-1 method is executed when a queue has transitioned from full to not full, queue_empty when the queue doesn't contain anything, and queue_empty-1 when the queue transitions from empty to not empty. With these six callentries and callbacks, a queue is implemented in a generic, extensible, yet incredibly efficient manner. Pretty cool stuff, huh? But wait, there's more! Optimistic lock-free synchronization Most modern operating systems use a combination of interrupt disabling and locks to synchronize access to shared data structures and guarantee single-threaded execution of critical sections in general. The most popular synchronization primitive in Linux is the spinlock, implemented with the nearly universal test-and-set-bit atomic operation. When one thread attempts to acquire the spinlock guarding some critical section, it busy-waits, repeatedly trying to acquire the spinlock until it succeeds. Synchronization based on locks works well enough but it has several problems: contention, deadlock, and priority inversion. Each of these problems can be (and is) worked around by following strict rules: keep the critical section short, always acquire locks in the same order, and implement various more-or-less complex methods of priority inheritance. Defining, implementing, and following these rules is non-trivial and a source of a lot of the pain involved in writing code for modern operating systems. To address these problems, Maurice Herlihy proposed a system of lock-free synchronization using an atomic compare-and-swap instruction. Compare-and-swap takes the address of a word, the previous value of the word, and the desired new value of the word. It swaps the previous and new values of the word if and only if the previous value is the same as the current value. The bare compare-and-swap instruction allows atomic updates of single pointers. To atomically switch between larger data structures, a new copy of the data structure is created, updated with the changes, and the addresses of the two data structures swapped. If the compare-and-swap fails because some other thread has updated the value, the operation is retried until it succeeds. Lock-free synchronization eliminates deadlocks, the need for strict lock ordering rules, and priority inversion (contention on the compare-and-swap instruction itself is still a concern, but rarely observed in the wild). The main drawback of Herlihy's algorithms is that they require a lot of data copying for anything more complex than swapping two addresses, making the total cost of the operation greater than the cost of locking algorithms in many cases. Massalin took advantage of the two-word compare-and-swap instruction available in the Motorola 68030 and expanded on Herlihy's work to implement lock-free and copy-free synchronization of queues, stacks, and linked lists. She then took a novel approach: Rather than choose a general synchronization technique (like spinlocks) and apply it to arbitrary data structures and operations, instead build the operating system out of data structures simple enough to be updated in an efficient lock-free manner. Synthesis is actually even cooler than lock-free: Given the system of quajects, code synthesis, and callbacks, operations on data structures can be completely synchronization-free in common situations. For example, a single-producer, single-consumer queue can be updated concurrently without any kind of synchronization as long as the queue is non-empty, since each thread operates on only one end of the queue. When the callback for queue empty happens, the code to operate on the queue is switched to use the lock-free synchronization code. When the quaject's queue-not-empty callback is invoked, the quajects switch back to the synchronization-free code. (This specific algorithm is not, to my knowledge, described in detail in the thesis, but was imparted to me some months ago by Dr. Massalin herself at one of those wild late-night kernel programmer parties, so take my description with a grain of salt.) The approach to synchronization in Synthesis is summarized in the following quote: Avoid synchronization whenever possible. Encode shared data into one or two machine words. Express the operation in terms of one or more fast lock-free data structure operations. Partition the work into two parts: a part that can be done lock-free, and a part that can be postponed to a time when there can be no interference. Use a server thread to serialize the operation. Communications with the server happens using concurrent, lock-free queues. The last two points will sound familiar if you're aware of Paul McKenney's read-copy-update (RCU) algorithm. In Synthesis, thread structures to be deleted or removed from the run queue are marked as such, and then actually deleted or removed by the scheduler thread during normal traversal of the run queue. In RCU, the reference to a list entry is removed from the linked list while holding the list lock, but the removed list entry is not actually freed until it can be guaranteed that no reader is accessing that entry. In both cases, reads are synchronization-free, but deletes are separated into two phases, one that begins the operation in an efficient low-contention manner, and a second, deferred, synchronization-free phase to complete the operation. The two techniques are by no means the same, but share a similar philosophy. Synthesis: Operating system of the future? The design principles of Synthesis, while powerful and generic, still have some major drawbacks. The algorithms are difficult to understand and implement for regular human beings (or kernel programmers, for that matter). As Linux has demonstrated, making kernel development simple enough that a wide variety of people can contribute has some significant payoffs. Another drawback is that two-word compare-and-swap is, shall we say, not a common feature of modern processors. Lock-free synchronization can be achieved without this instruction, but it is far less efficient. In my opinion, reading this paper is valuable more for retraining the way your brain thinks about synchronization than for copying the exact algorithms. This thesis is especially valuable reading for people interested in low-latency or real-time response, since one of the explicit goals of Synthesis is support for real-time sound processing. Finally, I want to note that Synthesis contains many more elegant ideas that I couldn't cover in even the most superficial detail - quaject-based user/kernel interface, per-process exception tables, scheduling based on I/O rates, etc., etc. And while the exact implementation details are fascinating, the thesis is also peppered with delightful koan-like statements about design patterns for operating systems. Any time you're feeling bored with operating systems, sit down and read a chapter of this thesis. [ Valerie Henson is a Linux file systems consultant and proud recipient of a piggy-back ride from Dr. Alexia Massalin. ] kgdb getting closer to being merged? The kernel source level debugger, kgdb, has been around for a long time, but never in the mainline tree. Linus Torvalds is not much of a fan of debuggers in general and has always resisted the inclusion of kgdb. That looks like it might be changing somewhat, with the inclusion of kgdb in 2.6.26 now a distinct possibility. Over the years, Torvalds has made various pronouncements about debuggers, particularly kernel debuggers, a long message to linux-kernel in 2000 seems to outline his objections: I happen to believe that not having a kernel debugger forces people to think about their problem on a different level than with a debugger. I think that without a debugger, you don't get into that mindset where you know how it behaves, and then you fix it from there. Without a debugger, you tend to think about problems another way. You want to understand things on a different _level_. An attempt to sneak kgdb into the mainline via x86 architecture updates failed, but Torvalds did open the door a crack towards merging the kgdb changes: "I won't even consider pulling it unless it's offered as a separate tree, not mixed up with other things. At that point I can give a look." That has spawned the kgdb-light effort, spearheaded by Ingo Molnar. The original hope to get it included into 2.6.25 has been dashed, but with Molnar rapidly iterating to address kernel hacker concerns, the amount of complaints seems to be decreasing. Molnar is up to version 10 of the kgdb-light patchset in something like three days since the first was posted. The various linux-kernel threads show a number of very hopeful developers waiting with bated breath to see if kgdb can finally make its way into the mainline. The light version of kgdb still has most of the capabilities of the original kgdb and any additional, possibly more intrusive, features can be added later. Molnar is clearly trying to do things the right way, with a merge of the non-intrusive kgdb functionality that can eventually be used by multiple architectures. He points out that there are already gdb remote stubs in three separate architectures in the mainline, continuing: So we could have done it the same way, by doing cp kernel/kgdb.c arch/x86/kernel/gdb-stub.c and merging that. Nobody could have said a _single_ word - we already have lowlevel UART code in early_printk.c that we could have reused. But we wanted to do it _right_ and not add an arch/x86/kernel/gdb-stub.c special hack. Discussions about the patches have been mostly to point out problems or areas that need cleaning up. The philosophical objections have been mostly avoided, quite possibly because Molnar has been scrupulously trying to make a "no impact" set of patches: this kgdb series has _obviously_ zero impact on the kernel, because it just does not touch any dangerous codepath. From this point on KGDB can evolve in small, well-controlled baby steps, as all other kernel code as well. To that end, the patch changes 22 files (rather than the 41 touched by the original kgdb submission), removing "_all_ critical path impact" and the low-level serial drivers—as Molnar points out, kgdb should not be in the driver business. In addition, the "kgdb over polled consoles" support has been reworked and cleaned up. Various hacks to get at module symbols have been removed as a better solution for that problem is needed. So far, no show stopping problems have been identified, so it really seems to come down to what Torvalds thinks; for that, we may have to wait until the 2.6.26 merge window opens in April or May. The dangers of weak random numbers Amit Klein has been looking into pseudo-random number generators (PRNG) again. He has found a number of problems in the algorithms that make it easier to guess the next number generated. Much like his earlier work on Berkeley Internet Name Daemon (BIND), Klein found that with a small amount of traffic, predicting the next DNS transaction ID or IP fragmentation ID is possible. Anything that uses random numbers for security purposes—as opposed to, say, choosing which fortune to deliver—needs to ensure that their random numbers are cryptographically strong. In his report, Klein looks at a specific algorithm that has been implemented, with slight variations, in multiple places. It was introduced into OpenBSD in 1997 to randomize two 16-bit IDs to protect against predictability. Prior to that, both DNS transaction IDs and IP fragmentation IDs were essentially just incrementing counters. Various attacks, like idle scanning and DNS cache poisoning were possible because those IDs could be predicted. The OpenBSD PRNG algorithm was then used in their BIND 9 implementation, replacing the solution that Internet Systems Consortium (ISC)—maintainer of BIND—had used. ISC added a random number for the 16-bit DNS transaction ID, instead of an incrementing counter, as part of BIND 9. Klein's earlier work found problems with that PRNG—avoided by the OpenBSD version—leading to a certain amount of smugness on the part of the OpenBSD folks. It is clear that the OpenBSD algorithm is better than the one ISC introduced in BIND 9, but Klein was still able to find ways to break it. The method requires much more computation than was needed to crack BIND 9 transaction IDs, roughly six minutes of computation on a fairly high-end processor. Klein presents various ideas to parallelize the algorithm for multi-core or multi-processor computation that could bring that number way down. So, there is no working exploit available, but it is well within the grasp; a determined attacker could make use of the techniques to poison the cache of OpenBSD servers. In addition, Klein found ways to exploit the IP fragmentation ID predictability to do idle scanning, host operating system fingerprinting, and other kinds of information leaks; it may also be possible to inject an attacker-controlled packet into a TCP/IP connection, called a blind data injection. The belief in the strength of the OpenBSD PRNG made it an attractive option for others in the BSD family to adopt. NetBSD, FreeBSD, and DragonFly BSD all adopted a variant of the algorithm for the IP fragmentation ID, as did the FreeBSD-derived Mac OS X. It should be noted that only OpenBSD and Mac OS X enable the fragmentation ID randomization by default, the others have a setting for it, but their default behavior is sequential IDs (i.e. id++) which is clearly even easier to predict. The security team for each of the OSes had a fairly predictable response, with one notable exception. NetBSD, FreeBSD, and DragonFly BSD all changed the PRNG algorithm for less predictability; Apple claimed to be working on the problem but could not provide a timeline for a fix. The exceptional response came from OpenBSD, who are "completely uninterested in the problem," according to an email from the OpenBSD coordinator (presumably Theo de Raadt) that Klein quotes. The email goes on to say that the problem is "completely irrelevant in the real world." This kind of bluster is surprising from the OS that prides itself on security; it was, after all, the first to introduce randomization of these IDs. It may be that exploiting the predictability is hard to do, but Klein's techniques clearly reduce the search space drastically which is not what you want from a PRNG. The other BSDs found it important enough to change, what does OpenBSD know that they don't? It would be foolish for Linux users to write this off as a "BSD problem"—though the random numbers used for IP fragmentation IDs by Linux are considered to be cryptographically strong—because there very well may be problems elsewhere in Linux or the applications that are typically run on it. We are not immune to making mistakes, so all uses of random numbers should be scrutinized. New development needs to remember these lessons of the past as well, so that we can avoid this kind of problem in the future. Directions in UMPC-land It is an exciting time for Linux users who are interested in ultra-mobile PCs (UMPCs). New models are being announced frequently with many—dare we say most?—coming with at least the option to have Linux pre-installed. The low-cost models probably require Linux in order to make their price point, but even higher-end UMPCs seem to be made with Linux firmly in mind. In many ways, the One Laptop Per Child (OLPC) project has driven the demand for low-cost machines for adults as well. Commercial offerings from ASUS (Eee PC), Everex (Cloudbook), Elonex (One), along with a rumored UMPC from HP are giving both the OLPC and Intel's ClassmatePC some competition. Add in Nokia's N810 and you have a half-dozen very mobile solutions featuring Linux—though the ClassmatePC seems to be more geared towards Windows XP. None of them has quite the right set of features to be the ultimate UMPC, but we seem to be headed in the right direction, so it is worth contemplating what that machine might look like. Battery life is the achilles heel of mobile devices; some kind of breakthrough in power consumption or energy storage needs to happen for big strides to be made in this area. Because of weight considerations, today's UMPCs tend to have small batteries and three hours or less of battery life. Something on the order of twelve hours—with a measurement in days being the real goal—is more like what is needed. Perhaps some kind of human-powered or alternative charging mechanism can play a role. It is probably the biggest challenge to reaching something approaching an ultimate device. Part of the reason that battery life is so low is because of how much power the display consumes. With rotating media on its way out (at least for these kinds of devices), the display is one of the areas where power savings would be felt most strongly. The E-Ink displays, such as those used by the newer e-book readers, have some great properties in terms of power consumption, but the speed at which they update makes them undesirable for general computer use. Many of us spend a fair amount of time looking at a static screen for several to many seconds at a time. Web pages or e-books might be candidates for using E-Ink, perhaps, but not Wesnoth or typing a document. Perhaps a dual-mode screen that combined an LED and E-Ink display could blend the best of both. OLPC has an innovative display with many of the characteristics needed which can also can be viewed in sunlit conditions. Former OLPC CTO Mary Lou Jepsen's startup is licensing the XO display technology, so we may see it in a UMPC before too long. The size of the display will likely need to be larger than today's offerings as well. That will be a balancing act between size, weight, and cost which will be interesting to see play out. A touchscreen is another feature that will be necessary as the display should be usable separate from the keyboard. Some way of transforming a small laptop into a tablet PC and e-book reader would be very desirable, with bonus points awarded if that transformation is fast and seamless. A full-sized or nearly so keyboard is also a necessity. Too much of the work that we do involves words and numbers that need to be input. If this device is to become an integral part of a day-to-day routine, thumb or child-sized keyboards just won't cut it. Wifi and wired connectivity are obvious, while Bluetooth would seem to be a good addition to provide internet via cell phone. Some might want to integrate actual cell phone functionality into the device itself—to avoid the multiple device hassle. Given that the size of a UMPC won't ever reach that of a cell phone, that seems like a stretch, but for those who want it, an optional feature seems like the way to provide that. Like the OLPC, the device should be ruggedized, able to withstand reasonable amounts of abuse without much more than a case scratch. This is another area where flash disks will help as there won't be the threat of losing data when the disk heads suffer rapid deceleration. The price per gigabyte for solid-state drives will drop to the point where a few hundred GB will be possible at a reasonable price. Carrying around one's favorite music as FLACs, rather than in some lossy format, should be possible. A fairly modest and power-friendly processor with a GB or two of RAM should round out the basics of the hardware. The device will run Linux, of course, and might have a few other peripherals: camera, microphone, speakers, etc. All should be available for $500-700, at least in a very functional low-end configuration. When might we see such a device? Two to three years seems quite likely, certainly before five years have passed. When it's ready, please send one to LWN for review in care of the author. A Beijing trip report China would seem like an ideal environment for free software. The Chinese have a need for vast amounts of software as their country rapidly industrializes, they have reasons to prefer software which is not controlled by American corporations, and they have been coming under some pressure from those same corporations to do something about their little habit of copying proprietary software without much regard for details like license agreements. Free software offers them the ability to take control of their own software, make sure it lacks unwelcome surprises, and copy it as much as they like. And China has been making a lot of use of Linux and free software, but, as is the case with many Asian countries, China's presence in the development community is relatively small. Encouraging participation from Asian countries has been a goal of the Linux Foundation for some time; one result of that is the series of symposiums held in Japan over the last few years. Now, for the first time, the Foundation has extended this series to China. On February 19 and 20, the first Linux Developer Symposium China was held in Beijing. This event was organized in cooperation with the China Open Source Promotion Union (COPU). Your editor had the privilege of speaking at this meeting. This was not the kind of developer-oriented gathering that one might expect to find in many other parts of the world. Far too many suits and ties, for example. Often the focus of the event appeared to be the creation of photo opportunities while people (who were not developers) gave speeches. In general, it was organized in a mode of talking to the participants, rather than talking with them. The agenda makes this clear: 17 speakers on the first day, with only one break (for lunch). The talks were well received by a sellout crowd, but there was not a lot of opportunity for people to talk. The second day featured a round table discussion and a set of BOF sessions. The round table was interesting, though it focused on issues which are not necessarily development oriented: Linux adoption in mobile devices, competing with pirated copies of Windows, etc. The BOF was, in many ways, the most interesting part of the whole event; this was where participants could find people with similar interests and simply ask questions. Your editor fielded questions on security modules, the kevent interface, community participation in Asia, language issues, and more. Chinese developers, like their Japanese counterparts, seem to be reluctant to ask questions in front of a large group. But, in a closer situation, the floodgates open and all kinds of questions come out. Unfortunately, the second day was open only to a small subset of the conference attendees, and that subset was heavy on the managerial side. So a lot of people who could have benefited most from the BOF session were not there. One topic which never came up - until your editor raised it briefly at the round table session - was license compliance. For the most part, it does not seem to be on the radar there. Your editor was told that GPL violations are common with products which are sold in the Chinese market but not exported elsewhere; the people involved can assume, with seemingly good reason, that nobody will take them to court. There is also a fair amount of driver work being done for companies in other countries; once the code is shipped the original developers forget about it and move on to the next project. Quite a bit of that code never makes it into the mainline. This sort of activity fails to give back to the community which provided Linux in the first place. But it also hurts the developers involved. They do not become part of the community, do not get recognition for their work, and miss the opportunity to learn from others. During the press conference on the first day, it was noted that Chinese companies are having a hard time hiring Linux developers, and that more training opportunities would be a good thing. Your editor felt the need to point out that, of all the people working in free software projects, very few of them are specifically trained to do so. It's more a matter of individual initiative. Training is good, but the training received in Chinese universities should be more than adequate for those looking to get involved with free software. Andrew Morton took that theme further by pointing out that, rather than complaining about difficulties in hiring, these companies would be better off encouraging community participation and skills development within their existing staff. That would be more productive than chasing the same small set of developers that everybody else is trying to hire. On the second day, Dave Neary made the crucial point that community participation is something that individuals - not companies - do. There are a lot of companies worldwide which have a hard time understanding how free software development works, and China is no exception. One last note on hiring free software hackers. Your editor ran across this article, which states: In China, 43 per cent of IT graduates are unemployed, and hacker "training" web sites are creating a pool of effective malware authors and paying them like a legitimate business. In such a situation (assuming the claim is true - something your editor cannot vouch for), finding developers who are able and willing to learn how to hack on free software should not be that hard. Meanwhile, your editor was struck by the energy and initiative shown by the Beijing Linux Users Group, which helped with many aspects of the event. BLUG is busily organizing gatherings and creating a local community out of Beijing's hackers. A real spark is glowing there; it will be interesting to see how that group develops in the near future. All told, the event was a clear success. It was a proper media event which raised the profile of Linux in China and showed that Linux developers care enough about the country to pay a visit. A mixture of local and imported developers were able to present their work to an attentive and interested audience. The discussions brought developers closer and, hopefully, sent them away with interesting things on their "to do" lists. And, importantly, the visiting developers learned something about China that goes beyond the proper technique for eating Peking Duck or the effort required to climb the Great Wall (or to circumvent the rather obnoxious great firewall). With luck, we have a better understanding of what developers are up to in that part of the world and how we can help them to participate fully in our projects. And that can only be a good thing. (Some pictures from the event have been posted. Unbelievable numbers of photos were taken, so more can be expected to surface at some point. But, under no circumstances should anyone look at the scurrilous photo posted by Andrew Morton.) The state of Nouveau, part 2 [Editor's note: this is the second in a two-part series on the state of the Nouveau driver for NVIDIA hardware. The first installment is recommended reading for those who have not yet seen it.] Sources of information, and reverse engineering tools As very little information is available on NVidia's hardware design and implementation, the Nouveau project has developed a number of tools to gain a better understanding of card architecture and programming model. These tools, along with some previously available information, are what are used to create the driver. The Haiku/BeOS projects have a driver that came from a software development kit NVidia released for NV03/04 cards, and also gathered some information from an unobfuscated nv driver that appeared briefly in XFree86. This driver has improved mode-setting code compared to nv, and a basic 3D driver using hard-coded objects running in a single context. More information was available in the nvclock utility, which allows overclocking NVidia GPUs on Linux. Its lead developer Roderick Colenbrander (Thunderbird) has helped out Nouveau in the clock setup, i2c and tv-out areas. renouveau The first utility developed was called renouveau. renouveau is mainly concerned with reverse engineering the NVidia binary driver by black-boxing it, feeding it certain inputs and watching what it writes to the hardware. It runs a large batch of OpenGL tests which exercise most of the GPU's capabilities and generates a set of dump files which are sent to the Nouveau developers. The tool works by mapping the card registers and the FIFO assigned to the current application. It then records the current state of both FIFO and registers, executes small OpenGL tests, and compares the final state against the initial saved state. It then dumps this info, which can be parsed into a human readable form using an XML register/command database. (Some developers would argue the hex is readable to them). The tool has advantages in that it can be run very simply by end users, on various card architectures, without requiring root privileges. It doesn't tamper with the binary driver, and does not require much technical knowledge. MMioTrace MMioTrace is a tool for tracing memory-mapped I/O (MMIO) access within the kernel. The NVidia driver contains a kernel module which is responsible for a lot of card initialization and mode setting. This activity cannot easily be traced by user-space tools such as renouveau. MMioTrace uses relayfs and debugfs to relay the tracing data to userspace. MMioTrace works by replacing calls to the kernel's ioremap(), ioremap_nocache(), and iounmap() calls from the driver that is to be probed with wrappers that call into MMioTrace. When the driver module in question calls ioremap() to access the MMIO registers, the pages are mapped as not-present in the kernel address space instead. It can be set up to only trace address ranges which are likely to be touched by the driver you are interested in, thus reducing the amount of useless MMIO accesses. When the module then tries to access the register space, a page fault will occur. In the page fault handler the address is detected and the attempted action recorded. The page is then marked present and the page-faulting code is single-stepped to execute the instruction doing MMIO. After that the page is set to "not present" again so that the cycle can be restarted for the next access to the page. MMioTrace has some restrictions on tracing into the legacy ISA address range, as marking those pages not present crashes the kernel. A solution to this may be forthcoming but would require patching the kernel. MMioTrace is usable for all types of drivers running in the kernel, not just graphics drivers. It is not shipped with the kernel as of yet and was shipped as a working external module up 2.6.23. However 2.6.24 has seen the removal of certain features that mean MMioTrace will need to be upstreamed for it to work with 2.6.25 or later kernels. If you are interested in more details, you should have a look at the MMioTrace page. valgrind-mmt Valgrind-mmt is a plugin for the valgrind debugging suite. It traces MMIO accesses from a user-space process (like the X.org server) where the NVidia DDX code is loaded. This was originally written by Dave Airlie for tracing ATI hardware and has since been extended by a number of other developers. It is used in Nouveau in a way similar to renouveau: to dump the contents of a FIFO. Valgrind-mmt allows reliably tracing the X.org FIFO, which is something renouveau cannot do very well. Tracing the X.org FIFO is sometimes required as it is the only way to see how some 2D features are implemented. Using MMioTrace to implement a new feature Commands are usually sent to the card by writing in the command FIFO, not by touching registers directly. But initialization of the card (including notably mode setting), as well as some other operations, are done via MMIO operations from within the kernel. Below is an example of how MMioTrace was used to reverse engineer the YV12 video overlay that is present in some NVidia cards. Video formats Videos are usually not encoded in the RGB colorspace. Most video codecs work in the YUV colorspace instead, where Y stands for luminance (black and white image), and U and V represent the chrominance (i.e. color). Since eye perception is higher for luminance, codecs usually drop a fraction of U and V samples in order to save space. When the card is asked through e.g. X-Video to display a video frame, it is passed a buffer containing YUV data, usually in YV12 or YUY2 format. FourCC.org can give you details about those formats, but for the purposes of this article, we will just say that YUY2 is a format that keeps one chrominance sample (U or V alternatively) per luminance sample, thus giving "YUYVYUYV" to the card (16 bits per pixel), and YV12 is a format that keeps two chrominance samples (one U, one V) per 2x2 luminance block, which gives an effective 12 bits per pixel of video. YV12 is 25% smaller than YUY2 and is the format used by most popular codecs. Your author has yet to find any movie codec that does not output YV12. (or I420, which conceptually is the same - it just inverts the position of U and V in the buffer). Some months ago, Nouveau's Xv implementation was inherited from nv. Besides being extremely slow, nv supported only the YUY2 format, and converted YV12 input to YUY2 in software before uploading the data to the card. While working on improving performance, we quickly came to wonder if NVidia cards supported YV12 in hardware. Due to the 25% size reduction, this would naturally decrease the volume of bus transfers, which plays a very important role in Xv throughput especially on PCI cards. We verified that by running performance tests on the NVidia binary driver, playing YV12 and YUY2 videos (using mplayer's -yuy2 option). Our performance tests consisted simply of mplayer's "benchmark" mode. The results were extremely clear: the operation required just over 20 seconds in YUY2 mode, and in just over 15 seconds in YV12 mode. No need to take your calculator, it is a 25% difference which matches the data size exactly. The most obvious explanation is that the data is sent to the hardware in YV12 format. So the situation was: we had a Xv driver that handled YUY2 video only, we knew (or thought, with a high degree of confidence and hope) that the hardware supported YV12, but no existing driver like rivatv had code for it. Some reverse engineering had to take place. MMioTrace doesn't enter the arena just now, however. As mentioned above, most of the time, commands are sent to the card by writing to the command FIFO, and not by touching registers. So we first checked the X command FIFO using valgrind-mmt and found some commands related to video. However, it quickly turned out that those were software methods, that is to say, dummy methods that make the card generate an interrupt asking for the kernel to handle it. It's somehow similar to an ioctl() call into the kernel module, except that it's in sync with the FIFO. First lesson learned: Video overlay setup is being done by the kernel module. We then MMioTraced the NVidia binary driver, playing YUY2 and YV12 video (same dimensions, window position, ... - the only thing that differed was the format), and compared the outputs. And among the 150 kilobytes of resulting data, we found (for YUY2 mode): While for YV12 mode: So here we had a different value being written into FORMAT, and three unknown registers. From a reading of existing documentation and code, it turned out that the bit 0 of FORMAT was previously unknown to us. Next we tried to get the feature to work in our driver. We tried it without touching the three unknown registers, and got no video at all. So it had an effect, but we weren't sure if it really was the "YV12 format" bit. Further looking into MMioTraces showed that what was written into the three registers was in fact fairly similar to what was done for the image buffer setup, and we were able to make an educated guess at what was supposed to be written here. (It was the set up of the color buffer, while the "main" buffer was used for luminance data.) In the end, we got YV12 to work in Nouveau's Xv without converting to YUY2, which represented an increase in performance of (about) the expected 25%. MMioTrace enabled us to discover how the card needed to be programmed to do YV12 in hardware, which was apparently known by nobody outside of Nvidia before. This knowledge ended up in nv_video.c in NVPutOverlayImage: It is interesting to note that MMioTrace simply records all register reads and writes - you can see almost everything that the kernel module does to the card. The downside to "almost everything" is that the saved data set gets large fast. Reducing the trace range and using it only for short periods of time helps a bit but still... after a few minutes of mmiotracing, you will get into the megabyte range for your logs. Sifting through those thousands of lines to find what one is looking for takes some time to get used to. We used MMioTrace to reverse-engineer YV12 overlay, but we also used it to reverse-engineer a very large part of card initialization code and mode setting - and it will most certainly be useful for many other things that involve a kernel module. It is not limited to Nouveau, and is able to trace MMIO operations from any of your (binary) kernel modules, thereby allowing reverse-engineering of drivers for other hardware. Current development in Unix graphics and its influence on Nouveau We'll now take a peek into the future of 3D acceleration on Linux. 2007 saw a number of major changes in how Linux and X11 handle graphics. A lot of improvements are coming into use: EXA for 2D acceleration, TTM for memory management, Gallium3D for 3D, the new DRI2 interface, etc. All this needs driver-side changes, which can take some time to be done. With the advent of programmable graphics hardware, the old graphics driver model in Mesa became unsuitable. The current Mesa model is designed for cards which are based around OpenGL fixed-function operations. Fixed-function cards have hardware blocks designed for each part of the GL pipeline. The driver model for this requires each new piece of fixed functionality to call into the driver, which can get complex. This also causes a lot of code to be duplicated in each driver. A new driver model, called Gallium3D, tries to simplify the driver interface and increase the amount of shared code. It is designed to cater for OpenGL 3.0's needs as well as current OpenGL and DirectX APIs. It is also designed to allow portable drivers across all major platforms/OSes. It assumes programmable graphics hardware with, at least, fragment shaders. Now that we know why the design was changed, let's have a look at the architecture of Gallium3D. Gallium3D splits the DRI driver into 3 major components, the common "state tracker", the OS dependent "winsys" layer and hardware specific 3D driver. The winsys is in charge of 2D action and most of the housekeeping and OS-specific bits, while the hardware driver does the 3D. Each driver needs to implement a hardware driver and a Winsys part. If an existing driver gets ported to another OS, only the Winsys parts needs to be redone. There is also a fully working reference software 3D driver called softpipe. It is a software renderer showing the Gallium3D concepts and how to implement them, which also acts as a software fallback driver for things the hardware cannot handle. Another component of the new graphics subsystem is the TTM based memory manager. TTM is a unified in-kernel manager for all GPU accessible memory. Previous memory management was split between X drivers, mostly using static allocations. TTM was originally designed and implemented for Intel hardware, and had to be adapted to handle NVidia hardware and Nouveau software design. The main feature added to TTM was called fence classing, which was required to support NVidia's multiple hardware contexts. Current Status When we shifted work from reverse engineering to driver development last year, we were asked when a driver would be ready. We predicted late 2007, but we only got part of the work done. Except for NV5x cards, we basically have a good-to-reasonably-well working 2D driver. Releasing an official "2D" driver was considered but, at this point, the kernel interfaces are not considered stable enough to support for the long term. When a DRM kernel module is shipped in Linus's kernel, the interfaces are required to be supported indefinitely. This would be unwise for Nouveau as the interface is evolving to accommodate changes for TTM and mode setting, and supporting old interfaces may place hard-to-support requirements on newer ones. Currently, Nouveau can claim: basic 2D rendering on all cards (through EXA) EXA composite (implementing the XRENDER extension) works via the 3D engine on all cards except NV5x and NV04. In the case of NV04, hardware limitations make a composite implementation difficult if not impossible. NV1x was just recently completed, which was a major feat as these cards only have two fixed function register combiners and no shaders Xv from NV04 up to NV4x thanks to the work of Arthur Huillet. Depending on the hardware, either blitter (on NV4->NV4x), overlay (on NV4->NV30) or video texture (on NV40). Xv performance is on par with that of the nvidia binary driver on some cards. PPC support: at least some PPC based systems work. Most endian-based problems are solved thanks to the help of the PS3 RSX project and Ben Herrenschmidt. However, some systems are exhibiting DMA hangs when trying to do uploads to the card. The code is currently being audited and most of the PPC bugs have been fixed. xrandr 1.2 support is being worked on, basic mode setting should work mostly on NV3x, NV4x and NV5x cards. More sophisticated features, like dual head support, are actively being worked on and progress is fast. the Nouveau specific DRM code has some preliminary work done for TTM. e.g. we have one FIFO allocated for DRM use only. However, a fair amount of work is left until we have something really useful there. Ben Skeggs is working on a Gallium3D driver for NV4x and NV5x. This driver does work for NV4x but is neither feature complete nor bug free. NV5x does not work currently. Stephane is working on supporting shaderless cards with Gallium3D. That would be a generic framework which, in case of NVidia cards, could support shader instructions on cards ≥NV04 <NV30. This framework is not specifically designed for NVidia cards but should help older ATI/Intel cards too. The weak spot is currently the NV50. On these cards, 2D is working the same as nv but saving and restoring the console / virtual terminal state doesn't work. All that is nice and somewhat important to have, but I hear you ask "what about 3D"? The short answer is: We don't have 3D working. The longer answer is: NV5x doesn't work and needs more reverse engineering as a lot has changed from NV4x. For all other cards the needed information is available but there are many pieces in the puzzle to build a final driver. As a proof of concept, glxgears works on NV1x, NV3x and NV4x but with some glitches. However, work on the Mesa DRI driver has ceased in order to target Gallium3D. A somewhat working Gallium3D driver exists with many bugs and glitches. The NV4x is getting better everyday but isn't usable for games yet. Gallium3D itself is still a work in progress and the same holds true for our Gallium3D driver. Currently, a fair amount work is going on in the mode setting field, with Maarten Maathuis and Stuart Bennett enhancing this part of the code. This leads to RandR1.2 (dual head) support in Nouveau. Once this is done, we plan to move it into kernel land, following the other drivers. A kernel API has been defined for that purpose. Basically this API looks like a simplified randr1.2 api which should make porting easy. So what is coming next? This is only a rough outlook of what we want to do mid term: Finish 2D work which includes mode setting and RandR1.2 more reverse engineering for NV5x cards. Implement TTM support Implement Gallium3D drivers. This one is obvious for the cards with shaders, However as Gallium3D expects shaders, older cards are left in the cold unless Stephane gets his framework working. In case the framework isn't feasible, a DRI driver for older cards may be the only option. By the way: If you are interested in more details, please have a look at our Wiki and TiNDC ("The Irregular Nouveau Development Companions") or join us in #nouveau on freenode (logs are available). So to keep tradition lets have some screenshots. Here's a shot of Neverball running under the Nouveau driver: And OpenArena with a Nouveau Gallium3D build from January 2008 displays this: It seems the weapon is a bit too dark but otherwise we couldn't find obvious differences. Further information about Gallium3D can be found on the Tungsten Graphics site. Conclusion So that is our current status, our roadmap shows the next milestone would be Quake which is not so far away on NV4x, but which has some problems to overcome on the other cards. Our first estimate of Autumn / Winter 2007 held up well for the 2D part but, as we detailed earlier, was somewhat delayed due to decisions out of our control like TTM and Gallium. However, the decision was the right one as Nouveau will be one of the most advanced and future proof drivers available. And finally: I would like to take this opportunity and thank Arthur Huillet, Ben Skeggs David Airlie and Stephane Marchesin for their great help on this article. It definitely was a team effort! Tracing memory-mapped I/O operations Device drivers, in the end, usually do one thing: they communicate with the hardware by way of a set of memory-mapped I/O (MMIO) registers. So when one is trying to figure out what a driver is doing - for debugging purposes, perhaps - it is often interesting to look at the sequence of MMIO operations the driver performs. If one is trying to reverse-engineer a driver which is available only in binary form, watching what is done with MMIO registers may be the only way to figure out how the hardware works. To this end, the developers behind the Nouveau project developed a tool called "mmiotrace" which helps them to watch which is going on with memory-mapped I/O. Now that tool is being fixed up and pushed toward the mainline. Drivers gain access to MMIO regions with ioremap() (or one of the higher-level functions like pci_iomap()), so that is the logical place to hook in a tracing infrastructure. So the current mmiotrace patch adds some new variants of ioremap(): These functions perform like ioremap() and ioremap_nocache(), in that they return a I/O memory pointer which can be used by the driver to get at MMIO space. What goes on internally, though, is quite different. On the x86 architecture (as with most others), I/O memory space is accessed with memory operations through the page tables in the usual way, so ioremap() just returns an address which maps onto the desired physical space. The tracing versions, though, take the extra step of marking the pages within the I/O region as not being present in the system; as a result, whenever code attempts to access that space, a page fault will be generated. Normally, page faults incurred when running in kernel mode will cause a kernel oops. There are exceptions, though; the functions which copy data between user and kernel space are one example. The mmiotrace patch adds another exception which tests faulting addresses against the MMIO region(s) being traced. Should the address indicate that an MMIO access is being attempted, the mmiotrace code will: Mark the relevant page as being present in memory. Set the TF (trace) bit in the faulting thread's processor state mask. Invoke a "pre" handler provided by higher-level tracing code. Indicate that the fault has been handled and return to the faulting code. Once all this has happened, the instruction which originally caused the page fault will be rerun, successfully this time. But the setting of the trace bit will cause a new processor trap after that instruction has been executed. At that point, the page is marked unavailable once again, the trace bit is reset (assuming it wasn't set elsewhere), the tracing layer's "post" handler is called, and life continues as normal until the next fault happens. The tracing layer really only has one task: figure out what the code was trying to do in MMIO space and log the action by way of the relay interface. Figuring things out means learning enough about the instruction which caused the page fault to determine which address was being accessed, whether a read or write was being performed, the size of the data being transferred, and the actual value read or written. So there is a certain amount of architecture-specific instruction grubbing code involved, which, for the current patch, is only provided for x86 machines. Since tracing is enabled by calling a special version of ioremap(), it is not possible to trace a driver module without making changes to its source and rebuilding it. That might seem like a strange requirement for a tool meant to help with reverse engineering (among other things). The driver being studied by the Nouveau project uses a GPL-licensed shim to link into the kernel, so making modifications in that case was not a hard thing to do. A more general solution may eventually need to be found, though, for situations where that sort of glue layer is not present. Beyond that, this patch is likely to go through a number of changes before it finds its way into the mainline. Reviewers have found a number of things which need fixing, and there's a few too many places in the code where the comments say (literally) "if this happens, all hell breaks loose." It also seems likely that mmiotrace will be merged with the recently-posted ftrace tracing mechanism. There is time to get this work done before the 2.6.26 merge window opens, but the mmiotrace hackers will need to keep the work moving forward. Merging drivers early Drivers tend to be a world unto themselves, with bugs only affecting a subset—often a tiny subset—of kernel users. Until a driver gets merged into the kernel though, anyone wishing to test it, or help clean it up, has to jump through some hoops. To try and help reduce those barriers, Linus Torvalds and others have been advocating early merging of drivers; getting them into the kernel and incrementally improving them from there. This policy of early merging of drivers is not universally embraced, with a recent remote DMA (RDMA) ethernet driver, which lives in the infiniband tree, getting singled out. Based on the problems he observed in the driver, Adrian Bunk asked: "Is it really intended to merge drivers without _any_ kind of review?" This was, perhaps, an overly dramatic question as the driver has undergone review, but not all of the changes have been reflected in the mainline version. There is still work to do, as Infiniband maintainer Roland Dreier points out: Just to be clear, this driver was reviewed. Many issues were found, and many were fixed while others are being worked on. It's a judgment call when to merge things, but in this case given the good engagement from the vendor, I didn't see anything to be gained by delaying the merge. It is a sentiment shared by other kernel hackers as well. When there is a developer who is responding to the feedback along with a working driver, getting it into the mainline kernel—where more eyes can scrutinize it—is seen as a positive step. Torvalds is very interested in seeing drivers earlier so that more collaboration can happen: I'd really rather have the driver merged, and then *other* people can send patches! The thing is, that's what merging really means - people can work on it sanely together. Before it's merged, it's a lot harder for people to work on it unless they are really serious about that driver, so before merging, the janitorial kind of things seldom happen. Other maintainers explained their criteria for accepting drivers that are not quite up to usual kernel standards. The consensus seems to be that drivers with the following characteristics are acceptable: compiles and seems to work has no obvious security holes has an active maintainer does not affect people who don't have the hardware does not introduce unnecessary or not fully thought out user space interfaces There is little in the way of a downside to making drivers available earlier. Since they are self-contained, they generally don't cause problems elsewhere in the kernel. As long as reviewers are keeping an eye out for security problems, which could lead to an unsuspecting user's box being compromised, there are not many ways for a driver to negatively impact the kernel as a whole. User space interfaces via ioctl(), sysfs, or other means also need to be closely examined as they will have to be maintained as part of the kernel interface. Along the way, much grumbling was heard about checkpatch, the perl script that complains about various stylistic problems with a patch. Notably absent from the list above is any kind of requirement that checkpatch errors or warnings be handled. The main complaint against checkpatch is its checks for line length; the resulting "fixes" to kernel source sometimes leave much to be desired. While it is generally agreed that too many overly long lines can result in code that is difficult to read, exactly what constitutes such a line tends to be an aesthetic judgment. Slavish adherence to a fixed number of characters on a line in order to appease checkpatch is clearly seen as a problem. To some, this makes checkpatch less than useful, bordering on dangerous to readability. Torvalds stated that he has considered removing it from the kernel tree on more than one occasion. Human judgment is required to interpret the warnings from checkpatch and sometimes it is not being applied. On the other hand, Ingo Molnar gives an impassioned defense of the tool: Based on this first hand experience, my opinion about checkpatch has changed, rather radically: i now believe that checkpatch is almost as important to the long term health of our kernel development process as BitKeeper/Git turned out to be. If i had to stop using it today, it would be almost as bad of a step backwards to me as if we had to migrate the kernel source code control to CVS. Molnar goes on to outline the pros and cons of checkpatch, all of which stands in stark contrast to some of his earlier complaints about the tool. For most drivers, the path into the kernel has been made a lot easier. This will have the effect of getting working, or mostly working, drivers into the hands of users more quickly. More importantly, it will also get the code into the hands of the Linux kernel community faster. The likely result is a fully working, cleanly coded driver sooner than it might have happened in the past. An already quick turnaround for hardware support in Linux may have just gotten faster. The Linux Desktop Testing Project reaches the 1.0.0 release The Linux Desktop Testing Project is a cross-UNIX GUI testing framework. The project was started in 2005. In the Linux world, LDTP originally just supported the GNOME desktop environment. KDE support was planned from the beginning, this capability is now in place with the recently released KDE 4.0. In addition to operating with the two major Linux desktops, LDTP is being used by Mozilla and OpenOffice.org. From the LDTP home page: Linux Desktop (GUI Application) Testing Project (LDTP) is aimed at producing high quality test automation framework and cutting-edge tools that can be used to test Linux Desktop and improve it. It uses the Accessibility libraries to poke through the application's user interface. The framework also has tools to record test-cases based on user-selection on the application. LDTP is a Linux / Unix GUI application testing tool. It runs on Linux / Solaris / FreeBSD / Embedded environment (Palm source). Version 0.8 of LDTP was investigated last February on LWN, take a look to get an overview of the software's operation. LDTP version 0.9.0 was released in August 2007, it featured new Firefox automation support and bug fixes. This week, version 1.0.0 was announced: This release features number of important breakthroughs in LDTP as well as in the field of Test Automation. This release note covers a brief introduction on LDTP followed by the list of new features and major bug fixes which makes this new version of LDTP the best of the breed. Useful references have been included at the end of this article for those who wish to hack / use LDTP. New features in this release include the Object Oriented LDTP, the LDTP Editor with record and replay functionality, major bug fixes and lots of work on the documentation. The Linux Desktop Testing Project is maturing and its scope is getting wider. LDTP can become an important tool for automated testing of GUI-based applications. With a bit of effort on the part of developers, LDTP can improve the quality of applications and speed up the testing of new releases. Interoperating with Microsoft Last week, with much fanfare, Microsoft announced a change in its practices in order to "expand interoperability". It is a rather sizable shift away from some of its previous inflammatory statements about free software—though it scrupulously avoids that term—but whether it is the harbinger of a more open Microsoft, or yet another empty pronouncement, is still unclear. It does contain things of interest to the community, in particular the patent enumeration, but there are pitfalls as well. The largest chunk of what Microsoft promises is documentation for APIs and protocols used by some of their most popular products. They immediately released some 30,000 pages of Windows protocol specifications, much of which the Samba project had to pay to access last December. In addition, they will be releasing documentation suitable for developers wishing to interoperate with "Windows Vista (including the .NET Framework), Windows Server 2008, SQL Server 2008, Office 2007, Exchange Server 2007, and Office SharePoint Server 2007, and future versions of all these products." Microsoft has also promised to list which of the documented protocols are covered by one of its patents or patent applications. We may finally start to get a handle on the infamous "235 patents" that Linux and free software supposedly infringe. These patents will be available for license on the standard "reasonable and non-discriminatory" (RAND) terms, with an interesting addition: "low royalty rates". The patent list is not yet available, but may be of use in ways that Microsoft does not intend; invalidating some of the patents with prior art for example. As Microsoft is well aware, RAND terms are a non-starter for free software because they restrict redistribution of the code. The company has tried to soften that blow, perhaps, by rehashing its "covenant not to sue" developers that originated as part of the Novell interoperability agreement. The covenant may be a great public relations ploy, but does little to alleviate concerns that free software developers will have in implementing patented protocols. It is the rare developer who finds an itch to develop code to talk to Microsoft servers and who has no thought of using or distributing it commercially. There are also provisions in the announcement for documentation of Microsoft implementations of industry standards. A cynic might wonder why additional information is needed, they are, after all, supposed to be standards. The unfortunate reality is that Microsoft does extend such standards for its own purposes in incompatible ways; having that kind of information can only help web browsers, directory services, and other multi-platform tools. For a company as adamantly opposed to Open Document Format (ODF) as it claims to be, it is a bit surprising to see that they plan changes to Microsoft Office to "promote user choice among document formats". APIs for document format plug-ins along with the ability for users to make their own choice about the default save format will be added. How reasonable those APIs are and how faithfully they can encapsulate Office documents will be an interesting test of both Microsoft's sincerity and ODF's capabilities. It is also a pretty clear attempt to at least appear to be playing nicely with ODF while its competing OOXML format is being considered for an ISO standard. There are also various platitudes about "opening dialogs" and "expanding outreach" with the community included in the announcement. It will be interesting to see how that actually plays out. It is, however, hard to imagine even a year ago seeing a posting on a Microsoft-sponsored site entitled "How open source has influenced Windows Server 2008". In less than seven years, we have moved from a "cancer" to influencing its flagship products. One obvious conclusion that can be drawn from this and other Microsoft initiatives is that it is feeling a fair amount of pressure from customers, the European Union, standards groups, and free software. These kinds of changes, even if they don't go as far as the rhetoric would lead one to believe, are a pretty substantial shift in Microsoft culture and thinking. Unfortunately, they do also seem to be angling for the long-sought "Linux tax"—a payment, even just a small one, for each and every Linux deployment. So far, Microsoft doesn't seem to have caught on to the idea that most Linux installations are free in both senses of the term. There is no per-installation, per-processor, per-core licensing stream to tap into. One of the headaches that free software users avoid is keeping track of all those licenses, enforced by the ever-present threat of a Business Software Alliance audit. It has, to a limited extent, already tapped into—and likely tapped out—that revenue from the deals with Novell and other distributors. Overall, this seems like a positive step. It clearly acknowledges the role that free software (or open source if you prefer) is playing in both the commercial marketplace and the marketplace of ideas. The actual effects of this announcement for our community may be small, but it may also be indicative of Microsoft moving in a more cooperative direction. That would be a rather nice thing to see. A brief look at some distribution news In the process of reading through a number of distribution mailing lists your editor encountered several items that seemed worthy of mention, but none that seemed to provide enough for a complete article. So the following will be a brief look at a variety of topics. The Fedora Bug Zappers subproject was recently announced on the fedora-devel mailing list. This is a team of people who triage bugs and act as a bridge between the users and developers. The team is meeting regularly, and new bug zappers are always welcome. Donnie Berkholz ran an informal survey that was answered by 50 Gentoo developers. The results have been graphed, one page per question. For example, the question "What are the top 3 issues facing Gentoo?" is here. "Developers' top 5 issues are manpower, publicity, goals, developer friction, and leadership." The pie chart shown on the previous page has been replaced by a bar chart. There are eight more questions that remain to be charted. The openSUSE project has been discussing the creation of a developer blog. Although other blogs exist they tend to range off-topic. This would be specifically a place to talk about development topics, such as new features in YaST. Posts would be tagged so that people who wanted to find more about YaST could find all entries with that tag. Ubuntu wants all users to be involved with bug squashing. Do 5 a day - every day!, says Daniel Holbach. What you can do? That's up to you, your interests and your abilities. - If you're a developer, you can help out reviewing patches and getting them uploaded. - If you want to just confirm new bugs, you can do that. - If you have experience with a certain package and want to triage bugs you can do that and forward them upstream if necessary. - If you know your way around Ubuntu quite well, you can help assign bugs to the right package. That's not a bad idea, regardless of your distribution of choice. Cascading security updates When following the distributions' security updates on a daily basis, as we do at LWN, certain days are more work than others. Two weeks ago we had a rather full update with no less than 28 packages updated for Fedora (most of those for both F7 and F8), along with a handful of updates from other distributions. It turns out that the majority of the Fedora updates had a single cause: a set of serious vulnerabilities in Mozilla Firefox. How does a single update to an application ripple so far that more than a dozen packages have to be rebuilt? One would think there would be shared libraries that would get updated, with applications picking up those changes the next time they are run. That is, in theory, how things are supposed to work, but in this case, the underlying libraries have no fixed application binary interface (ABI). So, changes to those libraries require any applications that use them to be rebuilt and retested. Gecko is the rendering engine used by Mozilla in their products to display HTML. Various other packages have started using it as well because of its speed and standards compliance. Because Mozilla sometimes breaks the ABI between releases, even minor releases, distributions may be stuck rebuilding those applications when a new version of the library is released. Normally, that only happens when packaging a new version of the distribution—or when serious security flaws are found. Mozilla's solution for this problem is XULRunner which will provide a stable ABI for applications. As XULRunner and its companion libxul become more widely available, the applications that currently link to the Gecko libraries will presumably switch to avoid these kinds of problems in the future. It is highly unlikely that we have seen the last security problem in the Gecko engine, so reducing the cascade that results from finding one would be welcome. Because of problems with the ABI changing in the past, Fedora chooses to make the applications' library version number exactly track the Mozilla release number. Some other distributions do not do that, so unless the ABI does change, they do not need to update each package that uses the libraries. This has some advantages, but could lead to broken applications if an ABI change goes unnoticed. We have also seen similar cascades of updates, most notably from the xpdf PDF viewer. Unlike Gecko, there is no library for xpdf, leading multiple applications to include its source into their own. When a flaw is found, several different applications (cups, gpdf, etc.) across all distributions need to be updated immediately, leading to a similar effect as was seen with the Gecko vulnerabilities. Hopefully, over time, the development of the poppler library will mitigate this problem somewhat. There are lots of good reasons to separate code into components where possible, but security is an important one. Creating and maintaining an ABI is sometimes difficult, but generally worth the trouble. Imagine the chaos that could result from a security vulnerability requiring an ABI change in glibc. An object debugging infrastructure Thomas Gleixner has discovered that being the maintainer of a core kernel infrastructure module can bring some special challenges. Whenever somebody's kernel oopses in the timer code, for example, Thomas tends to hear about it. The only problem is that the timer code is almost never where the bug is. Instead, it's far more likely that some other kernel subsystem has corrupted an active timer, leaving a bomb that will only explode later, in the timer code, when that timer is set to expire. At that point, it can be hard to figure out where the real problem is, as the culprit will be long gone. In response, Thomas developed some special-purpose code aimed at finding the real source of timer-related problems, preferably before it brings down the kernel. He has now generalized that code and posted it as the object debugging infrastructure patch, which was subsequently significantly revised. As this code develops, it has the potential to help find whole classes of especially difficult bugs before they bring the system down. There's a few steps involved in adding support for object debugging to a new subsystem. The first is to create and populate a debug_obj_descr structure (defined in <linux/debugobjects.h>): The name field is the name of the subsystem; it is used in debugging output. We will return to the other fields below. The next step is to call into the object debugging code whenever an action of interest involves one of the tracked objects. There is a set of functions used for this purpose: In each case, addr is a pointer to the object being operated on, and descr is a pointer to the debug_obj_descr structure mentioned above. The meaning of each call is: debug_object_init(): the object is being initialized. debug_object_activate(): it is being added to a subsystem list. For timer debugging, this action happens when add_timer() is called. debug_object_deactivate(): the object is being removed from a subsystem list. debug_object_destroy(): the object is being destroyed and is no longer referenced within the subsystem. This call is not used in the version 2 patch set. debug_object_free(): the object is being freed. The debugging code maintains a hashed set of lists for tracking objects; each object is added to the appropriate list when one of the above calls is made. As actions are performed on the objects, their state is tracked. In this way, the debugging code is able to test for a number of common mistakes, including deactivating an object which is not active, reinitializing active objects, or adding objects twice. When something goes wrong, a backtrace is sent to the system logs. Since this backtrace identifies where the original error is made, it is likely to be far more useful than the trace associated with the system crash which will probably come later. But this infrastructure can also help to make that crash less likely, in that each subsystem can register a set of "fixup functions." These, of course, are all the methods in the debug_obj_descr structure which we glossed over above. For example, if a call to debug_object_init() is made with an object which has already been activated, the debugging infrastructure will respond with a call to the fixup_init() callback, passing in the object in question and its current state (ODEBUG_STATE_ACTIVE in this case). The callback should return zero if it is able to, somehow, repair the damage. Even if things cannot be truly fixed, though, there is still use for this function; the timer code, for example, will disable an active timer if the calling code mishandles it. The kernel will almost certainly not operate as expected, but, at least, it has a smaller chance of crashing at some random time in the future. Most debugging checks are performed in response to calls from within the subsystem itself. There is one useful check which cannot be done that way, though: detecting the freeing of objects which are still under some sort of subsystem management. To catch that mistake, Thomas's patch inserts a hook into functions like kfree() and free_hot_cold_page(). Every time an object is freed, the code checks through the appropriate list to see if it is still seen as being active in some subsystem. Freeing an object which is still known to a subsystem is almost always a bug - one which can be hard to track down later on. The check on freed memory objects is clearly a useful debugging tool. It could also have a nontrivial overhead, though, since it requires searching a list every time some memory is freed. So it has its own configuration option and can be configured out of the kernel, even if the rest of the debugging code is built in. At this point, only the timer subsystem is covered by this infrastructure, but there are plenty of other obvious candidates. Perhaps at the top of the list would be kobjects, which are famously susceptible to all kinds of programming mistakes. So expect to see the coverage of this code grow in the near future. Ryzom returns? Toward the end of 2006, a company called Nevrax went out of business. Nevrax was the operator of an online multiplayer game called Ryzom which had developed a dedicated (if insufficiently lucrative) following. A group of free software developers, former Nevrax employees, and assorted Ryzom players sensed an opportunity here: perhaps the source for Ryzom could be obtained from the failing company and turned into free software. It seemed like a winning solution for all sides: Nevrax's creditors could get whatever money could be raised for the code, Ryzom players would continue to have a game, and the free software community would get an extensive new code base. All that was needed was to convince the relevant bankruptcy court that this was a good idea. To that end, the Free Ryzom project raised some €170,000 in pledges - an impressive amount of money. The Free Software Foundation offered $60,000 toward this goal. But, in court, another suitor (Gameforge) won out with a plan to keep the game proprietary. The Free Ryzom folks became the Virtual Citizenship Association and faded from view; it seemed that this story was done. Only it seems it's not done. In February, the project sent out a news update on what had been happening over the past year. It seems that Gameforge stopped paying its employees in June, 2007, and, by August, was not paying its creditors. In October, Gameforge France went back into the bankruptcy process; then, last February, the Ryzom servers were shut down. This particular plan to save Ryzom, it seems, was not as successful as one might have liked. So it seems that the Ryzom source might, once again, be up for grabs. A news update suggests that the process is moving quickly, but the project could make a try for the code if it is able to come up with a large (at least €230,000) bid in the immediate future. As of this writing, the Free Ryzom folks are examining their options and trying to come to a decision on the best course to take. There can be no doubt that this code would be a valuable acquisition. Despite the fact that some of the very first multiplayer online games were free software (consider Netrek, for example, which occupied rather too much of your editor's time some 15 years ago, or some of the early MUD and MOO systems), free software does not have much to offer in that area now. The lack of competitive offerings in this area is one of the biggest motivations for people to use Windows. A free Ryzom could be a strong step toward better online gaming with free software. [PULL QUOTE: One has to wonder why we seem to be unable to put together a competitive game without relying on a huge infusion of source from the proprietary world. END QUOTE] That said, one has to wonder why we, the larger free software community, seem to be unable to put together a competitive game without relying on a huge infusion of source from the proprietary world. There are certainly projects out there; consider Battle for Wesnoth or WorldForge, for example. Wesnoth is an addictive game with basic multiplayer capability and an active developer community, but it is a turn-by-turn game with relatively rudimentary graphics - though the graphics and soundtracks are quite nice by free software standards. WorldForge has high ambitions and a lot of infrastructure, but it never really seems to get out of that pre-alpha state. A look at WorldForge's CVS logs suggests that very few developers are actively contributing to the project. There are critics of the free software community who would argue that gaming is the sort of program that free software just cannot do as well as proprietary software. A certain amount of planning and direction is required to pull together a coherent virtual world, quite a bit of artistic work (artwork, sounds, etc) is required, and so on; a project without a business-based revenue stream just cannot compete in this area. There might be some truth to this claim - but not that much. When one looks all all that we have accomplished, it does not seem like an online multiplayer game - challenging though it might be - should be beyond our capabilities. What seems more likely is that we just haven't gotten the project management right yet. Anybody who has hung around with people who are interested in computing knows that game playing is certainly an itch that many feel the need to scratch. We just haven't yet made it easy enough for that scratching to happen. What's needed is a relatively simple core upon which people can easily create virtual worlds. It should be straightforward for people who are not developers - artists, musicians, script writers - to contribute to the system, and their contributions should be made welcome. The desktop projects have had a certain amount of success in bringing in non-developer contributors; a look at how they have done that could be worthwhile. Arguably, we should have most of the pieces we need. Battle for Wesnoth has shown that it's possible to put together a community which goes beyond just software developers. WorldForge seems to have a good start on some important pieces of infrastructure. There may be some useful code to be had from the Second Life client, which has been free for a year now. We are a large and talented community, we certainly have the ability to do something interesting in this area. It should not be necessary to wait until we get a code dump from a dead proprietary software company. The rest of the vmsplice() exploit story Back in February, LWN published a discussion of the vmsplice() exploit which showed how the failure to check permissions for a read operation led to a buffer overflow within the kernel. Subsequently, a linux-kernel reader pointed out that the article stopped short of a complete explanation: this is not an ordinary buffer overflow exploit. Travel schedules and such prevented the writing of an immediate followup, but your editor would still like to tell the full story. So this article picks up where the last one left off and describes how the vmsplice() exploit makes use of this buffer overflow to take over the system. When vmsplice() is being used to feed data from memory into a pipe, the function charged with making it all happen is vmsplice_to_pipe(), found in fs/splice.c. It declares a couple of arrays of interest: PIPE_BUFFERS, remember, is 16 on exploitable configurations. Both of these arrays are passed into get_iovec_page_array(), which, as described in the previous article, makes a call to get_user_pages() to fill in the pages array. As a result of the failure to check whether the calling application is allowed to read the requested region of memory, get_user_pages() will overflow the pages array, writing far more than PIPE_BUFFERS pointers into it. These are, however, pointers to legitimate kernel data structures; it remains to be seen how this overflow enables the attacker to take control of the system. The partial array is also passed into get_iovec_page_array(); it describes the portion of each page which should be written into the pipe. To that end, a loop like this is run immediately after returning from get_user_pages(): Since full pages are being written in this case, the calculated offset will be zero, and the length will be PAGE_SIZE (4096). The value of error is the return value from get_user_pages(); that will be the number of pages actually mapped: 46, in the case of the exploit. Remember that the partial array is also dimensioned to hold 16 entries, so this loop will overflow that array as well. Both of these arrays are declared, one right after the other, in vmsplice_to_page(). A quick test by your editor suggests that the partial array will be placed below pages in memory, so, once partial is overflowed, the loop will start overwriting pages instead. So the pages array will end up containing alternating values of zero and 4096 rather than the real struct page pointers it had before. (It's worth noting that the exploit still works if the arrays are placed in the opposite order, since the overflow causes code down the line to think that pages is larger than it really is). Once all this has happened, control returns to vmsplice_to_pipe() - the overflow is not big enough to have overwritten the return address. A call to splice_to_pipe() is supposed to finish the job, but something interesting happens there. Toward the beginning of this function, this test is made: Looking back at the exploit code, we see that it closes the read side of the pipe before calling vmsplice(). So splice_to_pipe() will quit almost immediately. On its way out, however, it does this: The call to get_user_pages() will have locked each of the relevant pages into memory to allow the kernel to work with them; this is the cleanup code which goes back and unlocks the pages which will not be used. But remember that the pointers in the pages array have been overwritten, and are now either zero or 4096. What would normally happen here is a kernel oops, since those are not legitimate addresses. The exploit code has done something tricky, though: using some special mmap() calls, it has created some anonymous memory at the bottom of its address space. Directly dereferencing user-space addresses while running in kernel mode is frowned upon for a number of reasons; it can blow up in a number of ways. But, if the address is valid and the relevant page is resident in memory, direct access to user-space memory will work. So, when the kernel starts to work with the addresses that it thinks are struct page pointers, it does not get any sort of fault; instead, it gets the data placed in that memory by the exploit. Needless to say, that data has been arranged carefully. The Linux kernel normally manages each page as an independent object. There are times, however, when pages are grouped into larger units, called "compound pages." This generally happens when physically contiguous allocations larger than one page are needed by the kernel; when this happens, a compound page is passed back to the caller. These pages are special in that they must be split back apart when they are released back into the system, and there may be other cleanup work to do. So compound pages have an attribute not found on normal pages: a destructor which is called when the page is freed. So, if we look at how the exploit sets up its low-memory page structures, we see: When the kernel looks for a page structure at user-space address zero, it will find something which looks like a compound page. The destructor (stored in the lru.next field of the second page structure) is set to kernel_code(), a function defined within the exploit itself. Since the count is set to one, the call to page_cache_release() (which decrements that count) will conclude that there are no further references and, since the page looks like a compound page, the destructor will be called. At this point, the exploit has arbitrary code running in kernel mode, and the show is truly over. This code just sets the process's uid to zero (giving it root access), then engages in some assembly-language trickery to return immediately to user space, shorting out the rest of the cleanup process. There are a couple of interesting implications from all of this. One, clearly, is that this exploit is not something which was bashed out by a script kiddie somewhere. It was written by somebody who understands low-level kernel code quite well and who is able to use that understanding to escalate an apparent information-disclosure vulnerability into a full code execution problem. It is, clearly, a mistake to underestimate those who write exploits, not all of whom immediately make their works known to the development community. One also should not assume that they have not already written exploits for other, still unfixed bugs. Also worth noting is the fact that ordinary buffer overflow protection may well have not been effective against this vulnerability. The return address on the stack was not overwritten, and no exploit code was put in data areas. This episode has caused a renewed interested in technical security measures in the kernel. These measures are good, but it would be a mistake to think that they will fix the problem. What is really needed is stronger review of patches with security in mind; it is not yet clear to your editor that this review is happening. The GNOME Foundation launches an accessibility outreach program The GNOME Foundation has announced a new outreach program for the GNOME accessibility project: The GNOME Foundation is running an accessibility outreach program, offering US$50,000 to be split among individuals. This program will promote software accessibility awareness among the GNOME and broader Free Software communities, as well as harden and improve the overall quality of the GNOME accessibility offering. The program is sponsored by GNOME Foundation, Mozilla Foundation, Google's Open Source Program Office, Canonical, and Novell. Applications were opened for review starting on March 1, the project closes on December 31. Acceptance of long-term tasks closes on October 1, short-term task acceptance closes on December 15. The goal of the program is to work on improving shortcomings in the existing GNOME accessibility system. There is an aim to increase awareness of accessibility-related issues, encourage developers to work on accessibility issues and generally improve accessibility in free software. From the project announcement: "There will be two tracks to the program: In the first track accepted individuals will work towards accomplishing one of the major projects nominated for the program, earning US$6,000 and can take up to six months to complete the task. The second track will reward contributors US$1,000 for fixing five bugs out of a pool of accessibility bugs nominated by the program judges." The program rules explain the contract that the developers will work under, the process of claiming tasks, the judging process and more. A list of tasks has been announced: "Are you a developer who wants to become more familiar with accessibility? Are you an artist that can draw? Maybe you might also be interested in becoming a module maintainer some day. A great way to get started is by fixing bugs, and we're offering you a way to get paid to do it. :-)" The list of long-term tasks includes: Writing and updating accessibility documentation. Improving accessibility support in the Evince document viewer. Adding and improving GNOME magnification support. Building an accessibility testing framework. Adding new participant-defined accessibility projects. Developers who need some income and are willing to improve availability of GNOME to all should consider taking on a task. NDISwrapper dodges another bullet Hardware compatibility has long been a problem for Linux—though it has gotten much better over the years—so it will be surprising to some to see a kernel change that will make some hardware cease working. For others, who follow kernel development a bit more closely, it will come as no great surprise that NDISwrapper was disabled by a change made to the kernel back in January. NDISwrapper has never been very popular with kernel hackers, but, because it is GPL licensed and allows more hardware to be used, there are folks on both sides of the argument. For a while, it looked like NDISwrapper had lost that argument, but the 2.6.25-rc4 release restores the functionality it requires. NDISwrapper is a kernel module that is used to load Windows-only drivers into Linux. For some hardware, notably wireless network cards, it is the only way to support them because the manufacturer provides neither specifications nor a working Linux driver. Unfortunately, many of these cards are installed in laptops where it is difficult or impossible to replace them with Linux-friendly alternatives. This is what led to implementing the Network Device Interface Specification (NDIS) for Linux. NDIS is an ancient—it was originally developed by Microsoft and 3Com for MS-DOS in the mid to late 1980s—interface for networking devices, which is still in use today. The NDISwrapper code has been around since 2003, but always as a separate module that must be built by the user (or distribution) and loaded into the kernel. It is not part of the mainline kernel, nor will it ever be; maintaining a glue layer that allows proprietary, closed-source drivers to be linked into the kernel is not high on anyone's list. But, NDISwrapper is GPL. Its code is available for inspection or modification by all, so that is not the problem, it is the intent that matters. When a binary-only driver—the NVidia video driver for example—is loaded into the kernel, a "taint" flag is set, indicating that the kernel is tainted by code that cannot be examined. Bug reports for tainted kernels are routinely ignored, unless they can be reproduced in an untainted kernel. Life, it seems, is too short to try and diagnose problems that could easily have been created by a buggy driver that cannot be debugged. Originally, the taint flag was just a means to detect and ignore those bug reports, but over time it has become part of a mechanism to restrict which symbols a module can access. Some kernel symbols are considered so integral that any module using them must be a derivative work. Therefore, modules that want to use them must be GPL. Modules declare their license using the MODULE_LICENSE macro, while symbols are exported using either EXPORT_SYMBOL or EXPORT_SYMBOL_GPL. Any module that doesn't have a compatible license doesn't get access to the GPL-only symbols. Few would argue for a GPL module which existed to re-export all of the GPL-only symbols to non-GPL modules. But that is not what NDISwrapper does; instead it implements NDIS, but in order to do that, needs access to GPL-only symbols, mostly for USB and workqueue interfaces. It would be hard to contend that NDIS drivers are derivative of the Linux kernel, they were written for an entirely different system using an interface that predates Linux. This is why NDISwrapper developers and users think that an exception should be made for it. Clearly the Windows drivers taint the kernel, but accessing a subset of the GPL-only functionality through NDISwrapper should be allowed, they argue. Since NDISwrapper itself is GPL, the normal module loading rules would allow it to access GPL-only symbols, except that an explicit check for NDISwrapper was added to the 2.6.16 kernel. The question, then, revolves around what should be done when the kernel detects it being loaded. NDISwrapper has always been careful to mark the drivers that it loads as tainted, but the recent patch marks the module itself as tainted, disallowing access to the GPL-only symbols and breaking NDISwrapper. Absent that patch, only the kernel is marked as tainted—the module itself is not. A similar situation occurred back in October 2006, which LWN covered on the Kernel page, when a stricter interpretation of tainting started to be enforced. At that point, NDISwrapper stopped working and it looked like it might stay that way, until Andrew Morton stepped in with objections to breaking NDISwrapper with no warning. Shortly thereafter, a patch was merged that only marked the kernel as tainted when NDISwrapper is loaded. At that point, the issue fell by the wayside, until now. Part of the problem is that marking a symbol as GPL-only means different things to different developers. For some, it is a means to warn proprietary driver developers that they are straying into territory that makes distribution of their drivers very likely to be a violation of the GPL, while others want to use it to completely eliminate binary-only kernel drivers. There is no policy that clearly delineates which interpretation is "correct". Meanwhile, NDISwrapper has been in use by many for four years or more; breaking it now, with little or no warning, is likely to create some very unhappy users. Linus Torvalds clearly thinks there are no licensing issues with NDISwrapper: Quite frankly, my position on this has always been that the GPLv2 explicitly covers _derived_ works only, and that very obviously a Windows driver isn't a derived work of the kernel. So as far as I'm concerned, ndiswrapper may be distasteful from a technical and support angle, but not against the license. Jon Masters, the author of the patch that inadvertently made this change, had an excellent suggestion that should be pursued to try and reduce these kinds of problems in the future: Since we've brought it up, one good thing I would like to see come of this perhaps is a clearer understanding of what the kernel should and should not be doing in terms of "license compliance enforcement". We have had lots of talk, but perhaps a "policy" document is worthwhile. Another interesting battle will be that surrounding exporting init_mm() which was removed in early versions of 2.6.25, but then restored in 2.6.25-rc4. It is fairly clearly a low-level kernel interface that is unused by any in-tree driver, so its export was removed. One rather glaring exception is that the out-of-tree NVidia binary drivers do use it. Its export has been restored for one more development cycle, but it is clearly seen as something that should not be touched by drivers. It could be quite a struggle between the developers and users of a very popular driver and the kernel hackers that don't want to see kernel API abuse. Issues surrounding the GPL are always contentious on linux-kernel; this one is no different. While NDISwrapper is an out-of-tree driver, it has hardly been invisible, so complaints when it breaks should come as no surprise. A simple renaming will avoid the current kernel check, so breaking it that way will mostly be an annoyance to users rather than a real barrier to its use. Since there is no real consensus amongst kernel hackers on the binary driver issue, it is hard to see one emerging with regards to NDISwrapper, but that would be the best outcome. One way or another, it needs to be decided, NDISwrapper shouldn't come under a periodic threat of breaking. If it is determined to be a violation of the kernel interfaces, that should be clearly indicated and its users should be given some warning so they can find alternatives. File monitoring with Mortadelo and SystemTap SystemTap is a tool to help gather information about running Linux systems which has been available for some time now. But applications that use the tool have been few and far between. Mortadelo is a GUI tool that uses SystemTap to observe and record system calls. It is more of a proof-of-concept than a complete application—though it is useful in its current form—but it does start to show some of the things that can be done using SystemTap. Mortadelo specifically intercepts system calls that deal with accessing files, collecting the arguments to the calls as well the return codes. It is patterned after the Windows Filemon program, which is used in much the same way that a Linux user might use strace—only with a GUI. Problems with permissions or files that do not exist are the kinds of things that Mortadelo could be used to diagnose. The data collected is displayed in a list in the GUI (shown at left), which can then be filtered using regular expressions to pull out the information of interest. Because it uses SystemTap, Mortadelo gathers information from all running processes at once, allowing the user to choose which parts they are interested in. The filtering is somewhat primitive, in that particular fields cannot be chosen to filter on, but still useful because it searches each entry fully. System calls that return an error are highlighted in red making it easy to pick them out. By choosing appropriate strings to filter on, all permission errors in the system or every access of a particular filename can be seen. The GUI allows one to start and stop the recording as well as to save the captured data to a file. Each entry includes a timestamp, the process name and pid, the system call, return code, and arguments. The application is written in C#, using the Mono framework; one of the authors has an interesting weblog entry comparing Mono and Python for developing this kind of tool. Mortadelo's interface to SystemTap is fairly straightforward, it spawns a stap command and sends it the probe points and code via stdin. It then reads the stap output, parsing it and displaying it in the window. There were some tricks to getting it to build and run, but Eugene Teo's instructions for running it on Fedora 8 were quite helpful. Part of the problem was in getting SystemTap going on the system, which is a problem we have mentioned before. There were some other small hurdles as well, but Teo's hints and proper application of grep were enough to get past those. Mortadelo's impact isn't so much in the application itself as it is in some of the ideas behind it. Using SystemTap for GUI tools will help users and administrators, especially those who are not command-line savvy. If Mortadelo, or some descendant of it, becomes popular, that will help make SystemTap use more widespread. Distributors will start packaging it in more readily usable forms, perhaps installing it by default. That will in turn help anyone tasked with keeping a Linux system smoothly functioning, whether they are GUI-centric or not. Realtime adaptive locks The realtime patchset has one overriding goal: provide deterministic response times in all situations. To that end, much work has been done to eliminate places in the kernel which can be the source of excessive latencies; quite a bit of that work has been merged into the mainline over the last two years or so. One of the biggest remaining out-of-tree components is the sleeping spinlock code. Sleeping spinlocks have advantages and disadvantages. A recently posted set of patches has the potential to significantly reduce one of the biggest disadvantages of the realtime spinlock code. Mainline spinlocks work by repeatedly polling a lock variable until it becomes available. This busy-waiting code thus "spins" while waiting for a lock. Spinlocks are quite fast, but they can also be a source of significant latencies: a processor which is holding a lock can delay others for indefinite amounts of time. In the mainline kernel, it is also not possible to preempt a thread which holds a spinlock - another source of latencies. (See this article for a more detailed description of the mainline spinlock implementation). The realtime patch set addresses this problem in a couple of ways. One of those is to cause threads waiting for a contended lock to sleep rather than spin. As a result, lock contention cannot create latencies on processors which are not holding the lock. When spinning is removed, it is also possible to make code preemptible even when it holds a lock without causing deadlock problems. That allows a high-priority process to run regardless of any lower-priority processes which might currently hold locks on the current CPU. Finally, the realtime patch set has added priority awareness and priority inheritance to the locking code to ensure that the highest-priority process is always able to run. This is all good stuff, but there is one little disadvantage: the extra overhead imposed by the more complicated locks can reduce system throughput considerably. This is a cost that the realtime developers have been willing to pay; it is often necessary to make trade-offs between throughput and latency. Recently, though, some developers at Novell have come to the conclusion that the throughput cost of the realtime patch set need not be as severe as it currently is; the resulting adaptive realtime locks patch brings the throughput of the realtime kernel to a level much closer to that found in the mainline - at least, for some workloads. The core observation encapsulated in this patch set is that hold times for spinlocks tend to be quite short, especially in the realtime kernel. So the cost of putting a waiting thread to sleep may well exceed the cost of simply busy-waiting until the lock becomes free. So adaptive locks behave more like their mainline counterpart and simply spin until the lock becomes available. There are some twists, though, which are necessitated by the realtime system: The spinning cannot go on forever, since it may cause unacceptable latencies elsewhere in the system. So an adaptive lock will only spin up to a configurable number of times (the default is 10,000) before giving up and going to sleep. Since lock holders are preemptible in the realtime kernel, it is possible that the thread which currently holds the lock was previously running on the same CPU as the process trying to acquire the lock. In that situation, spinning for the lock is clearly a bad thing to do. In the absence of a loop counter, it would be a hard deadlock situation; with the counter, it would just be an unnecessary delay. Either way, the result is undesirable, so, if the lock owner is running on the same processor, the thread waiting for the lock simply goes to sleep. If the lock owner is, instead, itself sleeping while waiting for something, there is little point in having another thread stay awake in the hope that the owner will release the lock soon. So, in this case too, a thread contending for a lock will simply go to sleep rather than spin. One other throughput improvement is obtained by changing the lock-stealing code. Locks in the realtime system are normally fair, in that threads waiting for a lock will get it in first-come-first-served order. A higher-priority process will jump the queue, however, and "steal" the lock from lower-priority processes which have been waiting for longer. The adaptive locks patch tweaks this algorithm by allowing a running process to steal a lock from another, equal-priority process which is sleeping. This change adds some unfairness to the locking code, but it allows the system to avoid a context switch and keep a running, cache-warm process going. Some benchmark results [PDF] have been posted. On the test system, the dbench benchmark runs at about 1500 MB/s on a stock 2.6.24 system, but at just under 170 MB/s on a system with the realtime patches applied. The adaptive lock patch raises that number back to over 700 MB/s - still far from a mainline system, but much better than before. The improvement in hackbench results is even better, while the change in the all-important "build the kernel" benchmark is small (but still positive). A fundamental patch like this will require quite a bit of review and testing before it might be accepted. But the initial results suggest that adaptive locks might be a big win for the realtime patch set. A draft proposal for Fedora spins This week Jeff Spaleta posted a draft proposal for a spin submission and approval process. For those interested in creating officially approved Fedora spins, it is worth a look. Anyone can create a Fedora spin for their personal use. Just create a kickstart file to install the packages you want. There are various ways of doing this, but the Anaconda kickstart is probably the most common. This kickstart file tells the Anaconda installer what packages you want, and you have your own Fedora spin. This draft is about creating official spins that will be listed at the Fedora Project Spins Tracker, and available for interested users to get the official Fedora spin of their choice. However there does need to be a way to cleanly distinguish between Released Spins and Contributed Spins. What will it take to create an official Fedora spin according to this proposal? The first step is get a kickstart file into the Kickstart Pool, where the file will be reviewed and tested by a peer group of Spin Maintainers. If the peer group approves then the spin proposal goes to the board for review. If the Fedora Board approves the spin it will be granted trademark usage and from there it can be added to the Fedora CVS. A number of steps need to be completed for this plan to work. First is the creation of Spin Guidelines. The guidelines will specify a minimum level of technical quality for kickstart files, and contain a naming scheme for new spins. The not-yet-formed peer group of Spin Maintainers will have some say in these Guidelines, although the release engineering team will probably create the first draft. There is a long way to go to get a straightforward way for a Fedora Special Interest Group (or anyone else) to get a spin approved, but such things always have a start somewhere. Authentication bypass in routers An authentication bypass vulnerability is one of the more dangerous problems that a web application can have. It allows the attacker to perform some action that the application designer saw fit to restrict to authenticated users without providing said authentication. Using these techniques, an attacker can control a targeted web application from afar without even wasting time cracking bad passwords—a dream scenario for such people. If an authentication bypass is found in the latest social networking site, the flaw could cause embarrassment, but if that bypass is in your home router, much worse things could result. A series of articles over at GNUCITIZEN highlights quite a variety of authentication bypass flaws in various embedded devices including routers. The flaws come from their research and recent router hacking challenge, which challenged readers to find holes in their routers. (There is no table of contents for the series, so here are links to the four installments: 1, 2, 3, and 4). Most authentication bypass flaws are caused by a conceptual mistake made by web programmers: believing that the "normal" way of accessing the site is the only way to access it. This manifests itself as applications that check for particular URLs to see if they require credentials without considering the possibility of aliasing. For example, web servers will generally ignore double-slashes in a URL, but if the application checks for /privileged/page and gets /privileged//page it may very well fall prey to an authentication bypass. Other similar schemes can be used to make the URL look different, but arrive at the same place. A far uglier possibility is applications that believe you can only get to a particular URL via a page that enforces authentication. This is a belief in "security through obscurity"; that attackers won't be able to guess the URLs for the pages "behind" the authentication screen. This is almost comical in that there are many ways to find out what those URLs are, not least by buying the device and accessing them yourself. Pages that require authentication need to check that the credentials have been provided whenever the page is accessed—without regard for what URL got them there. Some applications do all of the checking correctly on the pages that show various settings in a form allowing them to be changed, but the action of the form submits it to a different program. Inexplicably, sometimes that program does not check for credentials. Perhaps the programmer believes that web forms can only be submitted from the page that they have created, but it is trivially easy to generate an HTTP POST with the appropriate parameters. It certainly does no good to protect the current value of settings from non-authenticated users if they can easily change them to any values they want. In terms of web security, authentication bypass is usually quite easy to avoid, it is a matter of ensuring valid credentials anywhere they are required. Before performing any action that requires a logged-in user, check the cookie (or other persistent authentication mechanism) for validity to perform the action requested. For people using routers at home, perhaps the best advice is to make sure its administrative interface is not internet facing. Routers have a pretty bad track record of getting this right, so far, as the hacking challenge and other research has shown. GCC 4.3.0 exposes a kernel bug A change to GCC for a recent release coupled with a kernel bug has created a messy situation, with possible security implications. GCC changed some assumptions about x86 processor flags, in accordance with the ABI standard, that can lead to memory corruption for programs built with GCC 4.3.0. No one has come up with a way to exploit the flaw, at least yet, but it clearly is a problem that needs to be addressed. The problem revolves around the x86 direction flag (DF), which governs whether block memory operations operate forward through memory or backwards. The main use for the flag is to support overlapping memory copies, where working backwards through memory may be required so that the data being copied does not get overwritten as the copy progresses. Debian hacker Aurélien Jarno reported the problem to linux-kernel on March 5th, which was found when building Steel Bank Common Lisp (SBCL) using the new compiler. GCC's most recent release, 4.3.0, assumes that the direction flag has been cleared (i.e. memory operations go in a forward direction) at the entry of each function, as is specified by the ABI (which is, somewhat amusingly, found at sco.com [PDF]). Unfortunately, this clashes with Linux signal handlers, which get called, incorrectly, with the flag in whatever state it was in when the signal occurred. This has the effect of leaking one bit of state from the user space process that was running when the signal occurred to the signal handler, which could be in another process. That, in itself, is a bug, seemingly with fairly minimal impact. Prior to 4.3, GCC would emit a cld (clear direction flag) opcode before doing inline string or memory operations, so those operations would start from a known state. In 4.3, GCC relies on the ABI mandate that the direction flag is cleared before entry to a function, which means that the kernel needs to arrange that before calling a signal handler. It currently doesn't, but a small patch fixes that. The window of vulnerability is small, but was observed in SBCL. The sequence of events that would lead to memory corruption are as follows: a user space program does an operation (memmove() for example) that sets DF a signal occurs for some process the kernel calls the signal handler the signal handler does a memmove() in what it thinks is a forward direction the memory is copied in the reverse direction, leading to corruption It is hard to see how that could be turned into a security breach, but it would be a mistake to assume that it can't. Other kernel bugs, like the one that allowed the recent vmsplice() exploit, have looked liked memory corruption, but were found to be more than that. The DF issue may turn out to be harmless from a security standpoint, but it should not be assumed. So, now the question is: what to do about it. It is clear that the kernel should not leak the DF state to signal handlers, regardless of what GCC does. It is interesting to note that this behavior is the same (DF is not cleared on entry to a signal handler) on BSD kernels, leading some to claim that it is the ABI that is incorrect and that GCC should revert to its old behavior. Solaris kernels do clear the DF before calling signal handlers. This problem has existed for 15 years; GCC has always emitted code that worked correctly on kernels that did not follow the ABI, until now. Part of the problem is that there are an enormous number of installed kernels that are vulnerable to this problem, but only if GCC 4.3 is installed. That version of GCC is not, yet, in widespread use, so the thinking is that GCC should revert its behavior now, before it gets into distributions. As kernels with the fix become more widespread, the "proper" behavior could be restored. The GCC folks don't necessarily see it that way, so it is unclear what will happen. While it is true that distributors can control what kernel version and GCC version they ship, those aren't the only ways that either GCC or GCC-compiled binaries get installed. It is a bit of ticking time bomb for random memory corruption at a minimum. Handling those bug reports will be very difficult and time consuming. While the new behavior of GCC is correct, and the kernel is broken, it would be very helpful to back out this change, perhaps providing the new behavior via a command-line argument for those who are sure their binaries will be running on patched kernels. Some discussion on the gcc-devel list would indicate that a GCC 4.3.0.1 or 4.3.1 may be forthcoming. Still waiting for Flash Those of us who were using Linux full-time around the turn of the century will remember that the state of web browsing on Linux was a little scary then. The only real option available was the binary-only Netscape 4 client; it was buggy and old. It really seemed like the web was going to move forward without Linux, and that there was not a whole lot we could do about it. Things have improved somewhat on that front; we now have a few top-quality web browsers to choose between. At the same time, though, one might be forgiven for thinking that we are heading back into a similar situation, but involving Flash this time around. For all practical purposes, there is only one viable option for Flash on Linux: the binary-only plugin provided by Adobe. But that plugin is not just proprietary software; it also is somewhat old and buggy, and there is nothing we can do to fix it. For an increasing part of the web experience, we still have a second-rate, proprietary platform. When one thinks of Flash, naturally, one thinks of video sites like YouTube. But there is more to the Flash experience than silly videos and obnoxious advertising. Some parts of Google are heavily into flash, as can be seen from that company's finance sites or analytics offerings. Your editor's children will attest that there's no end of game sites which require Flash, and for which the Linux plugin fails to work properly. Looking for any way to reduce the total amount of time spent in airplane seats, your editor recently investigated "around the world" tickets; that search ended up at this travel planning site which, of course, requires Flash. And so on. Like it or not, Flash is the language in which an increasing number of interactive sites are being coded, and Linux does not have proper support for it. With this in mind, your editor decided to give the recently-announced Gnash 0.8.2 release a try. This release was billed as the first beta version of Gnash, so there was reason to hope that it would be something close to a true solution to the Flash problem. In reality, Gnash is a step in the right direction, but the Flash issue will be with us for some time yet. For now, the acid test for a Flash player would appear to be YouTube, so that is the first place your editor went. The experience there was mixed. It is, in fact, possible to watch YouTube videos using the Gnash Firefox plugin. Hearing them is another matter, though; they all played silently. It would not be surprising to learn that getting audio is a matter of filling in a missing codec - but would sure be nice if the software were to say something to that effect. Pausing and playing the video worked, but skipping around in it did not. Playing videos from other sites was uniformly unsuccessful. The "around the world" calculator appeared to load properly, but then took off as if somebody were punching all of its buttons at once. Charts on Google sites are uniformly blank. Some flash games mostly worked, others showed more input-related confusion. Few of them were truly playable. On the other hand, Flash "intros" and advertisements mostly work as intended - just what your editor wanted. So Gnash is not really there yet. In truth, this software is not in a condition where the use of the term "beta" makes sense; there is a lot of work yet to be done. There are few of us clamoring for support for more obnoxious advertising - especially among the LWN readership, as your plentiful emails over the last couple of months have made clear. What we want is working support for the useful Flash applications out there - and there are a few of those at this point. Gnash does not, currently, provide that support. (Your editor also tried out Swfdec 0.6.0, with generally worse results). That said, it is clear that a lot of work has been done to get Gnash to this point. Your editor has no real way to judge how much more is required to get full support for even Flash version 7; chances are it is not a small job. Needless to say, support for newer versions of Flash will require even more work. But there now appears to be a solid platform upon which that work can be done, and that is an important start. Gnash has the look of a project which has overcome some of the biggest initial hurdles and is now setting a pace to finish the job. With luck, it will have reached the point where the fact that it almost works will inspire new developers to come in and fill in the remaining pieces. Adobe has the ability to make this job a lot easier. Your editor has heard, informally, that the company has taken a less hostile position toward the Gnash developers than it had in the past, but it certainly is still not helping them. The Flash specifications are not available to anybody trying to create a Flash player, and, unsurprisingly, the Flash EULA forbids any sort of reverse engineering. That EULA, incidentally, also forbids running Adobe's player on any "non-PC device," including tablets and phones. That restriction suggests that Adobe sees business opportunities in the lack of a free Flash player for such systems and intends to ensure that this scarcity continues. So, despite the occasionally friendly noises Adobe has been making toward the Linux community, we should not expect a great deal of help from that direction. Someday, people will figure out that closed standards (like Flash) are best avoided. Meanwhile, Flash is a fact of life that we will need to deal with. It appears that we are getting closer to being able to deal with it - but we are not there yet. A better DMA memory allocator As any device driver author knows, hardware can be a pain sometimes. In the early days of Linux, peripherals attached to the ISA bus inflicted their particular variety of pain by being unable to use more than 24 bits to access memory. What that meant, in practical terms, was that ISA devices could not perform DMA operations on memory above 16MB. The PCI bus lifted that restriction, but, for some time, there were quite a few "PCI" devices that were minimally modified ISA peripherals; many of those retained the 16MB limit. To handle the needs of these devices, Linux has long maintained the DMA memory zone. Drivers which need to allocate memory from that zone would specify GFP_DMA in their allocation requests. The memory management code takes special care to keep memory in that zone available so that DMA requests can be satisfied. In this way, the system can provide reasonable assurance that memory will be available to perform DMA in ways which meet the special needs of this particularly challenged hardware. The only problem is that there aren't a whole lot of devices out there which still have the old 24-bit addressing limitation. So the DMA zone tends to sit idle. Meanwhile, there are devices with other sorts of limitations. Many peripherals only handle 32-bit addresses, so their DMA buffers must be allocated in the bottom 4GB of memory. There is a subset, however, with stranger limitations - 30 or 31-bit addresses, for example. The kernel's DMA library provides a way for drivers to disclose that sort of embarrassing limitation, but the memory management code does not really help the DMA layer make allocations which satisfy those constraints. So drivers for such devices must use the DMA zone (which may not be present on all architectures), or hope that normal zone memory fits the bill. Andi Kleen has set out to clean up this situation with a new DMA memory allocator. His solution is to take a chunk of memory out of the kernel's buddy allocator entirely and manage it in an entirely different way, forming a reserve pool for DMA allocations. The result is a bit of a departure from normal Linux memory management algorithms, but it may well be better suited to the task at hand. The new "mask" allocator grabs a configurable chunk of low memory at boot time. Allocations from this region are made with a separate set of calls, with the core API being: alloc_pages_mask() looks a lot like the longstanding alloc_pages() function, but there's some important differences. The size parameter is the desired size of the allocation, rather than the "order" value used by alloc_pages(), and mask describes the range of usable addresses for this allocation. Though mask looks like a bitmask, it is really better understood as the address value that the allocated memory should have; "holes" in the mask would make no sense. A call to alloc_pages_mask() will first attempt to allocate the requested memory using the normal Linux memory allocator, on the assumption that the reserved DMA memory is an especially limited resource. If the allocation fails, perhaps because there's no physically-contiguous chunk of sufficient size available, then the allocator will dip into the reserved DMA pool. If the normal allocation succeeds, though, the allocated memory must still be tested against the maximum allowable address: the normal memory allocator, remember, has no support for allocating below an arbitrary address. So if the returned memory is out of bounds, it must be immediately freed and the reserved pool will be used instead. That reserved pool is not managed like the rest of memory. Rather than the buddy lists maintained by the slab allocator, the DMA allocator has a simple bitmap describing which pages are available. It will normally cycle through the entire memory region, allocating the next available chunk of sufficient size. If that chunk is above the memory limit, though, the allocator will move back to the lower end of the reserved pool and allocate from there instead. Since DMA allocations tend to be short-lived, one would expect that a suitable block of memory would either be available or become available in the near future. One other difference of note is that, unlike the slab allocator, the DMA allocator does not round memory allocation sizes up to the next power of two. DMA allocations can be relatively large, so that rounding can result in significant internal fragmentation and memory waste. At the next level up, Andi has added a new form of mempool which uses the DMA allocator: This pool will behave like normal mempools, with the exception that all allocations will be below the limit passed in as mask. These pools are used in the block layer, where memory allocations for DMA must succeed. One might object that reserving a big chunk of low memory for this purpose reduces the total amount of memory available to the system - especially if the DMA allocator is cherry-picking normal memory whenever it can anyway. But the cost is not as bad as one might think. These patches do away with the old DMA zone, which, for all practical purposes, was already managed as a reserved (and often unused) memory area. Some 64-bit architectures also set aside a significant chunk (around 64MB) of low memory for the swiotlb - essentially a set of bounce buffers used for impedance matching between high memory (>4GB) buffers and devices which cannot handle more than 32-bit addresses. With Andi's patch set, the swiotlb, too, makes allocations from the DMA area and no longer has its own dedicated memory pool. So the total amount of memory set aside for I/O will not change very much; it could, in fact, get smaller. For most driver authors, there will be little in the way of required changes if this patch set gets merged. The DMA layer already allows drivers to specify an address mask with dma_set_mask(); with the DMA allocator in place, that mask will be better observed. The one change which might affect a few drivers is further down the line: eventually the GFP_DMA memory allocation flag will go away. Any driver which still uses this flag should set a proper mask instead. So far, there has been little discussion resulting from the posting of these patches. Silence does not mean assent, of course, but it would appear that there is little opposition to this set of changes. Some topics related to MP3 players In many parts of the world, the U.S. is looked upon as a place with particularly poor taste in "intellectual property" legislation; the DMCA and software patents are often held up as examples. DMCA-like laws have since spread to other parts of the planet, which, for some reason, has not made people living there any more appreciative of the American legal regime. But it is often pointed out that software patents remain an almost entirely American problem; people in other parts of the world (Europe, say) need not worry about them. If only it were so. On March 5, German police raided a booth at the CeBit conference in Hannover. That booth, run by Meizu, contained an iPhone-clone product, but nobody cared about that. Instead, the contraband which absolutely had to be suppressed was a music player for which Sisvel (an Italian company which has done this kind of thing before) had not been paid royalties on its MP3 patents. The player, as it happens, did not even have MP3 playback capability, but that didn't seem to matter. The police duly cleared the booth of all mention of the offending device and saved another day for free enterprise. This is a pure software patent action, and the U.S. has no part in it. Software patents are truly a global problem. (Police raids raise the stakes in interesting way, though; even in the U.S., things usually start with a polite letter from a lawyer first). Anybody who wonders why companies like Red Hat exercise great care around software patents (and MP3 patents in particular) need only look at episodes like this. The selling of enterprise Linux products is likely to be distinctly harder if your prospective customers see your conference booth being forcibly shut down by the authorities. Meanwhile, it occurred to your editor, while thinking about music players, that little has been said about the Rockbox project on LWN in recent times. Rockbox, remember, is a GPL-licensed firmware which runs on a wide variety of music players. It offers a wider range of features, has more codecs, is more customizable, and has better accessibility support than the stock firmware on any of these devices. And it's free software. Since LWN last looked at this project, the Rockbox developers have added a number of new features and new platforms. The abandoned 3.0 release has never happened; the Rockbox developers appear to have given up on the idea of formal releases for now. The daily snapshots generally work quite well, though, and there are lots of satisfied Rockbox users out there. [PULL QUOTE: Despite the fact that Rockbox supports a lot of players, absolutely none of the supported platforms are currently in production. So anybody looking to buy a player which can run Rockbox must go digging around on auction sites. END QUOTE] The only problem is: it's not clear how many more such users may arrive in the future. Despite the fact that Rockbox supports a lot of players, absolutely none of the supported platforms are currently in production. So anybody looking to buy a player which can run Rockbox must go digging around on auction sites. Many Rockbox users do exactly that, but many more potential users would rather not get their devices that way. Rockbox ports to current devices are underway, but the developers are fighting an uphill battle. Manufacturers tend to be uncooperative when it comes to releasing hardware information, so a certain amount of reverse engineering is required. And, by the time that work is done, the manufacturers have moved on to a new product. Music players are consumer electronics devices, and, like most such devices, their product lifetime tends to be quite short. So developers on a project like Rockbox will forever be trying to catch up. Your editor, meanwhile, still lugs around his ancient iRiver H340. People look at it strangely, as if they expect there to be a hatch on the back so that the user can occasionally add another shovel full of coal. But it works beautifully with Rockbox, and a replacement looks hard to find. Your editor wishes that at least one manufacturer would realize that it could provide better functionality at a lower cost by designing its players to run Rockbox from the beginning. Perhaps the project needs better advocacy within the player industry. There is another approach which could be considered here. The OpenMoko project is trying to rearrange the mobile telephone market by offering a completely open product. Surely a music player, being a much simpler device, would be amenable to the same treatment? As it turns out, there are a couple groups of people trying to jump start just this kind of effort. They have a prototype design, and a competing design as well. Both look like they could produce a respectable player at a reasonable cost - a player designed to run free software from the outset. Designing a device which can run Rockbox and produce decent audio (and video) output is not that hard, given the components which are available. Turning it into a product which is small and sleek enough that people want to buy it seems likely to be harder. Getting a full device manufactured at a reasonable cost may be the hardest of all; that requires significant up-front money and a distribution channel which can sell enough units to make the whole thing cost-effective. There's also the little issue of those MP3 patents to take care of. There is no real sign that the Rockbox player developers are thinking on this level at this time. One of the prototype designs carries a Creative Commons noncommercial license in an attempt to prevent others from thinking that way. So the resulting hardware may end up being little more than a kit for especially dedicated hobbyists. Unless somebody picks up the ball and tries to commercialize a product like this, Rockbox may be stuck in its role as the software of choice for last year's players. The good news in all this is that Linux-based tablet devices seem likely to become cheaper, more abundant, and more compact. Since these devices can make fine media players, we may eventually get our completely open gadget via that path. Modulo patent problems, of course. Emacs chooses Bazaar The Emacs development process is undergoing some changes; Richard Stallman has handed off project maintenance duties, while a change in the version control system (VCS) seems to be in the offing. Some of the modernization suggestions made by Eric Raymond last December are taking root. Stallman has not completely stepped away from Emacs development—it's doubtful anyone expected him to—but his approach on how to choose a VCS for Emacs is raising a few eyebrows. Currently, Emacs is tracked with CVS, but a distributed VCS (DVCS) is definitely planned down the road—how far is unclear at this point. In earlier discussions, Stallman was particularly interested in the offline capabilities of DVCS; being able to do commits, diffs, and see revision history while unconnected to the internet is a useful feature for him. Many other Emacs developers see a DVCS as a major upgrade to the development process, the question then becomes which DVCS to use. The main contenders are git, Mercurial (aka hg), or Bazaar (aka bzr); there are other options, of course, but they were quickly eliminated due to speed or feature set issues. There was some hope that a comparative VCS study that Raymond was working on would help lead the project to the proper choice, but the study has been delayed—a major release of Wesnoth is underway which has taken Raymond from that task. There were some discussions of the merits of the various systems but, in the meantime, Bazaar joined the GNU project which changed the equation somewhat. Stallman announced: We should use Bzr because that is becoming a GNU package. GNU packages should show loyalty to each other when possible, and in this case it is possible. As might be expected, short-circuiting a technical discussion for a political expedient is not met with universal approval. Juanma Barranquero sums up his (and others') objections: What I'm trying to say is: I won't discuss which dVCS we choose (unless it makes Windows development a PITA). But I agree with Jeremy Maitin-Shepard that the cause of free software is strengthened by us selecting among the free alternatives the one that best serves our technical, not political, needs. There is a certain irony in noting that one of the perceived weaknesses of git was its poor support for Windows development. It is certainly understandable, but the idea that one of the flagship GNU projects would make a decision based on tool availability for a proprietary operating system gives one pause. That isn't one of Stallman's requirements of course, he sees the decision as essentially a choice amongst equals: We already know the most important thing about what we will find from a careful study of git, mercurial and Bzr. We will find that each has its advantages and disadvantages -- but none of them conclusive. Each will be preferred by some people, but any one of them would work out well enough. As Thomas Lord (author of another GNU VCS, arch), points out, there is a cost to agonizing over a choice like this: Probably so but any group of smart people could easily spend a year arguing about it. Not even a year arguing about which system is best but a year arguing just about what "best" means in this context. Over-optimizing a choice like that can be a *huge* resource suck and projects and groups fail all the time because of falling into such traps. No technical barriers to using Bazaar have been raised, it is, as Stallman asserts, a fairly arbitrary choice. Unsurprisingly, Stallman chooses the one that serves his agenda. The new maintainers, Stefan Monnier and Chong Yidong, presumably agree with that agenda, in any case they have not indicated any resistance to the choice. So it seems that Emacs will be moving to Bazaar. Jason Earl has been pulling the CVS history into a Bazaar repository that should be available soon. The import process seems to be taking a fair amount of time—something on the order of a week—which is hopefully not indicative of the operational speed of Bazaar. Assuming the conversion works and developers can get their work done using it, this would be a pretty high-profile project to use it. Other GNU software may follow suit, which could be a big boost to the visibility of Bazaar; precisely what Stallman was aiming for. Monitor disks with the S.M.A.R.T. monitoring tools The S.M.A.R.T. Monitoring Tools (Smartmontools) is a cross-platform set of utilities that are able to monitor operating data from hard drives: The smartmontools package contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART) built into most modern ATA and SCSI hard disks. In many cases, these utilities will provide advanced warning of disk degradation and failure. It should run on any modern Darwin (Mac OSX), Linux, FreeBSD, NetBSD, OpenBSD, Solaris, OS/2, eComStation, QNX, or Windows system. Wikipedia defines SMART as the Self-Monitoring, Analysis, and Reporting Technology: "Mechanical failures, which are usually predictable failures, account for 60 percent of drive failure. The purpose of S.M.A.R.T. is to warn a user or system administrator of impending drive failure while time remains to take preventative action — such as copying the data to a replacement device. Approximately 30% of failures can be predicted by S.M.A.R.T." Version 5.38 of Smartmontools was recently announced. Improvements include: Several Libata/Marvell driver improvements. New additions to the drive database. ATA-8 updates. New Dragonfly support. Support for the QNX operating system. A new no-fork option for smartd. Better support for systems with large numbers of disks. Improvements to the descriptions of the SMART Attribute list. A workaround for a Samsung firmware bug. Improvements to the CCISS support system. New selective self-test command line options. Build system portability improvements. Numerous bug fixes. Building Smartmontools was straightforward. The code was downloaded and unpacked. The usual configure, make and make install steps were performed on an Ubuntu 7.04 system with no troubles. The operation instructions from the README file were followed and the software was able to discover data from the one hard drive on the test system. This example output shows the wide variety of drive information that Smartmontools can display. The drive appears to be healthy. If you are a systems administrator who needs to keep track of hard drive reliability data, Smartmontools be able to provide some useful drive information. With the addition of a small amount of glue-logic scripting, it should not be too difficult to set up an automated drive monitoring system. Extended Validation certificates and cross-site scripting Cross-site scripting (XSS) is a frequent topic on security forums because it is a common web application flaw that can lead to variety of unpleasant surprises. One of the more frequently seen abuses of an XSS flaw is in the aid of a phishing attack. With the advent of Extended Validation (EV) certificates coupled with the accompanying browser UI changes, some XSS attacks will become much more powerful. By now, most users are familiar with SSL certificates, which are used to authenticate one or both sides of an HTTPS connection to the other. EV certificates are a step up from a more pedestrian SSL certificate as the recipient must undergo more scrutiny from the certificate authority (CA) before being granted one. We covered EV certificates in more detail in November 2006, but they are just now starting to be installed more widely. Netcraft reported the problem a few weeks ago with regard to sourceforge.net. Sourceforge is one of the 4,000 or so sites with an EV certificate, but it also has an XSS problem. So anyone using the site for XSS purposes now gets the benefit of the higher trust that is supposed to be embodied in an EV certificate. Browser vendors are being encouraged to highlight the EV certificates in their UI so as to give users more confidence in those sites. The most recent Firefox 3 betas as well as IE7 are highlighting the site name in green in the address bar to denote this higher trust. Unfortunately, the extra validation does not extend to testing the site for XSS flaws, which could leave users easily fooled. A phishing attack could use an XSS flaw in a search box or error message, for example, to add content to the appearance of a site. That content is really coming from the XSS attack but it would appear under the "green means go" address bar for the EV certificate-protected site. That content could include a login screen that sent the credentials elsewhere or a cookie stealing attack for session hijacking. For any site with sensitive information, XSS attacks are already a problem, EV certificates just add another mechanism for exploiting the user's trust. Much like the padlock icon that appeared many years ago to denote a "secure" (really, just encrypted) connection, this new green address bar indicator is somewhat difficult to explain. Based on the vetting process for EV certificates, there should be a real entity behind an EV certificate—or at least there was one at the time of issuance—but it is by no means an endorsement of the security of everything on a web page that has one. It is, like the original padlock, more nuanced than that. Unfortunately, users are not good at security nuances. They want yes or no answers to "Is this site safe?"; that answer is nearly always "maybe" or perhaps "probably". At one time, the padlock icon was seen as a "yes" answer; now the green address bar may take its place. Somehow users need to be taught to look beyond simple answers and websites need to clean up their act so that their users are not scammed. The number of sites with XSS problems is staggering (a look at xssed.com is instructive) and new ones crop up all the time. In many ways, XSS is an attack against users rather than directly against a site. This may make it less of a priority to fix than a direct attack, like a SQL injection, might be. That is very unfortunate for their users, especially if they have a shiny new EV certificate. How to use a terabyte of RAM We have not yet reached a point where systems - even high-end boxes - come with a terabyte of installed memory. But products like those from Violin Memory make it clear that the day is coming; one can buy a Violin box with 500GB in it now. So it seems worth asking the question: once one has spent the not inconsiderable sum to buy a box like that, what does one do with all that memory - especially now that the Firefox developers have gotten serious about fixing memory leaks? Perhaps it's time for some wild ideas. And there is no better source for such ideas than Daniel Phillips, whose Ramback patch has stirred up a bit of discussion this week. The core idea behind Ramback is that all of that memory is turned into a ramdisk, but with a persistent device attached to it. In normal conditions, all application I/O involves only the ramdisk, and is, thus, quite fast ("Every little factor of 25 performance increase really helps."). In the background, the kernel worries about synchronizing data from the ramdisk onto permanent storage. But the synchronization process is mostly concerned with I/O performance, rather than providing guarantees about just when any given block will make it onto the disk platters. Ramback thus differs from the normal block I/O caching done by the kernel in a number of ways. It keeps the entire device in memory, so that, in steady-state operation, applications need never encounter a disk I/O delay. Should an application call fsync(), the expected result (blocking until the data is written to physical media) will not happen. Filesystems take great care to order operations in a way that minimizes the risk of data loss in a crash; Ramback ignores all of that and writes data to physical media in whatever order it decides is best. As Daniel put it, the "most basic principle" of Ramback's design is: [T]he backing store is not expected to represent a consistent filesystem state during normal operation. Only the ramdisk needs to maintain a consistent state, which I have taken care to ensure. You just need to believe in your battery, Linux and the hardware it runs on. Which of these do you mistrust? Ramback does include an emergency mode which will endeavor to bring the disk up to date in a hurry should the UPS indicate that power has been lost. But that does not seem to be enough for everybody. In the resulting discussion, nobody complained about the sort of performance benefits that a tool like Ramback could provide. But there was a lot of concern about data integrity; it seems that many people distrust their battery, their hardware, and Linux. And that has led to a sort of impasse, with several developers claiming that Ramback would be too risky to use and Daniel dismissing their concerns as FUD. FUD or not, those concerns are likely to be a difficult barrier for Ramback to overcome. Meanwhile, Daniel is looking for people to help test out the code, but that presents challenges of its own: This driver is ready to try for a sufficiently brave developer. It will deadlock and livelock in various ways and you will have to reboot to remove it. But it can already be coaxed into running well enough for benchmarks, and when it solidifies it will be pretty darn amazing. So far, reports from suitably courageous testers have been, well, scarce. Your editor fears that this work could suffer the same fate as many of Daniel's other patches: they can contain brilliant ideas and great coding but just don't quite survive the encounter with the real, messy world. But we need people thinking about how our systems will work in the coming years; one hopes that Daniel won't stop. News from the Debian security team A note from the Debian security team shows a number of new initiatives and plans. The team recently expanded by two while looking for up to two more folks to round it out. That, coupled with a number of new initiatives makes for some interesting news from the Debian security world. Adding people to the team adds more eyes to find bugs, but, perhaps more importantly, adds more hands to actually patch the code when bugs are found. In many cases, the upstream project will fix the vulnerability in its latest release, leaving the distribution security team to backport the fix into whatever version they are shipping. This takes knowledge; one must understand the code and how to build it for Debian. They have not set the bar low for the kind of folks they are looking for: You need to be familiar with how the wide variety Debian packages are maintained, patched and built. If you're not scared by packages generating their patch series by applying sed statements from cdbs include files before passing the patches through an awk filter to quilt until they're finally built with yada, you might be the right person. The team is now using Request Tracker to track security bugs and updates. Two separate categories have been established, one for upstream bugs that are not yet public, the other for publicly known bugs. This allows the team to track all the bugs, but not prematurely release information about security vulnerabilities that are not yet public. Two other changes will help with the quality of security patches. The first is a public patch review mailing list that is being formed to allow interested parties to see what patches are being proposed. Presumably this would only apply to public vulnerabilities or the list membership will need to be tightly controlled. The other quality boosting change is to use the time between when a patch is completed and when it is has been ported and built for all of the architectures to further test the patch. The team is looking for large installations that normally install security updates in their own test environment before rolling them out to their live systems. Leveraging those test environments to further exercise the patched code can only lead to better code in the long run. Security is an important part of any distribution, so it is nice to see these kinds of initiatives. More team members, testing, and tracking are all likely to bring about a faster and better response to security problems in the future. Who maintains dpkg? The Debian project is known for its public brawls, but the truth of the matter is that the Debian developers have not lived up to that reputation in recent years. The recent outburst over the attempted "semi-hijacking" of the dpkg maintainership shows that Debian still knows how to run a flame war, though. It also raises some interesting issues on how packages should be maintained, how derivative distributions work with their upstream versions, and what moral rights, if any, a program's initial author retains years later. Dpkg, of course, is the low-level package management tool used by Debian-based distributions; it is the direct counterpart to the RPM tool used by many other systems. Like RPM, it is a crucial component in that it determines how systems will be managed - and how much hair administrators will lose in the process. And, like RPM, it apparently causes a certain sort of instability in those who work with it for too long. Ian Jackson wrote dpkg back in 1993, but, by the time a few years had passed, Ian had moved on to other projects. In recent times, though, he has come back to working on dpkg - but for Ubuntu, not for the Debian project directly. One of his largest projects has been the triggers feature, which enables one package to respond to events involving other packages in the system. This feature, which is similar to the RPM capability by the same name, can help the system as a whole maintain consistency as the package mix changes; it can also speed up package installations. Triggers have been merged into Ubuntu's dpkg and are currently being used by that distribution. The upstream version of dpkg shipped by Debian does not have trigger support, though, and one might wonder why. If one listens to Ian's side of the story, the merging of triggers has been pointlessly (perhaps even maliciously) blocked for several months by Guillem Jover, the current Debian dpkg maintainer. So Ian concluded that the only way to get triggers into Debian in time for the next release ("lenny") was to carry out a "semi-hijack" of the dpkg package. By semi-hijack, Ian meant that he intended to displace Guillem while leaving in place the other developers working on dpkg, who were encouraged to "please carry on with your existing working practices." Ian also proceeded to upload a version of dpkg with trigger support, and without a number of other recently-added changes. It is worth noting that all of this work went into a separate repository branch, pending a final resolution of the matter. So when the upload was rejected (as it was) and Ian was deprived of his commit privileges (as he was), there was no real mess to clean up. Those wanting a detailed history of this conflict can find it in this posting from Anthony Towns. It is a long story, and your editor will only be able to look at parts of it. One of the relevant issues here is that Guillem Jover appears to be a busy developer who has not had as much time to maintain dpkg as is really needed. Since the beginning of the year, he has orphaned a number of other packages (directfb and bmv, for example) in order to spend more time on dpkg. But, as a result of time constraints, a number of dpkg patches have languished for too long. While this was happening, Guillem put a fair amount of the time he did have into reformatting the dpkg code and making a number of other low-level changes, such as replacing zero constants with NULL. Ian disagrees strongly with the reformatting and such - unsurprisingly, the original code was in his preferred style. And this is where a lot of the conflict comes in, at two different levels. Ian disagrees with the coding style changes in general, saying: Everyone who works on free software knows that reformatting it is a no-no. You work with the coding style that's already there. Many developers will disagree on the value of code reformatting; some projects (the kernel, for example) see quite a bit of it. Judicious cleaning-up of code can help with its long-term maintainability. All will agree, though, that reformatting can make it harder to merge large changes which were made against the code before the reformatting was done. This appears to be a big part of Ian's complaint: unnecessary (to him) churn in the dpkg code base makes it hard for him to maintain his trigger patches in a condition where they can be merged. Code churn is a part of the problem, but Ian's merge difficulties are also a result of doing the trigger work in the Ubuntu tree rather than in Debian directly. Ian did try to unify things back in August, but that was after committing Ubuntu to the modified code. Ubuntu's dpkg is currently significantly different from Debian's version, and, while one assumes that, sooner or later, Debian will acquire the trigger functionality, there is no real assurance that things will go that way. Dpkg has been forked, for now, and the prospects for a subsequent join are uncertain. Ian also asserts that, as the creator of dpkg, he is entitled to special consideration when it comes to the future of that package. His semi-hijack announcement makes that point twice. But one of the key features of free software is this: when you release code under a free license, you give up some control. It seems pretty clear that Ian has long since lost control over dpkg in Debian. So who does control this package, and how will this issue be resolved? Certainly Ian's hijack attempt found little sympathy, even among those who think that dpkg has not been well maintained recently. There are some who say that the disagreement should be taken to the Debian technical committee, which is empowered to resolve technical disputes between developers. But faith in this committee appears to be at a low point, as can be seen in this recent proposal to change how it is selected: It's been pretty dysfunctional since forever, there's not much that can be done internally to improve things, and since it's almost entirely self-appointed and has no oversight whatsoever the only way to change things externally is constitutional change. Meanwhile, the discussion has gone quiet, suggesting that, perhaps, it has been moved to a private venue. The dpkg commit log, as of this writing, shows that changes are being merged, but triggers are not among them. It is hard to imagine that the project will fail to find a way to get the triggers feature merged and the maintenance issues resolved, but that does not appear to have happened yet. Generic semaphores Most kernel patches delete some code, replacing it with newer and (presumably) better code. Much of the time, it seems, the new code is more voluminous than what came before. Occasionally, though, a patch comes along which deletes over 7600 lines of code - replacing it with a mere 314 lines - while claiming to maintain the same functionality. Matthew Wilcox's generic semaphore patch is one of those changes. In essence, a semaphore is a counter with a wait queue attached to it. When kernel code wants to access the resource protected by the semaphore, it makes a call to: This call will check the counter associated with sem; if it is greater than zero, the counter will be decremented and control returns to the caller. Otherwise the caller will be put to sleep until sometime in the future when the counter has been increased again. Increasing the counter - when the the protected resource is no longer needed - is done with a call to up(). Semaphores can be used in any situation where there is a need to put an upper limit on the number of processes which can be within a given critical section at any time. In practice, that upper limit is almost always set to one, resulting in semaphores which are used as a straightforward mutual exclusion primitive. In current kernels, semaphores are implemented with highly-optimized, architecture-specific code. There are, in fact, more than twenty independent semaphore implementations in the kernel code base. Matthew's patch rips all of that out and replaces it with a single, generic implementation which works on all architectures. After the patch is applied, a semaphore looks like this: The implementation follows from this definition in a straightforward way: the spinlock is used to protect manipulations of count, while wait_list is used to put processes to sleep when they must wait for count to increase. The actual code, of course, is somewhat complicated by performance and interrupt-safety considerations, but it remains relatively short and simple. One might ask: why weren't semaphores done this way in the first place? The answer is that, once upon a time (prior to 2.6.16), semaphores were one of the primary mutual exclusion mechanisms in the kernel. The 2.6.16 cycle brought in mutexes from the realtime tree, and most semaphore users were converted over. So semaphores, which were once a performance-critical primitive, are now much less so. As a result, any need there may have been for carefully hand-tuned, architecture-specific code is gone. So the code might as well go too. The other question which comes up is: why are semaphores still being used at all? The number of semaphore users has dropped considerably since 2.6.16, but there are still a number of them in the kernel. Some of those could certainly be converted to mutexes, but doing so requires a careful audit of the code to be sure that the semaphore's counting feature is not being used. Once that work is done, it may turn out that, in some places, a semaphore is truly the right data structure. So semaphores are likely to remain - but they'll require rather less code than before. Installfest generates 350 Linux computers for schools On Saturday March 1st, Untangle and the Alameda County Computer Resource Center (ACCRC) organized the first of what is hoped to be many "Installfest for Schools" events. It took place at four San Francisco Bay area locations (San Francisco, Berkeley, San Mateo and Novato) and refurbished 350 older computers with Ubuntu for northern California schools. The primary goal of the installfest was to give children in disadvantaged neighborhoods the same access to technology that students in wealthy school districts grow up with. However, the event was also about curbing waste. 132 million PCs were bought in the year 2000 alone and none of them can run Vista. But older hardware works great with GNU/Linux and extending the life of these PCs will keep thousands of tons of toxic electronic waste out of the landfill. And let's not forget about budgetary waste. With many states facing budget crises that will inevitably force deeper classroom spending cutbacks, why should our schools to spend their scarce resources on proprietary software licenses? In fact, cutbacks may create an incredible window of opportunity for the GNU/Linux desktop movement to establish itself within schools. The installfest drew approximately 130 free and open source software community volunteers across the four locations. We started with over 1,000 older, discarded computers that had been collected by ACCRC through donations from the general public, local businesses and municipal governments. Some of the computers were smooth sailing: they met the hardware specification, had all of the necessary components and installed without any problems. Other computers had software install problems, but those were easy to solve because so many of the Bay Area's most hardcore free and open source software gurus participated and with their combined expertise, no error message went unattended to. The rest of the computers required a little more care, as many of them were missing a hard drive, NIC or enough RAM to run Ubuntu. Yet, by disassembling problematic boxes it was easy to form a pool of spare parts that could then be stitched back together to create working computers. The week after the installfest, ACCRC put the finished systems through a 72-hour burn-in test and we now have 350 computers that have already started being donated to schools. The Ascend School in Oakland received the first batch of nine computers. Other schools that have received open source computers from the ACCRC include: Lockwood School (Oakland) Whittier Elementary School (Oakland) Casa Grande High School (Petaluma) Woodside Elementary School (Concord) KIPP San Francisco Bay Academy (San Francisco) Mission High School (San Francisco) This event was about donating open source computers to schools in Northern California. However, ACCRC regularly donates to schools nationwide (and sometimes internationally). Schools in need of computers should fill out ACCRC's school application form [PDF]. Computer hardware and software specifications The minimum specifications for each computer were an 800mhz processor (PIII or AMD), 256MB Ram and a 20 GB hard drive, but we were pleasantly surprised to find a handful of P4 processors in the mix as well. One location even received a batch of 6 dual core systems with elegant slim cases—who throws those out and what else are they looking to get rid of?—but ironically we couldn't install them during the event because they were only equipped with DMS-59 DVI ports that required special monitor cables. Each system received a fresh copy of Ubuntu 7.10 desktop with the latest apt-get upgrade applied as of February 27, 2008. Because the computers were going into schools with little or no GNU/Linux expertise, it was important to try and create a positive first experience so we worked with Creative Commons to package samples of pictures from Flickr and music from Jamendo to show off the fun side of the donated computers. No Starch Press also donated PDF copies of Ubuntu for non-Geeks that were loaded on to each computer so that help for common support questions was never more than a click away. Install specifications Each location was set up with 10 to 40 workstations that had permanent keyboards, mice, monitors and cables so that the volunteers only had to move the desktops themselves back and forth. The process was started by booting from custom install CDs and the packages were applied over the network via apache http web servers. The custom CDs were optimized to make the Ubuntu OS installation as fast and easy as possible. Physically placing the CD into the drive and booting from disc was really all that was required because the additional content from Creative Commons and No Starch Press were bundled as Debian packages that were automatically installed via the network just like the other Ubuntu updates and patches. The installfest networks were based on dual Pentium III servers with a RAID array and Gigabit network cards plugged into a 24-port Gigabit switch. It was important to have a fast setup because updating as many as 40 systems at once placed a heavy load on drives and network connections. Electricity was also a concern as most of the outlets available had 15 or 20 Amp circuits. Given the intensity of the installation/reboot workload and the relatively power inefficient CRT monitors, we drew the line at 5 workstations per 15 Amp circuit because an extra machine might have fit, but blowing the circuit breaker would have caused a big disruption—especially if the breaker happened to be in a locked closet. Community goes the extra mile With 130 volunteers showing up, Untangle and ACCRC really had a lot of help in pulling the Installfest for Schools off. However, the community did far more than just show up, our volunteers really went the extra mile to save the day on as we stumbled across a handful of unexpected hiccups. One particularly inspirational moment was when the San Mateo location ran out of computers, our volunteers drove their own cars across the Bay to pickup extra hardware rather than close the location early! We also owe a debt of gratitude to 3 members of the San Francisco Linux Users' Group (Christian Einfeldt, Jim Stockford and Daniel Mizyrycki), who worked long hours to set up and clean up that location. We also received lots of help from free and open source software related organizations. Mozilla in particular really stepped up to the plate by blogging about the event and then bringing schwag and pizza for all 130 volunteers! But Mozilla wanted to get their hands dirty as well and Mozilla team members showed up to lend a hand at each location. Creative Commons and the No Starch Press helped put together content. Also, O'Reilly, OSI, the Linux Foundation, Sun and Canonical really helped get the word out with supportive blog mentions that encouraged participation as well. Future plans Moving forward, Untangle and ACCRC hope to continue organizing bigger and better Installfests for Schools. Our goal is to turn the one-time regional event into a distributed national event occurring on a regular basis. If we're able to find some friendly organizations to help out, we may even be able to go international. Stay tuned because you'll be hearing from us sooner rather than later about the next Installfest for Schools. Anyone wishing to help should stay informed by signing up for the installfest mailing list. As we move more into a distributed national event, we need all of the help that we can get identifying local schools, old computer donors and feet on the street volunteers to make sure everything goes smoothly. That work will be coordinated on the mailing list. [ Andrew Fife, of Untangle, is one of the organizers of the project. ] The return of authoritative hooks The containers developers have what would seem to be a relatively straightforward problem: they would like to control access to devices on a per-container basis. Then containers could safely be granted access to specific devices without compromising the overall security of the system - even if a container has a root-capable process which can create new device files. Implementing this feature has been a longer journey than these developers had imagined, though, with the "device whitelist" feature being sent around to different kernel subsystems almost like one of those famous garbage barges from years past. A final resting place may have been found, though, and it may signal a change in how some security decisions are made in the kernel in the future. The original version of the patch, posted by Pavel Emelyanov, set up a control group for the management of device accessibility within containers. The actual rules - and their enforcement - were stored deep within the device model subsystem. This drew an objection from Greg Kroah-Hartman, who suggested that, instead, this kind of access control should done either with udev or with the Linux security module (LSM) subsystem. Udev does not give the desired degree of control and, apparently, can be problematic for those wanting to run older distributions within containers, so it was not seriously considered. The LSM suggestion was, after some resistance, taken to heart, though. The result was the device whitelist LSM patch, posted by Serge Hallyn. It was a stacking security module which made changes to a number of hooks. This is where James Morris came in and suggested that, instead, the whitelist should just be added to the existing capabilities security module. Then there would be no need for a separate module and things could be generally simplified. So Serge duly rolled out version 3 of the patch which moved the whitelist into the capabilities module. But this one ran into resistance as well. Quoting James Morris again: Moving this logic into LSM means that instead of the cgroups security logic being called from one place in the main kernel (where cgroups lives), it must be called identically from each LSM (none of which are even aware of cgroups), which I think is pretty obviously the wrong solution. Casey Schaufler also didn't like this idea: When the next feature comes along are we going to stuff it into capabilities, too? Maybe we'll cram it into audit or CIPSO instead, but how long can this go on? Eventually we need a mechanism that allows more or less general mix-and-match, maybe with a few rules like "don't mix plaids and stripes" to keep things sane or these lesser facilities have no chance. Seems like we're still making LSM too hard to use At this point, the complaint was clearly not with just the device whitelist, but with the capabilities module as well. It seems that capabilities are a bit of a poor fit with the LSM idea as a whole. The fact that they exist at all is a bit of a historical artifact; some developers wanted to see them implemented that way to show the flexibility of the LSM interface and to let capabilities be omitted from embedded setups. As it happens, it's still not possible to remove capabilities, and they impose a bit of a cost on all other security modules. The core problem is this: LSM, fundamentally, is a restrictive mechanism. An LSM hook can deny an action, but it can never empower a process to do something it would not have been allowed to do in the absence of the security module. The decision to disallow "authoritative hooks" was made explicitly back in 2001 as a way of restricting the scope of LSM modules and, hopefully, ensuring that those modules would not themselves become security problems. But capabilities are an inherently authoritative mechanism - a capability check verifies the existence of a special permission which would otherwise not be there. The device whitelist is the same sort of thing: it grants access which would otherwise be denied. So it fits poorly with the LSM model. Serge came back with yet another patch which takes the whitelist code out of the LSM framework and, instead, inserts a separate set of hooks into the relevant places in the code. Those hooks sit right next to the LSM hooks, but operate in a permissive manner. So far, this approach seems to be passing muster, with no developers (yet) talking about booting it out into yet another subsystem. Things may yet change, though. Casey Schaufler is now talking about the creation of a "Linux privilege module" framework for the management of all permissions checks. The normal discretionary access control checks could be moved there, as could all capability and "are they root?" logic. And, of course, the device whitelist code. Nobody has really spoken out against this idea - but, then, nobody has seen any code yet either. But, if things continue in this direction, authoritative hooks may have finally found a home, many years after having been rejected from the LSM mechanism. Python gears up for 2.6 and 3.0 Things are heating up in the Python world in advance of two major synchronized releases of the language. As it heads towards Python 3000 (aka Py3k or Python 3.0), alongside the transitional version 2.6, the development team is narrowing its focus to just those items that are required for the releases. Along the way, the conversations taking place on python-devel provide a look inside the development and release process decisions that a project needs to make as releases loom. Py3k is the next-generation version of Python, as we described last September. It will not be backward compatible with programs written for Python 2.x in a wide variety of ways. Python 2.6 is an effort to bridge the gap, enabling much of the 3.0 functionality so that new programs can start using it. It can also provide warnings for code that will not work with Py3k. Python 2.6 was originally scheduled for an April 2008 release, in advance of the August 2008 release planned for Py3k. Now the two are slated for synchronized releases, roughly monthly, until the final release now scheduled for early September 2008. The synchronization is seen as important for two reasons as Python's Benevolent Dictator For Life (BDFL) Guido van Rossum outlines: Not only could this potentially save the release manager and his assistants some time, doing the final releases together sends a clear signal to the community that both versions will receive equal support. Because Py3k is such a radical change, the 2.x series will continue for a long time. van Rossum's recent PyCon keynote (PDF slides) mentions five years as the time frame for 2.6 to be supported, with 2.7 and 2.8 releases possible. A stable development platform for the next few years is very important for current Python users as is giving them a long time to migrate their code. The third alpha of Py3k was released at the end of February along with the first alpha of 2.6. Additional alpha releases of each are slated for April and May as laid out in Python Enhancement Proposal (PEP) 361. Those are to be followed by betas in June and July with the final release planned for September 3. All of that adds up to a fairly aggressive schedule, but the team seems confident—at least so far. One of the issues that the Python hackers are trying to figure out is how to track the items still left to be done. van Rossum describes the scope of the problem: In order to make such a tight release schedule we should try to come up with a list of tasks that need to be done, and prioritize them. This should include documentation, and supporting tools like 2to3. It should include features, backports of features, cleanup, bugs, and whatever else needs to be done (e.g. bugbot maintenance). No one had any major objections to van Rossum's suggestion of using the bug tracker to track the tasks, with Christian Heimes pointing out: Despite the url bugs.python.org it's an issue tracker and not a bug tracker. We track patches, feature requests, ideas and bugs in the same tracker. The bug tracker allows for different priorities to be set on bugs (or tasks) that are entered into it, which led van Rossum and others to wonder about the proper usage of that field. One of the problems is distinguishing between issues that must be addressed before the next release versus those that must be addressed sometime before the final release. In some sense, both are "critical" and "show-stopping" (depending on which show you are focused on). Brett Cannon reported the scheme they came up with: So "release blocker" blocks a release. "Critical" could very easily block a release, but not the current one. "High" issues should be addressed, but won't block anything. "Normal" is normal. And "low" is for spelling errors and such. This can elevate bugs that are relatively minor, but need to be handled before a final release, into a category that inflates their importance. But, not elevating the bugs can lead to them incorrectly being set aside for a later release. van Rossum wondered about this bug priority "inflation", but it is the way that 2.6/3.0 release manager Barry Warsaw wants to handle things: Critical is the right one to use. Neal and I will basically be moving issues between 'release blocker' and 'critical' with the former meaning this issue blocks the upcoming release. Other projects or project managers might make different decisions on how to handle bug priorities, but the important thing is to make a reasonable decision quickly. Once that was done, the tasks were added to the tracker and could be prioritized correctly within the framework and without a lot of hand-wringing about which way is "best". It is an important skill for project managers of all kinds to learn. Things are progressing rapidly on python-devel these days—not surprising with two major releases due in less than six months. There is a lot of work to be done, but the Python hackers aren't shrinking from those tasks. In addition, the team has also been able to change their processes as needed to support their tight schedule. With hard work and a bit of luck that should put Py3k and its 2.6 sibling on our development machines by autumn. A new suspend/hibernate infrastructure While attending conferences, your editor has, for some years, made a point of seeing just how many other attendees have some sort of suspend and resume functionality working on their laptops. There is, after all, obvious value in being able to sit down in a lecture hall, open the lid, and immediately start heckling the speaker via IRC without having to wait for the entire bootstrap sequence to unfold. But, regardless of whether one is talking about suspend-to-RAM ("suspend") or suspend-to-disk ("hibernation"), there are surprisingly few people using this capability. Despite the efforts which have been made by developers and distributors, suspend and hibernate still just do not work reliably for a lot of people. For your editor, suspend always works, but the success rate of the resume operation is about 95% - just enough to keep using it while inspiring a fair amount of profanity in inopportune places. Various approaches to fixing suspend and hibernation have been proposed; these include TuxOnIce and kexec jump. Another possibility, though, is to simply fix the code which is in the kernel now. There is a lot that has to be done to make that goal a reality, including making the whole process more robust and separating the suspend and hibernation cases which, as Linus has stated rather strongly several times, are really two different problems. To that end, Rafael Wysocki has posted a new suspend and hibernation infrastructure for devices which has the potential to improve the situation - but at a cost of creating no less than 20 separate device callbacks. For the (relatively) simple suspend case, there are four basic callbacks which should be provided in the new pm_ops structure by each bus and, eventually, by every device: When the system is suspending, each device will first see a call to its prepare() callback. This call can be seen as a sort of warning that the suspend is coming, and that any necessary preparation work should be done. This work includes preventing the addition of any new child devices and anything which might require the involvement of user space. Any significant memory allocations should also be done at this time; the system is still functional at this point and, if necessary, I/O can be performed to make memory available. What should not happen in prepare() is actually putting the device into a low-power state; it needs to remain functional and available. As usual, a return value of zero indicates that the preparation was successful, while a negative error code indicates failure. In cases where the failure is temporary (a race with the addition of a new child device is one possibility), the callback should return -EAGAIN, which will cause a repeat attempt later in the process. At a later point, suspend() will be called to actually power down the device. With the current patch, each device will see a prepare() call quickly followed by suspend(). Future versions are likely to change things so that all devices get a prepare() call before any of them are suspended; that way, even the last prepare() callback can count on the availability of a fully-functioning system. The resume process calls resume() to wake the device up, restore it to its previous state, and generally make it ready to operate. Once the resume process is done, complete() is called to clean up anything left over from prepare(). A call to complete() could also be made directly after prepare() (without an intervening suspend) if the suspend process fails somewhere else in the system. The hibernation process is more complicated, in that there are more intermediate states. In this case, too, the process begins with a call to prepare(). Then calls are made to: The freeze() callback happens before the hibernation image (the system image which is written to persistent store) is created; it should put the device into a quiescent state but leave it operational. Then, after the hibernation image has been saved and another call to prepare() made, poweroff() is called to shut things down. When the system is powered back up, the process is reversed through calls to: The call to quiesce() will happen early in the resume process, after the hibernation image has been loaded from disk, but before it has been used to recreate the pre-hibernation system's memory. This callback should quiet the device so that memory can be reassembled without being corrupted by device operations. A call to complete() will follow, then a call to restore(), which should put the device back into a fully-functional state. A final complete() call finishes the process. There are still two more hibernation-related callbacks: These functions will be called when things go wrong; once again, each of these calls will be followed by a call to complete(). The purpose of thaw() is to undo the work done by freeze() or quiesce(); it should put the device back into a working state. The recover() call will be made if the creation of the hibernation image fails, or if restoring from that image fails; its job is to clean up and get the hardware back into an operating state. For added fun, there are actually two sets of pm_ops callbacks. One is for normal system operation, but there is another set intended to be called when interrupts are disabled and only one CPU is operational - just before the system goes down or just after it comes back up. Clearly, interactions with devices will be different in such an environment, so different callbacks make sense. But the result is that fully 20 callbacks must be provided for full suspend and hibernate functionality. These callbacks have been added to the bus_type structure as: Fields by the same name have also been added to the pci_driver structure, allowing each device driver to add its own version of these callbacks. For now, the old PCI driver suspend() and resume() callbacks will be used if the pm_ops structures have not been provided, and no drivers have been converted (at least in the patch as posted). As of this writing, discussion of the patch is hampered by an outage at vger.kernel.org. There are some concerns, though, and things are likely to change in future revisions. Among other things, the number of "no IRQ" callbacks may be reduced. But, with luck, the final resolution will leave us all in a position where suspend and hibernate work reliably. The Banshee Music Management and Playback Utility The Banshee project is creating a music management and playback utility for the GNOME desktop. The Banshee home page states: Import, organize, play, and share your music using Banshee's simple, powerful interface. Rip CDs, play and sync your iPod, create playlists, and burn audio and MP3 CDs. Most portable music devices are supported. Banshee also has support for podcasting, smart playlists, music recommendations, and much more. Version 1.0 Alpha 1 (0.98.1) of Banshee has been announced. New features in this release include: A code rewrite with an emphasis on performance improvements and better resource usage. A new Album Browser feature with the ability to display album artwork. A Play Queue feature for building on-the-fly music playlists. New search capabilities for locating artists, albums and song titles. Integration with the Last.fm music sharing service. A built-in 10 band audio equalizer. The new ability to play from a playlist while browsing new sources. The version 1-0.98.1 change log file has more detailed information on the new release. This 1.0 alpha release of Banshee is missing a number of features that were present in the earlier 0.13.2 version. There is no support for hardware devices yet, so it is not possible to import or burn CDs, talk to iPod devices or deal with USB or MTP devices. Numerous plugins have also been left out, so it is not possible to access podcasts, internet radio, music sharing services, etc. The release announcement states: Do not despair, these features will be added back before the final 1.0 release. Many hardware related features are projected to land in the Alpha 2 and 3 releases of Banshee 1.0. We expect releases in quick succession leading up to the final 1.0 release. Banshee 1-0.98.1 was installed on a system running an Athlon XP 1700 processor and 512MB of RAM. The operating system was the alpha 6 release of Ubuntu Hardy Heron for i386. The following steps were required to get the software running: Banshee fired up as expected. Your author converted a few CDs to flac files and copied them to the system for testing. It did not take much effort to figure out how to play individual tracks and build playlists. The standard play/pause buttons and skip to previous or next track buttons worked as one would expect. The built-in equalizer worked, although it tended to produce audible clipping if a frequency band was turned up too high. Unlike earlier versions of Banshee, the only internet music channel shown in version 1.0 was Last.fm. It was possible to use the standalone last.fm binary to access the site, but Banshee was only able to list the selections, not play them. The error message: don't know how to handle audio/mpeg... led to the source of the problem. The installation page was consulted, a large collection of gstreamer0.10-plugins were installed with the Synaptic package manager, and Banshee was restarted. Last.fm content came through loud and clear. One final issue was noticed with Banshee. When the application was run from the command line and exited using the GUI, it left the GNOME terminal in a locked-up state. Future releases of Banshee will likely include fixes for some of the aforementioned issues. Banshee is an interesting application that can be used for combining a wide variety of audio listening functions into one place. Electing the openSUSE board The openSUSE project takes another step in becoming a true community project. The current openSUSE board, appointed by Novell, will soon be replaced by an elected board. The question that is being debated on the opensuse-project mailing list is "Who can vote for the openSUSE board?" Among the openSUSE community there are Members and a larger number of Users. ""openSUSE Members" are specifically distinguished contributors who have brought a continued and substantial contribution to the openSUSE project. They are approved by the openSUSE board." Becoming a user is as easy as registering on the wiki. Some possible answers to the "who can vote" question include: members only anyone (members + registered users) members + non-members vouched for by members members + users who have signed the Guiding Principles At this time the number of members is low. There are concerns that having members (who are appointed by the board) as the only voters for the board could exclude the greater community. On the other hand opening up elections to the greater user community is difficult to police. It should be verifiable that those who are eligible to vote have only one vote counted. Other projects may serve as a guide for this issue. Debian has the Debian Voting Information page which defines how voting is done and how votes are counted. Debian restricts voting to Debian Developers (DDs), who much sign their vote with their key which is also on the official keyring. DDs may vote more than once, but only the last vote is counted, so voting is restricted and it's easy to insure one-vote-per-person. The Fedora project has defined Fedora Board Elections more recently than Debian. This document states that 5 of 9 seats on the board are appointed by the board. Voting is open for the remaining seats to those who have a valid account in the Fedora Account System. Getting an account on the Fedora Account System requires an application and approval process that is somewhat similar to becoming an openSUSE Member. The GNOME Foundation Elections process was also raised as a model. GNOME membership is open to any contributor willing to go through the application process. Given those three examples it does seem that voting privileges are typically restricted to a subset of the community that has made both a commitment and continuing contributions to the project. The main difference is that openSUSE membership is relatively new and is therefore a small segment of the greater community. Over time the membership will grow and members only elections may become more appealing. In any case, the procedures that are defined for this election may be changed for subsequent elections. Breaking CAPTCHA Perhaps someday it will be considered discrimination against a sentient, but these days a way to distinguish between programs and humans is required for many web-based applications. Keeping spambots from posting comments in weblogs or other bots from signing up for a web service are two of the most common applications for separating humans and bots. As has often been the case in the past, though, when the stakes are high enough, attackers will find ways to circumvent barriers like this. The most common means of testing for humans in web site sign-ups and the like is a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). Typically these are images that contain some text that has been mangled so that it is still recognizable by humans, but not by programs—at least that is the theory. Variations on the theme include asking math or "common sense" questions that programs will supposedly not be able to figure out—more likely no attacker has had enough interest breaking them. Serious CAPTCHAs tend to use images that can be created on the fly, giving nearly infinite variety. Some of the most sophisticated CAPTCHAs are those used by various free web mail services: Hotmail, Yahoo, and Gmail. These services provide quite a bit of storage that might be of use to an attacker, but they also lend their reputation to mail that gets sent from those accounts. Domains like yahoo.com and gmail.com are very unlikely to be blacklisted. Mail coming from those domains may also score lower in various spam testing rules, which may be exactly what an attacker is looking for. Various techniques have been tried in the past to circumvent CAPTCHAs, with the most successful ones using humans. It seems that many folks will happily solve CAPTCHAs in order to view pornography or for cash. Over the last year, though, CAPTCHA-breaking programs have started to appear. In a very detailed report, Websense presents evidence that Gmail's CAPTCHA has been cracked. Earlier reports indicate that attackers have cracked Yahoo, Windows Live, and Hotmail CAPTCHAs as well. Cracked does not mean 100% success rate—humans cannot even achieve that—it just needs to work often enough to provide the attackers with the accounts they want. These programs use some image processing and optical character recognition (OCR) techniques to decipher the puzzle, removing humans from the equation entirely. Typical success rates are in the 20-35% range. For attackers with botnets available to spread out the work, this could yield an amazing number of accounts in relatively short order. CAPTCHAs have a number of bad characteristics: they are annoying to most and unusable by those who are visually impaired. Yet they are pervasive. Alternate techniques using audio have so far been found wanting; a more interesting method is Asirra from Microsoft Research. Asirra uses 3 million images of dogs and cats from animal shelters that have been categorized. The test then shows a dozen random images from the database and asks the "human" to select all the cat photos. This would seem much more difficult for a program to handle. The picture database would need regular updates to thwart attackers just collecting all the images and doing their own categorization—perhaps with help from porn viewers or poor folk. Also, computer recognition systems will someday be able to recognize dogs and cats. It is a difficult problem to solve, but one that needs to be addressed. Systems like OpenID are not enough—it is not what they were designed for—as there is nothing stopping bots from having OpenIDs. Some mechanism that would allow reputation or trust to accumulate on a given ID might help prove that its holder is a human—or at least a well-behaved bot. Designing a reputation service that is decentralized will also be difficult, but it is the right direction for solving these kinds of problems. Bruce Perens and the OSI board The Open Source Initiative (OSI) was formed almost ten years ago to safeguard the "Open Source" name. Over the years it has approved licenses and attempted some other activities while, generally, having little relevance to the wider community. It has often been seen as a relatively closed and non-democratic organization. Now one of OSI's founders is trying to get back into the organization and change its direction; the outcome of the resulting discussion may (or may not) change the direction of the OSI. Bruce Perens has launched a bid to be elected to the OSI board of directors, but this bid has not been particularly well received by the current board. His on-line petition to collect community support specifies a number of reasons that he wants to be on the board—those reasons are ruffling some feathers. Outgoing board member Matt Asay has taken Perens to task for some of his statements as has OSI president Michael Tiemann. Perens's reasons for wanting to be on the board are threefold: reducing the over-representation of vendors, trying to ensure Microsoft does not get a seat on the board, and reducing license proliferation. The idea of a Microsoft seat on an open source organization's board is sure to rile a segment of the community, which is undoubtedly part of what Perens is hoping for. The likelihood of that happening is pretty small, though. Tiemann makes it clear that the board doesn't elect companies at all: The OSI nominates people to the board despite their corporate affiliations, not because of them. The idea that the OSI would elect a "Microsoft" board member is as absurd as the idea that we'd elect a "Google" board member or an "IBM" board member. We elect people based on their own merits, not the merits (or demerits) of the companies or organizations they are affiliated with. Microsoft and its employees do not currently contribute to open source in any substantial way, so there is little that would lead the board to nominate them. If that ever changes, it would be pretty disingenuous to deny someone a seat because of their employer's past—or even at that time, current—misbehavior. In addition, it is hard to see how one board member—Perens or someone "controlled" by Microsoft—is going to make such a crucial difference in what the board does anyway. In many ways, the Microsoft connection is a red herring—one sure to rally the troops, though. Reducing license proliferation is a noble goal, one that the OSI tried to tackle a few years back without much in the way of tangible success. Perens states that he would like to see OSI do more reduce the number of licenses, but his claims about the number of licenses needed have raised eyebrows: Another problem is the failure to reduce the number of different licenses in general use. My own work in this area shows that only four licenses, all compatible with each other, can satisfy all common business and non-business purposes of Open Source development. Three of these licenses have essentially the same text, and the fourth is very short. Life would be easier if more projects used them. While it would be difficult to shut down approval of new licenses, I think OSI could be more proactive at reducing license proliferation. Part of the reason that Tiemann and others are skeptical is due to some obvious bad blood between the board and Perens over the license proliferation committee. LWN covered some of that "debate" in August 2005. Perens clearly believes he should have been a member just as strongly as others on the board seem to feel he should not have been. When the board was formed without him as a member, Perens refused to participate in the process in any way. It seems to stick in the craw of some for Perens to now claim that he has the solution. Russ Nelson, former OSI president and current board member—as well as a member of the committee—sums up the frustration in a comment on Tiemann's post: I don't see how Bruce can claim to have a short list of four licenses. I start with BSD, GPLv2, GPLv3, LGPLv2 and LGPLv3 and that's five. If he thinks that people should simply agree with him that all GPLv2 should be relicensed GPLv3, I invite him to spend some time with Linus Torvalds, who notoriously and politely disagrees. Having a solution is not the same as convincing people to adopt it. It is rather interesting to see Perens trying to get back on the board that he famously resigned from in 1999 after having founded the organization with Eric Raymond in 1998. This is not the first time Perens has lost interest and/or resigned from some form of community leadership position; Debian and UserLinux spring to mind. Though none of the expressed concerns about his candidacy have mentioned it, some must be wondering how long it would be before ideology or a shifting focus caused Perens to move on from a board position if he were elected. Perens has been an excellent advocate for free software and/or open source over the years, but his tendency towards self-promotion grates on some. It may not be an ego thing, as he claims, but it certainly rubs some people the wrong way. The ego issue is one of the reasons that board observer Andrew Oliver does not support Perens for the board: A return to a very Amerocentric hacker culture voice with big egos is not the answer to OSI's problems. I think OSI is on the path to real fundamental change. I'd like to hear Bruce explain what he'd do differently in collaboration with others who may not always agree with him. Asay certainly doesn't see Perens as having the right credentials either: The OSI needs a vibrant membership of those currently shaping the open source landscape. It's possible that its current make-up doesn't reflect this. Point well taken. But it's equally possible - indeed, I'd say probable - that Bruce's directorship wouldn't change this. I like Bruce but aside from the occasional picketing he does, I can't point to anything substantive he has done for open source in the past half-decade or so. The petition drive came about because Tiemann encouraged Perens to show that there was strong community support for him to be a part of the board. As of this writing, the petition has garnered more than 1700 "signatures", which Perens believes is enough: Regarding my candidacy, OSI's board, through its president, asked me to show an uprising of strong community support if the board was to to elect me. I have. Now that I have done what you asked, are you going to hide behind complaints about my campaign, which is really quite mild in its criticism and is in no way the "scorched earth" that Matt refers to, or are you going to do what you said? If you OSI can't handle a political opponent on my laid-back scale, you'd only looking for yes-men. The OSI board is "self-replacing" with current board members nominating and electing candidates for empty slots. Each director serves for a three-year term, with roughly one-third coming up for election each year—though this year there are five slots to be filled. Three directors are standing for re-election, leaving two slots open. Unfortunately, it's not clear when the actual election will be held, nor is there likely to be any advance notice of who has been nominated. Transparency, it seems, is not one of the attributes of OSI. Self-replacement and overlapping terms of office tend to give a certain stability to a board, but it also creates a kind of inbreeding. It is unlikely that a board will nominate people who think substantially differently from themselves. This is one thing that Perens is trying to circumvent with his very public candidacy. Whatever else can be said about Perens's candidacy, it is clear that he would bring a different voice into the OSI boardroom. But, what is OSI really? Is it an organization that is somehow supposed to represent all of the diverse voices in the community? At the moment it appears to exist for the purpose of approving licenses and "protecting the Open Source Definition". Perens thinks it could be more than that. OSI itself seems to agree as they have been moving towards more relevance in the community. Oliver describes that effort: OSI is trying to solve its problems, by becoming more grassroots and less bottom up. Meanwhile, it is trying to grow the movement by expanding its international representation. Corporations do influence OSI, in that not all of the board has a free hand to say what is on their mind publicly. However, the solution is to make the OSI board what it should be: a governance board. OSI and its board are currently in a state of flux, trying to define a role for themselves that is broader than just a license approval body. There doesn't seem to be a lot of discontent within the board that might lead to Perens or another controversial figure being added. Whether this leads to continued stagnation or a more vibrant OSI remains to be seen. A more interesting question might be: will anyone care? If OSI starts to do visible things for the community, it will finally acquire some relevance. Given the attitude towards his candidacy, it seems unlikely that Perens will be able to lead the board in that direction. Which leaves it up to the current board and the two new members—neither of which are likely to be Perens—to find a way to make the community care. Atomic context and kernel API design An API should refrain from making promises that it cannot keep. A recent episode involving the kernel's in_atomic() macro demonstrates how things can go wrong when a function does not really do what it appears to do. It is also a good excuse to look at an under-documented (but fundamental) aspect of kernel code design. Kernel code generally runs in one of two fundamental contexts. Process context reigns when the kernel is running directly on behalf of a (usually) user-space process; the code which implements system calls is one example. When the kernel is running in process context, it is allowed to go to sleep if necessary. But when the kernel is running in atomic context, things like sleeping are not allowed. Code which handles hardware and software interrupts is one obvious example of atomic context. There is more to it than that, though: any kernel function moves into atomic context the moment it acquires a spinlock. Given the way spinlocks are implemented, going to sleep while holding one would be a fatal error; if some other kernel function tried to acquire the same lock, the system would almost certainly deadlock forever. "Deadlocking forever" tends not to appear on users' wishlists for the kernel, so the kernel developers go out of their way to avoid that situation. To that end, code which is running in atomic context carefully follows a number of rules, including (1) no access to user space, and, crucially, (2) no sleeping. Problems can result, though, when a particular kernel function does not know which context it might be invoked in. The classic example is kmalloc() and friends, which take an explicit argument (GFP_KERNEL or GFP_ATOMIC) specifying whether sleeping is possible or not. The wish to write code which can work optimally in either context is common, though. Some developers, while trying to write such code, may well stumble across the following definitions from <linux/hardirq.h>: It would seem that in_atomic() would fit the bill for any developer trying to decide whether a given bit of code needs to act in an atomic manner at any specific time. A quick grep through the kernel sources shows that, in fact, in_atomic() has been used in quite a few different places for just that purpose. There is only one problem: those uses are almost certainly all wrong. The in_atomic() macro works by checking whether preemption is disabled, which seems like the right thing to do. Handlers for events like hardware interrupts will disable preemption, but so will the acquisition of a spinlock. So this test appears to catch all of the cases where sleeping would be a bad idea. Certainly a number of people who have looked at this macro have come to that conclusion. But if preemption has not been configured into the kernel in the first place, the kernel does not raise the "preemption count" when spinlocks are acquired. So, in this situation (which is common - many distributors still do not enable preemption in their kernels), in_atomic() has no way to know if the calling code holds any spinlocks or not. So it will return zero (indicating process context) even when spinlocks are held. And that could lead to kernel code thinking that it is running in process context (and acting accordingly) when, in fact, it is not. Given this problem, one might well wonder why the function exists in the first place, why people are using it, and what developers can really do to get a handle on whether they can sleep or not. Andrew Morton answered the first question in a relatively cryptic way: in_atomic() is for core kernel use only. Because in special circumstances (ie: kmap_atomic()) we run inc_preempt_count() even on non-preemptible kernels to tell the per-arch fault handler that it was invoked by copy_*_user() inside kmap_atomic(), and it must fail. In other words, in_atomic() works in a specific low-level situation, but it was never meant to be used in a wider context. Its placement in hardirq.h next to macros which can be used elsewhere was, thus, almost certainly a mistake. As Alan Stern pointed out, the fact that Linux Device Drivers recommends the use of in_atomic() will not have helped the situation. Your editor recommends that the authors of that book be immediately sacked. Once these mistakes are cleared up, there is still the question of just how kernel code should decide whether it is running in an atomic context or not. The real answer is that it just can't do that. Quoting Andrew Morton again: The consistent pattern we use in the kernel is that callers keep track of whether they are running in a schedulable context and, if necessary, they will inform callees about that. Callees don't work it out for themselves. This pattern is consistent through the kernel - once again, the GFP_ flags example stands out in this regard. But it's also clear that this practice has not been documented to the point that kernel developers understand that things should be done this way. Consider this recent posting from Rusty Russell, who understands these issues better than most: This flag indicates what the allocator should do when no memory is immediately available: should it wait (sleep) while memory is freed or swapped out (GFP_KERNEL), or should it return NULL immediately (GFP_ATOMIC). And this flag is entirely redundant: kmalloc() itself can figure out whether it is able to sleep or not. In fact, kmalloc() cannot figure out on its own whether sleeping is allowable or not. It has to be told by the caller. This rule is unlikely to change, so expect a series of in_atomic() removal patches starting with 2.6.26. Once that work is done, the in_atomic() macro can be moved to a safer place where it will not create further confusion. Kernel markers and binary-only modules Kernel markers are a mechanism which allows developers to put static tracepoints into the kernel. Once placed, these markers can be used by operations staff to trace well-known events in running systems without that staff having to know about kernel code. Solaris provides a long list of static tracepoints for use with Dtrace, but Linux, thus far, has none. That situation should eventually change - static markers were only merged into the mainline in 2.6.24. But, as the developers start to look more seriously at markers, some interesting issues are coming up. One of those emerged as a result of this patch from Mathieu Desnoyers which allows proprietary modules to contain markers. The fact that current kernels do not recognize markers in binary-only modules is mostly an accident: markers are disabled in modules with any sort of taint flag set as a way to prevent kernel crashes - a kernel oops being a rather heavier-weight marker than most people wish to encounter. Matthieu tightened that test in a way that allows markers in proprietary modules, saying "let's see how people react." Needless to say, he saw. One might well wonder why the kernel developers, not known for their sympathy toward proprietary modules in general, would want to consider letting those modules include static tracepoints. The core argument here is that static markers allow proprietary modules to export a bit more internal information to the kernel, and to their users. It is seen as a sort of (very) small opening up on the part of the proprietary module writer. Mathieu says: I think it's only useful for the end user to let proprietary modules open up a bit, considering that proprietary module writers can use the markers as they want in-house, but would have to leave them disabled on shipped kernels. The idea is that, by placing these tracepoints, module authors can help others learn more about what's going on inside the module and help people track down problems. The result should be a more stable kernel which - whether proprietary modules have been loaded or not - is generally considered to be a good thing. On the other hand, there's no shortage of developers who are opposed to extending any sort of helping hand to binary module authors. Giving those modules more access to Linux kernel internals, it is argued, only leads to trouble. Ingo Molnar put it this way: Why are we even arguing about this? Binary modules should be as isolated as possible - it's a totally untrusted entity and history has shown it again and again that the less infrastructure coupling we have to them, the better. Ingo also worries that allowing binary modules to use markers will serve to make the marker API that much harder to change in the future. Since that API is quite young, chances are good that changes will happen. As much as the kernel developers profess not to care about binary-only modules, the fact of the matter is that there are good reasons to avoid breaking those modules. The testing community certainly gets smaller when testers cannot load the modules they need to make their systems work in the manner to which they have become accustomed. So it is possible that allowing proprietary modules to use markers could make the marker API harder to fix in future kernel releases. The grumbles have been loud enough that Matthieu's patch will probably not be merged for 2.6.25. The idea is likely to come back again, but not necessarily right away: the marker feature may have been merged in 2.6.24, but it would appear that 2.6.25 will be released with no actual markers defined in the source. It's not clear that binary-only module authors are pushing to add tracepoints when none of the other developers are doing so. Until somebody starts actually using static markers, debates on where they can be used will continue to be of an academic nature. Distribution-friendly projects - Part 1 [Editor's note: This article, which looks at the interactions of software projects and distribution providers, will be presented in three parts.] Introduction In today's world most users of Linux don't build their system from scratch by downloading the sources of the applications and libraries they need and building them by hand. Most users will use one or more distributions (the ones that best suit their needs), and they'll stick with the packages provided by the distribution for as long as they can. Power users may know how to get the software they want and build it so it runs, but the average user won't go around looking for software that is not readily available to them. The job of a distribution is, of course, to provide as much software as its users will need, sometimes changing the software so that it suits the needs of its users better. The distribution's developers, the so-called downstream developers, have different responsibilities compared to the original software developers, the upstream developers. The former are responsible directly to their users, while the latter are usually more focused on implementing their software correctly for their own standards (which means for instance implementing a protocol exactly as described by the standard, or supporting a file format exactly as it should be). Most of the time, these two objectives are compatible with one another, and users face an interface that hides the details of the implementation. Sometimes though there are user requests that upstream developers won't acknowledge, for instance: to parse a file that was written improperly by a commonly-used tool (maybe a proprietary tool that does not support free software). In these cases, some distributions tend to edit the source, creating a modified version for that particular distribution, with a different behaviour, interface, or what not. It's because of cases like this, especially in the last few years, that there have been many arguments between original developers and distributions, which sometimes involved legal threats, forks or removal of software from distributions' repositories. It's not fun to watch these arguments going by, and sometimes it's all because of differences in opinion between the developers, or in how their experiences have affected their views. Starting with the idea that everybody wants to have the software they wrote used, this article will try to explain what distributors want and why they ask the original developers to cooperate toward that goal. People who worked both as an upstream developer and as a downstream maintainer usually know what is being done with their code in a distribution and why. For people who have only seen one side, understanding of the needs or the reasons of the other side might be a very difficult task. Technical and philosophical needs The majority of the points where upstream and downstream have different views can be divided into technical and philosophical points. On the technical side, distributors need to make the software build on their system, without lots of workarounds, and it should follow the same behaviour as other software in their setup. On the philosophical side, they have needs relating to user requests and expectations. Users expect some consistency in how software looks and behaves on their system. Often, both of these kind of matters relate to the policy (written or unwritten) of that distributor. While one might actually expect a philosophical debate between developers on formats and how to implement a protocol, it's difficult to understand how so many arguments are caused by different technical requests. Unfortunately even the technical needs are often different between upstream projects and distributions. The only way to accommodate both is to provide choices, something that more times than not is considered bad by the upstream developers, who do not want the complication of too many choices. I sincerely doubt there will ever be a time when all the upstream developers and the downstream maintainers will be on the same page, but it is possible to at least try to understand what the other side wants, and see if it's possible to cover their needs, without regressing. Even if that means increasing the complexity a bit. It is true that most of today's tools, in every area, are more sophisticated and complex than their equivalent years ago (tens of years for computer tools, hundreds of years for other areas). [This ends part 1 of this article. Part 2 will look at the technical needs of distributions and the upstream developers. Finally, part 3 will cover the philosophical concerns and present some conclusions. Stay tuned for part 2, which should air in two weeks.] Striking gold in binutils A new linker is not generally something that arouses much interest outside of the hardcore development community—or even inside it—unless it provides something especially eye-opening. A newly released linker, called gold has just that kind of feature, though, because it runs up to five times as fast as its competition. For developers who do a lot of compile-link-test cycles, that kind of performance increase can significantly increase their efficiency. Linking is an integral part of code development, but it can be invisible, as it is often invoked by the compiler. The sidebar accompanying this article is meant for non-developers or those in need of a refresher about linker operation. For those who want to know even more, the author of gold, Ian Lance Taylor, has a twenty-part series about linker internals on his weblog, starting with this entry. For Linux systems, the GNU Compiler Collection (GCC) has been the workhorse by providing a complete toolchain to build programs in a number of different languages. It uses the ld linker from the binutils collection. With the announcement that gold has been added to binutils, there are now two choices for linking GCC-compiled programs. A linker overview For non-developers, a quick overview of the process that turns source code into executable programs may be helpful. Compilers are programs that turn C—or other high-level languages—into object code. Linkers then collect up object code and produce an executable. Usually the linker will not only operate on object code created from a project's source, but will also reference libraries of object code—the C runtime library libc for example. From those objects, the linker creates an executable program that a user can invoke from the command line. The linker allows program code in one file to refer to a code or data object in another file or library. It arranges that those references are usable at run time by substituting an address for the reference to an object. This "links" the two properly in the executable. Things get more complicated when considering shared libraries, where the library code is shared by multiple concurrent executables, but this gives a rough outline of the basics of linker operation. The intent is for gold to be a complete drop-in replacement for ld—though it is not quite there yet. It is currently lacking support for some command-line options and Linux kernels that are linked with it do not boot, but those things will come. It also currently only supports x86 and x86_64 targets, but for many linker jobs, gold seems to be working well. The speed seems to be very enticing to some developers, with Bryan O'Sullivan saying: When I switched to using gold as the linker, I was at first a little surprised to find that it actually works at all. This isn't especially common for a complicated program that's just been committed to a source tree. Better yet, it's as fast as Ian claims: my app now links in 2.6 seconds, almost 5.4 times faster than with the old binutils linker! Performance was definitely the goal that Taylor set for gold development. It supports ELF (Executable and Linking Format) objects and runs on UNIX-like operating systems only. Only supporting one object/executable format, along with a fresh start and an explicit performance goal are some of the reasons that gold outperforms ld. Tom Tromey likes the looks of the code: I looked through the gold sources a bit. I wish everything in the GNU toolchain were written this way. It is very clean code, nicely commented, and easy to follow. It shows pretty clearly, I think, the ways in which C++ can be better than C when it is used well. Because the implementation is geared for speed, Taylor used techniques that may confuse some. He has some concerns about the maintainability of his implementation: While I think this is a reasonable approach, I do not yet know how maintainable it will be over time. State machine implementations can be difficult for people to understand, and the high-level locking is vulnerable to low-level errors. I know that one of my characteristic programming errors is a tendency toward code that is overly complex, which requires global information to understand in detail. I've tried to avoid it here, but I won't know whether I succeeded for some time. Overall, it seems to be getting a nice reception by the community, with O'Sullivan commenting that he is "looking forward to the point where gold entirely supplants the existing binutils linker. I expect that won't take too long, once Mozilla and KDE developers find out about the performance boost." Once gold gets to that point, Taylor is already thinking about concurrent linking—running compiler and linker at the same time—as the next big step. There are two other ongoing projects that are working with the greater GCC ecosystem in interesting ways: quagmire and ggx. Quagmire is an effort to replace the GNU configure and build system—consisting of autoconf, automake, and libtool—with something that depends solely on GNU make. Currently, that system uses various combinations of the shell, m4, and portable makefiles to make the building and installation of programs easy—the famous "./configure; make" command line. The tools were written that way to try and ensure that users did not need to install additional packages to configure and build GNU tools. Quagmire, which has roots in a posting by Taylor recognizes that GNU make is ubiquitous, so basing a system around that makes a great deal of sense. The ggx project is Anthony Green's step-by-step procedure to create an entire toolchain that can build programs for a processor architecture that he is creating as a thought experiment. The basic idea is to design the instruction set based on the needs of the compiler, in this case GCC, rather than the needs of the hardware designers. He is using GCC's ability to be retargeted for new architectures, along with its simulation capabilities to create a CPU that he can write programs for. As of this writing, he has a "hello world" program working, along with large chunks of the GCC test suite passing. Well worth a look. Introducing Sphinx, the Python documentation toolchain The first public release of the Python Sphinx documentation system, which should not be confused with the CMU Sphinx speech recognition project, has been announced. Sphinx is a tool that makes it easy to create intelligent and beautiful documentation for Python projects, written by Georg Brandl and licensed under the BSD license. It was originally created to translate the new Python documentation, but has now been cleaned up in the hope that it will be useful to many other projects. (Of course, this site is also created from reStructuredText sources using Sphinx!) The Sphinx introduction states: "The focus is on hand-written documentation, rather than auto-generated API docs. Though there is limited support for that kind of docs as well (which is intended to be freely mixed with hand-written content), if you need pure API docs have a look at Epydoc, which also understands reST." An interesting feature of the Sphinx web pages is the inclusion of their own document source code. The document source code from the previously mentioned Sphinx introduction page is a good place to go to get a look at the reStructuredText language that Sphinx uses. More information on that language can be found in the A ReStructuredText Primer, the Quick reStructuredText user reference and the reStructuredText Cheat Sheet. The Sphinx feature list includes: Cross-platform, works under a variety of operating systems. Support for the HTML, Windows HTML Help, and LaTeX output formats. Can use Jinja from the Django project for creating HTML templates. Includes semantic markup and automatic links for cross-referencing. The documentation tree is hierarchically structured. Indexes are automatically generated. Sphinx can optionally use the Pygments programming language syntax highlighter. Supports a number of extensions for code snippet testing and more. The Python source code and related files for Sphinx are available for download here. The change log shows that a number of recent releases have been made. As of this writing, the current version is release 0.1.61950, dated March 26, 2008. If you need to maintain a collection of web-based or print-based project documentation, Sphinx could be a very useful tool. Toward a free metaverse Last month, an article about another attempt to free the proprietary Ryzom game expressed frustration with the implied idea that the free software community could not, on its own, create a game experience comparable to Ryzom. One of the resulting comments took issue with (what was seen as) a dismissive attitude toward the Second Life client and pointed out some of the work which is being done based on that client. So your editor decided to take another look. The bottom line is this: the work being done in this area is still in an early and unstable state, but it does have the potential to open a new frontier for free software in the area of virtual environments. The Second Life client for Linux is now in a beta release. "Beta," in this case, means that all of the features have, in some way, been implemented; now it's just a matter of making it all actually work. Your editor found the client to be slow, unwieldy, crash-prone, and very fussy about its graphics environment. Your editor's well-supported (in X) Intel-based desktop was not adequate for this client, for example; the associated documentation recommends a long list of cards which (for now) are only supported with proprietary drivers. Still, on the right system, the client is able to render three-dimensional worlds with the same quality that, well, Second Life has on any platform. An alternative is OpenViewer, a C#/Mono-based, BSD-licensed viewer project. Your editor had little luck getting this client going, but the screenshots are nice. The developers appear to have made significant progress toward the creation of a functional, three-dimensional client; this is a project to watch. Less far along is the Aether project, which is working on a OpenViewer-based client meant to run within Firefox; thus far, it has a nice design diagram but not much else. There is also RealXtend, a project based on the Second Life client which is emphasizing performance and visual quality. Unfortunately, it also seems to be emphasizing Windows support, so your editor did not give it a try. Free software clients are certainly an important tool to have; we will not be able to access this kind of virtual environment without them. But it would be a real shame if these clients simply facilitated a world where we use free clients to access locked-down, proprietary virtual worlds on somebody else's server. What would be much better would be the ability to create our own virtual worlds - using free software, of course - and to link those worlds into a larger virtual universe. That is the formula which made the World Wide Web (and many other Internet services) work, and it should certainly be applicable in this context as well. The good news is that people are working in this area. One project, OpenSim, has the look of something which is about to achieve much wider awareness as its features mature. In short, OpenSim is a virtual world server which can be deployed to create environments much like what one would find in Second Life. It works with the Second Life client and with OpenViewer as well, and it presents a very similar experience - at least, in the virtual worlds which have been deployed so far. Since it's free software, it can be customized toward the creation of different kinds of environments, including role-playing games and such. It is written with C# and Mono - seemingly a common choice for this kind of software. The Mono environment, for all its faults and potential pitfalls, may well make it easier to create a cross-platform application with the requisite features. What makes OpenSim really interesting, though, is its ability to connect servers together in a "grid" mode. Once this is done, a virtual world is not limited to a single entity's server (or imagination). Servers across the net can be interconnected into a single, larger world. This is the feature which has the potential to take OpenSim from another interesting project into something which transforms the net. There are a number of people organizing grids with OpenSim now; there is a list of public grids on the OpenSim site. Some of them appear to be relatively proprietary operations offering the opportunity to buy virtual land - though subprime loans are unavailable. Others allow anybody connect their server into the grid and become part of the whole. These grids appear, in general, to be in a sort of early adopter state at the moment, but much of the fundamental functionality is there. How hard could it be to make it all work properly at this point? The answer to that question, of course, is "quite hard." But the fact remains that people are working on this very interesting problem, and they are making significant progress toward solving it. These projects bear watching; they may well be planting the seeds of the systems we will all be using in the coming years. Predictive ELF bitmaps When the kernel executes a program, it must retrieve the code from disk, which it normally does by demand paging it in as required by the execution path. If the kernel could somehow know which pages would be needed, it could page them in more efficiently. Andi Kleen has posted an experimental set of patches that do just that. Programs do not know about their layout on disk, nor is their path through the executable file optimized to reduce seeking, but with some information about which pages will be needed, the kernel can optimize the disk accesses. If one were to gather a list of the pages that get faulted in as a program runs, that information could be saved for future runs. It could then be turned into a bitmap indicating which of the pages should be prefetched. Once you have such a bitmap, where to store it becomes a problem. Kleen's method uses a "hack" to the ELF format on disk, putting the bitmap at the end of the executable. This has a number of drawbacks: a seek to get the info, modifying the executable each time you train, and only allowing a single usage pattern system-wide. It does have one very nice attribute, though, the bitmap and executable stay in sync; if the executable changes, due to an upgrade for instance, the bitmap would get cleared in the process. Alternative bitmap storage locations—somewhere in users' home directories for example—do not have this property. Andrew Morton questions whether this need be done in the kernel at all: Can't this all be done in userspace? Hook into exit() with an LD_PRELOAD, use /proc/self/maps and the new pagemap code to work out which pages of which files were faulted in, write that info into the elf file (or a separate per-executable shadow file), then use that info the next time the app is executed, either with an LD_PRELOAD or just a wrapper. Ulrich Drepper does not want to see the ELF format abused in the fashion it was for this patch, Kleen doesn't either, but used it as an expedient. Drepper thinks the linker should be taught to emit a new header type which would store the bitmap. It would be near the beginning of the ELF file, eliminating the seek. A problem with that approach is that old binaries would not be able to take advantage of the technique; a re-linking would be required. Then the question arises, how does that bitmap get initialized? Drepper suggests that systemtap be used: To fill in the bitmaps one can have separate a separate tool which is explicitly asked to update the bitmap data. To collect the page fault data one could use systemtap. It's easy enough to write a script which monitors the minor page faults for each binary and writes the data into a file. The binary update tool and can use the information from that file to generate the bitmap. Kleen's patch walks the page tables for a process when it is exiting, setting a bit in the bitmap if that page has been faulted in. Drepper sees this as suboptimal: Over many uses of a program all kinds of pages will be needed. Far more than in most cases. The prefetching should really only cover the commonly used code paths in the program. If you pull in everything, this will have advantages if you have that much page cache to spare. In that case just prefetching the entire file is even easier. No, such an improved method has to be more selective. The problem is in finding the balance between just prefetching the entire executable—which might be very wasteful—and prefetching the subset of pages that are most commonly used. It will take some heuristics to make that decision. As Drepper points out, recording the entire runtime of a program "will result in all the pages of a program to be marked (unless you have a lot of dead code in the binary and it's all located together)." The place where Drepper sees a need for kernel support is in providing a bitmap interface to madvise() so that any holes in the pages that get prefetched do not get filled by the readahead mechanism. The current interface would require a call to madvise() for each contiguous region, which could be add up to a large number of calls. Both he and Morton favor the bulk of the work being done in user space. Overall, there is lots more work to do before "predictive bitmaps" make their way into a Linux system—if they ever do. To start with, some benchmarking will have to be done to show that performance improves enough to consider making a change like this. David Miller expresses some pessimism about the approach: I wrote such a patch ages ago as well. Frankly, based upon my experiences then and what I know now, I think it's a lose to do this. It is an interesting idea though, one that will likely crop up again if this particular incarnation does not go anywhere. Since the biggest efficiency gain is from reducing seeks, though, it may not be interesting long-term. As Morton says, "solid-state disks are going to put a lot of code out of a job." Voting machine integrity through transparency It is hard to believe that governments would spend money on voting equipment that they are not allowed to test, but that is exactly what multiple counties in New Jersey appear to have done. They are certainly not alone, many other places are likely to have the same restrictions on "their" voting machines. This begs the question: where are the free software voting systems? Union County wanted to ask Ed Felten to look at the voting machines it purchased from Sequoia Voting Systems because of several anomalies—less charitably known as miscounts—observed when using them in the primary elections. Once Sequoia got wind of the plan, they emailed Felten a nastygram because he might engage in "non-compliant analysis" of the machines in violation of the Sequoia license. It seems quite likely that is exactly what Felten and the county clerk had in mind as a third-party analysis is the only sensible way to evaluate voting machines. Other jurisdictions have done better of late, with Felten's Freedom to Tinker weblog noting that California has denied certification for two voting machines from Election Systems & Software (ES&S). California Secretary of State Debra Bowen has been at the forefront of trying to ensure that voting machines work correctly. LWN's home state of Colorado also decertified a number of voting machines, but, like the earlier California study, it was done after those machines were purchased. As in California, it seems likely that Colorado will be using those machines in November. Things are getting a little better, perhaps, but no one has, as yet, tried to take on the four major voting machine makers with a system that is built with security in mind. There is no reason that the source code for a voting machine could not be made available for study. The voting machine vendors claim all sorts of proprietary secret sauce in their code, but that isn't the real reason they hide it. Covering up their shoddy code is much more likely. Every independent review of voting machines has found numerous, fundamental security flaws that should make anyone with an interest in the integrity of the election process cringe. Many of those analyses were done without the source code, so there is little doubt that even uglier problems would have been found in the code itself. It just cannot be that difficult to produce something vastly more secure than what is made available today. One could speculate about the motives of these companies, but instead looking at what could be built, with mostly off-the-shelf software, is more fruitful. The place to start is by hiring a few good security-minded developers, while lining up an independent review team. One might guess that Felten and his associates would be a good place to start. A stripped down Linux system could very easily be the basis for a voting machine, but other free software choices would serve just as well. Some user interface code for touchscreens and alternative input methods for those with disabilities would need to be written. Some kind of printing output device would need to be made a part of the system so that voter-verifiable audit trails—better yet, ballots that can be put into a locked box—can be created. Source code availability does not, in and of itself, ensure vote security. That code needs to be reviewed by as many experts as can be found. In addition, there needs to be some mechanism to show that the source code being reviewed is the same as that being run. For that reason, the system itself might run on some kind of Trusted Platform Module (TPM) chip so that interested parties can verify that the published code is the same as that running on the system. If the system runs Linux, it might use the integrity management patches for that. Most importantly, the outside interfaces (network, USB, PCMCIA, etc.) to the device would either not be present or be very tightly controlled. Any kind of removable vote recording memory would need adequate cryptographic safeguards to eliminate tampering between vote taking and vote tabulating machines. Instead of an emphasis on PR, schmoozing, and bamboozling non-technical folks, the focus of a free software voting system would be on transparency. The number one goal would be to give everyone, from the least technical voter to the Bruce Schneiers of the world: confidence in the machines and the process. It is hard to fathom how anyone could want anything less. A creative example of the value of free drivers Free operating systems differ from the proprietary variety in a number of ways. One of the differences which is most evident to all users is in the provision of device drivers. With free systems, device drivers are free software, provided with the system itself. Proprietary systems tend to provide relatively few drivers; instead, proprietary drivers are shipped with the hardware itself and installed separately. Anybody who wonders about which model works better would be well advised to look at the events of March 28, when Creative Labs shut down an outside developer who had been working to improve Creative's drivers. Creative is, of course, a long-time manufacturer of audio hardware. Opinions vary on the quality of that hardware, but there can be no doubt that Creative has been successful in this market. Creative's customers have found, though, that moving to Vista has been an unusually painful experience, even by the standards of that particular system. It seems that Creative's drivers have failed to provide the same level of functionality found in previous versions, leaving customers with crippled hardware. Strangely enough, said customers have not been entirely pleased with this state of affairs. Enter a developer called "Daniel_K". Daniel took the time to figure out how the hardware worked and to patch Creative's drivers to, once again, provide access to the full capability of the hardware. He then made those drivers available to others. Creative hardware owners were happy about this: somebody had finally managed to solve the problems they had been complaining about. One would have expected Creative to be happy too; happy customers tend to be good for business. That's not the way of it, though. Instead, Creative removed links to the fixed drivers from its forums and posted a public cease-and-desist letter. According to Creative's Phil O'Shaughnessy: By enabling our technology and IP to run on sound cards for which it was not originally offered or intended, you are in effect, stealing our goods. When you solicit donations for providing packages like this, you are profiting from something that you do not own. If we choose to develop and provide host-based processing features with certain sound cards and not others, that is a business decision that only we have the right to make. There can be little doubt that Creative is operating within its legal rights here. It has retained proprietary rights to its driver software, and it has imposed the usual sort of "thou shalt not reverse engineer" EULA on its users. So, while Daniel_K may (or may not) have been able to legally reverse engineer the driver (depending on his location), he almost certainly did not have the right to redistribute modified versions of Creative's drivers. Asking for donations to help him continue this activity will not have made him any friends at Creative either. When dealing with other peoples' proprietary software in this manner, one should not be surprised to get shutdown notices. Creative may be on solid ground legally, but it still makes sense to look at what is going on here. One might have attributed the driver problems to a lack of competence at Creative, or, perhaps, to the general sort of misery that (your editor has heard) goes along with Vista. Instead, Creative's crippled drivers were the result of a "business decision." Rather than allow its customers to get the most out of the hardware they thought they owned, Creative decided to restrict that functionality, presumably as a way of motivating those customers to buy newer, shinier, better-supported hardware. Daniel_K, by making Creative's customers happier, was threatening Creative's chosen business strategy. Now consider a company whose hardware is supported by free drivers. That company lacks the ability to use crippled drivers as a tool to "encourage" customers to replace their hardware. Instead, that company has every incentive to provide the best hardware possible and to ensure that said hardware works to its fullest capability. Such a company would welcome an outsider who made their products work better; those outsiders would be more likely to receive job offers than cease-and-desist letters. Rather than calling out the lawyers, this company could focus on the business of being a hardware company. Your editor knows which sort of company he would (and does) choose to buy hardware from. Free drivers are not just a path toward higher-quality support, though that is typically the result. They are not just a way to help ensure that the kernel as a whole remains stable and debuggable. And free drivers are not just a way to help ensure that all can learn and benefit from the work which was done to get the hardware working. They are also a way to avoid the threat of manipulation by hardware vendors who have decided that providing the best value for customers is no longer a winning business strategy. That is a sort of freedom which is worth having. Debian Project Leader Election 2008 The Debian Project Leader election is well underway. The debate is over and the first call for votes has gone out. If it seems like the process is going faster this year, that's because it is. Last year a constitutional amendment to reduce the length of the DPL election process was adopted by the developers. There were three candidates nominated for this year's election; Marc Brockschmidt, Raphaël Hertzog and Steve McIntyre. Information about this election can be found on this year's vote page. Steve McIntyre has been a Debian Developer for more than 11 years. During that time he acquired a wide range of packaging experience, worked on creating the official CDs (and DVDs) and hosting machines used by Debian. Steve also served as Assistant Project Leader under Anthony Towns, so he has some idea of what the job entails. This is not the first time he's run for DPL either. In addition to this year's platform, his 2006 and 2007 platforms are also available. While Steve has no plans to appoint a DPL team, he is willing to delegate tasks when appropriate. His goals include improving communications within the project and improving the workflow, getting people to ask for help when they need it or to step down when they can't devote enough time to the job. In my opinion, a key part of working effectively is honesty. We can all suffer from a lack of time to do the jobs that we've promised to do. After all, real life has a nasty habit of intruding on our so-called "spare" time. So long as we don't let things delay too far, we can cope and still contribute. But at some point, we need to be more honest with ourselves and actually admit that we can't continue with the jobs that we've promised to do. It's a hard thing to do, but in a friendly community where we're all working together towards a common goal there should be no shame in asking for help. Raphaël Hertzog is also no stranger to DPL elections. He ran in 2002 and 2007, in addition to this year. Raphaël has proposed a small team of two other individuals (Moritz Muehlenhoff and Lucas Nussbaum) to help him with the DPL duties. His goals include making Debian more visible and recruiting more contributors. While the number of packages in Debian increased a lot since 2001, the number of active developers stayed the same. We could definitely use more developers to continue increase the quality of our distribution (teams with hundreds of bugs are quite common). We made a first step with the Debian Maintainer proposal, but we can do more. I'm not saying that we should give upload rights to less skilled people: we don't want to compromise on quality. He would also like to improve the core teams such as keyring managers, NM/DAM, ftpmasters, and the press team. Unofficial services that have proved useful (mentors.debian.net and backports.org) should be integrated officially into Debian. Marc Brockschmidt has been a Debian Developer since 2004 and has been involved in many parts of Debian since then, including helping with the New Maintainer process, as an AM to dozens of people, at the NM Frontdesk and working with the release team. He also helps to manage a network of hosts used for autobuilding, porting and other Debian-related services. Improving communications is a popular goal for DPL candidates, but has some thoughts on that: Before writing this platform, I had a look at the platforms of the past years and was amazed that nearly everyone talked about "improving communication", usually meaning that flaming shouldn't be allowed. I don't think this is possible - we can hardly replace all involved developers by cuddly stuffed animals. Good software developers have a strong opinion about topics dear to their heart, two good developers usually have two different opinions. Discussion, even bordering on flames, is OK - as long as it leads to a result. He would like to see more "Bits from ..." mails on debian-devel-announce for better internal communication. He would also like to see better presentation of Debian to outsiders. Like Raphaël, he would like backports.org to become an official Debian service. Summer of Code has been useful in bringing together some cool ideas with people who can work on them. Marc would like to see that wiki page remain active throughout the year. Marc admits that he doesn't have as much free time as the DPL will take, and plans to delegate heavily, especially finding others to present Debian to the rest of the world at conferences. Voting for these candidates will be open until April 13 and the term for the new DPL will start soon after, on April 17, 2008. Toward better direct I/O scalability Linux enthusiasts like to point out just how scalable the system is; Linux runs on everything from pocket-size devices to supercomputers with several thousand processors. What they talk about a little bit less is that, at the high end, the true scalability of the system is limited by the sort of workload which is run. CPU-intensive scientific computing tasks can make good use of very large systems, but database-heavy workloads do not scale nearly as well. There is a lot of interest in making big database systems work better, but it has been a challenging task. Nick Piggin appears to have come up with a logical next step in that direction, though, with a relatively straightforward set of core memory management changes. For some time, Linux has supported direct I/O from user space. This, too, is a scalability technology: the idea is to save processor time and memory by avoiding the need to copy data through the kernel as it moves between the application and the disks. With sufficient programming effort, the application should be able to make use of its superior knowledge of its own data access patterns to cache data more effectively than the kernel can; direct I/O allows that caching to happen without additional overhead. Large database management systems have had just that kind of programming effort applied to them, with the result that they use direct I/O heavily. To a significant extent, these systems use direct I/O to replace the kernel's paging algorithms with their own, specialized code. When the kernel is asked to carry out a direct I/O operation, one of the first things it must do is to pin all of the relevant user-space pages into memory and locate their physical addresses. The function which performs this task is get_user_pages(): A successful call to get_user_pages() will pin len pages into memory, those pages starting at the user-space address start as seen in the given mm. The addresses of the relevant struct page pointers will be stored in pages, and the associated VMA pointers in vmas if it is not NULL. This function works, but it has a problem (beyond the fact that it is a long, twisted, complex mess to read): it requires that the caller hold mm->mmap_sem. If two processes are performing direct I/O on within the same address space - a common scenario for large database management systems - they will contend for that semaphore. This kind of lock contention quickly kills scalability; as soon as processors have to wait for each other, there is little to be gained by adding more of them. There are two common approaches to take when faced with this sort of scalability problem. One is to go with more fine-grained locking, where each lock covers a smaller part of the kernel. Splitting up locks has been happening since the initial creation of the Big Kernel Lock, which is the definitive example of coarse-grained locking. There are limits to how much fine-grained locking can help, though, and the addition of more locks comes at the cost of more complexity and more opportunities to create deadlocks. The other approach is to do away with locking altogether; this has been the preferred way of improving scalability in recent years. That is, for example, what all of the work around read-copy-update has been doing. And this is the direction Nick has chosen to improve get_user_pages(). Nick's core observation is that, when get_user_pages() is called on a normal user-space page which is already present in memory, that page's reference count can be increased without needing to hold any locks first. As it happens, this is the most common use case. Behind that observation, though, are a few conditions. One is that it is not possible to traverse the page tables if those tables are being modified at the same time. To be guaranteed that this will not happen, the kernel must, before heading into the page table tree, disable interrupts in the current processor. Even then, the kernel can only traverse the currently-running process's page tables without holding mmap_sem. Lockless operation also will not work whenever pages which are not "normal" are involved. Some cases - non-present pages, for example - are easily detected from the information found in the page tables themselves. But others, such as situations where the relevant part of the address space has been mapped onto device memory with mmap(), are not readily apparent by looking at the associated page table entries. In this case, the kernel must look back at the controlling vm_area_struct (VMA) structure to see what is going on - and that cannot be done without holding mmap_sem. So it looks like there is no way to find out whether lockless operation is possible without first taking the lock. The solution here is to grab a free bit in the page table entry. The PTE for a page which is present in memory holds the physical page frame address. In such addresses, the bottom 12 bits (for architectures using 4096-byte pages) will always be zero, so they can be dedicated to other purposes. One of them is used to indicate whether the page is present in memory at all; others indicate writability, whether it's a user-space page, whether it is dirty, etc. Nick's patch grabs one of the few remaining bits and calls it "PAGE_BIT_SPECIAL," indicating "special" pages. These are pages which, for whatever reason, do not have a readily-accessible struct page associated with them. Marking "special" pages in the page tables can help in a number of places; one of those is making it possible to determine whether lockless get_user_pages() is possible on a given page. Once these pages are properly marked in the page tables, it is possible to write a function which makes a good attempt at a lockless get_user_pages(). Nick's proposal is called fast_gup(): This function has a much simpler interface than get_user_pages() because it does not handle many of the cases that get_user_pages() can deal with. It only works with the current process's address space, and it cannot return pointers to VMA structures. But it can iterate through a set of page tables, testing each page for presence, writability, and "non-specialness," and incrementing each page's reference count (thus pinning it into physical memory) in the process. If it works, it's very fast. If not, it undoes things then falls back to get_user_pages() to do things the slow, old-fashioned way. How much is this worth? Nick claims a 10% performance improvement running "an OLTP workload" (one of those unnameable benchmark suites, perhaps) using IBM's DB2 DBMS system on a two-processor (eight-core) system. The performance improvement, he says, may be greater on larger systems. But even if it remains at "only" 10%, this work is a clear step in the right direction for this kind of workload. [Update: this interface was merged for the 2.6.27 kernel; the name was changed to get_user_pages_fast() but it is otherwise the same.] Where 2.6.25 came from The Linux Foundation has just published a white paper, written by Greg Kroah-Hartman, Amanda McPherson, and your editor, reviewing the origins of the code merged into the kernel from 2.6.11 through 2.6.24. As LWN readers know, the 2.6.25 kernel is getting close to release. So this seems like as good a time as any to look at what happened with the process in this release cycle. As of this writing, 12,269 individual changesets have been merged for 2.6.25 - a new record. That beats the previous record (2.6.24, with a mere 10,353 changesets) by almost 2,000. There were 1,174 individual developers involved with 2.6.25, 419 of whom contributed one single patch. All told, those developers worked for 159 employers (that your editor could identify). The changes added 766,979 lines of code and removed 399,791, for a total growth of 367,188 lines. Here is an updated version of a plot that your editor has been fond of showing during talks in recent years: This plot shows a cumulative count of lines changed over time, with kernel release dates added in. The effects of the merge window policy can be seen in the stair-step appearance of the plot. The steps appear to be getting bigger, but the time between releases has also increased slightly, so the overall rate of change remains roughly constant. It is a high rate, with over five million lines changed - well over half the total - in the last two years. So who did this work? Here is the traditional table of the most active developers in the 2.6.25 series: There are some familiar names on this list, but also some new ones. Bartlomiej Zolnierkiewicz contributed more changesets than any other developer; his work is contained entirely within the IDE subsystem. Patrick McHardy works in the networking area, mostly (but not exclusively) with the netfilter subsystem. Adrian Bunk continues to make small fixes all over the tree and to relentlessly hunt down unused code for removal. Ingo Molnar remains busy in his new role as one of the x86 maintainers; scheduler work also accounts for a number of his changes. Paul Mundt maintains the SuperH architecture. The picture is a little different when one considers how many lines of code were changed. Jesper Nillson's work was done within the CRIS architecture. David Howells works all over the tree; his largest contribution was the addition of the MN10300 architecture code. Eliezer Tamir contributed the bnx2x (Broadcom Everest) network driver, and Kumar Gala works with the PowerPC architecture. There is relatively little change in the lists of employers associated with all of this work (please remember that the numbers associated with employers are necessarily approximate): As usual, one can also look at who applies a Signed-off-by header to code for which they are not the author. These headers illustrate the chain of trust which gets code into the kernel. For 2.6.25, the top approvers of patches are: Some of these developers are quite busy; Andrew Morton is signing off more than twenty patches every day - weekends included. The gatekeepers to the kernel continue to work for a relatively small number of companies, with the top ten employers accounting for over 75% of all non-author signoffs. All told, all these numbers paint a picture of a development process which is healthy and continues to set a fast pace. It incorporates work from an increasingly large community of developers who are able to work in a highly cooperative manner despite the fact that their employers are fierce competitors. There are very few projects like it. (Thanks to Greg Kroah-Hartman for his help in the creation of these statistics). UBIFS The steady growth in flash-based memory devices looks set to transform parts of the storage industry. Flash has a number of advantages over rotating magnetic storage: it is smaller, has no moving parts, requires less power, makes less noise, is truly random access, and it has the potential to be faster. But flash is not without its own idiosyncrasies. Flash-based devices operate on much larger blocks of data: 32KB or more. Rewriting a portion of a block requires running an erase cycle on the entire block (which can be quite slow) and writing the entire block's contents. There is a limit to the number of times a block can be erased before it begins to corrupt the data stored there; that limit is low enough that it can bring a premature end to a flash-based device's life, especially if the same block is repeatedly rewritten. And so on. A number of approaches exist for making flash-based devices work well. Many devices, such as USB drives, include a software "flash translation layer" (FTL); this layer performs the necessary impedance matching to make a flash device look like an ordinary block device with small sectors. Internally, the FTL maintains a mapping between logical blocks and physical erase blocks which allows it to perform wear leveling - distributing rewrite operations across the device so that no specific erase block wears out before its time - though some observers question whether low-end flash devices bother to do that. The use of FTL layers makes life easy for the rest of the system, but it is not necessarily the way to get the best performance out of the hardware. If you can get to the device directly, without an FTL getting in the way, it is possible to create filesystems which embody an awareness of how flash works. Most of our contemporary filesystems are designed around rotating storage, with the result that they work hard to minimize time-consuming operations like head seeks. A flash-based filesystem need not worry about such issues, but it must be concerned about things like erase blocks instead. So making the best use of flash requires a filesystem written with flash in mind. The main filesystem for flash-based devices on Linux is the venerable JFFS2. This filesystem works, but it was designed for devices which are rather smaller than those available today. Since JFFS2 must do things like rebuild the entire directory tree at mount time, it can be quite slow on large devices - for relatively small values of "large" by 2008 standards. JFFS2 is widely seen as reaching the end of its time. A more contemporary alternative is LogFS, which has been discussed on these pages in the past. This work remains unfinished, though, and development has been relatively slow in recent times; LogFS has not yet been seriously considered for merging into the mainline. A more recent contender is UBIFS; this code is in a state of relative completion and its developers are asking for serious review. UBIFS depends on the UBI layer, which was merged for 2.6.22. UBI ("unsorted block images") is not, technically, an FTL, but it performs a number of the same functions. At the heart of UBI is a translation table which maps logical erase blocks (LEBs) onto physical erase blocks (PEBs). So software using UBI to access flash sees a device providing a simple set of sequential blocks which apparently do not move. In fact, when an LEB is rewritten, the new data will be placed into a different location on the physical device, but the upper layers know nothing about it. So UBI makes problems like wear leveling and bad block avoidance go away for the upper layers. UBI also takes care of running time-consuming erase operations in the background when possible so that upper layers need not wait when writing a block. One little problem with UBI is that the logical-to-physical mapping information is stored in the header of each erase block. So when the UBI layer initializes a flash device, it must read the header from every block to build the mapping table in memory; this operation clearly takes time. For 1GB flash devices, this initialization overhead is tolerable; in the future, when we'll be booting our laptops with terabyte-sized flash drives in them, the linear scan will be a problem. The UBIFS developers are aware of this issue, but believe that it can be solved at the UBI level without affecting the higher-level filesystem code. By using UBI, the UBIFS developers are able to stop worrying about some aspects of flash-based filesystem design. Other problems remain, though. For example, the large erase blocks provided by flash devices require filesystems to track data at the sub-block level and to perform occasional garbage collection: coalescing useful information into new blocks so that the remaining "dead" space can be reclaimed. Garbage collection, along with the potential for blocks to turn bad, makes space management on flash devices tricky: freeing space may require using more space first, and there is no way to know how much space will actually become available until the work has been done. In the case of UBIFS, space management is an even trickier problem for a couple of reasons. One is that, like a number of other flash filesystems, UBIFS performs transparent compression of the data. The other is that, unlike JFFS2, UBIFS provides full writeback support, allowing data to be cached in memory for some time before being written to the physical media. Writeback gives large performance improvements and reduces wear on the device, but it can lead to big trouble if the filesystem commits to writing back more data than it actually has the space to store. To deal with this problem, UBIFS includes a complex "budgeting" layer which manages outstanding writes with pessimistic assumptions on what will be possible. Like LogFS, UBIFS uses a "wandering tree" structure to percolate changes up through the filesystem in an atomic manner. UBIFS also uses a journal, though, to minimize the number of rewrites to the upper-level nodes in the tree. The latest UBIFS posting raised questions about how it compares with LogFS. The resulting discussion was ... not entirely technical, but a few clear points came out. UBIFS is in a more complete state and appears to perform quite a bit better at this time. LogFS is a lot less code, avoids the boot-time linear scan of the device, and is able to work (with some flash awareness) through an FTL. Which is better is not a question your editor is prepared to answer at this time; what does seem clear is that the growing competition between the two projects has the potential to inspire big improvements on both sides in the near future. SDCC, the Small Device C Compiler SDCC is a multi-platform, multi-target C cross compiler that was originally written by Sandeep Dutta and has been further improved by a number of other people: SDCC is a retargetable, optimizing ANSI - C compiler that targets the Intel 8051, Maxim 80DS390, Zilog Z80 and the Motorola 68HC08 based MCUs. Work is in progress on supporting the Microchip PIC16 and PIC18 series. SDCC is Free Open Source Software, distributed under GNU General Public License (GPL). Some of the features include: ASXXXX and ASLINK, a Freeware, retargetable assembler and linker. extensive MCU specific language extensions, allowing effective use of the underlying hardware. a host of standard optimizations such as global sub expression elimination, loop optimizations (loop invariant, strength reduction of induction variables and loop reversing ), constant folding and propagation, copy propagation, dead code elimination and jump tables for 'switch' statements. MCU specific optimizations, including a global register allocator. adaptable MCU specific backend that should be well suited for other 8 bit MCUs independent rule based peep hole optimizer. a full range of data types: char (8 bits, 1 byte), short (16 bits, 2 bytes), int (16 bits, 2 bytes), long (32 bit, 4 bytes) and float (4 byte IEEE). the ability to add inline assembler code anywhere in a function. the ability to report on the complexity of a function to help decide what should be re-written in assembler. a good selection of automated regression tests. The SDCC package components include the sdcc compiler, the sdcpp C preprocessor, assemblers and linkers for the supported target processors, a simulator for the 8051 processor, the sdcdb source debugger and the packihx Intel hex file packing tool. Version 2.8.0 of SDCC was announced on March 30, 2008, it includes the following changes: Your author downloaded SDCC 2.8.0 as a .tar.bz2 file onto a machine running Ubuntu 7.04 "Feisty Fawn". The file was uncompressed, and untared. The configure script was run and one package dependency issue was resolved by installing flex. The second run of configure worked, as did the make and make install steps. Running sdcc -v produced the expected result: SDCC : mcs51/gbz80/z80/avr/ds390/pic16/pic14/TININative/xa51/ds400/hc08 2.8.0 #5117 (Apr 1 2008) (UNIX). A few test cases were compiled and assembled using the default MCS51 target, then using the -mz80 switch to produce output for a Z80 processor. All of the tests seemed to work, and produced readable Intel Hex files that appear to be suitable for movement to a development platform. Your author recognized the hex C30001 at the beginning of the code as a Z80 jump instruction, activate the wayback machine. This may be a long way from developing a working embedded application on real hardware using SDCC, it does show that the system builds and is stable enough to consider using as a development platform. The Z80 and mcs51 microprocessors have been around since the late 1970s, newer versions are still being produced. The Microchip PIC microcontroller family and the Atmel AVR family are currently very popular microcontroller platforms. The AVR is the processor used in the recently featured Arduino open hardware microprocessor design, although that uses a different development system. SDCC allows microprocessor applications to be written in C, and that greatly expands the range of problems that can be solved by small embedded machines. The field of C cross-compilers has traditionally been dominated by proprietary Windows-based software. SDCC allows one to develop embedded microprocessor designs using open-source software under Linux. WebKit rising Once upon a time, there were no usable free web browsers for the Linux environment; the binary-only Netscape releases were all that was available to us. For many, the solution to the problem was to be found in the release of the Netscape source code; some years later, we got the Mozilla and Firefox browsers (based on the Gecko rendering engine) from this work. The KDE project, though, took a different route in the late 1990's, developing the KHTML renderer to use with the Konqueror application. A few years later, Apple surprised the world by selecting KHTML as the base for its Safari browser, despite the fact that Gecko was more widely deployed. What followed was essentially a fork of KHTML and some bad blood between Apple and the KDE project. Over time, the two sides have come to a better understanding, but KHTML and Apple's version (WebKit) have remained separate. The existence of two KHTML forks may not last that much longer, though, and some interesting things appear to be happening. One of those things is that Konqueror is slowly being moved over to WebKit as its rendering engine. The decision to go in this direction was made at the 2007 Akademy gathering, and work has been proceeding ever since. Current Ubuntu development releases include a preview version of Konqueror on WebKit. Work can be expected to continue in this direction, with the result that KHTML will slowly lose its prominence in the KDE project. The fork, in other words, is beginning to join, with the resulting software being called "WebKit." [Update: as can be seen in the comments, this paragraph overstated the case somewhat. Things might end up as described here, but that is not the case now.] Meanwhile, it seems that people are actually starting to use Safari, to the point that web designers are thinking that they should actually test their sites with it. For what it's worth, Safari currently accounts for just over 3% of visits to LWN.net - relatively small compared to Firefox (over 60%), but, when added to Konqueror's 4.5%, it makes half of Internet Explorer's 15% share. One can argue that the mix of browsers used by LWN readers is not typical of the net as a whole, but, even so, it looks like WebKit-based browsers just might become a significant part of the Internet's software base. [PULL QUOTE: When a GNOME project announces, on April 1, that it is moving over to a major component which came from the KDE camp, one can be forgiven for not taking it seriously. END QUOTE] The story does not stop there, though. When a GNOME project announces, on April 1, that it is moving over to a major component which came from the KDE camp, one can be forgiven for not taking it seriously. But it would appear that this announcement from the Epiphany developers, saying that they are moving to WebKit as their sole rendering engine, is the real thing. Epiphany, remember, is the closest thing that GNOME has to an official web browser; it has users who swear by its better integration with the GNOME desktop. But Epiphany has always been based on the Gecko engine, and it seems that not a whole lot of users have seen reasons to stick with it over Firefox, which provides rather more functionality on the same engine. Epiphany is not a big force in the browser arena currently. Last year, the Epiphany developers added an abstraction layer which allowed the browser to operate over multiple rendering engines, including WebKit. Now they have decided to take that layer back out and to support just one rendering engine: WebKit. The development team cites a number of reasons for moving away from Gecko, including release-cycle mismatches, a feature set which is driven by a competing project, and a lack of attention being paid to the Gecko/GTK embedding API. Gecko, they have decided, is not the best fit for Epiphany. WebKit, instead, was designed for embedding - the WebKit project's goals explicitly rule out building a browser themselves - and the GNOME API is said to work very nicely. WebKit in GNOME uses technologies like Cairo and Pango, like many other GNOME applications. Overall, the Epiphany team feels like WebKit is a better match for what they are trying to do - and they suggest that a number of other GNOME projects move in that direction as well. The initial response from other GNOME participants appears to be positive, with the exception of some concerns about accessibility support in WebKit - concerns which, presumably, can be addressed. The GNOME/KDE flame wars, happily, are some years behind us. Developers from both projects are more interested in cooperation these days, but, so far, much of that cooperation has been around relatively small, low-level components. An HTML rendering engine is not a small, low-level component, though. If both projects seriously work toward the improvement of WebKit, they will have started an era of rather higher cooperation than has been seen in the past. If this cooperation holds together, it can only be to the benefit of both projects, and to all other users of WebKit as well. The Gecko engine is good code and a highly successful project. But it is also controlled by a company (Mozilla Corporation) whose agenda, beneficial though it may be, does not include the creation of successful competing browsers. So it's not entirely surprising that Gecko has not proved to be entirely suitable for groups trying to create those competing browsers. WebKit, at the outset, looks like it is better suited to this task. The WebKit project has expressed interest in working with GNOME; there might just be a productive partnership in the making here. But it's worth remembering that WebKit, too, is a project developed by a company with its own objectives, few of which make any mention of turning 2009 into the real year of the Linux desktop. For now, though, WebKit has the look of a project with all the right attributes: real independence, merit-based access to the source repository, no requirement for copyright assignments, reasonable licensing, and the right goals. It may well be positioned to become a core component in the Linux desktop. OOXML gets ISO approval The votes are in, with Microsoft's Office Open XML (OOXML) format gaining international standard status. Both Microsoft and Ecma International jumped the gun a bit by proclaiming victory a day before the official announcement, but the writing was on the wall since the balloting closed on March 29. There are now two competing standards for office document formats that have been approved by the International Organization for Standardization (ISO): OOXML and Open Document Format (ODF). The most recent vote was an opportunity for the national bodies to change their vote from September based on the outcome of the Ballot Resolution Meeting (BRM). The September vote was relatively close but OOXML did not pass, which led Ecma and Microsoft to try and address the 3,500 comments (1,000+ after eliminating duplicates) made by participating countries. The comments and the Microsoft/Ecma solutions to them were discussed during the five-day BRM meeting in Geneva in late February. When the BRM was announced, many wondered how that number of comments could be handled in a week-long meeting, unfortunately the answer is: not very well. There was simply too much to cover, so the majority of comments—mostly substantive issues with OOXML—didn't get discussed and were voted on en masse. The majority of participants abstained (18) or failed to vote (4), with six voting to accept the changes proposed by Microsoft/Ecma and four voting against. This allowed the BRM process to complete, leaving it up to the national bodies to decide whether to change their September votes. The outcome was again fairly close, but a net change of seven votes from "disapprove" to "approve" moved OOXML into approval. 24 of 32 votes from Participating countries were for approval, which is beyond the two-thirds majority required. Also, 86% of the Observing countries voted to approve, which is above the 75% required. In both cases, abstentions are not counted. At some level, the outcome should not be surprising. Microsoft put a huge effort into ensuring OOXML standardization. Some would claim that they "gamed" the system—it's pretty clear they did—what's less clear is why, and what they plan to do next. Their tactics have been questionable, which leads many to believe they have an ulterior motive. To start with, Ecma International essentially rubber-stamped a "specification" that Microsoft presented as ECMA-376. Then it was introduced to ISO on the "fast-track" process, which is meant for mature standards that have few gray areas or controversial parts. Whatever else can be said of OOXML, nearly anyone that is not firmly in the Microsoft camp can see that it is in no way mature, clear, or non-controversial—it is flawed at multiple levels. One of the most puzzling things about the process is how we have ended up with two standards. In general, standards are supposed to be, well, standard, allowing multiple implementations that use the standard, but innovate in other areas. HTML and HTTP are standards, whereas Firefox, Safari, Konqueror, Opera, and Internet Explorer all implement those standards—some more faithfully than others—but provide different sets of features on top. Microsoft's argument for multiple standards is a disingenuous one: choice. It would seem that Microsoft wants to paint this as a VHS vs. Betamax battle, where the consumer is able to choose the one best suited for their needs. But, both of the video recording standards were proprietary, with many arguing that the technically inferior choice "won". Microsoft is, of course, no stranger to having its choices—again arguably technically inferior and generally pushed through its near-monopoly on the desktop—come out on top. One might be able to argue that competition between the standards is consumer-friendly if there is a level playing field. In order for that to happen, Microsoft would have to implement and deploy the competitive standard—something it has clearly said it will not do. It is hard to see how customers are going to be able to determine which of the two formats is "better" when most of them will only be given one choice. Many also fear that free software (and other non-Microsoft proprietary) implementations of the standard will not be fully interoperable with the de facto standard because of specification inadequacies or patents. Many, including ODF editor Patrick Durusau have called for OOXML to be passed so that it can be clarified. Setting aside the obvious cart-before-the-horse problem, standards bodies are notoriously slow—it has been more than a year for the fast-track approval of OOXML for example—expecting that clarifications can be made through that process is somewhat alarming. More likely, changes will be made in the format emitted by various Microsoft products and then shoehorned into the standard some months or years later. The claim that billions of documents exist in OOXML, which leads many to believe it should be adopted, is particularly galling to many. There is no OOXML standard yet—the final document has not yet been produced—but that is a minor issue. The fact is that even though a form of OOXML is available in recent Microsoft products, it is not the default and most documents have not been stored using it. The billions of documents are mostly stored in various versions of the proprietary DOC format that non-Microsoft users have been struggling to read for years. The opponents of OOXML had their own share of misbehavior during this process. It is pretty unlikely that everyone who favored OOXML passage is in the pay of Microsoft, for example. The doom and gloom predictions of what will happen have sometimes been over the top as well. Free software is not about restricting choices—if folks want to store documents in OOXML, that is their decision. So, what will happen to ODF? To many it looks like a truly vendor-neutral standard—warts and all—will be shoved aside by a truly vendor-specific one. Andy Updegrove, who has followed this process closely and fairly objectively in his weblog, sees things a bit differently. There is still a long way to go before OOXML supplants ODF, if it ever does, according to Updegrove: That answer is this: if anyone had asked me to predict in August of 2005 (the date of the initial Massachusetts decision that set the ODF ball rolling) how far ODF might go and what impact it might have, I would never have guessed that it would have gone so far, and had such impact, in so short a period of time. I think it's safe to say that whatever happens with the OOXML vote is likely to have little true impact at all on the future success of ODF compliant products. It is possible that Microsoft is changing its ways, but longtime Microsoft watchers, especially those who have been harmed by their tactics in the past, remain skeptical. One would guess Microsoft will be on its best behavior for the next two months while objections to the approval can still be raised. After that, we will see—over time—whether this is yet another lock-in play or whether they wish to play fair in the document storage arena. Every move they make will be closely scrutinized; there are risks to reverting to their previous behaviors. But, if we end up with a truly open standard, free of patent nonsense, and implementable by all, it doesn't really matter whether it is OOXML or ODF. Biometrics for identification Using a fingerprint or other physical characteristic, called biometric data, for identity verification seems, at first glance, like a perfect solution to the problem. Unfortunately, there are some basic problems with using biometric information that way. If the biometric data can be gathered by others, it no longer makes such a good identifier. As part of a political protest against including fingerprints in passports, the Chaos Computer Club (CCC) published a fingerprint of German Home Secretary Wolfgang Schäuble. Schäuble is a supporter of collecting fingerprint data to combat terrorism. The club not only published the picture, but also a film that can be placed over a finger to deceive fingerprint scanners. A club spokesman has usage recommendations as reported in heise online: We recommend that you use the film whenever your fingerprint is taken, such as when you enter the US, stop over at Heathrow, or even when you touch bottles at your local super market -- just to be on the safe side It seems unlikely that CCC's distributed finger film will actually leave the Secretary's print on a glass surface, but more sophisticated versions of the same basic idea should be able to. Various folks have shown that using an image of someone's fingerprint can fool most scanners. Even sophisticated scanners can be spoofed when that image is placed over a live finger—with body temperature and pulse. The problem is that while a fingerprint is unique, it isn't secret. CCC got theirs from a sympathizer who picked it up from a glass used by the Secretary during a speech. Bruce Schneier is, as usual, ahead of the curve on this. In an article from nearly ten years ago, he drives home the point: The moral is that biometrics work great only if the verifier can verify two things: one, that the biometric came from the person at the time of verification, and two, that the biometric matches the master biometric on file. If the system can't do that, it can't work. Biometrics are unique identifiers, but they are not secrets. (Repeat that sentence until it sinks in.) Other forms of biometric identification exist, but are susceptible to the same kinds of problems. A voiceprint or facial identification scanner could be fairly easily subverted by secretly recording or photographing the subject. Retinal scans are trickier, perhaps, but technology to remotely (and surreptitiously) read them will probably come along. In many cases, an attacker may not even need to go to that amount of trouble because they can just extract—or pay to have someone else extract—that information from some database. More and more of this kind of information is being gathered and centralized. The US has started fingerprinting all ten fingers of non-citizens who enter the country—other countries have started doing it in retaliation. One could hope the data retention policy for that information is similar to that of White House emails, but it is probably longer. Worse yet, it is probably stored with photographs, passport information, and signature of the subject. The key to using biometrics correctly is to repeat the Schneier mantra: Biometrics are powerful and useful, but they are not keys. They are useful in situations where there is a trusted path from the reader to the verifier; in those cases all you need is a unique identifier. They are not useful when you need the characteristics of a key: secrecy, randomness, the ability to update or destroy. Biometrics are unique identifiers, but they are not secrets. Revocation of a biometric identifier is difficult or impossible—if it is even known to be compromised. One could potentially switch fingers for fingerprint identification, or even switch eyes—once. Switching voiceprint, face, or DNA if and when that gets used, will be essentially impossible. Biometrics suffer from the same failure mode as using the same password everywhere, unless you can somehow use a different characteristic for each biometrically "protected" dataset—hard to do with limited body parts. Biometric data does have its uses, but it has limitations as well. It seems seductively simple that your fingerprint is the same as you, but it isn't necessarily true. Now we just need to teach the politicians, which might be something that Schäuble is starting to learn. Memory allocation failures and scary warnings People who put their Linux systems under a certain amount of memory stress - and who look at their logfiles - may notice an occasional message indicating that a "page allocation failure" has occurred, followed by a scary backtrace. These people may also notice that, despite the apocalyptic appearance of this message, the world often fails to end. In fact, the system tends to carry on just fine. For this reason, Dave Jones, who probably gets ten emails for every backtrace generated on a Fedora system, has suggested that these messages are simply noise which should be removed. Whether that should really happen is not entirely clear, though; understanding why requires a bit of background. In general, the kernel's memory allocator does not like to fail. So, when kernel code requests memory, the memory management code will work hard to satisfy the request. If this work involves pushing other pages out to swap or removing data from the page cache, so be it. A big exception happens, though, when an atomic allocation (using the GFP_ATOMIC flag) is requested. Code requesting atomic allocations is generally not in a position where it can wait around for a lot of memory housecleaning work; in particular, such code cannot sleep. So if the memory manager is unable to satisfy an atomic allocation with the memory it has in hand, it has no choice except to fail the request. Such failures are quite rare, especially when single pages are requested. The kernel works to keep some spare pages around at all times, so the memory stress must be severe before a single-page allocation will fail. Multi-page allocations are harder, though; the kernel's memory management code tends to fragment pages, making groups of physically-contiguous pages hard to find. In particular, if the system is under pressure to the point that there is not much free memory available at all, the chances of successfully allocating two (or more) contiguous pages drops considerably. Multi-page allocations are not often used in the kernel; they are avoided whenever possible. There are situations where they are necessary, though. One example is network drivers which (1) support the transmission and reception of packets too large to fit into a single page, and which (2) drive hardware which cannot perform scatter/gather I/O on a single packet. In this situation, the DMA buffers used for packets must be larger than one page, and they must be physically contiguous. This is a situation which will become less pressing over time; scatter/gather capability in the hardware is increasingly common, and drivers are being rewritten to make use of this capability. With sufficiently smart hardware, the need for multi-page allocations goes down considerably. But all of that skirts around the main point, which is that kernel code is supposed to handle allocation failures properly. There is never any guarantee that memory will be available, so kernel code must be written defensively. Allocation failures must be handled without losing any more capability than is strictly necessary. If one assumes that kernel code is written correctly, there should be no need to issue warnings on allocation failures. Things should just continue to work, perhaps without users noticing at all. And, in fact, things often do just work. But the discussion resulting from Dave's suggestion makes it clear that few developers are confident that all kernel code does the right thing in the face of memory allocation problems. In cases where an allocation failure is not dealt with correctly, the system may go down in random places, leaving few clues as to what really happened. In that kind of situation, the allocation failure warning may be the only useful information which survives the crash. For this reason, some people want to see the warnings left in place. As it happens, the memory allocator supports a special bit (__GFP_NOWARN) which causes the warning not to be emitted if a specific allocation fails. So it has been suggested that the allocations made from code which is known to handle failures properly have __GFP_NOWARN set. That would kill the warnings in code known to do the right thing while leaving it for all other callers, presumably limiting the warnings to places where there might truly be a problem. Jeff Garzik strongly opposed this idea, though, saying that it clutters up the code and "punishes good behavior." The other reason given for keeping the warnings in place is to make it clear when a system is running under persistent memory pressure. Such systems will not be performing optimally; often there are changes which can be made to relieve the pressure and help the system to run more smoothly. So it has been suggested that the warning could be reduced in frequency and made less scary. Nick Piggin suggests: So I think that the messages should stay, and they should print out some header to say that it is only a warning and if not happening too often then it is not a problem, and if it is continually happening then please try X or Y or post a message to lkml... An alternative idea would be to keep some sort of counter somewhere which could be queried by curious system administrators. Of course, the real solution is to ensure that all kernel code is robust in the face of allocation failures. This can be hard to do, since the error recovery paths in any code are not often exercised or tested. Fortunately, the fault injection framework can help in this situation. Kernel developers can use this framework to simulate allocation failures in specific regions of code, then watch to see what happens. Your editor's impression, though, is that relatively few developers are using this tool. So confidence in the kernel's handling of allocation failures may remain low, and the desire to keep the warning around may remain high. vringfd() One of the core features of the (now stalled) kevent subsystem was a circular buffer intended for efficient movement of data between the kernel and user space. Kevent may have run out of steam, but the ring buffer idea is back via a different path. Rusty Russell is now proposing a new system call (called vringfd()) which turns some of the virtio work into a new kernel-to-user ring buffer interface. The submitted patch is breathtaking in its lack of documentation on this new system call, especially considering that its author is quite good with that sort of writing. Your editor has taken this omission as a personal challenge and, as a result, has set about reverse engineering the (somewhat complex) vringfd() interface. A user-space process which wishes to set up a vring for communication with the kernel must create a slightly complicated data structure first. One starts by deciding how many entries the ring should have; this number must be a power of two which fits into an unsigned, 16-bit value. Given this number (we'll call it RING_SIZE), the data structure looks like this: The page alignment for the used array is important - that array might be mapped separately into kernel space. The array must fit into a single page, which puts a practical limit of 256 entries for RING_SIZE on systems with 4096-byte pages. If this API goes forward, chances are good that a way will be found to raise this limit. Individual descriptors in the ring are described with this structure: For a simple buffer, the application would simply point addr at the beginning and set len to the appropriate value. If the buffer is to be written to by the kernel, the application should also set VRING_DESC_F_WRITE in the flags field. Things can get more complicated than that, though, in that the vringfd() interface supports multipart scatter/gather buffers. To set up such a buffer, user space would use one vring_desc entry for each segment of the buffer. For all but the final segment, the VRING_DESC_F_NEXT flag (saying "use the next descriptor too") should be set, and next should be the index of the next descriptor. When the kernel grabs a buffer, it will follow the chain and use all segments found until the final one (which lacks the VRING_DESC_F_NEXT flag) is encountered. Before the kernel will use buffers set up by the application, though, user space must indicate that the buffer is ready. That is done through the vring_avail structure: The ring array holds indexes into the descriptors array. The idx field should always be the index of the last valid entry in ring. When a new buffer is ready for transfer to or from the kernel, the application will store the index of the first descriptor into ring[idx+1], then increment idx. When the ring is first established, the kernel remembers the position of idx, so the first buffer should be added here after the vringfd() system call is made. The kernel will consume buffers from the available ring as needed. Once the requested operation has been performed on the buffer and the kernel is done with it, the buffer will show up in the used area, which is structured this way: In the vring_used structure, idx is the index of the next entry in ring which may be written by the kernel; it will be incremented after the ring is updated. When a buffer is placed in the used ring, the id field will be the index of the descriptor, and len will be the actual length of the data transferred. Note that the flags fields in the vring_avail and vring_used structures appear to be unused. Once the application has this whole data structure set up, it can establish the ring buffer with the kernel with the new system call: Here, addr is the base address of the data structure described above, ring_size is the number of descriptors in the ring, and last_used is a 16-bit unsigned integer indicating which entry in the used ring was last consumed by the application. Failure to keep last_used current will not slow things down, but it will keep poll() from working properly. The return value will be a file descriptor associated with the ring. Creating the vring is only part of the job, though. The next step is to connect it with a kernel subsystem for the transfer of data. Rusty's patch includes vring support in the tun virtual network driver; to use that support, an application makes a special ioctl() call to provide the vring file descriptor to the tun driver. Any other subsystem will need a similar mechanism to support vring. If the application is using the ring to transfer data into the kernel, it must (1) set up one or more descriptors for full data buffers in the available ring, then (2) make a write() call to the vring file descriptor. The buffer and length passed to write() are ignored; all that matters is that a write was done to that file descriptor. When write() returns the operation will have been set in motion, but it cannot be considered to be complete until the ring descriptors show up in the used ring. For data transfers from the kernel to user space, the application simply puts buffers into the available ring, then waits until they show up in the used ring. A poll() on the vring file descriptor will block until buffers are available. The kernel determines whether unconsumed buffers exist in used by comparing the vring_used->idx index against the application-supplied last_used value. It's worth noting that, depending on how the relevant kernel subsystem works, buffers may not actually make it into the used ring until the poll() call is made. On the kernel side, a developer wanting to add vring support to a subsystem will start by creating a set of vring_ops: All of these functions take a private pointer given when the subsystem attaches to the vring (to be described shortly). The pull() callback is invoked when the application calls poll(); if there is any descriptor processing which must be done with user space accessible, this is the place to do it. If pull() adds any buffers to the used ring, it should return the number of buffers; it can also return a negative error code. push() is called from a write() call indicating that there are buffers ready to be transferred into the kernel; it returns zero or a negative error code. The destroy() callback is called when the vring file descriptor is closed. All of these callbacks are optional. Attaching to a vring is done with: For this call, fd is a file descriptor corresponding to a vring, ops is the operations structure described above, data is a private data pointer which is passed into the vring_ops callbacks, and atomic_use is nonzero if the kernel needs to be able to add buffers to the used ring in atomic context. The return value is a pointer to an internal vring data structure or an ERR_PTR() value if something goes wrong. To obtain a buffer from the available ring, a call is made to: This function will fill in an array of iovec structures corresponding to the next available buffer. If the kernel expects to write to the buffer, it should set in_iov to the iovec array, num_in pointing to the length of in_iov, and in_len pointing to a location to store the total length of the buffer (or NULL if that information is not useful). For transfers into the kernel, out_iov, num_out, and out_len should be set similarly. Note that the addresses stored in the iovec arrays are user-space addresses; vring_get_buffer() does not validate them, so the caller must do so. It is possible to set pass both in_iov and out_iov; in this case, one of the two will be set, depending on whether the next buffer in the available ring has the VRING_DESC_F_WRITE flag set. In most cases, though, only one of the two sets of parameters will have non-NULL values. The apparent intent of the API is that, if bidirectional transfers between user space and the kernel are needed, two separate vrings should be used. The return value from vring_get_buffer will be one of (1) a positive descriptor index, (2) zero, indicating that no buffers are available, or (3) a negative error code. The descriptor index should be saved the the final step, which is indicating that the kernel is done with a specific buffer: Either one of these functions indicates that the buffer indicated by id should be put into the used ring; len is the amount of data actually transferred. If sleeping is not possible, vring_used_buffer_atomic() should be used - but the vring must have been attached with the atomic_use flag set. There does not appear to be a way for a subsystem to detach from a vring; it must, instead, wait for the application to close the associated file descriptor. This interface is in an early stage, and the code has a number of limitations and FIXME comments. So things seem likely to evolve before vringfd() is seriously considered for merging into the mainline kernel. The idea of a ring buffer for this kind of communication seems to come around on a regular basis, though, so it would seem that there is a demand for this kind of API. Video forums for free software Over the last few years, we have seen the rise of video content on the web, but much of that content has been locked up in non-free formats. Patented video codecs are a big part of the problem, though there are free alternatives (Theora and Dirac for example), they are not widely used. Free software projects often use videos as part of their marketing and documentation, using screencasts to highlight interesting or exciting features of the program for example. But the choices for collecting and distributing video content leave much to be desired for free software advocates. The Fedora project has been looking into this problem lately, in support of its FedoraTV project. A recent thread on the fedora-advisory-board mailing list looks at various alternatives now that the original host of FedoraTV content, luluTV, has gone out of business. Greg DeKoenigsberg outlines the problem: The original goal of Fedora TV was to provide a "Fedora-friendly" home for videos that we had some control over. I think this is still a worthwhile strategic goal, but since we no longer have the help of dedicated engineers, I no longer think it's a sensible tactical goal. The question that follows: "we've got lots of people who are excited about making Fedora videos. What's the best way, in the short term, to gather those videos together to make them accessible?" He goes on to outline the criteria for finding a near-term solution, starting with the absolute requirements: Ogg Theora format, one-click download, and a robust, stable hosting site. Also important, but not as critical are things like the ability to extract static screenshots for posting in various places, an easy way for community members to know when new videos are available (an RSS feed for example), and a way for uploaders to easily associate a license with their video. These should resonate with most projects that have an interest in providing a video forum for their community as they are likely to have many of the same needs. Transcoding the videos to Flash to reach the largest possible audience is DeKoenigsberg's "controversial" criteria. It is an unfortunate truth that, even for fairly strong free software proponents, the Flash browser plugin provides the simplest route to viewing online videos. Other solutions exist and work, but require a great deal more effort to enable additional software repositories so that the proprietary or patented codecs can be installed. Interestingly, there were no arguments presented against the transcoding suggestion. For Fedora, where Theora—or other free codec—viewers are easily available, Flash transcoding might be less of a requirement. Other projects, especially those that are cross-platform, may find that a large part of their community is either unable or unwilling to install additional software to view videos. Users of non-free operating systems are largely unaware of the video codec problems; their OS comes with a no-extra-cost video viewer that just works. Because of that, transcoding to Flash does at least provide a way to present videos that can be relatively easily viewed by free and non-free systems alike. Various solutions to the hosting problem were discussed, from partnering with archive.org to rolling their own using MediaWiki, Plumi, or some of the technology released by luluTV. One of the suggestions that got the most attention was to create a Miro channel hosted, at least temporarily, on Fedora project servers. Miro has a lot of promise as a viewer and organizer of videos, with a BitTorrent client built-in, but it doesn't solve the other half of the problem: how to allow the community to contribute. There is, it seems, a growing need for a free community video forum, both from a code and a hosting perspective. The bandwidth and storage requirements of video are enormous, so covering the actual cost will be a big challenge. Places like YouTube allow short videos to be uploaded, but they can only be played back via Flash. In addition, their software is not free, so they only solve parts of the problem. There are no obvious free solutions, yet, but it is a problem that we will be facing more frequently. Somehow leveraging Miro as a free, cross-platform video delivery system may make the most sense. Providing a way for the community to upload video content into the channels would make for a mostly working FedoraTV and other projects like that. Miro supports free codecs as well, which might help to start weaning people away from their current non-free codec addiction. Then we can start figuring out how to pay for the network and hard disk capacity required. OpenSSH bug falls through the cracks Linux distributions often patch the software they distribute, to fix bugs or add features. Anything they add is pushed upstream to the project responsible for the package—at least in theory. When that theory is not borne out in practice, it can lead to the kind of unhappiness and finger pointing that went along with a recent OpenSSH release. The release notes point at Debian for failing to report it upstream, but the bug was actually fixed much earlier, in Red Hat Enterprise Linux 4 (RHEL4). The bug in question is rather nasty, allowing a local attacker to hijack X Windows programs of a user who logged in using ssh with X forwarding enabled. Under those circumstances, the ssh client and server arrange that any X programs started on the logged-in machine actually display on the client machine's desktop. This is very useful for running X programs across the internet as the X traffic is encrypted as part of the ssh session. Due to a broken interaction with Internet Protocol version 6 (IPv6)—the next generation protocol for internet traffic—ssh can get confused about the port number of the X server. If a particular port (which maps to the X DISPLAY environment variable) is not available to be used under IPv4—the protocol in use today—but is available under IPv6, the ssh server will incorrectly set DISPLAY. If it is an attacker's program that is listening on the IPv4 port, it will be able to hijack X programs that get run. Up until sometime in the last several years, this would not have happened for most Linux boxes because IPv6 was generally not enabled. In that case, the ssh server would recognize that it could not get the port it wanted and try another, eventually setting DISPLAY correctly. Because IPv6 is much newer, these kinds of bugs may exist in other network programs. This bug should serve as a reminder to developers to closely check their IPv6 support. Clearly, though, the bug fell through the cracks. The OpenSSH team shows its annoyance in the release notes: We apologise for any inconvenience resulting from this release being made so shortly after 4.9. Unfortunately we only learned of the below security issue from the public CVE report. The Debian OpenSSH maintainers responsible for handling the initial report of this bug failed to report it via either the private OpenSSH security contact list (openssh@openssh.com) or the portable OpenSSH Bugzilla (http://bugzilla.mindrot.org/). It was reported in January to the Debian bug tracking system, but not fixed and released until late March. OpenSSH does releases every six months or so, with 4.9 being released on March 30. Having to turn around another release four days later to fix a problem that was known for a few months could certainly make for annoyed developers. So how did the bug get fixed in Debian, with a Common Vulnerabilities and Exposures (CVE) number being assigned, but without notifying the OpenSSH team? The Debian bug entry is instructive, because it documents some of the steps that led to the hurried release. In particular, Phil Miller thought he had done the right thing to report the problem in February: As noted in the control section, I have forwarded this to Theo DeRaadt, the point of contact for security issues found in OpenBSD's software. That email must have gotten lost or been eaten by a spam filter as de Raadt would presumably have gotten it to the right people had he seen it. The bug description clearly puts it in the realm of a security problem, but the bug was not classified that way in the Debian system. Had it been, it would have been handled differently, possibly triggering an email to the proper place. But the bug report also shows that Red Hat fixed it in 2005. It was reported to Red Hat by a customer and got entered into their bugzilla as bug #163732. Unfortunately, that bug report is confidential because it contains potentially sensitive customer information. This makes it difficult to track further. Indications are that it was not seen as a security problem and that it was believed to have been already known as an OpenSSH bug. Apparently no one checked to make sure the OpenSSH folks knew of it though. Closer cooperation between the OpenSSH maintainers for Red Hat and the upstream team would probably have helped. Red Hat has been carrying the patch along for quite some time. Because the security implications were not clear and the patch is quite simple, it may not have seemed to be all that necessary to get it upstream. Though, there are more than twenty patches listed in the fedora OpenSSH CVS repository for rawhide, which will become Fedora 9. The OpenSSH team would be well served by paying closer attention to various distribution patches to their code as well. It is certainly plausible that those interested in finding security holes to exploit might start by seeing if any patches floating around for critical services like OpenSSH were useful. By being more proactive, OpenSSH might have found and fixed this bug much earlier. The way this particular bug avoided notice seems to be mostly happenstance; if there is blame to be placed, there is plenty to go around. RHEL and other "enterprise" distributions have long support cycles which means that the versions of various packages being maintained are well behind the upstream project. It doesn't take very many bug reports getting shot down because they have already been fixed in a more recent version before distribution maintainers lose enthusiasm for making those reports. But it is an essential part of the process. The OpenSSH team has the reputation of being somewhat difficult to work with, which may have helped this particular problem get overlooked. It is a difficult problem to solve fully. Distributions have their own set of requirements which may be in opposition to those of the upstream project. Those projects may also have policies and procedures that distributions are not up to speed on. The Linux kernel often sees the same kind of conflicts, which is why distributions often maintain their own set of kernel patches for features their customers need. But it is in everyone's best interest to work those problems out so that distributions carry along as few patches as possible while upstream projects do not miss out on bug fixes and features. Distribution-friendly projects Part 2 [Editor's note: This article, which looks at the interactions of software projects and distribution providers, is presented in three parts. Part 1 introduces the concepts found here, in part 2.] Technical needs Under the name technical needs we're going to see a series of requests that distributors often have to make to the original developers of the software they want to package. Not all these requests are made by all distributors. Some will care more about one particular aspect than another. Some might apply only on non-mainstream distributions, and some distributions might just want to take care of philosophical needs and leave the technical side entirely alone, even if similar distributions aren't exactly common. Most of the technical needs described in this article are present in the policies set forth by Debian (written), Gentoo (mostly unwritten), and apply to other distributions as well. Some of these needs won't be encoded in any policy and are often not requested explicitly by the developers. Those are mostly details that make a distributor's life easier. These details may not be mandatory, but it's still worth considering them. The easier the life of the downstream maintainer is, the easier it is for the software to be packaged. Also, it's important to note that when a distribution makes a request, it might not be alone. Other distributions might want to take advantage of the same change, but they didn't have time to request it, or simply preferred to wait before packaging the software until some issues were resolved. Don't just ignore the request because the distribution which contacted you already took care of the issue by patching your software. Acknowledge the request and apply the patch, it will make both your and their life easier on the long term. Sane version information Distributions often rely on the version information provided by the original software developers. This usually means that they don't expect huge changes between version $x.y.z$ to version $x.y.z+1$. One very common scheme for versions is the major, minor, micro version, which in the example above would be respectively $x$, $y$ and $z$ (it's a common misconception that $y$ is the major version component). The way this kind of scheme is usually applied relates to the compatibility of the programming interface (API and ABI). Changes in the software warrant increments of various version components depending on the amount of changes in the interfaces: adding zero or more interfaces, without changing or removing previous interfaces, or the behaviour expected from them - meaning the software is entirely compatible with the older version - usually only require an increment of micro version; changing or removing interfaces, usually deprecated - in such a way that older software might require to be adapted, but not rewritten - usually require an increment of minor version; changing the interface entirely - requiring users of the software to rewrite their code, or otherwise do major structural changes - usually require an increment of the major version. Obviously, increasing one component will usually involve resetting to zero the version components on the right. There might be other components, too. For instance if the source archive has to be regenerated without any code change (missing file, updated addresses for the maintainers or the homepages), rather than changing the version entirely, a suffix might just be added at the end of the version, making it, for instance $1.2.3a$ or $1.2.3c$. If just a security issues has been fixed, it could also be expressed by adding a nano component to the version, like $1.3.34.1$, to emphasize that there is no change other than the security fix. The source archives for the software should be named after both the project and the version, resulting in names like foobar-1.3.4.tar.gz. Having different versions of the same software that don't have the same naming causes confusion. It is quite important for the distributions that source archives not be changed without changing the name: distributions usually make sure that the checksum (usually MD5, but often nowadays SHA1) of the archive is the one they recorded, and changing the tarball without notice often leads to failed builds. There is a similar issue with the naming of the directory inside the archive. Most distributions assume that the source is included inside a directory with the same name of the archive (minus the extension), but often enough the archive contains sources not organised in a directory, or a directory with the name of the project without version. Similarly, if possible the directory should also contain eventual suffixes, to avoid adding extra cases in their presence. Distribution methods like Ruby Gems and Python Eggs mandate similar version schemes for their packages for the same reason Free Software distribution would prefer them: it makes it easier to compare versions, and know when something has to be updated. Internal libraries One common issue considered by both Debian and Gentoo policies relates to the use of internal copies of libraries. Sometimes the software needs some uncommon libraries to work properly. These libraries are unlikely to be found on users' systems, which would require them to download and install them separately. Such a task is not easy for new users. A few projects will keep an internal copy of the libraries they want to use for that reason, and will use that internal copy unconditionally. Adding an internal copy of a library seems cheap to the original developers, and it's convenient for users to download and install a single package, however this causes a large number of problems to the distributors. The first problem is that they might have to patch the same bug several times. Let's all think of zlib as a practical example, a very common library implementing the classic deflate algorithm of compression. It's a very small library, that a lot of projects imported internally over the years. Not too long ago, a serious security issue was found in the code of zlib, and all the distributors had to patch it out as fast as they could. In a perfect world, patching zlib, and eventually rebuilding everything that linked to it would have sufficed. Unfortunately, we're not in a perfect world. More software was packaged with internal copies of the library, requiring each of those packages to be patched to make sure the issue was solved. There are many other implications with using internal bundled copies of libraries, and most of them are critical for distributors. These problems increase their complexity when the internal copies of libraries are modified to suit better the use the application has for them. In those cases, even though the source might be advertised as being part of another library, they are actually different from that library, and their replacement might be impossible, or may cause further problems. The code is no longer shared between programs: not only the source code, which requires extra work to fix bugs and security issues, but also executable code and data. When shared libraries are used, the memory used by processes loading them is reduced, as they will share code and part of the data. This cannot be done when using static libraries or, worse, internal copies of libraries. Symbols may collide during the loading: modern Linux and Unix systems use the ELF format for programs and libraries. This format provides a so-called flat namespace for the symbols (data and functions) to be found. When using internal copies of other libraries in a library, the two definitions of the same symbol might collide, and just one of them can be used. If the interface used by the library changed subtly, it is possible that this will lead the program in an execution path that was not intended and is not safe. Distribution-specific changes need to be duplicated: as it will be discussed later on, sometimes distributions need to make changes to source code, to fix bugs (security related and not), or change paths of files for instance. Internal copies require downstream maintainers to repeat these changes multiple times. For this reason, a good compromise between the needs of the original authors and the needs of the distributions is to treat internal copies of libraries as untouchable, thus disallowing any changes in its interface or behaviour. That way those users who get the package directly from upstream still have only one package to download and build. The distributions, who want to share code as much as possible, should have a way to ask the build system to use the system copy of that library. An easy way to implement that is to provide --with-system-libfoo options at the ./configure call (for autoconf for instance), or to give a WITH_SYSTEM_LIBFOO" handle at the make command line. By allowing the distributions to use their own copies of libraries, the developers are still preserving the ability for the user not to install extra dependencies, but also giving the distributions the power they need, to avoid changing the original code, sometimes in a conflicting way. It is important for the upstream authors to not change the behaviour of bundled libraries, as the distributions will most likely want to use a shared system library instead. Modifications made to a bundled library will likely cause problems for users who use get the package from their distribution's repository where it has been built with a shared system library. An easy choice for optional dependencies Almost all distributions prefer having a choice about the optional dependencies of a package. Source-based distributions (like Gentoo and FreeBSD's ports system) offer the same (or more) choices as the original project. Gentoo's USE flags or FreeBSD's knobs offers the user options on which options will be enabled. Binary distributions (like Debian or RedHat) might want to choose options to ensure that the final binary package does not try to use dependencies that are not present in their official repositories. Again, if a project does not provide an easy way to control whether some optional dependency is used, most distributions will either try to workaround that problem (by forcing cache discovery variables) or change the build system themselves to get the choice to disable or enable some dependency. This creates problems similar to the ones discussed above: different distributions might use slightly different changes, which may cause errors when merging them in, and they might make errors that introduce new bugs. As above, it's just a matter of providing a switch in the build system (like a --disable-feature or --without-feature in autoconf, or a WITHOUT_FEATURE knob for make). If the software has a plug-in infrastructure, binary distributions might also just package the different plug-ins in different packages, allowing the user to choose which ones to install. Software without plug-in structures might require building different packages with different feature sets. For instance, if a software can use either OpenSSL or GnuTLS as implementation of SSL/TLS layers, then the distribution might create two packages, linking to one or the other. The user could then choose between the two. When some optional dependencies are discovered by the build system, used if present and ignored if not, without a way to tell the software to not build the optional feature that uses a library that is present on the system, we're talking about an automagic dependency. Automagic dependency is a term used to indicate when a package, optionally using another, discovers its presence automatically, without allowing for the user (or the downstream maintainer) to ask not to use it. This kind of dependency is usually a problem just for source distributions, as they build the software on users' systems, which may or may not have the same configuration as the developer working on the build scripts. Binary distributions on the other hand build their code in controlled environments having only the stated dependencies installed. This might actually confuse one of their developers in thinking that a given dependency is mandatory, seeing it enabled in their local build, and not finding an option to disable it. In general, automagic dependencies should be avoided; having a soft failure default is usually equivalent for the user passing by - you enable the dependency if found, disable it if not found, but still give a way to tell the build system to disable it even when found. This preserves the behaviour intended by the original developers, but also provides the control that (source) distributions want to have over what is built. Control over how the software is built Another problem shared both by binary and source distribution is having control on how the software is built. For binary distributions this usually means being able to impose options to the compiler, linker and other tools during the package build process, so they respect their standard options. For source distributions, this means allowing the user to choose the options to provide to the compiler, linker, assembler and other build tools, on a package-by-package basis. This does not mean that the distributions want to force-feed extra optimisations into software that might be fragile. This seems to be the biggest concern of developers for not wanting to provide a way to change the options used at compile time. Distributions might want to reduce the optimisations used, or they might just wish to enable (or disable) warnings to more easily spot eventual problems with their packages. Distributions might also want to build debug information, or remove debug messages, and so on. There are a huge amount of possible combinations. When the distributions want to reduce optimisation, that might be because the need to create packages which work on lower architectures not compatible with these optimisations. Or they know that some of these optimisations are not going to work with their environment. They might know that their version of the compiler does not support the optimisation, or there could be other reasons. Usually, the distribution knows the best way to handle the package for their own environment. This also leads to a compromise between upstream developers and downstream maintainers: the former should provide their own default options and optimisations, leaving a way to override these defaults as the distributions see fit. On the other hand, distributions should try their best to determine when eventual problems might be caused by their own choice of optimisations. Distributions should not expect upstream developers to fix problems that they have caused with their choice of optimisations. This way, it's usually possible to keep the relationship between upstream and downstream in good terms even when the set of optimisations used is totally different. More times than not, the problem is not even of willingness of the developers to provide an override, but rather a problem of actually having such an override working. While most distribution developers can fix these problems with relative ease, original developers would probably want to facilitate the work of their distributors by checking their own releases so that setting very minimal options to the compiler will work as intended. A common mistake is hard-setting CFLAGS (or similar variables) in the configure.ac file for autoconf (which otherwise has proper support for user-chosen options). While we're talking about compiler optimisations it's important to note that for some software, e.g. number crunching software (multimedia applications, cryptography tools, etc.) enabling extra optimisations is desirable. Even so, it should be possible to disable extensive optimisation. These optimisations are usually fragile, and only work in particular environments (compiler type and version, and architectures), so having a way for distributors to decide what they actually want to enable is a very real need. But having a way to provide options to compiler (C and C++, respectively CFLAGS and CXXFLAGS) is not all that is needed: most modern distributions might want access to the options used by the linker (LDFLAGS) to change the kind of hash tables to be generated, or to enforce particular security measures. For custom-prepared build systems, it's a common mistake to ignore this need, or to support it in the wrong way. Linker options should go before the list of object files, which in turn should go before the list of libraries to link to. This is another common mistake that distributors can fix with relative ease, but it would be better taken care of by the original developers, as it would require repeating the same steps for (almost) all distributions. [This ends part 2 of this article. Stay tuned for part 3, which will cover the philosophical concerns and present some conclusions.] Improving syncookies Back in 1997 TCP SYN flood attacks were all the rage among script kiddies. A SYN flood is a denial of service attack that uses up server resources by initiating, but not completing, a connection. Attacks via this method still remain a problem today though they are now more likely to be launched by sophisticated botnets rather than an individual. A first line defense against SYN floods is the syncookie. The syncookie was not designed for Linux specifically but found its way into kernel 2.1.44 via a patch from Andi Kleen. This long-time feature generated some recent discussion when a patch was submitted adding syncookie support to IPv6. The patch has now been queued for acceptance but in discussion along the way the community also began to tackle some longstanding limitations of syncookies and reaffirmed how relevant the feature continues to be. To fully describe syncookies some background on how TCP uses a three way handshake to establish a connection is in order. The first packet of any TCP session received by the server is known as the SYN packet because it carries the synchronize control flag. The SYN flag indicates that its sender wishes to open a new connection. That flag is only used during the opening sequence. The server responds with a packet also containing the SYN flag because the connection needs to be opened in both directions. This second packet also carries the ACK flag and is known as the SYN-ACK. It serves to both open the connection from the server to the client and to acknowledge receipt of the opening packet from the other host. Finally, the client sends a bare ACK packet to the server to acknowledge receipt of server-to-client SYN-ACK and the connection is then fully established. During a SYN flood a server receives the first packet of the three-way TCP handshake and responds with a SYN-ACK but no further data is ever received from the initiating client. When the SYN-ACK is generated most servers will also create an entry in the SYN queue. This queue is the waiting area for half-open connections awaiting handshake completion. The attacker intentionally orphans those entries and instead generates more SYN packets which in turn take up more entries in the queue. The server needs to wait for a long timeout before giving up and recovering the connection resources. During this time the attacker can flood it with many more half-open connections. Eventually the server runs out of resources and cannot accept any new connections without dropping some, perhaps legitimate, connection from the queue. Simple solutions such as placing a quota on the number of partially open connections per peer or using dynamically adjusted packet filters do not work because the SYN packets are easy to forge with fake source addresses. A syncookie allows the server to defer using up any resources until the third packet in the three-way handshake has been received. At that time the peer's address has been mildly authenticated because the final packet in the handshake contains a reference to the sequence number that was sent by the server in the second packet. With this assurance, packet filters and resource quotas keyed to the peer's address will again be useful defenses against resource attacks. The basic mechanism of the syncookie works by carefully manipulating the initial sequence number value of the connection instead of choosing it at random. Upon receiving a SYN the server carefully encodes the vital information that would have been stored as state in the SYN queue. This encoded information is cryptographically hashed with a secret key to form the sequence number of the SYN-ACK and sent to the client. The third packet of a legitimate handshake, which is the ACK from the client back to the server, contains this sequence number (plus one) in its acknowledgment number field. In this way all the information necessary to fully open the connection is presented back to the server without having to maintain state while the handshake is being completed. The major downside to syncookies is that they only have space to encode the most basic of TCP handshake options. At the time of initial syncookie deployment this was not a large problem because the only option prominently in use at the time was the Maximum Segment Size (MSS) option. This option is provided to help the peer avoid unnecessary fragmentation by sending packets that the other end of the connection knows a priori are too large to cross its network. This is exactly the kind of information that is normally stored as state in the SYN queue. The syncookie designers knew that this option was important to performance and found 3 bits for it in the encoded syncookie. These bits are used to approximate the real value of the option to one of 8 common values. In the intervening years new options have come into prominence and these are not syncookie compatible. The most important of these are the window scaling and Selective Acknowledgment (SACK) options. These features respectively allow the TCP congestion control window to grow beyond 64KB and be more efficient in the case of minor packet losses from those large windows. Without using these features it is impossible to get good transfer rates on networks with large bandwidth or large latency. Many household broadband links require at least the window scaling option to fully utilize the network connection. Due to this limitation, and the modest computation overhead of the cryptographic hash, the Linux stack only resorts to syncookie based connections when the number of half-open connection exceeds a high watermark controlled by the net.ipv4.tcp_max_syn_backlog sysctl. These connections are less featureful than normal connections but they are only resorted to when the queue would otherwise require active pruning. It turns out that the cookie mechanism is only implemented for IPv4. Recently, Glenn Griffin posted patches that add IPv6 support for syncookies. Andi Kleen, author of the original syncookie patch, wondered if the mechanism should be continued at all much less added to IPv6: Syncookies are discouraged these days. They disable too many valuable TCP features (window scaling, SACK) and even without them the kernel is usually strong enough to defend against syn floods and systems have much more memory than they used to be. So I don't think it makes much sense to add more code to it, sorry. Andi's argument was three pronged. His first point was about the reduced abilities of cookie initiated connections as already described in this article. Over time the value of these options has increased and therefore the cost of using syncookies has increased too. His second point was that Linux no longer uses all of the memory necessary for a full connection until the new connection is fully open. Instead it uses a "minisock" for that period. The minisock is a 96 byte struct tcp_request_sock structure holding the minimum state necessary to get the connection fully opened. The fully established struct tcp_sock is 1616 bytes. Both structure size measurements refer to a 64-bit kernel. Finally, Andi points out that the queue management routines for an overloaded SYN queue are more sophisticated now than the dumb head drop algorithm that was in place when syncookies were first deployed. The suggestion was that in aggregate these advances might make Linux robust enough without syncookies so that they could therefore be removed all together. Instead of engaging in a theoretical discussion some readers set up and ran their own experiments. One of the best parts of the Linux community is the tendency to put real data behind their arguments. While there is often disagreement over the realism of the measured scenarios, the data points always help us better understand the dynamics of kernel code. Willy Tarreau: My tests on an AMD LX800 with max_syn_backlog at 63000 on an HTTP reverse proxy consisted in injecting 250 hits/s of legitimate traffic with 8000 SYN/s of noise.[..] Without SYN cookies, the average response time was about 1.5 second and unstable (due to retransmits), and the CPU was set to 60%. With SYN cookies enabled, the response time dropped to 12-15ms only, but CPU usage jumped to 70%. The difference appears at a higher legitimate traffic rate. Ross Vandegrift: Under no SYN flood, the server handles 750 HTTP requests per second, measured via httping in flood mode. With a default tcp_max_syn_backlog of 1024, I can trivially prevent any inbound client connections with 2 threads of syn flood. Enabling tcp_syncookies brings the connection handling back up to 725 fetches per second. This data compellingly supports the continued value of the syncookie and that position seems to have won the day. The IPv6 syncookie patches are now queued within the network 2.6.26 development tree. However, the biggest news is probably that this discussion brought renewed energy to the problem of lost handshake options. Florian Westphal and Glenn Griffin have recently presented a solution to the most damaging aspect of that problem too. Their solution is to leverage the echoed TCP timestamp option in a way similar to the way classic syncookies leverage the echoing of the SYN-ACK sequence number in the subsequent ACK. The timestamp option was introduced with RFC 1323 and is widely deployed on modern Linux, Windows, and FreeBSD (including OS X) systems. Its main purpose is to be able to increase the frequency of round trip time measurements in the presence of large congestion control windows. Using the timestamp to preserve the window scale and SACK option values requires modifying the timestamp of the SYN-ACK packet to include the state necessary to support them. During a normal handshake the client will echo the modified timestamp value of the SYN-ACK packet back to the server as part of the timestamp option on the third part of the handshake and thus propagate the SACK and window scale information without keeping any state on the server. In order to make room in the timestamp for this new information the least significant 9 bits of the timestamp are shaved off. The encoded representation of the window scale and SACK options are then transferred back and forth at the minor cost of reduced granularity of TCP timestamps during the handshake exchange. Timestamps lose their least significant 512 jiffies with this approach. Below are two different TCP handshakes completed with syncookies and the timestamp patch. Note that the lowest bits of the SYN-ACK timestamp are the same in each handshake even at different points in time because each handshake uses the same SACK and window scaling options. As a result the timestamp values in each SYN-ACK are different but the lower nine bits share the same 0x166 value. While there is no guarantee that the timestamp option will be supported by every TCP peer, timestamps are widely deployed on the most common operating systems. Additionally, because timestamps, window scaling, and selective acknowledgments are all features related to high latency and bandwidth networks it would be unlikely to find an implementation that supported only a subset of these options. One shortcoming of the scheme is that it is not general enough to be future-proof as new handshake based options may continue to be deployed. At this time the MSS, SACK, window scaling, and timestamp options are the only handshake options seen with any regularity other than the NOP option which is just used for packet alignment. However, the whole point of an extensible option scheme is to leave room for future improvements. The IANA registry that records option values was last updated in February 2007 to reserve option code 27 for use with Experimental RFC 4782 "Quick Start for TCP and IP". Only time will tell if that particular option will be the next challenge to the syncookie scheme or if something else will rise first. The timestamp patch has only been posted very recently, and there has been little discussion of it beyond the developers who worked directly on it. It is not clear whether or not it will be accepted right away into the mainline, but it certainly seems to address a well known core problem with the syncookie at a minor cost. With the updates for IPv6 and modern TCP option schemes syncookies appear primed to keep providing sweet relief in their somewhat esoteric networking security niche. Perhaps they will keep chugging away for another 10 years without having to be re-baked. Discussing desktops at the Collaboration Summit Your editor is typing this from the Linux Foundation's collaboration summit, currently in progress in Austin, Texas. The day's agenda includes giving a talk on the state of the kernel during the evening reception; beer-fueled hecklers would appear to be in your editor's near future. The first day, though, included a rather more sober panel on the state of the Linux desktop which revealed some interesting thoughts on where things are going. This panel, moderated by Steven Vaughan-Nichols, featured John Hull from Dell, David Liu (gOS), Jim Mann (HP), Timothy Chen (Via), Kelly Fraser (Xandros), Grégoire Gentil (Zonbu), Ellis Wang (Asus), Debra Kobs-Fortner (Lenovo), and a representative from Everex whose name your editor did not catch. Together, they represented a wide range of industries, from component makers and operating system vendors to providers of complete systems. They take different approaches to the Linux desktop, but they are all optimistic about where it is heading - though some are more so than others. So how are these vendors doing with desktop Linux? While all of the vendors were optimistic, some were more guarded than others. Dell states that sales have "met expectations," but are aimed mostly at niche markets so far. There is, they say, a lot of interest in emerging markets, where users can start with Linux from the outset and do not have to migrate from other platforms. HP was also moderate in its enthusiasm, saying that its sales are "right about at the industry average." Lenovo was cautiously optimistic; their Thinkpad offerings are targeted at business users, which is a slower market to get into. According to Lenovo, most of their Linux-based sales are custom products designed for specific businesses. Rather more enthusiasm came from gOS, the company which supplied the distribution for Wal-Mart's low-end PC. Sales, they say, are "very good." Asus is clearly happy with the success of the Eee PC. That success, they say, comes from the effort put into designing a complete solution for users, with features like quick booting and solid-state storage: "you drop it, it still works." Everex says that "sales are brisk"; the company is pleased and will continue to offer Linux-based products - including the "MyMiniPC", a small system aimed specifically at MySpace users. Via's components are found in a number of small Linux systems, including the Eee PC, so Via is happy. It's too early for real results from Zonbu, which is trying to use Linux-based systems for a "computers as a service" business model. But, says Zonbu, Linux is the best platform for companies trying new models. Finally, Xandros also is optimistic, especially about "new form factors" for the desktop, a place where Microsoft, they say, "stumbled." The panel was asked what the development community can do to help these desktop businesses; in response, Arjan van de Ven piped up from the audience, asking what the companies are doing for the kernel community. From Lenovo, the word is that developers can work to get drivers into enterprise distributions as soon as possible. That request, of course, gets back to the tension between enterprise distributions and the desire for current code; this subject was not pursued further here, though. Dell would like to see more collaboration with other vendors in the production of drivers. The Via representative came straight out and said that "we don't do much" to support the community, but insisted that their intentions are good. He said that community support is hard for a Taiwanese company to do, but didn't say why. Via does plan to open a community site at linux.via.com.tw with driver code and more, but this site is not yet in place. [PULL QUOTE: There would appear to be some tension between providing a truly open device and keeping support costs down. END QUOTE] Support of users came up briefly. The HP representative said that the company expects distributors to provide backup support, but the first call will always go to the vendor of the hardware. That can be a problem, especially for the small devices which are seeing so much success at the moment; a single support call can wipe out any profit on the sale of one of those systems. Selling "constrained systems" which only do a few things helps; but, earlier, Mr. Mann had also talked about the difficulty of installing additional applications on these systems. There would appear to be some tension between providing a truly open device and keeping support costs down. The word from Asus is that a system like the Eee PC generates a lot of relatively trivial calls - things like "how do I search on the web?" So there is a real need to train users which has little to do with Linux itself. On the subject of applications, the gOS representative discussed a strategy of putting as much as possible on the web. The problem with local applications which look like Microsoft products is that users then expect those applications to behave like Microsoft products. It is better to have something which is obviously different and, presumably, better. Xandros called for better style guides and consistency throughout the interface; clones of other products are not what the market needs. On the HP side, the biggest request was "don't make people open a terminal." Perhaps the most amusing comment came from the Via representative, who described a "Maddog/Shuttleworth" choice. He asserted that his grandparents would find Jon "maddog" Hall (who was in the audience) to be a rather scary presence, while Mark Shuttleworth comes across as a friendly gentleman. Our interfaces, he says, need to look more like Mark Shuttleworth. Your editor, who has always found Maddog to be one of the friendliest people he knows, does not entirely buy into this analogy. But perhaps there is something to be said for clean-shaven interfaces. There was some talk of asking suppliers to provide hardware which is supported by free software. Perhaps the most telling comment came from Lenovo, which, apparently, has been asking for Linux-supported hardware "for a number of years." Free drivers are not a priority, though; the first priority is just having things work. So there is still some work to be done in this direction. Arguably the most interesting theme which came from this discussion - and from the first day of the summit as a whole - is that nobody is really pushing all that hard to get Linux into traditional desktop settings. The real action at the moment would appear to be in small devices like the Eee PC. These "greenfield" areas where there is no established presence to compete against offer vendors a market where they are not trying to migrate users away from other products. They would appear to be convinced that Linux can be a strong contender there - maybe the strongest. So soon we may truly see the year of the Linux desktop - for specific types of "desktop." Design simple menus with Cursed Menu The Cursed Menu project implements a terminal-based menu system via the the Curses terminal control library: Cursed Menu aims to create an ncurses based menu system for character based sessions. This menu program could be used to create user, system administration, or utility menus for clients connecting with text based clients such as telnet, ssh, or rlogin. Version 1.0.3 of Cursed Menu was recently announced. Despite being unable to find any documentation whatsoever on the project page, your editor decided to try out the software. The code was downloaded as a tar.bz2 file, uncompressed and untared. The configure script was run on a system running Ubuntu 7.04. There was one dependency issue that was fairly easily solved by installing the libncurses5-dev package. After fixing that, the software configured and made correctly. The next logical action was to take a look at the source code in the src/ subdirectory. The source files were mostly .cc and .hh indicating a C++ project. The cursedmenu binary was run and a blue curses screen similar to the example screenshot showed up. Navigating through the menus was simply a matter of using the arrow keys for movement and the Enter key for selecting an item. A longer description of the item under the cursor showed up on the lower left corner of the terminal screen. A little more digging through the code revealed the configuration system for Cursed Menu. Each menu has an associated .cmd file, here's what the default main menu .cmd file looks like: Customizing the .cmd file was fairly intuitive, shell commands were added to the ItemExec lines and ran when the menu item was selected. The cursedmenu binary picked up the changes in the .cmd file without recompilation. Cursed Menu provides a quick and easy way to control simple shell scripts and could be useful for many purposes. The project could really benefit from some basic documentation, A simple README file with a description of the available commands would be a good start. Despite this lack, the code seems to function nicely and can be put to use as-is. Backscatter increase clogs inboxes Backscatter, also known as blowback, is the result of a spammer forging the sender address on an email that is sent to a non-existent address. Many mail servers do not reject invalid addresses when they receive the email and instead generate a bounce message sometime later. The unfortunate victim, then, is the one whose address was forged as the sender. Sometimes, hundreds or thousands of bounce messages can be generated which flood the inbox of an innocent bystander. Backscatter seems to be on the rise recently, the LWN inbox has seen a huge increase in the number of bounces over the last week or so. There may be some connection to some Google domains contributing to the problem, but that cannot explain all of it. One basic problem is that many mail servers are generating the bounce messages after accepting mail for invalid addresses, rather than rejecting it while the SMTP transaction is still in progress. When a mail server gets a connection from a sending machine, it gets several pieces of information about the email in addition to its contents. Both a "from" and "to" address are included in this extra information, which is usually called the envelope, for obvious reasons. After receiving each piece of the envelope, a mail server has the opportunity to reject the message. Typically this isn't done for valid-looking sender addresses, except in limited blacklist situations, but it certainly can and should be done when the recipient address is invalid. Due to a variety of mail server configuration issues, many mail servers do not avail themselves of rejecting mail for invalid senders. Instead, they defer their decision until sometime later. Servers that relay mail will not know whether some of the addresses they relay are valid, while other servers (qmail for example) separate the SMTP conversation program from the local delivery program for security reasons and thus do not have that information available. Other valid or semi-valid reasons exist, but once the mail has been accepted, the proper means of indicating a bad address is no longer available. In the days before spam—remember those?—a mail server could generally trust that the sender address in the envelope was the real sender. So an incorrectly addressed email could be bundled up in a bounce message and sent to the sender. If the sender address is valid, it is very little different than a bounce that is generated by the sender's machine when the mail gets rejected at SMTP time. Unfortunately, the majority of sender addresses these days are forged. But spammers don't want to use just any forged address, they want to use something that is valid or appears valid. Mail servers have gotten better at testing sender addresses for validity before accepting mail from them. So, where does an enterprising spammer get a valid email address? They pick one at random from their list of "500,000 guaranteed opt-in email addresses" that they bought from some other miscreant. They use those lists to send their spam to as well as using them to choose sender addresses to use. As might be guessed, the SpamAssassin mailing lists have been discussing the problem recently, especially trying to find ways to reduce the amount received. SpamAssassin does have the VBounce plugin to recognize bounce messages. By default, it doesn't increase the score of bounces by much as it is meant to be used with procmail to put bounces in a separate place from spam. Another idea floated on the list is to use SPF or DKIM records for a domain. The belief is that spammers avoid using those domains because it is likely to cause their message to be immediately classified as spam. Anecdotal evidence seems to indicate that backscatter can be significantly reduced in this way. Notes from the Collaboration Summit Your editor has certainly attended no shortage of Linux-related conferences. Many of those are developer conferences, which are invariably interesting events. Others are oriented around marketing or outreach, with rather more variable results. The Linux Foundation's Collaboration Summit, which ran from April 8 to 10, is unique, though, in that it attracts representatives from throughout the Linux ecosystem. Developers are not in short supply (though it seemed like there were fewer than last year), but those developers spend three days talking with corporate executives, industry analysts, and, crucially, a number of high-profile users. This mixture of people creates a very different dynamic which supports a whole range of interesting conversations. One of the first events was the kernel developers' panel, moderated by your (normally rather immoderate) editor. Panelists James Bottomley, Matt Domsch, Dave Jones, Christoph Lameter, Ted Ts'o, Arjan van de Ven, and Chris Wright discussed a variety of topics ranging from kernel quality (getting better), code review, development process participation, hardware support, and more. Your editor was not able to take notes from the panel; perhaps the best report which has come up so far can be found in this InformationWeek article by Charles Babcock. IDC analyst Al Gillen spent half an hour going through a bunch of chart-heavy slides on the future of Linux in the marketplace. Overall, things look good, in that a market worth $20 billion in 2007 is expected to go up to $50 billion in 2011. There were lots of associated details which have been reported elsewhere. One interesting aspect was watching how the analyst trade copes with "non-paid" Linux deployments - which, according to Mr. Gillen, is 43% of the total. There was talk about how "monetizing" these deployments is a challenge for those looking to make money in the Linux marketplace. He expressed surprise at just how many companies are confident in their ability to support Linux deployments on their own. But he also talked about just how important that non-paid base is for the support of the entire ecosystem. Non-paid deployments may be a "challenge" to those who would prefer to be paid, but their absence would be a rather larger challenge. There was an echo of this insight when Red Hat CTO Brian Stevens talked. One of Red Hat's goals, he says, is to give customers the immense value that goes with a "zero cost to exit" offering. There is no RHEL lock-in. To that end, he says, the folks at CentOS have done Red Hat a great favor. Brian also talked about the difference between the old "selling the distribution" business model, which gave Red Hat an incentive to put lots of shiny new things into each release, and the current model, which puts the focus on continuity instead. Since Red Hat's customers have already paid for the next release, Red Hat doesn't need to add lots of cool new features to encourage them all to upgrade. He then spent the rest of his talk on the various cool new features the company is working on, including messaging, realtime support, and more. Marten Mickos, once CEO of MySQL and now a vice president at Sun Microsystems, gave a talk which was intended to make listeners feel good about Sun and its plans for free software. It bothers him, he says, when people ask whether MySQL will remain committed to Linux; it strikes him as a demonstration of uncertainty about the future of Linux in general. That uncertainty is unnecessary; Linux's future is strong, regardless of what MySQL does. But MySQL (and Sun) do remain committed to Linux as a platform; the era of monolithic computing platforms is over, and companies have to support customers who will make their own choices at each level in the stack. So LAMP as an "architecture of participation" will remain supported by Sun well into the future. An industry panel on "the state of Linux" was a useful view into how some large companies see the platform. They are all seeing growth in Linux; Bdale Garbee (representing HP) noted that Linux is "showing up in everything" that customers are planning. IBM's Dan Frye said that Linux is ready for any kind of workload. Oracle's Wim Coekaerts did note, though, that Oracle's revenue from Linux, at a mere $2 billion, is "still lagging." There was a fair amount of discussion on how to work with the development community; NetApp's Brian Pawlowski asserted that "money helps." By that, he means employing developers to work within the community and advance the platform. Bdale noted that HP tries to work "in" the community, not "with" it. Dan Frye echoed that thought, saying that it's important to have people with credibility in the community and to allow them to work inside the community for long periods of time. Motorola's Christy Wyatt, instead, worried that her company still doesn't have the necessary wisdom to work effectively with the development community; Linux and the mobile industry, she says, are still relatively new to each other. Wim related a story from the first kernel summit wherein an Oracle representative presented a laundry list of desired features. That is, he says, not the right way to do things; the community tends not to react well to wishlists with no development effort behind them. Oracle now has a Linux development team which is entirely separate from the normal product teams; among other things, it has a blanket approval to contribute the code it develops, avoiding the lengthy and tiresome internal legal review process. The company has also adopted a policy of making projects open from the beginning, getting much-needed review early in the process. Other participants noted that working with a company's legal department can often be the hardest part of community participation. Dan suggested bringing in the legal department at the beginning of a project and keeping them around; sticking with a single counsel who can slowly be educated in free software ways is also important. Bdale said that we were likely to need "legal domain experts" for some time yet, but that the situation is getting better; most lawyers now have at least some understanding of how free software licensing works. A couple of panelists discussed the legal headaches that come with mixing components with different licenses; they would certainly like to see fewer licenses going into the future. The final session from the first day covered the state of mobile Linux. It was about the only contentious panel on a day where the majority of the sessions were mostly educational in nature. One area of disagreement was over security models. Some platforms (such as ACCESS) work with a fine-grained set of privileges, while Google's Android uses sandboxing and controlled access to resources determined by asking the user. The fine-grained approach is seen by some as an ideal way for carriers to lock down handsets and exert firm control over what handset owners can do - not the desired outcome. On the other hand, asking users is seen as insecure; it's not usually too hard to get users to agree to almost anything. Perhaps the lowest moment in this panel came when Google's Eric Chu was asked about participation with the community as opposed to developing everything as a private fork. He replied that the Android code was open, it sits in a repository somewhere. But there will be no effort to engage with (for example) the kernel community and merge this code until it is "done." That approach runs against what others had been saying since the kernel panel that morning: one must get code out there as early as possible. When the Android developers finally decide that their code is ready, they are likely to have a nasty surprise when they try to merge it into the kernel and are told that much of it is unsuitable by design. Google came off looking somewhat bad here, but the truth of the matter is that most of the (many) mobile Linux projects are operating in similar ways. Getting these projects to really work with the communities whose code they are using is, as with many embedded applications, a challenge. One can hope that the suggestions given to these projects at the summit will be taken to heart. That sort of communication is what makes this event worthwhile; it is often hard for this particular mixture of people to come together in other contexts. The Collaboration Summit was heavy on conversation in general, often to great effect. One well-known developer commented to your editor that the Summit had the biggest disparity between the official content and the "hallway track" that he had ever seen. The hallway track was good, with, hopefully, lots of good things to come from it in the coming months. TOMOYO Linux and pathname-based security It takes a certain kind of courage to head down a road when one can plainly see the unpleasant fate which befell those who went before. So one might think that the fate of AppArmor would deter others from following a similar path. The developers of TOMOYO Linux are not easily put off, though. Despite having a security subsystem which shares a number of features with AppArmor, these developers are pushing forward in an attempt to get their code into the mainline. AppArmor, remember, is a Linux security module which uses pathnames to make security decisions. So it is entirely conceivable that two different security policies could apply to the same file if that file is accessed by way of two different names. This approach helps make AppArmor easier to administer than SELinux, but it has given AppArmor major problems in the review process for a few reasons: There has been strong resistance to the addition of any new security modules at all, to the point that proposals to remove the LSM framework altogether have been floated. Some security developers see a pathname-based mechanism as being fundamentally insecure. SELinux developers, in particular, have been very strongly against pathname-based security. To these developers, security policies should apply directly to objects (or to labels attached directly to objects) rather than to names given to objects. The current Linux security module hooks, not being developed with pathname-based security in mind, do not provide sufficient information to the low-level file operation hooks. So AppArmor had to reconstruct pathnames within its security hooks. The method chosen for this reconstruction was, one might say, not universally admired. If the TOMOYO Linux developers are serious about getting their code into the mainline, they will need to have answers to these objections. As it happens, the first two obstructions have mostly gone away. Casey Schaufler's persistence finally resulted in the merging of the SMACK security module for 2.6.25; it is the only such module, other than SELinux, ever to get into the mainline. Now that SMACK has paved the way, talk of removing the LSM framework (which had been strongly vetoed by Linus in any case) has ended and the next security module should have an easier time of it. Linus has also decreed that pathname-based security modules are entirely acceptable for inclusion into the kernel. So, while some developers remain highly skeptical of this approach, their skepticism cannot, on its own, be used as a reason to keep a pathname-based security module out. Pathname-based approaches appear to be "secure enough" for a number of applications, and there are some advantages to using that approach. All of the above is moot, though, if the TOMOYO Linux developers are unable to implement pathname-based access control in a way which passes muster. The recent TOMOYO Linux patch took a different approach to this problem: since the LSM hooks do not provide the needed information, the developers just added a new set of hooks, outside of LSM, for use by TOMOYO Linux. And, while they were at it, they added new hooks at all enforcement points. This was not a popular decision, to say the least. The whole idea behind LSM was to have a single set of hooks for all security modules; if every module now adds its own set of hooks, that purpose will have been defeated and the kernel will turn into a big mess of security hooks. Duplicating the LSM framework is not the way to get a security module into the mainline. So, somehow, the TOMOYO Linux developers will need to implement pathname-based security in a different way. The most obvious thing to do would be to modify the existing hooks to supply the requisite information (being a pointer to the vfsmount structure). The problem here is that, at the point where the LSM hooks are called, that structure is not available; it is only used at the higher levels of the virtual filesystem code. So either some core VFS functions would have to be changed (so the vfsmount pointer could be passed into them), or a new set of hooks would need to be placed at a level where that pointer is available. It appears that the second approach - adding new hooks in the namespace code - will be taken for the next version of the patch. As the TOMOYO Linux developers work through this problem, they are likely to be closely watched by the (somewhat reduced in number) AppArmor group. There appears to be a resurgence of interest in getting AppArmor merged, so we will probably see AppArmor put forward again in the near future. That will be even more likely if TOMOYO Linux is able to solve the pathname problem in a way which survives review and gets into the kernel. Bisection divides users and developers The last couple of years have seen a renewed push within the kernel community to avoid regressions. When a patch is found to have broken something that used to work, a fix must be merged or the offending patch will be removed from the kernel. It's a straightforward and logical idea, but there's one little problem: when a kernel series includes over 12,000 changesets (as 2.6.25 does), how does one find the patch which caused the problem? Sometimes it will be obvious, but, for other problems, there are literally thousands of patches which could be the source of the regression. Digging through all of those patches in search of a bug can be a needle-in-the-haystack sort of proposition. One of the many nice tools offered by the git source code management system is called "bisect." The bisect feature helps the user perform a binary search through a range of patches until the one containing the bug is found. All that is needed is to specify the most recent kernel which is known to work (2.6.24, say), and the oldest kernel which is broken (2.6.25-rc9, perhaps), and the bisect feature will check out a version of the kernel at the midpoint between those two. Finding that midpoint is non-trivial, since, in git, the stream of patches is not a simple line. But that's the sort of task we keep computers around for. Once the midpoint kernel has been generated, the person chasing the bug can build and test it, then tell git whether it exhibits the bug or not. A kernel at the new midpoint will be produced, and the process continues. With bisect, the problematic patch can be found in a maximum of a dozen or so compile-boot-test cycles. Bisect is not a perfect tool. If patch submitters are not careful, bisect can create a broken kernel when it splits a patch series. The patch which causes a bug to manifest itself may not be the one which introduced the bug. In the worst case, a developer may merge a long series of patches, finishing with one brief change which enables all the code added previously; in this case, bisect will find the final patch, which will only be marginally useful. If the person reporting the bug is running a distributor's kernel, it may be hard to get that kernel in a form which is amenable to the bisection process. Bisection might require unacceptable downtime on the only (production) system which is affected by the bug. And, of course, the process of checking out, building, booting, and testing a dozen kernels is not something which one fits into a coffee break. It requires a certain determination on the part of the tester and quite a bit of time. All of the points above would suggest that requesting a bisection from a user reporting a bug should be done as a last resort. In that context, it is worth looking at the story of a recent bug report which suggests that some observers, at least, think that kernel developers are relying a little too heavily on this tool. An April 9, Mark Lord reported a regression in the networking stack; after making a couple of guesses, the network developers suggested that the problem be bisected. Mark replied that he did not have the time to go through a full bisection, and that he would much rather be provided a list of commits which might be at fault. That list was not forthcoming, though; there were no developers who had an idea of where the problem might be and, as it turns out, the developer who introduced the bug lives in a time zone which caused him to miss the discussion. Mark's response was strong: Years ago, Linus suggested that he opposed an in-kernel debugger mainly because he preferred that we *think* more about the problems, rather than just finding/fixing symptoms. This 100% reliance upon git-bisect is worse than that. It has people now just tossing regressions into the code left and right, knowing that they can toss all of the testing back at the poor folks whose systems end up not working. Andrew Morton also worries that developers resort too quickly to a bisection request rather than working with users as was once done. Either that, he says, or developers just ignore the report from the beginning. Other developers have answers to these worries, of course. Kernel developers often are not in a position to reproduce a reported bug; it may depend on the specifics of the user's hardware or workload. So they must depend on the user to try things and inform them when a change fixes the problem. Here's David Miller's view on how things used to work: In fact, this is what Andrew's so-called "back and forth with the bug reporter" used to mainly consist of. Asking the user to try this patch or that patch, which most of the time were reverts of suspect changes. Which, surprise surprise, means we were spending lots of time bisecting things by hand. We're able to automate this now and it's not a bad thing. The other answer that one hears is that the situation now is much different, with far more users, much more code, and more problems to deal with. The old "back and forth" mode was better suited to smaller user and developer communities; in the current world, things must be done differently. David Miller again: What people don't get is that this is a situation where the "end node principle" applies. When you have limited resources (here: developers) you don't push the bulk of the burden upon them. Instead you push things out to the resource you have a lot of, the end nodes (here: users), so that the situation actually scales. There is another aspect of the problem which is spoken about a bit less frequently: developers must prioritize bug reports and decide which ones to work on. Unlike some projects, the kernel does not have anybody serving in any sort of bug triage role, so, in the absence of a disgruntled and paying customer, most developers make their own decisions on which problems to try to solve. It should not be surprising that problems with the most complete information are the ones which are most likely to be addressed first. A bug report with a bisection that fingers a specific commit is a report with very good information, one which is generally easy to resolve. As an example, consider Mark Lord's report again; he did eventually take the time (five hours, apparently) to bisect the problem and report the results; the bug was found and fixed almost immediately thereafter - despite the fact that the responsible developer was still sleeping on the other side of the planet. Even less spoken about is the fact that quite a few problems are one-off occurrences. Somewhere out there in the world, there is a single user who, due to a highly uncommon mixture of hardware and software, experiences a problem which affects (almost) nobody else. Marginal hardware, out-of-tree patches, and overclocking only make the problem worse. Arjan van de Ven's kernel oops summaries are illustrative in this regard; the statistics for the 2.6.25-rc kernels show that a half-dozen problems account for over half of the reports, while the vast majority of oopses have only a single occurrence. Kernel developers have learned that this kind of problem report tends to go away by itself; the affected user finds a way around the issue (or just gives up) and nobody else ever complains. One can well argue that trying to chase down this kind of problem is not a good use of a kernel developer's time. The hard part is figuring out which reports are of this variety. One relatively straightforward way is to wait until reports from other users confirm the problem - or until a sufficiently determined user bisects the problem and provides a commit ID. In this sense, bisection serves as a sort of triage mechanism which requires users to perform enough work to show that the problem is real. So the developers do have very good reasons for requesting bisections from users. That said, there is reason to worry that many users will simply stop sending in bug reports. If the only response they can expect is a bisection request (which they may be in no position to answer), they may see no point in reporting bugs at all. Fewer bug reports is not the path toward more solid kernel releases. So, as useful as it is, bisection will have to be a tool of last resort in most cases. The good news is that the development community does seem to understand that; bisection remains just one of the many tools we have for the isolation and solution of problems. The not-quite-so-good news is that, as Al Viro and James Morris have pointed out, the real problem is in the review of code so that fewer bugs are created in the first place. That is not a problem which can be solved with bisection. e1000 v. e1000e Ingo Molnar was recently bitten by a problem which, in one form or another, may affect a wider range of Linux users after 2.6.26. Linux currently has two drivers for Intel's e1000 network adapters, called "e1000" and "e1000e". The former driver, being the older of the two, supports all older, PCI-based e1000 adapters. There is, shall we say, a relative shortage of developers who are willing to stand up for the quality of the code in this driver, but it works and has a lot of users. The e1000e driver, instead, supports PCI-Express adapters. It is a newer driver which is seen as being better written and easier to maintain. It is intended that all new hardware will be supported by this driver, and that, in particular, all PCI-Express hardware will use it. The only problem is that a few PCI-Express chipsets were added to the older e1000 driver before this policy was adopted. Since the newer driver also supports those chipsets, there are two drivers (with two completely different bodies of code) supporting the same hardware. The e1000 maintainers would like to end this duplication and put the e1000 driver into a stable maintenance mode. To that end, earlier this month, it was announced that, as of 2.6.26, the PCI IDs corresponding to PCI-Express devices would be removed from the e1000 driver, and that all users of that affected hardware need to move over to e1000e. The e1000 developers had originally tried to make this move for 2.6.25, but they committed a fundamental faux pas in the process: they broke Linus's machine. So that change got reverted before 2.6.25-rc1 came out. Instead, now, we have the announcement that the change is coming in the next cycle (when the e1000e problems, presumably, will be fixed) and a bit of configuration trickery has been added; it causes the e1000 driver to not claim PCI-Express devices if the e1000e driver has been built into the kernel. Ingo's problem is that he built the e1000 driver into his kernel, but ended up with e1000e configured as a module which was never loaded. That combination leads to a network adapter which does not work at all, since the built-in driver no longer claims it. Ingo, a bit disgruntled at having to spend an hour tracking down the problem, has suggested that it is a regression which must be fixed. The e1000 driver maintainers have resisted doing so, but Linus, having also been burned, agrees. So, while this transition is likely to go ahead as scheduled, 2.6.25 will probably have a configuration change designed to keep others from falling into a similar trap. OMFS and the value of obscure filesystems Your editor has never dabbled in filesystems development. He has a suspicion, however, that there is a tense moment in every new filesystem developer's life: when Christoph Hellwig's review shows up in the mailbox. Christoph's reviews, while not always being pleasant reading, tend to be right on the money with regard to problems in filesystem implementations - and problems in new filesystems are common. Christoph's stamp of approval is almost required for the merging of a filesystem, so, when the initial posting of a filesystem is greeted with reviews that read, nearly in their entirety, "looks good," one would assume that the path into the mainline would be straightforward. The story of OMFS, though, shows that this assumption does not always hold. Reviewers have only been able to find the smallest of details to fix, but there is opposition to its merging, especially from Andrew Morton. The objection is that this filesystem - found on devices like the Rio Karma music player and ReplayTV boxes - has a very small user base. OMFS developer Bob Copeland, in his initial posting, suggested that fewer than twenty people might be using it at this time. New devices with this filesystem are no longer being made, so the chances of the user base growing significantly are small. Andrew's objection is that the addition of any new code creates a new maintenance burden for kernel developers. Whenever a VFS interface is changed, all filesystems must be fixed to work with the new API. So the addition of a filesystem imposes costs which, he says, should be outweighed by the benefits that new filesystem brings. In the case of an obscure filesystem with a small and (presumably) decreasing user base, says Andrew, it is not clear that the benefits are sufficient. He asks: Just as a thought exercise: should we merge a small and well-written driver which has zero users? Andrew would rather see OMFS turned into a user-space filesystem using FUSE. Chris Mason is also concerned: Even though OMFS seems to be using the generic interfaces well, there is still a testing burden for every change. Someone needs to try it, report any problems and get them fixed. Since none of the people making the changes is likely to have an OMFS test bed, all of that burden will fall on Bob, his users, and anyone who tries to compile the module (Andrew). OMFS supporters note that the code is written well and can serve as an example for other filesystem authors. They also note that code with small user bases is often merged - that, in fact, in some areas, developers have said they want all code, regardless of how few people are using it. Running OMFS through FUSE, they say, would be harder for users to set up and less efficient in operation. Says Christoph: Moving a simple block based filesystem means it's more complicated, less efficient because of the additional context switches and harder to use because you need additional userspace packages and need to setup fuse. We made writing block based filesystems trivial in the kernel to grow more support for filesystems like this one. In this case, it looks like Andrew will back down on this one and let the next version of the OMFS patches into -mm. From there, if all goes well, it could make the jump into the mainline, possibly as early as 2.6.27. But Andrew is clearly unhappy about that outcome, and may well raise the question again in the future: is "well written" really sufficient to justify merging new filesystems into the kernel? Turnitin and fair use The McLean, Va. High School students whose copyright infringement lawsuit against iParadigms, LLC and its Turnitin plagiarism-detection software system was dismissed on summary judgment on March 11 have filed a notice of appeal [PDF] to the Fourth Circuit Court of Appeals. That was likely a surprise to iParadigms, whose CEO John Barrie confidently predicted that hell would freeze over before the students would appeal. Yet, appeal they have. So this story isn't over yet. District Court Judge Claude Hilton's Opinion [PDF] ruled that Turnitin's use was highly transformative and hence fair use; that is one of the issues that will be appealed, as Robert Vanderhye, the attorney representing the students pro bono, explained to me in an email interview: What the judge held, and what we are appealing, are (1) if a minor clicks on to the Turnitin.com website he/she is bound by the conditions of the "Agreement" even if it denies the student the ability to enforce his/her copyright, and (2) as a matter of law the Turnitin use is transformative so that it is fair use instead of copyright infringement. With respect to the first, we submit that the Court misinterpreted Virginia law, and did not apply the controlling Virginia cases that we cited. With respect to the second there clearly are facts in dispute. Among the facts in dispute are a) does the Turnitin system work to deter plagiarism, or does it actually encourage plagiarism since it is so easily avoided by anyone who really wants to plagiarize; b) is the Turnitin system so insecure that students papers can easily be recovered by a hacker so as to easily allow theft of the students' works, or for a criminal to use information contained in student works against them; and c) how can the Turnitin use be transformative when they will send a student's work verbatim to someone outside the student's school system without the student's permission, or even knowledge. Also, with respect to the second point, Turnitin violates the FERPA since student names, schools, and personal information are usually on the student works; since it violates FERPA as a matter of law the Turnitin system is against the public interest, and therefore there can be no fair use. He mentions that there are facts in dispute because a court is only supposed to grant summary judgment if the pleadings and supporting documents, when viewed in the light most favorable to the non-moving party, show that there is no genuine issue as to any material fact. Fed. R. Civ. P. 56(c). The major issues being appealed then are: Was it error to dismiss this lawsuit on summary judgment? Can minors lose copyright rights, because of clicking "I agree" to an agreement that their schools compelled them to agree to? What about the privacy issues under the Family Educational Rights and Privacy Act (FERPA)? But the key question is, Is this fair use? iParadigms' point of view, one that the lower court agreed with, is that a lot of high schools and universities use this software and rely on it. They find plagiarism goes down significantly. Turnitin isn't using the creative parts of the papers for commercial gain, the judge said; it's a system of integrity checking. And that's a transformative use. Similarities between Google Books and Turnitin: The computer does the copying, not humans. Both archive complete copies of the works. Neither gets the works directly from the copyright holder. Both claim the use is transformative. Differences: The students are minors. There are arguably privacy issues with Turnitin. The student papers are unpublished works. The conceivable market harm is distinguishable. There is no way students can opt out. Any author can opt out of Google Books. Turnitin represents itself as a system for protecting copyrights. For that matter, so is Google Books, in that it's a kind of digital card catalogue, letting us know where to find books with information we want. In Perfect 10, Inc. v. Google, Inc. (the thumbnail photo case, hence another works-in-a-computer-database fact pattern) the court found that, too, was transformative and hence fair use. Judge Hilton notes this finding in his order on page 13. The photos had one purpose originally, the court found, but putting them into a database was something not originally intended, and the search engine "provides a social benefit by incorporating an original work into a new work, namely, an electronic reference tool." The purpose is limited and the works are used only for comparative purposes that provide a social benefit. He does mention the exception to that, however, in that if there is a request to see the work a student's paper allegedly seems to have plagiarized, a teacher can obtain that work to evaluate. Hence the appeal over archiving by students who don't want their works used that way. If the students have issues about having to use the system, they should take it up with the schools, the judge ruled, because that is who is giving Turnitin authority to do what they are doing with these student papers, and he thought the schools had the right. As for fair use, Judge Hilton found that this was a transformative use, and he quoted a definition of transformative from a case, Harper & Row Publishers, Inc. v. Nation Enterprises, to mean that it "adds something new, with a further purpose or different character". If use is transformative, he wrote, it's "strong evidence" that the use is fair use. iParadigms has on its website a legal opinion [PDF] it commissioned from Foley & Lardner. Fair use is a bit hard to pin down. Even the legal opinion notes that fair use is very much dependent on the facts of each situation: Determining whether a copyright exists in a particular work or is infringed by a particular use of the work is difficult. The analysis is so fact-specific that relatively minor variations between the facts of superficially similar cases often lead to diametrically different conclusions. To grasp the students' point of view, imagine if a company decided to offer a service to check for infringed code, so it collected all the world's proprietary software it could get its hands on, without permission from the original authors. Say it got copies from the world's libraries. And there was no way to opt out. Now, imagine that if the software thought it found a match, you could request to see the proprietary code that it was thought to infringe. Do you think the proprietary software companies or the authors of that code would view that as a transformative fair use? The crux of the students' issue, then, is the archiving. They don't want their papers to remain in the system, even if they must submit them for originality review. It bothers them that iParadigms archives the students' manuscripts and then uses them for profit, while they, the students, lose control over their own work without getting any compensation. The students have their own website, Don'tTurnItIn.com, and they have some additional court filings available there. A lot of commentary so far has cited Judge Hilton's ruling, because of its fair use arguments, viewing the opinion as perhaps being helpful to Google in the litigation brought against it by the Author's Guild and others regarding Google Books, and I'm sure you can see why. But there are significant differences too. Some have argued that copyright law is out of date in a digital world, the Internet being nothing but one huge copying machine. Computers copy, and so some suggest it would be more logical and less damaging to penalize wrongful distribution, not copying. In that sense, the judge's ruling was quite progressive. Indeed, it's hard to read his opinion without concluding that to Judge Hilton, copying by a computer isn't a problem, so long as human eyes are not involved, the use is transformative, and there is no distribution for profit or any market harm. In iParadigm's Counterclaims [PDF], there were several other causes of action, trying to mold the facts into a claim of "trespass to chattels" and even claims of violations of the Computer Fraud and Abuse Act, as well as Virginia's Computer Crimes Act. Those are serious allegations. On the first, the assertion was that the plaintiffs allegedly used nyms like 'Rube Goldberg' and 'Perpetual Motion' to improperly file papers in the Turnitin system without authorization. The court dismissed those counterclaims, pointing out that you have to prove actual damages and, in the case of trespass to chattels, some impairment of quality or condition or use. It's a bit hard to come up with a dollar figure for how harmed one is by someone's use of a nym. As for filing the papers without authority, where's the financial harm, the court asked? Trespass to chattels in meat space is like someone taking your car for a joy ride, getting into a fender bender, and then bringing the car back without fixing the fender or even filling the gas tank back up. Not only is the car damaged, but you didn't have use of it while it was out being driven around, and so you couldn't drive it to the airport yourself as you intended and missed your job interview. And it's your car, your personal property, which is what chattel means. Like many other legal concepts, it has been applied to digital world, as if physical property and intellectual property are identical, and in some ways, it fits. AOL was an early trailblazer in using trespass to chattels successfully against spammers, arguing that the sheer volume of emails interfered with their being able to use their own system as intended to service their real customers properly (here's one example). iParadigms also claimed that the terms of their Usage Policy provided for indemnification to iParadigm arising out of any use of the Turnitin website. It also has a user agreement that you are confronted with and must click "I Agree" to in order to submit papers to Turnitin. The judge made a distinction between the user agreement and the Usage Policy, however, noting that there was no "I Agree" to the Usage Policy or any evidence that the students saw it, and it was not referenced or incorporated into the user agreement. So he decided that while the students were bound by what they said "I Agree" to, they never agreed to the Usage Policy. But the appeal asks whether these minors ever gave a legally binding assent, since their "I Agree" was really "My School Says I Have to Agree". In some respects, this EULA issue may be as interesting to track as the fair use questions. The Cairo Project reaches a new milestone The cairo project is producing a cross-platform universal vector graphics library: Cairo is a 2D graphics library with support for multiple output devices. Currently supported output targets include the X Window System, Win32, image buffers, PostScript, PDF, and SVG file output. Experimental backends include OpenGL (through glitz), Quartz, and XCB. Cairo is designed to produce consistent output on all output media while taking advantage of display hardware acceleration when available (eg. through the X Render Extension). Cairo is used by the GNOME and desktop environment and some KDE applications. The Wikipedia article on cairo has more background information on the project. LWN investigated cairo back in August, 2005 at the time of the 0.9.0 release. Progress on cairo has been steady since then, with releases coming out frequently. Major version 1.6.0 of cairo was recently announced: This is a major update to cairo, with new features and enhanced functionality which maintains compatibility for applications written using cairo 1.4, 1.2, or 1.0. We recommend that anybody using a previous version of cairo upgrade to cairo 1.6.0. A list of the major changes in cairo 1.6.X includes: The pdf generation has been greatly improved, the number of rasterized image fallbacks has been greatly reduced. The PostScript and PDF output code have had a number of efficiency and portability improvements. The pixman library has been split out so that it can be shared by cairo and the X server. Cairo 1.6.X now supports arbitrary X trueColor and 8-bit PseudoColor visuals. The Mac OS X Quartz backend is now an official part of cairo and the API has been stabilized. A new win32 printing backend has been added. There have been a number of minor API additions to cairo. Numerous "robustness fixes" have been added. Other enhancements and bug fixes have been added. As is typical with major releases, several bug fix releases quickly followed. The first was version 1.6.2 which addressed a problem with certain PostScript printers. That was followed by version 1.6.4: "The cairo community is wildly embarrassed to announce the 1.6.4 release of the cairo graphics library. This release reverts the xlib locking change introduced in 1.6.[2], (and the application crashes that it caused)." Hopefully the code will now stabilize and be adopted by the upstream applications. Congratulations go out to Carl Worth and the other cairo developers for this major release and their continued work on this important project. ELC: Trends in embedded Linux Henry Kingman, editor of LinuxDevices, opened the Embedded Linux Conference with a look at the trends in embedded development since he started covering the subject in 1999. Based largely on the annual surveys run by LinuxDevices, his keynote speech highlighted the growth of Linux as an embedded operating system as well as where it is headed in the next few years. The conference, which started April 15 in Mountain View, California, gathers around 175 embedded developers for three days of talks on a wide variety of embedded topics. Sponsored by the Consumer Electronics Linux Forum (CELF), the conference has become the premier technical conference for the ever-growing embedded Linux community. Each day has a keynote, with kernel hacker Andrew Morton and CELF architecture group chair (and conference organizer) Tim Bird rounding those out, followed by a half-dozen presentations slots, with three parallel presentations. Bird introduced Kingman as one of the main providers of news about embedded Linux, relating that LinuxDevices and LWN.net are his "two main sources of information" about the community. Bird marveled at the body of work that Kingman has amassed: "this guy is prolific". He also reminisced a bit about the early days of embedded Linux, starting with his days at Lineo to his current work at Sony: It was hard to get people to pay attention to Linux, now Sony is putting Linux into almost everything. Kingman acknowledged Bird's introduction, but said that he didn't know "if that makes me an expert in the forest, or lost in the trees". He looked back to a 1999 San Francisco Bay Linux Users Group meeting with Linus Torvalds as the featured speaker. Kingman said that Torvalds wanted Linux to be a desktop operating system but that he saw the embedded space as the big growth area. Later that year, Kingman attended the first LinuxWorld conference where he saw some folks from Transmeta talking about squashfs and cramfs. An article he wrote about those filesystems was published by Rick Lehrbaum, founder of LinuxDevices. That was the first of more than 3000 articles Kingman has since written for LinuxDevices. Kingman then presented the results of the most recent LinuxDevices reader survey. The survey gathers information about what LinuxDevices readers are doing or planning with regard to embedded Linux development. It has been run for eight years, providing some interesting information on changes in the readers' attitudes over the years. Usage of Linux in embedded development projects crossed a threshold this year, with more than 50% of the 812 respondents saying that they are currently using it. Usage of Linux has been growing year over year, but didn't cross the halfway mark until 2008. More than 61% believed their company would be using Linux within the next two years. The ARM family of processors has continued its growth with 30% of the readers using it, while 25% are using x86 variants. ARM overtook x86 three years ago; that trend looks to be continuing with respondents seeing 31% ARM versus 23% x86 over the next two years. Kingman said that he thinks Intel is trying to reverse that trend because spending on consumer devices is predicted to "outstrip IT spending". There were a couple of questions asking where respondents obtain the version of Linux they use in their products. Ubuntu has a somewhat surprising share at 8%. For a relatively new distribution that is not specifically targeted at that market, it stands out, as does its predicted growth to 10% over the next two years. Kernel.org at 16% and Debian at 14% are the leading sources, with uClinux tied with Ubuntu and MontaVista and Fedora at 6% each. Unsurprisingly, per-unit royalties were not popular with two-thirds of respondents being unwilling to pay those, but 60% were willing to pay for development and support of embedded Linux, so it is not just the free-beer aspect that is drawing companies to Linux. Most (45%) get their sources as a free download from a community site like kernel.org or handhelds.org, with 18% getting them bundled with their hardware. Only 11% said that cost was the greatest influence on their choice. Legal threats are still on the minds of some, with copyright or patent concerns being considered a significant threat to roughly half of the respondents. SCO has fallen off the radar, with only 2.5% thinking that it is still a threat. "None of the above" was the big winner, presumably meaning that there are no significant threats, at 40%. Kingman finished with a request of the embedded community to let him know what things should be covered in more depth and any additional areas they wish to see covered. He is looking for input on what the community wants to talk about: "we want to be your website." GCC and pointer overflows On April 4, CERT put out a scary advisory about the GNU Compiler Collection (GCC). This advisory raises some interesting issues on when such advisories are appropriate, what programmers must do to write secure code, and whether compilers should perform optimizations which could open up security holes in poorly-written code. In summary, the advisory states: Some versions of gcc may silently discard certain checks for overflow. Applications compiled with these versions of gcc may be vulnerable to buffer overflows. [...] Application developers and vendors of large codebases that cannot be audited for use of the defective length checks are urged to avoiding [sic] the use of gcc versions 4.2 and later. This advisory has disappointed a number of GCC developers, who feel that their project has been singled out in an unfair way. But the core issue is one that C programmers should be aware of, so a closer look is called for. To understand this issue, consider the following code fragment: Here, the programmer is trying to ensure that len (which might come from an untrusted source) fits within the range of buffer. There is a problem, though, in that if len is very large, the addition could cause an overflow, yielding a pointer value which is less than buffer. So a more diligent programmer might check for that case by changing the code to read: This code should catch all cases; ensuring that len is within range. There is only one little problem: recent versions of GCC will optimize out the second test (returning the if statement to the first form shown above), making overflows possible again. So any code which relies upon this kind of test may, in fact, become vulnerable to a buffer overflow attack. This behavior is allowed by the C standard, which states that, in a correct program, pointer addition will not yield a pointer value outside of the same object. So the compiler can assume that the test for overflow is always false and may thus be eliminated from the expression. It turns out that GCC is not alone in taking advantage of this fact: some research by GCC developers turned up other compilers (including PathScale, xlC, LLVM, TI Code Composer Studio, and Microsoft Visual C++ 2005) which perform the same optimization. So it seems that the GCC developers have a legitimate reason to be upset: CERT would appear to be telling people to avoid their compiler in favor of others - which do exactly the same thing. The right solution to the problem, of course, is to write code which complies with the C standard. In this case, rather than doing pointer comparisons, the programmer should simply write something like: There can be no doubt, though, that incorrectly-written code exists. So the addition of this optimization to GCC 4.2 may cause that bad code to open up a vulnerability which was not there before. Given that, one might question whether the optimization is worth it. In response to a statement (from CERT) that, in the interest of security, overflow tests should not be optimized away, Florian Weimer said: I don't think this is reasonable. If you use GCC and its C frontend, you want performance, not security. After all, the real issue is not the missing comparison instruction, but the fact that this might lead to subsequent unwanted code execution. There are C implementations that run more or less unmodified C code in an environment which can detect such misuse, but they come at a performance cost few are willing to pay. Joe Buck added: Furthermore, there are a number of competitors to GCC. These competitors do not advertise better security than GCC. Instead they claim better performance (though such claims should be taken with a grain of salt). To achieve high performance, it is necessary to take advantage of all of the opportunities for optimization that the C language standard permits. It is clear that the GCC developers see their incentives as strongly pushing toward more aggressive optimization. That kind of optimization often must assume that programs are written correctly; otherwise the compiler is unable to remove code which, in a correctly-written (standard-compliant) program, is unnecessary. So the removal of pointer overflow checks seems unlikely to go away, though it appears that some new warnings will be added to alert programmers to potentially buggy code. The compiler may not stop programmers from shooting themselves in the foot, but it can often warn them that it is about to happen. An LWN.net Distribution List update It's that time of year again -- the time when we look at how the LWN Distributions List has changed over the past year. Last year's update can be found here. At that time the list had 485 "active" distributions, with an additional 58 listings in the Historical section. This year the list has grown to 491 active distributions, but down to 56 in the Historical listing. We define a historical distribution as one that is no longer under development, but we leave them on the list as long as there is still code to be found. As always, it can be a challenge separating the slow-paced distributions from the historical ones. There are, inevitably, some projects that are still in the active part of the list that have not been developed in years. Occasionally historical projects come out with new releases. Distributions will be removed from the list if their website times out repeatedly over a period of time, but that's not the end of it. Entries are moved to an internal list, where they are rechecked a few more times. Sometimes projects come back and are re-added to the list. In the last year every link on the list has been checked at least once. Almost half the list has been checked again. In addition to regular link checking, new distributions are added and existing entries are updated with new releases and other information. We do our best to keep the list up-to-date. That said, if you know of distributions that should be added, or removed, or changed in any way, just let us know. Now it's time to say goodbye to the distributions that have been removed in the last year, in no particular order. Brutalware, Progeny Componentized Linux, herbix, BeatrIX Linux, Deep-Water/Linux, distccKNOPPIX, LinuxDefender Live!, LNX-BBC, Mandows, Mediainlinux, RunOnCD, RxLinux, LinuxInstall.org, Turkix, XoL, Aleph ARMlinux, UltraLinux, epiOS, APAWS Linux with Gallery, Linux for Windows 9X, Phat Linux, GNU/Linux TerminalServer for Schools, BSLinux, CAEN Linux, FlightLinux, Laonux, LibraNet GNU/Linux, Linux in a Pillbox (LIAP), Mastodon, Phlak, PHP Solutions Live, Sentinix, slimlinux, Snootix, Tunix, uOS, Icepack Linux and Think BlueLinux. ELC: Morton and Saxena on working with the kernel community In many ways, Andrew Morton's keynote set the tone for this year's Embedded Linux Conference (ELC) by describing the ways that embedded companies and developers can work with the kernel community in a way that will be "mutually beneficial". Morton provided reasons, from a purely economic standpoint, why it makes sense for companies to get their code into the mainline kernel. He also provided concrete suggestions on how to make that happen. The theme of the conference seemed to be "working with the community" and Morton's speech provided an excellent example of how and why to do just that. Conference organizer Tim Bird introduced the keynote as "the main event" for ELC, noting that he often thought of Morton as "kind of like the adult in the room" on linux-kernel. Readers of that mailing list tend to get the impression that there's more than one of him around because of all that he does. He also noted that it was surprising to some that Morton has an embedded Linux background—from his work at Digeo. Morton believes that embedded development is underrepresented in kernel.org work relative to its economic importance. This is caused by a number of factors, not least the financial constraints under which much embedded development is done. An exceptional case is the chip and board manufacturers who have a big interest in seeing Linux run well on their hardware so that they can attract more customers. But even those do not contribute as much as he would like to see to kernel development. An effect of this underrepresentation is a risk that it will tilt kernel development more toward the server and desktop. The kernel team is already accused of being server-centric, and there is some truth to that, "but not as much as one might think". Kernel hackers do care about the desktop as well as embedded devices, but without an advocate for embedded concerns, sometimes things get missed. Something Morton would like to see is a single full-time "embedded maintainer". That person would serve as the advocate for embedded concerns, ensuring that they didn't get overlooked in the process. An embedded maintainer could make a significant impact for embedded development. Not all kernel contributions need to be code, he said. There is a need just to hear the problems that are being faced by the embedded community along with lists of things that are missing. "Senior, sophisticated people" are needed to help prioritize the features that are being considered as well. Morton often finds out things he didn't know at conferences, things that he should have known about much earlier: "That's bad!" Morton is trying to incite the embedded community to interact with the kernel hackers more on linux-kernel. He said that a great way to get the attention of the team is to come onto the mailing list and make them look bad. Unfavorable comparisons to other systems or earlier kernels, for example, especially when backed up with numbers, are noticed quickly. He said that it is important to remember that the person who makes the most noise gets the most attention. One of the areas that he is most concerned about is the practice of "patch hoarding"—holding on to kernel changes as patches without submitting them upstream to the kernel hackers. It is hopefully only due to a lack of resources, but he has heard that some are doing it to try and gain a competitive advantage. This is simply wrong, he said, companies have a "moral if not legal obligation" to submit those patches. [PULL QUOTE: The code will be better because of the review done by the kernel hackers; once it is done, the maintenance cost falls to near zero as well. He also touted the competitive advantage, noting that getting your code merged means that you have won—competing proposals won't get in. END QUOTE] There are many good reasons for getting code merged upstream that Morton outlined. The code will be better because of the review done by the kernel hackers; once it is done, the maintenance cost falls to near zero as well. He also touted the competitive advantage, noting that getting your code merged means that you have won—competing proposals won't get in. Being the first to merge a feature can make it easier on yourself and harder on your competition. There are downsides to getting your code upstream as well. Most of those stem from not getting code out there early enough for review. The kernel developers can ask for significant changes to the code especially in the area of user space interfaces. If a company already has lots of code using the new feature and/or interface, it could be very disruptive; "sorry, there's no real fix for that except getting your code out early enough". Another downside that companies may run into is with competitors being brought into the process. Morton and other kernel hackers will try to find others who might have a stake in a new feature to get them involved so that everybody's needs are taken into account. This can blunt the "win" of getting your feature merged. Some are also concerned that competitors will get access to the code once it has been submitted; "tough luck" Morton said, everything in the kernel is GPL. Morton had specific suggestions for choosing a kernel version to use for an embedded project. 2.6.24 is not a lot better than 2.4.18 for embedded use, but it has one important feature: the kernel team will be interested in bugs in the current kernel. He suggests starting with the current kernel, upgrading it while development proceeds, freezing it only when it is time to ship the product. He also suggests that a company create an internal kernel team with one or two people who are the interface to linux-kernel. This will help with name recognition on the mailing list, which will in turn get patches submitted more attention. Over time, by participating and reviewing others' code, the interface people will build up "brownie points" that will allow them to call in favors to get their code reviewed, or to help smooth the path for inclusion. The kernel.org developers appear to give free support, generally very good support, Morton said, but it is not truly free. Kernel hackers do it as a "mutually beneficial transaction"; they don't do it to make more money for your company, they do it to make the kernel better. Morton is definitely a big part of that, inviting people to email him, especially if "five minutes of my time can save months of yours". The decision about when to merge a new feature is hard for some to understand. Many consider Linux a dictatorship, which is incorrect, it is instead "a democracy that doesn't vote". The merge decision is made on the model of the "rule of law" with kernel hackers playing the role of judges. Unfortunately, there are few written rules. Some of the factors that go into his decision about a particular feature are its maintainability, whether there will be an ongoing maintenance team, as well as the general usefulness of the feature. Depending on the size of the feature, an ongoing maintenance team can be the deciding factor. It is not so important for a driver, but a new architecture, for example, needs ongoing maintenance that can only be done by people with knowledge of and access to the hardware. MontaVista kernel hacker, Deepak Saxena, gave a presentation entitled "Appropriate Community Practices: Social and Technical Advice" later in the conference that mirrored many of Morton's points. He showed some examples of hardware vendors making bad decisions that got shot down by the kernel developers, mostly because they didn't "release early and release often". There is a dangerous attitude that "it's Linux, it's open source, I can do anything I want" which is true, but won't get you far with the community. Saxena has high regard for the benefits of working with the system: if your competitor is active in the community, they are getting an advantage that you aren't. Like Morton, he believes that some members of the development team need to get involved in kernel.org activities. "The community is an extension of your team, your team is an extension of the community." He also has specific advice for hardware vendors: avoid abstraction layers, recognize that your hardware is not unique, and think beyond the reference board implementation. Generalizing your code so that others can use it will make it much more acceptable, as will talking with the developers responsible for the subsystems you are touching. Abstraction layers may be helpful for hardware vendors trying to support multiple operating systems, but they make it difficult for the kernel hackers to understand and maintain the code. The kernel.org folks are not interested in finding and fixing bugs in an abstraction layer. He also points out additional benefits of getting code merged. Once it is in the kernel, the company's team will no longer have to keep up with kernel releases, updating their patches to follow the latest changes. The code will still need to be maintained, but day-to-day changes will be handled by the kernel.org folks. An additional benefit is that the code will be enhanced by various efforts to automatically find bugs in mainline kernel code with tools like lockdep. It is clear that the kernel hackers are making a big effort to not only get code from the embedded folks, but also some of their expertise. There are various outreach efforts to try and get more people involved in the Linux development process; these two talks are certainly a part of that. By making it clear that there are benefits to both parties, they hope to make an argument that will reach up from engineering to management resulting in a better kernel for all. The 2.6.26 merge window opens That shiny new 2.6.25 kernel which was released on April 16 is now ancient history; some 3500 changesets have been merged into the mainline git repository since then. Some of the most significant user-visible changes include: New drivers for Korina (IDT rc32434) Ethernet MACs, SuperH MX-G and SH-MobileR2 CPUs, Solution Engine SH7721 boards, ARM YL9200, Kwikbyte KB9260, Olimex SAM9-L9260, and emQbit ECB_AT91 boards, Digi ns921x processors, the Nias Digital SMX crypto engines, AMCC PPC460EX evaluation boards, Emerson KSI8560 boards, Wind River SBC8641D boards, Logitech Rumblepad 2 force-feedback devices, Renesas SH7760 I2C controllers, and SuperH Mobile I2C controllers. The PCI subsystem now supports PCI Express Active State Power Management, which can yield significant power savings on suitably equipped hardware. There is a new security= boot parameter which allows the specification of which security module to use if more than one are available. Network address translation (NAT) is now supported for the SCTP, DCCP, and UDP-Lite protocols. There is also netfilter connection tracking support for DCCP. The network stack can now negotiate selective acknowledgments and window scaling even when syncookies are in use. Another long series of network namespace patches has been merged, continuing the long process of making all networking code namespace-aware. Mesh networking support has been added to the mac80211 layer. It is currently marked "broken," though, until various outstanding issues are fixed. 4K stacks are now the default for the x86 architecture. This change is controversial and could be reversed by the time the final release happens. SELinux now supports "permissive types" which allow specific domains to run as if SELinux were not present in the system at all. A number of enhancements have been made to the realtime group scheduler, including multi-level groups, the ability to mix processes and groups (and have them compete against each other for CPU time), better SMP balancing, and more. Support for the running of SunOS and Solaris binaries has been removed; it has long been unmaintained and did not work well. The kernel now has support for read-only bind mounts, which provide a read-only view into an otherwise writable filesystem. This feature (the implementation of which was more involved than one might think) is intended for use in containers and other situations where even processes running as root should not be able to modify certain filesystems. Changes visible to kernel developers include: At long last, support for the KGDB interactive debugger has been added to the x86 architecture. There is a DocBook document in the Documentation directory which provides an overview on how to use this new facility. Page attribute table (PAT) support is also (again, at long last) available for the x86 architecture. PATs allow for fine-grained control of memory caching behavior with more flexibility than the older MTRR feature. See Documentation/x86/pat.txt for more information. Two new functions (inode_getsecid() and ipc_getsecid()), added to support security modules and the audit code, provide general access to security IDs associated with inodes and IPC objects. A number of superblock-related LSM callbacks now take a struct path pointer instead of struct nameidata. There is also a new set of hooks providing generic audit support in the security module framework. The now-unused ieee80211 software MAC layer has been removed; all of the drivers which needed it have been converted to mac80211. Also removed are the sk98lin network driver (in favor of skge) and bcm43xx (replaced by b43 and b43legacy). The generic semaphores patch has been merged. The semaphore code also has new down_killable() and down_timeout() functions. The ata_port_operations structure used by libata drivers now supports a simple sort of operation inheritance, making it easier to write drivers which are "almost like" existing code, but with small differences. A new function (ns_to_ktime()) converts a time value in nanoseconds to ktime_t. The final users of struct class_device have been converted to use struct device instead. If all goes well, the class_device structure will be removed later in the 2.6.26 cycle. Greg Kroah-Hartman is no longer the PCI subsystem maintainer, having passed that responsibility on to Jesse Barnes. The seq_file code now accepts a return value of SEQ_SKIP from the show() callback; that value causes any accumulated output from that call to be discarded. Needless to say, this development series is still young and, as of this writing, the merge window has over a week to run. So there will be a lot more code going into the mainline before the shape of 2.6.26 becomes clear. The Grumpy Editor encounters the Hardy Heron Your editor is not always known for making life easy for himself. Perhaps one of the most clear examples of masochistic behavior would be a certain preference for running development distributions on mission-critical systems. That said, your editor has stuck with a stable distribution on his laptop through a round of intensive travel earlier this year. But that was too easy, so, shortly before heading off to the Linux Foundation's Collaboration Summit, the laptop got moved to the Ubuntu "Hardy Heron" distribution. Needless to say, there have been some interesting ups and downs (literally) since then. There is always a certain thrill that comes with upgrading a system and finding that important features no longer work. In this case, the problem was suspend and resume, which your editor uses heavily. In fact, the system would suspend just fine - as long as one failed to notice that, behind the cleverly darkened screen, the laptop's backlight had been left on. Needless to say, this new behavior is not helpful if one's goal is to save power while the system is suspended, but it gets worse than that. Your editor discovered this nice surprise after carrying the computer in a backpack for a few hours; by the time it came out, it was almost too hot to hold. Happily, no permanent damage appears to have been done. Or, perhaps, unhappily. Your editor has been looking for an excuse to get a new laptop for a while. The problem turned out to be a HAL configuration error combined with a strange internal model number which makes your editor's Thinkpad X31 different from, seemingly, every other X31 on the planet. Once your editor found the bug report and attached a "me too" comment, the solution was quick in coming. On the net, one can find complaints that Ubuntu is unresponsive to bug reports, but that was certainly not the experience here. As an aside, it seems worth noting that life seems to have gotten more complicated, with a lot more code wrapped around the kernel than there once was. The problematic configuration file was /usr/share/hal/fdi/information/10freedesktop/20-video-quirk-pm-ibm.fdi - not a place where your editor, who is not a HAL expert, would have thought to look. That, it seems, is the price of more capable hardware and software, but sometimes your editor pines for the days when it seemed possible to carry a full understanding of the system within a single brain. GNOME developers are (perhaps unjustly in recent years) known for taking a minimal approach to configuration options. That can be irritating, but just as annoying is their tendency to reset the options they do provide over major updates. Once suspend and resume work, your editor demands something else of a laptop when traveling: absolute silence. So the return of beeps to gnome-terminal was not appreciated. Those were easily silenced, but the GNOME developers also saw fit to bring back the blinking cursor - and they took away the configuration option which abolishes that intolerable feature. Your editor first ran into the unstoppable blink with Rawhide; a query to the developers there turned up a quick answer. It seems that the GNOME developers have decided to create a single, system-wide parameter to control blinking cursors. Now, your editor approves of the concept of being able to turn off that behavior everywhere with a single switch - but only as long as that switch isn't hidden where nobody will ever find it. In this case, the GNOME developers have taken this feature, wrapped it in old newspapers, and stashed it behind the furnace in the basement; then they put a trunk on top of it. It is a rare user who will find it unassisted. In the hopes that it may save one or two readers from some time spent with search engine, your editor will now divulge the top-secret incantation which turns blinking cursors off: Naturally, a terminal window is required to run this command. It would have been nice if the developers who packaged this code for Hardy Heron had found a way to smooth over this change, but no such luck; as far as your editor can tell, no distributor has made that effort. Another bit of fun is that your editor is no longer able to set the desktop background; the relevant configuration windows are ineffective. In this case, it would appear that the task of implementing the user's background choices have been moved to nautilus - just the place your editor would have thought to look for it. As it happens, your editor has no use for file managers and does not run nautilus - and is punished with an immutable Ubuntu-brown background for that sin. Happily, your editor still knows how to run xsetroot. All of the above is a set of relatively minor grumbles, all of which are rectified in relatively short order. Once those details have been taken care of, the Hardy Heron release works quite well. One of the biggest aggravations from previous upgrades - having OpenOffice.org reformat the slides in all of your editor's presentations - was not present this time around. Hopefully we are moving into an era where "it didn't mangle my documents" is not something considered worthy of mention. There was one very nice surprise as well. Your editor's laptop previously required almost 12 watts of power when running unplugged. This laptop is not at the bleeding edge of current technology, so the amount of time it was able to run without a recharge has been dropping for a while. With the Hardy release, steady-state power consumption has dropped to just over 9 watts - a big improvement. The credit for this change belongs to developers at all levels: kernel, applications, distributors, etc. The end result is a system which runs much more efficiently, and that is a good thing. All told, your editor is reasonably content; this distribution looks like one which might just be worth keeping around. That's a good thing, since Ubuntu plans to maintain it as a "long-term support" release. Not that your editor intends to make much use of that long-term support; there should be a new development series starting soon, after all. One of the nice things about development distributions is that support never ends as long as one stays on the treadmill and the project itself remains alive. ELC: A taste of the conference Technical conferences generally provide a wealth of choices, to the point where participants have to make tough decisions at times to pick the session to sit in on. This year's Embedded Linux Conference was no exception; there were multiple slots where the author had to wish that he could be in more than one place at a time. But, he did manage to take notes in some of those that he attended; hopefully some of the conference flavor can come through in the following report. Power management MontaVista's Kevin Hilman presented an approach for handling power management on embedded devices that focused on changes that can be made to the kernel, but noted that there is much that can be done by applications too. Because of the time and money budgets available for embedded projects, many do not have the resources to do a complete job of tuning the kernel to get the best possible power performance. There is also no "one size fits all" solution for power management, there are too many device-specific issues to allow that. Hilman's approach is to target specific "building blocks" that embedded developers can incorporate into their project. Each block will provide some savings, so the project can stop when the desired performance is reached—or it is time to ship the device. One of the easier steps is to customize the idle loop in the kernel, putting the processor to sleep when there is no work to be done. There are different kinds of sleep, though, generally trading off power savings and wakeup latency. The cpuidle subsystem provides a means to specify those values in an architecture independent way, which, along with a platform independent "governor", can put the processor into various sleep modes. The only platform dependent piece are the hooks to enter each of the different sleep states. A similar approach is taken by the CPUfreq subsystem, which can reduce the clock frequency of the CPU to reduce power consumption using the Dynamic Voltage and Frequency Scaling (DVFS) feature of some processors. "Operating points" (OPs)—voltage and frequency tuples—are defined for the hardware. There are various generic CPUfreq governors that can then be used to determine when to change OPs and which to change to. The governor will invoke a platform-specific driver to effect that change. In addition, power management "quality of service" is currently being discussed to allow applications to request a certain level of performance that may override some of the lower-level sleep or frequency decisions. Embedded SELinux SELinux has a well-earned reputation for being able to restrict processes to only use those resources that have been specifically allowed by policy, but it is rather resource intensive. Yuichi Nakamura presented Hitachi's research into bringing SELinux into a more resource constrained embedded environment. One of the first problems they encountered was the need for flash filesystems that support extended attributes (xattrs), which is where SELinux stores labels for files. Only jffs2 currently supports xattrs, so that is the one they used. The next big hurdle was trying to get a set of policies that were stripped down to the needs of an embedded platform. Nakamura started with the SELinux reference policy (refpolicy) and started removing rules. The sheer number of rules and policies that needed to be removed was daunting—as was the need to understand what was being removed. He also ran into strange dependencies: removing a sendmail policy caused a problem in the apache rules. The solution was to create a simplified policy language and policy editor that reduced the problem to something more tractable for the embedded world. In the process it greatly reduced the size of the policy files, from 4.6M down to 60K. Another problem encountered was the performance and size of SELinux, which is a common embedded woe. Through some hand optimization of the read/write path, along with removing some unused permissions checks, they were able to increase the performance by a factor of ten on their SuperH reference platform. By changing some static buffers in SELinux to a dynamic allocation they also saved 250K of runtime memory. Much of that work was merged into 2.6.24. There is still work to be done, but with the changes, SELinux is viable for embedded platforms. GCC and kernel hacking Two sessions provided various tips and tricks for embedded development, with Gene Sally of Timesys focused on GCC, while IBM's Hugh Blemings shared some of the things he has learned from the kernel hackers he works with. Sally discussed the different ways that developers could get a GCC toolchain for their target processor. One of the bigger hurdles that an embedded developer faces is getting a cross-compiler toolchain—one that runs on his development workstation, but generates code for the target platform. There are several ways to get the toolchain: as a tarball for popular development/target combinations, by using helper tools like crosstool or buildroot from uClibc, or by building it from source directly. Building from source is the most difficult, of course, but allows for the most customizations and flexibility. Sally went on to describe a handful of useful GCC command-line options for helping to debug cross-compilers or just to better understand what GCC is doing: gcc -### - show what GCC would have executed gcc -v - show what GCC is executing gcc -g x.c -o x; objdump -S x - show the C and generated assembly code gcc -E -dM - </dev/null - show all predefined GCC macros gcc -C -E - show pre-processor output, but leave comments intact gcc -M - show all include file dependencies (for use in Makefiles) gcc -MM - like above, but ignore system include files Blemings concentrated on the development infrastructure by describing the lab that he used to port the kernel to a Taishan PowerPC-based evaluation board. When undertaking a project like that, "get to know your hardware team" because they will have lots of important information and shortcuts that can be used as part of the board "bringup". At IBM in Canberra, where Blemings is based, they have gotten to the point where they can bring up Linux on any board where they can "access memory and point the PC [program counter] at it"; his tips have come out of that environment. One of the most important things is to realize that you will be building kernels over and over again, so optimizing your environment for that will save lots of time. His suggestion was to start with a "honkin'" compile box; he described an IBM multi-processor box as an excellent choice but noted that the cost was so high he couldn't get one. It would, however, do "3k/sec"—that's compile 3 kernels per second. In the absence of something like that, he suggested borrowing cycles by using ccache and distcc to reduce and parallelize the compilation that needs to be done. Even adding relatively modest machines into the distcc pool can significantly reduce time spent waiting for a new kernel. Ubuntu mobile and embedded (UME) and Maemo One of the hottest areas in embedded Linux these days is the mobile internet device (MID) market. There were two talks on MID-focused distributions, with Canonical's David Mandala giving an overview of Ubuntu Mobile and Embedded (UME) and Nokia's Kate Alhola talking about the status and future directions of Maemo Mobile Linux. UME is a relatively recent addition to the mobile device space—they are anxiously awaiting hardware to run on—whereas Maemo has been around for a while, powering the Nokia N770, N800, and N810 internet tablets. UME is an effort to apply the Ubuntu distribution and philosophy to touchscreen devices. Mandala explained that they are taking existing Linux applications and adapting them for small screens that use fingers, rather than keyboard and mouse, as the input device. The resolution of the displays is typically something approaching that of low-end desktops, but the physical space they take up is far smaller (i.e. the dots per inch or DPI is high) making it difficult to do development without actual hardware. The UME project is working with Intel's Moblin.org project to target Atom processor based systems. It uses the Hildon application framework atop GNOME Mobile, running on an Ubuntu 8.04 (Hardy Heron) distribution. Mandala stressed that Linux should be "invisible" on these devices as users just want applications that work to browse the web, use email, and the like. The main focus of UME has, so far, been on the user interface, though power consumption, memory footprint, and speeding up boot times are all on their radar. Canonical is very interested in fostering a community around UME, but that has been "a bit of a challenge", mostly due to a lack of hardware to run on. Mandala expects a few different hardware devices to be available "soon" and that will make it easier to attract a development community. As should come as no surprise after Nokia's purchase of Trolltech early this year, Alhola announced that Maemo would be supporting both GTK and Qt in the near future. This is part of Nokia's belief that there is "no single truth", so Maemo supports multiple paths to development on the platform. Maemo directly supports C, C++, and Python, while the community has added support for Java, Objective C, Vala, and Mono. Nokia makes a very clear distinction in its product line between phones, which are largely closed platforms, and tablets, which are open. Open source software is an essential part of their strategy as they want to build an application ecosystem around their products. "We are taking open source to the consumer mainstream," Alhola explains. One of the interesting tools that Nokia is working on as part of Maemo is Scratchbox, which is a toolkit geared towards making cross-compilation easier. It does this by making the development environment look and act like the execution environment, using QEMU to simulate the target hardware. Scratchbox supports both ARM and x86 targets, with experimental support for additional architectures. It uses standard toolchains and distributions where possible and is released under the GPL. LogFS LogFS is a flash filesystem that is targeted at the larger flash devices that are becoming more widespread. Unlike some filesystems currently in use, most notably jffs2, LogFS is specifically designed to avoid some of the performance and scalability problems that come with larger devices. Jörn Engel is the developer of LogFS, with some support from the Consumer Electronics Linux Forum (sponsor of ELC), so he gave an update on the status of the project. Engel used an unconventional scale (the sucks/rules meter) to measure the progress that had been made in the last year. The scale runs from -10 to 10 and measures the "suckiness" of particular features of the filesystem. Taking a page from This Is Spinal Tap, the score for the mount speed of LogFS was measured at 11 both last year and this. It is clearly the feature that Engel is most proud of as it takes 10-60ms to mount a filesystem; a similarly sized jffs2 takes on the order of one second. Engel looked at around ten separate attributes of the filesystem, first rating them on where LogFS was a year ago, then re-rating based on where it is today. The conclusion is that the average measure has moved from -2.75 to -0.55, so that "on average, it hardly sucks". He says he is getting confident enough to submit it to Andrew Morton for inclusion in his tree, hopefully on its way into the mainline. Engel is clearly somewhat frustrated with people who are waiting until it is "done" to start using LogFS—though there are some fairly serious usability problems that would tend to limit testers—proclaiming: "LogFS is finished, try it now, today!" In conclusion There were more talks, of course, as well as an active "hallway track" for the roughly 175 participants. ELC is a well-run and very interesting conference that is worth consideration for anyone who uses, or plans to use, Linux as an embedded operating system. This year's venue, the Computer History Museum was a nice facility for a conference of this size. It also had some great exhibits that will bring back memories for anyone who has been using computers, calculators, or game systems over the past 50 years or so—well worth a visit when one is in Silicon Valley. Integrating and Validating dynticks and Preemptable RCU Introduction Read-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October of 2002. RCU is most frequently described as a replacement for reader-writer locking, but it has also been used in a number of other ways. RCU is notable in that RCU readers do not directly synchronize with RCU updaters, which makes RCU read paths extremely fast, and also permits RCU readers to accomplish useful work even when running concurrently with RCU updaters. In early 2008, a preemptable variant of RCU was accepted into mainline Linux in support of real-time workloads, a variant similar to the RCU implementations in the -rt patchset since August 2005. Preemptable RCU is needed for real-time workloads because older RCU implementations disable preemption across RCU read-side critical sections, resulting in excessive real-time latencies. However, one disadvantage of the -rt implementation was that each grace period required work to be done on each CPU, even if that CPU is in a low-power “dynticks-idle” state, and thus incapable of executing RCU read-side critical sections. The idea behind the dynticks-idle state is that idle CPUs should be physically powered down in order to conserve energy. In short, preemptable RCU can disable a valuable energy-conservation feature of recent Linux kernels. Although Josh Triplett and Paul McKenney had discussed some approaches for allowing CPUs to remain in low-power state throughout an RCU grace period (thus preserving the Linux kernel's ability to conserve energy), matters did not come to a head until Steve Rostedt integrated a new dyntick implementation with preemptable RCU in the -rt patchset. This combination caused one of Steve's systems to hang on boot, so in October, Paul coded up a dynticks-friendly modification to preemptable RCU's grace-period processing. Steve coded up rcu_irq_enter() and rcu_irq_exit() interfaces called from the irq_enter() and irq_exit() interrupt entry/exit functions. These rcu_irq_enter() and rcu_irq_exit() functions are needed to allow RCU to reliably handle situations where a dynticks-idle CPUs is momentarily powered up for an interrupt handler containing RCU read-side critical sections. With these changes in place, Steve's system booted reliably, but Paul continued inspecting the code periodically on the assumption that we could not possibly have gotten the code right on the first try. Paul reviewed the code repeatedly from October 2007 to February 2008, and almost always found at least one bug. In one case, Paul even coded and tested a fix before realizing that the bug was illusory, but in all cases, the “bug” was in fact illusory. Near the end of February, Paul grew tired of this game. He therefore decided to enlist the aid of Promela and spin, as described in the LWN article Using Promela and Spin to verify parallel algorithms. This article presents a series of seven increasingly realistic Promela models, the last of which passes, consuming about 40GB of main memory for the state space. Quick Quiz 1: Yeah, that's great!!! Now, just what am I supposed to do if I don't happen to have a machine with 40GB of main memory??? More important, Promela and Spin did find a very subtle bug for me!!! This article is organized as follows: Introduction to Preemptable RCU and dynticks Task Interface Interrupt Interface Grace-Period Interface Validating Preemptable RCU and dynticks Basic Model Validating Safety Validating Liveness Interrupts Validating Interrupt Handlers Validating Nested Interrupt Handlers Validating NMI Handlers These sections are followed by conclusions and answers to the Quick Quizzes. Introduction to Preemptable RCU and dynticks The per-CPU dynticks_progress_counter variable is central to the interface between dynticks and preemptable RCU. This variable has an even value whenever the corresponding CPU is in dynticks-idle mode, and an odd value otherwise. A CPU exits dynticks-idle mode for the following three reasons: to start running a task, when entering the outermost of a possibly nested set of interrupt handlers, and when entering an NMI handler. Preemptable RCU's grace-period machinery samples the value of the dynticks_progress_counter variable in order to determine when a dynticks-idle CPU may safely be ignored. The following three sections give an overview of the task interface, the interrupt/NMI interface, and the use of the dynticks_progress_counter variable by the grace-period machinery. Task Interface When a given CPU enters dynticks-idle mode because it has no more tasks to run, it invokes rcu_enter_nohz(): This function simply increments dynticks_progress_counter and checks that the result is even, but first executing a memory barrier to ensure that any other CPU that sees the new value of dynticks_progress_counter will also see the completion of any prior RCU read-side critical sections. Similarly, when a CPU that is in dynticks-idle mode prepares to start executing a newly runnable task, it invokes rcu_exit_nohz: This function again increments dynticks_progress_counter, but follows it with a memory barrier to ensure that if any other CPU sees the result of any subsequent RCU read-side critical section, then that other CPU will also see the incremented value of dynticks_progress_counter. Finally, rcu_exit_nohz() checks that the result of the increment is an odd value. The rcu_enter_nohz() and rcu_exit_nohz functions handle the case where a CPU enters and exits dynticks-idle mode due to task execution, but does not handle interrupts, which are covered in the following section. Interrupt Interface The rcu_irq_enter() and rcu_irq_exit() functions handle interrupt/NMI entry and exit, respectively. Of course, nested interrupts must also be properly accounted for. The possibility of nested interrupts is handled by a second per-CPU variable, rcu_update_flag, which is incremented upon entry to an interrupt or NMI handler (in rcu_irq_enter()) and is decremented upon exit (in rcu_irq_exit()). In addition, the pre-existing in_interrupt() primitive is used to distinguish between an outermost or a nested interrupt/NMI. Interrupt entry is handled by the rcu_irq_enter shown below: Quick Quiz 2: Why not simply increment rcu_update_flag, and then only increment dynticks_progress_counter if the old value of rcu_update_flag was zero??? Quick Quiz 3: But if line 7 finds that we are the outermost interrupt, wouldn't we always need to increment dynticks_progress_counter? Line 3 fetches the current CPU's number, while lines 4 and 5 increment the rcu_update_flag nesting counter if it is already non-zero. Lines 6 and 7 check to see whether we are the outermost level of interrupt, and, if so, whether dynticks_progress_counter needs to be incremented. If so, line 9 increments dynticks_progress_counter, line 10 executes a memory barrier, and line 11 increments rcu_update_flag. As with rcu_exit_nohz(), the memory barrier ensures that any other CPU that sees the effects of an RCU read-side critical section in the interrupt handler (following the rcu_irq_enter() invocation) will also see the increment of dynticks_progress_counter. Interrupt entry is handled similarly by rcu_irq_exit(): Line 3 fetches the current CPU's number, as before. Line 5 checks to see if the rcu_update_flag is non-zero, returning immediately (via falling off the end of the function) if not. Otherwise, lines 6 through 11 come into play. Line 6 decrements rcu_update_flag, returning if the result is not zero. Line 8 verifies that we are indeed leaving the outermost level of nested interrupts, line 9 executes a memory barrier, line 10 increments dynticks_progress_counter, and line 11 verifies that this variable is now even. As with rcu_enter_nohz(), the memory barrier ensures that any other CPU that sees the increment of dynticks_progress_counter will also see the effects of an RCU read-side critical section in the interrupt handler (preceding the rcu_irq_enter() invocation). These two sections have described how the dynticks_progress_counter variable is maintained during entry to and exit from dynticks-idle mode, both by tasks and by interrupts and NMIs. The following section describes how this variable is used by preemptable RCU's grace-period machinery. Grace-Period Interface Of the four preemptable RCU grace-period states shown below (taken from The Design of Preemptable Read-Copy Update), only the rcu_try_flip_waitack_state() and rcu_try_flip_waitmb_state() states need to wait for other CPUs to respond. Of course, if a given CPU is in dynticks-idle state, we shouldn't wait for it. Therefore, just before entering one of these two states, the preceding state takes a snapshot of each CPU's dynticks_progress_counter variable, placing the snapshot in another per-CPU variable, rcu_dyntick_snapshot. This is accomplished by invoking dyntick_save_progress_counter, shown below: The rcu_try_flip_waitack_state() state invokes rcu_try_flip_waitack_needed(), shown below: Lines 7 and 8 pick up current and snapshot versions of dynticks_progress_counter, respectively. The memory barrier on line ensures that the counter checks in the later rcu_try_flip_waitzero_state follow the fetches of these counters. Lines 10 and 11 return zero (meaning no communication with the specified CPU is required) if that CPU has remained in dynticks-idle state since the time that the snapshot was taken. Similarly, lines 12 and 13 return zero if that CPU was initially in dynticks-idle state or if it has completely passed through a dynticks-idle state. In both these cases, there is no way that that CPU could have retained the old value of the grace-period counter. If neither of these conditions hold, line 14 returns one, meaning that the CPU needs to explicitly respond. For its part, the rcu_try_flip_waitmb_state state invokes rcu_try_flip_waitmb_needed(), shown below: This is quite similar to rcu_try_flip_waitack_needed, the difference being in lines 12 and 13, because any transition either to or from dynticks-idle state executes the memory barrier needed by the rcu_try_flip_waitmb_state() state. Quick Quiz 4: Can you spot any bugs in any of the code in this section? We now have seen all the code involved in the interface between RCU and the dynticks-idle state. The next section builds up the Promela model used to validate this code. Validating Preemptable RCU and dynticks This section develops a Promela model for the interface between dynticks and RCU step by step, with each of the following sections illustrating one step, starting with the process-level code, adding assertions, interrupts, and finally NMIs. Basic Model This section translates the process-level dynticks entry/exit code and the grace-period processing into Promela. We start with rcu_exit_nohz() and rcu_enter_nohz() from the 2.6.25-rc4 kernel, placing these in a single Promela process that models exiting and entering dynticks-idle mode in a loop as follows: Lines 6 and 20 define a loop. Line 7 exits the loop once the loop counter i has exceeded the limit MAX_DYNTICK_LOOP_NOHZ. Line 8 tells the loop construct to execute lines 9-19 for each pass through the loop. Because the conditionals on lines 7 and 8 are exclusive of each other, the normal Promela random selection of true conditions is disabled. Lines 9 and 11 model rcu_exit_nohz()'s non-atomic increment of dynticks_progress_counter, while line 12 models the WARN_ON(). The atomic construct simply reduces the Promela state space, given that the WARN_ON() is not strictly speaking part of the algorithm. Lines 14-18 similarly models the increment and WARN_ON() for rcu_enter_nohz(). Finally, line 19 increments the loop counter. Quick Quiz 5: Why isn't the memory barrier in rcu_exit_nohz() and rcu_enter_nohz() modeled in Promela? Quick Quiz 6: Isn't it a bit strange to model rcu_exit_nohz() followed by rcu_enter_nohz()? Wouldn't it be more natural to instead model entry before exit? Each pass through the loop therefore models a CPU exiting dynticks-idle mode (for example, starting to execute a task), then re-entering dynticks-idle mode (for example, that same task blocking). The next step is to model the interface to RCU's grace-period processing. For this, we need to model dyntick_save_progress_counter(), rcu_try_flip_waitack_needed(), rcu_try_flip_waitmb_needed(), as well as portions of rcu_try_flip_waitack() and rcu_try_flip_waitmb(), all from the 2.6.25-rc4 kernel. The following grace_period() Promela process models these functions as they would be invoked during a single pass through preemptable RCU's grace-period processing. Lines 6-9 print out the loop limit (but only into the .trail file in case of error) and model a line of code from rcu_try_flip_idle() and its call to dyntick_save_progress_counter(), which takes a snapshot of the current CPU's dynticks_progress_counter variable. These two lines are executed atomically to reduce state space. Lines 10-22 model the relevant code in rcu_try_flip_waitack() and its call to rcu_try_flip_waitack_needed(). This loop is modeling the grace-period state machine waiting for a counter-flip acknowledgment from each CPU, but only that part that interacts with dynticks-idle CPUs. Line 23 models a line from rcu_try_flip_waitzero() and its call to dyntick_save_progress_counter(), again taking a snapshot of the CPU's dynticks_progress_counter variable. Finally, lines 24-36 model the relevant code in rcu_try_flip_waitack() and its call to rcu_try_flip_waitack_needed(). This loop is modeling the grace-period state-machine waiting for each CPU to execute a memory barrier, but again only that part that interacts with dynticks-idle CPUs. Quick Quiz 7: Wait a minute! In the Linux kernel, both dynticks_progress_counter and rcu_dyntick_snapshot are per-CPU variables. So why are they instead being modeled as single global variables? The resulting model, when run with the runspin.sh script, generates 691 states and passes without errors, which is not at all surprising given that it completely lacks the assertions that could find failures. The next section therefore adds safety assertions. Validating Safety A safe RCU implementation must never permit a grace period to complete before the completion of any RCU readers that started before the start of the grace period. This is modeled by a grace_period_state variable that can take on three states as follows: The grace_period() process sets this variable as it progresses through the grace-period phases, as shown below: Quick Quiz 8: Given there are a pair of back-to-back changes to grace_period_state on lines 25 and 26, how can we be sure that line 25's changes won't be lost? Lines 6, 10, 25, 26, 29, and 44 update this variable (combining atomically with algorithmic operations where feasible) to allow the dyntick_nohz() process to validate the basic RCU safety property. The form of this validation is to assert that the value of the grace_period_state variable cannot jump from GP_IDLE to GP_DONE during a time period over which RCU readers could plausibly persist. The dyntick_nohz() Promela process implements this validation as shown below: Line 13 sets a new old_gp_idle flag if the value of the grace_period_state variable is GP_IDLE at the beginning of task execution, and the assertion at line 18 fires if the grace_period_state variable has advanced to GP_DONE during task execution, which would be illegal given that a single RCU read-side critical section could span the entire intervening time period. The resulting model, when run with the runspin.sh script, generates 964 states and passes without errors, which is reassuring. That said, although safety is critically important, it is also quite important to avoid indefinitely stalling grace periods. The next section therefore covers validating liveness. Validating Liveness Although liveness can be difficult to prove, there is a simple trick that applies here. The first step is to make dyntick_nohz() indicate that it is done via a dyntick_nohz_done variable, as shown on line 26 of the following: With this variable in place, we can add assertions to grace_period() to check for unnecessary blockage as follows: We have added the shouldexit variable on line 5, which we initialize to zero on line 10. Line 17 asserts that shouldexit is not set, while line 18 sets shouldexit to the dyntick_nohz_done variable maintained by dyntick_nohz(). This assertion will therefore trigger if we attempt to take more than one pass through the wait-for-counter-flip-acknowledgment loop after dyntick_nohz() has completed execution. After all, if dyntick_nohz() is done, then there cannot be any more state changes to force us out of the loop, so going through twice in this state means an infinite loop, which in turn means no end to the grace period. Lines 32, 39, and 40 operate in a similar manner for the second (memory-barrier) loop. However, running this model results in failure, as line 23 is checking that the wrong variable is even. Upon failure, spin writes out a “trail” file, which records the sequence of states that lead to the failure. Use the spin -t -p -g -l dyntickRCU-base-sl-busted.spin command to cause spin to retrace this sequence of states, printing the statements executed and the values of variables. Note that the line numbers do not match the listing above due to the fact that spin takes both functions in a single file. However, the line numbers do match the full model. We see that the dyntick_nohz() process completed at step 34 (search for “34:”), but that the grace_period() process nonetheless failed to exit the loop. The value of curr is 6 (see step 35) and that the value of snap is 5 (see step 17). Therefore the first condition on line 21 above does not hold because curr != snap, and the second condition on line 23 does not hold either because snap is odd and because curr is only one greater than snap. So one of these two conditions has to be incorrect. Referring to the comment block in rcu_try_flip_waitack_needed() for the first condition: The first condition does match this, because if curr == snap and if curr is even, then the corresponding CPU has been in dynticks-idle mode the entire time, as required. So let's look at the comment block for the second condition: The first part of the condition is correct, because if curr and snap differ by two, there will be at least one even number in between, corresponding to having passed completely through a dynticks-idle phase. However, the second part of the condition corresponds to having started in dynticks-idle mode, not having finished in this mode. We therefore need to be testing curr rather than snap for being an even number. The corrected C code is as follows: Making the corresponding correction in the model results in a correct validation with 661 states that passes without errors. However, it is worth noting that the first version of the liveness validation failed to catch this bug, due to a bug in the liveness validation itself. This liveness-validation bug was located by inserting an infinite loop in the grace_period() process, and noting that the liveness-validation code failed to detect this problem! We have now successfully validated both safety and liveness conditions, but only for processes running and blocking. We also need to handle interrupts, a task taken up in the next section. Interrupts There are a couple of ways to model interrupts in Promela: using C-preprocessor tricks to insert the interrupt handler between each and every statement of the dynticks_nohz() process, or modeling the interrupt handler with a separate process. A bit of thought indicated that the second approach would have a smaller state space, though it requires that the interrupt handler somehow run atomically with respect to the dynticks_nohz() process, but not with respect to the grace_period() process. Fortunately, it turns out that Promela permits you to branch out of atomic statements. This trick allows us to have the interrupt handler set a flag, and recode dynticks_nohz() to atomically check this flag and execute only when the flag is not set. This can be accomplished with a C-preprocessor macro that takes a label and a Promela statement as follows: One might use this macro as follows: Line 2 of the macro creates the specified statement label. Lines 3-8 are an atomic block that tests the in_dyntick_irq variable, and if this variable is set (indicating that the interrupt handler is active), branches out of the atomic block back to the label. Otherwise, line 6 executes the specified statement. The overall effect is that mainline execution stalls any time an interrupt is active, as required. Validating Interrupt Handlers The first step is to convert dyntick_nohz() to EXECUTE_MAINLINE() form, as follows: Quick Quiz 9: But what would you do if you needed the statements in a single EXECUTE_MAINLINE() group to execute non-atomically? Quick Quiz 10: But what if the dynticks_nohz() process had “if” or “do” statements with conditions, where the statement bodies of these constructs needed to execute non-atomically? It is important to note that when a group of statements is passed to EXECUTE_MAINLINE(), as in lines 11-14, all statements in that group execute atomically. The next step is to write a dyntick_irq() process to model an interrupt handler: Quick Quiz 11: Why are lines 44 and 45 (the in_dyntick_irq = 0; and the i++;) executed atomically? Quick Quiz 12: What property of interrupts is this dynticks_irq() process unable to model? The loop from line 7-47 models up to MAX_DYNTICK_LOOP_IRQ interrupts, with lines 8 and 9 forming the loop condition and line 45 incrementing the control variable. Line 10 tells dyntick_nohz() that an interrupt handler is running, and line 44 tells dyntick_nohz() that this handler has completed. Line 48 is used for liveness validation, much as is the corresponding line of dyntick_nohz(). Lines 11-24 model rcu_irq_enter(), and lines 25 and 26 model the relevant snippet of __irq_enter(). Lines 27 and 28 validate safety in much the same manner as do the corresponding lines of dynticks_nohz(). Lines 29 and 30 model the relevant snippet of __irq_exit(), and finally lines 31-42 model rcu_irq_exit(). The implementation of grace_period() is very similar to the earlier one. The only changes are the addition of line 10 to add the new interrupt-count parameter and changes to lines 19 and 41 to add the new dyntick_irq_done variable to the liveness checks. This model results in a correct validation with roughly half a million states, passing without errors. However, this version of the model does not handle nested interrupts. This topic is taken up in the next section. Validating Nested Interrupt Handlers Nested interrupt handlers may be modeled by splitting the body of the loop in dyntick_irq() as follows: This is similar to the earlier dynticks_irq() process. It adds a second counter variable j on line 5, so that i counts entries to interrupt handlers and j counts exits. The outermost variable on line 7 helps determine when the grace_period_state variable needs to be sampled for the safety checks. The loop-exit check on line 10 is updated to require that the specified number of interrupt handlers are exited as well as entered, and the increment of i is moved to line 39, which is the end of the interrupt-entry model. Lines 12-15 set the outermost variable to indicate whether this is the outermost of a set of nested interrupts and to set the in_dyntick_irq variable that is used by the dyntick_nohz() process. Lines 32-37 capture the state of the grace_period_state variable, but only when in the outermost interrupt handler. Line 40 has the do-loop conditional for interrupt-exit modeling: as long as we have exited fewer interrupts than we have entered, it is legal to exit another interrupt. Lines 41-48 check the safety criterion, but only if we are exiting from the outermost interrupt level. Finally, lines 63-66 increment the interrupt-exit count j and, if this is the outermost interrupt level, clears in_dyntick_irq. This model results in a correct validation with a bit more than half a million states, passing without errors. However, this version of the model does not handle NMIs, which are taken up in the nest section. Validating NMI Handlers We take the same general approach for NMIs as we do for interrupts, keeping in mind that NMIs do not nest. This results in a dyntick_nmi() process as follows: Of course, the fact that we have NMIs requires adjustments in the other components. For example, the EXECUTE_MAINLINE() macro now needs to pay attention to the NMI handler (in_dyntick_nmi) as well as the interrupt handler (in_dyntick_irq) by checking the dyntick_nmi_done variable as follows: We will also need to introduce an EXECUTE_IRQ() macro that checks in_dyntick_nmi in order to allow dyntick_irq() to exclude dyntick_nmi(): It is further necessary to convert dyntick_irq() to EXECUTE_IRQ() as follows: Note that we have open-coded the “if” statements (for example, lines 16-28). In addition, statements that process strictly local state (such as line 56) need not exclude dyntick_nmi(). Finally, grace_period() requires only a few changes: We have added the printf() for the new MAX_DYNTICK_LOOP_NMI parameter on line 11 and added dyntick_nmi_done to the shouldexit assignments on lines 22 and 46. Quick Quiz 13: Do you always write your code in this painfully incremental manner??? The model results in a correct validation with several hundred million states, passing without errors. Conclusions This effort provided some lessons (re)learned: Promela and spin can validate interrupt/NMI-handler interactions. Documenting code can help locate bugs. In this case, the documentation effort located this bug. Validate your code early, often, and up to the point of destruction. This effort located one subtle bug that might have been quite difficult to test or debug. Always validate your validation code. The usual way to do this is to insert a deliberate bug and verify that the validation code catches it. Of course, if the validation code fails to catch this bug, you may also need to verify the bug itself, and so on, recursing infinitely. However, if you find yourself in this position, getting a good night's sleep can be an extremely effective debugging technique. Finally, if cmpxchg instructions ever become inexpensive enough to tolerate them in the interrupt fastpath, their use could greatly simplify this code. The Promela model for an atomic-instruction-based implementation of this code has more than an order of magnitude fewer states, and the C code is much easier to understand. On the other hand, one must take care when using cmpxchg instructions, as some code sequences, if highly contended, can result in starvation. This situation is particularly likely to occur when part of the algorithm uses cmpxchg, other parts of the algorithm use atomic instructions that cannot fail (e.g., atomic increment), contention for the variable in question is high, and the code is running on a NUMA or a shared-cache machine. Sadly, almost all multi-socket systems with either multi-core or multi-threaded CPUs fit this description. Acknowledgments We are indebted to Andrew Theurer, who maintains the large-memory machine that ran the full test. We all owe a debt of gratitude to Vara Prasad for his help in rendering this article human-readable. Distribution-friendly projects Part 3 [Editor's note: This article, which looks at the interactions of software projects and distribution providers, is presented in three parts. Part 1 introduces the issues downstream distributions have with upstream software providers. Part 2 covers the technical needs of the distributions.] Philosophical requests The philosophical side of packaging is mostly of concern to the users, although the global goals of the distribution will have some philosophical issues as well. In this section I'll try to cover the most common requests of the downstream distributions to the various upstream providers. Technical requests can usually be taken care of with compromises and options at build-time, so a solution will usually be found in time. Philosophical needs, on the other hand, often put distributions and original developers on two opposite positions, and cannot be fixed unless one of the two accepts the position of the other. While technical needs are usually shared between distributions, different distributions have different goals, and in turn different philosophical needs, and this adds to the problem. A project wanting to accommodate the requests from a distribution might make life more difficult for another, that has different philosophical issues. User expectations Users of a distribution usually expect a certain behaviour from the software they install. For graphical applications, they also might expect a certain graphical aspect so that all the applications blend in together. These problems may be easier to solve as non-technical concerns go, as they usually require providing more choices. One common example comes with fonts in graphical environments: most distributions these days ship with all the graphical applications set to use the same font family. This is done to follow the style rule of using as few different fonts in a single view as possible (you don't usually see a book using ten or twelve different fonts in the text). So one common request is for software to provide a way to change the default font without user intervention (which otherwise often involves changing the source code). This still seems to be a technical issue; for example, using TrueType fonts requires the use of some sophisticated libraries for font rendering. Some projects don't want to complicate their code with them. Similarly, some distributions like to provide anti-aliasing for the default fonts, but some projects dislike the whole idea of using anti-aliased fonts. A similar issue might arise with GUI toolkits. Distributions tend to provide the same graphical aspect for the software they install, this usually means that both GTK+ and Qt are set to a similar, if not identical, themes. So one other thing distributions might like is for graphical software to rely on a theme-capable toolkit. Themes and skin, though, are often disliked by minor projects, which might also not be using GTK+ or Qt toolkits, as they tend to make the code more complex and slow. Users expect to be able to set their language preference and to have all applications use that language. Unfortunately adding support for translations is a far from trivial task, and adds complexity to the software, which is something that especially smaller software projects try to avoid. Similar complexity stems also from supporting text encoding like UTF-8, which - for certain distributions like Fedora - is a prerequisite for the software to be added to their repository. It's easy to see that user requests are often pretty complex and hard to implement on many existing projects. The best way to make a project more appealing for distributions is to start from the beginning with these points in mind. Distribution philosophical policies There are also distribution policies regarding issues that are simply philosophical. In this category you can find the requests of distributions for removing code that deals with non-free formats, like most multimedia formats. All these issues are usually centered around licensing, copyright and patent-encumbered formats or algorithms: Multimedia formats, video, audio or picture formats are often considered patent-encumbered (sometimes with the exception of Xiph formats like Ogg, Vorbis and Theora). Most distributions would like support for these to always be optional, for this reason. This might actually start to be a problem for multimedia applications as their main goal is just to provide support for these patent-encumbered formats. Usage of GPL-incompatible libraries by GPL applications, and other forms of licenses mixing is a quite big issue for almost all binary distributions. Some distributions will not be able to legally redistribute packages with licensing issues. This is why distributions often push for optional support for GnuTLS as a replacement for OpenSSL, or build packages without SSL support by default. (In addition to license problems, SSL support is also encumbered with special cryptography legislation, which makes its handling even more a problem). In both multimedia and cryptography there is the problem of patent-encumbered algorithms. Even when the format itself is not encumbered, the algorithm used to generate the result might be. Distributions want a way to opt-out some of the support in the software. Support for network analysis tools can also be a problem. While the boundary between analysis and cracking can be quite thin, laws like the ones enacted in Germany start making it difficult for distributions to carry tools that are even well inside the definition of network analysis rather than network cracking. For some distributions, there are extra policies that depend on the license used by the project itself. For instance if they disallow changes to the source (verbatim copy only) they might disallow the distribution from taking the proper steps to fix technical issues. Each distribution has its philosophical goals, such as a focus on Free Software or a focus on corporate users. These goals will influence the distribution's view of philosophical issues. Companies tend to take a more defensive look at these issues than community projects. European projects tend to be more lax when it comes to software patent problems, and other countries don't enforce copyright laws. These are the types of factors that will affect a distribution's philosophical goals. Conclusions The issues described in this section can also conflict directly with the goals of the upstream project (and they often do), in which case the project is very unlikely to be packaged officially by at least some distributions. This is why I categorized these issues under the philosophical name rather than technical. It's actually fairly common that issues of this kind sprout debates inside and outside projects and distributions, as developers and maintainers have very different feelings about them. One example is a software package in which the team is composed entirely by native English speakers. Such a project has a fair chance to start without considering that some users might expect the user interface to be available in any other language. While this might sound a bit harsh, it's often true. It is nearly impossible for a project to satisfy these issues altogether for every possible distribution. And it becomes even harder when the project itself focuses on keeping the code as simple as possible, even when that means ignoring features and optimizations that require adding complexity. For this reason, the issues listed here have to be taken as suggestions, rather than requirements, depending on the original goals for the project. If the project wants to be widely accepted though, it will need to provide a good mitigation method for these issues. Afterword In addition to these issues, that relate directly with the source and the software, there are some extra suggestions that can be given to make the whole project friendlier to distributors, entailing process and workflow changes. Making it visible in the source code who and where to send patches and suggestions is certainly a nice initial step. When preparing a patch for software that misbehaves, it's usually simpler to read the documentation shipped with the software. Distribution developers are less likely to open a browser to check the homepage of the software, looking for the address to use. This also means that you should always make sure to update your address referenced in the software documentation. If you change your email address, and the old one is no longer reachable, it is a good idea to re-roll the tarball even if no code changes are present (in these cases, adding a suffix to the tarball is better than changing the version). Even better, having a public way to track patches is often useful: distributions can easily see if someone else fixed that issue before, and how. It might save a downstream maintainer some work, and it helps having a solution that works for more than one person at a time. Acknowledging the patches, and pointing out what is not going to work, is something that helps reduce the frustration for maintainers, as it allows them to better suit the coding style of the project each time. Ignoring a patch entirely is not a good idea, as these patches are rarely non-issues. It's also important to notice that almost always the patches don't concern optimizations. Most distributions are generally less concerned with the performance of the software over the correctness as defined by their own policies. Distributions would often prefer to reduce the speed of the software if that makes it better follow their policy. This has also to be understood and accepted, and at the best worked around by either improving the optimization, allowing an option, or by making a trade off. Having a public repository for the source is often helpful, but often not in major ways. While it makes it easier for the downstream maintainers to check the progress of the code, it might not be trivial to identify where the correct source is. If the project uses branches it might well be that the upstream developers have already moved away from the broken code by the time a distribution tries to package it. If a repository is available, regular tagging and branching is helpful, and makes it easier for a maintainer to find what has actually changed between two or more versions. Finally This article cannot cover all the possible requests that a distribution might have, and it does not even get close to all the possible requests by all possible distributions. There are even more requests relating to portability, for distributions that target particular non-mainstream hardware (e.g. distributions targeting embedded device like OpenWRT) or distributions for other (Free and non-Free) operating systems (like Cygwin, Fink or FreeBSD ports). The remaining issues have to be coped with on an incremental basis, with collaboration of the downstream maintainers for that distribution. Following these practices might make it easier for those people to contact the upstream developers to work out a solution. They make the project appear friendlier, just as well as saying Please let us know how we can improve your users' experience with our software. They change the mindset of a project so that it makes it less frustrating for packagers to prepare it for distributions. Going against all these points not only makes the job harder for the distributors, but might give the (wrong) idea that the project is not open to accommodating distributions. 4K stacks by default? The kernel stack is a rather important chunk of memory in any Linux system. The unpleasant kernel memory corruption that results from overflowing it is something that is to be avoided at all costs. But the stack is allocated for each process and thread in the system, so those who are looking to reduce memory usage target the 8K stack used by default on x86. In addition, an 8K stack requires two physically contiguous pages (an "order 1" allocation) which can be difficult to satisfy on a running system due to fragmentation. Linux has had optional support for 4K stacks for nearly four years now, with Fedora and RHEL enabling it on the kernels they ship, but a recent patch to make it the default for x86 has raised some eyebrows. Andrew Morton sees it as bypassing the normal patch submission process: This patch will cause kernels to crash. It has no changelog which explains or justifies the alteration. afaict the patch was not posted to the mailing list and was not discussed or reviewed. It is not surprising that patch author Ingo Molnar sees things a little differently: what mainline kernels crash and how will they crash? Fedora and other distros have had 4K stacks enabled for years [ ... ] and we've conducted tens of thousands of bootup tests with all sorts of drivers and kernel options enabled and have yet to see a single crash due to 4K stacks. So basically the kernel default just follows the common distro default now. (distros and users can still disable it) As described in an earlier LWN article, the main concerns about only providing 4K for the kernel stack are for complicated storage configurations or for people using NDISwrapper. There is fairly high disdain for the latter case—as it is done to load proprietary Windows drivers into the kernel—but it could lead to a pretty hideous failure in the former. Data corruption certainly seems like a possibility, but, regardless, a kernel crash is definitely not what an administrator wants to have to deal with. Arjan van de Ven summarized the current state, noting that NDISwrapper really requires 12K stacks, so having 8K only makes it less likely those kernels will crash. The stacking of multiple storage drivers (network filesystems, device mapper, RAID, etc.) is a bigger issue: we need to know which they are, and then solve them, because even on x86-64 with 8k stacks they can be a problem (just because the stack frames are bigger, although not quite double, there). Proponents of default 4K stacks seem to be puzzled why there is objection to the change since there have been no problems with Red Hat kernels. But Andi Kleen notes: One way they do that is by marking significant parts of the kernel unsupported. I don't think that's an option for mainline. The xfs filesystem, which is not supported in RHEL or Fedora, can potentially use a great deal of stack. This leads some kernel hackers to worry that a complicated configuration that uses it, an "nfs+xfs+md+scsi writeback" configuration as Eric Sandeen puts it, could overflow. Work is already proceeding to reduce the xfs stack usage, but it clearly is a problem that xfs hackers have seen. David Chinner responds to a question about stack overflows: We see them regularly enough on x86 to know that the first question to any strange crash is "are you using 4k stacks?". In comparison, I have never heard of a single stack overflow on x86_64.... It would seem premature to make 4K stacks the default. There is good reason to believe that folks using xfs could run into problems. But there is a larger issue, one that Morton brought up in his initial message, then reiterated later in the thread: Anyway. We should be having this sort of discussion _before_ a patch gets merged, no? The memory savings can be significant, especially in the embedded world. Coupled with the elimination of order 1 allocations each time a process gets created, there is good reason to keep working toward 4K stacks by default. As of this writing, the default remains for 4K stacks in Linus's tree, but that could change before long. Image handling vulnerabilities Bugs that linger for eight years without a fix are probably annoying to whoever reported them; perhaps others as well. When those bugs have possible security implications, it is hard to see how they can remain unfixed for even eight months, let alone years, but that appears to be the case with some GTK image handling bugs. Code to handle image formats has been the source of numerous vulnerabilities along the way, which makes it even harder to see why these have languished so long. A call for ideas for a hackfest on the GNOME foundation mailing list seems like a bit of a strange place to find information about vulnerabilities, but in the ensuing thread, Michael Chudobiak brought up some bugs that he would like to see addressed, perhaps as part of a hackfest: I'd like to suggest one possible topic: The pixbuf loaders. They're slow and memory intensive, and this drags down anything that needs thumbnails (Nautilus, etc). There is a lot of opportunity to improve the responsiveness of the desktop here. The bugs he listed were from 2002 (80925), 2004 (142428), and 2008 (522803), but Alan Cox mentioned that he reported one of them as a GNOME security bug "about eight years ago". In his opinion all of the bugs were of the "well known, never fixed" variety. Because the code in question lives in GTK—used by many GNOME applications—"quite a few gnome apps fed small compressed images explode". The basic problem is that the routines handling images create the full-resolution image in memory regardless of the size requested. In addition, various memory-intensive techniques are used to scale the image to the requested size. This impacts Nautilus and other GNOME programs that create thumbnails of large images. Presumably, a denial of service, at a minimum, can result from these operations, though there may be other ways to exploit any program crashes that result. Cox has a plan to see them get fixed: Unfortunately they are well known but nobody seems to care. I'll forward your message to the vendor security list and we'll see what happens. Probably the bug just needs to be made *very* public to incentivise people to fix it 8) The vendor security list, often abbreviated vendor-sec, is a closed mailing list for distribution security teams to exchange information about vulnerabilities in various programs. It is closed so that bugs that are not publicly known can be freely discussed. Whether Cox's posting to that list spurs any action remains to be seen. It is a rare week where LWN does not report some kind of image handling botch as a new vulnerability. This week, a cups vulnerability in handling PNG files could lead to a denial of service; last week we reported an Opera vulnerability in handling images in HTML canvas elements that could possibly lead to arbitrary code execution. Image handling is an area where all bugs need to be scrutinized carefully for potential security issues. Hopefully, part of the problem is that the GNOME hackers did not realize the security implications of the bugs. There does seem to be ample complaint about performance problems, though, to get some kind of action over the last six or eight years. This is a set of related bugs that have seemingly been overlooked for a long time. Perhaps that time is now coming to an end. Firebird adds new features with version 2.1 Firebird is one of the popular open-source relational database management systems (RDBMS) that runs under Linux. From the about Firebird document: Firebird is a relational database offering many ANSI SQL standard features that runs on Linux, Windows, and a variety of Unix platforms. Firebird offers excellent concurrency, high performance, and powerful language support for stored procedures and triggers. It has been used in production systems, under a variety of names, since 1981. The Firebird Project is a commercially independent project of C and C++ programmers, technical advisors and supporters developing and enhancing a multi-platform relational database management system based on the source code released by Inprise Corp (now known as Borland Software Corp) on 25 July, 2000. Stable version 2.1 of Firebird was announced on April 18, 2008: "Firebird 2.1 is a full version release that builds on the architectural changes introduced in the V.2.0 series. Thanks to all who have field-tested the Alphas and Betas during 2007 and the first quarter of 2008 we have a release that is bright with new features and improvements, including the long-awaited global temporary tables, a catalogue of new run-time monitoring mechanisms, database triggers and the injection of dozens of internal functions into the SQL language set." A summary of new features from the release announcement includes: Database triggers for making user-defined triggers have been added. Global temporary tables are now available for the handling of non-persistent data. New common table expressions are available for making dynamic recursive queries. An optional RETURNING clause which supports update, insert and delete operations has been added. The MERGE function now has an UPDATE OR INSERT statement for performing conditional operations. The new LIST() function can retrieve information in the form of a comma-separated list. New built-in functions have been added to replace UDF library calls. Text BLOBs up to 32K in length can now masquerade as varchars. Procedural SQL (PSQL) local variables can now be declared using domains. PSQL variables and arguments can be COLLATEd. A new DDL CREATE COLLATION command has been added, replacing the need for a script. New Unicode collations can be applied to any character set. The ability to perform run-time database snapshot monitoring via SQL has been added. The performance of the remote protocol has been improved to better support operation on slow networks. More details on the version 2.1 release are available in the release notes [PDF]. The document should be read by those who are upgrading from older versions of Firebird. The release notes list a number of additional changes, including: The reworking of the on disk structure (ODS). Improvements to the PSQL error stack trace. The availability of more context information. A new fbsvcmgr command-line interface to the Services API. Support for named cursors. Implementation of the new XNET local transport protocol. A rework of the garbage collection mechanism. The Services API to Classic architecture port has been finished. Lock timeouts are now available for WAIT transactions. New Database Shutdown Modes have been added. The NULL handling for UDFs has been improved. There have been synchronization logic improvements. Support has been added for 64 bit platforms. Larger record enumeration limits are now supported. Debugging improvements have been added. Connection handling on the POSIX superserver has been improved. The PSQL invariant tracking system has been reworked. The ROLLBACK RETAIN clause is now supported. There have been improvements made to the optimizer routines. Numerous Windows improvements have been added. Clearly, the Firebird developers have been busy working on this software. If the above lists aren't enough, the Firebird home page notes that there is a mechanism for users to request more new features. The development roadmap for 2008 gives an idea of where the project is headed. Several bug fix releases are scheduled for version 2.1 in the near future and work on the next major release, version 2.5, is already in progress. Firebird is available for download here. OLPC at a turning point It looks like hard times for the One Laptop Per Child project. Quite a few key developers have left, including Mary Lou Jepsen, Ivan Krstić, Andres Salomon, and Walter Bender. Laptop deployments are far below the several million that the project had hoped for by this time, and many of the goals for the system's software have not been achieved. There is persistent talk of supporting Windows, with suggestions that Linux could be dropped altogether. An ongoing thread on the project's development mailing list shows that quite a few participants are concerned about where things are going. To many, it seems, OLPC is about to go down as a noble failure. These rumors may be just a bit premature, though. When considering what may really come of OLPC, it's worth keeping a few things in mind. One of those is the fact that the project has just completed a major push to its first mass-production system. Your editor has watched the project closely enough to see that, as with many such efforts, the people involved have been putting in lots of long hours to get the job done. When this kind of pressure is lifted, it is natural to take a break, catch up on the house work, and, perhaps, find a new job. So the departure of some key staff at this stage is not entirely surprising. A look at the state of OLPC's software suggests that the project had set an overly ambitious set of goals for its first release. When that happens, one must jettison some objectives; the later that this is done, the more likely it is that the wrong objectives will be tossed overboard. There are signs that OLPC tried to do too much for too long, with an end result which is not as stable, as fast, or as fully-featured as one would like. As many people close to the project have noted, the laptop's software remains immature. But, as former president Walter Bender put it: While [we] have heard a lot of noise about performance in the media and from some members of the development community, it has not, in my experience been a major road-block in the school trials and deployments. There are lots of bugs and lots of things that could be improved upon, and these should certainly be addressed, but the characterizations being made in this thread do not reflect the realities of the OLPC deployments--the children and teachers are using the laptops and are learning. Finally, the number of laptops delivered to children is far below the level the project had planned upon. Fewer deployments means a lower impact for the project, but it also cannot be helping to create the economies of scale the project had counted on to push the cost down. There have also been some embarrassing failures along the way, including the misplacing of a large number of "Give one get one" orders until after it was too late to include them in the manufacturing run. All of the above points to a need to make some changes in how the project is run. Changes always create uncertainty, so it would be surprising if OLPC participants were not a little nervous at the moment. What happens in the next few months will likely determine OLPC's fate. The project's leadership has famously said in the past that OLPC is an education project, not a laptop project. Some people have recently expressed concerns that, in fact, OLPC is turning into a laptop project, with deployment numbers being the main goal. Nicholas Negroponte doesn't help when he allows himself to be quoted as being "mainly concerned with putting as many laptops as possible in children's hands." If OLPC becomes primarily a low-cost laptop vendor, and especially if it goes to proprietary operating systems as a means toward that end, it will lose much of the community that has grown up around the project. And that would be a shame. There is great beauty in the idea of putting a well-designed learning tool into the hands of children and empowering those children by providing a system which is completely open and hackable. A large and motivated community of highly-capable people came together behind that vision and did their best to rethink how this technology should work and create something better. Deployment groups in a number of countries have gotten the resulting systems into the hands of thousands of children, and many of them are reporting good results. A lot of good things have happened here, and it doesn't have to end now. But it might end soon. To pull things together, the project will have to communicate a clearer vision of where it plans to go with its software at all levels; Mr. Negroponte's statement of continued support for Sugar appears to be an attempt to start this process. The operational side of the project needs to get its act together. Some transparency on, for example, what is being done with donation money and what agreements have been made with outside corporations, would be most helpful. And, most of all, the group of volunteers working with this project have to be convinced anew that they are not wasting their time. If the project's leadership can manage all of that, there may well be great things coming from OLPC in the future. The 2.6.26 merge window, part 2 Since last week's summary was written, another 3700 changesets have found their way into the mainline git repository. The most significant user-visible changes include: New drivers have been merged for Wolfson WM9713 codecs, TI DAVINCI AC97 sound chips, Emagic Audiowerk 2 soundcards, x86 PC speakers (new driver which makes them look like sound cards), Asus AV100 (Xonar DX) sound cards, Micron MT9M001 and MT9V022 cameras, PXA27x Quick Capture cameras, Kworld ATSC 120 tuners, cx23417 MPEG encoders, Integrant ITD1000 tuners, Philips TDA10048HN-based demodulators, Philips SAA7171/3/4 audio/video decoders (the last out-of-tree IVTV driver), Auvitek AU8522 demodulators, Samsung S5H1411-based tuners, framebuffer, keyboard, and mouse virtual devices (for Xen), several Wolfson Microelectronics touchscreens, wireless Xbox 360 controllers, Zhen Hua PPM-4CH transmitters, SPCP8x5 USB to serial adaptors, NCR 53c9x SCSI controllers (replacement driver), Freescale 8610 and 5121 display interface units, Intel 965G/965GM integrated graphics controllers, TI OMAP sound controllers (including the one on the Nokia 810), Eee PC function keys, and Intel IXP4xx Ethernet devices. There is now "basic" support for braille screen readers. Support for the One Laptop Per Child XO architecture has been merged into the mainline. The new virtual files found in /proc/pid/mountinfo provide information on all filesystem mounts visible to the relevant process. The new virtual file /proc/vmallocinfo displays information on use of vmalloc space within the kernel. The SPARC Niagara architecture now has NUMA support. The Xen balloon driver (allowing memory to be added to or removed from virtual guests) has been merged. By default, /dev/mem can no longer be used to access RAM; Fedora and Red Hat have applied this patch for years, but now it has found its way into the mainline. The KVM paravirtualization subsystem now supports the S/390, PowerPC 440, and ia64 architectures. Per-process "securebits" are supported. These bits control how a process's capability bits are managed; the patch is intended to help those who would transition over to a fully capability-based system. See this article for a more detailed description of this feature. The getrusage() system call has a new RUSAGE_THREAD option which causes it to return information about the current thread only. The device whitelist control group patch (described briefly in this article) has been merged. It is now possible to create and use partitions with network block device (NBD) devices. The audit subsystem can now test events against the type of the file being operated upon. The VFS now makes backing device information available under /sys/class/bdi. Interested people can look at per-device readahead and writeback variables there. The FUSE filesystem now supports the creation of shared writable memory mappings. Changes visible to kernel developers include: ioremap() on the x86 architecture will now always return an uncached mapping. Previously, it had taken a more relaxed approach, leaving the caching as the BIOS had set it up. The practical result was to almost always create uncached mappings, but with occasional exceptions. Drivers which depend on a cached mapping will now break; they will need to use ioremap_cache() instead. The Video4Linux2 API now defines a set of controls for camera devices; they allow user space to work with parameters like exposure type, tilt and pan, focus, and more. On the x86 architecture, there is a new configuration parameter which allows gcc to make its own decisions about the inlining of functions, even when functions are declared inline. In some cases, this option can reduce the size of the kernel's text segment by over 2%. The legacy IDE layer has gone through a lot of internal changes which will break any remaining IDE drivers. The nopage() virtual memory area operation has been removed; all in-tree code is now using fault() instead. The SLUB allocator supports a new sysfs file (/sys/kernel/slab/name/order) which allows system administrators to change the size of page allocations used by the named slab. A condition which triggers a warning from WARN_ON will now also taint the kernel. The get_info() interface for /proc files has been removed. There is also a new function for creating /proc files: This version adds the data pointer, ensuring that it will be set in the resulting proc_dir_entry structure before user space can try to access it. The object debugging infrastructure has been merged. The merge window remains open; tune in next week for (what should be) the final set of changes merged for 2.6.26. Ksplice: kernel patches without reboots The kernel developers are generally quite good about responding to security problems. Once a vulnerability in the kernel has been found, a patch comes out in short order; system administrators can then apply the patch (or get a patched kernel from their distributor), reboot the system, and get on with life knowing that the vulnerability has been fixed. It is a system which works pretty well. One little problem remains, though: rebooting the system is a pain. At a minimum, it requires a few minutes of down time. In many situations, that down time cannot be tolerated. Reboots also disrupt any ongoing work, break existing network connections, and can cause the loss of results from long-running processes. And, most importantly of all, reboots prove traumatic for a certain subset of Linux administrators who prize a long uptime above almost all other things. Administrators currently have to choose between multi-year uptimes and security fixes; anything which frees them from a dilemma of this magnitude can only be welcome. That "anything" might just be a recently-announced project called ksplice. With ksplice, system administrators can have the best of both worlds: security fixes without unsightly reboots. An in-depth explanation of how ksplice works can be found in this document [PDF]. In short, ksplice requires as input the source tree for the running kernel and the security patch. It will then build two kernels, one with the patch and one without; the kernels are built with a special set of options which makes it easy to figure out which functions change as a result of the patch. The two kernels will be compared, with the purpose of finding those functions. Changes can propagate further than one might expect, especially if, for example, an inline function is modified. Once a list of changed functions has been made, the updated code for those functions is packaged into a kernel module and loaded into the system. Then comes the tricky part: getting the running kernel to start using the new code. That requires patching the running code, which is a risky thing to do. Ksplice starts with a call to stop_machine_run(), which dumps a high-priority thread onto each processor, thus taking control of all processors in the system. It then examines all threads in the system to ensure that none of them are running in the functions to be replaced; if so, trampoline jumps are patched into the beginning of each replaced function (they "bounce" the call to the old code into the replacement code) and life continues. Otherwise ksplice will back off and try again later. This method imposes a number of limitations. One is that only code changes can be patched in with ksplice; patches which make changes to data structures cannot be accommodated. Another comes from the retry-based approach to ensuring that no threads are running in the patched functions; what happens if one of those functions is never free? Kernel functions like schedule(), sys_poll(), or sys_waitid() are likely to always have processes running within them. In cases like this, ksplice will eventually give up and inform the user that the patch cannot be done; it is simply not possible to make changes to those particular functions. These limitations mean that, out of 50 security patches examined by the ksplice developers, eight could not be applied with ksplice. So multi-year uptimes are probably still incompatible with the application of all security patches. Even so, ksplice certainly has the potential to reduce patch-related downtime considerably. Chances are good that there will be a fair amount of interest in ksplice in sites running high-uptime, mission-critical systems. There are few things in the way of an immediate merge of this code into the mainline. One is a matter of coding quality and can be fixed. Then, there is the matter of the lead developer being unconvinced that merging this code makes sense since it is, essentially, a standalone feature. Andi Kleen's response made the (usual) reasons for merging the code clear: To be honest you weren't the first to come up with something like this (although you're the first to post to l-k as far as I know). But the usual problem of something that is kept out of tree is that it eventually bitrots and gets forgotten. The only sane way to make such extensions a generically usable linux feature is to merge them to mainline. So, presumably, the code will eventually be proposed for a mainline merge. But there is one other little difficulty pointed out by Tomasz Chmielewski: Microsoft holds a patent described this way: A system and method for automatically updating software components on a running computer system without requiring any interruption of service. A software module is hotpatched by loading a patch into memory and modifying an instruction in the original module to jump to the patch. Microsoft came up with this novel new technique in the distant past: 2002. The posting immediately brought out a crowd of surprised graybeards who distinctly remember using such techniques on their PDP-11 systems some decades before Microsoft "invented" hot-patching. The basic claim of the patent would thus appear to be invalidated by some decades' worth of prior art, but some of the dependent claims include features (such as capturing all other processors on the system) which were unlikely to be useful on PDP-11s. Given that the kernel developers are now well aware of this patent, they must take it into account when deciding whether to accept this code into the mainline. It would not be surprising if they chose to avoid baiting the Microsoft FUD machine in this way, even if they all agreed that the patent lacked validity. So a promising technology risks being left out of the kernel as the result of a software patent which was filed at least 30 years too late. On the conviction of Hans Reiser On April 28, a California jury found Hans Reiser guilty of first-degree murder. There has been a lot of speculation in the press, both before and after the conviction, on what the loss of Mr. Reiser will mean for the Linux community. Much of that speculation, it seems, lacks an understanding of what Mr. Reiser's role in the community really was. Your editor will take no position on whether his conviction was correct or just. But there are things to be said about what this conviction will mean. Hans Reiser was, of course, the designer (and, to an extent, implementer) of the reiserfs filesystem. When it was merged, reiserfs had the distinction of being the first journaling filesystem for Linux which was intended for general use; it also offered good performance in some situations, especially those involving lots of small files. Reiserfs saw a significant amount of use and was adopted by a handful of distributors. There are, doubtless, quite a few reiserfs deployments still operating out there. Mr. Reiser's role in reiserfs development and maintenance ended some years ago, though. He stopped work on it when reiser4 development started, and even opposed the incorporation of improvements done by others. Reiserfs continues to be maintained independently of its creator, though there is not much interest in adding features to it at this point. Reiserfs is nearing the end of its run, and nothing which happened this week has changed that situation in any way. There is more concern about what will happen with Reiser4, Mr. Reiser's next generation filesystem. Many reports have suggested that current events spell the end for this project, but it is worth taking a look at the longer history. Reiser4 is not exactly new; it was first posted in 2002. Mr. Reiser made an unsuccessful effort to get it merged for the 2.6.0 kernel, and frequently thereafter. He blamed commercial interests and politics for his failure in this regard, but the real situation is more straightforward than that. Reiser4 tried to do a number of things very differently from other filesystems. It included some very non-POSIX semantics which raised red flags within the development community. There was a multipurpose reiser4() system call which implemented a wide range of features and included an in-kernel interpreter for a special language. There was a low-level plugin mechanism which raised concerns (not all justified) about varying on-disk formats and proprietary formats. Reiser4 did many things at the filesystem level that others thought should be done at the virtual filesystem level instead. The "files as directories" feature, beyond striking people as strange, opened up a wide range of trivial deadlock scenarios. In summary, this code was nowhere near ready for inclusion into the mainline kernel. Kernel development projects which are done in isolation often encounter this kind of surprise when they are finally posted to the development community. Over the next few years work on reiser4 continued. Many of the problems were solved by simply removing most of the features which made reiser4 unique, turning it into just another filesystem. Once you have just another filesystem, attention will turn to performance; in this case, many people found that they got benchmark results which differed from those posted by Mr. Reiser. Community interest in this filesystem fell over time, and the development rate fell as well. There was still work happening to prepare reiser4 for the mainline kernel when Mr. Reiser was arrested, but it was moving slowly. Perhaps the biggest obstacle to the inclusion of reiser4, though, was the confrontational approach taken toward the rest of the community. When developers pointed out problems with reiser4, Mr. Reiser had a tendency to question their motives rather than pay attention to what they were saying. His interactions with the community were characterized by statements like: What makes you think kernel developers have a deep understanding of the value of connectivity in the OS? They don't. The average kernel developer is not particularly bright. A number of developers reached a point where they simply chose not to engage with him any more. By rejecting the development community, Mr. Reiser remained forever an outsider to it. And that is why the practical effect of Mr. Reiser's conviction on the community will be relatively small, at least in the short term. As brilliant as he is, his effectiveness was limited by his disregard for the rest of the community and his certainty of always being right. He could have accomplished much more with a different approach. That said, his loss is unfortunate. He did prove able, over a number of years, to raise funds for Linux filesystem work, and the community benefited from that work. Some of the reiser4 developers are still interested in working on that code, and they still submit patches. But now nobody is paying them to do that work, which puts the whole enterprise in danger. There are limits to how long reiser4 development can be carried forward as a labor of love. The biggest loss, though, is elsewhere. More than anybody else, Mr. Reiser put a lot of thought into what our systems should look like in the future. He saw capable filesystems as the way to make our systems far more powerful than they are now. In a world where the filesystem was the only namespace of any significance on the system, all objects would be equal and the number of potential connections between them would explode. His long-term goal was not (just) better benchmarks; it was to create a filesystem which could serve as this all-encompassing namespace. It was a radical idea, and, perhaps, impractical. But our future comes from ideas like that. After a few relatively quiet years, there is now a flurry of activity around Linux filesystems. The challenges in this area are large, but we have many highly capable developers working on the problem and there can be no real doubt that Linux filesystems will continue to be among the best available anywhere. But that development community has lost a voice which, for all its faults, had some unique and innovative things to say, and we are all poorer for it. Restricting root with per-process securebits Linux capabilities have had a long and somewhat tortuous journey as part of the Linux kernel. Slowly—and very carefully—functionality is being added to this security feature to get it to a point where it is a viable alternative to the all-or-nothing setuid(0) model. A recently merged patch adds a per-process securebits feature that will allow capabilities-based daemons or subsystems to coexist with existing setuid utilities. Linux capabilities break up the privileged tasks normally associated with root (i.e. uid 0) into finer-grained abilities which can be individually granted or revoked for specific processes. The idea is to change the standard Unix model that root has all special privileges while all other users have none. The terminology is always a bit contentious, though, as Linux capabilities are derived from a POSIX proposal that was never adopted, but shares the name "capabilities" with an entirely different approach; this article is only concerned with capabilities of the Linux variety. There has long been interest in creating a Linux system that did not rely upon a single root account. Capabilities are seen as the way to get there, but they have suffered from a bit of a chicken-and-egg problem. With the recent work to add file-based capabilities and restore CAP_SETPCAP to its original meaning, a true capabilities-based system is becoming possible. In the patch, which has been merged for 2.6.26, Andrew Morgan describes the new functionality: The feature added by this patch can be leveraged to suppress the privilege associated with (set)uid-0. This suppression requires CAP_SETPCAP to initiate, and only immediately affects the 'current' process (it is inherited through fork()/exec()). This reimplementation differs significantly from the historical support for securebits which was system-wide, unwieldy and which has ultimately withered to a dead relic in the source of the modern kernel. The patch removes the global securebits variable, replacing it with an entry in struct task_struct, that can be manipulated by a process, but only for itself—and any children. Morgan envisions hybrid systems that have some utilities using capabilities to get their privileges along with some setuid(0) utilities. In that scenario, a capabilities-based utility or daemon may wish to limit what its children can do, even if they execute a setuid(0) binary. As part of the evolution, process trees can be created that cannot get root privileges. Processes which have the CAP_SETPCAP capability can change their securebits setting via the prctl() system call. There are three separate bits that govern the interaction of capabilities and setuid: SECURE_NOROOT – enabling this gives no special privileges to uid 0 SECURE_NO_SETUID_FIXUP – setting this bit disables capability fixes when transitioning from or to uid 0 via setuid. This might be done for compatibility with older programs that use setuid to reduce their privileges. SECURE_KEEP_CAPS – when set, a process can retain its capabilities even when transitioning to a normal (not uid 0) user. This bit is cleared by exec(). Each of these bits also has a companion *_LOCKED bit that, if set, will not allow any user program to alter the corresponding setting. As Morgan notes in the patch, a program that can set its capabilities (has CAP_SETPCAP) can drop all privileges for itself and any child process by doing: This is the equivalent of setting SECURE_NOROOT, SECURE_NO_ROOT_LOCKED, SECURE_NO_SETUID_FIXUP, SECURE_NO_SETUID_FIXUP_LOCKED, and SECURE_KEEP_CAPS_LOCKED. The memory of the sendmail-capabilities bug from 2000 makes some a bit queasy—or worse—about any patches that involve capabilities and setuid. Andrew Morton asks: "what was the bug which caused us to cripple capability inheritance back in the days of yore? (Some sendmail thing?)" That bug was caused because unprivileged users could take away the CAP_SETUID capability from setuid binaries like sendmail. When sendmail then used setuid to drop its privileges, it failed, but sendmail did not check, so it was still running with full privilege. This could be leveraged by a user to gain root privileges. It was a disconnect between capabilities and the longstanding behavior of Unix-like systems when dropping privileges. Morgan has written a detailed description of the sendmail-capabilities bug in response to Morton's questions. He makes it clear that he wants to move toward full capability support without breaking existing code: I'm basically interested in evolving the capability implementation back to the POSIX.1e model and making it whole - but most certainly *without crippling legacy superuser support in the process* . As folk get more comfortable with this full capability model. I believe we can delete more cruft from the main kernel, but even that clean up will leave a fully functional legacy model in place. I feel it should be for something like init, or one of its children to be able to run subsystems in capability-only or legacy modes. Morton seemed satisfied that his concerns had been addressed, but still wonders about the future for capabilities: "So how do we ever get to the stage where we can recommend that distributors turn these things on, and have them agree with us?" This was echoed by Ismail Dönmez, who was looking for concrete examples of how to use the per-process securebits feature. Morgan provides a pointer to some examples along with his belief that sometime soon the capabilities developers will become confident enough to recommend turning off the "experimental" flag for the SECURITY_FILE_CAPABILITIES kernel configuration. That flag governs both the file-based capabilities as well as the per-process securebits. In addition, Morgan says: More importantly I'm hopeful that in that time we'll have accumulated enough documentation and user-space experience and examples to convince others that this is, indeed, a viable feature to support in mainstream distributions. A developerWorks article on file-based capabilities by Serge Hallyn and a web page on POSIX capabilities by Chris Friedhoff were both mentioned in the thread as good references for the work being done to actually use capabilities in systems. Those pre-date the securebits work, so Dönmez was looking for use-cases for the new feature. Morgan replied that containers were one, deferring to Hallyn who has some ideas on using securebits: We tend to talk about 'system containers' versus 'application containers'. A system container would be like a vserver or openvz instance, something which looks like a separate machine. I was going to say I don't imagine per-process securebits being useful there, but actually since a system container doesn't need to do any hardware setup it actually might be a much easier start for a full SECURE_NOROOT distro than a real machine. Heck, on a real machine init and a few legacy [daemons] could run in the init namespace, while users log in and apache etc run in a SECURE_NOROOT container. But I especially like the thought of for instance postfix running in a carefully crafted application container (with its own virtual network card and limited file tree and no visibility of other processes) with SECURE_NOROOT on. Capabilities are an interesting, but complicated, security feature. For most of the ten years they have been part of the Linux kernel, they have either been broken, ignored, or both. With the latest work being done by Hallyn, Morgan, and others, capabilities are finally becoming a fully-working alternative to things like SELinux. It will be interesting to see if more user utilities will become capability-aware and whether distributions start using capabilities. Some day, root may just fade away. Large educational Linux deployment for Brazil Numbers like 52 million are attention grabbers, especially when they refer to students getting access to Linux. That's the number of Brazilian public school students who will have access to Linux-based educational computers in some 53,000 labs spread throughout the country. As reported on Mauricio Piacentini's weblog, the Brazilian government already has 17,000 of the labs up and running and plan to be fully rolled out by the end of 2009. The project, called ProInfo, is run by the Ministry of Education (MEC) for Brazil. Piacentini heard about it at the recent Fórum Internacional Software Livre (FISL) conference, which is held annually in Porto Alegre, Brazil. He noted that the project is not only providing computers and infrastructure, but also a "Linux Educacional" distribution with free educational and entertainment software along with other "open content". The distribution is Debian-based using KDE 3.5 as its desktop. Packages from the KDE Education Project (KDE-Edu) and KDE Games Center (KDEGames) were included. The project customized the interface, adding a quick navigation bar at the top (seen at left). This is the second version of the distribution incorporating feedback from installations of the previous version. The distribution ISOs, open content, and some documentation (all in Portuguese) can be found at the MEC ProInfo website. There are various different lab configurations that ProInfo has devised that depend on the nature of the location of the school. Urban labs have equipment for up to fifteen students whereas rural installations have power-friendly hardware that can support up to five users. There is also a configuration targeted at schools for people with special needs that has a large display and accessibility tools added to the distribution. ProInfo also has a project that sounds much like OLPC, except in Portuguese: Um Computador por Aluno ("One computer per student") that plans to bring 150,000 laptops (possibly Intel Classmate PCs) to students over the next year or so. Some have quibbled about the number of students estimated, but even if it is overestimated by a factor of two or three—which seems unlikely—it is still an enormous project that will impact a huge number of students. Free software is perfect for these kinds of projects, because it can reduce the hardware requirements significantly, eliminate licensing nightmares, and provide a look "under the hood" for students who are interested. Computer skills are largely portable if some of those students end up using other operating systems in the future, but because they are using free software now, any documents, pictures, music, and other data files will be able to move with them. Folks from the KDE project are justifiably proud of this deployment. It uses KDE 3.5, but plans are afoot to work with MEC to explore using KDE4 down the road according to KDE hackers Piacentini and Aaron Seigo. Many have been concerned about the future of KDE 3.5, but the project has always maintained that it will be around for a long time. As Seigo says: KDE 3.5 will be supported in the market for many years to come due to deployments such as this one. Looking towards the future, KDE4 will likely make some things even easier for them in the future, such as how to implement the navigation bar they added to the top of desktop as a result of usability research done involving this specific audience. With Plasma, a few lines of JavaScript is all that would be needed. Proponents of the other desktops or distributions should be cheering this deployment as well. There will probably be lots of lessons learned that can apply to other projects in Brazil or elsewhere that standardize on a different set of software components. This is an exciting project for the free software community. But even more importantly, it is great to see so many of these tools become available to those who have not yet been exposed to them. Distributions in the Summer of Code For the fourth year, Google's Summer of Code will pay undergraduate students to work with some of the world's top developers on open-source projects. Students and mentors also get a T-shirt, which for many of us is motivation enough. Many of the accepted projects are not surprising, such as GNOME, KDE, Drupal, and Python. One interesting category of projects, however, is distributions. Aren't they just writing packages? What would they do with a Summer of Code project? That's what this article aims to discover. This year, four distributions were accepted for a combined total of 40 slots: Debian, Fedora, Gentoo, and openSUSE. Conspicuous in their absence are other major distributions such as Mandriva and Ubuntu. One wonders what happened—did they apply (if not, how come?); were they rejected? Ubuntu participated in 2006 and 2007, so it is curious that the distribution is not in SoC this year. In addition to these four distributions, three of the BSDs participated as well, receiving a combined total of 35 slots: DragonFly BSD, FreeBSD, and NetBSD. Since these are operating systems in addition to their own package distributions, many of their slots are devoted to core OS code, while the Linux distributions' slots are not. Let's take a closer look at the types of distribution projects in this year's Summer of Code. Many of Debian's 12 projects relate to installation (two slots), configuration management (two slots), or package management/development (seven slots). The exception is a project to make an embedded, Debian-based NAS device. Another 12 slots went to Fedora, which shared two of its slots with JBoss. Fedora has a more eclectic mix: it devoted two slots to package management and two to configuration management, investing the remaining slots in features for a translation framework (three), creation of a new Web interface for the hardware profiler Smolt, enhancement of the booting profiler Bootchart to use SystemTap, and creation of a simple, non-linear video editor for ogg video to integrate with the screencasting tool recordmydesktop. Gentoo received six slots, of which two relate to package management. The other four are dedicated to diverse projects: implementing OpenPAM-compatible modules for Linux, improving a Web-based, WYSIWYG XML editor, making it easy to set up a Beowulf cluster, and improving Gentoo's embedded network-appliance framework. OpenSUSE got ten slots; five of these are going toward package management/development, and one is going toward installation. The remaining four are the most generally interesting: implementing a face-based authentication module, enabling ext4 as GRUB's boot partition, interactive crash analysis (presumably an improvement upon what recent GNOME versions do rather than a duplication), and creation of a GUI manager for LTSP thin clients. Now let's take a quick look at BSD land. Of DragonFly's projects, six out of seven are OS-related, and the other is installation-related. FreeBSD received 21 slots, of which many are devoted to the core OS—of the rest, four are related to package management/development, and one aims to improve Wine support. NetBSD received 14 slots, of which many again went to the core OS. Other than that, one slot went to installation and another to package management. Distributions and "mixed" distributions/OSs unsurprisingly devote a large quantity of their efforts to their core competencies of package management, configuration management, and installation. At least in the Summer of Code, however, they do devote a significant amount of effort to solving larger problems that affect people outside the distribution. Sun and corporate open source Over the last couple of weeks there has been an interesting set of articles posted on various weblogs on how Sun is managing its open source projects. As more companies try to get involved with free software, they may find things to learn from this discussion. So here are a few thoughts on corporate open source. It all started with a posting by Ted Ts'o which stated: So if you run into a Sun salescritter or a Sun CEO claiming that OpenSolaris is just like Linux, it's not. Fundamentally, Open Solaris has been released under a Open Source license, but it is not an Open Source development community. Maybe it will be someday, as some Sun executives have claimed, but it's definitely not a priority by Sun; if it was, it would have been done before now. The posting drew responses from Dave Neary and Alvaro Lopez Ortega, among others; both the original messages and the responses to it are worth reading in their entirety. In summary, the responses say that (1) Sun really is trying to be a good open source player, and (2) Sun has done as well as could be expected, that the creation of true open source communities is hard. The first part can only be true. Sun has been the source of a great deal of free software, including packages like OpenOffice.org which are found in almost every Linux distribution. This company has released its core operating system as open source, and it is making noises about, finally, making Java truly open at all levels. There are few companies which have contributed code at this level, and that should be recognized. Beyond any doubt, Sun is contributing to this community. What people question, though, is Sun's interest in creating real communities around its open source projects. These projects are notoriously hard to participate in and contribute to. As Ted points out, OpenSolaris currently gets less than one patch per day from outside the company, the project's governing board is made up entirely of Sun employees, and its (non-distributed) revision control system lives inside the Sun firewall. External OpenSolaris developers have known to quit with messages like: Sun agreed that "OpenSolaris" would be governed by the community and yet has refused, in every step along the way, to cede any real control over the software produced or the way it is produced, and continues to make private decisions every day that are later promoted as decisions for this thing we call OpenSolaris. Rather than be honest about it and restructure the community to correspond to this MySolaris style of over-the-wall development, Sun prefers to lie to the external community members while ignoring their input. OpenOffice.org, too, remains hard to work with; thus the many discouraged comments on the ooo-build wiki from developers who want to get things done: Many ooo-build patches are ready for up-streaming but there is no / little response from up-stream. Worse there is the perception that taking leadership and actually doing something about merging fixes would be firmly opposed. Finally - even when maintainers are active, responsive & friendly - there is no agreed mechanism for blanket approving fixes - or sub-types of trivial fixes, which thus tend to fester in IssueZilla. The key to what is going on here can be found in many places, including in Alvaro's posting: Besides, the OpenSolaris development model is quite different because of a number of technical reasons. IMO, the first one is something as simple as that we want to ensure its quality by following a number of processes. Another very important technical point is that we want OpenSolaris to continue being binary compatible (ABI) with the previous Solaris revisions, which is something Linux could not even dream of. The real issue is control; Sun does not want to relinquish control over how its projects evolve. This is not a particularly uncommon situation with corporate-controlled projects; these projects will always be subject to the controlling company's agenda. Thus, no developer is likely to be successful in projects like: Adding features to MySQL which provide the functionality which is otherwise being reserved for the "enterprise" offerings. Adding packages to Fedora which make Red Hat's legal department nervous. Adding features to projects owned by the Free Software Foundation which, in the FSF's opinion, are not consistent with its goals; support for loading Emacs modules from an external repository is one example. Making any changes to Firefox which could threaten Mozilla Corporation's revenue stream from Google. Companies which control open source projects in this way are generally acting within their rights; they may even be acting in their own best interests. The software is still open source. But the retention of this sort of control will have an effect on the community which builds around the software. In many cases, it can have the effect of preventing the creation of that community in the first place. And that, too, may be what the company had in mind. There are a number of company-controlled open source projects which, by all appearances, are mostly for show and bragging rights. The company does not really seem to have much interest in developing a significant external community. In cases like this, if the software on offer is valuable enough, the result will often be a more community-oriented fork. Projects like ADempiere, LedgerSMB, and Cinelerra CV result from this kind of frustration. Opinions clearly differ on whether Sun is truly uninterested in the creation of outside development communities for its projects, or whether it simply is having a hard time letting go. If the latter is the case, then Sun might be well advised to follow Dave Neary's suggestion and create a separate, non-profit foundation for the development of OpenOffice.org. Sun's apologists are right when they say that turning a large blob of proprietary code into free software is a hard thing to do. But it's harder if you don't give the community the power to help; in the case of OpenOffice.org, there would appear to be enough of an interested community to make a real go at it. This might be Sun's best chance to show that it can create real development communities around its software. Stream video and audio with Boxtream Boxtream is a GPL-licensed streaming video and audio system that is being developed by Jerome Alet and a team of developers at the University of Nice in France: Boxtream is a mobile and autonomous audio and video streaming and recording studio. Of course, depending on your own hardware choices, the number and extent of capabilities and the quality of the final results may vary, but at least the software part should be versatile enough to accommodate even the most basic hardware. Boxtream was mostly designed to stream live courses featuring a professor and his slides (or any other computer based output like software training, web browser, video player...), but can also be used to stream congresses, interviews and the like. Boxtream uses a virtual smorgasbord of open-source components to achieve its results. Scripting is done with the Python language, metadata is stored in the XML format. The GStreamer multimedia framework library is used for handling the audio/video data and the Icecast streaming media server is used for media distribution. Video and audio are encoded with Ogg Theora and Ogg Vorbis. The Graphviz graph visualization software is used for presenting a graphical view of the video system's scenario. A few notable Boxtream features include a GUI interface, support for on-disk recording, selectable audio and video rates, support for image overlays and automation for all tasks. The Boxtream features list has a more complete list. Boxtream supports a number of video switching devices as well as other video and audio equipment. The hardware list has more information. This architecture diagram gives a pictorial view of a fairly complicated Boxtream system. An online example shows the system being used for a scientific conference. Boxtream version 0.998 was announced on April 27, 2008. Changes include support for more video hardware, inclusion of the dia schema software, bug fixes and a license change from GPLv2 to GPLv3. If your organization is in need of a full-featured video conferencing system, you should give Boxtream a look. The Tahoe secure filesystem The Tahoe filesystem is designed as a secure, distributed filesystem that is available as free software. Tahoe is also designed for fault tolerance so that data remains available even in the presence of missing or malicious peers. In March, the project released a 1.0 version which makes this a good time to take a peek. The basics of Tahoe are somewhat similar to GNUnet or Freenet in that the data is encrypted and spread around to multiple nodes in the network. Unlike those, though, Tahoe does not seek to provide anonymity. The nodes making up a Tahoe filesystem are called a "grid". Grids consist of some number of peers acting as storage server nodes along with an "introducer" that knows all of the other nodes and is the central point of contact for the grid. Files are stored in Tahoe by first being encrypted on the local machine using AES. They are then broken into "shares", ten by default, that are distributed to different servers in the grid. Before that happens, though, the encrypted file is encoded in such a way that the whole file can be recovered even if only a subset of the shares can be retrieved. This encoding, known as "erasure coding", is the key to the fault-tolerance of the Tahoe system. By default, Tahoe encodes the shares such that retrieving three of the ten is sufficient to recover the entire file. It also increases the size of the file by the expected 10/3 ratio. The suggested use case for Tahoe is a "friendnet" where some group of friends share their storage with each other in a way that reduces or eliminates the need for backups. Tahoe also has ways to share data in either read-only or read-write (immutable or mutable in Tahoe-speak) modes. Tahoe is used as a commercial backup system by Allmydata, sponsor of the Tahoe project. Tahoe is designed to be secure, which means that it protects the integrity and confidentiality of the data stored in it. SHA-256 is used extensively to ensure consistency of the plaintext, ciphertext, and shares. Files stored in the system are identified by long identifiers called capabilities, that look something like: For mutable files, there are two versions of the capability, one that allows only reading, while the other allows writing as well. Anyone who does not have a capability string for a particular file cannot access it at all. Multiple user interfaces are available for Tahoe, including a web interface, a command-line interface, a FUSE extension and a web API. Tahoe is written in Python, using some C extensions for efficiency. It uses the Twisted framework for event handling, pycryptopp (a Python interface to the Crypto++ library) for its encryption needs, and zfec for the erasure coding. All of the Tahoe code is available under the GPL. Installing Tahoe was fairly straightforward—there were a few hiccups which have since been resolved—using the installation guide. Joining the test grid was as easy as putting an introducer identifier into a file and starting Tahoe from the command line. In some basic testing, it seems to work quite well, overall, though it did not seem to use available bandwidth as efficiently as it might. This brief overview only scratches the surface of the information available about Tahoe; there is much more on the documentation page. For anyone interested in distributed, secure, and/or fault-tolerant filesystems, Tahoe is definitely worth a look. The last things through the 2.6.26 merge window About 500 changesets were merged after the publication of the first and second 2.6.26 merge window summaries. The merge window is now closed; here is the final set of changes which got in: New drivers for Solarflare Communications Solarstorm SFC4000 controller-based Ethernet controllers, Hauppauge HVR-1600 TV tuner cards, ISP 1760 USB host controllers, Cypress c67x00 OTG controllers, and Intel PXA 27x USB controllers. 8Kb stacks are, once again, the default for the x86 architecture. "Out-of-memory situations are less problematic than silent and hard to debug stack corruption." The klist type now has the usual-form macros for declaration and initialization: DEFINE_KLIST() and KLIST_INIT(). Two new functions (klist_add_after() and klist_add_before()) can be used to add entries to a klist in a specific position. As had been planned, struct class_device has been removed from the driver core, along with all of the associated infrastructure. Classes are now implemented with an ordinary struct device. kmap_atomic_to_page() is no longer exported to modules. There are some new generic functions for performing 64-bit integer division in the kernel: Unlike do_div(), these functions are explicit about whether signed or unsigned math is being done. The x86-specific div_long_long_rem() has been removed in favor of these new functions. There is a new string function: It compares the two strings while ignoring an optional trailing newline. The prototype for i2c probe() methods has changed: The new id argument supports i2c device name aliasing. There is a new configuration (MODULE_FORCE_LOAD) which controls whether the loading of modules can be forced if the kernel thinks something is not right; it defaults to "no." How not to sell embedded Linux Every now and then one should have a look at some unabashed fear, uncertainty, and doubt (FUD) material. It's good to know what the other side is saying, the level of unintended humor is often high, and, on occasion, one even learns something. Your editor's suggestion for FUD of the week is this Embedded.com article by Dan O'Dowd. Therein, one will learn about the impending death of embedded Linux as told by the companies which sell embedded Linux. In particular, Mr. O'Dowd looks at some marketing material from MontaVista and Wind River, and concludes: This embedded Linux bashing from embedded Linux's strongest proponents should give pause to those who are thinking through their embedded operating system strategy. If embedded Linux champions are saying that embedded Linux is terrible, why would anyone want to risk their products or their company on it? One can easily pick holes in this article, starting with the assertion that MontaVista and Wind River are "Linux's strongest proponents." One could also recall that we have heard this kind of thing before; in 2004, Mr. O'Dowd (who happens to be the founder and CEO of a proprietary embedded systems software vendor) helpfully warned us that "intelligence agencies and terrorists" would contribute "subversive software" to Linux and lectured on the need for secret source code to achieve true security. One could point out that many of the points put forward by Mr. O'Dowd appear to be pure fantasy. All of these rebuttals would be valid, but they risk missing an important point to be gained from this article - though it's not quite the point Mr. O'Dowd is trying to make. Mr. O'Dowd obtains his "facts" from two sources: an advertisement by Wind River Systems (which your editor was unable to find online) and, primarily, from a column by MontaVista founder Jim Ready in Military Embedded Systems magazine. Mr. Ready's evident purpose is to frighten embedded systems vendors into buying his company's services; to that end, he lays it on pretty thick: To keep abreast of the changes occurring on a daily basis, a developer needs to monitor the email traffic of 11 different and unsynchronized open source projects: kernel.org, the core home of the Linux kernel; the gcc and glibc projects (the core tool chain and libraries from FSF at fsf.org); and at least nine other components that would typically comprise a useable Linux development environment. Kernel.org itself may have up to 5,000 messages a day with 1,000 of these being patches that need to be evaluated and possibly applied to the source base. Simply ignoring the traffic, figuring that the system in use seems to be working well enough, can lead to disastrous consequences later. For example, a recent security patch that took all of 13 lines of code to implement against an embedded Linux system would have taken more than 800k lines of source patches to implement if the previous trail of patches had been ignored. It's a classic case of pay now or really pay later. Somebody must have had a great deal of fun putting all of those numbers together. The generation of ordinary random numbers can be managed through traditional methods like a toss of the dice, picking numbers out of a hat, or reading corporate earnings estimates. Randomness on this scale, though, can only be achieved through the use of special-purpose software. Even by kernel.org standards, 5,000 messages per day is fairly intense, though your editor, a subscriber to the linux-kernel, git-commits-head, and mm-commits lists, can attest that the order of magnitude is right at least. But your editor cannot even begin to grasp the thought process which turns a 13-line security patch into 800,000 lines of code. Imagine posting that to linux-kernel. "Pay now or really pay later" indeed. But the provenance of the numbers is not really the point here. Mr. Ready is perpetrating the fallacy that, to build an embedded system with Linux, one starts with the various components and integrates them all by hand. If a company were to take that path, it might well incur the high costs that Mr. Ready warns about. Creating your own distribution - and maintaining it over a product's life - is, indeed, a difficult and expensive job. But it is a rare vendor which does that; even Gentoo users outsource much of the integration work to their distributor. Why would any vendor create its own distribution when there are so many out there to base a product on? Customizing a distribution for an embedded application is not a trivial job, but it's not rocket science either. The distributor will keep up with most of those mailing lists, and, somehow, a reasonable distribution also manages to ship security updates which do not involve 800,000 lines of code. There is no reason for embedded systems vendors to wander into the expensive mess that Mr. Ready describes; the creation of a suitable distribution is much easier than that. Even so, many vendors may decide that, in fact, they would rather not be in the business of customizing distributions. They might, instead, look to a vendor to do that work for them. It makes perfect sense for companies like MontaVista and Wind River (among others) to offer to provide a stable, integrated, and supported platform to embedded systems vendors for a fee. There is honest value in this line of business. But one does have to wonder why these companies feel the need to scare companies into buying their services. Those services, properly rendered, have a real value which can be sold without resort to outright FUD. Failure to focus on that value gives encouragement to people like Mr. O'Dowd, who would be most pleased if embedded Linux were to go away altogether. This does not seem like a sensible business strategy. Companies which seek to make money from Linux might just want to think twice before poisoning the well they are trying to drink from. That is the real lesson to be learned from this particular piece of writing. Blizzard tests the reach of copyright law Free software users rarely, if ever, need to be concerned about the license that governs the applications they use. Unlike developers or distributors, users are unlikely to pay attention to whether a program is released under a BSD, GPL, or some other license—not so with proprietary software. If Blizzard Entertainment has its way, it could get a whole lot worse, with proprietary vendors controlling the behavior of its users and enforcing it by way of the Copyright Act. Blizzard, makers of the online role-playing game World of Warcraft (WoW), has filed a lawsuit against MDY, Inc., makers of a tool that assists players in gaining levels within the game. The Glider program essentially plays the game for a user, creating a more powerful character, with additional riches, while the user is otherwise occupied. Some would claim it is a legitimate way to avoid some of the drudgery of "leveling up" a new character, while others would see it as a means of cheating. In any case it is clearly a violation of the Terms of Use (TOU) of WoW. But those terms are only accepted by a user when they agree to the End User License Agreement (EULA) that comes with the game. Blizzard would seem to have plenty of ammunition to take action against players that use Glider, but instead of suing its customers for breach of contract—perhaps they have learned something by watching the music industry—they went after the easier target. Had they only sued MDY for "tortious interference with contracts", it probably would have attracted little attention. But Blizzard did something that aroused the interest of the Electronic Frontier Foundation (EFF), Public Knowledge, and others by trying to stretch copyright law to cover MDY's actions. Certainly Blizzard is no stranger to using copyright law—in particular the much-despised Digital Millennium Copyright Act (DMCA)—in ways that many have found objectionable. The courts, at least in the Blizzard v. BNETD case, have agreed with Blizzard, though, shutting down the development of an alternative server for players of their games. Because of that, any time Blizzard makes a copyright claim, serious scrutiny from various watchdogs can be expected. Blizzard's claim is that, by running Glider, its users are not only in violation of the contract they agreed to, but they are also committing copyright infringement. As has been seen in various file-sharing lawsuits, whenever copyright is supposedly violated on a computer, any program even tangentially involved in that violation is then accused of "contributory infringement"; this is the second claim that Blizzard makes against MDY in its suit. Under Blizzard's interpretation, users are allowed to copy the program into the RAM of their computer as long as they do not violate the TOU. If they do violate them, their license to copy to RAM—a necessary step to be able to use the program at all—is terminated; they are infringing Blizzard's copyright and liable for damages starting at $750 per illegal RAM copy. If Blizzard's interpretation is upheld by the courts, many other acts would also serve as copyright infringements: choosing a character name that violates any of the thirteen name restrictions spelled out in the TOU, transmitting or posting "any content or language which, in the sole and absolute discretion of Blizzard, is deemed to be offensive...", or "anything that Blizzard considers contrary to the 'essence' of the Program", for example. Under those conditions, Blizzard could essentially claim copyright infringement any time they wish; racking up another $750+ each time the program is used. Public Knowledge outlined two good reasons that the copyright infringement claim should be discarded. It is well established that it is not an infringement if making a copy is required to use the copyrighted material, as it is for software. Blizzard's argument that due to the terms of the EULA, those who buy WoW are not "owners" but instead license the software is also weak. The courts have always looked on software purchases as sales, not rentals under some company-controlled license, in much the same way that music and movies are purchased. Copyright owners would love to be able to eliminate the "first sale doctrine" that allows owners to sell used books and other copyrighted content, but the courts have so far been unwilling to go along. One would hope that the courts would be persuaded not to see this dispute in terms of copyright either, but there is the risk that a tool used for "cheating" might not get the benefit of a well-reasoned view. There have been many occasions where the US courts have made surprising decisions regarding copyright. Undoubtedly there are various copycat suits waiting in the wings should such a decision be reached. In the end, though, neither Blizzard nor any copycats really want to go after the actual "infringers"—also known as customers—they want to go after others who allow users to use (or abuse) their software in ways they do not like. It is a classic proprietary software control strategy, and, thankfully, something that free software users do not have to endure. There is an interesting comparison to be made with free software licensing, though. Licenses like the GNU GPL also restrict behavior based on copyright law; GPLv3, for example, makes some specific requirements on the patent-licensing agreements that one can make with third parties. Like Blizzard, those who release software under a free license can make a claim of copyright infringement (not breach of contract) if the terms of that license are not adhered to. There is a crucial difference, though: free software licenses do not regulate the use of the software, only its distribution. By claiming that users of the software violate copyright if it does not like their behavior, Blizzard is attempting to extend the reach of copyright law far beyond anything seen in the free software community. It is certainly understandable that Blizzard would prefer that its users did not employ Glider or other, similar software. They believe it unbalances the game; making it unfair to other players. In the past, they have temporarily or permanently banned players for using bot software, but Glider is evidently more difficult to detect, which led to the current lawsuit. Blizzard must police its own game, however, and should not expect others to do it for them. It is hard to see that Glider is doing anything particularly wrong here, though Blizzard may prevail on either or both of its claims. If players want to find ways around things they don't like about the game, they will, unless Blizzard finds technological means to prevent it. It would appear that there is a substantial business opportunity in helping players avoid some of the boring, repetitive parts of playing the game—one that Blizzard currently ignores. Though there is no direct threat to free software from this litigation (unless one is developing free game-playing robots), any potential expansion of copyright is worth watching. The community relies upon copyright law to enforce its licenses, so watching how judges make decisions about such issues is important. While it may be that Blizzard is in the right to go after "cheaters" and a company that helps them, it should not be doing that by trying to expand the reach of its copyrights to this extreme. Time to slow down? All communities develop rituals over time. One of the enduring linux-kernel rituals is the regular heated discussion on development processes and kernel quality. To an outside observer, these events can give the impression that the whole enterprise is about to come crashing down. But the reality is a lot like the New Year celebrations your editor was privileged enough to see in Beijing: vast amounts of smoke and noise, but everybody gets back to work as usual the next day. Beyond that, though, discussions of this nature have real value. Any group which is concerned about issues like quality must, on occasion, take a step back and evaluate the situation. Even if there are no immediate outcomes, the ideas raised often reverberate over the following months, sometimes leading to real improvements. The immediate inspiration for this round of discussion was broken systems resulting from the 2.6.26 merge window. This development cycle has had a rougher start than some, with more than the usual number of patches causing boot failures and other sorts of inconvenient behavior. That led to some back-and-forth between developers on how patches should be handled. Broken patches are unfortunate, but one thing is worth noting here: these problems were caught and fixed even before the 2.6.26-rc1 kernel release was made. The problems which set off this round of discussion are not bugs which will affect Linux users. But, beyond any doubt, there will be other bugs which are slower to surface and slower to be fixed. The number of these bugs has led to a number of calls to slow down the development process in one way or another. To that end, it is worth noting that the process has slowed down somewhat, with the 2.6.26 merge window bringing in far fewer changesets than were seen for 2.6.24 or 2.6.25. Whether this slower pace will continue into future development cycles, or whether it's simply a lull after two exceptionally busy cycles remains to be seen. But, if the process does not slow down on its own, there are developers who would like to find a way to force it to happen. Some have argued for simply throttling the process by, for example, limiting new features in each development cycle to specific subsystems of the kernel. There has also been talk of picking the subsystems with the worst regression counts and excluding new features from those subsystems until things improve. The fact of the matter, though, is that throttling is unlikely to help the situation. Slowing down merging does not keep developers from developing, it just keeps their code out of the tree. An extreme example can be found in the 2.4 kernel: the merging of new code was heavily throttled for a long time. What happened was that the distributors started merging new developments themselves because their users were demanding them. So a lot of kernels which went under the name "2.4" were far removed from anything which could be downloaded from kernel.org. That way lies fragmentation - and almost certainly lower quality as well. Linus actually takes this argument further by arguing that quickly merging patches leads to better quality: [M]y personal belief is that the best way to raise quality of code is to distribute it. Yes, as patches for discussion, but even more so as a part of a cohesive whole - as _merged_ patches! The thing is, the quality of individual patches isn't what matters! What matters is the quality of the end result. And people are going to be a lot more involved in looking at, testing, and working with code that is merged, rather than code that isn't. Andrew Morton has also argued against throttling: If we simply throttled things, people would spend more time watching the shopping channel while merging smaller amounts of the same old crap. Kernel developers are, of course, known to be hard-core shoppers, so giving them more opportunity to pursue that activity is probably not the best idea. Seriously, though: Andrew is in favor of a slower development process, but only when approached from a different angle: his point is that an increased focus on quality will, as a side effect, result in slower development. Kernel developers need to be focused on finding and fixing bugs rather than creating new ones and/or shopping. It is worth noting that a substantial portion of the development community appears to believe that there are no real problems in this regard. Bugs are being found and fixed at a high rate and the kernel is solid for most users. Arjan van de Ven notes: Are we doing worse on quality? My (subjective) opinion is that we are doing better than last year. We are focused more on quality. We are fixing the bugs that people hit most. We are fixing most of the regressions (yes, not all). Subsystems are seeing flat or lower bugcounts/bugrates. Ted Ts'o points out that a lot of problems result from obscure and low-quality hardware, and that it's not possible to make everybody happy. Andrew is unconvinced, though, and seems to fear that the kernel is declining in quality. In a sense, though, that part of the discussion is moot. Nobody would argue against the idea that fewer bugs is a worthy goal, regardless of whether one believes that the current process has quality problems. So talk of ways to make things better is always on-topic. Testing remains a big issue; the kernel, more than almost any other project, is highly sensitive to the systems on which it is run. Many problems (arguably the majority of them) are related to specific hardware, or specific combinations of hardware; there is no way for the developers, who do not have all possible hardware to test on, to ever find all of these bugs. Users have to help with that process. Getting widespread testing coverage is always hard; Peter Anvin argues that the current process has actually made that harder: One thing is that we keep fragmenting the tester base by adding new confidence levels: we now have -mm, -next, mainline -git, mainline -rc, mainline release, stable, distro testing, and distro release (and some distros even have aggressive versus conservative tracks.) Furthermore, thanks to craniorectal immersion on the part of graphics vendors, a lot of users have to run proprietary drivers on their "main work" systems, which means they can't even test newer releases even if they would dare. There is, in fact, a wealth of development kernels to test, and it is not always clear where users and developers should be concentrating their testing effort. A consensus may be forming, though, that more people should be looking at the linux-next tree in particular. Linux-next is where all of the patches intended for the next merge window are supposed to congregate; the current contents of linux-next, as of this writing, are targeted toward 2.6.27. This is the place where early integration issues and other problems should be found; if linux-next is well tested, the number of problems showing up in the next merge window should be somewhat reduced. The linux-next tree is an interesting experiment. It is, for all practical purposes, making the development cycle longer: since linux-next exists, the 2.6.27 cycle has, in some sense, already started. Linux-next also does something which kernel developers have tended to resist: causing the stabilization period for one development cycle to overlap with active development for the next cycle. In the past, it has been argued that this kind of overlap will cause developers to prioritize the creation of new toys over fixing the problems with last week's toys. Some people argue that this is happening now: developers are not spending enough time dealing with bugs - and that their carelessness is creating too many bugs in the first place. Others assert that, while it will never be possible to fix every reported bug, the bugs that really matter are being addressed. A real resolution to this disagreement seems unlikely; the creation of meaningful metrics on kernel quality is a difficult task. About the best that can be done is to try to keep the regression list as small as possible; as long as systems which once worked continue to work, it is hard to argue too forcefully that things are headed in the wrong direction. Read-only bind mounts Bind mounts can be thought of as a sort of symbolic link at the filesystem level. Using mount --bind, it is possible to create a second mount point for an existing filesystem, making that filesystem visible at a different spot in the namespace. Bind mounts are thus useful for creating specific views of the filesystem namespace; one can, for example, create a bind mount which makes a piece of a filesystem visible within an environment which is otherwise closed off with chroot(). There is one constraint to be found with bind mounts as implemented in kernels through 2.6.25, though: they have the same mount options as the primary mount. So a command like: will fail to make /vital_data read-only under /untrusted_container if it was mounted writable initially. On your editor's 2.6.25 system, the failure is silent - the bind mount will be made writable despite the read-only request and no error message will be generated (the mount man page does document that options cannot be changed). There is clear value in the ability to make bind mounts read-only, though. Containers are one example: an administrator may wish to create a container in which processes may be running as root. It may be useful for that container to have access to filesystems on the host, but the container should not necessarily have write access to those filesystems. As of 2.6.26, this sort of configuration will be possible, thanks to the merging of the read-only bind mounts patches by Dave Hansen. As it happens, it's still not possible to create a read-only bind mount with the command shown above; the read-only attribute can only be added with a remount operation afterward. So the necessary sequence is something like: This example raises an interesting question: what if some process opens a file for write access between the two mount operations? A system administrator has the right to expect that a read-only mount will, in fact, only be used for read operations. The 2.6.26 patch is designed to live up to that expectation, though the amount of work required turned out to be more than the developers might have expected. Filesystems normally track which files are opened for write access, so an attempt to remount a filesystem read-only can be passed to the low-level filesystem code for approval. But the low-level filesystem knows nothing about bind mounts, which are implemented entirely within the virtual filesystem (VFS) layer. So making read-only access for bind mounts work requires that the VFS keep track of all files which have been opened for write access. Or, more precisely, the VFS really only needs to keep track of how many files are open for write access. The technique chosen was to create something which looks like a write lock for filesystems. Whenever the VFS is about to do something which involves writing, it must first call: The return value is zero if write access is possible, or a negative error code otherwise. This call can be found in obvious places - such as in the implementation of open() - when write access is requested. But write access comes into play many other situations as well; for example, renaming a file requires write access for the duration of the operation. So mnt_want_write() calls have been sprinkled throughout the VFS code. When write access is no longer needed, the "write lock" should be released with a call to: One of the discoveries which has been made is that write access is needed in rather more places than one might have thought. In particular, it turns out that there is need for mnt_want_write() calls within the low-level filesystems as well as in the VFS layer. So getting the read-only bind mounts patch into shape has been an ongoing process of finding the spots which have been missed and adding mnt_want_write() calls there. In an attempt to make this process a bit less error-prone, Miklos Szeredi has put together a set of VFS helper functions which encapsulate the situations where write access is needed. Those functions have not been merged for 2.6.26, however. Superficially, mnt_want_write() is easy to understand - it simply increments a counter of outstanding write accesses. The problem with a simple implementation, though, is that a shared, per-filesystem counter would create scalability problems. On multiprocessor systems, the cache line containing the counter would bounce around the system, slowing things considerably. A common response to this type of problem is to turn the counter into a per-CPU variable, allowing operations on the counter to remain local to each processor. When somebody needs to know the total value of the counters, it's a simple matter of adding each CPU's version; this operation is slow, but it is also rare. On big systems, though, the number of CPUs can be large - as can the number of filesystems, and bind mounts will only increase that number. The result is a multiplicative effect which, once again, is a scalability problem, only this time it manifests itself in the form of excessive memory use. The read-only bind mounts patch resolves this situation by, in effect, going back to global counters which are cached on specific processors. To that end, each CPU has one of these structures: At any given time, this structure will hold a local count for one filesystem, represented by mnt. If the processor needs to adjust the write count for that filesystem, it's a simple matter of incrementing or decrementing count. When the processor's attention turns to a different filesystem, it must first adjust the global count for the old filesystem, then it can switch its local mnt_writer structure to the new one. The result is a compromise between purely local and purely global counters which yields "good enough" performance on benchmarks designed to stress the system. Read-only bind mounts join with other features (such as shared subtrees) to create a flexible set of tools for the construction of the filesystem namespace. It is not clear how much of this functionality is being used at this time, but, as the implementation of containers in the mainline gets closer to completion, there is likely to be more interest in this capability. Linux systems in coming years may have much more complex filesystem layouts than have been seen in the past. Rietveld: another code review aid With the release of Rietveld, another tool for those interested in doing web-based code reviews is now available. We looked at Review Board back in January. It was inspired by an internal Google tool, written by Python creator and Google employee Guido van Rossum, called Mondrian. That tool in turn spawned Rietveld. The feature sets of Rietveld and Review Board are strikingly similar, which is not surprising as they both used Mondrian as a model. van Rossum originally wanted to turn Mondrian into a free software project, but it was too tied to "proprietary Google infrastructure", so he started over, with Rietveld as the result. Both tools are implemented in Python using the Django framework, but one major difference is that Rietveld is written to use Google App Engine. There are multiple ways to get a set of patches into the Rietveld system to create an "issue"—the term used for a patch set undergoing review—from an upload of a unified diff to using a python script to retrieve the patches from a repository. Currently Rietveld only supports Subversion, but van Rossum would like to see support added for other version control systems over time. Review Board has a bit of a head start in this area, so it supports Mercurial, Git, Bazaar, Perforce, Subversion and CVS. Once an issue has been created in the system, reviewers can then be invited to comment on the changes. Navigating through the diff is straightforward, with Javascript being used liberally to give an interactive "local application" feel to the interface. Double-clicking on a line brings up a comment box that a reviewer can fill in to attach some comments to that line. All comments are held as "drafts" until the reviewer is satisfied with their review at which point they "publish" the comments for the author and other reviewers to see. The Rietveld project is free software, released under the Apache 2.0 license, while the application itself runs via the Google App Engine. Anyone can browse the system, but folks who have a Google account can add issues, comments, and conduct reviews using the tool. Because it uses App Engine, people wanting to try it out on their code need not find a server to install and run the application—as would be required with Review Board—they can just upload a set of patches, invite some reviewers, and proceed. This kind of simplified deployment is one of the benefits that Google App Engine is meant to provide. For free software projects, where code review is purposely done in the open, Rietveld provides a way to quickly try the application out. Those who wish to keep their source code secret may want to install their own instance of Review Board or another tool. It may be possible to install Rietveld in a different environment by replacing the App Engine-specific pieces, but that clearly is not where it is targeted. While Rietveld does not provide much in the way of additional functionality from Review Board—in fact it lags Review Board in some areas—it does provide a very nice introduction to the Google App Engine interface. Developers will undoubtedly be using the code as a template for their own ideas once Google makes more App Engine accounts available. Given the shared history, language, and framework, it isn't impossible to imagine that Review Board and Rietveld might join forces one day. Even if they don't, some cross-pollination is inevitable which will result in both getting better. Hopefully, with more projects using one or both, better code for the community is the result. Looking ahead to Mandriva Linux 2009 With Mandriva Linux 2008 Spring out the door, the first steps toward Mandriva Linux 2009 are in progress. Ideas are being collected on this wiki page and Bugzilla is open for suggestions and ideas. The wiki page begins with instructions for entering ideas and suggestions into Bugzilla. A number of items are in the wish list for kernel and hardware support. The ML 2009 kernel will use libata, the one item already marked as complete (better late than never). Other wishes include an installed and enabled kerneloops package, full support for Lenovo Thinkpads T60/T61 (and T62 in the future) (with all the bells, whistles, drivers, hotkeys, LEDs, etc. working), making Xen work properly (or dropping it), and patches for kernel-level mode setting. There is a request for virtualbox 1.6 to be added to the toolchain, along with cmake and svn. The RPM, URPMI requests include better separation of free and non-free so that non-free sources do not get installed on an otherwise free system; and better dependency handling. Some requests involve making it easier to use a lightweight desktop/window manager. There is an Xfce edition for ML 2008.1, but some would like the Xfce edition to be an official part of the 2009 release. Requests for improved icewm support are joined by requests for LXDE, and Enlightenment 17. No matter how good an installer is, there is always room for improvement and some ideas are on the list. The same could be said for system tools, and several improvements to Drakxtools are also on the list. The list ends with suggestions for better internationalization and localization support. Those who have ideas about improving Mandriva Linux, now is the time to get involved. File bug reports where features seem to be missing, and help make ML 2009 better than ever. Pygments - the Python Syntax Highlighter Pygments is a multi-language syntax highlighter that is written in Python and distributed under the BSD license. The project description states: It is a generic syntax highlighter for general use in all kinds of software such as forum systems, wikis or other applications that need to prettify source code. Highlights are: a wide range of common languages and markup formats is supported special attention is paid to details that increase highlighting quality support for new languages and formats are added easily; most languages use a simple regex-based lexing mechanism a number of output formats is available, among them HTML, RTF, LaTeX and ANSI sequences it is usable as a command-line tool and as a library ... and it highlights even Brainf*ck! The project FAQ notes that Pygments supports a long (and expandable) collection of input languages. It can produce output as HTML, LaTeX, RTF and ANSI sequences for console output. The software can be run from the pygmentize command-line tool, or accessed from your own Python code. See the command line reference for details on running pygmentize. Pygments version 0.10 was recently announced. Changes include the addition of 15 new language lexers, expansion of the Makefile lexer's capabilities, the ability to output in several image formats, a new style and other enhancements and fixes. Installation of Pygments was straightforward on an Ubuntu 7.04 system. A tar.gz file was downloaded from the Python package site. The file was uncompressed with gunzip and extracted with tar. Running python setup.py install as root on the setup script installed the software and it was ready to run. After a quick read of the Command Line Usage document, your author was able to run pygmentize on some Python code and produce some rather pleasing colorized html output. The project's demo page has a number of examples of Pygment's output, it also allows you to upload your own code to see how it looks after formatting. Pygments looks to be a well designed generic tool. It could useful for online and offline documentation, code analysis, education and much more. This list of projects is already putting Pygments to use, perhaps your project could make use of it as well. Cryptographic splicing makes for a Wordpress vulnerability Authentication bypass vulnerabilities are particularly painful because they allow an attacker to access and potentially modify things that should be off-limits. It is important to ensure that when fixing that kind of bug, one does not introduce a different, but equally potent, hole. A recent Wordpress vulnerability clearly demonstrates the care that needs to be taken. The problem started in November 2007, when Steven Murdoch reported a problem with Wordpress authentication cookies. Essentially, the cookie that Wordpress used was an MD5 hash calculated using a value stored in the database's user table. Any attacker that could get read access to the database, via a SQL injection or looking inside a database backup for example, could generate a cookie value that would allow them access as that user. The password itself was not stored in the database as plaintext, but the value used in the cookie was just a simple MD5 of the stored value. So, the value stored was MD5(password) and the cookie value was MD5(MD5(password)). Murdoch released his advisory in advance of a fix, because the vulnerability was being actively exploited. It was entered as bug #5367 into the Wordpress bug tracking system and a long conversation about how to properly fix it ensued. As part of that discussion, Murdoch suggested that a paper entitled "Dos and Don'ts of Client Authentication on the Web" [PDF] be consulted. The paper covers various issues regarding cookies and the kinds of attacks that can be made against them. Some, but not all, of its recommendations were followed. The new cookie scheme was released at the end of March as part of the Wordpress 2.5 release. Authentication cookie values were now calculated using the following (with the '.' operator representing concatenation): This took into account the hazards of a straightforward hash of a stored value and added an expiration to the cookie, but it failed to protect against a cryptographic splicing attack. When calculating the hash of the concatenation of the username and expiration (along with a secret known by the server), no delimiter was used between the two. This means that the hash for username "foobar" with expiration "20080507" is the same as the hash for username "foo" with expiration "bar20080507". This allows anyone with a username that begins the same as another username, to generate a legitimate cookie for that other user. Using the example above, user "foobar" could create valid cookies for a user "foo" (or any other prefix substring). Many Wordpress weblogs allow new users to create an account with any name they choose, so long as it is not already taken. By choosing one that starts with the administrator's username, an attacker can generate a cookie for themselves, modify it slightly, and have a valid cookie to access the administrator account. No password cracking is required, nor is any access to the database needed. Wordpress 2.5.1 has been released to address this problem. Earlier versions could disable the registration feature and delete or suspend any user accounts with suspicious usernames as a workaround. Though if those suspicious accounts exist, it would not be surprising to find that the real administrator no longer knows the proper password for that account. The paper that Murdoch referenced clearly indicated the danger from cryptographic splicing, but the Wordpress implementers must have missed it. Cookie authentication schemes are a necessary evil for web applications—it would be nearly unusable to have to authenticate on each page—but they are difficult to get right. A careful reading of the paper will help, as will using already vetted libraries or frameworks. It is one of those things that is hard to get right and extremely important to do so. A Talk with Fedora Project Leader Paul Frields Late last week I had the pleasure of talking with Fedora Project Leader Paul Frields. Our conversation covered a range of Fedora Project topics, including Fedora 9, the latest Fedora release. One thing Paul is passionate about is getting people to volunteer. There are many ways to get involved with the Fedora Project, lots of sub-projects and Special Interest Groups (SIGs) that people can join depending on their interests and talents. The Fedora Project wiki is a good starting point for finding out more. The Join Fedora page also goes into the various roles that a Fedora contributor might be suited for, with easy links to setting up a Fedora account and using the Fedora Account system. You don't have to be a programmer or a computer expert to contribute to the project. Joining the Fedora Project is easier now than it ever was during Fedora's five year history. As a result Fedora now has over 2000 registered account holders. That includes about 350 ambassadors who promote Fedora in their local area. In addition to making it easier to become a Fedora contributor, a variety of new web applications/collaborative tools are now available for contributors. Of course all Fedora infrastructure is Free Software, available in the Fedora repository, and running on Fedora. All registered account holders may vote in Fedora elections, which is worth noting because there is an election coming up in June. The composition of the Fedora board was recently changed to five elected members of the nine board seats. Four of those seats will be voted on in the next election. The other board seats are appointed by Red Hat, but are not necessarily Red Hat employees. Red Hat retains some control by employing and appointing the Project Leader. Paul took a job with Red Hat when he was offered the position of Project Leader. Paul mentioned that former Fedora Project Leader Max Spevack is moving to the Netherlands to organize and manage Fedora volunteers in Europe. Paul also mentioned that Fedora has many Brazilian contributors. Of course Red Hat employs some Fedora engineers. There are fourteen Red Hat employees working full time on Fedora, mostly acting as team leaders and organizing the volunteers. In addition all Red Hat engineers will spend some fraction of their time working on Fedora in areas where Red Hat Enterprise Linux in involved. Some people think of Fedora as a beta for Red Hat Enterprise Linux, but its more realistic to think of Fedora as the upstream source for its enterprising cousin and spin-offs such as CentOS. So even though Fedora is a community project, Red Hat is still very involved in its development. FUDCon (Fedora User & Developer Conference) is an event held on an irregular schedule several times per year. Some are smaller events held in conjunction with a larger event, such as the May 30, 2008 FUDCon, which will be held at LinuxTag in Berlin, Germany. Further out, there is some talk of having a mini-FUDCon at the 2009 linux.conf.au. The Boston FUDCon coming up in June, will run for several days. Co-located with the Red Hat Summit, the Boston FUDCon will feature hackfests, a barcamp and technical talks. The Red Hat Summit will bring in Red Hat customers, and include talks about actual use cases. These talks should be interesting for Fedora developers, who will have a chance to see what people are doing with their work downstream. FUDCon is open to anyone, so stop by if there is a FUDCon in your area. On to the just released Fedora 9 and the upcoming Fedora 10. Fedora 9 is one of the first major releases to feature KDE 4 by default. To make this work, the KDE SIG has built a compatibility library to keep KDE 3 applications running properly. For Fedora 10 Casey Dahlin is working on replacing the init system with upstart, the system developed for Ubuntu. Some other items that we touched on briefly: Fedora maintains an open build system and works at getting patches upstream. The project also strives to cooperate with other distributions. From what I've seen, Fedora 9 looks very good, attractive and functional. Now that rawhide has moved on to Fedora 10 it will be a rough ride for at least a few days. So stick with Fedora 9, or get it from a mirror near you. Fedora 9 is Paul's first release as Project Leader and he had a few words to add. "It's been less than five years since the first release of Fedora (back when it was called Fedora Core), and in that time Fedora has become not just a vibrant, innovative, and extremely popular Linux distribution, but also a thriving community. A community that believes that free and open source software is not just something you *use*, it's something you *do* -- something to which you *contribute*." Distributed bug tracking It is fair to say that distributed source code management systems are taking over the world. There are plenty of centralized systems still in use, but it is a rare project which would choose to adopt a centralized SCM in 2008. Developers have gotten too used to the idea that they can carry the entire history of their project on their laptop, make their changes, and merge with others at their leisure. But, while any developer can now commit changes to a project while strapped into a seat in a tin can flying over the Pacific Ocean, that developer generally cannot simultaneously work with the project's bug database. Committing changes and making bug tracker changes are activities which often go together, but bug tracking systems remain strongly in the centralized mode. Our ocean-hopping developer can commit a dozen fixes, but updating the related bug entries must wait until the plane has landed and network connectivity has been found. There are a number of projects out there which are trying to change this situation through the creation of distributed bug tracking systems. These developments are all in a relatively early state, but their potential - and limitations - can be seen. One of the leading projects in this area is Bugs Everywhere, which has recently moved to a new home with Chris Ball as its new maintainer. Bugs Everywhere, like the other systems investigated by your editor, tries to work with an underlying distributed source code management system to manage the creation and tracking of bug entries. In particular, Bugs Everywhere creates a new directory (called .be) in the top level of the project's directory. Bugs are stored as directories full of text files within that directory, and the whole collection is managed with the underlying SCM. The advantages to an approach like this are clear. The bug database can now be downloaded along with the project's code itself. It can be branched along with the code; if a particular branch contains a fix for a bug, it can also contain the updated bug tracker entry. That, in turn, ensures that the current bug tracking information will be merged upstream at exactly the same time as the fix itself. Contemporary projects are characterized by large numbers of repositories and branches, each of which can contain a different set of bugs and fixes; distributing the bug database into these repositories can only help to keep the code and its bug information consistent everywhere. There are also some disadvantages to this scheme, at least in its current form. Changes to bug entries don't become real until they are committed into the SCM. If a bug is fixed, committing the fix and the bug tracker update at the same time makes sense; in cases where one is trying to add comments to a bug as part of an ongoing conversation the required commit is just more work to do. That fact that, in git at least, one must explicitly add any new files created by the bug tracker (which have names like 12968ab9-5344-4f08-9985-ef31153e504f/comments/97f56c43-4cf2-4569-9ef4-3e8f2d9eb1fe/body) does not help the situation. Beyond that, tracking bugs this way creates two independent sets of metadata - the bug information itself, and whatever the developer added when committing changes. There is currently no way of tying those two metadata streams together. Then, there is the issue of merging. Bugs Everywhere appears to reflect some thought about this problem; most changes involve the creation of new, (seemingly) randomly-named files which will not create conflicts at merge time. It did not take long, however, for your editor to prove that changing the severity of a bug in two branches and merging the result creates a conflict which can only be resolved by hand-editing the bug tracker's files. Said files are plain text, but that is less comforting than one might think. [PULL QUOTE: All of this can make distributed bug tracking look like a source of more work for developers, which is not the path to world domination. END QUOTE] All of this can make distributed bug tracking look like a source of more work for developers, which is not the path to world domination. What is needed, it seems, is a combination of more advanced tools and better integration with the underlying SCM. Bugs Everywhere, by trying to work with any SCM, risks not being easily usable with any of them. A project which is trying for closer integration is ticgit, which, as one might expect, is based on git. Ticgit takes a different approach, in that there are no files added to the project's source tree, at least not directly; instead, ticgit adds a new branch to the SCM and stores the bug information there. That allows the bug database to travel with the source (as long as one is careful to push or pull the ticgit branch!) while keeping the associated files out of the way. Ticgit operations work on the git object database directory, so there is no need for separate commit operations. On the other hand, this approach loses the ability to have a separate view of the bug database in each branch; the connection between bug fixes and bug tracker changes has been made weaker. This is something which can be fixed, and it would appear (from comments in the source) that dealing with branches is on the author's agenda. Ticgit clearly has potential, but even closer integration would be worthwhile. Wouldn't it be nice if a git commit command would also, in a single operation, update the associated entry in the bug database? Interested developers could view a commit which is alleged to fix a bug without the need for anybody to copy commit IDs back and forth. Reverting a bugfix commit could automatically reopen the bug. And so on. In the long run, it is hard to see how a truly integrated, distributed bug tracker can be implemented independently of the source code management system. There are some other development projects in this area, including: Scmbug is a relatively advanced project which aims "to solve the integration problem once and for all." It is not truly a distributed bug tracker, though; it depends on hooks into the SCM which talk to a central server. Regardless, this project has done a significant amount of thinking about how bug trackers and source code management systems should work together. DisTract is a distributed bug tracker which works through a web interface. To that end, it uses a bunch of Firefox-specific JavaScript code to run local programs, written in Haskell, which manipulate bug entries stored in a Monotone repository. Your editor confesses that he did not pull together all of the pieces needed to make this tool work. DITrack is a set of Python scripts for manipulating bug information within a Subversion repository. It is meant to be distributed (and, eventually, "backend-agnostic"), but its use of Subversion limits how distributed it can be for now. Ditz is a set of Ruby scripts for manipulating bug information within a source code management system; it has no knowledge of the SCM itself. As can be seen, there is no shortage of work being done in this area, though few of these projects have achieved a high level of usability. Only Scmbug has been widely deployed so far. A few of these projects have the potential to change the way development is done, though, once various integration and user interface issues are addressed. There is one remaining problem, though, which has not been touched upon yet. A bug tracker serves as a sort of to-do list for developers, but there is more to it than that. It is also a focal point for a conversation between developers and users. Most users are unlikely to be impressed by a message like "set up a git repository and run these commands to file or comment on a bug." There is, in other words, value in a central system with a web interface which makes the issue tracking system accessible to a wider community. Any distributed bug tracking system which does not facilitate this wider conversation will, in the end, not be successful. Creating a distributed tracker which also works well for users could be the biggest challenge of them all. The big kernel lock strikes again When Alan Cox first made Linux work on multiprocessor systems, he added a primitive known as the big kernel lock (or BKL). This lock, originally, ensured that only one processor could be running kernel code at any given time. Over the years, the role of the BKL has diminished as increasingly fine-grained locking - along with lock-free algorithms - have been implemented throughout the kernel. Getting rid of the BKL entirely has been on the list of things to do for some time, but progress in that direction has been slow in recent years. A recent performance regression tied to the BKL might give some new urgency to that task, though; it also shows how subtle algorithmic changes can make a big difference. The AIM benchmark attempts to measure system throughput by running a large number of tasks (perhaps thousands of them), each of which is exercising some part of the kernel. Yanmin Zhang reported that his AIM results got about 40% worse under the 2.6.26-rc1 kernel. He took the trouble to bisect the problem; the guilty patch turned out to be the generic semaphores code. Reverting that patch made the performance regression go away - at the cost of restoring over 7,000 lines of old, unlamented code. The thought of bringing back the previous semaphore implementation was enough to inspire a few people to look more deeply at the problem. It did not take too long to narrow the focus to the BKL, which was converted to a semaphore a few years ago. That part of the process was easy - there aren't a whole lot of other semaphores left in the kernel, especially in performance-critical places. But the BKL stubbornly remains in a number of core places, including the fcntl() system call, a number of ioctl() implementations, the TTY code, and open() for char devices. That's enough for a badly-performing BKL to create larger problems, especially when running VFS-heavy benchmarks with a lot of contention. Ingo Molnar tracked down the problem in the new semaphore code. In short: the new semaphore code is too fair for its own good. When a semaphore is released, and there is another thread waiting for it, the semaphore is handed over to the new thread (which is then made runnable) at that time. This approach ensures that threads obtain the semaphore in something close to the order in which they asked for it. The problem is that fairness can be expensive. The thread waiting for the semaphore may be on another processor, its cache could be cold, and it might be at a low enough priority that it will not even begin running for some time. Meanwhile, another thread may request the semaphore, but it will get put at the end of the queue behind the new owner, which may not be running yet. The result is a certain amount of dead time where no running thread holds the semaphore. And, in fact, Yanmin's experience with the AIM benchmark showed this: his system was running idle almost 50% of the time. The solution is to bring in a technique from the older semaphore code: lock stealing. If a thread tries to acquire a semaphore, and that semaphore is available, that thread gets it regardless of whether a different thread is patiently waiting in the queue. Or, in other words, the thread at the head of the queue only gets the semaphore once it starts running and actually claims it; if it's too slow, somebody else might get there first. In human interactions, this sort of behavior is considered impolite (in some cultures, at least), though it is far from unknown. In a multiprocessor computer, though, it makes the difference between acceptable and unacceptable performance - even a thread which gets its lock stolen will benefit in the long run. Interestingly, the patch which implements this change was merged into the mainline, then reverted before 2.6.26-rc2 came out. The initial reason for the revert was that the patch broke semaphores in other situations; for some usage patterns, the semaphore code could fail to wake a thread when the semaphore became available. This bug could certainly have been fixed, but it appears that things will not go that way - there is a bit more going on here. What is happening instead is that Linus has committed a patch which simply turns the BKL into a spinlock. By shorting out the semaphore code entirely, this patch fixes the AIM regression while leaving the slow (but fair) semaphore code in place. This change also makes the BKL non-preemptible, which will not be entirely good news for those who are concerned with latency issues - especially the real time tree. The reasoning behind this course of action would appear to be this: both semaphores and the BKL are old, deprecated mechanisms which are slated for minimization (semaphores) or outright removal (BKL) in the near future. Given that, it is not worth adding more complexity back into the semaphore code, which was dramatically simplified for 2.6.26. And, it seems, Linus is happy with a sub-optimal BKL: Quite frankly, maybe we _need_ to have a bad BKL for those to ever get fixed. As it was, people worked on trying to make the BKL behave better, and it was a failure. Rather than spend the effort on trying to make it work better (at a horrible cost), why not just say "Hell no - if you have issues with it, you need to work with people to get rid of the BKL rather than cluge around it". So the end result of all this may be a reinvigoration of the effort to remove the big kernel lock from the kernel. It still is not something which is likely to happen over the next few kernel releases: there is a lot of code which can subtly depend on BKL semantics, and there is no way to be sure that it is safe without auditing it in detail. And that is not a small job. Alan Cox has been reworking the TTY code for some time, but he has some ground to cover yet - and the TTY code is only part of the problem. So the BKL will probably be with us for a while yet. Extending system calls Getting interfaces right is a hard, but necessary, task, especially when that interface has to be supported "forever". Such is the case with the system call interface that the kernel presents to user space, so adding features to it must be done very carefully. Even so, when Ulrich Drepper set out to remove a hole that could lead to a race condition, he probably did not expect all the different paths that would need to be tried before closing in on an acceptable solution. The problem stems from wanting to be able to create file descriptors with new properties—things like close-on-exec, non-blocking, or non-sequential descriptors. Those features were not considered when the system call interface was developed. After all, many of those system calls are essentially unchanged from early Unix implementations of the 1970s. The open() call is the most obvious way to request a file descriptor from the kernel, but there are plenty of others. In fact, open() is one of the easiest to extend with new features because of its flags argument. Calls like pipe(), socket(), accept(), epoll_create() and others produce file descriptors as well, but don't have a flags argument available. Something different would have to be done to support additional features for the file descriptors resulting from those calls. The close-on-exec functionality is especially important to close a security hole for multi-threaded programs. Currently, programs can use fcntl() to change an open file descriptor to have the close-on-exec property, but there is always a window in time between the creation of the descriptor and changing its behavior. Another thread could do an exec() call in that window, leaking a potentially sensitive file descriptor into the newly run program. Closing that window requires an in-kernel solution. Back in June of last year, after some false starts, Linus Torvalds suggested adding an indirect() system call, as a way to pass flags to system calls that don't currently support them. The indirect() call would apply a set of flags to the invocation of an existing system call. This would allow existing calls to remain unchanged, with only new uses calling indirect(). User space programs would be unlikely to call the new function directly, instead they would call glibc functions that handled any necessary indirect() calls. Davide Libenzi created a sys_indirect() patch in July, but Drepper saw it as "more complex than warranted". So Drepper created his own "trivial" implementation, that was described on this page in November. It was met with a less than enthusiastic response on linux-kernel for being, amongst other things, an exceedingly ugly interface. The alternative to sys_indirect() is to create a new system call for each existing call that needed a flags argument. This was seen as messy by some, including Torvalds, leading some kernel hackers into looking for alternatives. The indirect approach also had some other potential benefits, though, because it was seen as something that could be used by syslets to allow asynchronous system calls. No decision seemed to be forthcoming, leading Drepper to ask Torvalds for one: Will you please make a decision regarding sys_indirect? There has been no other proposal so the alternative is to add more syscalls. To bolster his argument that sys_indirect() was the way to go, Drepper also created a patch to add some of the required system calls. He started with the socket() family, by adding socket4(), socketpair5(), and accept4()—tacking the number of arguments onto the function name a la wait3() and wait4(). Drepper's intent may not have been well served by choosing those calls as Alan Cox immediately noted that the type argument could be overloaded: Given we will never have 2^32 socket types, and in a sense this is part of the type why not just use that would be far far cleaner, no new syscalls on the socket side at all. Michael Kerrisk looked over the set of system calls that generate file descriptors, categorizing them based on whether they needed a flag argument added. He observed that roughly half of the file descriptor producing calls need not change because they could either use an overloading trick like the socket calls, the glibc API already added a flags argument, or there were alternatives available to provide the same functionality along with flags. In response, Drepper made one last attempt to push the indirect approach, saying: Or we just add sys_indirect (which is also usable for other syscall extensions, not just the CLOEXEC stuff) and let userlevel (i.e., me) worry about adding new interfaces to libc. As you can see, for the more recent interfaces like signalfd I have already added an additional parameter so the number of interface changes would be reduced. Even though the indirect approach has some good points, Torvalds liked the approach advocated by Cox, saying: Ok, I have to admit that I find this very appealing. It looks much cleaner, but perhaps more importantly, it also looks both readable and easier to use for the user-space programmer. Ultimately, developers will only use these new interfaces if they can easily test for the existence of the new code. Torvalds gives an example of how that might be done using the O_NOATIME flag to open(), which has only been available since 2.6.8. It is this testability issue that makes him believe the flags-based approach is the right one: And that's the problem with anything that isn't flags-based. Once you do new system calls, doing the above is really quite nasty. How do you statically even test that you have a system call? Now you need to add a whole autoconf thing for it existing, and when it does exist you still need to test whether it works, and you can't even do it in the slow-path like the above (which turns the failure into a fast-path without the flag). This new approach, with a scaled down number of new system calls rather than adding a general-purpose system call extension mechanism like sys_indirect(), is now being pursued by Drepper. In the explanatory patch at the start of the series, he lays out which of the system calls will require a new user space interface: paccept(), epoll_create2(), dup3(), pipe2(), and inotify_init1(), as well as those that do not: signalfd4(), eventfd2(), timerfd_create(), socket(), and socketpair(). Drepper has already made several iterations of patches addressing most of the concerns expressed by the kernel developers along the way. There have been some architecture specific problems, but Drepper has been knocking those down as well. If no further roadblocks appear, it would seem a likely candidate for inclusion in 2.6.27. Release synchronization For those who have not seen it, Mark Shuttleworth's recent The Art of Release posting is worth a look. He starts with some rather self-congratulatory talk about the Ubuntu 8.04 release, saying: To the best of my knowledge there has never been an "enterprise platform" release delivered exactly on schedule, to the day, in any proprietary or Linux OS. One could quibble with this claim in a number of ways, but it is true that Ubuntu got out a release designed to be supported for a number of years, and they did it when they said they would. That, of course, is only part of the job; now they have to follow through on that little promise of supporting this distribution into 2011. The initial signs are good: Ubuntu's support thus far has been solid, and it would appear that the distribution will not be going away anytime soon. One might well question whether the timely release of 8.04 is noteworthy. As a community we are increasingly spoiled; an increasingly large number of projects and distributions manage to get out regular releases on a reasonably predictable schedule. Even kernel releases, once known to slip for a year or more, are now predictable to within a couple of weeks. Now that free software releases are rather more predictable and reliable than, say, airline departures, why is the Ubuntu 8.04 release noteworthy? The answer is the long-term support commitment. Theoretically, a distribution intended for this sort of long lifetime will have had a degree of extra care put into its preparation. Important components will have been given extra time to stabilize so that the distribution will be more reliable from the outset. Some thought will have gone into the selection of packages shipped with an emphasis on supportability over the long term. The whole process requires more effort and a higher degree of assurance that all of the pieces are truly ready. The degree to which Ubuntu has done all of that work should become clear over time. Certainly the software selected for this release is rather less seasoned than the packages found in a Red Hat or SUSE enterprise release. But "older" does not necessarily mean "better" or "more stable," so the real proof will be in how well this distribution holds up for the next three years. Meanwhile, Mr. Shuttleworth has already stated that the next long-term support release will be happening in April, 2010. Ubuntu's success with 8.04, he says, allows this commitment to be made almost two years in advance. There is, however, a possibility that things could change: There's one thing that could convince me to change the date of the next Ubuntu LTS: the opportunity to collaborate with the other, large distributions on a coordinated major / minor release cycle. If two out of three of Red Hat (RHEL), Novell (SLES) and Debian are willing to agree in advance on a date to the nearest month, and thereby on a combination of kernel, compiler toolchain, GNOME/KDE, X and OpenOffice versions, and agree to a six-month and 2-3 year long term cycle, then I would happily realign Ubuntu's short and long-term cycles around that. This idea is not new, but Mr. Shuttleworth seems to be particularly attached to it. There is no doubt that there would be advantages to aligning schedules in this way. The kernel developers, who have been known to make a special effort for a release destined to be used by a major enterprise distributor, could focus especially hard on a stable release knowing that it would be widely used. Higher-level projects could do the same. The distributors could also, perhaps, find a way to collaborate on the long-term maintenance of these components, rather than duplicating the effort of backporting patches into older code. Perhaps they could even get together for a joint release party, saving even more money. Or perhaps this is all a nice idea which fails to survive its encounter with reality. Enterprise distribution releases tend to be highly-publicized events. Ubuntu might be happy to share its limelight with the larger distributors, but that feeling might not be reciprocated on the other side. It is hard to imagine Red Hat or Novell wanting to have their big enterprise distribution release be just one of many happening during the same month. It is also hard to see Ubuntu making an agreement with the enterprise distributors which specifies both a release date and the versions of the major components. 8.04 released with the 2.6.24 kernel, which was almost exactly three months old at the time. Red Hat Enterprise Linux 5 released in mid-March, 2007, when the 2.6.20 kernel was current - but Red Hat shipped the six-month-old 2.6.18 kernel instead. Aligning schedules would require more than picking a date; it would also require adopting similar stabilization periods. It is far from clear that Ubuntu would want to fall that far behind the leading edge for the sake of alignment. And, frankly, it's hard to imagine Debian making a credible commitment (within one month) to a release date at all. So the aligned schedules for enterprise distributions seems like a hard sell. A better approach might be to try to wean these distributions off the "freeze and backport" model of support; this model is expensive to sustain, brings risks of its own, and doesn't always fit the needs of enterprise customers.. If the enterprise distributors were able to track more current software - rather than backporting pieces of it into older software - better alignment of releases might just come naturally. Debian, OpenSSL, and a lack of cooperation A rather nasty security hole in the Debian OpenSSL package has generated a lot of interest—along with a fair amount of controversy—amongst Linux users. The bug has been lurking for up to two years in Debian and other distributions, like Ubuntu, based on it. There are a number of lessons to be learned here about distributions and projects working together or, as in this case, failing to work together. Back in April 2006, a Debian user reported a problem using the OpenSSL library with valgrind, a tool that can check programs for memory access problems. It was reporting that OpenSSL was using uninitialized memory in parts of the random number generator (RNG) code. Using memory before it is initialized to a known value is a well known way to create hard-to-find bugs, so it is not surprising that the valgrind report caused some consternation. Debian hacker Kurt Roeckx tracked the problem down to what he thought were two offending lines of code and posted a question on the openssl-dev mailing list: What I currently see as best option is to actually comment out those 2 lines of code. But I have no idea what effect this really has on the RNG. The only effect I see is that the pool might receive less entropy. But on the other hand, I'm not even sure how much entropy some unitialised data has. What do you people think about removing those 2 lines of code? There were few responses, but they were not opposed to removing the lines, including one from Ulf Möeller using an openssl.org email address: "If it helps with debugging, I'm in favor of removing them." Unfortunately, as was discovered recently, removing one of the two lines was harmless, the other essentially crippled the RNG so that OpenSSL-generated cryptographic keys were easy to predict. (For more technical details on the bug and what should be done to respond to it, see our article on this week's Security page.) It turns out, at least according to OpenSSL core team member Ben Laurie, that openssl-dev is not for discussing development of OpenSSL. That may be true in practice, but the OpenSSL support web page describes it as: "Discussions on development of the OpenSSL library. Not for application development questions!" In addition, the address suggested by Laurie (openssl-team-AT-openssl.org) does not appear in any of the OpenSSL documentation or web pages. If it wasn't the right place, it would seem that the OpenSSL developers could have provided a helpful pointer to the right address, but that did not occur. It probably was not clear that Roeckx was asking the questions in an official Debian capacity, nor that he was planning to change the Debian package based on the answer to his questions. As Laurie rightly points out, he should have submitted a patch, proposing that it be accepted into the upstream OpenSSL codebase. That probably would have garnered more attention, even if it was only posted to openssl-dev. It seems very unlikely that the patch in question would have ever made it into an OpenSSL release. It is in the best interests of everyone, distributions, projects, and users, for changes made downstream to make their way back upstream. In order for that to work, there must be a commitment by downstream entities—typically distributions, but sometimes users—to push their changes upstream. By the same token, projects must actively encourage that kind of activity by helping patch proposals and proposers along. First and foremost, of course, it must be absolutely clear where such communications should take place. Another recently reported security vulnerability also came about because of a lack of cooperation between the project and distributions. It is vital, especially for core system security packages like OpenSSH and OpenSSL, that upstream and downstream work very closely together. Any changes made in these packages need to be scrutinized carefully by the project team before being released as part of a distribution's package. It is one thing to let some kind of ill-advised patch be made to a game or even an office application package that many use; SSH and SSL form the basis for many of the tools used to protect systems from attackers, so they need to be held to a higher standard. Another of Laurie's points, which also bears out the need for a higher standard, is the timing of the check-in to a public repository when compared to that of the advisory. Any alert attacker could have made very good use of the five or six day head start, they could have gotten by monitoring the repository, to exploit the vulnerability. While it is certainly possible that some of malicious intent already knew about the flaw, though no exploits have been reported, alerting potential attackers to this kind of hole well in advance of alerting the vulnerable users is unbelievably bad security protocol. This is the kind of problem that could have been handled quickly and quietly by all concerned. All affected distributions—though it might be difficult to list all of the Debian-derived distributions out there—could have been contacted so that the advisory and updates to affected packages could have been coordinated. One of these days, one of these problems is going to give Linux a security black eye unless the community can do a better job of working together. Debian vulnerability has widespread effects The recent Debian advisory for OpenSSL could lead to predictable cryptographic keys being generated on affected systems. Unfortunately, because of the way keys are used, especially by ssh, this can lead to problems on systems that never installed the vulnerable library. In addition, because the OpenSSL library is used in a wide variety of services that require cryptography, a very large subset of security tools are affected. This is a wide-ranging vulnerability that affects a substantial fraction of Linux systems. For a look at the chain of errors that led to the vulnerability, see our front page article. Here, we will concentrate on some of the details of the code, the impact of the vulnerability, and what to do about it. An excellent tool for finding memory-related bugs, Valgrind was used on an application that used the OpenSSL library. It complained about the library using uninitialized memory in two locations in crypto/rand/md_rand.c: While the lines of code look remarkably similar (modulo the pre-processor directive), their actual effect is very different. The first is contained in the ssleay_rand_add() function, which is normally called via the RAND_add() function. It adds the contents of the passed in buffer to the entropy pool of the pseudo-random number generator (PRNG). The other is contained in ssleay_rand_bytes(), normally called via RAND_bytes(), which is meant to return random bytes. It adds the contents of the passed in buffer—before filling it with random bytes to return—to the entropy pool as well. The major difference is that removing the latter might marginally reduce the entropy in the PRNG pool, while removing the former effectively stops any entropy from being added to the pool. For both RAND_add() and RAND_bytes(), the buffer that gets passed in may not have been initialized. This was evidently known by the OpenSSL folks, but remained undocumented for others to trip over later. The "#ifndef PURIFY" is a clue that someone, at some point, tried to handle the same kind of problem that Valgrind was reporting for the similar, but proprietary, Purify tool. While it isn't necessarily wrong to add these uninitialized buffers to the PRNG pool, it is something that tools like Valgrind will rightly complain about. Since it is dubious whether it adds much in the way of entropy, while constituting a serious hazard for uninitiated, some kind of documentation in the code would seem mandatory. The major response from the OpenSSL team seems to be from core team member Ben Laurie's weblog, where he has a rant entitled "Vendors Are Bad For Security". In it, and its follow-up, he makes some good points about mistakes that were made, while seeming to be unwilling for OpenSSL to take any share of the blame. The end result is that OpenSSL would create predictable random numbers, which would then result in predictable cryptographic keys. According to the advisory: Affected keys include SSH keys, OpenVPN keys, DNSSEC keys, and key material for use in X.509 certificates and session keys used in SSL/TLS connections. Keys generated with GnuPG or GNUTLS are not affected, though. A program that can detect some weak keys has also been released. It uses 256K hash values to detect the bad keys, which would imply 18-bits of entropy in the PRNG pool of vulnerable OpenSSL libraries. By using hashes of the keys in the detection program, the authors do not directly give away the key values that get generated, but it should not be difficult for an attacker to generate and use that list. For affected Debian-derived systems, the cleanup is relatively straightforward, if painful. The SSLkeys page on the Debian wiki has specific information on how to remove weak keys along with how to generate new ones for a variety of services affected. Obviously, none of those steps should be taken until the OpenSSL package itself has been upgraded to a version that fixes the hole. A bigger problem may be for those installations based on distributions that were not directly affected because they did not distribute the vulnerable OpenSSL library. Those machines may very well have weak keys installed in user accounts as ssh authorized_keys. A user who generated a key pair on some vulnerable host may have copied the public key to a host that was not vulnerable. This would allow an attacker to access the account of that user by brute forcing the key from the 256K possibilities. Because of that danger, the Debian project suspended public key authentication on debian.org machines. In addition, all passwords were reset because of the possibility that an attacker could have captured them by decrypting the ssh traffic using one of the weak keys. One would guess that debian.org machines would have a higher incidence of weak keys, but any host that allows users to use ssh public key authentication is potentially at risk. The weak key detector (dowkd) has some fairly serious limitations: dowkd currently handles OpenSSH host and user keys and OpenVPN shared secrets, as long as they use default key lengths and have been created on a little-endian architecture (such as i386 or amd64). Note that the blacklist by dowkd may be incomplete; it is only intended as a quick check. In order to ensure that there are no weak keys installed as public keys on other hosts, it may be necessary to remove all authorized_keys (and/or authorized_keys2) entries for all users. It may also be wise to set all passwords to something unknown. Until that is done, there still remains a chance that a weak key may allow access to an attacker. It is a unpleasant task that needs to be done for those who administer a multi-user system. Getting a handle on caching Memory management changes (for the x86 architecture) have caused surprises for a few kernel developers. As these issues have been worked out, it has become clear that not everybody understands how memory caching works on contemporary systems. In an attempt to bring some clarity, Arjan van de Ven wrote up some notes and sent them to your editor, who has now worked them into this article. Thanks to Arjan for putting this information together - all the useful stuff found below came from him. As readers of What every programmer should know about memory will have learned, the caching mechanisms used by contemporary processors are crucial to the performance of the system. Memory is slow; without caching, systems will run much slower. There are situations where caching is detrimental, though, so the hardware must provide mechanisms which allow for control over caching with specific ranges of memory. With 2.6.26, Linux is (rather belatedly) starting to catch up with the current state of the art on x86 hardware; that, in turn, is bringing some changes to how caching is managed. It is good to start with a definition of the terms being used. If a piece of memory is cachable, that means: The processor is allowed to read that memory into its cache at any time. It may choose to do so regardless of whether the currently-executing program is interested in reading that memory. Reads of cachable memory can happen in response to speculative execution, explicit prefetching, or a number of other reasons. The CPU can then hold the contents of this memory in its cache for an arbitrary period of time, subject only to an explicit request to release the cache line from elsewhere in the system. The CPU is allowed to write the contents of its cache back to memory at any time, again regardless of what any running program might choose to do. Memory which has never been changed by the program might be rewritten, or writes done by a program may be held in the cache for an arbitrary period of time. The CPU need not have read an entire cache line before writing that line back. What this all means is that, if the processor sees a memory range as cachable, it must be possible to (almost) entirely disconnect the operations on the underlying device from what the program thinks it is doing. Cachable memory must always be readable without side effects. Writes have to be idempotent (writing the same value to the same location several times has the same effect as writing it once), ordering-independent, and size-independent. There must be no side effects from writing back a value which was read from the same location. In practice, this means that what sits behind a cachable address range must be normal memory - though there are some other cases. If, instead, an address range is uncachable, every read and write operation generated by software will go directly to the underlying device, bypassing the CPU's caches. The one exception is with writes to I/O memory on a PCI bus; in this case, the PCI hardware is allowed to buffer and combine write operations. Writes are not reordered with reads, though, which is why a read from I/O memory is often used in drivers for PCI devices as a sort of write barrier. A variant form of uncached access is write combining. For read operations, write-combined memory is the same as uncachable memory. The hardware is, however, allowed to buffer consecutive write operations and execute them as a smaller series of larger I/O operations. The main user of this mode is video memory, which often sees sequential writes and which offers significant performance improvements when those writes are combined. The important thing is to use the right cache mode for each memory range. Failure to make ordinary memory cachable can lead to terrible performance. Enabling caching on I/O memory can cause strange hardware behavior, corrupted data, and is probably implicated in global warming. So the CPU and the hardware behind a given address must agree on caching. Traditionally, caching has been controlled with a CPU feature called "memory type range registers," or MTRRs. Each processor has a finite set of MTRRs, each of which controls a range of the physical address space. The BIOS sets up at least some of the MTRRs before booting the operating system; some others may be available for tweaking later on. But MTRRs are somewhat inflexible, subject to the BIOS not being buggy, and are limited in number. In more recent times, CPU vendors have added a concept known as "page attribute tables," or PAT. PAT, essentially, is a set of bits stored in the page table entries which control how the CPU does caching for each page. The PAT bits are more flexible and, since they live in the page table entries, they are difficult to run out of. They are also completely under the control of the operating system instead of the BIOS. The only problem is that Linux doesn't support PAT on the x86 architecture, despite the fact that the hardware has had this capability for some years. The lack of PAT support is due to a few things, not the least of which has been problematic support on the hardware side. Processors have stabilized over the years, though, to the point that it is possible to create a reasonable whitelist of CPU families known to actually work with PAT. There have also been challenges on the kernel side; when multiple page table entries refer to the same physical page (a common occurrence), all of the page table entries must use the same caching mode. Even a brief window with inconsistent caching can be enough to bring down the system. But the code on the kernel side has finally been worked into shape; as a result, PAT support was merged for the 2.6.26 kernel. Your editor is typing this on a PAT-enabled system with no ill effects - so far. On most systems, the BIOS will set MTRRs so that regular memory is cachable and I/O memory is not. The processor can then complicate the situation with the PAT bits. In general, when there is a conflict between the MTRR and PAT settings, the setting with the lower level of caching prevails. The one exception appears to be when one says "uncachable" and the other enables write combining; in that case, write combining will be used. So the CPU, through the management of the PAT bits, can make a couple of effective changes: Uncached memory can have write combining turned on. As noted above, this mode is most useful for video memory. Normal memory can be made uncached. This mode can also be useful for video memory; in this case, though, the memory involved is normal RAM which is also accessed by the video card. Linux device drivers must map I/O memory before accessing it; the function which performs this task is ioremap(). Traditionally, ioremap() made no specific changes to the cachability of the remapped range; it just took whatever the BIOS had set up. In practice, that meant that I/O memory would be uncachable, which is almost always what the driver writer wanted. There is a separate ioremap_nocache() variant for cases where the author wants to be explicit, but use of that interface has always been rare. In 2.6.26, ioremap() was changed to map the memory uncached at all times. That created a couple of surprises in cases where, as it happens, the memory range involved had been cachable before and that was what the code needed. As of 2.6.26, such code will break until the call is changed to use the new ioremap_cache() interface instead. There is also a ioremap_wc() function for cases where a write-combined mapping is needed. It is also possible to manipulate the PAT entries for an address range explicitly: These functions will set the given pages to uncachable, write-combining, or writeback (cachable), respectively. Needless to say, anybody using these functions should have a firm grasp of exactly what they are doing or unpleasant results are certain. The Freedom of Fork One of the important rights that Free Software gives you is the ability to take the source code of any software, modify it, and release it again under a compatible Free Software license. It is a very important freedom, as it allows not only users to customize the software they use to better suit their requirements, but also enables distributions to patch software to build in their environment. Environmental changes include new architectures and different versions of system tools and libraries. As with other important freedoms, this ability can prove to be a huge problem if not handled properly. There can be problems for the original author, the person doing the fork, and the users of the various versions of the software. The story of Free Software is full of good examples of forks handled correctly, like the EGCS fork that transformed the GNU C Compiler into the GNU Compiler Collection (GCC), or more recently the replacement of Jörg Schilling's cdrtools with the cdrkit package that is now found in most distributions. Unfortunately, the list of bad examples is longer. Historically, forking a project was a difficult task for most single developers: handling version control repositories (especially with CVS) was not something done easily. It limited the task of forking to experienced developers, who usually had enough common sense to know when forking was not an option. Nowadays, forking is much easier, Subversion allows to developers to easily fetch the whole history of a project. Distributed version control systems (DVCS) like git, Mercurial, Bazaar-NG and others remove the need for a central repository, making forking and branching two very similar activities. Recently, the GitHub hosting site has made this action even more prominent by adding a "fork" button on the pages for the repository hosted on their servers, allowing anybody to create a new branch (or fork) of a project in a simple mouse click. The Downsides of Forking Forking is not always the best option. It should probably be considered the last resort. Forking divides efforts as the two projects often take slightly different turns. The result of the fork is that the two versions of the code diverge, even though they share the same interface and most of the background logic. This creates a series of problems, of a technical nature, that reflects on the non-technical attributes of a program. A forked project reuses a big part of the code from the original project. This causes code duplication, with its usual problems, and one in particular: security risks. A forked project is usually vulnerable to the problems the original project had, unless that part of the code has been rewritten or modified with time. As the forks evolve, authors often miss the security issues fixed by their ancestor, making it harder for developers to track the issues down. Another common problem is the division of users' contributions. Users usually just report issues to one project, the one they use. So either the developers of the two projects exchange information about the bugs they fix in the common code, or the problems will likely be ignored by one of the two projects, making the distance between the projects increase. You can find this very problem with software like Ghostscript, the omnipresent PostScript processor, used to generate, view and convert PostScript files. Its development is currently divided into multiple forks which do not always give their code back to the originating project. You can find one version released under the AFPL (Aladdin Free Public License), one released under the GPL, a commercial/proprietary one, and one version that used to be developed by Easy Software Products, the authors of the CUPS printing system. The reasons for the forks here were mostly related to licensing issues. And, in the case of ESP, to better support CUPS. In the end, the development of different bloodlines for the project caused, and still causes, problems for distribution maintainers. Distribution issues include keeping packages aligned, which means doubling the effort needed to fix the code if it breaks or if it doesn't follow policy. Another case where dividing the development effort has caused problems is in the universe of Logitech mouse control software. The lmctl project was started as a tool to control some settings of Logitech devices, like resolution and cordless channels. The code has to know which devices have which settings available. To do this, it keeps a table of USB identifiers. As new devices started appearing on the market, and Linux users started using them and the table became outdated. Distributions patched this up, but in different ways, creating inconsistent tables. Some users started releasing their own modified version of lmctl with an extended table to support different devices. While explicit forks of entire projects have problems, the fact that they delineate where they took the code from makes it easier to track down the source of bugs and handle security vulnerabilities. On the other hand, when a project borrows some code and imports it in its source distribution, this kind of tracking becomes more difficult. Free Software licenses explicitly allow, and push for, importing code between projects; cross-pollination also improves general code quality over time. For most distributions, an internal imported copy of a library inside another project is also a violation of policy. For this reason the developers will most likely try to make the project use a shared, external copy of the code. This works fine when the other library is simply bundled together untouched, but it becomes a nuisance if there are subtle changes which might not be apparent at a first impression. One thing to take into account when you want to have an internal copy of a library is to consider it as an untouchable piece of code. instead of spending time fixing bugs inside that copy of the code, the developers should try to fix the bugs in the original sources, so that everybody (including themselves) can make use of the improvement. In the real world, one example of this can be the FFmpeg source code. FFmpeg is imported by many different Free Software projects in the area of multimedia: xine, MPlayer, GStreamer. While it is a very wide common ground for all these projects, as well for some others that aren't importing a copy of it like VLC, some of the imports change the source code, in more or less subtle ways. In the case of xine, the whole build system is replaced to integrate it with the automake-based build system used by the rest of the library. Further patching is done to the sources themselves so that they behave in a slightly different way than the original. The code rots quickly and bugs that were already fixed in the in-development sources of FFmpeg still sprout in xine-lib. Maintaining such an import is a difficult and boring task, to the point that the developers, in the past two years, have spent a lot of energy toward the goal of not using an internal copy of FFmpeg anymore. The result is that the difference between the original FFmpeg and the internal copy is quite smaller, mostly limited to the build system. Instead of advising against using an external copy of FFmpeg, it is advised not to use the internal one. For the next minor version of xine-lib, FFmpeg is being used pristine, entirely unpatched, and it will probably not even be bundled with the library in the next future. Successful Forks Of course it's not all bad. There are successful forks in Free Software, and many of them are now more famous than their parents. I've already named the GNU Compiler Collection, which is the GCC that almost all Free Software users have at hand at the moment. Most people use GCC version 3 and later, which started as a fork of the other GCC (the GNU C Compiler), version 2. The original development of GCC was, like many other GNU projects, very closed to the community. As Eric S. Raymond defined it in his book The Cathedral and the Bazaar, it was a Cathedral-style development that often prepares the ground for forks, and this was no exception. Multiple forks of the GCC code were created. Their goals, while different, often didn't clash, but could have easily been worked on at the same time. Some of the forks were then merged into the EGCS project, which eventually replaced the original GCC. Again citing GNU's Cathedral-style of development, it's difficult not to talk about GNU Emacs and its brother XEmacs. Created originally to support one particular product, the XEmacs project is nowadays a mostly standalone project. XEmacs is kept at an arm's length from GNU Emacs, mostly because of licensing and copyright assignment issues. Neither version can be considered a superset of the other because they both implement features in their own way. Better is the state of Claws Mail, started as a different branch of Sylpheed, with the name Sylpheed Claws. Originally the intention was to develop new features that could one day find their way back to the original code. Claws Mail has since declared itself independent and is now a stand-alone project. In this case, the exchange of code between the two projects has basically halted, as the code bases have diverged so much that they retain very little in common. In the case of the Ultima Online server emulators, forks became daily events, and cross-pollination had grown to the point where at least five projects were linked by family ties. The UOX3 source code has been forked, reused, imported and cut down so that it is present in WolfPack, LoneWolf, NoX-Wizard and Hypnos. Almost all of the UOX3 forks involved re-writing parts of the code, as it had stratified to the point of not being maintainable. The forks continued copying one from the other to make use of the best features available. Forking vs. Branching There are a few good reasons why you might want to detach, temporarily, from a given development track. Development of experimental features, new interfaces, backend rewrites or resurrection of a project whose original authors are unavailable. In most of these cases, forking is not the best solution but branching most likely is. Although the border between these two actions started slimming down thanks to distributed VCS, branching usually doesn't involve setting up a new web page for the project, changing its name or finding a new goal. And a branch is usually related, tightly or not, to the original project. Merges between the two code bases often happen at more or less regular intervals, and ideas and bug reports are shared. Branches usually have the target of being merged in the main development track, sooner for small, testing branches, or later for huge rewrites. They don't usually require dividing of the efforts as the problems affecting the main branch get their fixes propagated to the other branches when they merge back the original code. One common problem with developing through branches involved bad support in the Subversion version control system. In Subversion the branches are represented as a different path in the repository, with almost no help for branches in the merge operations. With a modern distributed VCS, branches are so cheap that any checkout is, from some points of view, a different branch, and the merge operations are one of the main focuses. Projects like the Linux kernel or xine-lib rely heavily on an above-average number of branches. These are often short-lived and used for testing purposes. Looking to the Future Forks will never end in Free Software as they are supported by one of the freedoms that make Free Software what we all want it to be. The future will, of course, bring new forks. Recently there has been a lot of talk about Funpidgin, a fork of the widespread Pidgin Instant Messaging client (formerly Gaim). Again it seems like it was the Cathedral-style development of the original code that motivated a fork that could give (some of) the users what they wanted. And even though GNU Emacs opened its process quite a lot, its forks haven't stopped sprouting. This is despite the fact that Richard Stallman, original author and mastermind behind the GNU project, stepped down as maintainer, putting in place Stefan Monnier and Chong Yidong. The Aquamacs Emacs is still diverging from the original GNU Emacs for supporting Apple's Mac OS X, while different versions are being developed to support the multiple user interfaces one can use on that operating system. Similarly, although the Windows port of Emacs is already pretty solid, there are extensions being written to make it easier for users to adapt it to the Microsoft environment. Forks are usually the effect of a closed-circle development, a Cathedral, where some of the developers or users can't see their objective being fulfilled, will all their energy being poured in. So just look for the projects that don't seem to be getting much love from a community, and you might find a fork starting to make its first leaves. Then there is the Poppler project, which merged together the modified versions of the XPDF code imported by projects like GNOME and KDE for their PDF viewers. Poppler is soon going to be a nearly omnipresent PDF viewer on Free Software desktops and beyond. This summer's milestone KDE 4.1 release will include the release of the new oKular document viewer, oKular will use Poppler for PDF rendering on the (stable) KDE users' desktops. Conclusions I'd suggest that anybody thinking about creating a fork should think twice. Forking is rarely a good choice, better choices can be branching, or if you need just part of a code, working together like Poppler developers did to separate the code to share the common parts. When you want to make some changes to a software project, propose branching it, show the results to the original developers and discuss with them on how to improve the code. Most of the times you'll find authors are open to the changes. A fork is a grave matter. It might bring innovation to the Free Software community, but it could also separate developers that could otherwise work together, maybe in a better way. In this light, GitHub's one click forking capability seems like a dangerous feature. The ever-increasing ease of forking everything, from small projects to part of, or even entire distributions (think about Debian's repositories and Gentoo's overlays) is increasing the fragmentation of Free Software projects. Biodiversity in software can be a very good thing, just like in nature, but people should first try their best to work together, rather than one against the other. Debian contemplates patch management Developers in the Debian project had a busy week cleaning up after the openssl vulnerability was disclosed. Once that was taken care of, they moved on to process-related issues. Clearly, some shortcomings in how Debian handles patches to the programs it ships have been revealed; now the project would like to face those problems and make things work better in the future. The resulting discussion shows Debian at its introspective best, and may well have results that other distributors will want to pay attention to. As a Fedora developer noted: "This bug could easily have been us on the receiving end." All distributors make changes to their packages, so all of them are potentially exposed to this kind of failure. Debian's packaging policy resembles that of most other distributions. A Debian source package is supposed to contain a tarball of the upstream source distribution, without changes. Any distribution-specific patches are included separately and applied when the source package is prepared for building. There are couple of Debian-specific issues to be faced, though: From the discussion, it seems that the "pristine upstream tarball" rule is occasionally bent by developers. Sometimes there is no alternative: some upstream source distributions contain material which, due to its licensing, cannot be shipped by Debian. The justification for other cases is not always quite as clear. Debian's patches are all mashed together and included as a single diff file. So there is no metadata describing the patches, and they are difficult to separate from each other. In this regard, Debian differs from RPM-based distributions, which generally keep each patch separate. The end result of all this is that Debian's patches are hard for others to review, hard for upstream projects to consider, and even hard for other Debian developers to get a handle on. Raphaël Hertzog started a discussion on how to improve this situation. A key part of his approach (and an idea which others have been pursuing as well) is to make changes to the Debian source package format which would make the nature of each patch explicit. At a minimum, packagers would include a debian/patches directory with the source; that directory would contain each patch, broken out into a separate file. Some Debian packages are built this way already, though the practice is far from universal. Beyond that, though, it would be nice to have the source package itself understand the patch stream and its associated metadata. There are a few proposals for this; Raphaël favors the "3.0 (quilt)" format, which keeps the patches (in a separate tarball) as a quilt series. This format seems to have a certain amount of support; among other things, its simplicity would make it easy for Debian developers to create packages in this format without having to learn new tools. The quilt series file - like the spec file used with RPM packages - makes it clear which patches must be applied, and in which order. There are other variants of the 3.0 source package format, though. The "3.0 (git)" format contains a git repository containing the upstream source and a series of patches to it. This approach has the advantage of including the history of the patches along with the other metadata; it could also, arguably, make it easier for other distributors (and upstream) to cherry-pick patches of interest. On the other hand, a git-based package format requires the availability of git and has the potential to make those packages larger. The GitSrc FAQ has some more information on this format; there's also a "3.0 (bzr)" format variant out there. Any of these new formats, if widely adopted, would bring a new level of transparency to Debian's patching activities. It would enable the creation of a "patches.debian.org" site (clearly inspired by patches.ubuntu.com) where anybody could quickly look at the changes which have been made to any given package. There are some developers who doubt the utility of this; they worry that upstream developers won't want to poll a site to see what changes have been made to their code. One developer at least (GNOME hacker Vincent Untz) thinks that a patches.debian.org site would be a step in the right direction, though. Another quibble which has been heard is that Debian does not need any new infrastructure for patch management. The right place for patch tracking, it is said, is with the upstream project. Nobody seems to challenge the claim that more patches need to go back upstream, but there is also the fact that quite a few patches will never get there. The upstream developers for a number of projects seem to have different goals and are seen by the distribution maintainers as being overtly uncooperative. And some patches - such as those removing non-free material - may not be something that even cooperative upstream maintainers want. So there will always be a need for distribution-specific patches; the "track it upstream" approach will not solve the whole problem. Meanwhile, Joey Hess brought a completely different idea to the discussion: just treat every divergence from upstream as a bug. Each patch would have a corresponding entry in the Debian bug tracking system (BTS) with a special tag. Anybody could then query the list of outstanding bugs, view the patches, and participate in the associated discussion. Using the BTS brings some real technical advantages, in that the system already exists. But, Joey says, the real benefit is elsewhere: The biggest reason for using the BTS is not technical. It's that, if we decide that the project will treat divergence from upstream as a bug, then we've effectively decided that maintainers will be responsible for both minimising unnecessary divergence, communicating about it to upstream, and for keeping track of what divergence exists. Because developers are responsible for their bugs. A separate patch tracking mechanism, instead, would be a mostly automatic subsystem on the side which might not bring the same sort of pressure to bear on developers. The BTS approach is not universally acclaimed either. Some developers claim that most Debian-specific patches are not really Debian bugs - they are, instead, upstream bugs. Regardless of whether that is really true, distribution bug trackers generally carry a great many entries which, in the end, describe bugs in upstream packages. Another complaint is that creating and maintaining BTS entries would be just another bit of bureaucratic work imposed on Debian developers. Beyond any doubt, some developers would see it that way. But this may be a place where a bit more bureaucracy makes some sense. The Linux distributors of the world (certainly not just Debian) are carrying thousands of patches against the free programs they distribute. Making the nature and extent of those patches more readily apparent can only be beneficial for users, reviewers, distributors, and upstream maintainers. One clear conclusion from recent events is that all distributors could do more to let the rest of the community know about the changes they are making. A distributor's ability to patch a program is a crucial part of the whole ecosystem - it's the distributors' way of balancing their users' needs against the upstream maintainer's policies. But distributors should be clear about the changes they are making, willing to merge those changes upstream whenever possible, and wanting feedback on those patches. Any "bureaucracy" which helps to make that happen can only help our community as a whole in ways that go far beyond the avoidance of another openssl disaster. One final note: the existence of source package formats which incorporate distributed version control system repositories shows that developers have been thinking about this problem for a while; it's not just a response to recent events. There is an effort underway to think about what the intersection of version control and packaging can really achieve for all distributors; the folks working on this project can be found at vcs-pkg.org. They are working on organizing a gathering this September in Extremadura. Vcs-pkg is worth watching; it has the potential to make things work better for developers and users of all distributions. Kill BKL Vol. 2 Last week's big kernel lock article discussed a BKL-related performance regression and concluded that we would likely see a new interest in its elimination. In the intervening week, that interest has indeed come to the fore. There are now a couple of different efforts afoot to get rid of this long-lasting lock. One might well wonder why the BKL is so persistent. Over the last (approximately) fifteen years, thousands of locks have been added to the kernel, pushing the BKL into increasingly obscure corners. But there are a lot of those corners, including a great many explicit lock_kernel() calls, the open() method for every char device, most ioctl() implementations, all fasync() implementations, and more. The BKL can be found throughout the kernel, and doesn't appear ready to go without a fight. Part of the problem is simply that locking is hard. So going in and changing the locking of some crufty, old driver is not at the top of the list for a lot of developers, who would generally rather be creating crufty new drivers. Beyond that, though, the BKL is special. It was originally created to be more than just a locking primitive; its purpose is to allow BKL-covered code to pretend that it is still running on an old, uniprocessor system. So its semantics are very different from any other lock in the Linux kernel. For example, the BKL nests, so programmers can add lock_kernel() calls anywhere without worrying about whether the BKL might already have been acquired elsewhere. As with a mutex, code holding the BKL can sleep; however, the scheduler will magically release the BKL until the holding thread wakes up again. So there can be various threads in kernel space, all of which think they hold the BKL, but only one of them will actually be running at any given time. The end result is that it is hard to get a handle on what is happening with the BKL at any given time; code can depend on it without ever really being aware of its existence. As Ingo Molnar put it in his kill the BKL tree announcement: Furthermore, the BKL is not covered by lockdep, so its dependencies are largely unknown and invisible, and it is all lost in the haze of the past ~15 years of code changes. All this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong. That doesn't mean that people aren't willing to try; Ingo's tree - to which we will return shortly - is a major effort in that direction. But first, consider another initiative which, somewhat accidentally, turned up an example of just how subtle BKL-related issues can be. As was mentioned above, the kernel grabs the BKL whenever a process opens a char device; the BKL is held while the associated driver's open() function runs. To eliminate BKL, one must remove this particular use of it; one cannot just take it out, however, without breaking every driver which does not have proper locking internally. So, in fact, this lock_kernel() call cannot be removed until every driver's open() function has been audited and, if necessary, fixed. That's a big flag day. An alternative, which your editor rashly jumped into doing, is to push the acquisition of the BKL down one level. Every open() function is forced to be correct through the addition of explicit lock_kernel() and unlock_kernel() calls; once all of the in-tree drivers have been fixed in this way, the higher-level call in chrdev_open() can be removed. This work may seem like a step backward, in that it replaces a single lock_kernel() call with approximately 100 others. But it's actually a big step forward, in that each driver can now be audited and fixed independently. This work has now been done, the resulting tree is in linux-next, and, if all goes well, it should be ready for 2.6.27. While doing this work, though, your editor noticed quite a few drivers with open functions that were either completely empty (all they do is "return 0") or they do something relatively trivial. These functions, one would think, do not need to acquire the BKL; they touch no global resources and cannot possibly race with any other part of the kernel. In fact, as was suggested by others, the empty open() functions could just be removed altogether. It was Alan Cox who pointed out that life is not quite so simple. Under the current regime, an open function which looks like this: is really better modeled as this: These two may seem the same, but there is a crucial difference: in the second form, empty_open() will not return until it can acquire the BKL. In other words, after empty_open() runs, one knows that the BKL became available at least once. And this matters: a classic device driver error is to (1) register a device with the kernel, then (2) initialize all of the internal data structures needed to manage that device. Should some other process attempt to open and use the device between those two steps, unpleasant things can happen. The lock_kernel() call in the open() function, despite protecting no critical section directly, serializes the opening of the device with the driver's initialization, and thus prevents mayhem. So, says Alan, I think it would be best to make them lock/unlock kernel in the first pass and then work through them. The BKL can be subtle and evil, but as I brought it into the world I guess I must banish it ;) Alan will not be alone in that effort, though, and Ingo Molnar's "kill the BKL" tree is likely to help this work considerably. Ingo's approach is to get rid of most of the features which make the BKL special. So, with his patches, the BKL becomes just another mutex which, crucially, can be tracked with the lock validator. It is no longer released when a thread calls schedule(), a change which forced the addition of a few explicit "release, schedule, and reacquire" changes in code which would otherwise deadlock. There's a number of warnings added to point out calls made holding the BKL which should not be. And so on. This patch set, in essence, removes the BKL entirely, replacing it with just another big lock which happens to do nesting. And the nesting might go too at some point. So the BKL becomes more visible and easier to understand. And, presumably, easier to eliminate. Linus likes this approach, though he would like to see it reworked to the point that it can be merged into the mainline relatively soon. Doing that would require putting most of the changes behind a configuration option decorated with a sufficient number of scary warnings; then people who wanted to test this code could turn it on and see what explodes. The number of explosions would probably be relatively small - but probably not zero. This set of changes, along with the other work being done, suggests that significant progress toward the elimination of the BKL can be expected over the next few kernel development cycles. Once it's gone, we'll have a kernel without legacy locking issues, and without the unpleasant performance issues that the BKL can bring. That will still take a while, though; there is simply no substitute for actually looking at all the BKL-covered code and ensuring that it will run safely in the absence of that protection. It's a painstaking job requiring moderate skills which can only be rushed so much. Appropriate sources of entropy A steady stream of random events allows the kernel to keep its entropy pool stocked up, which in turn allows processes to use the strongest random numbers that Linux can provide. Exactly which events qualify as random—and just how much randomness they provide—is sometimes difficult to decide. A recent move to eliminate a source of contributions to the entropy pool has worried some, especially in the embedded community. The kernel samples unpredictable events for use in generating random numbers, storing that data in the entropy pool. Entropy is a measure of the unpredictability or randomness of a data set, so the kernel estimates the amount of entropy each of those events contributes to the pool. Many kernels run on hardware that is lacking some of the traditional sources of entropy. In those cases, the timing of interrupts from network devices has been used as a source of entropy, but it has always been controversial, so it was recently proposed for removal. Two of the best sources of random data for the entropy pool—user interaction via a keyboard or mouse and disk interrupts—are often not present in embedded devices. In addition, some disk interfaces, notably ATA, do not add entropy, which extends the problem to many "headless" servers. But network interrupts are seen as a dubious source of entropy because they may be able to be observed, or manipulated, by an attacker. In addition, as network traffic rises, many network drivers turn off receive interrupts from the hardware, allowing the kernel to poll periodically for incoming packets. This would reduce entropy collection just at the time when it might be needed for encrypting the traffic. This is not the first time eliminating the IRQF_SAMPLE_RANDOM flag from network drivers has come up; we looked at the issue two years ago (though the flag was called SA_SAMPLE_RANDOM at that time). It has come up again, starting with a query on linux-kernel from Chris Peterson: "Should network devices be allowed to contribute entropy to /dev/random?" Jeff Garzik, kernel network device driver maintainer, answered: "I tend to push people to /not/ add IRQF_SAMPLE_RANDOM to new drivers, but I'm not interested in going on a pogrom with existing code." For anyone that is interested in such a pogrom, Peterson proposed a patch to eliminate the flag from the twelve network drivers that still use it. This sparked a long discussion on how to provide entropy for those devices that do not have anything else to use. While the actual contribution of entropy from network devices is questionable, mixing that data into the pool does not harm it, as long as no entropy credit—the current estimate of entropy in the pool—is awarded. Alan Cox proposed a new flag to track sources like that: A more interesting alternative might be to mark things like network drivers with a new flag say IRQF_SAMPLE_DUBIOUS so that users can be given a switch to enable/disable their use depending upon the environment. Some were in favor of an approach like this, but Adrian Bunk notes that: If he can live with dubious data he can simply use /dev/urandom . If a customer wants to use /dev/random and demands to get dubious data there if nothing better is available fulfilling his wish only moves the security bug from his crappy application to the Linux kernel. Part of the problem stems from a misconception about random numbers gotten from /dev/random versus those that are read from /dev/urandom, which we described in a Security page article last December. In general, applications should read from /dev/urandom. Only the most sensitive uses of random numbers—keys for GPG for example—need the entropy guarantee that /dev/random provides. In a system that is getting regular entropy updates, the quality of the random numbers from both sources is the same. There is still an initialization problem for some systems, though, as Ted Ts'o points out: Hence, if you don't think the system hasn't run long enough to collect significant entropy, you need to distinguish between "has run long enough to collect entropy which is causes the entropy credits using a somewhat estimation system where we try to be conservative such that /dev/random will let you extract the number of bits you need", and "has run long enough to collect entropy which is unpredictable by an outside attacker such that host keys generated by /dev/urandom really are secure". A potential entropy source, even for embedded systems, is to sample other kernel and system parameters that are not predictable externally. Garzik suggests: EGD demonstrates this, for example: http://egd.sourceforge.net/ It looks at snmp, w, last, uptime, iostats, vmstats, etc. And there are plenty of untapped entropy sources even so, such as reading temperature sensors, fan speed sensors on variable-speed fans, etc. Heck, "smartctl -d ata -a /dev/FOO" produces output that could be hashed and added as entropy. Another source is from hardware random number generators. The kernel already has support for some, including the VIA Padlock that seems to be well thought of. Not all processors have such support, however. The Trusted Platform Module (TPM) does have random number generation and is becoming more widespread, especially in laptops, but there is no kernel hw_random driver for TPM. Garzik advocates adding a kernel driver for what he calls the "Treacherous Platform Module", but as others pointed out, it can all be done in user space using the TrouSerS library. Even for the hardware random number generators that are supported in the kernel there is no automatic entropy collection, as it is left up to user space to decide whether to do that. This is done to try and keep policy decisions about the quality of the random data out of kernel code. Systems that wish to sample that data should use rngd to feed the kernel entropy pool. rngd will apply FIPS 140-2 tests to verify the randomness of the data before passing it to the kernel. Andi Kleen is not in favor of that approach: Just think a little bit: system has no randomness source except the hardware RNG. you do your strange randomness verification. if it fails what do you do? You don't feed anything into your entropy pool and all your random output is predictable (just boot time) If you add anything predictable from another source it's still predictable, no difference. There is concern that some of the hardware random number generators are poorly implemented or could malfunction, so it would be dangerous to automatically add that data into the pool. Doing the FIPS testing in the kernel is not an option, leaving it up to user space applications to make the decision. There is nothing stopping any superuser process from adding bits to the entropy pool—no matter how weak—but the consensus is that the kernel itself must use sources it knows it can trust. Another instance of this problem—in a different guise—appears in a discussion about random numbers for virtualized I/O, with Garzik asking: "Has anyone yet written a "hw" RNG module for virt, that reads the host's random number pool?" Rusty Russell responded with a patch for a virtio "hardware" random number generator as well as one that adds it into his lguest hypervisor. The lguest patch reads data from the host's /dev/urandom, which is not where H. Peter Anvin thinks it should come from: There is no point in feeding the host /dev/urandom to the guest (except for seeding, which can be handled through other means); it will do its own mixing anyway. The reason to provide anything at all from the host is to give it "golden" entropy bits. The virtio implementation only provides the hw_random implementation, thus it requires user space help to get entropy data into the kernel. Much like any process that can read /dev/random, lguest could exhaust the host entropy pool, so there was some discussion of limiting how much random data guests can request from the device. A guest implementation could then use a small pool of entropy read from the host to seed its own random number generator for the simulated hardware device. Removing the last remaining uses of IRQF_SAMPLE_RANDOM in network drivers seems likely, though some way to mix that data into the entropy pool without giving it any credit is still a possibility. With luck, that will encourage more effort into incorporating new sources of entropy using tools like EGD or, for systems that have it available, random number hardware. For systems that lack the traditional entropy sources, this should lead to a better initialized and fuller pool, while eliminating a potential attack by way of network packet manipulation. Barriers and journaling filesystems Journaling filesystems come with a big promise: they free system administrators from the need to worry about disk corruption resulting from system crashes. It is, in fact, not even necessary to run a filesystem integrity checker in such situations. The real world, of course, is a little messier than that. As a recent discussion shows, it may be even messier than many of us thought, with the integrity promises of journaling filesystems being traded off against performance. A filesystem like ext3 works by maintaining a journal on a dedicated portion of the disk. Whenever a set of filesystem metadata changes are to be made, they are first written to the journal - without changing the rest of the filesystem. Once all of those changes have been journaled, a "commit record" is added to the journal to indicate that everything else there is valid. Only after the journal transaction has been committed in this fashion can the kernel do the real metadata writes at its leisure; should the system crash in the middle, the information needed to safely finish the job can be found in the journal. There will be no filesystem corruption caused by a partial metadata update. There is a hitch, though: the filesystem code must, before writing the commit record, be absolutely sure that all of the transaction's information has made it to the journal. Just doing the writes in the proper order is insufficient; contemporary drives maintain large internal caches and will reorder operations for better performance. So the filesystem must explicitly instruct the disk to get all of the journal data onto the media before writing the commit record; if the commit record gets written first, the journal may be corrupted. The kernel's block I/O subsystem makes this capability available through the use of barriers; in essence, a barrier forbids the writing of any blocks after the barrier until all blocks written before the barrier are committed to the media. By using barriers, filesystems can make sure that their on-disk structures remain consistent at all times. There is another hitch: the ext3 and ext4 filesystems, by default, do not use barriers. The option is there, but, unless the administrator has explicitly requested the use of barriers, these filesystems operate without them - though some distributions (notably SUSE) change that default. Eric Sandeen recently decided that this was not the best situation, so he submitted a patch changing the default for ext3 and ext4. That's when the discussion started. Andrew Morton's response tells a lot about why this default is set the way it is: Last time this came up lots of workloads slowed down by 30% so I dropped the patches in horror. I just don't think we can quietly go and slow everyone's machines down by this much... There are no happy solutions here, and I'm inclined to let this dog remain asleep and continue to leave it up to distributors to decide what their default should be. So barriers are disabled by default because they have a serious impact on performance. And, beyond that, the fact is that people get away with running their filesystems without using barriers. Reports of ext3 filesystem corruption are few and far between. It turns out that the "getting away with it" factor is not just luck. Ted Ts'o explains what's going on: the journal on ext3/ext4 filesystems is normally contiguous on the physical media. The filesystem code tries to create it that way, and, since the journal is normally created at the same time as the filesystem itself, contiguous space is easy to come by. Keeping the journal together will be good for performance, but it also helps to prevent reordering. In normal usage, the commit record will land on the block just after the rest of the journal data, so there is no reason for the drive to reorder things. The commit record will naturally be written just after all of the other journal log data has made it to the media. That said, nobody is foolish enough to claim that things will always happen that way. Disk drives have a certain well-documented tendency to stop cooperating at inopportune times. Beyond that, the journal is essentially a circular buffer; when a transaction wraps off the end, the commit record may be on an earlier block than some of the journal data. And so on. So the potential for corruption is always there; in fact, Chris Mason has a torture-test program which can make it happen fairly reliably. There can be no doubt that running without barriers is less safe than using them. Anybody can turn on barriers if they are willing to take the performance hit. Unless, of course, their filesystem is based on an LVM volume (as certain distributions do by default); it turns out that the device mapper code does not pass through or honor barriers. But, for everybody else, it would be nice if that performance cost could be reduced somewhat. And it seems that might be possible. The current ext3 code - when barriers are enabled - performs a sequence of operations like this for each transaction: The log blocks are written to the journal. A barrier operation is performed. The commit record is written. Another barrier is executed. Metadata writes begin at some later point. On ext4, the first barrier (step 2) can be omitted because the ext4 filesystem supports checksums on the journal. If the journal log data and the commit record are reordered, and if the operation is interrupted by a crash, the journal's checksum will not match the one stored in the commit record and the transaction will be discarded. Chris Mason suggests that it would be "mostly safe" to omit that barrier with ext3 as well, with a possible exception when the journal wraps around. Another idea for making things faster is to defer barrier operations when possible. If there is no pressing need to flush things out, a few transactions can be built up in the journal and all shoved out with a single barrier. There is also some potential for improvement by carefully ordering operations so that barriers (which are normally implemented as "flush all outstanding operations to media" requests) do not force the writing of blocks which do not have specific ordering requirements. In summary: it looks like the time has come to figure out how to make the cost of barriers palatable. Ted Ts'o seems to feel that way: I think we have to enable barriers for ext3/4, and then work to improve the overhead in ext4/jbd2. It's probably true that the vast majority of systems don't run under conditions similar to what Chris used to demonstrate the problem, but the default has to be filesystem safety. Your editor's sense is that this particular dog is now wide awake and is likely to bark for some time. That may disturb some of the neighbors, but it's better than letting somebody get bitten later on. Mozilla looks to simplify embedding There has been a longstanding complaint about the difficulty in embedding Mozilla into other applications, but an effort is underway to change that. Mozilla evangelist Christopher Blizzard is coordinating a group of interested developers to redefine the application programming interfaces (APIs), libraries, and embedding "story" to try to make it easier for other applications. Mozilla is leading the way, but they want to build a community around embedding, so they are reaching out to developers that wish to help guide the effort. Embedding the Gecko rendering engine—the guts of Mozilla's web content handling—will allow separate programs to deal with and use the web without writing the code themselves. New applications can leverage all of the work done by Mozilla to handle HTML, CSS, Javascript, etc. to concentrate on their specific task. There are several embedding use cases cited on the Mozilla wiki, but the focus of this new effort has been on applications where handling web content is just part of the task at hand. To some extent, this effort is probably being driven by the rise of WebKit, which has a specific focus on being embeddable. WebKit is derived from the KHTML rendering engine—which underlies Konqueror—as modified by Apple for their Safari browser. There has been a fair amount of press about WebKit lately, which, along with the defection of the Epiphany browser from Gecko to WebKit, may have given Mozilla more motivation to make Gecko more embeddable. Two meetings have occurred so far to discuss and plan a strategy for providing better embedding support. Blizzard has a lengthy report from the first which goes into some detail about the direction they are headed. The other was held in early May, but there are no reports from that as yet. This a young project that is looking for more interested folks to get involved. One of the larger complaints about trying to embed Gecko into other applications is that there are multiple ways to do it. It is difficult for a developer to know which is right for their application. Blizzard says: Sometimes you use libxul, sometimes you use the win32 embedding widget, sometimes you use the gtk embedding widget, sometimes you have to reach down into internal interfaces to change things and some times you don't. Having a single story around how to make use of the embedding APIs on your platform and in your environment is one of our goals. Another area that needs work is providing a stable API. One of the downsides to not having stability at the API or application binary interface (ABI) is that security holes in Gecko tend to cascade throughout all the other applications that use it. But Blizzard does not expect to nail down the API right away: So we will have some iteration during early development and will start locking things down once we have a better sense of what people [want] and what we'll need to change internally once we understand about our user's specific use cases. Stable API is a goal, but it's a longer goal. The more that we have people help us understand and contribute code out of the gate the faster we will get here. The diagram at right gives an overview of how the new API will fit. There is existing code at both the top and bottom of the diagram, while most the of the middle is new. Applications will be able to use some of the embedded functionality through platform-specific APIs—for GNOME, Windows, or OS X—or write directly to the new embedding APIs for more capabilities. One of the more interesting decisions is to use the existing APIs as a model, but not for creating a fully compatible implementation. Blizzard explains: Note that trying to be a drop in replacement to WebKit or MSHTML/WebBrowser Control is not on the table. Therein lies madness. You end up chasing compatibility instead of just trying to make something that works really really well. But we can learn what works well from them and what doesn't and hopefully apply that to our new embedding interfaces. The project has started on a roadmap of features that need to be worked on, beginning with the basics. Reorganizing the libraries and header files to create a software development kit (SDK) is high on that list. One of the bigger issues that needs to be addressed is how to handle profiles—the directory (i.e. $HOME/.mozilla) that Mozilla uses for user-specific data storage. Some use cases will want to run without a profile, but the current code expects to always have one available. The full list in the meeting report is worth a read. This is an interesting project that should lead to more interesting applications down the road. The barriers to working with Gecko today are fairly high, but the advantages to using a well-tested, well-supported, and reasonably fast rendering engine for applications that need it are compelling. Those barriers look to be lowering in the not-too-distant future. Blame Fedora. Again. As your editor writes, the Fedora development list is the scene of an extended, heated discussion about Fedora 9. One might think that some users would be unhappy about the inclusion of KDE 4, say, or maybe it's an issue with Firefox 3, with its refusal to run older extensions and persistent fsync() bug. It would not be hard to imagine users being upset by the continued presence of Codeina. In fact, nobody seems to have much to say about those issues. Instead, a small group of very vocal users is complaining about the X Window System. That, too, might not be completely beyond imagination. Your editor can certainly attest that Rawhide users had more than their share of X-related fun over the course of the Fedora 9 development cycle. The interesting thing, though, is that just about all of the problems reported by Rawhide users got fixed before the final release. So, while Fedora 9 has a lot of very new X infrastructure, it seems to be fairly solid infrastructure. The problem, instead, is that NVIDIA has not shipped a version of its binary-only graphics driver which works on Fedora 9. These vocal users feel that the Fedora Project has done them a major disservice by shipping a release without an NVIDIA-compatible X server. Instead, they say, Fedora should either have declined to ship a "pre-release" server, or it should provide a separate set of packages with an older server for NVIDIA users. NVIDIA seems to agree: Fedora 9 is shipping a pre-release X server. If you can't wait for an updated NVIDIA graphics driver and the limited support provided in 173.08 graphics driver release is insufficient for your purposes, please use the X.Org nv driver or fall back to a supported distribution. There are a few responses to be made to this set of claims, starting with the "pre-release" bit. The server is only "pre-release" by a relatively short period of time, and, more importantly, the ABI for this server release has been frozen for a few months now. The X developers have made it clear that the ABI will not change before the 1.5 release ships. So there's no real reason why NVIDIA could not release a driver if it chose to do so. But NVIDIA has not so chosen. More to the point, NVIDIA has implemented a clear policy of not releasing drivers for a given X version until that version appears in a stable release by a major distribution. This is a policy which forces some distributor to ship a version of X which is not supported by NVIDIA. Criticizing a distribution like Fedora for being the first one out with a new X version seems misplaced; if one is averse to the use of new software, there are probably better distributions to be running. But what about the compatibility packages request? Beyond the inconvenient fact that putting resources into supporting proprietary software is contrary to Fedora's policies, that sort of support is expensive to provide. See Adam Jackson's response for a blunt summary of just how expensive. If Fedora developers start putting their time into that sort of project, they will be putting less time into making Fedora itself better. This does not seem like a good tradeoff for Fedora users who, after all, have chosen a distribution with a "100% free software" policy. And, certainly, some Fedora users appreciate the priorities that the developers have taken: Well I'm an Intel & Radeon user and Xorg in F9 is dramatically better better for all my machines. So, yes, if new code improves life for the open source drivers, lets do this again & again in future releases. I don't want my desktop experience held hostage by one company with binary drivers. In fact, X has gotten significantly better, and it has gotten better more quickly as a result of Fedora's decision to go with the upcoming release. Any attempt to maintain compatibility with proprietary drivers would, at best, slow that progress down significantly. Users unquestionably have the right to hook binary-only drivers into their systems. But ensuring that those drivers work with current free software is their problem - not the free software developers' problem. The use of proprietary software may have some advantages for some people, but it does put users at the mercy of the only people who can fix or update that software: the software's owner. Most developers (most!) do not overtly wish to make life difficult for users of binary drivers. But asking them to go out of their way to shield binary driver users from the decisions made by their vendors is not just excessive; it actively risks making things worse for free software users. Anybody who wants to criticize Fedora can certainly find any number of valid things to gripe about. Your editor would start with the two obnoxious PackageKit icons which materialized on the GNOME panel, and which, it seems, cannot be made to go away without the application of a fair amount of dynamite. Why does a Rawhide user need a constant reminder that there are updates available? But the failure to provide an NVIDIA-compatible X server does not seem like an appropriate thing to complain about. One should not blame Fedora for being free software. Use Rakarrack for Electric Guitar Effects Rakarrack is a new GUI-based application that can turn a Linux machine into a collection of audio effects for use in the making of music. The developers include Josep Andreu, Daniel Vidal and Hernán Ordiales with help from other individuals. Rakarrack version 0.1.2 was recently announced, it appears to be the first public release. From the project's web page: Rakarrack is a guitar effects processor for GNU / Linux simple and easy to use but it contains features that make it unique in this field of applications. It contains 10 effects: Linear Equalizer, Parametric Equalizer, Compressor, Distorsion, Overdrive, Echo, Chorus, Phaser, Flanger and Reverb. It integrates a tuner and a MIDI converter (experimental). It can also be handled by an external MIDI controller. The settings designed by the user can be stored in presets and these presets can be used to create banks of effects. The README file in the source code has some information on the motivation behind the project: "This app born after an informal conversation about effects for guitar over GNU/linux. The major part of this apps are discontinued or simply not have new versions after few years. Josep Andreu say on the IRC chat "I can made an app based on the effects set hid[d]en on code of ZynAddSubFX (by Paul Nasca Octavian). Some time after here is the result of our work..." The project screen shots show the GUI layout and various color schemes. Compared to a typical hardware audio processor, the GUI has big advantages over the usual LCD display that most effect units have. One need not hunt around a pushbutton-controlled memory to view and change the many adjustable parameters and the system disk provides nearly unlimited configuration storage possibilities. To hear Rakarrac in action, listen to the demo by Carlos Pino (ogg format). One might wonder if audio effects processors will soon follow mobile phones, TiVo-like video recorders and consumer-based audio recorders in the transition from proprietary operating systems to Linux-based embedded systems. Such a system could be put together with a small Linux-compatible embedded platform, an LCD interface such as LCDproc (with the aforementioned UI limitations), keyboard and audio interfaces and some DSP software similar to Rakarrac. In the mean time, if you have a need for a versatile hardware effector and can spare some CPU cycles, Rakarrac may be an effective solution. The software is available for download here. Session cookies for web applications Two weeks ago on this page, we reported on some Wordpress vulnerabilities that were caused by incorrectly generating authentication cookies. The article was a bit light on details about such cookies, so this follow-up hopes to remedy that. In addition, Steven Murdoch, who discovered both of the holes, recently presented a paper on a new cookie technique that provides some additional safeguards over other schemes. HTTP is a stateless protocol which means that any application that wishes to track multiple requests as a single session must provide its own way to link those requests. This is typically done through cookies, which are opaque blobs of data that are stored by browsers. Cookies are sent to the browser as part of an HTTP response, usually after some kind of authentication is successful. The browser associates the cookie with the URL of the site so that it can send the cookie value back to the server on each subsequent request. Servers can then use the value as a key into some kind of persistent storage so that all requests that contain that cookie value are treated as belonging to a particular session. In particular, it represents that the user associated with that session has correctly authenticated. The cookie lasts until it expires or is deleted by the user. When that happens, the user must re-authenticate to get a new cookie which also starts a new session. Users find this annoying if it happens too frequently, so expirations are often quite long. If the user explicitly logs out of the application, any server-side resources that are being used to store state information can be freed, but that is often not the case. Users will generally just close their browser (or tab) while still being logged in. It is also convenient for users to be allowed multiple concurrent sessions, generally from multiple computers, which will cause the number of sessions stored to be larger, perhaps much larger, than the number of users. Applications could restrict the number of sessions allowed by a user, or ratchet the expiration value way down, but they typically do not for user convenience. This allows for a potential denial of service when an attacker creates so many sessions that the server runs out of persistent storage. For this reason, stateless session cookies [PDF] were created. Stateless session cookies store all of the state information in the cookie itself, so that the server need not keep anything in the database, filesystem, or memory. The data in the cookie must be encoded in such a way that they cannot be forged, otherwise attackers could create cookies that allow them access they should not have. This is essentially where Wordpress went wrong. By not implementing stateless session cookies correctly, a valid cookie for one user could be modified into a valid cookie for a different user. A stateless session cookie has the state data and expiration "in the clear" followed by a secure hash (SHA-256 for example) of those same values along with a key known only by the server. When the server receives the cookie value, it can calculate the hash and if it matches, proceed to use the state information. Because the secret is not known, an attacker cannot create their own cookies with values of their choosing. The other side of that coin is that an attacker can create spoofed cookies if they know the secret. Murdoch wanted to extend the concept such that even getting access to the secret, through a SQL injection or other web application flaw, would not feasibly allow an attacker to create a spoofed cookie. The result is hardened stateless session cookies [PDF]. The basic idea behind the scheme is to add an additional field to stateless session cookies that corresponds to an authenticator generated when an account is first set up. This authenticator is generated from the password at account creation by iteratively calculating the cryptographic hash of the password and a long salt value. Salt is a random string—usually just a few characters long—that is added to a password before it gets hashed, then stored with the password in the clear. It is used to eliminate the use of rainbow tables to crack passwords. Hardened stateless session cookies use a 128-bit salt value, then repeatedly calculate HASH(prev|salt), where prev is the password the first time through and the hash value from the previous calculation on each subsequent iteration. The number of iterations is large, 256 for example, but not a secret. Once that value is calculated, it is hashed one last time, without the salt, and then stored in the user table as the authenticator. When the cookie value is created after a successful authentication, only the output of the iterative hash itself is placed in the cookie, not the authenticator that is stored in the database. Cookie verification then must do the standard stateless session cookie hash verification, to ensure that the values have not been manipulated, then hash the value in the cookie to verify against authenticator in the database. If it sounds complicated, it is; the performance of doing 256 hashes is also an issue, but it does protect against the secret key being lost. Because an attacker cannot calculate a valid authenticator value to put in the cookie (doing so would require breaking SHA-256), they cannot create their own spoofed cookies. While it is not clear that the overhead of all of these hash calculations is warranted, it is an interesting extension to the stateless session cookie scheme. In his paper, Murdoch mentions some variations that could be used to further increase the security of the technique. Exherbo announced. Sort of... A new distribution called Exherbo has announced its existence. It's at least partly inspired by Gentoo and has borrowed some Gentoo code. Exherbo is not a Gentoo fork in the conventional sense. Although it shares some code with Gentoo, and although many concepts are similar, and although many of the people involved were or are Gentoo developers, most Exherbo code is rewritten from scratch. Exherbo is not your average distribution, nor does it aspire to be. In fact, Exherbo is not for users at all. Exherbo is designed to be a developer's playground. A place to experiment, to innovate, and to break packages with impunity. So far there isn't much there. The projects page lists only two projects so far: Arbor, an exheres-format (the Exherbo package format) repository for base system and assorted useful packages, and Genesis, which aims to be a replacement init daemon. There are two mailing lists available, the main development list and a commit mailing list. The source repository has some packages in git and a few more in subversion. There's a Bugzilla bug tracker too. So there isn't much yet, but the infrastructure is there to support what may come. Perhaps the most interesting part of the site for most people is the Planet Exherbo, a typical blog space for developers to talk about what they are doing, or would like to do, or whatever. For example you'll find this post [warning, site is currently reported by Firefox 3 as an "Attack Site", content can also be found on the Planet site] by Anders Ossowicki which explains: First of all, Exherbo was announced because some elements of it will be discussed at an upcoming conference. Rather than having a blank page and let people start various rumors it seemed wise to at least let people know what was going on. But in an effort not to hype it above what it was, we didn't hand over all available information and code. Unfortunately Slashdot picked up the announcement because some tard decided it would be a great idea to submit it to them. We did not do that ourselves because, as we state on the website, we have no need for users at the moment and exherbo won't fulfill users demands for the foreseeable future. That is not to say exherbo won't ever become useful but we're not there at the moment. Some very basic things still need to be worked out properly. So there it is. Do not download and expect a working distribution. Do not expect a release of a working distribution any time soon. But if you are a developer with an itch to scratch, this might be the place to so. Just to keep it all together, here's the original LWN announcement and all associated comments. The Grumpy Editor's Guide to distributions for laptops Laptop installation has traditionally been one of the biggest challenges faced by Linux users. These systems come with no end of special-purpose hardware, and they bring particular needs of their own. More recently, getting a laptop into a basic, working state has become less of a challenge - at least, for carefully-chosen systems. Life has gotten much easier in this area. But a contemporary laptop user is not content with "it boots Linux." A well-provisioned laptop in 2008 should be able to make full use of all the hardware, suspend and resume reliably, avoid turning presentations into extended projector-related hassles, and get the most out of the battery. Your editor has, in the past, proved that he could get a laptop to suspend through a sufficient investment of his life into building kernels and tweaking configurations. Your editor, in the present, has little patience for that kind of messing around. The manual creation of power management configurations should really, at this point, go the way of hand-crafting XFree86 modelines. Both were once ways of showing one's advanced Linux skills, but both are now just unnecessary pain. A period of relatively little travel recently made it possible to follow through on an old suggestion from Arjan van de Ven: install a number of distributions on a laptop and compare how they perform. To this end, your editor's aging Thinkpad X31 was pressed into service with offerings from several distributors. In each case, a recent stable (or occasionally beta) distribution was installed while doing a minimum of work beyond clicking "next": no "expert" installations were done. All available updates were applied. Then, a number of things were checked: Powertop was installed (if not already present) and run to measure the steady-state power usage of the machine. The laptop was as idle as your editor could get it to be, with the backlight at minimum brightness; the system was left long enough for the power usage numbers to stabilize. The idea was to get the lowest possible value for each distribution. Suspend (to RAM) and hibernate (suspend to disk) were tested. Various laptop-specific buttons were tested. The X31, for example, has a button combination which controls a small light which illuminates the keyboard. The wireless network adapter was tested. The X31 presents an interesting complication in that it has an Atheros-based adapter, which, until recently, has not been supportable with free software. An external monitor was connected to determine how much work is required to drive an external projector. During the process, any other events of note were recorded as well. Late in the process of writing this article, your editor was lucky enough to receive a shiny new HP 2510p laptop, thanks to the generosity of the folks at HP (and Bdale Garbee in particular). This machine, being based on Intel chipsets, is fully supported by free software. It promises to make future travels much more pleasant; having a toy like this show up in the mail makes it hard to maintain a grumpy attitude. The above tests were run on the new machine, but only for a subset of the distributions. Debian Lenny (unstable testing) Your editor chose to perform this experiment with a mid-May Debian Lenny testing release, rather than the aging stable distribution. That installed a system with a 2.6.22 kernel which, of course, has no ath5k driver. So no wireless on the X31 for Debian users - at least, not without installing the proprietary MadWifi module. Unsurprisingly, the Debian installer did not offer MadWifi as an option. Suspend works, as long as the user does not mind a corrupted display on resume; it's possible to see enough to perform an orderly reboot, but not much more. It is strange that Debian would have this problem; suspend has worked on this laptop with kernels significantly older than 2.6.22. Hibernate was not accessible via its usual place on F12, but, when invoked from the menus, worked properly. Other laptop keys worked without problem. The external display port did not work under Debian. The only way to get video out of that port is to have the monitor plugged in when the system boots. Power consumption on an idle system was 10.7 watts, with the system waking up an average of 67 times every second. This is far from the worst power performance your editor saw over the course of this exercise, but also far from the best. All told, Debian Lenny in its current form is not one of the better systems for laptops - at least, for this particular laptop. Some of the other distributors have made much more progress in this area in recent years. Fedora 9 The installation from the Fedora 9 DVD went without any significant problems. One of the nicest things about this particular distribution was its inclusion of the ath5k driver as part of its 2.6.25 kernel. It seems that ath5k does not work well for all chipsets, but the X31 wireless adapter works quite well with it. So, with Fedora 9, the X31 laptop works with 100% free software. Another thing worthy of note: Fedora 9 was the only distribution tested which offered to install the system on an encrypted disk. Given the frequency with which laptops are lost, encrypting the data on them seems like something a lot of users would want to have. Suspend and hibernate worked on this system, with one little glitch: the backlight remained on after the system was suspended. Your editor ran into the same problem with Ubuntu Hardy during its development cycle; after some conversation in Launchpad, the problem was quickly fixed. So a bug has been filed in the Fedora tracker pointing to that resolution, but no activity has been seen so far. The power consumption for Fedora was 8.9 watts, with the processor waking up an average of 45 times per second. The NetworkManager applet offers a "disable wireless" operation which, indeed, will disable the wireless interface. It does not power it down, though, so power consumption is unchanged. Actually uninstalling the ath5k module dropped power consumption to 8.2 watts. Plugging into an external display worked, though it was necessary to bring up the "screen resolution" dialog to bring up the external port. On the 2510p, the display was run in a strange, non-native resolution during the installation, making the text harder to read. The installed system, however, did not have this problem. This system ran at 11.0 watts, with a surprising 145 wakeups per second. Following Powertop's advice, your editor shut down the Bluetooth interface and the HAL CD polling daemon, bringing power usage down to 10.1 watts. Once again, NetworkManager was unable to save any power by disabling the wireless. The hardware's wireless button did power down the interface, bringing power usage down to 8.6 watts. But (and this is true for all distributions tested), NetworkManager was never able to make use of that interface again until the system was rebooted. All told, Fedora 9 works quite nicely for laptop installations; this distribution has made quite a bit of progress over the last few releases. Some grumpiness about the GNOME setup is appropriate, though. Fedora's hackers seem especially enamored of those dialog notifier windows which pop up from the panel icons. The experience is rather like trying to work while being heckled by a sizable crowd of unhelpful bystanders. One window, in particular, announced that closing the lid would no longer suspend the system because some (unnamed) program was blocking that action. That might be useful information, but knowing which program was getting in the way would have been more helpful. But even more helpful would be to not have to dismiss little notifier windows all the time. There's also something in the GNOME system on Fedora which feels entitled to adjust the backlight brightness anytime it thinks that the user has screwed it up again. This happens even after the "dim display on idle" options have been disabled, and often results in making the display brighter on an idle system. If the user has set the backlight brightness, the system should not presume to readjust it. One should not have to wrestle with one's computer over the brightness of the display. OpenSolaris Some whim or other inspired your editor to install the OpenSolaris 200805 release. It has been almost ten years since the last encounter with Solaris, so, perhaps, it was time for a brief reunion. Brief it was. The installation procedure for this operating system is textual; it seems rather primitive next to the effort Linux distributors have been putting into making their installers attractive. There is a license acceptance stage, where the poor user gets to scroll through all of the licenses applicable to the software in this distribution - 244 licenses in all. There's no requirement to indicate acceptance, though. The installed system worked with the Atheros wireless by virtue of a binary-only driver. Initially it only worked so well, though; this system, from Sun "the network is the computer" Microsystems, installs itself configured to use a local hosts file (only) for hostname lookups. Your editor had to manually tweak nsswitch.conf to get it to use DNS. Sun's equivalent to NetworkManager is the "network automagic daemon," which is obscure in spots but seems to work. There is no power savings to be had from turning off the wireless interface. On the power front, once your editor tracked down a Powertop port, the system was seen to be drawing 11.5 watts. Unlike with any Linux distribution, Solaris runs the processor at its fastest speed at all times; there does not appear to be any concept of CPU frequency control. The laptop fan runs constantly under Solaris. There is no suspend capability, no hibernate. In general, it would appear that the Solaris developers have not put a whole lot of effort into the power management problem so far - at least, not on x86; the OpenSolaris power management page says that life is better with the Sparc port and that all this goodness is coming to x86 Real Soon Now. The external video port did not work at all under OpenSolaris. Your editor was charmed to notice that the Solaris folks have retained the classic "log off now or risk your files being damaged" message in the shutdown procedure. On the 2510p, the OpenSolaris CD brought up GRUB, but did not succeed in booting into the installer. All told, OpenSolaris has some catching-up to do. Laptops were almost certainly not at the top of the priority list for Project Indiana, but it is still a little discouraging to see how far behind things are. openSUSE 11.0 Beta 3 The openSUSE development cycle is heading toward its close, so your editor decided to go with the beta 3 release. It must be said that this distribution got off on rather the wrong foot; it puts up an end-user license agreement which prohibits redistribution for compensation, bundling openSUSE with any other "offering," reverse engineering, transfer of the software, use in a production environment, or publishing benchmark results (but only if you're a software vendor). Users are required to stop using the software upon termination of the license, which happens after 90 days, after the next release, or whenever Novell says so. And, just in case one was considering the crime of using the release for too long: The Software may contain an automatic disabling mechanism that prevents its use after a certain period of time, so You should back up Your system and take other measures to prevent any loss of files or data. There's a certain amount of weasel-wording to the effect that Novell is not trying to take away any rights conferred by the real licenses on the software it ships. So the EULA has little force. But it is not consistent with the mores of the community from which Novell took this software, and it leaves a bad taste in one's mouth. Installation is relatively straightforward, though a bit more mouse-intensive than some other distributions. But one has to watch carefully: openSUSE, by default, configures the system to automatically log in the user account created at installation time. An amusing addition is that, after suspending and resuming the system (which works), a password prompt will be presented, even though none is required on a cold boot. openSUSE, like Fedora, thinks that it's smarter than the user and is entitled to readjust the backlight at any time. As mentioned, suspending the system worked without trouble. Hibernation, however, failed; it goes straight to resume without halting the system. openSUSE ships the ath5k driver, so the wireless interface worked flawlessly with free software. The external monitor port is always on under openSUSE; the dialogs offered to create a Xinerama setup, but that operation failed. Power consumption was 11.2 watts, with 106 wakeups happening per second. Your editor noticed that beagled was running; something which was not observed on other systems. Powertop noticed too, and politely offered to kill it off; that brought the system down to 78 wakeups with slightly less power used. Removing the ath5k driver brought consumption down to 10.8 watts. Experience with the 2510p was quite similar. Hibernate still fails. Power usage is a low 9.0 watts; 8.8 when the "kill beagled" option is selected. Unfortunately, this lower usage is likely to be a result of the wireless interface not working. NetworkManager is able to present a list of access points, but does not succeed in associating with any of them. This is a device with a free driver, well supported in the 2.6.25 kernel shipped by openSUSE; its failure to work is discouraging. Many of the glitches encountered in this distribution are easily explained by pointing out that it is a beta release. One can only assume that many of them will be fixed up before the final version. With that done, openSUSE has the potential to be a solid system for laptops; many of the right pieces are there. Your editor, though, will have a hard time considering an openSUSE installation; that unpleasant EULA has left a lasting impression. Ubuntu 8.04 Ubuntu made its name partially through its attention to laptop installations, so your editor had reasonably high expectations from the "Hardy Heron" long-term-support release. Those expectations were met, for the most part. The installation CD did its job, and the resulting system worked well. The Ubuntu time zone selector deserves special mention, though: it tries to pan the world map under the mouse, with the effect that the target one is aiming for moves away as one gets close. It's a video game of sorts, but it can be a little frustrating, especially with a laptop-style mouse device. Wireless works, but Ubuntu silently installs the MadWifi driver to bring that about. Suspend and hibernate work, as do the various Thinkpad buttons. Ubuntu demonstrates some of the same backlight obnoxiousness as the other GNOME-based distributions - but quite a bit less of it. This system drew 9.5 watts of power, with 47 wakeups per second. With this configuration, disabling the wireless in NetworkManager did reduce power usage considerably - down to 8.1 watts. It would seem that the MadWifi driver still knows something about powering down the hardware that ath5k doesn't. Even so, removing MadWifi entirely dropped consumption still further, to 7.8 watts. On the 2510p, things generally worked well. Power consumption was 10.1 watts, with an amazing 217 wakeups per second, though. Part of the problem here appears to be a bug in the i915 driver which causes it to generate a steady stream of interrupts if the 3D engine is engaged. Ubuntu turns on Compiz by default, causing the video processor to pound on the CPU. Turning off "visual effects" cut the wakeup rate considerably. Following Powertop's advice and disabling the Bluetooth interface as well dropped the system down to 9.7 watts and 50 wakeups per second. Concluding notes Here's a table summarizing some of the results reported above: The second power number, when present, indicates what is achievable with minimal tweaking: turning off wireless or letting Powertop shut things down. More invasive techniques (unloading modules, for example, or changing kernel boot parameters) are not included. For the 2510p, the results are: Two other distributions were tried, but did not make it all the through the survey process: Gentoo. Playing with Gentoo has been on the list for years. So an install disk was downloaded and your editor launched into the "quick install guide." It is clear that Gentoo employs a rather long value of "quick." This guide prints over many pages, includes 39 "code listings," requires creating each filesystem by hand, etc. Your editor would still like to play with Gentoo, but there was no time for such an exercise now. Life has gotten too short to go through that kind of obstacle course just to get Linux installed on a computer. Slackware. In this case, your editor was able to get through the somewhat rustic Slackware 12.1 installation procedure. It was kind of nostalgic to see LILO again. The system ran, and even brought up the window system, but the system would lock hard as soon as your editor tried to bring up a terminal window. That, too, was not the sort of experience which had been sought. What comes out of all this work is that the Linux community now has a few good options for laptop-friendly distributions. Getting Linux running well on a laptop need no longer be an act of advanced wizardry. That said, there's clearly still room for improvement. Even well-supported hardware does not always cooperate well. For a laptop system, in particular, it is important to be able to power down unneeded hardware without having to dig into the system configuration or unload kernel modules. If the wireless interface, FireWire port, modem, BlueTooth interface, etc. are not being used, they should not be drawing power. After all, if the laptop's user is going to have something to actually do through a long series of LinuxWorld keynotes, it's important to stretch that battery as far as possible. Progress has been made, but there is more to do. Your editor must now make a choice as to which distribution will remain on these laptops. For the X31, the choice makes itself: Fedora. It works the best while installing only free software. One could retrofit a 2.6.25 kernel into an Ubuntu installation to get the ath5k driver, but it's nicer to not have to do that. For the 2510p, the choice is not quite so clear. It might, in the end, be Ubuntu for the slightly lower power consumption and fewer backlight hassles. The potential (not always realized) for online upgrades might also tip things a little more in the Ubuntu direction. All of that will have to be traded off against Fedora's out-of-the-box encrypted installation, though. But either Ubuntu or Fedora is a fine choice for this machine; it is nice to be in a position where there are a couple of high-quality alternatives. GEM v. TTM Getting high-performance, three-dimensional graphics working under Linux is quite a challenge even when the fundamental hardware programming information is available. One component of this problem is memory management: a graphics processor (GPU) is, essentially, a computer of its own with a distinct view of memory. Managing the GPU's memory - and its view of system RAM - must be done carefully if the resulting system is intended to work at all, much less with acceptable performance. Not that long ago, it appeared that this problem had been solved with the translation table maps (TTM) subsystem. TTM remains outside of the mainline kernel, though, as do all drivers which use it. A recent query about what would be required to get TTM merged led to an interesting discussion where it turned out that, in fact, TTM may not be the future of graphics memory management after all. A number of complaints about TTM have been raised. Its API is far larger than is needed for any free Linux driver; it has, in other words, a certain amount of code dedicated to the needs of binary-only drivers. The fencing mechanism (which manages concurrency between the host CPUs and the GPU) is seen as being complex, difficult to work with, and not always yielding the best performance. Heavy use of memory-mapped buffers can create performance problems of its own. The TTM API is an exercise in trying to provide for everything in all situations; as a result it is, according to some driver developers, hard to match to any specific hardware, hard to get started with, and still insufficiently flexible. And, importantly, there is a distinct shortage of working free drivers which use TTM. So Dave Airlie worries: I was hoping that by now, one of the radeon or nouveau drivers would have adopted TTM, or at least demoed something working using it, this hasn't happened which worries me... The real question is whether TTM suits the driver writers for use in Linux desktop and embedded environments, and I think so far I'm not seeing enough positive feedback from the desktop side All of these worries would seem to be moot, since TTM is available and there is nothing else out there. Except, as it turns out, there is something out there: it's called the Graphics Execution Manager, or GEM. The Intel-sponsored GEM project is all of one month old, as of this writing. The GEM developers had not really intended to announce their work quite yet, but the TTM discussion brought the issue to the fore. Keith Packard's introduction to GEM includes a document describing the API as it exists so far. There are a number of significant differences in how GEM does things. To begin with, GEM allocates graphical buffer objects using normal, anonymous, user-space memory. That means that these buffers can be forced out to swap when memory gets tight. There are clear advantages to this approach, and not just in memory flexibility: it also makes the implementation of suspend and resume easier by automatically providing backing store for all buffer objects. The GEM API tries to do away with the mapping of buffers into user space. That mapping is expensive to do and brings all sorts of interesting issues with cache coherency between the CPU and GPU. So, instead, buffer objects are accessed with simple read() and write() calls. Or, at least, that's the way it would be if the GEM developers could attach a file descriptor to each buffer object. The kernel, however, does not make the management of that many file descriptors easy (yet), so the real API uses separate handles for buffer objects and a series of ioctl() calls. That said, it is possible to map a buffer object into user space. But then the user-space driver must take explicit responsibility for the management of cache coherency. To that end there is a set of ioctl() calls for managing the "domain" of a buffer; the domain, essentially, describes which component of the system owns the buffer and is entitled to operate on it. Changing the domains (there are two, one for read access and one for writes) of a buffer will perform the necessary cache flushes. In a sense, this mechanism resembles the streaming DMA API, where the ownership of DMA buffers can be switched between the CPU and the peripheral controller. That is not entirely surprising, as a very similar problem is being solved. This API also does away with the need for explicit fence operations. Instead, a CPU operation which requires access to a buffer will simply wait, if necessary, for the GPU to finish any outstanding operations involving that buffer. Finally, the GEM API does not try to solve the entire problem; a number of important operations (such as the execution of a set of GPU commands) are left for the hardware-specific driver to implement. GEM is, thus, quite specific to the needs of Intel's driver at this time; it does not try for the same sort of generality that was a goal of TTM. As described by Eric Anholt: The problem with TTM is that it's designed to expose one general API for all hardware, when that's not what our drivers want... We're trying to come at it from the other direction: Implement one driver well. When someone else implements another driver and finds that there's code that should be common, make it into a support library and share it. The advantage to this approach is that it makes it relatively easy to create something which works well with Intel drivers. And that may well be a good start; one working set of drivers is better than none. On the other hand, that means that a significant amount of work may be required to get GEM to the point where it can support drivers for other hardware. There seem to be two points of view on how that might be done: (1) add capabilities to GEM when needed by other drivers, or (2) have each driver use its own memory manager. The first approach is, in many ways, more pleasing. But it implies that the GEM API could change significantly over time. And that, in turn, could delay the merging of the whole thing; the GEM API is exported to user space, and, as a result, must remain compatible as things change. So there may be resistance to a quick merge of an API which looks like it may yet have to evolve for some time. The second approach, instead, is best described by Dave Airlie: Well the thing is I can't believe we don't know enough to do this in some way generically, but maybe the TTM vs GEM thing proves its not possible. So we can then punt to having one memory manager per driver, but I suspect this will be a maintenance nightmare, so if people decide this is the way forward, I'm happy to see it happen. However the person submitting the memory manager n+1 must damn well be willing to stand behind the interface until time ends, and explain why they couldn't re-use 1..n memory managers. One other remaining issue is performance. Keith Whitwell posted some benchmark results showing that the i915 driver performs significantly worse with either TTM or GEM than without. Keith Packard gets different results, though; his tests show that the GEM-based driver is significantly faster. Clearly there is a need for a set of consistent benchmarks; performance of graphics drivers is important, but performance cannot be optimized if it cannot be reliably measured. The use of anonymous memory also raises some performance concerns: a first-person shooter game will not provide the same experience if its blood-and-gore textures must be continually paged in. Anonymous memory can also be high memory, and, thus, not necessarily accessible via a 32-bit pointer. Some GPU hardware cannot address high memory; that will likely force the use of bounce buffers within the kernel. In the end, GEM will have to prove that it can deliver good performance; GEM's developers are highly motivated to make their hardware look good, so there is a reasonable chance that things will work out on this front. The conclusion to draw from all of this is that the GPU memory management problem cannot yet be considered solved. GEM might eventually become that solution, but it is a very new API which still needs a fair amount of work. There is likely to be a lot of work yet to be done in this area. (Thanks to Timo Jyrinki for suggesting this topic.) Getting the right kind of contributions Most free software projects encourage contributors—it is the rare project that has an overabundance—but contributions vary greatly in quality. Encouraging good submissions, or those likely to lead to useful contributions down the road, is an important part of any project. But it is a delicate balance. It can be difficult to determine the kinds of tasks suitable for new contributors that will lead to more important contributions later. The flip side of that coin is how to handle contributions that appear to lead elsewhere. Just wading through the significant submissions on a large project's mailing list—linux-kernel being an excellent example—is extremely time consuming; adding noise, in the form of less-than-completely-useful patches, only makes that job harder. New contributors generally want to start with something relatively easy, though, which leads to the tension. Discouraging patches that aren't particularly useful in a way that won't chase off prospective kernel hackers is hard. Al Viro's rather intemperate call for discussion of a linux-wanking mailing list on linux-kernel is probably not the right approach. He was responding to a patch that reformatted a kernel header file to line up the arguments. Viro is not known for his diplomatic skills, but he was responding to a problem that he and other kernel hackers see. There is an increasing amount of trivial cleanup work being submitted that is not translating to more substantial, useful contributions later on. In a followup post, Viro explains his concern: We are getting another self-contained area. Namely, "pick a pointless mechanical work out of ever-growing pile, do it, learn nothing, pick more, maybe look into finding new classes of such mindless stuff". Of course it always had been there; what changes is that now it's not just a transient state one might hit on the way in to be slightly embarrassed about years later. It gets more visible, it gets self-sustained and it gets more and more sticky - it became a subculture in its own right and as far as I can see it is offering more and more incentives to stay in it instead of moving on. There is a real cost associated with posts to linux-kernel. It is the main communication mechanism for kernel development so those involved need to work through the posts there. David Miller laments the time he spends sorting through it all: After deleting all of the noise posted here, I'm often too burnt out to do real work with what's left and just delete that too. :-/ It's worse than the postmaster and list owner mail I process each day for vger.kernel.org Wouldn't you like me to instead have the energy left to review some useful patches? The kernel project provides a number of resources for people who are interested in getting involved but don't know where and how to start. The Kernel Newbies effort is specifically designed to help people get started with the kernel by running a wiki, mailing list, and IRC channel that are focused on the needs of, well, newbies. The idea is to provide information and mentoring that will lead to useful contributions to the kernel. A subproject is the Kernel Janitors who focus on cleaning up kernel code: We go through the Linux kernel source code, doing code reviews, fixing up unmaintained code and doing other cleanups and API conversions. It is a good start to kernel hacking. Both of these efforts are targeted at getting people up to speed so that the kernel as a whole improves. All of the work is important, but there are many other kernel tasks that are not getting done, possibly because contributors are concentrating on cleanups. Andrew Morton has some suggestions for interested folks: One could understand a developer deciding to write a do-nothing whitespace patch as a general throat-clearing exercise, but when asked, I recommend against that. I generally recommend that people just download and test the latest -rc, linux-next and -mm kernels and build and run them. Because they surely will find things which need fixing. Often simple little things like compilation errors, sometimes things which need a bisection search. One problem, though, is that much of that work is more difficult than a whitespace cleanup. For those who are interested in getting their name "up in lights"—in the form of a kernel commit message—the trivial patch path appears easier. Responses like Viro's may deter them, but it risks making linux-kernel look like a hostile place that does not encourage new developers. Some extremely important kernel tasks often do get little or no recognition. Submitting detailed bug reports, bisecting the kernel to find the patch that broke things, or testing proposed fixes go unrecognized—at least in the kernel commit log. There have been thoughts of adding tags to the patches that would note these contributions, but no concrete proposal has been made. Two other documentation efforts are underway to assist new kernel developers. Jesper Juhl is working on a Kernel Newbies Guide to be included into the kernel Documentation tree. It may get folded into Documentation/HOWTO or as a separate file, but the idea is to steer folks in the right direction—and away from the kinds of patches that raise the ire of kernel hackers. LWN's Jonathan Corbet also mentioned a longer document he is working on with support from the Linux Foundation that should be ready for review in June. There may be some rudeness or hostility towards new developers on linux-kernel, but it rarely rises to the level seen on the openbsd-misc mailing list last October. In response to a query about a list of less complicated tasks for OpenBSD—similar to what the Linux Kernel Janitors maintain—project leader Theo de Raadt, who is really not noted for his diplomacy, blasts: Surely they are too busy whining at us for lists, to actually search for the lists. I'll say it again more clearly -- all of you whiners just plain suck. We know you'll never write diffs, and it is up to you to prove us wrong. If you don't write diffs, we have a difficult time feeling any loss. This is sort of the extreme end of the "show us the code" attitude, but, in his own inimitable way, de Raadt is reacting to the same problem. It takes time and effort to shepherd new kernel hackers. Spending time mentoring folks who will never end up contributing is a waste; that time is better spent finding, fixing, or adding bugs. As Linux hacker Ted Ts'o puts it: The real question is whether people who are wanking about whitespace and spelling fixes in comments will graduate to writing real, useful patches. If they won't, there's no point to encouraging them. How does a project determine which newly interested people will end up being useful contributors versus those that will not? It is a difficult problem that warrants some thought. It surely isn't just kernel projects that have it, as any large, high-profile project will have both a fairly high barrier to entry along with some developers who should be discouraged. Obviously there will never be a clear-cut "future contributor" test, but there may be ways to get a better idea. In the meantime, flaming well-meaning folks to a crisp is unlikely to get there. Referring inappropriate patches to Linux Newbies or something similar—on the off chance the person can be redirected—might be a start. The Grumpy Editor reviews Claws Mail The Grumpy Editor's guide to graphical email clients was published almost exactly four years ago. At that time, your editor was looking for a client which could replace an MH-based setup which, for all its age, provided a degree of speed and flexibility which was hard to match. Your editor gets a lot of mail - even before lists like linux-kernel are factored in - so there is a real need for a mail client which can process messages without adding even a few seconds of overhead. At that time, none of the clients reviewed were up to the task; it seems that developers of graphical clients value a number of features above speed and flexibility. That review mentioned a client called sylpheed-claws; at that time, this client was being managed as a sort of development branch for sylpheed, with every intent of getting changes back into that system. Since then, sylpheed-claws has evolved into a full fork intended to create an independent application; it's new name is Claws Mail. In 2004, your editor had found sylpheed-claws to be an unstable platform at best; in 2008, it seemed like time to go back and see what the developers had accomplished in the last four years. To that end, Claws Mail 3.4.0 was installed and put through its paces. The good news is that this client has, indeed, stabilized over time. Your editor was unable to make it crash - always a nice feature in a mail client. Many of the features which were under development four years ago are now stable and supported - and, generally, well documented. Claws Mail has come a long way. The Claws Mail developers emphasize configurability, so there's a wide variety of options to wander through. The layout of the window is highly configurable, allowing the user to make the best use of the available screen space. Most aspects of the client's behavior can be tweaked. For somebody who is willing to wander through a long series of configuration screens, Claws Mail offers the ability to adapt the client to just about any set of needs. Dealing with email is a keyboard-intensive activity. One of your editor's biggest complaints with graphical clients has been the need to switch constantly between the keyboard and the mouse - a transition which breaks focus and steals time. Claws Mail has improved things in this regard, in that a wide variety of actions can be handled without the mouse. And, unlike some other graphical clients, changing the keyboard bindings is easily done. For some simple operations - plowing through a mail folder, reading and deleting messages - Claws Mail can be visibly slow. Working over IMAP does not help, of course, but it is slower than with, for example, Thunderbird. In addition, by default, Claws Mail will not display a message which becomes selected as the result of, say, deleting the message before it. So the cycle of deleting a message and viewing the next one requires two keystrokes or clicks. That particular problem can be configured away, of course. Much of the remaining slowness can be mitigated by turning off the "execute moves and deletes immediately" option - a change which also makes it easier to recover from overzealous "delete finger" reflexes. One common bit of workflow for your editor involves feeding a message to an external program. As a general rule, graphical mail clients do not make this possible, though this feature is almost universal in non-graphical clients. Claws Mail includes the concept of "actions," which are, essentially, external programs which act on messages. This feature almost solves the problem; actions can be set up with quite a bit of flexibility, and they can be bound to keystrokes. But there is no equivalent to the "|" operation provided by textual clients, meaning that it's not possible to pipe a message into an arbitrary command. Claws Mail only passes through the mail headers which are visible on the screen - and there appears to be no way to configure that behavior. HTML mail appears to be an unfortunate fact of life on the contemporary net. Claws Mail will render such mail as text by default; there are also a couple of plugins which can render HTML mail as intended by its sender. It warmed your editor's heart to note that Claws Mail (unlike certain other clients) does not send HTML mail by default. In fact, it lacks the ability to send HTML mail at all. These developers seem to have their priorities in the right place. Offline operation is another nice feature in a mail client. Claws mail has such a feature, but your editor was only able to get it partially working. The client can gather up mail for offline reading, but changes and sending of mail lead to a series of "I can't do this" dialogs. Some more configuration (e.g. setting up a local drafts folder) helps in this regard, but this area looks a bit like a work in progress. There's no end of other features, of course. Claws mail supports encrypted mail, spelling checking, filtering of messages on arrival (with an optional Perl plugin for those especially complicated filtering jobs), a mail template facility, color-labeling of mail, tagging, scoring, watching of threads, and more. There are plugins which will turn on a laptop LED when mail arrives, strip attachments, view PDF files, track RSS feeds, deal with vCalendar messages, etc. There is a complex search mechanism which can do a lot more than just string matches. It is, in summary, a highly capable tool with more features than just about anybody is likely to use. So has your editor made the change? Not yet. Ways around some of the speed issues will have to be found, and it may be necessary to write a plugin to make Claws Mail work with some LWN processes. A few other details need to be made to work correctly. But it can be said that Claws Mail has gotten closer than any other graphical mail client that your editor has tried to date. Responding to ext4 journal corruption Last week's article on barriers described one way in which things could go wrong with journaling filesystems. Therein, it was noted that the journal checksum feature added to the ext4 filesystem would mitigate some of those problems by preventing the replay of the journal if it had not been completely written before a crash. As a discussion this week shows, though, the situation is not quite that simple. Ted Ts'o was doing some ext4 testing when he noticed a problem with how the journal checksum is handled. The journal will normally contain several transactions which have not yet been fully played into the filesystem. Each one of those transactions includes a commit record which contains, among other things, a checksum for the transaction. If the checksum matches the actual transaction data in the journal, the system knows that the transaction was written completely and without errors; it should thus be safe to replay the transaction into the filesystem. The problem that Ted noticed was this: if a transaction in the middle of the series failed to match its checksum, the playback of the journal would stop - but only after writing the corrupted transaction into the filesystem. This is a sort of worst-of-all-worlds scenario: the kernel will dump data which is known to be corrupt into the filesystem, then silently throw away the (presumably good) transactions after the bad one. The ext4 developers quickly arrived at a consensus that this behavior is a bug which should be fixed. But what should really done is not as clear as one might think. Ted's suggestion was this: So I think the right thing to do is to replay the *entire* journal, including the commits with the failed checksums (except in the case where journal_async_commit is enabled and the last commit has a bad checksum, in which case we skip the last transaction). By replaying the entire journal, we don't lose any of the revoke blocks, which is critical in making sure we don't overwrite any data blocks, and replaying subsequent metadata blocks will probably leave us in a much better position for e2fsck to be able to recover the filesystem. A bit of background might help in understanding the problem that Ted is trying to solve here. In the default data=ordered mode, ext3 and ext4 do not write all data to the journal before it goes to the filesystem itself. Instead, only filesystem metadata goes to the journal; data blocks are written directly to the filesystem. The "ordered" part means that all of the data blocks will be written before the filesystem code will start writing the metadata; in this way, the metadata will always describe a complete and correct filesystem. Now imagine a journal which contains a set of transactions similar to these (in this order): A file is created, with its associated metadata. That file is then deleted, and its metadata blocks are released. Some other file is extended, with the newly-freed metadata blocks being reused as data blocks. Imagine further that the system crashes with those transactions in the journal, but transaction 2 is corrupt. Simply skipping the bad transaction and replaying transaction 3 would lead to the filesystem being most confused about the status of the reused blocks. But just stopping at the corrupt transaction also has a problem: the data blocks created in transaction 3 may have already been written, but, as of transaction 1, the filesystem thinks those are metadata blocks. That, too, leads to a corrupt filesystem. By replaying the entire journal, Ted hopes to catch situations like that and leave the filesystem in an overall better shape. It is, perhaps, not surprising that there was some disagreement with this approach. Andreas Dilger argued: The whole point of this patch was to avoid the case where random garbage had been written into the journal and then splattering it all over the filesystem. Considering that the journal has the highest density of important metadata in the filesystem, it is virtually impossible to get more serious corruption than in the journal. The next proposal was to make a change to the on-disk journal format ("one more time") turning the per-transaction checksum into a per-block checksum. Then it would be possible to get a handle on just how bad any corruption is, and even corrupt transactions could be mostly replayed. As of this writing, that looks like the approach which will be taken. Arguably, the real conclusion to take from this discussion was best expressed by Arjan van de Ven in an entirely different context: "having a journal is soooo 1999". The Btrfs filesystem, which has a good chance of replacing ext3 and ext4 a few years from now, does not have a journal; instead, it uses its fast snapshot mechanism to keep transactions consistent. Btrfs may, thus, avoid some of the problems that come with journaling - though, perhaps, at the cost of introducing a set of interesting new problems. The Open Graphics Project prepares to release hardware The Open Graphics Project is working to produce an open-hardware PCI graphics card with open-source drivers. The Wikipedia entry for OGP is a good source for information on the project. The OGP project vision is detailed in the About document: There is a market for graphics hardware with good support for free software and free operating systems (there may or may not be a market for open graphics hardware also, but that is beyond the scope of this project). Such a graphics card would benefit from lower software development cost and mindshare in order to be commercially viable. Free software could benefit from the active cooperation of the manufacturer of such a card to create better drivers and to get a card that better meets the requirements of free software. Currently, the market for such cards is not served very well. NVIDIA has no offering in this market, ATI's older cards have very limited support, while their new ones have none, and Matrox has no offering in this market either. XGI are off to a good start but still no 3D code yet. In order to get manufacturers to make such hardware, we have to show that it will be economically viable to do so. OGP is working with the company Traversal Technology to develop the hardware side of the project, known as the OGD1. OGP recently announced that it is now taking pre-orders for the OGD1 board. The card will initially cost $1500, there will be a $100 discount for the first 100 orders. Larger quantity orders will receive a significant discount. The initial price may seem rather high for a video card when similar mass-produced products can be had for several hundred dollars. This can partly be justified by the fact that the OGD1 is more of a development platform than a commodity video card. The OGD1 is also useful for embedded and stand-alone video products, where commodity parts are not available and custom designs are expensive. Additionally, part of the money raised by selling OGD1 cards will be used to raise funds for OGP. The OGD1 FAQ addresses the price issue: "OGD1 is actually very competitively priced compared to FPGA kits with similar capabilities and capacity. For very small FPGA projects, OGD1 may be over-kill. But for larger projects, OGD1 is a must and a bargain." The OGD1 rev B hardware specs explain the board's features and show a photo of the board. The basic capabilities include a maximum resolution of 2560x1600 pixels, 256MB of 200Mhz video memory, DVI, RGB, S-Video and composite video outputs, a PCI/PCI-X interface and user-specified I/O. A number of commercial video card manufacturers have been warming up to the concept of open-source drivers. For several years, Intel's policy has been to provide free drivers for all of their video products. ATI has released documentation for their Graphical Processing Unit (GPU) and AMD is also supporting open-source drivers. The LWN 2007 kernel summit coverage notes: "Starting with the R500 chipset and going forward, AMD will fully support free drivers for all of its graphics processors. This support will not take the form of a release of the current proprietary ATI driver; that code is not considered to be something that anybody would really want to look at. So there will be a clean start. AMD will release specifications and a skeleton driver with the plan to have 2D support working by the end of the year. The company is clearly hoping that the community will do much of the work on the driver, but it also plans to participate actively in the process." While the OGD1 is somewhat in competition with commercial video card manufacturers, the developers are encouraging the release of more open-source drivers and specification information. According to the OGD1 FAQ: "We applaud ATI for doing the right thing and making available their GPU documentation for use by Free Software developers. There are certain market segments where ATI's offering may affect us, but there are other market segments (e.g. embedded systems, single-board computers, servers, special-purpose, etc.) where our growth potential is entirely unaffected. Moreover, they in no way impact our broader goals of enabling hardware hacking and bringing open hardware to the people." If you are a developer who is wanting to get involved in the development of video card firmware, or you need a well-supported video architecture for an embedded project, the OGD1 could prove to be an effective solution. An interview with Jim Ready Jim Ready has a long history in the embedded systems market. Most recently, he became the founder of MontaVista, now one of the most successful embedded Linux companies. A recent LWN article took issue with some of Jim's comments; it only seemed fair to give him the opportunity to present his side of the story. Thus, this interview. We asked several questions about MontaVista and its approach to Linux marketing, and Jim took quite a bit of time to answer them in detail. So, without further ado... You have been working in the embedded Linux market for some years. How has that market changed over that time? What do you think are the prospects for embedded Linux now? The single biggest change, and one that gives me great pleasure, is that embedded Linux is now mainstream, part of the landscape, and arguably the fastest growing embedded OS. Believe me, when we started in 1999 that was hardly the case. Of course, the complexity of the devices that our customers are building continues to increase. The underlying hardware is typically a highly integrated system with loads of I/O, for example the SOCs (System On a Chip) such as the TI 3430. That in turn drives both complexity in Linux as well as in the application and middleware software stacks. It's pretty amazing to realize that a little Linux-based handheld device running on batteries is powerful enough to have supplied the State of California government's computing needs not so many years ago. Where do you think MontaVista's sweet spot is in that market? Companies who are highly focused on their value-add who want a first class partner to supply them a suitable Linux and associated services (consulting and training etc.) upon which they will develop their application. The more formal approach a company takes towards their own software development, the more they care about meeting schedules and the higher their requirements for quality, the "sweeter" MontaVista looks. When you're basing a billion dollar product line on someone's OS you care about what's going in your product. So when Motorola, NEC or Panasonic end up shipping 30 million phones, they need a supplier who can meet their technical, schedule and quality requirements. The phones have to ship for the Christmas season, and no one wants to recall millions of devices. As another example, our Carrier Grade Linux distribution is the core OS in deployed NEC systems which have established 99.9999% availability (that's no more than ~31.5 seconds of unscheduled downtime in a year, which is a DoCoMo requirement). Our Professional Edition is the OS for two different patient monitoring systems that have been through FDA certification. We're truly fortunate to have thousands of customers, both big and not-so-big. Embedded systems vendors have, as a group, been criticized for their lack of participation in the free software development process. Are you happy with MontaVista's level of contribution? What, in your mind, are some of the highlights of MontaVista's community participation? Most contribution surveys show MV in the top 10 of Linux contributors. (No other embedded Linux vendor even makes it into the top 30) Arguably MontaVista contributes more to Linux relative to our size than any other company in the world. It has always been a cornerstone of our strategy to be a major contributor to Linux. We figured the more gas we poured on the Linux fire the faster we could erode the RTOS suppliers installed base and speed to movement towards Linux. We are perhaps best known for our work in helping making Linux real-time capable but over time we have also made significant contributions in the PPC, XScale, MIPS and ARM trees as well as some other specific projects such as kgdb, LTT, DPM (Dynamic Power Management) etc. Your recent article in Military Embedded Systems was seized upon by a proprietary embedded vendor as proof that Linux is too expensive and difficult for embedded applications. Assuming you disagree with his conclusions, where do you think his reasoning went wrong? Well there really wasn't any reasoning, just ranting. But having said that Dan (Dan O'Dowd Greenhills' CEO) implied that our business model was to allow customers to more or less get in trouble by developing their own from scratch Linux distribution and then charge them for support to bail them out. Of course that's not what we do. Rather we build a robust fully tested and supported embedded Linux distribution (MontaVista Linux Professional Edition, for example) and deliver that to our customers. We then maintain and support that specific version of MontaVista Linux over time, even as the community dashes forward. In fact we have maintenance obligations that can be as long as a decade from initial deployment. That approach gets the customer out of the business of making their own distribution, maintaining and supporting it with all the accompanying costs. So we shield the customer from the complexity and change rate that they otherwise would be exposed to if they were on their own. They don't have to watch all the patches, monitor the newsgroups and otherwise be tied up, they can get on to building their product. Dan purposely ignored the fact that a commercial embedded Linux distribution makes it very easy to use Linux as an embedded OS. I suspect that's why he tried to hide it. Your article suggests that an embedded systems manufacturer using Linux would start by assembling the kernel and development toolchain by hand. Why do you think they would do that? Even in the absence of vendors like MontaVista, there are numerous options which do not require assembling systems at such a low level; why would a vendor not use one of them? We know from direct experience that even starting with what appears to be "pre-assembled" distribution, from a semiconductor maker or elsewhere, a developer sometimes isn't getting what they think they are. Don't get me wrong, almost any Linux distribution can serve as a starting point, maybe 99.99% perfect, but our customers demand more than that. They want to be at the end of the Linux development cycle, not the beginning. For example, a Linux distribution we recently started working with had the following problems: The code explicitly ignored Linux coding standards by adding hardware dependencies. That code would never be accepted into the upstream trees, and this kind of fork creates debugging issues and additional maintenance burden. The drivers were not SMP-safe, real-time safe nor did they support DPM, yet the device was designed for applications where all three could well be required. In order to take advantage of these advanced features, the device driver would need to be re-written from scratch. The code contained numerous defects that caused the system to crash. Error returns were not checked and other problems indicating very poor coding practices. These are exactly the type of quality issues that should compel businesses to find a Linux commercialization partner. We had the great pleasure of fixing all these problems as we assembled our distribution. Even with our standard practice of pushing back the changes, as you well know, there is no guarantee by the community that these changes make it back into the appropriate open source trees. The fact is it is difficult for a prospective Linux developer to have any idea of the state of the Linux distribution they might select. A high quality, commercial distribution can give a developer some peace of mind about what they are getting. For example, MontaVista has a formal development process in place for each of its releases, with quantitative criteria that must be met for defects (0 critical defects for example with a sharply declining overall new defect detection curve.) before the distribution can ship. Our processes have been formally audited by a number of our largest customers in order to assure themselves of what they are getting from us. And as we mentioned above, the proven results from devices in the field speak to our abilities. As for other starting points, you'll have to ask them about their process. There were some interesting numbers in that article. Where did the 5000 messages/day for kernel.org come from - which lists? Since our engineers live and breathe Linux and the other Open Source components that make up and embedded Linux distribution, we have a pretty good feel for the overall rate of traffic we keep up with. Based upon that experience we measured the overall message traffic that a developer would have to monitor to keep abreast of the daily ins and outs of the typical mix of software they would use for an embedded Linux project. We aggregated the total under kernel.org, which isn't precisely correct, but the paragraph preceding that statement clearly referred to a set of lists one would have to monitor. For example, the monitoring would include not only lkml, but also the lists for other significant parts of the software typically used for an embedded project, including the list maintained for the specific architecture used (MIPS, PPC ARM etc), the real-time list, networking, IPv6, security, advanced filesystems etc. By the way, the lkml list on May 21, 2008 contained ~500 messages, and gcc contained ~100, just for starters. So it wasn't just lkml at 5000 but a total set that can total up to 5000 per day. Does the fact that lkml is only ~500 a day (and "only" ~200-300 on weekends) make it any less daunting? I don't think so. You say: "a recent security patch that took all of 13 lines of code to implement against an embedded Linux system would have taken more than 800k lines of source patches to implement if the previous trail of patches had been ignored." How was that number arrived at? Which security patch were you describing? How could it possibly require 800,000 lines of patches "to implement" this security fix? This example comes from a sequence back in 2006 (CVE-2006-1528 to be precise), but the "problem" is just as true today. Here's the setup: A developer decides to use Linux and has taken the strategy of minimizing their costs by using a community-maintained Linux kernel. (This story would be true if the developer started with the typical semiconductor distribution, by the way) The community has a good reputation for stability and defect resolution and therefore the developers think they can minimize their own effort. They start with Linux 2.6.10 and base their device application software on this release. During testing, they notice a defect and find that the defect has been identified (the good news) and fixed in 2.6.13 (the bad news). So now they have a problem, moving up to 2.6.13, where the defect is fixed also introduces 846,233 new lines of code (the delta between 2.6.10 and 2.6.13). This magnitude of change restarts their QA process, since so much code has changed in the underlying Linux kernel. Their other choice is to backport the fix, which in this particular case is 33 lines (we know because we did it), but now the developer has taken on maintenance of their own Linux, which was what they were trying to avoid in the first place. This drift between a Linux release you have baselined and the fact that defects are often fixed in newer releases presents a less than perfect set of choices for developers. Whether you wanted to or not, you're in the Linux maintenance business. This drift problem is true of many distributions, not just dealing with kernel.org. If our customer found the same defect, we have the obligation to fix it in the release that they purchased from us; we don't force them to potentially destabilize their environment by sending them a newer kernel release where the defect was originally fixed. I guess it all depends how cavalier one is about changing your underlying operating system after you've developed and tested your application. In general our customers are very strict about minimizing changes, and so are we. At least some of MontaVista's marketing would appear to focus on making Linux look scary. Are you not concerned that this approach might have the effect of making Linux in general look less attractive and, thus, playing into the hands of proprietary systems vendors? No one is a bigger proponent of embedded Linux than MontaVista (and we have the contributions to prove it). But it doesn't do us any good to have folks try Linux, get over their heads and fail, and attribute that failure to Linux. We have seen over the past 8 years any number of projects that got into trouble by not understanding what to expect when they downloaded some Linux and started in by themselves. In fact one of our very earliest customers, back in 1999, had started off building their own Linux, and hit a hardware integration bug that stopped them dead in their tracks for weeks, putting their project in real trouble. Had we not been able to help them out, their alternative was Windows CE. Ugh! Why shouldn't many millions of lines of complex operating system code that changes daily be a little scary, especially when your business is making devices, not operating systems? I think it is a mistake to "trivialize" the difficulty in owning large amounts of any software, including Linux. That's why I think it's important for folks to be well informed about what they are getting into, so they can make good decisions on how they will approach using Linux for their system, whether they do-it-themselves or go commercial. In either case we want them to succeed. Is there anything else you would like to pass on to LWN's readers? We can all be quite proud of the enormous progress Linux has made in transforming the embedded OS marketplace from one which was highly fragmented and largely devoid of standards, to an environment based upon a highly functional OS which is a truly open standard: Linux. A whole cast of characters made this possible: visionary customers who dared think it was possible to embed Linux in their devices, the semiconductor companies making sure Linux was ported to their chips, commercial companies such as MontaVista and others making rock solid distributions that were capable of being deployed by the millions, and numerous individuals who made significant contributions along the way. It's a pretty powerful combination that's hard to beat. We would like to thank Jim for taking the time to answer our questions. Using the firmware loader for static data Some device drivers need firmware to load into the hardware at initialization time. The kernel firmware loader interface exists to support that functionality, but it requires help from user space which may not be available in all environments. David Woodhouse has proposed a patch that would eliminate that requirement so that more drivers can use the firmware loader rather than craft their own solution. Embedded devices will be one of the main users of this ability. Many of those do not have a user space filesystem available at boot time—via initrd or initramfs—but they still need to access firmware images to download to peripherals. The new request_firmware() implementation would allow those devices to link the firmware into the kernel while still using the kernel firmware infrastructure. Woodhouse has an excellent summary of what he is trying to do in the patch posting: Some drivers have their own hacks to bypass the kernel's firmware loader and build their firmware into the kernel; this renders those unnecessary. Other drivers don't use the firmware loader at all, because they always want the firmware to be available. This allows them to start using the firmware loader. A third set of drivers already use the firmware loader, but can't be used without help from userspace, which sometimes requires an initrd. This allows them to work in a static kernel. A driver that has static firmware data, declares it using: The firmware_name is used as a key to find the specific firmware when request_firmware() is called. blob is a pointer to the actual code. The declaration adds the firmware to the end of an array holding struct builtin_fw elements, which look like this: When a call is made to request_firmware(), the new code linearly searches the array for a matching key before calling out to user space. This allows any statically created firmware blobs to take precedence over those in the filesystem. Whichever is found is returned. There seemed to be strong agreement that Woodhouse's approach was the right way to go. His original implementation copied the firmware blob before returning it to a request_firmware() caller which required a vmalloc()—a waste of precious memory on embedded devices. Woodhouse was concerned that some drivers might modify the firmware before loading it into the device. Once he started looking, he found examples of that, but instead of penalizing all devices, he changed the firmware data returned in a struct firmware to be constant, resulting in the following structure: This constitutes an API change for anyone using the request_firmware() interface. In-tree drivers have been modified by Woodhouse appropriately, but out-of-tree drivers need to be aware of the change. Any driver that needs to modify the data must make a copy for themselves. Another feature that would be useful for memory-constrained devices is compression of the firmware in the kernel image. This is on Woodhouse's radar, but is not seen as a feature that must be in the first release. Not copying the data for most drivers is a bigger win, but compression, especially for large firmware images might help. In those cases, though, both the compressed and uncompressed data will be in memory while the driver is downloading it. Getting this work included into 2.6.26 has been discussed, even though the merge window has closed. Woodhouse thinks it might be possible: Well, it's supposedly too late, but it's dead simple and shouldn't have much chance of breaking anything, so I suppose as long as we don't include the korg1212 patch and the rest of the similar patches which we're still working on, that's not such an insane request. This is a fairly simple patch that adds some very useful functionality, especially for the embedded community. Woodhouse has recently stepped up as one the kernel embedded maintainers, so we may see more things like this from him in the future. It is unlikely that Linus Torvalds will merge this feature so late in the 2.6.26 cycle, but inclusion into 2.6.27 seems quite probable. Fedora's Packager Sponsors Responsibility Policy A Linux distribution is really the sum of its packages. The more packages that are available, the more useful it becomes for a wide range of needs. Case in point, Debian has some 20,000 plus packages available to it's users, and to the wide variety of Debian-based distributions. Fedora doesn't have quite as many packages available (yet), but the project hasn't been working at it for nearly as long either. Of course having thousands of packages available is no good if they won't interact well with each other. A distribution isn't just a collection of random binary packages. Packaging guidelines are critical for ensuring that any package you (the user) installs, works well with the rest of your system. Fedora is working toward having an ever growing number of volunteers maintaining an ever growing number of packages, and still having an integrated distribution that works whether you want the "Everything Spin" or one of the highly specialized Spins, or something in between. One part of making that happen is having sponsors for new volunteers, and coming up with a policy to guide these sponsors. A draft version of the Packager Sponsors Responsibility Policy was posted to Fedora-devel late last week. The wiki version contains some additions and clarifications. With the new policy, sponsors are maintainers with a good record of package maintenance and have shown a willingness to review packages and assist others. Sponsors act as mentors for new contributors, as package reviewers and ultimately they are responsible for making sure that bugs are fixed in their sponsored packages. The policy also indicates some conditions where a sponsorship might be revoked: A maintainer that no longer wishes to contribute to Fedora, a maintainer that refuses to follow guidelines, or irreconcilable differences between the maintainer and the Sponsor. In this event it is the responsibility of the Sponsor to orphan the maintainers packages, and do any other needed cleanups. Like all such policies, it will evolve over time, but all in all it is a good start to a policy that should help new maintainers get involved with the Fedora project. Attacking network cards When considering the vulnerabilities of a system, the hardware is usually ignored. Software certainly presents the biggest target—fairly easily exploited as we have seen—but a new class of attacks goes directly at the hardware, specifically network cards. The results can range from a permanent denial-of-service to a complete compromise of the card's function. One researcher has overly cutely dubbed this kind of attack "phlashing" because it attacks the firmware on the card, which is typically stored in flash. The basic idea is that an attacker will rewrite the firmware using an image under their control. That image could do any number of fairly nasty things to the card. Two separate researchers have recently reported on their explorations into this type of attack. Arrigo Triulzi's posting to the, evidently private, Robust Open Source mailing list was reported on Ben Laurie's weblog. Rich Smith of HP also gave a talk on his PhlashDance fuzzing tool at the EuSecWest conference. In both cases, network devices were compromised via insecure remote firmware update capabilities. Smith's research focuses on causing permanent denial-of-service through overwriting the firmware, presumably with garbage. At that point, the card will no longer function and may, in fact, no longer be able to be updated—remotely or locally—which turns it into a paperweight. More importantly, no network traffic can use the device, so if it is situated in a critical router, for example, it could affect a large number of systems. A more insidious attack is described by Triulzi. He replaces the firmware with new code, effectively reprogramming the device to do whatever he wants. One of the attacks goes like this: [...] I've reached my goal of writing a totally transparent firewall bypass engine for those firewalls which are PC-based: you simply overwrite the firmware in both NICs and then perform PCI-to-PCI transfers between the two cards for suitably formatted IP packets (modern NICs have IP "offload engines" in hardware and therefore can trigger on incoming and outgoing packets). The resulting "Jedi Packet Trick" (sorry, couldn't resist) fools, amongst others, CheckPoint FW-1, Linux-based Strongwall, etc. This is of course obvious as none of them check PCI-to-PCI transfers. An additional trick, noted by Laurie and others is to use those same techniques to read or write the main memory of the host computer. This could certainly allow sensitive information to leak—or the host itself to be compromised. As Laurie says: "You might even be able to read disk, too, depending on the disk controller." This is truly frightening stuff that is flying under the radar of most network administrators. There are no known attacks in the wild, but it would seem only a matter of time before that happens. This is definitely something to keep an eye on. Other than avoiding vulnerable network hardware—lists of which do not seem to be available from either researcher—there doesn't seem to be much that can be done to deal with phlashing attacks. A properly programmed I/O memory management unit (IOMMU) might alleviate some of the worst cases by disallowing DMA outside of approved ranges, but card vendors need to make updates more difficult. It might be more convenient for an administrator of a large network to update multiple cards across the wire, but the price paid for that convenience seems too high. A summary of 2.6.26 API changes The 2.6.26 development cycle has stabilized to the point that it's possible to look at the internal API changes which have resulted. They include: At long last, support for the KGDB interactive debugger has been added to the x86 architecture. There is a DocBook document in the Documentation directory which provides an overview on how to use this new facility. Some useful features (e.g. KGDB over Ethernet) are not yet supported, but this is a good start. Page attribute table (PAT) support is also (again, at long last) available for the x86 architecture. PATs allow for fine-grained control of memory caching behavior with more flexibility than the older MTRR feature. See Documentation/x86/pat.txt for more information. ioremap() on the x86 architecture will now always return an uncached mapping. Previously, it had taken a more relaxed approach, leaving the caching as the BIOS had set it up. The practical result was to almost always create uncached mappings, but with occasional exceptions. Drivers which depend on a cached mapping will now break; they will need to use ioremap_cache() instead. See this article for more information on this change and caching in general. The generic semaphores patch has been merged. The semaphore code also has new down_killable() and down_timeout() functions. The final users of struct class_device have been converted to use struct device instead. The class_device structure, along with its associated infrastructure, has been removed. The nopage() virtual memory area operation has been removed; all in-tree code is now using fault() instead. The object debugging infrastructure has been merged. Two new functions (inode_getsecid() and ipc_getsecid()), added to support security modules and the audit code, provide general access to security IDs associated with inodes and IPC objects. A number of superblock-related LSM callbacks now take a struct path pointer instead of struct nameidata. There is also a new set of hooks providing generic audit support in the security module framework. The now-unused ieee80211 software MAC layer has been removed; all of the drivers which needed it have been converted to mac80211. Also removed are the sk98lin network driver (in favor of skge) and bcm43xx (replaced by b43 and b43legacy). The ata_port_operations structure used by libata drivers now supports a simple sort of operation inheritance, making it easier to write drivers which are "almost like" existing code, but with small differences. A new function (ns_to_ktime()) converts a time value in nanoseconds to ktime_t. Greg Kroah-Hartman is no longer the PCI subsystem maintainer, having passed that responsibility on to Jesse Barnes. The seq_file code now accepts a return value of SEQ_SKIP from the show() callback; that value causes any accumulated output from that call to be discarded. The Video4Linux2 API now defines a set of controls for camera devices; they allow user space to work with parameters like exposure type, tilt and pan, focus, and more. On the x86 architecture, there is a new configuration parameter which allows gcc to make its own decisions about the inlining of functions, even when functions are declared inline. In some cases, this option can reduce the size of the kernel's text segment by over 2%. The legacy IDE layer has gone through a lot of internal changes which will break any remaining out-of-tree IDE drivers. A condition which triggers a warning from WARN_ON will now also taint the kernel. The get_info() interface for /proc files has been removed. There is also a new function for creating /proc files: This version adds the data pointer, ensuring that it will be set in the resulting proc_dir_entry structure before user space can try to access it. The klist type now has the usual-form macros for declaration and initialization: DEFINE_KLIST() and KLIST_INIT(). Two new functions (klist_add_after() and klist_add_before()) can be used to add entries to a klist in a specific position. kmap_atomic_to_page() is no longer exported to modules. There are some new generic functions for performing 64-bit integer division in the kernel: Unlike do_div(), these functions are explicit about whether signed or unsigned math is being done. The x86-specific div_long_long_rem() has been removed in favor of these new functions. There is a new string function: It compares the two strings while ignoring an optional trailing newline. The prototype for i2c probe() methods has changed: The new id argument supports i2c device name aliasing. One change which did not happen in the end was the change to 4K kernel stacks by default on the x86 architecture. This is still a desired long-term goal, but it is hard to say when the developers might have enough confidence to make this change. Mark Shuttleworth on the future of Ubuntu The life of South African Mark Shuttleworth has been a kind of geek dream: found and sell Internet company for $500+ million in mid-20s; spend $20 million to become the second space tourist; and create a GNU/Linux distribution with a cool name that has become the most popular on the desktop. Here, he talks to Glyn Moody about Ubuntu's new focus on the server side, why Ubuntu could switch from GNOME to KDE, and what happens to Ubuntu and its commercial arm, Canonical, if Shuttleworth were to fall out of a spaceship. I believe you made about $500 million when you sold the certificate authority Thawte Consulting to Verisign in 1999. Creating a GNU/Linux distribution is not the most obvious follow-up to that: what were the steps that led from the early part of your life to the current phase? I have a belief that we should all paint our lives as boldly as we can, and we should explore the things that are the most interesting to us personally. I'm always disappointed when I see people asking the question: "What's going to be the next big thing? What career should I choose? Where will the most money be paid?" It's impossible to know what the future holds, but it's very possible to know what you might be personally interested in. So after Thawte, I spent some time setting up the [Shuttleworth] Foundation and some time setting up the [HBD] Venture Capital group, which I wasn't going to run personally, but which I thought was a good thing to have, and put a team in place to do that. And then I thought: what are the most interesting challenges out there, what are the opportunities that I'm sort of uniquely positioned to do? And the opportunity to go to Russia and train there and then fly was the opportunity that I chose. After that, it was more difficult. There were three things that I was looking at. Each of them was exploring the impact of the Internet in society and in commerce, but in different ways. And of all of them, [Ubuntu] is the project I thought was the most interesting, the most difficult, the biggest scale project. And ultimately, if we succeed, the one that will have the biggest impact. So I took this one on. Given that Ubuntu's roots are on the desktop, what's behind the recent shift in strategy to address the server side too? That's not a change in strategy, it's more a pull through. We started with a very narrow focus on the desktop, and that allowed us to punch in. As we've penetrated the industry, there's a natural pull through where someone who's started using us on their desktop has now started setting up Ubuntu on a server. You could always run Ubuntu on a server; there was never a significant reason not to. That body of users has now reached a critical mass on the server, and so our server work is now more responding to that than a shift in strategy. We continue to make the desktop our labor of love, the server requires a very enterprise-oriented approach. We've built out a dedicated team that just handles that. We haven't re-assigned people who are desktop specialists and asked them to test a server. You're not worried you're spreading yourselves to thinly? That is a risk, and that's something we discuss here a lot. There are benefits to offering a platform that can be used in both configurations. We see companies often saying: "We love your desktop. We would definitely choose your desktop if we could also use you on the server." Companies don't like to introduce arbitrary diversity in technology. Everybody has heterogeneous systems, but they don't like to make that situation worse without a very good reason for it. Ubuntu is a very good server for certain use-cases now, just like Ubuntu is a very good desktop for certain use-cases. Our challenge over the next couple of years is just to broaden the base to which it appeals on both fronts. On the server, it's very much a question of taking time to build the portfolio of relationships with other vendors. There are a lot of applications - what we call solutions - which are now free software-based: standard web-serving, mail-serving and so on. Ubuntu does very well for those. Increasingly, the challenge for us now is to build out the portfolio of non-free software certifications, everything from Oracle through SAP and thousands and thousands of pieces in between. That will take time; it's not something we can achieve overnight. One of the interesting things you've floated recently is the idea of coordinated releases amongst GNU/Linux distributions. Where did the idea come from, and what would the benefits be? [PULL QUOTE: That's really what Ubuntu's all about. We want to express fully the real nature of free software, as a true commercial, economic entity in its own right. END QUOTE] What I'm really, profoundly interested in, is how a different approach to technology makes new things possible. The business model of the proprietary software industry is licensing software to new customers or updates of software to existing customers. You make money when you have a new version. So there's an imperative both to release new versions and to have a whole bunch of new features in those versions, specific features that you articulate in advance. In the free software world, we don't have that to cloud our thinking. We accept that development goes at the pace that it goes. If we operate on a basis that we only integrate new features into the platform when we consider them ready, then we can effectively release the platform at any time. When you look at the world though those glasses, it makes sense then to articulate not that you'll ship the product when you have certain features, but you'll ship it at a certain time. That's actually really useful to all of your users, because they can plan for a particular time. This wasn't our stroke of genius: GNOME was the one that really championed this idea. We took the fairly radical step of saying we could do that across the whole ecosystem. The reason that is radical is because when you're one project, you can make decisions for yourself. But obviously as Ubuntu, we aggregate everyone from the Linux kernel to the GNOME project through the Firefox web browser and the Apache web server, and a ton of stuff in between. So people said: "How on earth will you tell them when to ship their stuff so that you can ship what you want?" We've simply taken the view that we have a very carefully-managed release process, and a new version from one of those projects just doesn't get in unless it's ready at the time it needs to be ready for us to have confidence that it can be integrated and tested. What this has really done is it's separated, very elegantly, the processes associated with R&D, which is focused on what new features we're going to develop, and how to manage that, which is very difficult to put on a particular schedule, from the process of integration, testing and distribution. Now, if I look at a company like Oracle or Microsoft, they have both of those responsibilities. So you end up in this horrible situation where they start saying now: "you'll have the next generation file system in this version and it'll ship on that date." And then reality intervenes, and that puts them in a very awkward situation. We just don't have that. To come back to the original idea, we try to understand what's the essential difference between the way we produce software and the way other people produce software, and what becomes possible because of that, that wasn't possible before, both economically and technologically. That's really what Ubuntu's all about. We want to express fully the real nature of free software, as a true commercial, economic entity in its own right. Have you had any feedback yet from the other distributions? Not yet, no. This is something that we've only just started articulating. My hope is that other distributions will see the benefits of synchronizing all of our releases. It doesn't matter whose cycle we converge on, but the idea of synchronizing releases then cues all of those thousands of other projects, that if they want their latest technology shipped by a particular date, if they're able to get it done by a particular time, then that will happen not just with Ubuntu, but with a whole bunch of different platforms. I think it's a powerful idea. There are commercial interests that might block it. It will be interesting to see if the other commercial distributions are nervous to put themselves in a situation where they really are being compared, apples to apples. We'll see. Given that more and more computing will be done in the cloud, is that going to be a threat or an opportunity for Ubuntu? It's a real opportunity, both on the server side and on the client side. To build a server-side cloud infrastructure, you want an operating system which is not licensed per seat or per processor or per machine or per instance. It is simply freely available with all of its updates, and Ubuntu meets that. You can go from a hundred instances in the cloud to a hundred thousand instances in the cloud and legally pay Canonical no more money. You will probably want to have some sort of support relationship with us, but that's entirely separate from the actual licensing of the platform, and it's not required in any way. We cut a deal to support you in the way that you need support. So, economically on the server side that's a very big winner, and Ubuntu is seeing a lot of adoption and traction there. You also want something that can be shrunk down so that in your cloud server you only have the pieces which you really need. Every extra piece is an extra piece of disk space that's not being used; it's an extra piece of memory that's not being used. It's an extra thing that can have a security issue that's not being used. And so you may as well get rid of it. Ubuntu's very modular - probably the most modular of the commercial platforms; this comes from our Debian heritage. On the client side, for cloud computing you really want something that "speaks the Internet", and does so very well and very securely, and speaks the web very well and very securely. Ubuntu running Firefox is a really compelling option there. So I think there's a good chance that the next YouTube is running in the cloud and running on Ubuntu. One of the versions of Ubuntu is Gobuntu, which has no non-free elements whereas Ubuntu does have some. Where do you stand on the question of including proprietary elements in a free software distribution? [PULL QUOTE: But we are willing to put in drivers that are not yet open source, because we figure it's more important to give everybody's grandma the opportunity to actually run free software applications on a free software environment, even if they need some proprietary drivers to get their hardware going. That puts us squarely in the pragmatist camp rather than the purist camp. END QUOTE] Very clearly, I'm a pragmatist. The non-free pieces of Ubuntu are nothing to do with Canonical's commercial interests. It's not like we've put pieces in there that suit us and don't suit anybody else. They're drivers for hardware where the manufacturers of that hardware haven't yet wrapped their heads around the idea of releasing the source code that makes their hardware work. They're not applications. We work with those vendors to help them understand that in fact it's to their advantage to make their source code open source. They will get much better quality. We have real examples of this. We have much better quality drivers with much better reliability that make their hardware more attractive to a bigger portion of the market. But we are willing to put in drivers that are not yet open source, because we figure it's more important to give everybody's grandma the opportunity to actually run free software applications on a free software environment, even if they need some proprietary drivers to get their hardware going. That puts us squarely in the pragmatist camp rather than the purist camp. Gobuntu is an attempt to create a version of Ubuntu that does away with that, but also that is specifically designed to be a platform where other ideas about Copyleft can be explored - this meme about collaborative creation of something is extremely powerful and software is just the tip of the iceberg - we've already seen Wikipedia. I think every industry is going to need to adjust its thinking to say: "How can this participative computing phenomenon energize us?" Gobuntu aimed to do that. People didn't really flock to it, so I think we will stop doing Gobuntu. People liked the idea, but not the people who would actually invest their time in it. I think it's too closely associated with Ubuntu. There's another one called gNewSense, which is exactly the same - Ubuntu with all the non-free stuff taken out. But because it's a separate organization, people feel more comfortable participating there. I don't mind, really. On a related issue, do you worry that GNOME is becoming too involved and enmeshed with Microsoft technologies? If the patent problem with GNOME becomes too great, might you switch to KDE one day? I think it's very healthy that we have multiple desktop platforms, and that they're both committed to free software and sources of innovation and inspiration and competition. We picked GNOME mostly because of its approach to the release cycle and because it had a real strong commitment back in 2004 to usability. Since then, KDE has also embraced the idea of usability as a primary driver, and they've done some really interesting things on the technology front. I keep a level of awareness of KDE, and I run KDE at home just to make sure I have a sense of where it's going and how it is doing. I like the rivalry. We might [switch]; it's good to have that option. As for patents in software, I think society does a very bad deal when it gives someone a monopoly in exchange for nothing. The traditional patent deal was you gave someone a monopoly in exchange for disclosure of a trade secret. You can't really have trade secrets in software. Of course, the entrenched interests like to frame this as "patents are all about innovation", when they really aren't. There's very strong, academic, peer-reviewed research that suggests that patents stifle the pace of change and innovation. The real insight with patents is that what society is buying with that monopoly is disclosure. And so the real benefit to society is accelerated disclosure of new ideas - not convincing people to invest. People have ideas all the time. You can't stop the human mind from innovating. People do research and development to win customers, that's what it's really about. It's not to file patents. So the entrenched patent holders really aren't doing much of a service to society when they articulate their position in very flawed terms. With regard to GNOME and Microsoft, I'm not concerned. My view is that to win, you have to have your own vision. You have to have a very clear idea of what you can deliver that's unique. You can't go around sort of chasing someone else's coat tails. So while I respect the people in the free software community who invest a lot of time in making compatible implementations of other people's technology, I don't think that's the real recipe for success for free software. We have to give people a reason to use our platform for itself, not because it's a cheap version of someone else's. And in fact, the real successes of free software have been the places where it has just blown away the alternatives. The Internet runs on free software, and not because it has copied anything from Microsoft. The proprietary software guys like to accuse free software of not innovating and not doing anything other than sort of walking down the same path that they've already walked, which is always easier. That's just not true, but guys like the Mono Project are reinforcing that stereotype. Finally, one of the issues that has traditionally preoccupied the Linux community is: what happens if Linus falls under a bus? So I was wondering what happens to Canonical and Ubuntu if you fall under a spaceship or something? Fall *out* of a spaceship! Well, I've made suitable preparations so that if I'm looking the wrong way when the bus comes, economically both Canonical and Ubuntu are fine: there are provisions in my will to make any additional investments needed. As to the other things that I do for the project, they will have to find someone else to step into my shoes. You know, there's a lot of good talent, and both technically and commercially and socially. I think the project would continue. Glyn Moody writes about open source at opendotdotdot. An interview with the new embedded maintainers Embedded Linux is getting a lot of attention these days. A new kernel.org mailing list, linux-embedded—archived here—has been set up, with discussions and patches already being posted. In addition, Paul Gortmaker and David Woodhouse have volunteered to be the "embedded maintainers" for the kernel to help coordinate the embedded Linux community. They graciously agreed to a joint email interview to shed some light on their new roles. LWN: What is your background with Linux, especially with embedded Linux? David: I got involved in Linux while I was at University, and ended up working at Nortel during one of the summer vacations, on a project for networking over mains power lines. It involved Linux boxes as routers, and I was working on solid state storage for that. From that, and from the basic support we had for similar devices in the PCMCIA code base, the MTD [Memory Technology Device] subsystem grew. After a while, I ended up working for Red Hat's engineering services division, doing board ports, drivers and other work. That's when JFFS2 was written, as part of a customer contract. I've been at Red Hat since 2000, in various rôles including spending most of the last couple of years on OLPC. Due to HR misconduct, I handed in my notice on Monday and will be going elsewhere. I spoke to my new boss before volunteering for the 'embedded maintainer' rôle, and he was happy with that—it's another Linux-friendly company where I'll be doing kernel development, and community interaction will continue to be part of my day job. Paul: I started using Linux back in the pre 1.0 days, and having always been one to take things apart and see how it works, being able to do that with the OS appealed to me. I put together various documents to help people back when the entry level into Linux was quite high, started fixing and writing drivers, and on it went from there. In 2005, I joined Wind River, where I've been primarily focused on kernel and board specific kernel patches, and this has given me the opportunity to be exposed to all the different architectures and lots of board variants within each architecture family. LWN: What is the role you see for the embedded Linux maintainers for the kernel? David: A bunch of things really. It's not like a normal maintainer rôle where we take ownership of a certain section of code; it's a bit more fluid. To start with, one of the things we really need to do is work with the various people who are using Linux in "embedded" situations, and help them to work better with the community. That isn't just the vendors of consumer equipment—it's communities like OpenWRT, handhelds.org, OLPC too. In no other field is the development of the Linux kernel so balkanised, with people all over the place carrying their own patches or even full trees of code. Another part of the job, which is actually something I've been doing for years anyway, is reviewing general changes in the kernel with a particular mind to how they affect embedded systems. That's not just bloatwatch, although obviously that's a part of it. It also covers things like watching the IBM zSeries folks provide execute-in-place support for block devices under z/VM, and saying "hey, how can we use the same memory management for XIP from flash?". The other main part of it is implementing features in the core kernel which are motivated by "embedded" requirements. Like the tricks for compiling parts of the kernel with "-fwhole-program --combine" to let GCC optimise better and reduce code size, for example. A certain amount of it, especially the new linux-embedded@vger.kernel.org list, I expect to be a kind of targeted kernelnewbies—but obviously with a more specific focus on embedded issues, and to a certain extent on professional developers rather than having such a high proportion of hobbyists. Although I certainly wouldn't want to discourage the hobbyists and students from getting involved with embedded. It's a good way to get people to send you cute toys, after all! I was trying to avoid having a 'linux-embedded' git tree, but for small things like the patch Tim Bird just sent to the linux-embedded list to introduce CONFIG_CONSOLE_TRANSLATIONS, I suppose it makes sense—so I've created that at git://git.infradead.org/embedded-2.6.git. Paul: There are several things that can be done here that will all benefit Linux and its users in the end. To start with, I'm hoping that we can close some of the entry level gap between people who don't necessarily track kernel development but yet have decided to develop on Linux with a specific embedded use case in mind, and those people who are long time Linux developers. We can also improve the linkage between people writing feature changes and some of the users of those features who are likely to be impacted, but otherwise would probably go unheard from. We can also look at externally maintained features of interest to embedded users, and try and determine what is the blocking factor that is stopping it (or parts of it) from being merged upstream, and then assist in removing those barriers where possible. LWN: What are the specific problems that are faced by embedded developers trying to use Linux? What can you do to make that situation better? David: I think the biggest single problem has always been the same—it's that people are too focused on getting their stuff out the door as quickly as possible without much thought to working with upstream. Managers aren't budgeting the time to get things merged, and engineers aren't talking about their design early enough that it can be improved before it's a fait accompli. That extra time isn't just about being a good citizen—failing to do it almost always comes back to bite you personally, when you come to do a new product, a product update, or even need to merge in changes from upstream to fix bugs. But everybody seems to need to learn that the hard way, it seems. Paul: A lot of times, you get the situation where a group who is developing for an embedded platform is focused 100% on getting their product up, running and deployed. The developers involved aren't necessarily hard core Linux folks, and it usually plays out by them picking a kernel version, getting their stuff in their local tree, and that is it. They may not know git, they probably don't have insight into who the respective subsystem maintainers are, they may perceive LKML as too hostile, or they may not have management buy-in on trying to push stuff upstream. But inevitably, some time passes, and then they have a carry forward task where they try and do a big jump uprev of all their changes, and this repeats forever. Most people who have had to endure the jump uprev vs. a continual tracking and carrying of changes will tell you the jump is not the way to go for a multitude of reasons, but it seems a lesson that everybody ends up having to learn on their own. So, I'm hoping we can get some of these people more aligned with the typical Linux developer workflow—i.e. work from the latest codebase, create logical changesets that can be submission candidates etc. I've been in a couple of meetings recently where we've had the opportunity to educate embedded developers on the advantages of doing this, and the feedback has been positive so far. LWN: The size of the kernel is getting larger in general, is it getting too big for some embedded applications? What, if anything, should be done to remedy that situation? David: I know there are people who'll want to take me out back and shoot me for this, but I think a large part of the solution to that is knowing when Linux is the answer, and accepting that sometimes it isn't. I've always been a bit dubious about implementing XIP support in Linux, for example, on the basis that if you care that much, you should probably have been using something like eCos anyway. Getting back to the real question, though, there are things we can do. The smaller, more efficient "slub" memory allocator is an example, as is the --combine thing I mentioned above. The trick is to find ways to improve matters without just littering the whole thing with ifdefs. Paul: There will always be some hardware or some use case where Linux isn't the right choice. It only makes sense to use the right tool for the job. However we do want to make sure that Linux is that right tool in as many cases as possible. On the plus side, the resources that are found on a typical embedded target today are a lot more rich than they were years ago. We just need to make sure that in optimizing for the general x86 use case, we don't inadvertently hinder these more fringe use cases coming from the embedded world. LWN: What do you see as the priorities for kernel work to better support embedded Linux? David: One important priority right now is replacing JFFS2. I wrote it, so I'm allowed to say that—it was good for its time, with NOR flash devices on the order of 32MiB. But having made it work on 1GiB of NAND flash in OLPC, I certainly agree with the observation that it's being pushed past its design limits. I'm very keen to get LogFS and/or UBIFS merged into the kernel and stabilised to the point where we can really start moving to them. We need to revamp the MTD API fairly urgently too. It was derived from the PCMCIA code we had at the time without much planning, and we really need to improve on it now. There may be a certain amount of bias in the items I've picked out, I suppose. Paul: The embedded community as a whole is probably the biggest user of all the architectures outside of the x86 based platforms. Sometimes the functionality of certain things don't get much testing outside of the basic x86 family. For example, one of the features that there is considerable interest in is the full preempt_rt patch set. Yet once you stray outside of the x86 family, you are pretty much guaranteed to run into drivers specific to embedded targets that don't play nice once this patch set is in place. This isn't such a surprise, simply because the intersection of the two hasn't been explored yet. I think there is value here in getting these types of intersections explored sooner rather than later, by reducing some of the gap between the people working on these sorts of features, and those intending to use them on embedded platforms. LWN: Do you have any specific goals for timelines of getting various features merged? David: Other than "ASAP" for LogFS and UBIFS, not particularly. Stuff is merged when it's ready. Paul: At this point in time, no. I'm not really interested in hijacking anyone's project or feature and trying to drive it towards some self-imposed merge deadline. I'd rather work with them to try and find out what the problem areas are, help with those where possible, be they logistical or technical and get them to a point where they feel that they can offer up merge candidates. LWN: What problems do you foresee in working with other kernel developers who may have less (or no) interest in the concerns of the embedded community? Are there specific features that may be difficult or impossible to get merged? David: I know it's fashionable to claim there's a big disconnect between embedded and big-iron users, but actually there's a lot more overlap than many people seem to realise. I mentioned XIP earlier; can you also guess who was first to implement tickless support? A lot of the problem has been people who show up and throw their code over the wall, then run away. Or worse, those who don't even throw it over the wall at all. People seem to have forgotten how long it took us to educate the enterprise vendors and get them to work nicely with us; we're a bit behind the curve on the embedded side but we're getting there. And organisations like CELF are doing good work on that front, too. Paul: We have to be realistic. There will always be some features that either are too invasive to be sensible merge candidates, or the particular feature has such a small user base, that it may not make sense from a carrying cost point of view to target it for inclusion in the standard kernel. Fortunately, I think the Linux developer community at large has generally been flexible in accommodating most things, while at the same time excluding things where the best interest of the kernel as a whole needed to come first. In such cases where a feature doesn't look to be a probable merge candidate, not all is lost. We have to capitalize on the remaining value adds that come with still working with it as if it was a merge candidate. Things like cherry-picking parts of it that are of global value and thus reducing the carrying cost. Or being able to voice an opinion at the appropriate time if the maintainer of the feature notices that a proposed change somewhere else in the kernel will impact the feature that they have been maintaining independently. So I think we still want to work towards getting the people handling these "harder" features of interest to the embedded community working more in parallel with the main kernel community. LWN: The term "embedded Linux" covers a huge spectrum of devices and uses of Linux, everything from devices where the OS is completely invisible up through internet tablets and UMPC devices that are essentially desktops squeezed into a smaller package. Where on that spectrum do your interests lie? What do you think the challenges of trying to support all of those different uses will be? David: My interest is everywhere in that spectrum—and beyond. Too much focus on one small area is the way to ensure that you solve your own problems while pessimising things for other people. I think it's important to keep a certain amount of holistic focus, because that's how we can make sure that Linux scales well both up and down. Paul: Absolutely. It seems that people naturally associate embedded with the small and resource constrained end of the scale. But the reality is that there are people who are wanting to use Linux in embedded applications where the baseline hardware has 16 cores and gigabytes of memory. On the one end of the scale you are interested in things like efficiency of resource usage, quick boot times, and on the other end of the scale, your interests are more likely around features relating to specific high availability features that may not be present in the standard kernel tree. These are clearly separate problem spaces, but the common thing they both share is that you've got a group using a specific piece of hardware with a specific use case in mind. This tends to bring out the "works for us, lets get it done and shipping" mentality, and the work tends to never make it out to where others can review it and look at merging bits that make sense. I'm hoping this is where we can make a difference. We would like to thank David and Paul for taking time to answer these questions. Profiling kernel code coverage Measuring which lines of code get executed and how often can be a useful tool for debugging or testing. That capability has long been available for user space programs in the form of gcov. A recent patch seeks to allow kernel hackers access to the same tool. There are three main components to making gcov work with the kernel: changing the build to add the -fprofile-arcs -ftest-coverage gcc flags, hooking up the gcc-generated code to record the coverage information, and providing a way for the kernel to output the data to user space. The GCOV_PROFILE kconfig option governs whether to include gcov into the build, while GCOV_PROFILE_ALL activates profiling for the entire kernel. If desired, individual directories and files can be selectively included or excluded from being instrumented. The new kernel/gcov directory contains the necessary functions to support the gcc-generated profiling code. This includes handling statically linked kernel code as well as kernel modules that are loaded. Information gathered from code in modules can be either preserved or discarded when they are unloaded. This will allow analysis of the module unloading path that could be useful for detecting resource leaks or other problems in that process. A user space program compiled for gcov will write a binary file to the filesystem for each source file that contains the data corresponding to the execution path through that file. The kernel needs to do that differently, so instead it writes to a file in debugfs. Each source file that is compiled for gcov will store its information in /sys/kernel/debug/gcov/path/file.gcda, where /sys/kernel/debug is the debugfs mount point and path is the path to the file in the kernel tree. The individual .gcda files can also be written to, which will result in setting the accumulated data for that source file back to zero. Once the data has been gathered, gcov can be invoked to produce a file that annotates the source showing each line with the number of times it has been executed. LCOV is a graphical tool that can also be used to examine the coverage information. LCOV and the gcov kernel patches both come from the Linux Test Project which has an extensive kernel test suite and is using gcov to expand the coverage of their tests. As part of the patch set, the seq_file interface has been extended to allow writing of arbitrary binary data to a virtual file. Currently, the seq_file interface is somewhat character oriented, so a function has been added to fs/seq_file.c to provide that ability: As the prototype implies, it writes len bytes from data to the seq_file seq. Efforts to get gcov support into the kernel have been around since 2002, but the code was recently rewritten to be a better fit for recent kernels. In the patch, Peter Oberparleiter says "due to regular requests, I rewrote the gcov-kernel patch from scratch so that it would (hopefully) be fit for inclusion into the upstream kernel." One of the bigger changes is to move the user space interface for gcov from /proc into debugfs. It seems that the technical issues have largely been addressed in the third version of the gcov patch. It can provide useful information, especially for increasing the reach of test coverage—something that can only help reduce kernel bugs—so it could make for a nice kernel addition. Whether it will be picked up into linux-next or -mm and pushed towards an eventual mainline merge remains to be seen. Fedora harnesses the power of idle computers with Nightlife Bryan Che, a member of the product management team at Red Hat, recently introduced Fedora Nightlife, a project he hopes will motivate people to donate their computer's downtime to processing data for scientific research and other socially beneficial work. The heavy lifting will be done by the University of Wisconsin-Madison's Condor workload management system which will be responsible for the scheduling and logistics of donated computer power and, in the end, Che hopes to build a network of more than a million nodes of Fedora systems to help process data for everything from Web-indexing projects to medical research. "[W]e have begun talking with the guys over at Wikia about helping them index the Web for their open source search engine," says Che. "It would be great if we could help with tasks for the Fedora infrastructure team at some point with things like automated builds or tests. There is a lot of scientific research that requires lots of computing power, and there are lots of students who could use access to a grid for research. I'd love to have all sorts of projects like these participate." Che says that the scope and type of projects that join will largely be dictated by the community, and he's hoping to draw on its collective expertise to "shape Nightlife into a useful community service." His end goal, however, isn't just to make computer resources available but to also develop a basis for larger infrastructure projects. Che notes, "For example, much of the high performance computing (HPC) jobs these days are done on Linux — and particularly Fedora or Red Hat. This puts us in a prime position to be able to shape and build out an entire open source stack for research computing on grids. Today, many people depend upon proprietary (and often costly) libraries for their scientific research or even enterprise computing. Nightlife will provide us a great forum to engage these users to see what are their needs and provide them with a fully open source solution that they can use for their valuable research." Naturally, security is of primary importance when individual computers are clustered together or outside data is inserted into a system for processing. Che says the Nightlife team takes security very seriously and has a number of measures in place to protect users' computers and ensure the application code is safe as well. "[W]e will require that projects that want to leverage Nightlife must distribute their packages and source code through Fedora," explains Che. "This will allow us to inspect what the applications are doing and make sure there isn't anything malicious. On the execution side, one of the capabilities that we've added to Condor recently is integration with our libvirt virtualization technology. This will enable people to execute Nightlife jobs entirely within a virtual machine bubble that is shielded from their physical computers. "We are also looking at taking advantage of SELinux technology, which we've developed with the NSA, as a mechanism for tightly locking down jobs so that they can only perform tasks for which they are explicitly granted permission." Che is quick to point out that although Fedora has committed plenty of resources to Nightlife, it is not Fedora-specific — indeed it's not even Linux-specific. Since Condor supports executing processes on many different platforms, Mac OS, Windows, Unix, and Linux distributions of any flavor are capable of donating resources. Not all features will be available on non-Linux platforms, however, if they lack certain underlying technologies. For instance, Windows lacks a built-in hypervisor for running virtual environments and doesn't support SELinux for lock-downs. "I would welcome anyone to donate spare capacity to Nightlife [and] I'd hope that people from all sorts of platforms join us," encourages Che. "[T]here isn't any reason why other communities couldn't participate with us and even start adding some of these capabilities to a Nightlife client for their platforms. From a development standpoint, the upstream code lives in the Condor project at the University of Wisconsin. So, anyone can contribute at that project as well without having any involvement with Fedora." When the project was announced last week, some community members were puzzled as to why Fedora chose to use Condor instead of BOINC, a similar project developed by University of California-Berkeley. Che points out that, though the two efforts have a lot in common, they each have an entirely different focus. He says BOINC's mission is "very much focused on enabling desktops/laptops to provide computing capacity as part of a larger grid [while] Condor is more general-purpose; it can take idle capacity and utilize it well, but it is primarily a good resource scheduler for dedicated grids." While some people's comparisons of Condor and BOINC focus on the technology behind the projects, others see similarities between the Condor and Nightlife projects themselves. In actuality, they are really quite different. "Condor's client can use a BOINC client to process data as backfill (when there are no other jobs to run)," notes Che. "So, there is no need to view these projects as competitive. Indeed, one possibility is to use Nightlife to increase the number of machines participating in BOINC." Of course, a low barrier to entry is also important for widespread adoption of Nightlife. Since many enterprises and researchers already run Condor for their dedicated grids, Che says it was a logical choice for the project. Dr. Keith Laidig can easily see the intrinsic value of Nightlife and how it will benefit the scientific community at large. He runs the computing infrastructure for the computational biophysics group in the Department of Biochemistry at University of Washington, and regularly relies on outside computing power to crunch data for researchers. Under the direction of Professor David Baker, about four years ago the group created Robetta, an automated prediction server that farms out work to other systems via Condor which has proven "quite successful at keeping the wait times [for research results] down to the range of 'months'." Laidig recently told the Nightlife community, "If we had access to more computing power, even that available from modest periods of inactivity, we could put that power to work to address many pressing issues in bio-medical research such as HIV/AIDS vaccine design, improvement of existing drugs and/or design new drugs, and creation of new methods to harness biology to address issues such as carbon sequestration." As Laidig explained to LWN, reducing the wait times for results to even a matter of weeks is not out of the question. "Given sufficient computing power, the processing time would drop even further. In principle, the processing could take a day or less — depending on computing power, queue depth, etc." Laidig says it's hard to estimate just how much donated computer access his lab would need in order to see an appreciable rise in research turn-around time, but he estimates they currently use around 300 - 400 processors running around the clock to maintain the current work flow. "Should we gain, say, 1,500 machines that could work for 8 hours... we'd be matching that — taking into account overhead. Now, I'd like to increase that by a factor of ten or more." Though he would be happy to see Nightlife flourish, Laidig notes there are some things to consider before committing your computer's resources to the project. "Not to throw a wet blanket on things, but [there are] issues that folks should keep in mind. Their gear would be using electricity and generating heat. There are also network bandwidth considerations as well — some data-sets necessary to undertake distributed work can be sizable (100 MBs) which can soak up resources. There's the local disk space usage, too. "Folks should be made aware of the 'costs' of contributing. Then, should their desire to contribute outweigh the costs, they should join up!" Some community members have indeed expressed concerns about the energy consumption associated with idling computers and suggest that the ecological harm of running the CPUs and fans of an unattended machine outweighs the benefit of charity in the name of science. In response to an animated discussion about Nightlife at Slashdot, one enterprising commenter tested how much energy his idle computer uses and discovered it was upwards of $70 per year. Che responded to the criticism by acknowledging that although cycle harvesting can be viewed as a "waste of energy," it can, in fact, save energy in the long run. In addition to the notion that energy to process data will eventually be used at some point or another anyway, Nightlife also distributes energy consumption over a wide geographical area, thereby reducing the overall energy burden on a single data center or location. Future plans for Nightlife include making it a first-boot option for Fedora so when a user does a fresh install, they are prompted to donate computer power to the project. Of course, before Che can attain his million-node goal, there are several smaller goals to accomplish along the way. "At the earliest, we wouldn't be able to start reaching numbers at this level until after Fedora 10 — and that's probably pushing it." Moving the firmware out It seems that David Woodhouse had a bit of an ulterior motive when he recently reworked the kernel firmware loader. That is not to say the work is not useful in its own right, but one of his goals is more apparent now: removing all of the firmware from the kernel source tree. By making it easy to separate the firmware blobs—while still allowing them to be statically built into kernels—he has provided a possible path for all firmware needed by any Linux driver to live in a single place. The firmware issue is somewhat contentious, with licensing and political issues that tend to annoy the kernel developers. Arguments about the "legality" of distributing firmware with the kernel flare up from time to time. Separate from that, there are some good reasons why it makes sense to keep the firmware in its own place: some distributions need or want to distribute their kernels without firmware blobs and some hardware manufacturers will not allow their firmware to be distributed with the kernel because of concerns about the GPL. The current situation makes it harder for both users and distributors. Woodhouse brought up the idea of pulling the firmware out of the kernel in a post to linux-kernel and ksummit-2008-discuss. The agenda for this year's Kernel Summit is under discussion, so he proposed that it be discussed there. He is clearly trying to anticipate the technical concerns that others might have: By the time the kernel summit comes around, we should have made decent progress on moving _all_ the firmware blobs to the firmware/ directory. And at that point I'd like to remove them completely, to a separate git tree and tarball. Those who really want to build them in to their static kernel would still be able to, but it wouldn't be the default behaviour. Unsurprisingly, there are some fairly strenuous objections. David Miller is quite annoyed: Sorry, that's taking things too far. I've fought, like, forever, to keep the tg3 driver with it's firmware in-tree. I refuse to let the driver get broken like that, it's staying working, and that means in-tree and linked into the driver. If debian or whoever else have these concerns and want to rip the firmware out, it is one hundred percent their problem to patch things out of the kernel tree they use. But there are other reasons to collect firmware in one single place, as Arjan van de Ven notes: Right now it's a royal pain for users to get all the right pieces of firmware.... having ONE place to put all that would go a long way of making that side of things easier. If you want to argue that that should be in the kernel tarball itself, you won't hear me complain. But others will... and for that a 2nd tarball might well be the answer. Just we shouldn't need 100 tarballs. There is a very real concern, though, that putting firmware without source into the kernel is a GPL violation. It is impossible to know for sure without a court decision, which is something that no one wants to have to deal with. Companies—and their lawyers—tend to be very conservative when it comes to inviting lawsuits, so removing unrelated, possibly actionable code from the kernel sources is of great benefit to them. As Woodhouse says: And it isn't just the nutters. Fedora also wants to ship the firmware in a separate package from the kernel -- since the alleged GPL violation is such a _gratuitous_ risk given that we always use an initrd anyway, and because people want to be able to do 'Free' spins which don't feature the firmware at all, even in the source packages. By making it easier to put all of the firmware in one non-GPL tree, hardware vendors—and their lawyers—may be willing to allow the firmware to be distributed. If Woodhouse's plan for supporting both compile-time and runtime loading of the firmware is successful and reasonably transparent, there should be little difference for kernel developers, but big improvements for users and distributors. It is unclear whether this is something that will be resolved in email, as Woodhouse hopes, or will require a discussion at the Kernel Summit in September, but it's an idea with a lot of merit that may find its way into the mainline at some point. What's up with the Intrepid Ibex The ibex is type of wild mountain goat with large recurved horns that are transversely ridged in front, found in Eurasia, North Africa, and East Africa. That is the Wikipedia definition. For the Ubuntu community, the Intrepid Ibex is the next version of the operating system, and the topic under discussion at the recent Ubuntu Developer Summit (UDS) in Prague. There are a number of YouTube videos from the UDS, with Mark Shuttleworth and others talking about Intrepid Ibex and related topics. Mark's two part video covers the various versions of Ubuntu from the server to the platform specific remixes, to collaboration with other distributions and upstream developers, and more. The Intrepid Ibex, scheduled for release next October, will also be known as version 8.10 - 8 for the year and 10 for the month of its release. With the Hardy Heron, Ubuntu's second LTS (Long Term Support) release out the door, the Ibex marks the beginning of a new LTS cycle. As such, it is likely to be a bit wild and woolly. A time to bring in new technology and experiment with possibilities. There will be plenty of time later for stabilizing the next LTS release, Ubuntu 10.04 LTS, scheduled for release in April 2010. This UDS had several tracks some reports are available: Community looks at getting the community involved in a helpful way Server looks at improving Ubuntu as a server distribution Platform covers 3G networking, the Education Edition, Firefox KDE integration, Boot performance and more QA looks at how to measure quality, and bug tracking issues the Desktop points to several other wiki documents dealing with single sign on, Compiz and other desktop topics. ItWire takes a look at the new features planned for Ubuntu's Intrepid Ibex and hopes for improved wireless networking. "Two key design goals were announced from the beginning. Firstly, the user interaction model will be re-engineered to ensure Ubuntu works as well as responsively as possible on hardware ranging from squinty little subnotebooks through to high-end powerful workstations. Secondly, and the one on my mind, is the goal of pervasive internet access. Ubuntu have explicitly stated they wish this release of Ubuntu - finally - to tap into bandwidth wherever you may be. Once more the goat metaphor comes to the fore, "No longer will you need to be a tethered, domesticated animal - you'll be able to roam (and goats do roam!) the wild lands and access the web through a variety of wireless technologies. We want you to be able to move from the office, to the train, and home, staying connected all the way."" Cody Somerville, leader of Xubuntu, tells us Why Xubuntu Intrepid is going to rock. The Xubuntu Intrepid Strategy document contains a clear mission statement and takes a deeper look at this variant: Xubuntu will provide (The goal of Xubuntu is to produce) an easy to use distribution, based on Ubuntu, using Xfce as the graphical desktop, with a focus on integration, usability and performance, with a particular focus on low memory footprint. The integration in Xubuntu is at a configuration level, a toolkit level, and matching the underlying technology beneath the desktop in Ubuntu. Xubuntu will be built and developed autonomously as part of the wider Ubuntu community, based around the ideals and values of Ubuntu. Kubuntu fans will find this entry in Jonathan Riddell's blog of interest. "Kubuntu Intrepid Version makes the decision to move to KDE 4 by default (anything else is history). KDE 3 libs will still be available for applications without a KDE 4 version, but the desktop won't be. It's a good time to move to KDE 4 since Intrepid is intended to be a more cutting edge release." The Kubuntu Intrepid wiki takes a look at some specific design goals the KDE variant. Some of the defaults for Kubuntu have been defined. We will remove sounds for actions. Actions do not need to attract the user's attention. We would like a new, shorter, login sound, Scott Wheeler has volunteered to make one. At the 4.1 release we will consider which default Plasmoids to include. The Desktop Plasmoid should be on by default. And so on. Other goals for Intrepid are still somewhat fuzzy, which means there is still time to make proposals for what you want. If you run Ubuntu (or variant thereof) but it's not quite what you want it to be, get involved and help make it better. Matplotlib announces a major release Matplotlib is a cross-platform numerical plotting and analysis library for Python: matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala matlab or mathematica), web application servers, and six graphical user interface toolkits. matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code. Matplotlib version 0.71 was last examined on LWN in January, 2005. Recently, major release version 0.98.0 was announced: matplotlib 0.98.0 is a major release which requires python2.4 and numpy 1.1. It contains significant improvements and may require some advanced users to update their code; see migration and API_CHANGES. We are supporting a maintenance branch of the older code available at matplotlib 0.91.3. The major changes in matplotlib 0.98.0 include a complete rewrite of the transformation infrastructure and new support for user-defined transformations and projections. The full list of changes is available in the CHANGELOG file. The new matplotlib release coincides with the new release (version 1.1.0) of NumPy, the fundamental package needed for scientific computing with Python: "This is the first minor release since the 1.0 release in October 2006. There are a few major changes, which introduce some minor API breakage. In addition this release includes tremendous improvements in terms of bug-fixing, testing, and documentation." Looking forward to upcoming and in-progress matplotlib development, the Goals document explains a number of new matplotlib capabilities that are in the planning and development stages. If you need to create any number of scientific data plots, matplotlib is an excellent choice for the job. It truly lives up to the claim of being easy to use. The latest matplotlib source code is available for download here. oCERT and oss-security Two recently announced organizations, the Open Source Computer Emergency Response Team (oCERT) and Open Source Software Security (oss-security), are both looking to assist projects with security issues in a complementary way. Each is focusing on different kinds of problems that free software projects face when trying to secure their code. oCERT is modeled on the various national CERT organizations, but focused on free software: The service aims to help both large infrastructures, like major distributions, and smaller projects that can't afford a full-blown security team and/or security resources. This means aiding coordination between distributions and small project contacts. The goal is to reduce the impact of compromises on small projects with little or no infrastructure security, avoiding the ripple effect of badly communicated or handled compromises, which can currently result in distributions shipping code which has been tampered with. In addition, oCERT is doing vulnerability research on free software projects. So far, they have released four advisories after coordinating with the affected projects and distributions. It is a way for team members—or anonymous researchers—to collect their vulnerability research and push it through the process. The oCERT team consists of five security professionals from Inverse Path, Google, and Intel, along with a two-person advisory board. Various projects have also signed up as members including several Linux distributions, security and other free software tools, as well as OpenBSD. In order to become a member, an project or organization must meet some fairly stringent membership requirements that include agreeing to the disclosure policy. Others can submit vulnerability information without becoming a member. oss-security is more of an open group, without any formal membership, that is looking to foster more discussion of security issues: The purpose of oss-security is to encourage public discussion of security flaws, concepts, and practices in the open source community. We don't want to simply be an information clearinghouse, or to replace any of the current security lists and groups. The goal is to fill an existing vacuum by encouraging active participation of those interested in the ideas and unique challenges in securing Open Source software. This includes activities such as flaw discovery, understanding, reporting, and overall best practices. The oss-security mailing list is one of the focal points of the group's efforts. Some of the topics currently being discussed are helping projects with code reviews, getting CVE IDs assigned for specific vulnerabilities, and the IP address change of the "L" root nameserver. The oss-security wiki seeks to gather relevant security information from projects and vendors in a single location. This includes security contacts, helpful mailing lists, bug tracker locations, distribution security patch repositories, and the like. If it gets fully populated and is kept up-to-date, it will be a tremendous resource for the community. Up to a certain point, more organizations looking to improve free software security can only be a good thing. Each of these seems to have a focus that is not met by existing groups, so they can hopefully fill a need in the community. The private, vendor-sec mailing list has long been used by distributors, whereas oCERT and oss-security are more focused on the project side of the equation. With luck, that will lead to better code and more coordination for projects and distributions. Andrew Morton on kernel development Andrew Morton is well-known in the kernel community for doing a wide variety of different tasks: maintaining the -mm tree for patches that may be on their way to the mainline, reviewing lots of patches, giving presentations about working with the community, and, in general, handling lots of important and visible kernel development chores. Things are changing in the way he does things, though, so we asked him a few questions by email. He responded at length about the -mm tree and how that is changing with the advent of linux-next, kernel quality, and what folks can do to help make the kernel better. Years ago, there was a great deal of worry about the possibility of burning out Linus. Life seems to have gotten easier for him since then; now instead, I've heard concerns about burning out Andrew. It seems that you do a lot; how do you keep the pace and how long can we expect you to stay at it? I do less than I used to. Mainly because I have to - you can't do the same thing at a high level of intensity for over five years and stay sane. I'm still keeping up with the reviewing and merging but the -mm release periods are now far too long. There are of course many things which I should do but which I do not. Over the years my role has fortunately decreased - more maintainers are running their own trees and the introduction of the linux-next tree (operated by Stephen Rothwell) has helped a lot. The linux-next tree means that 85% of the code which I used to redistribute for external testing is now being redistributed by Stephen. Some time in the next month or two I will dive into my scripts and will find a way to get the sufficiently-stable parts of the -mm tree into linux-next and then I will hopefully be able to stop doing -mm releases altogether. So. The work level is ramping down, and others are taking things on. What can we do to help? I think code review would be the main thing. It's a pretty specialised function to review new code well. The people who specialise in the area which the new code is changing are the best reviewers but unfortunately I will regularly find myself having to review someone else's stuff. Secondly: it would help if people's patches were less buggy. I still have to fix a stupidly large number of compile warnings and compilation errors and each -mm release requires me to perform probably three or four separate bisection searches to weed out bad patches. Thirdly: testing, testing, testing. Fourthly: it's stupid how often I end up being the primary responder on bug reports. I'll typically read the linux-kernel list in 1000-email batches once every few days and each time I will come across multiple bug reports which are one to three days old and which nobody has done anything about! And sometimes I know that the person who is responsible for that part of the kernel has read the report. grr. Is it your opinion that the quality of the kernel is in decline? Most developers seem to be pretty sanguine about the overall quality problem. Assuming there's a difference of opinion here, where do you think it comes from? How can we resolve it? I used to think it was in decline, and I think that I might think that it still is. I see so many regressions which we never fix. Obviously we fix bugs as well as add them, but it is very hard to determine what the overall result of this is. When I'm out and about I will very often hear from people whose machines we broke in ways which I'd never heard about before. I ask them to send a bug report (expecting that nothing will end up being done about it) but they rarely do. So I don't know where we are and I don't know what to do. All I can do is to encourage testers to report bugs and to be persistent with them, and I continue to stick my thumb in developers' ribs to get something done about them. I do think that it would be nice to have a bugfix-only kernel release. One which is loudly publicised and during which we encourage everyone to send us their bug reports and we'll spend a couple of months doing nothing else but try to fix them. I haven't pushed this much at all, but it would be interesting to try it once. If it is beneficial, we can do it again some other time. There have been a number of kernel security problems disclosed recently. Is any particular effort being put into the prevention and repair of security holes? What do you think we should be doing in this area? People continue to develop new static code checkers and new runtime infrastructure which can find security holes. But a security hole is just a bug - it is just a particular type of bug, so one way in which we can reduce the incidence rate is to write less bugs. See above. More careful coding, more careful review, etc. Now, is there any special pattern to a security-affecting bug? One which would allow us to focus more resources on preventing that type of bug than we do upon preventing "average" bugs? Well, perhaps. If someone were to sit down and go through the past five years' worth of kernel security bugs and pull together an overall picture of what our commonly-made security-affecting bugs are, then that information could perhaps be used to guide code-reviewers' efforts and code-checking tools. That being said, I have the impression that most of our "security holes" are bugs in ancient crufty old code, mainly drivers, which nobody runs and which nobody even loads. So most metrics and measurements on kernel security holes are, I believe, misleading and unuseful. Those security-affecting bugs in the core kernel which affect all kernel users are rare, simply because so much attention and work gets devoted to the core kernel. This is why the recent splice bug was such a surprise and head-slapper. I have sensed that there is a bit of confusion about the difference between -mm and linux-next. How would you describe the purpose of these two trees? Which one should interested people be testing? Well, things are in flux at present. The -mm tree used to consist of the following: 80-odd subsystem maintainer trees (git and quilt), eg: scsi, usb, net. various patches which I picked up which should be in a subsystem maintainer's tree, but which for one of various reasons didn't get merged there. I spend a lot of time acting as backup for leaky maintainers. patches which are mastered in the -mm tree. These are now organised as subsystems too, and I count about 100 such subsystems which are mastered in -mm. eg: fbdev, signals, uml, procfs. And memory management. more speculative things which aren't intended for mainline in the short-term, such as new filesystems (eg reiser4). debugging patches which I never intend to go upstream. The 80-odd subsystem trees in fact account for 85% of the changes which go into Linux. Pretty much all of the remaining 15% are the only-in-mm patches. Right now (at 2.6.26-rc4 in "kernel time"), the 80-odd subsystem trees are in linux-next. I now merge linux-next into -mm rather than the 80-odd separate trees. As mentioned previously, I plan to move more of -mm into linux-next - the 100-odd little subsystem trees. Once that has happened, there isn't really much left in -mm. Just the patches which subsystem maintainers leaked. I send these to the subsystem maintainers. the speculative not-for-next-release features the not-to-be-merged debugging patches. Do you have any specific goals for the development of the kernel over the next year or so? What would they be? Steady as she goes, basically. I keep on hoping that kernel development in general will start to ramp down. There cannot be an infinite number of new features out there! Eventually we should get into more of a maintenance mode where we just fix bugs, tweak performance and add new drivers. Famous last words. And it's just vaguely possible that we're starting to see that happening now. I do get a sense that there are less "big" changes coming in. When I sent my usual 1000-patch stream at Linus for 2.6.26 I actually received an email from him asking (paraphrased) "hey, where's all the scary stuff?" In the early-May discussions, Linus said a couple of times that he does not think code review helps much. Do you agree with that point of view? Nope. How would you describe the real role of code review in the kernel development process? Well, it finds bugs. It improves the quality of the code. Sometimes it prevents really really bad things from getting into the product. Such as rootholes in the core kernel. I've spotted a decent number of these at review time. It also increases the number of people who have an understanding of the new code - both the reviewer(s) and those who closely followed the review are now better able to support that code. Also, I expect that the prospect of receiving a close review will keep the originators on their toes - make them take more care over their work. There clearly must be quite a bit of communication between you and Linus, but much of it, it seems, is out of the public view. Could you describe how the two of you work together? How are decisions (such as when to release) made? Actually we hardly ever say anything much. We'll meet face-to-face once or twice a year and "hi how's it going". We each know how the other works and I hope we find each other predictable and that we have no particular issues with the other's actions. There just doesn't seem to be much to say, really. Is there anything else you would like to say to LWN's readers? Sure. Please do contribute to Linux, and a great way of doing that is to test latest mainline or linux-next or -mm and to report on any problems which you encounter. Nothing special is needed - just install it on as many machines as you dare and use them in your normal day-to-day activities. If you do hit a bug (and you will) then please be persistent in getting us to fix it. Don't let us release a kernel with your bug in it! Shout at us if that's what it takes. Just don't let us break your machines. Our testers are our greatest resource - the whole kernel project would grind to a complete halt without them. I profusely thank them at every opportunity I get :) We would like to thank Andrew for taking time to answer our questions. Implications of pure and constant functions Introduction Attributes and why you should use them Free Software development is often a fun task for developers, and it is its low barrier to entry (on average) that makes it possible to have so much available software for so many different tasks. This low barrier to entry, though, is also probably the cause of the widely varying quality of the code of these projects. Most of the time, the quality issues one can find are not related to developers' lack of skill, but rather to lack of knowledge of how the tools work, in particular, the compiler. For non-interpreted languages, the compiler is probably the most complex tool developers have to deal with. Because a lot of Free Software is written in C, GCC is often the compiler of choice. Modern compilers are also supposed to do a great job at optimizing the code by taking code, often written with maintainability and readability in mind, and translating it into assembler code with a focus on performance. Code analysis for optimization (which is also used for warnings about the code) has the task of taking a semantic look at the code, rather than syntactic, and identifying various fragments of algorithms that can be replaced with faster code (or with code that uses a smaller memory footprint, if the user desires to do so). This task is a pretty complex one and relies on the compiler knowing about the function called by the code. For instance, the compiler might know when to replace a call to a (local, static) function with its body (inlining) by looking at its size, the number of times it is called, and its content (loops, other calls, variables it uses). This is because the compiler can give a semantic value to the code for a function, and can thus assess the costs and benefits of a particular transformation at the time of its use. I specified above that the compiler knows when to inline a function by looking at its content. Almost all optimizations related to function calls work this way: the compiler, knowing the body of a function, can decide when it's the case to replace a call with its body; when it is possible to completely avoid calling the function at all; and when it is possible to call it just once and thereby avoid multiple calls. This means, though, that these optimization can be applied only to functions that are defined in the same unit wherein they are used. These functions are usually limited to static functions (functions that are not defined as static can often be overridden both at link time and runtime, so the compiler cannot safely assume that what it finds in the unit is what the code will be calling). As this is far from optimal, modern compilers like GCC provide a way for the developer to provide information about the semantics of a function, through the use of attributes attached to declarations of functions and other symbols. These attributes provide information to the compiler on what the function does, even though its body is not available. Consequently, the compiler can optimize at least some of its calls. This article will focus on two particular attributes that GCC makes available to C developers: pure and const, which can declare a function as either pure or constant. The next section will provide a definition of these two kinds of functions, and after that I'll get into an analysis of some common optimizations that can be performed on the calls of these functions. As with all the other function attributes supported by GCC and ICC, the pure and const attributes should be attached to the declarative prototype of the function, so that the compiler know about them when it finds a call to the function even without its definition. For static functions, the attribute can be attached to the definition by putting it between the return type and the name of the function: Pure and Constant Functions For what concerns the scope of this article, functions can be divided into three categories, from the smallest to the biggest: constant functions, pure functions and the remaining functions can be called normal functions. As you can guess, constant functions are also pure functions, but pure functions cannot be not all pure functions are constant functions. In many ways, constant functions are a special case of pure functions. It is, therefore, best to first define pure functions and how they differ from all the rest of the functions. A pure function is a function with basically no side effect. This means that pure functions return a value that is calculated based on given parameters and global memory, but cannot affect the value of any other global variable. Pure functions cannot reasonably lack a return type (i.e. have a void return type). GCC documentation provides strlen() as an example of a pure function. Indeed, this function takes a pointer as a parameter, and accesses it to find its length. This function reads global memory (the memory pointed to by parameters is not considered a parameter), but does not change it, and the value returned derives from the global memory accessed. A counter-example of a non-pure function is the strcpy() function. This function takes two pointers as parameters. It accesses the latter to read the source string, and the former to write to the destination string. As I said, the memory areas pointed to by the parameters are not parameters on their own, but are considered global memory and, in that function, global memory is not only accessed for reading, but also for writing. The return value derives directly from the parameters (it is the same as the first parameter), but global memory is affected by the side effect of strcpy(), making it not pure. Because the global memory state remains untouched, two calls to the same pure function with the same parameters will have to return the same value. As we'll see, it is a very important assumption that the compiler is allowed to make. A special case of pure functions is constant functions. A pure function that does not access global memory, but only its parameters, is called a constant function. This is because the function, being unrelated to the state of global memory, will always return the same value when given the same parameters. The return value is thus derived directly and exclusively from the values of the parameters given. The way a constant function "consumes" pointers is very different from the way other functions do: it can handle them as both parameter and return value only if they are never dereferenced, for accessing the memory they are referencing would be a global memory access, which breaks the requirements of constant functions. Of course these requirements have to apply not only to the operations in the given function, but also recursively to all the functions it calls. One function can at best be of the same kind of the least restrictive kind of function it calls. So when it calls a normal function it can't be but a normal function itself, if it only calls pure functions it can be either pure or normal, but not constant, and if it only calls constant functions it can be constant. As with inlining, the compiler will be able to decide if a function is pure or constant, in case no attribute is attached to it, only if the function is static (with the exception of special cases for freestanding code and other advanced options). When a function is not static, even if it's local, the compiler will assume that the function can be overridden at link- or run-time so it will not make any assumption based on the body for the definition it may find. Optimizing Function Calls Why should developers bother with marking functions pure or constant, though? As I said, these two attributes help the compiler to know some semantic meaning of a function call, so that it can apply higher optimization than to normal functions. There are two main optimizations that can be applied to these kinds of functions: CSE (Common Sub-expression Elimination) and DCE (Dead Code Elimination). We'll soon see in detail, with the help of the compiler itself, what these two consist of. Their names, however, are already rather explicit: CSE is used to avoid duplicating the same code inside a function, usually factoring out the code before branching or storing the results of common operations in temporary variables (registers or stack), while DCE will remove code that would never be executed or that would be executed but never used. These are both optimization that can be implemented in the source code, to an extent, reducing the usefulness of declaring functions pure or constant. On the other hand, as I'll demonstrate, doing so often reduces the readability of the code by obscuring the actual algorithm in favor of making it faster. This does not apply to all cases though, sometimes, doing the optimization "manually", directly in the source code, makes it more readable, and makes the code resemble the output of the compiler more. About Assemblers and Examples When talking about optimization, it's quite difficult to visualize the task of the compiler, and the way the code morphs from what you read in the C source code into what the CPU is really going to execute. For this reason, the best way to write about them is to use examples, showing what the compilers generates starting from the source code. Given the way in which GCC works, this is actually quite easy. You just need to enable optimization and append the -S switch to the gcc command line. This switch stops the compiler after the transformation of C source code into assembly, before the result is passed to the assembler program to produce the object file. Although I suspect a good fraction of the people reading this article would be comfortable reading IA-32 or x86-64 assembly code, I decided to use the Blackfin [1] assembly language, which should be readable for people who have never studied a particular assembly language. The Blackfin assembler is more symbolic than IA-32: instead of having operations named movl and addq, the operations are identified by their algebraic operators (=, +), while the registers are merely called R1, R2 and so on. Calling conventions are also quite easy to understand: for all the cases we'll look through in the article (at most four parameters, integers or pointers), the parameters are passed through the registers, starting in order from R0. The return value of the function call is also stored in the R0 register. To clarify the examples which will appear later on, let's see how the following C source code is translated by GCC into Blackfin code: becomes: As the Blackfin does not have 32 bit immediate load, you have to load high and low addresses separately (in whichever order); the assembler will take care of properly loading the high 16 bits of the label to the upper part of the register, and the low 16 bits to the lower part. Once the parameters are loaded, the function is called almost identically to any other call operation on other architectures; note the prefixed underscore on symbols' names. Integers, both constant or parameters and variables, are also loaded for calls in the registers. Blackfin doesn't have 32 bit immediate loading, but if the constant to load fits into 16 bits, it can be loaded through sign extension by appending the (X) suffix. When accessing a global memory location, the P2 pointer is set to the address of the memory location... ... and then dereferenced to assign that memory area. Being a RISC architecture, Blackfin does not have direct memory operations. The return value for a function is loaded into the R0 register, and can be accessed from there. The rts command is the return from subroutine, and usually indicates the end of the function, but like the return statement in C, it might appear in any place of the routine. In the following examples, the preambles with declarations and data will be omitted whenever these are not useful to the discussion. Concerning optimization levels, the code will almost always be compiled with at least the first optimization level enabled (-O1). This both because it makes the code cleaner to read (using register-register copy for parameters passing, instead of saving to the stack and then restoring from that) and because we need optimization enabled to see how they are applied. Also, most of the times I'll refer to the fastest alternative. Most of what I say, though, applies also to the smaller alternative when using the -Os optimization level. In any case, the compiler always weighs the cost-to-benefit ratio between the optimized and the unoptimized version, or between different optimized versions. If you want to know the exact route the compiler takes for your code, you can always use the -S switch to find out. DCE and Unused Variables One area where DCE is useful is to avoid operations that result in unused data. It's not that uncommon that a variable is defined by an operation, complex or not, and is then never used by the code, either because it is intended for future expansion or because it's a remnant of older code that has been removed or replaced. While the best thing would be to get rid of the definition entirely, users expect the compiler to produce a good result with sloppy code too, and that operation should not be emitted. The DCE pass can remove all the code that has no side effect, when its result is not used. This includes all mathematical operations and functions known to be pure or constant (as neither are allowed to change the global state of the variables). If a function call is not known to be at least pure, it may change the global state, and its call will not be eliminated, as shown in the following code: Which, once compiled with -O1, [2] produces the following Blackfin assembler: As you can see, the call to the pure function has been eliminated (the res2 variable was not being used), together with the algebraic operation but, the impure function, albeit having its return value discarded, is still called. This is due to the fact that the compiler emits the call, not knowing whether the latter function has side effects on the global memory state or not. This is equivalent to the following code (which produces the same assembler code): The Dead Code Elimination optimization can be very helpful to reduce the overhead caused by code written to conform to C89 standard, where you couldn't mix variables (and constant) declarations with executable code. In those sources, you had to declare variables at the top of the function, and then start to check for prerequisites. If you wanted to make it explicit that some variable had to keep its value, by making it constant, you would often have to fill them before the prerequisites could be checked. Without discussing legacy code, it is also useful when writing debug code, so that it doesn't look out of place from the use of lots of #ifdef directives. Take for instance the following code: The assert_se macro has different behavior from the standard assert, as it has side effects, which basically means that the code passed to the assertion is called even though the compiler is told to disable debugging. This is a somewhat common trick, although its effects on readability are debatable. With getsomestring() pure, when compiling without debugging, the DCE will remove the calls to all three functions: getsomestring(), strncmp() and strlen() (the latter two are usually declared as pure by both the C library and by GCC's built-in replacements). This because none of these functions have a side effect, resulting in a very short function: If our getsomestring() function weren't pure, even though its return value is not going to be used, the compiler would have to emit the call, resulting in rather more complex (albeit still simple, compared with most real-world functions) assembler code: Common Sub-expression Elimination The Common Sub-expression Elimination optimization is one of the most important optimizations performed by the compiler, because it's the one that, for instance, replaces multiple indexed accesses to an array so that the actual memory address is calculated just once. What this optimization does is to find common operations executed on the same operands (even when they are not known at compile-time), decide which ones are more expensive than saving the result in a temporary (register or stack), and then swapping the code around to take the cheapest course. While its uses are quite varied, one of the easiest ways to see the work of the CSE is to look at the code generated when using the ternary if operator. Let's take the following code: The compiler will optimize the code as: As you can see, the pure function is called just once, because the two references inside the ternary operator are equivalent, while the other one is called twice. This is because there was no change to global memory known to the compiler between the two calls of the pure function (the function itself couldn't change it – note that the compiler will never take multi-threading into account, even when asking for it explicitly through the -pthread flag), while the non-pure function is allowed to change global memory or use I/O operations. The equivalent code in C would be something along the following lines (it differs a bit because the compiler will use different registers): The Common Sub-expression Elimination optimization is very useful when writing long and complex mathematical operations. The compiler can find common calculations even though they don't look common to the naked eye, and act on those. Although sometimes you can get away with using multiple constants or variables to carry out temporary operations so that they can be re-used in the following calculations, leaving the formulae entirely explicit is usually more readable, as long as the formulae are not intended to change. Like with other algorithms, there are some advantages to reducing the source code used to calculate the same thing; for instance you can easily make a change directly to the definition of a constant and get the change propagated to all the uses of that constant. On the other hand, this can be quite a problem if the meaning of two calculations is very different (and thus can vary in different ways with the evolution of the code), and just happen to be calculated in the same way at a given time. Another rather useful place where the compiler can further optimize code with CSE, where it wouldn't be so nice or simple to do manually in the source code, is where you deal with static functions that are inlined by the compiler. Let's examine the following code for instance: In this code, you can find four basic expressions: (p1 * 16), (p2 * 16), (3 << a) and (4 << b). Each of these four expressions is used twice in the somefunc() function. Thanks to the CSE, though, the code will calculate each of them once, even though they cross the function boundary, producing the following code: As you can easily see (the assembly was modified a bit to improve its readability, the compiler re-ordered loads of registers to avoid pipeline stalls, making it harder to see the point), the four expressions are calculated first, and stored respectively in the registers R0, R1, R7 and R3. These kinds of sub-expressions are usually harder to see in the code and also harder to implement. Sometimes they get factored out on their own parameter, but that can be more expensive during execution, depending on the calling conventions of the architecture. Cheats As I wrote above, there are some requirements that apply to functions that are declared pure and constant, related to not changing or accessing global memory; not executing I/O operations; and, of course, not calling further impure functions. The reason for this is that the compiler will accept what the user declares the function to be, whatever its body is (as it's usually unknown by the compiler at the call stage). Sometimes, though, it's possible to fool the compiler so that it treats impure functions as pure or even constant functions. Although this is a risky endeavor, as it might truly cause bad code generation by the compiler, it can sometimes be used to force optimization for particular functions. An example of this can be a lookup function that scans through a global table to return a value. While it is accessing global memory, you might want the compiler to promote it to a constant function, rather than simply to a pure one. Let's take for instance the following code: If the lookup() function is only considered a pure function, as it is, adhering to the rules we talked about at the start of the article, it will be called three times in testfunction(), like this: Instead, we can trick the compiler by declaring the lookup() function as constant (the data it is reading is constant, after all, so at a given parameter it will always return the same result). If we do that, the three calls will have to return the same value, and the compiler will be able to optimize them as a single call: In addition to lookup functions on constant tables, this trick is useful with functions which read data from files or other volatile data, and cache it in a memory variable. Take for instance the following function that reads an environment variable: This is not truly a constant function, as its return value depends on the environment. Even so, assuming that the environment of the process is left untouched, its return value will never change between calls. Even though it will affect the global state of the program (as the cachedval static variable will be filled in the first time the function is called), it can be assumed to always return the same value. Tricking the compiler into thinking that a function is constant even though it has to load data through I/O operations, as I said, is risky, as the compiler will think there is no I/O operation going on; on the other hand, this trick might make a difference sometimes, as it allows the expression of functions in more semantic ways, leaving it up to the compiler to optimize the code with temporaries, where needed. One example can be the following code: Note: To make sure that the compiler won't reduce the three function calls to their return values right away, the static sub-functions return values taken from global variables; the meanings of those variables are not important. Considering the above source code, if get_testval() is impure, as the compiler will automatically find it to be, it will be compiled into: As you can see, the get_testval() is called twice, even though its result will be identical. If we declare it constant, instead, the code of our test function will be the following: The CSE pass combines the two calls to get_testval with one. Again, this is one of the optimizations that are harder to achieve by manually changing the source code since the compiler can have a larger view of the use of its value. A common way to handle this is by using global variables, but that might require one more load from the memory, while CSE can take care of keeping the values in registers or on the stack. Conclusions After what you have read about pure and constant functions, you might have some concerns about the average use of them. Indeed, in a lot of cases, these two attributes allow the compiler to do something you can easily achieve by writing better code. There are two objectives you have to keep in mind that are related to the use of these (and other) attributes. The first is code readability because sometimes the manually optimized functions are harder to read than what the compiler can produce. The second is allowing the compiler to optimize legacy or external code. While you might not be too concerned with letting legacy code or code written by someone else get away with slower execution, a pragmatic view of the current Free Software world should take into consideration the fact that there are probably thousands lines of code of legacy code around. Some of that code, written with pre-C99 declarations, might be even using libraries that are being developed with their older interface, which could be improved by providing some extra semantic information to the compiler through use of attributes. Also, it's unfortunately true that extensive use of these attributes might be seen by neophytes as an easy solution to let sloppy code run at a decent speed. On the other hand, the same attributes could be used to identify such sloppy code through analysis of the source code. Although GCC does not issue warnings for all of these cases, it already warns for some of them, like unused variables, or statements without effect (both triggered by the DCE). In the future more warnings might be reported if pure and constant functions get misused. In general, like with many other GCC function attributes, their use is tightly related to how programmers perceive their task. Most pragmatic programmers would probably like these tools, while purists will probably dislike the way these attributes help sloppy code to run almost as fast as properly written code. My hopes are that in the future better tools will make good use of these and other attributes on different levels than compilers, like static and dynamic analyzers. [1] The Blackfin architecture is a RISC architecture developed by Analog Devices, supported by both GCC and Binutils (and Linux, but I'm not interested in that here). [2] I have chosen -O1 rather than -O2 because in the latter case the compiler performs extra optimization passes that I do not wish to discuss within the scope of this article. Detect and record video movement with Motion Motion is a video application that monitors a video4linux device such as a USB camera and records movement within the image: Motion is a program that monitors the video signal from one or more cameras and is able to detect if a significant part of the picture has changed; in other words, it can detect motion. The program is written in C and is made for the Linux operating system. Motion is a command line based tool whose output can be either jpeg, ppm fies or mpeg video sequences. An installation of Motion was performed on a machine with a 3Ghz Athlon 64 processor running Ubuntu 7.04 (Feisty Fawn). The most recent version of Motion (v 3.2.10.1) was downloaded, the file was uncompressed and untared. The normal configure, make and make install steps were performed. If one wishes to record mpeg movies, the libavcodec and libavformat libraries must be installed prior to running configure. The make install step needed a bit of manual intervention, it was necessary to create the /var/run/motion directory and copy the motion-dist.conf configuration file to /usr/local/etc/motion.conf. The config file was modified to define a USB camera, the camera's default resolution was defined and the destination directory for images was set. The framerate parameter was changed to 2 seconds to slow down the rate of accumulation of image files. A Kensington Model 67015 VideoCAM VGA USB camera was plugged into the computer. It is a good idea to run a real-time video monitoring application such as xawtv or EffecTV (in DumbTV mode) to adjust the camera's focus, brightness and contrast settings. Running Motion was simply a matter of typing "motion" on the command line. The program takes about 25 seconds to start recording movement, presumably most of this time is spent learning the contents of the video. After this delay, the software would output a line of text and create one .jpg file for each movement it detected. The images were inspected with the Mirage image viewer and a changing sequence of static images was observed. Motion has a wide variety of capabilities and configurable parameters. The Motion Guide and Config File Options are a good place to read about the various capabilities and the FAQ gives answers to common questions. One can imagine a number of uses for Motion, cube farm denizens could find out what is causing their pens to disappear at night, people in high crime areas could use it to catch vandals and thieves in the act. The on_picture_save configuration directive can execute a script on motion detection, this could be used to copy captured images to a distant web server for remote monitoring. This feature was tested by adding a line like this: on_picture_save scp %f remote-host:/directory-path to the config file, the operation worked as expected. It should be noted that inexpensive USB cameras may only work in a very limited set of lighting conditions. Serious surveillance would require an NTSC or PAL video input adapter and a better camera, or a high resolution webcam. Apparently, no major releases of Motion have been released in a long time, but the developers' mail archive shows that recent work has been done on the project. A new point release just showed up this week, it added a fix for a security bug. If you are looking for a way to do automated video surveillance, Motion is an excellent tool for the job. A new kernel tree: linux-staging There's a new kernel tree in town. The linux-staging tree was announced by Greg Kroah-Hartman on 10 June. It is meant to hold drivers and other kernel patches that are working their way toward the mainline, but still have a ways to go. The intention is to collect them all together in one tree to make access and testing easier for interested developers. According to Kroah-Hartman, linux-staging (or -staging as it will undoubtedly be known) "is an outgrowth of the Linux Driver Project, and the fact that there have been some complaints that there is no place for individual drivers to sit while they get cleaned up and into the proper shape for merging." By collecting the patches in one place, it will increase their visibility in the kernel community, potentially attracting more developers to assist in fixing, reviewing, and testing them. The intent is for -staging to house self-contained patches—Kroah-Hartman mentions drivers and filesystems—that should not affect anyone who is not using them. Because of that, he is hoping that -staging can get included in the linux-next tree. As he says to Stephen Rothwell, maintainer of -next, in the announcement: Yes, I know it contains things that will not be included in the next release, but the inclusion and basic build testing that is provided by your tree is invaluable. You can place it at the end, and if there is even a whiff of a problem in any of the patches, you have my full permission to drop them on the floor and run away screaming (and let me know please, so I can fix it up.) The -next tree is meant for things that are headed for inclusion in the "N+1" kernel (where 2.6.N is the release under development), so including code not meant for that release is bending the rules a bit. As of this writing, Rothwell has not responded to the request to include -staging, but it would clearly benefit those patches to have a wider audience—with only a small impact on -next. There is no set timeline for patches to move from -staging into mainline, Kroah-Hartman says: Based on some of the work that is needed on some of these drivers, it is much longer than N+2, unless we have some people step up to help out with the work. It's almost all janitorial work to do, but I know I personally don't have enough time to do it all, and can use the help. The -staging tree is seen as a great place for Kernel Janitors and others who are interested in learning about kernel development to get their start. The announcement notes: "The code in this tree is in desperate need of cleanups and fixes that can be trivially found using 'sparse' and 'scripts/checkpatch.pl'." In the process of cleaning up the code, folks can learn how to create patches and how to get them accepted into a tree. From there, the hope is that more difficult tasks will be undertaken—with -staging or other kernel code—leading to a new crop of kernel hackers. The current status of -staging shows 17 patches, most of which are drivers from the Linux Driver Project. Kroah-Hartman is actively encouraging more code to be submitted for -staging, as long as it meets some criteria for the tree. The tree is not meant to be a dumping ground for drivers that are being "thrown over the wall" in hopes that someone else will deal with them. It is also not meant for code that is being actively worked on by a group of developers in another tree somewhere—the reiser4 filesystem is mentioned as an example—it is for code that would otherwise languish. The reaction on linux-kernel has so far been favorable, with questions being asked about what kinds of patches are appropriate for the tree, in particular new architectures. The -staging tree fills a niche that has not yet been covered by other trees. It also serves multiple purposes, from giving new developers a starting point to providing additional reviewing and testing opportunities for new drivers and other code. With luck, that will hasten the arrival of new features—along with new developers. Google announces Gadgets for Linux Google recently announced the release of their Gadgets for the Linux desktop, and, unlike some of their other desktop offerings, they released it under a free software license. While it is not earth-shattering technology, Gadgets does provide some interesting features and amusing diversions. It also generates some hope that Google is getting better at understanding what free software users are looking for, so perhaps things like the Google Desktop for Linux will be better integrated and more useful in the future. Gadgets are a cross-platform way to create simple applications that can run on web pages and desktops. The gadget API provides a means to retrieve content from other sites and display it along with a user interface. Many kinds of applications can be created, from clocks and calendars to RSS-feedreaders and "picture of the day" viewers. There are numerous gadgets available, a semi-random collection on a KDE desktop can be seen at left. Google has created a handful of gadgets, but the vast majority are available from others in various categories including News, Sports, Finance, Fun and Games, Technology, and Communication. The gadget browser shown below, at right, allows easy access to an amazing number of choices, many of which are variations on a theme. To get started with gadgets, it is first necessary to build the tool. Google does not yet provide .rpm or .deb files for various distributions. The "how to build" page was useful, but there was some difficulty in trying to translate the dependencies into Fedora 9 package names. A page in a language I don't know needed no translation, however. Linux commands, it seems, are multi-lingual. Building from the Apache-licensed source tarball was straightforward after that. Gadgets for Linux comes in both GTK+ and Qt flavors which allows for integration with the two dominant Linux desktop environments. The screenshots accompanying this article are from the Qt version, but a bit of a look at the GTK+ version seemed roughly the same—though the Qt version lacks the sidebar dock. This is a beta release, perhaps more of a beta than many Google releases, so there are still a fair number of glitches. Perhaps 20% of the gadgets tried had one problem or another, with some seeming not to function at all. Having no experience with gadgets on other platforms, it was not clear whether these were caused by bugs in the gadgets themselves or the desktop gadget program. The main benefit of the gadget API seems to be the cross-platform capabilities. Gadgets can run—largely unchanged—on Linux, Mac OS X, or Windows, but can also run in browsers on web pages at social networking sites or on other pages. If the API can deliver that wide of a range of platform choices, it could open up a much wider audience for folks that want to develop their gadgets on Linux. Still missing is one of the tools recommended for developing gadgets, Gadget Designer, which is only available for Windows. The documentation for creating a gadget make it look like a tedious exercise in XML manipulation and Javascript programming, but there may be tools available or in development to make some of that easier. Overall, gadgets look like an interesting project. There is really nothing new about the kinds of applications that can be built using the API, but there are few choices to build those kinds of programs in a truly cross-platform way. Google's choice to support Linux—and support it well—accompanied by the code under a free software license is, perhaps, the best news of all. openSUSE merges forums ahead of 11.0 release The openSUSE project announced this week it has merged its three largest English-language community support forums under one big green umbrella and relaunched it as the openSUSE Forums. According to data supplied by openSUSE, the combined number of suseforums.net, suselinuxsupport.de, and openSUSE Novell support forum members was in the tens of thousands &mdash a number expected to rise with the upcoming release of openSUSE 11.0. Even though the new forums are already up and running smoothly, the team has no intention of resting on its laurels. They're already working on implementing similar changes with forums in other languages and better integration with the rest of the site. Project Manager Rupert Horstkötter says there are also plans for a "user-rating for the whole openSUSE community, integrated with forums.opensuse.org, and all other openSUSE services. Besides all of that, we hope to be able to attract more independent forum communities for the official openSUSE forums." Keith Kastorff, the site admin for suseforums.net says the idea began to take shape during an openSUSE project meeting back in 2007. "A big topic was the need for an 'official' openSUSE forum, and the duplication of effort, expertise, and resources we had in play," he recalls. "I volunteered to reach out to some of the independent SUSE focused forums to see if I could generate any interest in a merge." Then he contacted people involved with Novell and suselinuxsupport.de and "things moved forward from there." Kastorff says getting the project underway was slow going at first and admits that some members were wary of Novell's involvement. "The open source community is sometimes skeptical of commercial players, but we found nothing but tremendous support from Novell," he says. It's not surprising there were a number of technical hurdles to overcome in bringing the three forums together. One of the main issues included an inability to merge the member databases and it was eventually decided to simply archive them within a section of the new forum. "Like any project, we had to make compromises to achieve the end goal," says Kastorff. "We knew going in we had different cultures in play, and there were times the dialogs between the various merging staffs got intense, but the team's strong commitment to bettering the openSUSE community kept us focused on the prize." Indeed, it was a team effort. More than 30 people worked behind the scenes to import the help sections of the separate forums and archive over 400,000 posts prior to launching forums.opensusue.org. In order for the project to work, the various groups &mdash each with their own goals and ideas &mdash needed to work together and trust in the end goal. Horstkötter says it was "a lot of work to combine different cultures into one big forum for the openSUSE community, but it was a great time. I feel like I met some new friends during the project." "We had three teams &mdash one from Novell, two from different grassroots projects that had sprung up to serve the community and had developed their own style and ways of working together," recalls openSUSE Product Manager Michael Löffler "To merge the three, the staff for each forum had to be comfortable putting all their eggs in one basket (Novell hosting the forums) and agreeing on a common set of rules, moderation guidelines, etc. It took some time and effort to work everything out, but I think that the three teams are working quite well together now." Just as important as teams working together is the impact that merged forums will have on the openSUSE community overall. "Having a unified forum means that all interested users can converse and support one another in one location &mdash so you don't have the duplication of effort." says Löffler. "I'm really glad [they] launched in time for 11.0 &mdash I expect that a lot of new users are going to be interested in openSUSE with this release, and I am very happy we have the forums to help support them." SCADA system vulnerabilities Core Security released a security advisory on 11 June that details a fairly pedestrian stack-based buffer overflow vulnerability. This is similar to hundreds or thousands of this kind of flaw reported over the years except for one thing: it was found in large industrial control systems for things like power and water utility companies. That there is a vulnerability is not surprising—there are certainly many more—but it does give one pause about the dangers of connecting these systems to the internet. The bug was found in a Supervisory Control and Data Acquisition—better known as SCADA—system and could be exploited to execute arbitrary code. Given that SCADA systems run much of the world's infrastructure, an exploit of a vulnerable system could have severe repercussions. The customers of Citect, the company that makes the affected systems, include "organizations in the aerospace, food, manufacturing, oil and gas, and public utilities industries." Makers of SCADA systems nearly uniformly tell their customers to keep those systems isolated from the internet. But as Core observes: "the reality is that many organizations do have their process control networks accessible from wireless and wired corporate data networks that are in turn exposed to public networks such as the Internet." So, the potential for a random internet bad guy to take control of these systems does exist. None of that should be particularly surprising when you stop to think about it, but it is worrying. Many SCADA systems—along with various other control systems—were designed and developed long before the internet started reaching homes and offices everywhere. They were designed for "friendly" environments, with little or no change for the hostile environment that characterizes today's internet. Also, as we have seen, security rarely gets the attention it deserves until some kind of ugly incident occurs. Even for systems that were designed recently, there are undoubtedly vulnerabilities, so it is a bit hard to believe that they might be internet-connected. According to the advisory, though, SCADA makers do not necessarily require that the systems be physically isolated from the network, instead customers can "utilize technologies including firewalls to keep them protected from improper external communications." Firewalls—along with other security techniques—do provide a measure of protection, but with the stakes so high, it would seem that more caution is required. It is probably convenient for SCADA users to be able to connect to other machines on the LAN, as well as to the internet, but with that convenience comes quite a risk. Even systems that are just locally connected could fall prey to a disgruntled employee exploiting a vulnerability to gain access to systems they normally wouldn't have. One can envision all manner of havoc that could be wreaked by a malicious person (or government) who can take over the systems that control nuclear power plants, enormous gas pipelines, or some chunk of the power grid. Unfortunately, it will probably take an incident like that to force these industries into paying as much attention to their computer security as they do to their physical security. The Kernel Hacker's Bookshelf: Ultimate Physical Limits of Computation Moore's Law - we all know it (or at least think we do). To be annoyingly exact, Moore's Law is a prediction that the number of components per integrated circuit (for minimum cost per component) will double every 24 months (revised up from every 12 months in the original 1965 prediction). In slightly more useful form, Moore's Law is often used as a shorthand for the continuing exponential growth of computing technology in many areas - disk capacity, clock speed, random access memory. Every time we approach the limit of some key computer manufacturing technology, the same debate rages: Is this the end of Moore's Law? So far, the answer has always been no. But Moore's Law is inherently a statement about human ingenuity, market forces, and physics. Whenever exponential growth falters in one area - clock speed, or a particular mask technique - engineers find some new area or new technique to improve at an exponential pace. No individual technique experiences exponential growth for long, instead migration to new techniques occurs fast enough that the overall growth rate continues to be exponential. The discovery and improvement of manufacturing techniques is driven on one end by demand for computation and limited on the other end by physics. In between is a morass of politics, science, and plain old engineering. It's hard to understand the myriad forces driving demand and the many factors affect innovation including economies of scale, cultural attitudes towards new ideas, vast marketing campaigns, and the strange events that occur during the death throes of megacorporations. By comparison, understanding the limits of computation is easy, as long as you have a working knowledge of quantum physics, information theory, and the properties of black holes. The "Ultimate Laptop" In a paper published in Nature in 2000, Ultimate Physical Limits of Computation (free arXiv preprint [PDF] here), Dr. Seth Lloyd calculates (and explains) the limits of computing given our current knowledge of physics. Of course, we don't know everything about physics yet - far from it - but just as in other areas of engineering, we know enough to make some extremely interesting predictions about the future of computation. This paper wraps up existing work on the physical limits of computing and introduces several novel results, most notably the ultimate speed limit to computation. Most interesting in my mind is the calculation of a surprisingly specific upper bound on how many years a generalized Moore's Law can remain in effect (keep reading to find out exactly how long!). Dr. Lloyd begins by assuming that we have no idea what future computer manufacturing technology will look like. Many discussions of the future of Moore's Law center around physical limits on particular manufacturing techniques, such as the limit on feature size in optical masks imposed by the wavelength of light. Instead, he ignores manufacturing entirely and uses several key physical constants: the speed of light c, Planck's reduced constant h (normally written as h-bar, a symbol not available in standard HTML, so you'll have to just imagine the bar), the gravitational constant g, and Boltzmann's constant kB. These constants and our current limited understanding of general relativity and quantum physics are enough to derive many important limits on computing. Thus, these results don't depend on particular manufacturing techniques. The paper uses the device of the "Ultimate Laptop" to help make the calculations concrete. The ultimate laptop is one kilogram in mass and has a volume of one liter (coincidentally almost exactly the same specs as a 2008 Eee PC), and operates at the maximum physical limits of computing. Applying the limits to the ultimate laptop gives you a feel for the kind of computing power you can get in luggable format - disregarding battery life, of course. Energy limits speed So, what are the limits? The paper begins with deriving the ultimate limit on the number of computations per second. This depends on the total energy, E, of the system, which can be calculated using Einstein's famous equation relating mass and energy, E = mc2. (Told you we'd need to know the speed of light.) Given the total energy of the system, we then need to know how quickly the system can change from one distinguishable state to another - i.e., flip bits. This turns out to be limited by the Heisenberg uncertainty principle. Lloyd has this to say about the Heisenberg uncertainty principle: In particular, the correct interpretation of the time-energy Heisenberg uncertainty principle ΔEΔt ≥ h is not that it takes time Δt to measure energy to an accuracy ΔE (a fallacy that was put to rest by Aharonov and Bohm) but rather that that a quantum state with spread in energy ΔE takes time at least Δt = πh/2ΔE to evolve to an orthogonal (and hence distinguishable) state. More recently, Margolus and Levitin extended this result to show that a quantum system with average energy E takes time at least Δt = πh/2E to evolve to an orthogonal state. In other words, the Heisenberg uncertainty principle implies that a system will take a minimum amount of time to change in some observable way, and that the time is related to the total energy of the system. The result is that a system of energy E can perform 2E/πh logical operations per second (a logical operation is, for example, performing the AND operation on two bits of input - think of it as single bit operations, roughly). Since the ultimate laptop has a mass of 1 kilo, it has energy E = mc2 = 8.9874 x 1016 joules. The ultimate laptop can perform a maximum of 5.4258 x 1050 operations per second. How close are we to the 5 x 1050 operations per second today? Each of these operations is basically a single-bit operation, so we have to convert current measurements of performance to their single-bit operations per second equivalents. The most commonly available measure of operations per seconds is FLOPS (floating point operations per second) as measured by LINPACK (see the Wikipedia page on FLOPS). Estimating the exact number of actual physical single-bit operations involved in a single 32-bit floating point operation would require proprietary knowledge of the FPU implementation. The number of FLOPS as reported by LINPACK varies wildly depending on compiler optimization level as well. For this article, we'll make a wild estimate of 1000 single-bit operations per second (SBOPS) per FLOPS, and ask anyone with a better estimate to please post it in a comment. With our FLOPS to SBOPS conversion factor of 1000, the current LINPACK record holder, the Roadrunner supercomputer (near my home town, Albuquerque, New Mexico), reaches speeds of one petaflop, or 1000 x 1015 = 1 x 1018 SBOPS. But that's for an entire supercomputer - the ultimate laptop is only one kilo in mass and one liter in volume. Current laptop-friendly CPUs are around one gigaflop, or 1012 SBOPS, leaving us about 39 orders of magnitude to go before hitting the theoretical physical limit of computational speed. Finally, existing quantum computers have already attained the ultimate limit on computational speed - on a very small number of bits and in a research setting, but attained it nonetheless. Entropy limits memory What we really want to know about the ultimate laptop is how many legally purchased DVDs we can store on it. The amount of data a system can store is a function of the number of distinguishable physical states it can take on - each distinct configuration of memory requires a distinct physical state. According to Lloyd, we have "known for more than a century that the number of accessible states of a physical system, W, is related to its thermodynamic entropy by the formula: S = kB ln W" (S is the thermodynamic entropy of the system). This means we can calculate the number of bits the ultimate laptop can store if we know what its total entropy is. Calculating the exact entropy of a system turns out to be hard. From the paper: To calculate exactly the maximum entropy for a kilogram of matter in a liter volume would require complete knowledge of the dynamics of elementary particles, quantum gravity, etc. We do not possess such knowledge. However, the maximum entropy can readily be estimated by a method reminiscent of that used to calculate thermodynamic quantities in the early universe. The idea is simple: model the volume occupied by the computer as a collection of modes of elementary particles with total average energy E. The following discussion is pretty heavy going; for example, it includes a note that baryon number may not be conserved in the case of black hole computing, something I'll have to take Lloyd's word on. But the end result is that the ultimate laptop, operating at maximum entropy, could store at least 2.13 x 1031 bits. Of course, maximum entropy means that all of the laptop's matter is converted to energy - basically, the equivalent of a thermonuclear explosion. As Lloyd notes, "Clearly, packaging issues alone make it unlikely that this limit can be obtained." Perhaps a follow-on paper can discuss the Ultimate Laptop Bag... How close are modern computers to this limit? A modern laptop in 2008 can store up to 250GB - about 2 x 1012 bits. We're about 19 orders of magnitude away from maximum storage capacity, or about 64 more doublings in capacity. Disk capacity as measured in bits per square inch has doubled about 30 times between 1956 and 2005, and at this historical rate, 64 more doublings will only take about 50 - 100 years. This isn't the overall limit on Moore's law as applied to computing, but it suggests the possibility of an end to Moore's law as applied to storage within some of our lifetimes. I guess we file system developers should think about second careers... Redundancy and error correction Existing computers don't approach the physical limits of computing for many good reasons. As Lloyd wryly observes, "Most of the energy [of existing computers] is locked up in the mass of the particles of which the computer is constructed, leaving only an infinitesimal fraction for performing logic." Storage of a single bit in DRAM uses "billions and billions of degrees of freedom" - electrons, for example - instead of just one degree of freedom. Existing computers tend to conduct computation at temperatures at which matter remains in the form of atoms instead of plasma. Another fascinating practical limit on computation is the error rate of operations, which is bounded by the rate at which the computer can shed heat to the environment. As it turns out, logical operations don't inherently require the dissipation of energy, as von Neumann originally theorized. Reversible operations (such as NOT) which do not destroy information do not inherently require the dissipation of energy, only irreversible operations (such as AND). This makes some sense intuitively; the only way to destroy (erase) a bit is to turn that information into heat, otherwise the bit has just been moved somewhere else and the information it represents is still there. Reversible computation has been implemented and shown to have extremely low power dissipation. Of course, some energy will always be dissipated, whether or not the computation is reversible. However, the erasure of bits - in particular, errors - requires a minimum expenditure of energy. The rate at which the system can "reject errors to the environment" in the form of heat limits the rate of bit errors in the system; or, conversely, the rate of bit errors combined with the rate of heat transfer out of the system limits the rate of bit operations. Lloyd estimates the rate at which the system can reject error bits to the environment, relative to the surface area and assuming black-body radiation, as 7.195 x 1042 bits per meter2 per second. Computational limits of "smart dust" Right around the same time that I read the "Ultimate Limits" paper, I also read A Deepness in the Sky by Vernor Vinge, one of many science fiction books featuring some form of "smart dust." Smart dust is the concept of tiny computing elements scattered around the environment which operate as a sort of low-powered distributed computer. The smart dust in Vinge's book had enough storage for an entire systems manual, which initially struck me as a ludicrously large amount of storage for something the size of a grain of dust. So I sat down and calculated the limits of storage and computation for a computer one μm3 in size, under the constraint that its matter remain in the form of atoms (rather than plasma). Lloyd calculates that, under these conditions, the ultimate laptop (one kilogram in one liter) can store about 1025 bits and conduct 1040 single-bit operations per second. The ultimate laptop is one liter and there are 1015 μm3 in a liter. Dividing the total storage and operations per second by 1015 gives us 1010 bits and 1025 operations per second - about 1 gigabyte in data storage and so many FLOPS that the prefixes are meaningless. Basically, the computing potential of a piece of dust far exceeds the biggest supercomputer on the planet - sci-fi authors, go wild! Of course, none of these calculations take into account power delivery or I/O bandwidth, which may well turn out to be far more important limits on computation. Implications of the ultimate laptop Calculating the limits of the ultimate laptop has been a lot of fun, but what does it mean for computer science today? We know enough now to derive a theoretical upper bound for how long a generalized Moore's Law can remain in effect. Current laptops store 1012 bits and conduct 1012 single-bit operations per second. The ultimate laptop can store 1031 bits and conduct 1051 single-bit operations per second, a gap of a factor of 1019 and 1039 respectively. Lloyd estimates the rate of Moore's Law as 108 factor of improvement in areal bit density over the past 50 years. Assuming that both storage density and computational speed will improve by a factor of 108 per 50 years, the limit will be reached in about 125 years for storage and about 250 years for operations per second. One imagines the final 125 years being spent frantically developing better compression algorithms - or advanced theoretical physics research. Once Moore's Law comes to a halt, the only way to increase computing power will be to increase the mass and volume of the computer, which will also encounter fundamental limits. An unpublished paper entitled Universal Limits on Computation estimates that the entire computing capacity of the universe would be exhausted after only 600 years under Moore's Law. 250 years is a fascinating in-between length of time. It's too far away to be relevant to anyone alive today, but it's close enough that we can't entirely ignore it. Typical planning horizons for long-term human endeavors (like managing ecosystems) tend to max out around 300 years, so perhaps it's not unthinkable to begin planning for the end of Moore's Law. Me, I'm going to start work on the LZVH compression algorithm, tomorrow. One thing is clear: we live in the Golden Age of computing. Let's make the most of it. Valerie Henson is a Linux consultant specializing in file systems and owns a one kilo, one liter laptop. Peter Zijlstra: From DOS to kernel hacking In a linux-kernel thread about fixing the Kernel Janitors project, Peter Zijlstra spoke up, with a bit of his perspective on attracting better kernel contributors. As he is a relatively recent addition to the kernel community, his path from Linux user to kernel hacker may serve as a template of sorts for others who are starting out now. We asked Peter to answer a few questions by email to help fill in some more of the details. LWN: How did you get started with Linux? What attracted you? Peter: Around the time Win95 came around, IIRC [if I remember correctly]. I used to do demo coding on DOS, which involved rebooting your machine every time you messed up, and whereas DOS reboots quite quickly, doing the same on Win95 was anything but quick. A friend of mine introduced me to Unix/Linux at the time, and I started learning all about programming in a real environment. Basically all programming up to that point was in a freestanding environment where you had to poke the hardware to get anything done. So initially it was the charm of a proper multitasking OS (with memory protection) that got me to use it – not having to reboot your machine every time, and the luxury of being able to run a debugger. LWN: How quickly did you start poking around in the kernel? What did you first start to look at and why? Peter: The kernel ... well that took a seriously long while. The above introduction to Linux was around 95/96 IIRC. My first real kernel patches were about 10 years later. In those 10 years I learnt a lot about programming. I learnt about Unix system programming, I learnt about C++, multi-threading, database engines, and a whole range of interesting things. Somewhere along I got a real internet connection and started lurking on mailing lists, including LKML – I must have been reading that on and off for about 5 years by the time I really sat down and wrote some patches. During that time I might have sent in some trivial build fixes, and I remember finding a priority leak in one of the realtime patches. But I wasn't actively coding on the kernel – I just liked running real exotic stuff, you know Gentoo and building just about everything from CVS. So what got me started on the kernel ... I can't quite remember how it happened, but I ran into some of Rik's [van Riel] Advanced Page Replacement stuff. I had worked on that problem space earlier while doing database engines, and had recently run into it again at work. So I started reading those papers and some of the proposed kernel patches, and I started to itch. I dropped basically everything I was working on in my spare time (hacking WindowMaker, writing a C++ ASN.1-DER serialization class, writing a new LDAP server and I'm sure some other projects that are rotting away on a harddrive somewhere :-) and started hacking. Why ... I'm not sure – it sure got me back to where I started out – crashing machines (and boot times haven't improved over those past 10 years at all). I think because of the challenge – I knew I could write whatever it was I was coding and this page replacement stuff was a whole new challenge, and TBH [to be honest] the kernel code didn't look too hard at the time (phew how ignorant I was..) LWN: How well were your contributions received by kernel hackers? Did you make any missteps along the way? Peter: Some better than others. I think its natural for every kernel hacker to grow a huge pile of discarded patches. Not everything will make it. But don't get discouraged by that, you did get to learn something from doing them. Mis-steps, feh, still do ;-) Unlike most people seem to think, kernel hackers are human too. LWN: What suggestions do you have for folks that are looking at getting involved in kernel hacking today? Peter: Just do it – seriously it's that easy. Oh and don't be afraid of criticism, you'll get it anyway – in spades. Criticism is not personal, it's about your patch, there are two things you can do: take it and act upon it convince the other he's wrong OK it can get personal, but that is only if you repeatedly fail the above two points. LWN: There has been a lot of talk about the Kernel Janitors project recently, do you think that is a good way to get started with kernel development? What do you think should be done differently in that (or other) project(s) to attract more and better contributors? Peter: I'm not sure. The Kernel Janitors thing doesn't really seem to work out. I think that might be due to two things: we don't have enough simple but interesting things lined up (not saying there are none, but we don't have a ready list). I think a proper challenging project would be much better that moronic code clean ups. the kernel really isn't a place for newbies; now let me explain this before it gets all mis-interpreted :-) Things really get a lot easier if you're fairly competent at (Unix) system programming before starting at the kernel. Kernel hacking is a solitary business in that you need to do things, nobody is going to do them for you. That is not saying nobody can help you if you have a question. Also, nobody is going to force you to do something – you need to want doing it. Now, none of this means you can't start hacking the kernel without knowing C or any programming it all, but you'd better be ready for one hell of a ride (Yes, there are people who learnt C from doing kernel stuff, but that is going to take a serious amount of will-power to pull off). So I guess what I'm saying is that you need to really want to do it. There is no other way to become a kernel hacker than by simply doing it. LWN: Do you work on Linux for your job, as a hobby, or both? Peter: Both; initially it was spare time besides $JOB. But after keeping this up for about a year my wife nudged me to look for a kernel job, since I obviously enjoyed hacking the kernel more than $JOB, and she'd get some of that spare time back ;-) So I applied for a kernel position at a few of the larger vendors, and Red Hat won the race. Already having had a year's worth of exposure to kernel code and LKML, certainly helped in getting this amazing opportunity. Have I already mentioned I absolutely love working on the kernel? So now I get to poke at the kernel all day, every day... LWN: What are your current kernel projects? What kinds of things do you see yourself doing in the kernel in the future? Peter: Current active projects are group scheduling and some -rt work. I should pick up the swap over network code again, and there are some other loose ends. The future ... well we'll see what happens, loads of interesting stuff to do. We would like to thank Peter for taking the time to answer our questions. The state of the pageout scalability patches The virtual memory scalability improvement patch set overseen by Rik van Riel has been under construction for well over a year; LWN last looked at it in November, 2007. Since then, a number of new features have been added and the patch set, as a whole, has gotten closer to the point where it can be considered for mainline inclusion. So another look would appear to be in order. One of the core changes in this patch set remains the same: it still separates the least-recently-used (LRU) lists for pages backed up by files and those backed up by swap. When memory gets tight, it is generally preferable to evict page cache pages (those backed up by files) rather than anonymous memory. File-backed pages are less likely to need to be written back to disk and they are more likely to be well laid-out on disk, making it quicker to read them back in if necessary. Current Linux kernels keep both types of pages on the same LRU list, though, forcing the pageout code to scan over (potentially large numbers of) pages which it is not interested in evicting. Rik's patch improves this situation by splitting the LRU list in two, allowing the pageout code to only look at pages which might actually be candidates for eviction. There comes a point, though, where anonymous pages need to be reclaimed as well. The kernel will make an effort to pick the best pages to evict by going for those which have not been recently referenced. Doing that, however, requires going through the entire list of anonymous pages, clearing the "referenced" bit on each. A large system can have many millions of anonymous pages; iterating over the entire set can take a long time. And, as it turns out, it's not really necessary. The VM scalability patch set now changes that behavior by simply keeping a certain percentage of the system's anonymous pages on the inactive list - the first place the system looks for pages to evict. Those pages will drift toward the front of the list over time, but will be returned to the active list if they are used. Essentially, this patch is applying a form of the "referenced" test to a portion of anonymous memory - whether or not anonymous pages are being evicted at the time - rather than trying to check the referenced state of all anonymous pages when the kernel decides it needs to reclaim some of them. Another set of patches addresses a different situation: pages which cannot be evicted at all. These pages might have been locked into memory with a system call like mlock(), be part of a locked SYSV shared memory region, or be part of a RAM disk, for example. They can be either page cache or anonymous pages. Either way, there is little point in having the reclaim code scan them, since it will not be possible to evict them. But, of course, the current reclaim code does have to scan over these pages. This unneeded scanning, as it turns out, can be a problem. The extensive unevictable LRU document included with the patch claims: For example, a non-numal x86_64 platform with 128GB of main memory will have over 32 million 4k pages in a single zone. When a large fraction of these pages are not evictable for any reason [see below], vmscan will spend a lot of time scanning the LRU lists looking for the small fraction of pages that are evictable. This can result in a situation where all cpus are spending 100% of their time in vmscan for hours or days on end, with the system completely unresponsive. Most of us are not currently working with systems of this size; one must spend a fair amount of money to gain the benefits of this sort of pathological behavior. Still, it seems like something which is worth fixing. The solution, of course, is yet another list. When a page is determined to be unevictable, that page will go onto the special, per-zone unevictable list, after which the pageout code will simply not see it anymore. As a result of the variety of ways in which a page can become unevictable, the kernel will not always know at mapping time whether a specific page can go onto the unevictable list or not. So the pageout code must keep an eye out for those pages as it scans for reclaim candidates and shunt them over to the unevictable list as they are found. In relatively short order, the locked-down pages will accumulate in this list, freeing the pageout code to concentrate on pages it can actually do something about. Many of the concerns which have been raised about this patch set over the last year have been addressed. A few remain, though. Some of the new features require new page flags; these flags are in extremely short supply, so there is always pressure to find ways of implementing things which do not allocate more of them. There are a few too many configuration options and associated #ifdef blocks. And so on. Addressing these may take a while, but convincing everybody that these (rather fundamental) memory management changes are beneficial under all circumstances may take rather longer. So, while this patch set is making progress, a 2.6.27 merge is probably not in the cards. Multi-system administration with Func Managing multiple computer systems can involve a lot of repetitive tasks: connecting to each, performing some update, status check, or configuration tweak, and then moving on to the next machine. These kinds of things can be scripted of course, but scripts of that nature typically need to be adjusted frequently as machines come and go or the tasks change. The Fedora Unified Network Controller (Func) is a tool that will help simplify system administration, but there is more to it than that—it is a framework for doing two-way secure communication, from the command line, scripts, or applications. Func is written in Python, providing an API for scripts written in that language, but it can also be used from the command line. Each client machine—minion in Func-speak—runs the funcd daemon which contacts the master server or overlord. From the overlord machine, commands can then be issued to individual minions or to subsets of them. Some of the power of Func can be seen in simple commands like: which will restart the web server on all of the minions. Similar kinds of tasks—but with more control—can be handled through the Python API. A somewhat contrived example from the Func website gives a sense of what can be done: This example looks for minions that are running a web server and reboots each that it finds. Managing keys can be a hassle when using ssh as an administrative tool, so Func uses another tool, Certmaster, to assist with keys. Certmaster provides a set of utilities and a Python API for managing SSL certificates. Clients generate certificate signing requests (CSRs), which contain their public key, that are sent to the Certmaster on the overlord. Administrators can either sign them from the command line or enable auto-signing. The minion then retrieves the signed certificate so that the overlord and minion communicate over an encrypted channel after that. Func is not meant to replace ssh, instead it is intended to provide multi-system and scripting capabilities which are not the strengths of ssh. Like ssh, though, Func is meant to be easy to deploy—eventually ubiquitous, at least for Fedora—simple to use as well as easy to extend. It also has a pluggable architecture that allows Python modules to be integrated easily into Func, expanding the abilities of the minions. The documentation shows how to use the func-create-module command to generate template code which allows the administrator to ignore the Func requirements and concentrate on the task at hand. There is nothing particularly Fedora-specific about Func, that's just where it was born. There are some efforts underway to add it for other distributions. Most of the work would be in creating distribution-specific analogs for things like restarting services and querying hardware configurations. Red Hat has been releasing a steady stream of system administration tools over the last year or so. The Emerging Technology (ET) group has developed quite an ecosystem of tools to support installations with large numbers of servers that are frequently installed and upgraded. One might think they have a large infrastructure of such servers. One of those tools that is frequently discussed in conjunction with Func is Cobbler. It is meant to simplify the configuration of a server to handle network installation and booting for a large server farm. From the web page: In short, Cobbler helps build and maintain network installation infrastructure really easily. It's highly customizable to your particular methods of operation through a wide variety of options, a powerful command line, a Web interface, a pluggable extension mechanism, and (for developers) its own Python API. Cobbler lets administrators forget how software gets installed and delivered and lets them concentrate instead on what they want to install where. Cobbler and the other tools coming out of the ET group are not just targeted at physical machines, but also virtualized environments. By using Cobbler, the puppet configuration manager, and the oVirt virtual machine manager, thousands of systems of various kinds can be managed in a centralized fashion. As would be expected, all of the code is available as free software. These tools are quite interesting for system administrators, particularly those who use Fedora and have lots of systems to maintain. Even for small home networks, though, Func at least could come in handy. For overworked administrators—no matter the size of their domain—better tools are always welcome. Why some drivers are not merged early Arjan van de Ven's kernel oops report always makes for interesting reading; it is a quick summary of what is making the most kernels crash over the past week. It thus points to where some of the most urgent bugs are to be found. Sometimes, though, this report can raise larger issues as well. Consider the June 16 report, which notes that quite a few kernel crashes were the result of a not-quite-ready wireless update shipped by Fedora. Ingo Molnar was quick to jump on this report with a process-related complaint: i suspect Fedora has done this to enable more hardware, and/or to fix mainline wireless bugs? I wish we would do such new driver merging in mainline instead, so that we had a single point of testing and single point of effort. Same for Nouveau: Fedora carries it and i dont understand why such a major piece of work is not done in mainline and not _helped by_ mainline. He then took the discussion further with this observation: That's my main point: when we mess up and dont merge OSS driver code that was out there in time - and we messed up big time with wireless - we should admit the screwup and swallow the bitter pill. This comment drew some unhappy responses from the networking developers, who feel that they have been unfairly targeted for criticism. Wireless drivers have been merged at the first real opportunity, they say, and trying to put them in earlier would have only made things worse. In fact, your editor will submit that mistakes were made with wireless drivers, but those mistakes have little to do with delaying their inclusion into the mainline. What went wrong with wireless is this: Early wireless developers did not really try to solve the wireless networking problem; they just wanted to get their adaptor to work. Wireless maintainer John Linville once told your editor that, for years, these adaptors were treated as if they were Ethernet adaptors, which they certainly are not. When these developers did get around to dealing with issues specific to wireless networking, they created their own wireless stacks contained within their drivers. So no general wireless framework was created. It's only in 2004 that Jeff Garzik started a project to create a generic wireless stack for Linux - and he started with a stack (HostAP) which, sometime later on, was seen as not being the best choice. So the work on HostAP - late to begin in the first place - was eventually abandoned. The networking stack which was eventually developed - mac80211 - began its life as a proprietary code base created with no community review or oversight at all. Predictably, it had all kinds of problems which required well over a year of work to resolve. Until mac80211 was in reasonable shape, there was no real way to get drivers ready for inclusion. The result of all this (and the occasional legal hassle as well) is that wireless networking on Linux lagged for years, and is only now reaching something close to a stable state. So it is not surprising that there has been a lot of code churn in this area, or that things occasionally break. But it is hard to see how trying to merge wireless drivers sooner would have helped the situation significantly. The non-merging of the Nouveau driver - the reverse-engineered driver for NVIDIA adapters - also has a simple explanation: the developers have not yet asked for this merge to happen. Nouveau is not considered to be at a point where it works yet, and, importantly, there are still user-space API issues which must be worked out. Breaking user-space code is severely frowned upon, so merging of code is nearly impossible if its user-space interfaces are still in flux. James Bottomley put forward another reason why a driver may stay out of the mainline even though the author would like to see it merged: For the record, my own view is that when a new driver does appear we have a limited time to get the author to make any necessary changes, so I try to get it reviewed and most of the major issues elucidated as soon as possible. However, since the only leverage I have is inclusion, I tend to hold it out of tree until the problems are sorted out. In other words, their control over access to the mainline tree is the one club subsystem maintainers have at hand when they feel the need to push a developer to make changes to a driver. It may well be that simply merging drivers regardless of technical objections (something which a number of developers are pushing for) will reduce the incentive for developers to get their code into top shape - and it's not always clear that others will step in and do the work for them. On the other hand, the idea that in-tree code tends to be less buggy than out-of-tree code is relatively uncontroversial. So, for many drivers at least, a "merge first and fix it up later" policy may well lead to the best results in the shortest period of time. One thing that is clear is that this discussion will not be going away anytime soon; chances are good that this year's kernel summit (happening in September) will end up revisiting the issue. Looking ahead to Mandriva 2009 Mandriva developer Adam Williamson recently announced the plans for Mandriva Linux 2009. The schedule and other details are available at 2009 development wiki. There will be two alpha releases, two beta releases and two release candidates before the final release in October 2008. The first alpha will be available very soon as the scheduled date is June 25, 2008. As usual Mandriva 2009 will be available in the Free, One (live CD) and PowerPack editions. So what's in store? Users of Cooker, Mandriva's development branch, will have already noticed the churn as gcc is upgraded to 4.3. There's also the switch to newer technologies such as libata and PolicyKit. The final kernel is not yet fixed but will likely be 2.6.26; with server, desktop and desktop586 flavors. The technical specifications available in SVN, where they are changed to reflect progress. I looked at the PDF snapshot for more information. KDE 4.1 and GNOME 2.24 will both be available, along with updated packages such as OpenOffice.org 3 and Firefox 3. There's a new design for the installer, and live distribution upgrade mode for MandrivaUpdate. The package management tools will be smarter about the removal of packages that are no longer required. The Windows migration tools have also gotten smarter, making it easier than ever for new users to get started with Linux. That's just the beginning. There is much more coming up in Mandriva Linux 2009. The Wine project releases version 1 Wine (Wine Is Not an Emulator) is one of the long-standing Windows interoperability projects that runs under Linux and other Unix-based systems: Wine is an Open Source implementation of the Windows API on top of X, OpenGL, and Unix. Think of Wine as a compatibility layer for running Windows programs. Wine does not require Microsoft Windows, as it is a completely free alternative implementation of the Windows API consisting of 100% non-Microsoft code, however Wine can optionally use native Windows DLLs if they are available. Wine provides both a development toolkit for porting Windows source code to Unix as well as a program loader, allowing many unmodified Windows programs to run on x86-based Unixes, including Linux, FreeBSD, Mac OS X, and Solaris. Wine is free software, released under the GNU LGPL. Although not game-specific, the ability to run Windows games has always been one of the major driving forces behind Wine. The Wine AppDB page lists the numerous Windows applications that have been made to work under Wine. Photoshop CS2 stands out as one of the few most-popular Wine-compatible Windows applications that is not a game. The Wine Features document lists Wine's capabilities, it is capable of running DOS through Windows XP applications, Windows Vista compatibility is not yet mentioned. The About Wine document explores the project's history, contributors, myths and more. The history document details the magnitude of the project: "Wine has grown to over 1.4 million lines of C code over the past decade. Nearly 700 people have contributed in some fashion. As always, you can expect Wine to be released sometime this year; or maybe early next year." Version 1.0 of Wine was announced (see the LWN reader comments) on June 17, 2008: The Wine team is proud to announce that Wine 1.0 is now available. This is the first stable release of Wine after 15 years of development and beta testing. Many thanks to everybody who helped us along that long road! There have been a series of Wine 1.0 release candidates over the last month involving a ton of bug fixes, janitorial code work, translation improvements and more. The details are available in the series of release notes for RC1, RC2, RC3, RC4, RC5 and finally version 1.0. Binary packages and source code for Wine 1.0 are available for download. While fairly unusual for most open-source projects, a commercial distribution of Wine known as CrossOver is available from Code Weavers. CrossOver Linux 7.0, which is synchronized with Wine 1.0, was announced this week. The Application Security Desk Reference The Open Web Application Security Project (OWASP) has undertaken an ambitious project to create a reference manual—in the same vein as the Physician's Desk Reference—covering application security. The book, along with a companion wiki are meant to be the starting point for researchers, developers, and code reviewers when performing a number of security-related tasks. The book is currently in an alpha state, with OWASP looking for more reviewers and authors to get the book into a finished state by August. The Application Security Desk Reference (ASDR) will be a 900+ page book, extensively tagged—cross-referenced in the wiki—to provide a multi-dimensional view of security threats, attacks, vulnerabilities, and impacts. The book introduces a set of principles that will help guide developers in avoiding these problems along with controls (aka countermeasures) to evade or eliminate them. The authors provide a description of why they took this approach: Application security information cannot be organized into a one-dimensional taxonomy that is useful for all purposes, although many have tried. For example, organizing application security by vulnerability helps tool vendors, but makes it very difficult for architects to select controls. We've adopted the folksonomy tagging approach to solving this problem. We simply tag our articles with a number of different categories. You can use these categories to help get different views into the complex, interconnected set of topics that is application security. The PDF 0.9 version is available, and it is already quite useful, though there is still a fair amount of work to do. An important goal is to provide a foundation: The ASDR is helpful as basic reference material when performing such activities as threat modeling, security architecture review, security testing, code review, and metrics. We intend to encourage understanding and consistency when discussing these basic foundational elements of application security. Security only works if people can make informed decisions about risk. The ASDR provides that basic information to help ensure all stakeholders are involved. Technical books have a unfortunate tendency to rapidly go stale because the industry moves so quickly. Maintaining the wiki will help alleviate this problem by allowing for a dynamic reference that can be periodically produced in dead tree form as well. Much of this kind of information can be found in books and on the web, but collecting it up into one place is very valuable. Three sections of the current draft stand out as being closest to completion: Principles, Attacks, and Vulnerabilities. Principles contains 17 basic things to keep in mind as part of gaining a "security consciousness". It defines terms in clear language and provides reasons why the principle should be followed. An example: Security through obscurity is a weak security control, and nearly always fails when it is the only control. This is not to say that keeping secrets is a bad idea, it simply means that the security of key systems should not be reliant upon keeping details hidden. More than 50 attacks are listed, along with examples and concise descriptions. In addition, there are several hundred vulnerabilities listed, each with examples as well as information on which platforms or languages are affected. It clearly sets out to be a clearinghouse of application security information and looks like it is succeeding in that. For anyone with an interest in security, it is well worth a look. For those who are skilled in security techniques, assisting with the review and content creation might be in order. Deki helps Mozilla developers collaborate There was undoubtedly plenty of activity this week at the Mozilla Developer Center ahead of the release of Firefox 3. Thanks to a special tool created by the team at MindTouch and implemented into its latest product offering, Deki, Mozilla developers all across the globe were able view the site in their native tongue. The "polyglot" language feature is only one of several components that make up Deki, an open source collaboration tool for communities and the enterprise. The polyglot can distinguish between different languages across a single system so it's no longer necessary for IT professionals to allocate sections of a web site's infrastructure to overcome language barriers. Instead, multiple languages are consolidated into one system and a site's pages are then localized according to user settings. Deki functions similar to that of a traditional wiki, but with more features and practical applications. In fact, the company originally called the product "Deki Wiki" but realized it was too limiting and recently dropped "Wiki" from the name altogether. Developers can use Deki as a way to organize and aggregate project data, share documents and media, or even author and create collaborative applications from the ground up. Groups and organizations also use Deki as platform for managing a large knowledge base, coordinating team-based projects, or as a file repository. Deki is part application, part platform. It behaves much the same way as other content management frameworks like Drupal and Joomla!, but has the underpinnings of a wiki that give it collaborative features as well. Furthermore, everything under Deki's hood can be accessed via the API on which it was built, and can be extended in any programming language. At the heart of the platform is MindTouch Dream, which forms the application's architecture, and uses Deki as its interface. It's a .NET representational state transfer (REST) framework that runs on .NET 2.0 and Mono 1.2 — .NET runs on Microsoft Windows Servers 2003 and 2008, while Mono runs on Debian, Fedora, Ubuntu, openSUSE, and Apple OS X (see the web site for complete details). Data manipulation is done in XML using standard HTTP verbs, and data conversions to PHP, JSONP, etc. are done automatically behind the scenes. Licensed under the Gnu GPL and LGPL, together Deki and Dream can be completely customized and scaled to the needs of any size organization. Company co-founders Aaron Fulkerson and Steve Bjorg were approached last winter by Mozilla's Chief Evangelist Mike Shaver about implementing Deki in time for the upcoming re-launch of its Developer Center. "Mike had reviewed our API and architectural documentation and he was enthusiastic about MindTouch Deki," recalls Fulkerson. "Later on the phone, we discussed Mozilla's needs, pains, and how MindTouch Deki seemed to be the perfect solution. We also day-dreamed a little about what the Mozilla community might build on the MindTouch platform. By my recollection, we both were pretty excited about the opportunity." Given the Developer Center's wide geographical reach, barriers were to be expected as it struggled to cater to a group that collectively spoke dozens of different languages. In response, Bjorg and Fulkerson put together a design that allows for a multi-lingual Web site that scales as needed. As Mozilla's needs grow, additional languages can easily be added by translating a single file and submitting it for inclusion in the official Deki build. In fact, all current translations have come from the community, and more are on the way. Deki isn't just for large organizations. Development platform-as-a-service provider Bungee Connect uses it as a documentation repository at the moment, but according to the Director of Bungee Connect's Developer Community, Ted Haeger, the plan is to soon make it the community platform for its Developer Network. "Our developers are very interested in programmable Web technologies, and Deki will allow us to provide them the most feature-complete wiki API we have seen yet. We expect to see some interesting and exciting things built by combining Bungee Connect and MindTouch Deki," he says. The decision to choose Deki over other similar options "was driven overwhelmingly by the architecture of the product. Because Deki provides a complete RESTful API, it makes it an extremely attractive offering for us," notes Haeger. Indeed, he considers the API Deki's best feature. "MindTouch has done an outstanding job with it," Haeger says. "Additionally, they have written their PHP front-end to the Deki API, which means that the API is central to the product rather than an afterthought. However, we should note that Deki's default PHP user interface is extremely polished, too. That combined with other must-haves, such as a permissions system that is considerably more flexible than what other wikis provide, helped solidify our decision." Though there are varying levels of support options available, Haeger says Bungee Connect hasn't yet decided which to choose. They do plan, however, to lean on MindTouch for assistance as they migrate company documentation from MediaWiki to Deki. For organizations planning to take on the task themselves, Fulkerson points to the helpful guide on its site and the Mediawiki to Deki converter they have written: "As we always have done, we've released the source code to our public SVN repository. It's stable and has had generous test coverage, but this should be considered a beta release." As Deki continues to gain traction in the enterprise as an agile content management system, Fulkerson and Bjorg say they knew they were on to something when they caught wind of the first user-organized conference held in Belgium last fall. Notes Fulkerson, "This was a pretty clear indication people liked what we're doing." A belated look at the Red Hat/Firestar patent settlement On June 11, Red Hat announced that it had reached a settlement in the software patent lawsuit it was defending against Firestar Software, Inc. and DataTern, Inc. This settlement is of interest to the community; it may point toward how how such cases may go in the future. Unfortunately, the amount of information which has been released so far leaves as many questions as answers, including the fundamental question of whether this settlement is as good for the community as Red Hat is claiming. The suit involves patent #6,101,502, which claims the concept of creating an impedance-matching layer to connect relational databases to object-oriented applications. The first claim reads like this: 1. A method for interfacing an object oriented software application with a relational database, comprising the steps of: selecting an object model; generating a map of at least some relationships between schema in the database and the selected object model; employing the map to create at least one interface object associated with an object corresponding to a class associated with the object oriented software application; and utilizing a runtime engine which invokes said at least one interface object with the object oriented application to access data from the relational database. One might well wonder how object-oriented programmers managed before 1998, when this patent was filed. Firestar claimed that a piece of JBoss violated the patent and duly filed suit; Red Hat has been fighting back ever since. The June 11 announcement appears to bring an end to this particular dispute. While Red Hat has not agreed that it was in violation of this patent, the company did not reach a settlement which clears it of infringement. Instead, Red Hat agreed to license the patent for itself and for its customers. The thing that makes this settlement a little more interesting is that Red Hat did not stop there; it also obtained a license for the project's upstream developers. From the settlement FAQ posted by the company: Upstream developers receive a perpetual, fully paid-up, royalty-free, irrevocable worldwide license to the patents in suit to engage in any and all activities with respect to a predecessor version of a Red Hat product. Those developers also receive a perpetual covenant not to sue with regard to all of DataTern's and Amphion's other patents on claims related to Red Hat products. The press release adds: The settlement also protects derivative works of, or combination products using, the covered products from any patent claim based in any respect on the covered products. Essentially, all that have innovated to create, or that will innovate with, software distributed under Red Hat brands are protected, as are Red Hat customers. So, in other words, this license and covenant covers the "predecessor versions" of any package shipped by Red Hat. Once a particular project finds its way into RHEL, it's part of the deal. This very carefully-worded text leaves one very interesting question open: what about users of the software who are not Red Hat customers? It would appear that developers are covered, presumably even as they develop the program beyond the "predecessor version" shipped by Red Hat. It has been made abundantly clear that Red Hat's customers are covered. There is a lot of text in the press release and FAQ suggesting that non-customer users should be protected too, but that is never said explicitly. An omission like that in a carefully-written, lawyer-vetted document can speak loudly; one must wonder what is going on. Another interesting question is this: what about all of the other projects out there which are using object-relational glue layers? One can only assume that this set includes just about every object-oriented application which is using a relational database. The language makes it pretty clear that this patent has not been licensed for free software in general; it only applies to the specific piece of JBoss which was under dispute. The press release claims that the settlement covers derivative works, leading one to imagine that it would be possible incorporate some small function from JBoss into an entirely unrelated project and get the patent license with it. But there is no way to know whether this interpretation matches the real settlement or not. And therein lies the real problem at this time: the actual terms of the settlement, and of the licenses and covenants, have not been published. One presumes that will change at some point; your editor queried Red Hat on when that might be, but did not receive an answer by the time this article was written. Without knowing what the actual agreement is, nobody can really assume that they have received any protection at all. One other claim from the FAQ merits attention: The settlement should encourage the open source community by providing broad protection as to the patents covered by the agreement. More generally, the settlement demonstrates Red Hat's commitment to standing up for the community against patent aggressors. We believe it will serve as a precedent that should discourage future similar cases. All of this is somewhat debatable, and needs to be questioned. As noted above, the actual breadth of the protection obtained is yet to be disclosed. The more relevant question, though, is: did Red Hat really "stand up for the community" in this case, and will it discourage these cases in the future? Your editor is not convinced of either. The way to stand up against this patent aggressor would have been to invalidate the patent and put an end to it forevermore. A quick trip to your editor's bookshelf turned up David Taylor's Business Engineering With Object Technology, dated 1995, which discusses difficulty with relational databases and impedance-matching layers. Grady Booch's Object Solutions (1996) says: "Thus, it is reasonable to approach the design of a data-centric system by devising a thin object-oriented layer on top of a more traditional relational database technology." Or look at Object-Oriented Modeling and Design by Rumbaugh et. al. (1991), which has an entire chapter on mapping objects into relational databases. In other words, there can be no shortage of prior art in this case; this is not an idea which was first conceived in 1998. But, rather than take this approach, Red Hat chose to settle. It is not said anywhere, but chances are good that some money changed hands here, and, by accepting a license for this patent, Red Hat has given it some legitimacy. Other free software projects - those which Red Hat does not ship - have apparently been left open to the same attack. Is this really the way to "discourage future similar cases"? Of course, such criticism is easy to make from the sidelines; it's easy for those of us not directly involved in the suit to claim that Red Hat should have taken the higher-risk, higher-expense road and fought this case to the end. There is no doubt that such an approach would be better for the community - assuming Red Hat prevailed - but Red Hat's management must make its own choices about which battles it is to fight. Given that it chose to settle, Red Hat clearly tried to do the right thing by obtaining some sort of protection for the community beyond its customer base. Time will tell how well that will work out and whether it will serve as a model for future settlements or not. A day in the life of linux-next The merge window phase of the kernel development cycle is a hectic time. Over a period of about two weeks, between 5,000 and 10,000 changesets find their way into the mainline git repository. Simply managing that many patches would be hard enough, but the job is made more complicated by the fact that these changesets are not all independent of each other. The first changes to be merged can change the code base in ways that cause later patches to fail to apply. So merge windows have traditionally required maintainers to rework their queued patches to resolve conflicts which arise as other trees are merged. Given the tight time constraints (patches which aren't ready when the merge window closes generally sit out until the next cycle starts), this integration process has been known to put a fair amount of pressure on subsystem maintainers. The other person feeling the stress was Andrew Morton; one of his many jobs was to bash subsystem trees together in his -mm releases. That took a lot of his time and didn't really solve the problem in the end; much of the work which shows up in -mm isn't necessarily intended for the next development cycle. The end result of all this is that each merge window brought together large amounts of code which had never been integrated before. Back in February, the linux-next tree was announced as a way to help ease some of these problems. We are now nearing the end of the first full development cycle to use linux-next, so it's worth taking a look to see how it is working out. The idea behind this tree is relatively simple. Linux-next maintainer Stephen Rothwell keeps a list of trees (maintained with git or quilt) which are intended to be merged in the next development cycle. As of this writing, that list contains 95 trees, all full of patches aimed at 2.6.27. Once a day, Stephen goes through the process of applying these trees to the mainline, one at a time. With each merge, he looks for merge conflicts and build failures. The original plan for linux-next stated that trees causing conflicts or build failures would simply be dropped. In reality, so far, Stephen usually takes the time to figure out the problem; he'll then fix up or drop an individual patch to make everything fit again. When this process is done, he releases the result as the linux-next tree for the day. Others then grab it and perform build testing on it; some people even boot and run the daily linux-next releases. All this results in a steady stream of problem reports, small fixes, patches moving from one tree to another, and so on - various bits of integration work required to make all of the pieces fit together nicely. There is an interesting sort of implicit hierarchy in the ordering of the trees. Subsystem trees which are merged early in the process are less likely to run into conflicts than those which come later. When two trees do come into conflict, it's the owner of the later tree - the one which actually shows the conflict - who feels the most pressure to fix things up. The history so far, though, shows that there has been very little in the way of finger-pointing when conflicts arise, as they do almost every day. All of the developers understand that they are working on the same kernel, and they share a common interest in solving problems. [PULL QUOTE: One aspect of this whole system remains untested, though: the movement of patches from linux-next into the mainline. END QUOTE] So, thus far, linux-next appears to be functioning as intended. It is serving as an integration point for the next kernel and helping to get many of the merging problems out of the way ahead of time. One aspect of this whole system remains untested, though: the movement of patches from linux-next into the mainline. As things stand now, there is no automatic movement between the trees; instead, maintainers will send their pull requests directly to Linus as always. If Linus refuses to merge certain trees, or if he merges them in an order different from their ordering in linux-next, integration problems could return. In the end, it seems like linux-next will have to drive the final integration process more than is anticipated now, but it will probably take a few development cycles to figure out how to make it all work. Meanwhile, anybody who is interested in 2.6.27 can, to a great extent, run it now by grabbing linux-next. This tree has clarified one aspect of the development process: the 2-3 month "development cycle" run by Linus is, in fact, just the tip of the kernel development iceberg. It is the final integration and stabilization stage. Linux-next nearly doubles the length of the visible development cycle by assembling the next kernel long before Linus starts working on it. And even linux-next only comes into play toward the end of a patch's life. In the past, Linus has pointedly worked to avoid overlapping the development and stabilization phases of the development cycle. There was no development tree at all for almost a year while 2.4 was beaten into reasonable shape. This separation was maintained out of a simple fear that an open development tree would distract developers from the more important task of finding and fixing bugs in the current stable release. That separation is a thing of the past now; there are literally dozens of development trees which are open for business at all times. That can only be worrisome to those who are concerned about the quality of kernel releases; why should developers concern themselves with 2.6.26 bugs when 2.6.27 is being assembled and 2.6.28 is already on the radar? Whether such concerns are valid is likely to be a matter of ongoing debate. Meanwhile, however, linux-next appears to have settled in as a long-term feature of the kernel development landscape. It is serving its purpose as a place to find and resolve integration problems; it has also had the effect of taking much of that integration work off of Andrew Morton's shoulders. And that, in turn, should free him to spend more time trying to get developers to fix all those bugs. (See the linux-next wiki for more information on how to work with this tree). What's AdvFS good for? On June 23, HP announced that it was releasing the source for the "Tru64 Advanced Filesystem" (or AdvFS) under version 2 of the GPL. This is, clearly, a large release of code from HP. What is a bit less clear is what the value of this release will be for Linux. In the end, that value is likely to be significant, but it will be probably realized in relatively indirect and difficult-to-measure ways. AdvFS was originally developed by Digital Equipment Corporation for its version of Unix; HP picked it up when it acquired Compaq, which had acquired DEC in 1998. This filesystem offers a number of the usual features. It is intended to be a high-performance filesystem, naturally. Extent-based block management and directory indexes are provided. It does journaling for fast crash recovery. There is an undelete feature. AdvFS is also designed to work in clustered environments. Much of the thought that went into AdvFS was concerned with avoiding the need to take the system down. There is a snapshot feature which can be used to make consistent backups of running systems. Defragmentation can be done online. There is a built-in volume management layer which allows storage devices to be added to (or removed from) a running filesystem; files can also be relocated across devices. The internal volume manager can perform striping of files across devices, but nothing more advanced than that; AdvFS will happily work on top of a more capable volume manager, though. There are a few things which AdvFS does not have. There is no checksumming of data, and, thus, no ability to catch corruption. Online filesystem integrity checking does not appear to be supported. The maximum filesystem size (16TB) probably seemed infinite in the early 1990's, but it's starting to look a little tight now. In general, AdvFS looks like something which was a very nice filesystem ten or fifteen years ago, but it has little that is not either available in Linux now, or in the works for the near future. And AdvFS doesn't even work with Linux - no porting effort has been made, and it's not clear that one will be made. So is this release just another dump of code being abandoned by its corporate owner? One could make a first answer by saying that, even if this were true, it would still be welcome. If a company gives up on a piece of code, it's far preferable to put it out for adoption under the GPL than to let it rot until nobody can find it anymore. But there may well be value in this release. Even if there is no point in trying to make it work under Linux, the AdvFS code is the repository of more than a decade of experience of making a high-end filesystem work in a commercial environment. Your editor had stopped working with DEC systems by the time AdvFS came out, but the word he heard from others is that the early releases were, shall we say, something that taught administrators about the value of frequent backups. But after a few major releases, AdvFS had stabilized into a fast, solid, and reliable filesystem. The current code will embody all of the hard lessons that were learned in the process of getting to that point. Chris Mason, who is currently working on the Btrfs filesystem, puts it this way: The idea is that well established filesystems can teach us quite a lot about layout, and about the optimizations that were added in response to customer demand. Having the code to these optimizations is very useful. Having that code licensed under the GPL is especially useful: any code which is useful in its current form can be pulled quickly into Linux. And, even when the code itself cannot be used, the ideas that it embodies can be borrowed without fear. And that is exactly what HP was hoping to encourage with this release: In case its not clear, this is a GPLv2 technology release, not an actual port to Linux. We're hoping that the code and documentation will be helpful in the development of new file systems for Linux that will provide similar capabilities, and perhaps used to make tweaks to existing file systems. And that would appear to be likely to happen. Over time, the best ideas and experience from AdvFS should find their way into the filesystems supported by Linux, even if AdvFS, itself, never becomes one of those filesystems. So HP has made a significant contribution to the kernel development process, one which will probably never show up in the changeset counts and other easily-obtained metrics. (Those interested in learning more about AdvFS would be well advised to grab the documentation tarball from the AdvFS sourceforge page. The "Hitchhiker's guide" is a good starting place, though, at 229 pages, it's not for hitchhikers who prefer to travel light.) Debian Lenny and the Eee PC? The ASUS Eee PC, a subnotebook computer, was first introduced at at COMPUTEX Taipei 2007. The first models came with a modified version of the Xandros operating system. Xandros has roots in Debian, and strives to be easy-to-use for first time Linux users and Windows-centric businesses. The company has never been afraid of using proprietary components to make that happen, which has made it less popular with free software fans. The little PCs, meanwhile, proved to be very popular. According to Wikipedia, ASUS sold over 300,000 units in 2007. Microsoft must have felt left out, so the next generation of the little notebooks were available with a modified version of XP. At the 2008 COMPUTEX DistroWatch noted that "not all was well at the ASUS stand. As a visitor interested in Linux, I was disappointed to find just one of the products on display running the open source operating system. Even worse was the fact that the entire area was plastered with advertisements displaying large Windows and Microsoft logos. The only flyer available at the stand was a Microsoft one entitled "It's better with Windows"." Naturally, the free software community has been working on free Linux variants to run on these small boxes. The most notable projects are EeeDora, a Fedora based variant and the DebianEeePC project. Now it seems the Debian effort may have a chance at becoming an official OS for the 2009 Eee PC. In a recent post to the Debian-eeepc-devel mailing list, Ben Armstrong says, "I just received an encouraging note from Ellis Wang of Asus in Taiwan following up on Martin Michlmayr's suggestions to Asus about how they could work more closely with the Debian community. Ellis has assigned Robert Huang the task of putting a working relationship in place between Asus and Debian, with backup provided by five other Asus employees." It would be great if ASUS would make pre-installed Debian Eee PC models. But even if they don't, free software enthusiasts can install their choice of EeeDora or custom Debian for themselves. Symbian to be another open mobile platform The already crowded open source mobile phone software market just got more so as Nokia has announced plans to open up the Symbian operating system. Symbian currently has the biggest installed base of any mobile OS, which makes this announcement somewhat more surprising—market leaders generally do not radically change their successful methods. What it means for the various Linux mobile phone initiatives is unclear, but it certainly shakes things up a bit. Nokia, along with many of the biggest players in the mobile phone market, has formed the Symbian Foundation to provide its members with the OS on a royalty-free basis. Several other components are being donated to the foundation as well, to create a complete platform for mobile applications. The plan is for all of the code to be released using the Eclipse Public License over the next two years. In order to own the code, Nokia is purchasing the 52% of Symbian Limited that it does not currently own for more than $400 million. This will allow Nokia to donate Symbian, along with its S60 smartphone platform, which runs atop Symbian, to the foundation. Sony Ericsson and Motorola will donate their UIQ user interface layer, while NTT DoCoMo will donate its Mobile Oriented Application Platform (MOAP). Nearly two dozen companies have come together to form the foundation, including handset makers, mobile carriers, and chip manufacturers. Interestingly, there is substantial overlap between Symbian Foundation members and those of the Open Handset Alliance—the umbrella organization for Google's Android effort—and the LiMo Foundation. Whether this reflects impatience with the pace of Android/LiMo development or just an effort to hedge their bets remains to be seen. Membership in the foundation is open to all who are willing to pay the $1500 annual membership fee. That fee will allow the use of all of the components that make up the Symbian platform on a royalty-free basis. Any developers that wish to create software for the platform need not join as there will be a developer program available at no charge. The foundation is expected to start operations in 2009. Opening up Symbian is seen as a reaction to Android and other free software efforts in the mobile phone space. One of the advantages touted for Linux solutions is the zero cost—particularly the lack of per-unit royalties. By moving Symbian to this model, the foundation undercuts that advantage. Because Symbian is already a dominant player in the smartphone market—with a large development community—there are some who believe it will redirect efforts currently focused on Linux to Symbian. That remains to be seen, of course, but Linux-based smartphones are still in their infancy. MontaVista's Mobilinux has been installed in more than 35 million mobile devices, mostly in Asian markets, but, perhaps because of it being controlled by a single company, hasn't really generated a large developer community. It may also be targeting mobile carriers who are not very interested in allowing users to customize their phones—at least not to the extent Android and others envision. There is a widening rift between the "free" and "locked down" camps for mobile devices. With this move, Nokia—and the other foundation members—seem to be moving toward allowing users more freedom, though undoubtedly some handset makers and carriers will opt for locking down their phones regardless of the openness of the underlying OS. One need look no further than the iPhone for an example of a tightly controlled application environment that is, at least so far, very popular with consumers. In the long run, it is hard to imagine that mobile device users will be willing to stick with the limited choices of applications provided by their carrier or phone maker. As more open alternatives become available, there will be a pushback from handset buyers that will be harder for the carriers to resist. For many, their mobile phone is the most sophisticated computer they own and the history of personal computers would indicate that a thriving ecosystem of the third-party applications is an important part of the purchasing decision. That requires developers. The current proliferation of open mobile phone software platforms is, in many ways, a battle for developer mindshare. LiMo, Android, and OpenMoko are all Linux-based development platforms that support multiple hardware devices, which should allow applications to run on many different mobile devices with minimal porting. How well that works in practice is still an open question. For many of the established players in the mobile device market, Symbian is a known quantity. It has shipped on countless devices—its strengths and weaknesses are well understood. Turning it into a free software release will allow, at least potentially, members to move the Symbian code in the direction they want. But will that stop, or substantially slow down, the adoption of Linux-based solutions? In order for that to happen, Symbian itself will need some kind of developer community, something like what currently exists for the kernel and user space applications on Linux. Whether the opening of the code will be enough to attract that community is an open question. It may be that developers at the member companies will be forced to form that community—something that could affect the bottom line. One of the key problems that the various Linux-based efforts face is that of fragmentation. The vendors of royalty-based mobile platforms—primarily Microsoft and Palm—tend to point to the multiple incompatible Linux efforts as proof. They tout the control that a single vendor provides to ensure compatibility. Others, like Apple and RIM (maker of Blackberry email phones), do not license their software to others so they tightly control the hardware, which tends to avoid fragmentation. Within a particular initiative, fragmentation is likely to be a very bad thing, but having multiple platform choices tends to provide healthy competition and thus help consumers. Over time, some of the current Linux-based platforms may fall by the wayside to leave fewer choices, but that will likely happen due to technical considerations, part of which will be determined by the third-party application developers. One questions remains though: what happens with Qt, or more specifically the Qtopia Phone Edition? Nokia bought Trolltech early this year, at least partially for their mobile toolkit. Will they port it to Symbian and donate it to the foundation? They could, of course, port it but keep it separate, but that would seem to lead down the path toward fragmentation. It seems somewhat unlikely that they would change Trolltech's successful hybrid of GPL and commercial licenses, but before this announcement few thought that Symbian would be freed. Nokia has certainly adopted a more open-friendly stance of late—they clearly see it as a way to generate more business—so it certainly is not out of the realm of possibility. While opening up Symbian may inhibit Linux adoption on mobile devices, it can only be seen as a good thing for consumers and the free software community as a whole. In many ways, it validates the free software development model along with the idea of freedom for users and developers. The competition between Linux and Symbian will also likely help both improve. Expect lots of interesting devices and applications in the next few years because of it. The Elastix PBX system Elastix is a Linux-based telephone Private Branch eXchange (PBX) telephony system that is built on the CentOS Linux distribution. Elastix uses the Asterisk PBX software as its base and adds a number of extensions. Elastix is being developed by PaloSanto Solutions. From the Elastix User Manual [pdf]: Elastix is an appliance software that integrates the best tools available for Asterisk-based PBXs into a single, easy-to-use interface. It also adds its own set of utilities and allows for the creation of third party modules to make it the best software package available for open source telephony. The goals of Elastix are reliability, modularity and ease-of-use. These characteristics added to the strong reporting capabilities make it the best choice for implementing an Asterisk-based PBX. Some of the Elastix features include: A web-based user interface. A built-in help interface. Modular design for easy management of features. Support for multiple virtualized systems on one platform. Can present a variety of system status reports. A built-in voicemail system. Support for VoIP telephony. Support for faxes with fax to email conversion. Support for instant messaging. a built-in mail server. Support for video phones. A billing interface. Support for automatic outgoing telemarketing calls. Multi-language support. The screen shots show the Elastix user interface in action. Stable version 1.1 of Elastix was recently announced: This new version contains updates to more than 130 packages. It also brings together the new "Agenda" module which allows you to access an integrated Calendar and Phone Book in a very user-friendly manner. The calendar module allows a user to schedule events which can activate automatic phone call reminders. In addition, version 1.1 brings a Phone Book interface which you should all be pretty familiar with. It lists people's names with their phone numbers. The interesting thing here is that you can click-to-call your contacts in the Phone Book. And that is not all! We have placed special emphasis on the end user. Starting with version 1.1 the end user may login to Elastix and find a "Dashboard" with quickly accessible information about personal emails, calendar, faxes, voicemails, etc. An Elastix 1.1 CD image was downloaded and burned onto a CDROM. The CD was installed onto an old 1.4 Ghz Athlon system with a 15GB hard drive. To actually use the system, an Asterisk-compatible telephone interface card should be installed on the host machine. The system installed with no problems, booted up and the login screen came up with a message to access the system via the web on the DHCP-supplied LAN address. The Elastix web interface was accessed from another local machine. At this point, the documentation (still at version 0.9) fell short due to a lack of information on the required username/password. A little searching on Google revealed the answer (admin/palosanto) from the online Elastix PBX Installation instructions. Once logged into the web interface, clicking through the many different pages showed that the system appeared to be functioning normally. An incredible array of capabilities exist in the system and it looks to be fairly easy to master. It was not possible to test any real telecom uses due to the lack of a telephone interface card, however adding and configuring a card can be done after the system has been installed. If you have a need for a low cost PBX, or simply want an easy way to play with Asterisk, Elastix is a good way to proceed. Freezing filesystems and containers Freezing seems to be on the minds of some kernel hackers these days, whether it is the northern summer or southern winter that is causing it is unclear. Two recent patches posted to linux-kernel look at freezing, suspending essentially, two different pieces of the kernel: filesystems and containers. For containers, it is a step along the path to being able to migrate running processes elsewhere, whereas for filesystems it will allow backup systems to snapshot a consistent filesystem state. Other than conceptually, the patches have little to do with each other, but each is fairly small and self-contained so a combined look seemed in order. Takashi Sato proposes taking an XFS-specific feature and moving it into the filesystem code. The patch would provide an ioctl() for suspending write access to a filesystem, freezing, along with a thawing option to resume writes. For backups that snapshot the state of a filesystem or otherwise operate directly on the block device, this can ensure that the filesystem is in a consistent state. Essentially the patch just exports the freeze_bdev() kernel function in a user accessible way. freeze_bdev() locks a file system into a consistent state by flushing the superblock and syncing the device. The patch also adds tracking of the frozen state to the struct block_device state field. In its simplest form, freezing or thawing a filesystem would be done as follows: Where fd is a file descriptor of the mount point and the argument is ignored. In another part of the patchset, Sato adds a timeout value as the argument to the ioctl(). For XFS compatibility—though courtesy of a patch by David Chinner, the XFS-specific ioctl() is removed—a value of 1 for the pointer argument means that the timeout is not set. A value of 0 for the argument also means there is no timeout, but any other value is treated as a pointer to a timeout value in seconds. It would seem that removing the XFS-specific ioctl() would break any applications that currently use it anyway, so keeping the compatibility of the argument value 1 is somewhat dubious. If the timeout occurs, the filesystem will be automatically thawed. This is to protect against some kind of problem with the backup system. Another ioctl() flag, FIFREEZE_RESET_TIMEOUT, has been added so that an application can periodically reset its timeout while it is working. If it deadlocks, or otherwise fails to reset the timeout, the filesystem will be thawed. Another FIFREEZE_RESET_TIMEOUT after that occurs will return EINVAL so that the application can recognize that it has happened. Moving on to containers, Matt Helsley posted a patch which reuses the software suspend (swsusp) infrastructure to implement freezing of all the processes in a control group (i.e. cgroup). This could be used now to checkpoint and restart tasks, but eventually could be used to migrate tasks elsewhere entirely for load balancing or other reasons. Helsley's patch set is a forward port of work originally done by Cedric Le Goater. The first step is to make the freeze option, in the form of the TIF_FREEZE flag, available to all architectures. Once that is done, moving two functions, refrigerator() and freeze_task(), from the power management subsystem to the new kernel/freezer.c file makes freezing tasks available even to architectures that don't support power management. As is usual for cgroups, controlling the freezing and thawing is done through the cgroup filesystem. Adding the freezer option when mounting will allow access to each container's freezer.state file. This can be read to get the current freezer state or written to change it as follows: It should be noted that it is possible for tasks in a cgroup to be busy doing something that will not allow them to be frozen. In that case, the state would be FREEZING. Freezing can then be retried by writing FROZEN again, or canceled by writing RUNNING. Moving the offending tasks out of the cgroup will also allow the cgroup to be frozen. If the state does reach FROZEN, the cgroup can be thawed by writing RUNNING. In order for swsusp and cgroups to share the refrigerator() it is necessary to ensure that frozen cgroups do not get thawed when swsusp is waking up the system after a suspend. The last patch in the set ensures that thaw_tasks() checks for a frozen cgroup before thawing, skipping over any that it finds. There has not been much in the way of discussion about the patches on linux-kernel, but an ACK from Pavel Machek would seem to be a good sign. Some comments by Paul Menage, who developed cgroups, also indicate interest in seeing this feature merged. Notes on the Fedora board election The Fedora Project recently held an election to fill four seats on its governing board. This is the first vote to happen since Red Hat decided to let the community elect the majority of the board's members. The results of this vote surprised the Fedora community in a couple of ways, leading to an extended discussion on how this community should be governing itself - and whether it can do that at all. In the end, Tom Callaway, Jesse Keating, and Seth Vidal were elected to the board for two release cycles, and Jef Spaleta for one cycle. The fifth elected seat is currently held by Matt Domsch; three of the appointed seats are currently held by Bill Nottingham, Karsten Wade, and Harald Hoyer. Red Hat has not yet announced who will be put into the fourth appointed seat. The newly-elected members are all well-known Fedora contributors who have done a lot for the project. So why are there questions? It comes down to two points: Three of the four representatives elected to the board are employed by Red Hat. So, while Red Hat has given up its ability to directly appoint the majority of the board, that board will still be dominated by Red Hat employees. Of the 4069 Fedora community members who were entitled to vote in this election, only 250 actually turned in ballots. A 6% turnout strikes many as being somewhat lower than one would expect from a fully-engaged community. Though nobody said so directly, some people apparently suspected that Red Hat employees voted in rather larger numbers than anybody else, and that they duly elected some of their own to fill the board seats. The truth of the matter is probably not so simple; what we are seeing is a middle stage in the Fedora Project's ongoing effort to become a more open, community-oriented effort. A few possible reasons for the low turnout were put forward. One had to do with how the election was conducted. The self-nomination process evidently does not sit well with some people, who would rather see candidates nominated by their peers. The range voting mechanism used by the project seems complex and intimidating - though it still seems simple compared to the Condorcet scheme employed by Debian. There were also some complaints that the election was not run in a sufficiently high-profile manner, to the point that many community members might not have known that an election was underway at all. Greg DeKoenigsberg put forward a different hypothesis to explain why so few people voted: IMHO, a properly functioning governance body *should* be so effective that no one cares much either way when it comes time to replace the membership. From my perspective, low turnout means low dissatisfaction. All other indicators seem to point to continued success for Fedora and its contributors... I myself almost didn't vote. Why? Because I liked the entire slate of candidates. In this point of view, everybody is so happy that there's no need to get involved in the process. There is a contrary point of view which is also worth considering, though: What I mean is that almost all Fedora related decisions come out of Red Hat anyway. The few +1 from community seats during FPB meetings don't matter, do they? They are just noise. By this line of reasoning, instead of everybody being happy, the community is in despair and sees no point in participating in a process which seems unlikely to change anything. The truth of the matter is almost certainly somewhere in between. The Fedora project has clearly opened considerably in recent years, to the point that it is one of the most transparent and active distributions out there. The community contributes a lot of work and certainly participates in discussions about the future of the project. But Red Hat still holds considerable sway; the fact that it employs a great number of Fedora developers is, by itself, enough to ensure that. Red Hat's large presence is also enough to explain the large number of Red Hat employees elected to the board. Those are the people who have the luxury of working on Fedora full time; it is not surprising that they tend to be the most prominent developers in the community. Additionally, there is a certain tendency for outsiders who become strong community members to eventually become Red Hat employees as well. Red Hat has been increasing its investment in Fedora and hiring a number of people to work on it; the fact that they would be inclined to hire people who are already doing good work with Fedora should not be surprising. So when Fedora developers look at a ballot and think about the names found there, chances are good that they will vote for the people they have seen working hard and accomplishing things within the community. And those people, at this point, are likely to be Red Hat employees. Until a time comes when other companies find it worthwhile to pay full-time Fedora developers, this situation is not likely to change much. The free software community is full of examples of company-dominated projects. The bulk of these projects are subject to a high degree of control by the sponsoring company. That is natural; these companies have specific needs which they expect their development projects to meet. Making such projects truly open can be hard. Red Hat has gone farther than many in its efforts to make Fedora open, even if said efforts have come later than some would like. Hopefully Red Hat will continue to follow that path, but, to a great extent, the next steps have to be taken by others. When the investment into Fedora from outsiders exceeds Red Hat's investment, Red Hat will be less of a dominant force. Until then, efforts to increase the number of people voting board elections - while being worthwhile and welcome - are unlikely to significantly change the results of those elections. Leaking browser history Browser history is fairly sensitive information for most people. If there were a way for random web sites to grab a list of other sites you have visited recently, it would cause a fair amount of concern. Unfortunately, a longstanding problem in the HTML Document Object Model (DOM) makes for an information leak nearly as bad as that. The problem stems from the handy feature that browsers implement to show you which links you have already visited. The way that they show links in a different color if you have visited them is by turning on the "visited" style for the link. Many sites, such as LWN, then change the default colors for both visited and non-visited links via the site's Cascading Style Sheet (CSS). This information gets recorded in the DOM for the page which can be queried from Javascript. Because of the nature of the leak, scripts cannot get a full dump of the browser's history, but they can get the visited status for a set of sites they are interested in. A web site that wishes to gather this kind of information need only add a link to each site of interest—often in an unreadable font size or color—and send over a bit of Javascript to read the DOM status for each link. While this problem has been known since at least 2002, there is no easy fix while still being compliant with the CSS standard. Because of that, most or all browsers are vulnerable. It has recently been in the news because it is being used in a benign, or at least semi-benign, way. These days many news sites and blogs have small images that correspond to various social networking sites—digg, reddit and the like—that allow voting on particular stories or postings. Those images are buttons that register a vote or submission of the site that displays them. With the proliferation of these sites, a great deal of screen real estate was being taken up by these icons, many of which were not useful because the person viewing them never visited those particular sites. To reduce the clutter, Aza Raskin created some Javascript code to determine which of the social networking sites a particular user had visited so that only the icons for those sites were displayed. Many people would find that to be a useful hack, one that was fairly minimally intrusive, which it is at some level. Others, with a more strict personal privacy desire, might find it more than a bit creepy. Reducing clutter is one thing, but this technique can be used to gather much more sensitive information than which of the many social networking "news" sites you visit. It is tempting to remind readers of the NoScript Firefox extension, but it has become increasingly difficult to do nearly anything on the web without enabling Javascript. Many sites essentially hide their content behind a Javascript test, refusing to display it unless Javascript is enabled. This makes it difficult to avoid giving away some of your browsing history to dodgy sites—or those with cross-site scripting vulnerabilities—other than by avoiding them entirely. It is an unfortunate side effect of a useful property that, as the discussion on the Mozilla bugzilla shows, will be difficult to completely eliminate. It should be noted that the links do not have to be obfuscated—by adding a dash of Javascript LWN could know whether you have visited digg or reddit. But, of course, we don't force Javascript on our readers. More DTrace envy Nearly a year ago, we looked at the status of SystemTap in the context of Sun's much-hyped DTrace tool. Since that time there has been progress, but the basic problem still remains: Linux does not have a good, ready-to-run answer to those wanting the equivalent functionality of DTrace. Due to an apparent disconnect between the developers of SystemTap and the kernel hackers, tracing for the Linux kernel—never mind user space programs—is not up to the competition. Both SystemTap and DTrace are tools meant to help administrators track down performance and other problems on production systems by instrumenting the kernel. Because SystemTap has not matured to the point of easy usability, DTrace is often seen as a prime differentiator between Linux and Solaris. In a posting to the ksummit-2008-discuss mailing list—where Kernel Summit topics are considered—Matthew Wilcox brought up the subject based on his experience at a recent PostgreSQL conference: There was a lot of buzz around DTrace. Sun and a couple of other companies have put DTrace hooks into postgres, so they now have some really useful canned queries. If you're running Solaris or MacOS, of course. So there was a lot of talk about switching away from Linux. This can't possibly be a good thing for us. I don't personally know what the state of our competing projects are, but clearly they haven't got their hooks into postgres ... at least not upstream. Typically Linux has been in the forefront of interesting new technologies for free operating systems. When Sun opened up Solaris, though, a few features jumped ahead of their Linux counterparts, in particular the ZFS filesystem and DTrace. SystemTap is supposed to provide the tracing functionality while Btrfs is the leading candidate for a "next generation" filesystem. But, so far, SystemTap has not lived up to its potential. There are a few reasons for disappointment with SystemTap, some of which were pointed out by James Bottomley: When I go around end users, I find people in two camps: The ones who've drunk the sun coolaid and won't take anything on linux that isn't a fully replicated dtrace (sort of like windows people who demand the availability of outlook on linux) and people who are migrating to Linux and trying to use systemtap for tracing. These latter seem to have a number of genuine concerns including latency, the time it takes to actually go from command executing to functional trace, the inability to trace user programs (dtrace can) and concerns about the amount of perturbation the probes actually place inside the kernel. Those are all valid concerns, but the biggest problem for users is that, unless they are knowledgeable about kernel internals, it is difficult to know how to use SystemTap. A more simplified interface, one that is less reliant on kernel internals, needs to be created; the way to do that is through the placement of static trace points in the kernel and the creation of "tapsets" to make them easily usable. The SystemTap developers think the kernel hackers are in the best position to do that work. Ted Ts'o agrees but sees some barriers: The big thing that are missing are the tapsets — the macro libraries that allow a system administrator to use it to find and solve performance problems without being a kernel developer, and more importantly, the documentation for said macro libraries so a system administrator can actually use it. [ ... ] the real problem isn't as much kernel developers, it's that (a) it's too hard for many kernel developers to use (and so many kernel developers are [not] using it), and (b) there aren't enough tapsets. The latter is something that kernel developers can help solve, but unfortunately I'm not sure discussing it at the Kernel Summit will necessarily lead to making forward progress. If the kernel developers have trouble using SystemTap, they are unlikely to add the tapsets that would make it more usable for system administrators and others who have some general kernel knowledge but not enough to sensibly instrument it. For people using distribution kernels—at least for the enterprise distributions and Fedora—it is only somewhat painful to get SystemTap up and running. But kernel hackers tend to run their own kernels, often many different versions in a short period of time, so they need to be able to be easily build one that works with SystemTap and includes all of the debugging information that it requires. SystemTap developer Frank Ch. Eigler has a long reply to many of the complaints in the thread. It seems clear that the SystemTap folks and the kernel hackers have not been communicating—there are solutions to many of the problems that were cited. They are in various states of readiness, but are mostly working. So SystemTap is most of the way there for kernel tracing as long as you are well-versed in kernel internals, but that has been true for some time. In order to get SystemTap to where it needs to be, the kernel hackers need to be involved. Building the infrastructure and waiting for tapsets to magically appear is not a recipe for success. The SystemTap hackers need to be engaging the kernel community, as well as distributions, to make the tool into something that gets used. SystemTap can use static probe points, kernel markers—merged into 2.6.24—but it is notable that no one has, as yet, made use of them. A concerted effort needs to be made to make the tool more usable for the kernel developers who can, in turn, help make it more usable for others. There is a clear problem when folks like Ts'o regularly try, but find it too difficult to be useful: But maybe as more people try using it, they'll discover some of these rough edges, and will start trying to fix it. Every couple of months, I've tried using it, and because it [h]as so many rough edges, I've normally found it less work to debug the kernel using manual methods rather trying to make Systemtap work on my system and with my kernel development workflow. It is a commonly heard complaint that while SystemTap is difficult to use, DTrace "just works" for Solaris; Eigler responds: Yeah, so I hear, but think about how different their target environment is. Their kernel hardly changes (several fixed APIs, ABIs): this has huge implications. Their kernel was willing to insert probes (~ markers), a bunch of build system changes (debug info subset transcribing). Here in linux land, we suffer multifaceted tensions and it is hard to go toward a goal without obstructions (well-meaning as they may be). A bunch of third-party scripts are often conflated with "dtrace", which is just a matter of growing the user community enough, and giving them a good tool to build on top of. A growing set of runnable end-user scripts is already packaged with systemtap, intended for use by nonexperts, more help (e.g. concise problem statements about what you'd like to measure/see) would be welcome. Many administrators and other users of tracing facilities are not necessarily interested in kernel-level tracing, but would really like to be able to use the instrumented versions of things like PostgreSQL. That is in the plan according to Eigler: "We aim to piggyback on these efforts by reusing the dtrace instrumentation calls embedded into postgres etc., if at all possible." Until the rough edges can be smoothed on the kernel side, Bottomley wonders if it even makes sense to start considering user space: Although there are differing opinions about what systemtap could and should do, it's clear that it's not working incredibly well for its design space: the kernel, so talking about extending it to userspace is a premature. DTrace sounds like a nice working solution that has many uses and many happy users. If one can ignore the self-congratulatory postings from its lead developer, it might be worth having in Linux, but that simply is not going to happen. Paul Fox is working on a port of DTrace to Linux, but that ignores the licensing realities that would never allow it to become part of Linux. It also ignores the difficult path a DTrace port would face getting merged into the mainline. (We hope to have an article from Mr. Fox on his DTrace porting work soon, stay tuned). For all of the talk out of Sun about how they would love to make DTrace a part of Linux, they clearly made a choice to ensure that could not happen. Even if any technical barriers were lifted, the CDDL is not compatible with the GPL. It is perfectly fine as a free software license, but if you wish to get things into Linux, they must be licensed in a GPL-compatible way. This was well understood at the time Sun freed Solaris, so this must have been a conscious decision. Given how much their marketing organization likes to tout DTrace, it would seem to be a choice that Sun is quite happy with. Linux will eventually get the tracing support it needs, in a way that is easily accessible to users, but it may take some time. Conversations like the recent one on ksummit-2008-discuss are an important part of getting there. It would appear that better support for the use cases of kernel developers will be forthcoming. It is mostly a matter of documentation along with simplifying some of the building and installation issues. Once the kernel hackers actually start using it, progress is likely to be fairly swift. This is the way free software development works; it generally does not track a straight path to a solution, but often wanders about in the solution space for a while. It is highly unlikely that a development like DTrace could have come about in the way that it did in a true community-developed operating system. For that you need everyone pulling in the same exact direction, which may be why Sun is reluctant to turn over much of the governance of Solaris to the community. That may help them develop things more quickly, because there will be fewer barriers, but it won't help them to foster the kind of development community that characterizes Linux. Making power policy just work The sched_mc_power_savings parameter (cleverly hidden under /sys/devices/system/cpu) was introduced in the 2.6.18 kernel. If this parameter is set to one (the default is zero), it changes the scheduler load balancing code in an interesting way: it makes an ongoing effort to gather together processes on the smallest number of CPUs. If the system is not heavily loaded, this policy will result in some processors being entirely idle; those processors can then be put into a deep sleep and left there for some time. And that, of course, results in lower power consumption, which is a good thing. Vaidyanathan Srinivasan recently noted that, while this policy works well in a number of situations, there are others where things could be better. The sched_mc_power_savings policy is relatively conservative in how it loads processes onto CPUs, taking care to not overload those CPUs and create excessive latency for applications. As a result, the workload on a large system can still end up spread out more widely than might be optimal, especially if the workload is bursty. In response, Vaidyanathan suggests making the power savings policy more flexible, with the system administrator being able to select a combination of power savings and latency which works well for the workload. On systems where power savings matters a lot, a more aggressive mode (which would pack processes more tightly into CPUs) could be chosen. This suggestion was controversial. Nobody disputes the idea that smarter power savings policy would be a good idea. But there is resistance to the idea of creating more tuning knobs to control this policy; instead, it is felt, the kernel should work out the optimal policy on its own. As Andi Kleen puts it: Tunables are basically "we give up, let's push the problem to the user" which is not nice. I suspect a lot of users won't even know if their workloads are bursty or not. Or they might have workloads which are both bursty and not bursty. There are a couple of answers to that objection. One is that the system cannot know, on its own, what priorities the users and/or administrators have. Those priorities could even change over time, with performance being emphasized during peak times and low power usage otherwise. Additionally, not all users see "performance" the same way; some want responsiveness and low latency, while others place a higher priority on throughput. If the system cannot simultaneously optimize all of those parameters, it will need guidance from somewhere to choose the best policy. And that's where the other answer comes in: that guidance could come from user space. Special-purpose software running on large installations can monitor the performance of important applications and adjust resources (and policies) to get the desired results. Or, in a somewhat different vision, individual applications could register their performance needs and expected behavior. In this case, the kernel is charged with somehow mediating between applications with different expectations and coming up with a reasonable set of policies. In the middle of all this, it was pointed out that a mechanism by which expectations can be communicated to the kernel already exists: the nice level (priority) associated with each process. In a simple view of the world, a process's nice level would tell the kernel how to manage it with regard to power savings; on a system with a number of niced processes, those processes would be gathered onto a subset of processors during period of relatively low activity. In essence, this policy says that it is not worthwhile to power up more processors just to give better throughput to low-priority processes. It does not take long, though, to come up with situations where the use of nice levels leads to the wrong sort of results. Peter Zijlstra observed that he has niced processes (created with distcc) which should have access to all of the CPU power available, but which should not contend with interactive processes on the same system. In such cases, those processes should have a high nice value with regard to CPU usage, but that should not interfere with their ability to move onto idle CPUs, if any exist. So the answer may take the form of a separate "powernice" command which would regulate a process's priority when it comes to causing the system to draw more power. Nice levels may (or may not) prove to be sufficient information to let the system choose an optimal power policy. But it will be some time before anybody really knows that; work on optimizing power usage - especially on server systems - is not in an advanced state. So pressure to add tuning knobs for power policies may continue, for one simple reason: people want ways of experimenting with different policies and seeing what the results are. Until we really know what the effects of different policies are - on both power usage and system performance - it will be hard to build a system which can choose an optimal policy on its own. Netgear's open router Your editor was recently reminiscing about an early stage of his career, which involved the administration of a VAX 11/780 computer. The VAX was a highly successful product, as was its native operating system VMS. Quite a few VAX customers chose to do without VMS, though, and put early versions of BSD Unix on them instead. Digital Equipment Corporation never entirely appreciated those customers. To DEC, every BSD installation looked like a lost VMS service contract. The company should, instead, have seen those installations as an extra sale gained as a result of the VAX's ability to run a nice operating system. Almost 30 years later, some parts of the computing industry have come to understand that there is value in selling hardware which can run operating systems provided by others. Microsoft made that point in a big way, of course, but there are also significant parts of the industry which benefit from making systems which can run Linux - and, in particular, a version of Linux which is not necessarily supplied by the vendor. But other sectors still seem to see the ability for the customer to put (or replace) Linux on their systems the way DEC saw Unix in the early 1980's. They see no value in letting their customers make changes to their systems, choosing instead to lock those systems down and keep total control. Embedded systems are often singled out as an example of this type of behavior, and vendors of small routers tend to be especially inclined in this way. It is not a coincidence that a substantial portion of the high-profile GPL-enforcement cases to date have involved consumer-level routers. Some vendors, at least, are getting smarter and doing what they need to do to avoid licensing problems. But relatively few of them welcome customers who want to replace the software on "their" devices. There are exceptions, though, and their number just grew with this announcement from Netgear. The WGR614L router looks like a fairly straightforward consumer wireless router, with the usual set of features. LWN readers will doubtless be glad to hear that it is "Works with Windows Vista" certified. It has a four-port Ethernet switch, an 802.11g access point, and a mighty 240 MHz CPU and 16MB of RAM. All of the stuff one would expect from an inexpensive desktop device. But what makes this device interesting is that it's designed to be open and hackable. The source code for the factory-installed firmware is available from Netgear's community web site; it's amusingly packaged as a zip file containing a single, compressed tarball which, in turn, holds a bleeding-edge 2.4.20 kernel tree. But anybody wanting something a bit more contemporary and community-oriented can replace that firmware altogether with a package like Tomato or DD-WRT; indeed, Netgear almost seems to encourage its customers to do so. Every one of those customers then gets the benefit of the effort which has gone into the development of those router distributions - with little effort required on Netgear's part. Those customers can improve this platform and make their changes available to other customers; that makes Netgear's hardware more valuable. If there are bugs in the system, a single motivated customer can fix them and make those fixes available to everybody else. And all of this comes at almost no cost to Netgear. It is always fun to see Linux turn up in new places. It's now a routine experience to realize that one's new television, camcorder, music player, or automobile runs Linux. But locked-down, Linux-based devices are not far removed from the fully proprietary systems which preceded them. Whether or not one agrees that locking down systems in this way is legally or morally defensible, it's easy to conclude that it is undesirable. A Linux system which is cast in concrete loses a part of the vital energy which makes Linux what it is. So it is always a welcome development when a vendor decides to take a more open path. With any luck at all, the wider public will eventually realize that more open devices are more powerful devices, and, as a result, such devices will prove more successful. That is the path that brings us more control over our systems and, eventually, to World Domination. A look at openSUSE 11.0 openSUSE 11.0 was released about two weeks ago, to generally good reviews. TuxMachines ran some lighthearted tests last fall and again recently, comparing the latest Mandriva release with the latest openSUSE release. This time around openSUSE edged out Mandriva in a near tie. Other good reviews can be found on LinuxPlanet, DownloadSquad and many other places around the web. There are plenty of options for getting a hold of this release. You can buy a boxed set, an option that has all but disappeared from the Linux distribution scene. The box comes with complete end-user documentation, installable media for 32 Bit and 64 Bit systems, plus 90 days of end-user installation support. Most people will probably download the release in one form or another. Chose from the 32-bit, 64-bit or PowerPC platforms. Get a DVD, a Live CD or use a network install. The live CD comes in a GNOME or a KDE version. There's plenty of documentation online to go along with that; release notes, the openSUSE 11.0 startup document and the step-by-step installation guide. The KDE live CD only contains KDE 4. If you would prefer KDE 3.5, it is available on the DVD or the network install. Benjamin Weber has a blog post on the inclusion of KDE4. "There should be a KDE3.5 installable livecd. This was not produced as there were insufficient resources to produce and test three installable livecds. Someone can always step up and help produce one." Xfce 4.4 is also available for those who want something lighter than either GNOME or KDE. Other applications available in this release include Firefox 3.0, OpenOffice.org 2.4, Banshee 1.0 and Wine 1.0. KIWI LTSP is the LTSP5 implementation on openSUSE. The previous openSUSE release added Giver, an easy GTK+ file-sharing tool. This release includes Kepas, a KDE application for file-sharing. Underneath all that you'll find Linux 2.6.25.4, AppArmor 2.3, Xen 3.2.1 RC1, Alsa 1.0.16, glibc 2.8 branch, binutils 2.18.50 SVN, cmake 2.6, gcc 4.3 branch, gdb 6.8, Perl 5.10, ConsoleKit 0.2.10, CUPS 1.3.7, D-Bus 1.2.1, NetworkManager 0.7 SVN, PackageKit 0.2.1, PolicyKit 0.7, PulseAudio 0.9.10, Samba 3.2pre2 and X.org 7.3. These and other highlights are listed here. Those familiar to openSUSE will notice that the installer and the package management have been overhauled for this release. Also NetworkManager has been improved and should autodetect an EVDO card without any major problems. Of course it's impossible to squash all bugs, but the Most Annoying Bugs 11.0 list is quite short and most have workarounds. All in all, this looks like a great release for openSUSE. TASK_KILLABLE Like most versions of Unix, Linux has two fundamental ways in which a process can be put to sleep. A process which is placed in the TASK_INTERRUPTIBLE state will sleep until either (1) something explicitly wakes it up, or (2) a non-masked signal is received. The TASK_UNINTERRUPTIBLE state, instead, ignores signals; processes in that state will require an explicit wakeup before they can run again. There are advantages and disadvantages to each type of sleep. Interruptible sleeps enable faster response to signals, but they make the programming harder. Kernel code which uses interruptible sleeps must always check to see whether it woke up as a result of a signal, and, if so, clean up whatever it was doing and return -EINTR back to user space. The user-space side, too, must realize that a system call was interrupted and respond accordingly; not all user-space programmers are known for their diligence in this regard. Making a sleep uninterruptible eliminates these problems, but at the cost of being, well, uninterruptible. If the expected wakeup event does not materialize, the process will wait forever and there is usually nothing that anybody can do about it short of rebooting the system. This is the source of the dreaded, unkillable process which is shown to be in the "D" state by ps. Given the highly obnoxious nature of unkillable processes, one would think that interruptible sleeps should be used whenever possible. The problem with that idea is that, in many cases, the introduction of interruptible sleeps is likely to lead to application bugs. As recently noted by Alan Cox: Unix tradition (and thus almost all applications) believe file store writes to be non signal interruptible. It would not be safe or practical to change that guarantee. So it would seem that we are stuck with the occasional blocked-and-immortal process forever. Or maybe not. A while back, Matthew Wilcox realized that many of these concerns about application bugs do not really apply if the application is about to be killed anyway. It does not matter if the developer thought about the possibility of an interrupted system call if said system call is doomed to never return to user space. So Matthew created a new sleeping state, called TASK_KILLABLE; it behaves like TASK_UNINTERRUPTIBLE with the exception that fatal signals will interrupt the sleep. With TASK_KILLABLE comes a new set of primitives for waiting for events and acquiring locks: For each of these functions, the return value will be zero for a normal, successful return, or a negative error code in case of a fatal signal. In the latter case, kernel code should clean up and return, enabling the process to be killed. The TASK_KILLABLE patch was merged for the 2.6.25 kernel, but that does not mean that the unkillable process problem has gone away. The number of places in the kernel (as of 2.6.26-rc8) which are actually using this new state is quite small - as in, one need not worry about running out of fingers while counting them. The NFS client code has been converted, which can only be a welcome development. But there are very few other uses of TASK_KILLABLE, and none at all in device drivers, which is often where processes get wedged. It can take time for a new API to enter widespread use in the kernel, especially when it supplements an existing functionality which works well enough most of the time. Additionally, the benefits of a mass conversion of existing code to killable sleeps are not entirely clear. But there are almost certainly places in the kernel which could be improved by this change, if users and developers could identify the spots where processes get hung. It also makes sense to use killable sleeps in new code unless there is some pressing reason to disallow interruptions altogether. The OLPC project releases 10GB of sound samples The One Laptop Per Child project recently released a large collection of sound samples: Loops, Grooves, Licks, Stings, Hits, Pads, Melodic Motives/Themes/Phrases, Sound-Effects, City and Country Soundscapes, Motors, Machines, Toys, Guns, Explosions, Swords, Armor, Cars, Jets, Pot & Pans, Acoustic and Synthetic Noises, Acoustic and Electronic Drums, Voices, Western and World Instruments, Real and Human Animals, Industrial and Natural Ambiences, Film and Game Foley, and more, more, more! This huge collection of new and original samples have been donated to Dr. Richard Boulanger @ cSounds.com specifically to support the OLPC developers, students, XO users, and computer and electronic musicians everywhere. They are FREE and are offered under a CC-BY license for downloading and use in your teaching, your demos, your research, your music, your remixes, your songs, your games, your videos, your slideshows, your websites, and your XO activities. The sample collection comes from a number of sources including the Open Path Music recording label, Zenph Studios (a musical software company), the Berklee College of Music, the Berklee Music Synthesis Alumni, Berklee Shares.com, the Worldwide Community of Csound Developers, Teachers and Users and Dr. Richard Boulanger. The sample collection is somewhat random in nature, there are similarities in the material from the various sources such as many single notes from common musical instruments. The recording quality tends to be decent, although a percentage of the sound samples have audible hum, hiss, aliasing issues and rough beginnings or endings. All of the samples are recorded in mono and are available in several sample rates. The samples have also had their volumes normalized. An obvious improvement to the collection would involve compressing the samples with FLAC to save disk space. The majority of the samples have durations of a few seconds or less, there are a number of long selections from long ambient recordings or groupings of short sounds. The sound descriptions for the various collections are somewhat generic, the best way to get a good understanding of the entire library is to download a group of sub-collections and play through the various sounds. Having a few gigabytes of empty disk space is a good idea. Unleashing a random audio file player on the collection can be amusing, if somewhat annoying after a while. Your editor listened to a random selection from the first seven sections from the Berklee College of Music Sampling Archive, the collection is quite diverse. One can imagine a number of possible uses for such a large library of sounds. Adding audio to games is an obvious use for the sounds. One could create accessibility applications for the visually impaired. In keeping with the OLPC theme, a teacher could sort through the sounds and use them for educating children about animals, musical instruments and other things that they may not experience in daily life. On the artistic side, the samples could be put to good use making audio tracks and movies. With the appropriate sample playing software, new and interesting musical instruments could be created. If your software project has a need for some open-licensed audio clips, the OLPC collection is a good source. Producing a large collection of sounds such as this would involve many hours of work. Ruby security flaws expose release process problems Some serious integer overflows in the Ruby language were recently discovered and fixed, but the process has left some in the community unhappy about how it was done. One of the biggest problems was that the official patched versions of the language broke its signature application: Rails. The overflows may lead to arbitrary code execution which left some users in a quandary, trying to decide whether to close known holes in the language or to keep their web applications running. There still seems to be some question about whether the holes are exploitable or not, but one thing is abundantly clear: they were fixed in the public CVS several days before any kind of security announcement was made. It was made worse by referring to the CVE numbers in the commit message. For anyone looking for a possibly exploitable Ruby flaw—one that had yet to be publicly announced—that would be a glaringly obvious place to start. When a release and announcement went out, some of the versions specified would cause Rails, the web application framework, to segfault. No new updates have been posted to the Ruby language web site leaving distributions and users to fill in the gap. Some frantic scrambling can be seen on a thread on the ruby-talk mailing list as folks with production Rails applications cast about for solutions. Part of the problem may stem from the number of separate language versions the Ruby team is trying to support. Three stable versions (1.8.5, 1.8.6, and 1.8.7) as well as one development version (1.9.0) are all affected by these vulnerabilities. Unfortunately, all four of the updated packages had one or more problems that either didn't fix all of the vulnerabilities or broke Rails. Those are still the versions suggested as a fix as of this writing. The new versions were based on the latest code in the CVS tree which evidently had not been tested completely. There are several test suites available for Ruby and Rails that would have caught these problems, but they apparently were not run. It is certainly important to get security fixes out quickly, but introducing other vulnerabilities and/or incompatibilities with existing code is a rather high price to pay. As is waiting ten (and counting...) days for a proper fix from upstream. For the most part, Linux distributions have resolved the problem for themselves by either backporting the fixes into the version they already support or by fixing the updated version provided. For example, Fedora 9 has done three separate releases to fully resolve the problem, the first to upgrade to the suggested upstream version (1.8.6p230), a second to resolve a segfault introduced somewhere between p114 and p230, and a third to handle the problem of Rails being broken. There is some indication that the Ruby team does not consider the flaws to be exploitable for code execution but, if so, they are still clearly denial-of-service vulnerabilities. The continued silence, at least on the official website, should also give one pause. The release process for Ruby seems to have fairly serious holes in it. This has caused some to issue a plea for a release process on the ruby-core mailing list. In addition, Dominique Brezinski claims that these bugs or some that were closely related were disclosed several years ago (see comment 43) and essentially ignored at that time. This is disconcerting for a language that is being increasingly used in web applications and other internet-facing services. One can only hope that this incident will serve as a wake up call to the Ruby developers. Failing that, if additional incidents like this occur, it may instead serve as a wake up call for those who depend on Ruby. Some development statistics for 2.6.26 - and beyond When 2.6.26-rc1 was released, your editor noted that, at a mere 7500 commits, it looked like 2.6.26 would be a smaller than usual development cycle. Interestingly, though, 2.6.26 has caught up. As of this writing (waiting for 2.6.26-rc9), this development cycle has incorporated 10,102 changesets for a net addition of 169,439 lines of code to the kernel. That makes it still significantly smaller than 2.6.25, but it is, by no means small. The developer base remains as broad as ever: 1065 developers (representing some 150 companies) have contributed to 2.6.26; just over 1/3 of those developers contributed one single changeset. The 2.6 development model says that the bulk of the changes should be merged during the merge window (before the -rc1 release), with only fixes coming thereafter. Here's how things break down for recent releases: So, while the bulk of the big patches enter the kernel during the merge window, at least 25% of the total - and often more - come thereafter. That's a lot of fixes. So who were the most active developers this time around? Here's the top 20: In terms of the number of changesets merged, Harvey Harrison got to the top of the list with a wide variety of of janitorial fixes. Bartlomiej Zolnierkiewicz continues to put significant effort into cleaning up the IDE subsystem, even though most distributors have moved away from that code and are using the newer PATA layer instead. Glauber Costa has been tirelessly working in the x86 architecture code; in particular, he continues to work toward the goal of unifying the 32-bit and 64-bit code to the greatest extent possible. Adrian Bunk has made a career of cleaning up the code base and eliminating unneeded code. And Joe Perches dedicated much time to eliminating warnings from the checkpatch.pl script. There have been complaints from the developers that the volume of "cleanup" patches is reaching a point that it is drowning out the rest and interfering with "real work." We're seeing some of that volume here, with three of the top five changeset contributors doing cleanup work - some of which is seen to be more valuable than the rest. On the lines changed side, we see a mostly different set of developers. In this case, the top slots were earned by deleting code. Stephen Hemminger finally succeeded in getting rid of the old sk98lin driver. Adrian Bunk tore out the bcm43xx driver, the ieee80311 software MAC layer, the xircom_tulip_cb driver, and various other bits and pieces. David Miller removed a bunch of old SPARC code, but replaced it with various other facilities; he also took the PowerPC low-level memory manager and made it generic. Steven Toth works in the Video4Linux layer; he added some new drivers and a bunch of cleanups. Ben Hutchings added the Solarstorm SFC4000 driver. When one thinks about 2.6.26 features, the things that come to mind include KGDB, almost-ready network namespaces, almost-ready mesh networking support, a working (shall we say "almost ready"?) realtime group scheduler, read-only bind mounts, page attribute table support, the object debugging infrastructure, and, of course, the vast pile of new drivers. One has to look hard to find the developers behind that work in the lists above (some of them are certainly there). Which just reinforces an important point: there is interest and information in counting changesets and lines changed, but the correlation between those numbers and serious accomplishments in kernel programming is weak at best. Unfortunately, "real work" is awfully hard to measure in any sort of automated way. So what the heck; we'll go back to the numbers we can measure. Here's the most active companies for 2.6.26: This list tends not to change too much from one release to the next; in particular, the top companies are always the same. If we look at who is attaching Signed-off-by tags to code they didn't write, we get a sense for who the gatekeepers to the kernel are. These are the developers and companies who are herding code into the mainline: Once again, these numbers tend not to change that much from one development cycle to the next. Subsystem maintainers do not change often. What's next? This is the first full development cycle where the linux-next tree was in operation. At this stage in the cycle, linux-next should look very much like 2.6.27 - or, at least, 2.6.27-rc1. Your editor pulled the July 2 linux-next tree and ran some statistics; this tree contains 6527 changesets from 619 developers. Just over 400,000 lines of code are touched, with a net addition of 38,000 lines. If linux-next is to be believed, the most active 2.6.27 developers will be: These numbers reflect a number of the larger developments which can be expected for 2.6.27: incredible amounts of KVM work, the merging of the UBIFS filesystem, the ftrace tracing framework, a lot of reworking of the TTY layer, a lot of firmware thrashing, and ongoing big kernel lock removal work. It will be most interesting to see how these numbers compare with what actually shows up in 2.6.27-rc1. Recent numbers suggest that quite a few patches will hit the mainline without having been in the linux-next tree - either that, or 2.6.27 will be a relatively small release. If nothing else, we will see which developers do not yet get their work into linux-next for integration testing ahead of the merge window. Mozilla plans for Firefox 3 and beyond The gift wrap is scarcely off Firefox 3 and the Mozilla community is already looking toward its next update. The first alpha release of Firefox 3.1, codenamed Shiretoko, may be released as early as this month, while its final release might see the light of day by year's end. Let's take a look at where this popular Internet browser is headed in the coming months, and what new features users can expect to see. Several features were nearly included for Firefox 3.0 but didn't make the cut because they weren't completely ready. New features expected to be in version 3.1 include a history and bookmark organizer with unified search and smart folder capabilities, and visual tab switching that shows thumbnail images of the web sites opened in each tab when moused over, both of which were abandoned in lieu of other, more critical features. According to an email sent to the mozilla.dev.planning mailing list, Mozilla's Vice President of Engineering, Mike Schroepfer, says there are other features expected to make it into version 3.1. For instance, native JSON DOM bindings (preferred by web developers over its JavaScript counterparts), an improved Awesomebar, support for cross-site XMLHttpRequest for the development of more powerful web applications, and better system integration are a few of the features Mozilla is anxious to get into the hands of users. Schroepfer says, "This, along with the overall quality of Gecko 1.9 as a basis for mobile and the desire to get new platform features out to web developers sooner has [led us] to want to do a second release of Firefox this year." In the event a feature isn't ready for version 3.1's targeted ship date, Schroepfer says rather than hold the release, it will simply be included in the next major release instead. In a recent blog post, Schroepfer says the new decision to aim for shorter, date-driven release cycles is in large part due to Mozilla's desire to "deliver releases of the quality and impact of Firefox 3 with much greater frequency." More frequent indeed; the gap between the release of Firefox 2.0 and 3.0 was almost two years. Not surprisingly, Firefox 4 is expected to usher in a whole host of changes, not the least of which is the introduction of Mozilla2, "an extensive update to the Mozilla platform to feature highlights like ActionMonkey, the merge of Mozilla's JavaScript engine (SpiderMonkey) and Tamarin, Adobe's JavaScript virtual machine open-sourced in late 2006." Details of the features expected to ship with Firefox 4 are sketchy, but the Vice President of Mozilla Labs, Chris Beard, has two projects currently under development that he'd like to see included: Weave and Prism. Weave is similar to the wildly popular browser synchronization add-on, Foxmarks. While Foxmarks only syncs an individual's bookmarks across machines, Weave's goal is to replicate a user's entire browsing experience — including bookmarks, favorites, passwords, and preferences — no matter where they access the Internet. Prism takes aim at Google Gears by making browser functionality available even while offline. Previously known as WebRunner, Prism is based on an idea called site specific browsers (SSB) and is already implemented in Fluid for Mac OS X, Adobe Air, and Microsoft Silverlight. Prism team member Matthew Gertner explains, "Rather than running programs in normal web browsers like Firefox or Safari, wedged in a tab between New York Times articles and TechCrunch posts, each app is given its own dedicated browser, which is customized to include many of the desktop features that users know and love." For a taste of what Prism can do within Firefox 3, download this extension. Of course, one of the biggest questions on the minds of many people these days is: what's up with the mobile version of Firefox? Although it looks like there's a ways to go before Mobile Firefox turns up on your Razr or BlackBerry, the rapid release cycle of Firefox will help push the project along. Schroepfer says, "There are already devices shipping with early versions of Gecko 1.9 at the core. More are coming soon and we'll be releasing milestones of full branded versions of Firefox (with XUL and the Firefox team taking a lead in the user experience) later this year. This lines up well with Firefox 3.1 and a synchronized release schedule will make everything run more smoothly." The development team is working on sorting through some of the basic differences among mobile devices such as a touch screen versus non-touch screen interface, virtual versus tactile keyboards, and so on. If you're interested in trying out the prototypes, they're available on the team's wiki page. Firefox 3 has been downloaded more than 8 million times since its release on June 17th, and more than 90% of users download the latest version of the browser within 7 days of its release. Clearly, Firefox has a large and growing user base, no doubt due in large part to Mozilla's willingness to offer new and useful features in a timely fashion. Notes on the Viacom ruling Google's purchase of YouTube always seemed questionable to some observers: it looked as if Google were buying itself a whole new source of copyright lawsuits. One of the benefits of that purchase came through on July 2, when a U.S. District Court ordered Google to hand over its complete set of YouTube traffic logs, containing information about every video viewed on the service. See Groklaw for the full text of the order. If this order stands (and it appears that Google will not appeal it), millions of users worldwide will have their viewing data handed over to a litigious entertainment industry company. There's a couple of important implications to draw from this turn of events, so LWN will venture a little far afield and take a look. The data involved includes, for each video viewed, the time, which video was involved, which YouTube user account was used, and the IP address the request came from. Viacom claimed that the privacy of YouTube users is not threatened by this release of data, and the court agreed. But account names can be correlated across sites, and IP addresses (especially time-correlated IP addresses) can easily identify exactly who was watching a particular video. Viacom promises it would never use this data to launch enforcement actions against individuals; the fact that the company feels the need to make that promise suggests that Viacom feels it could use this data to that end. One other interesting aspect of the ruling which has been commented upon less is this: Google has also been ordered to hand over every video which has been removed from the site. Once again, that is a great deal of data. It also drives home the point that, on a site like YouTube, nothing is really removed: all of those "removed" videos are still there, waiting for some company with enough lawyers to go after it. All of this data is to be handed over regardless of what jurisdiction the users thought they were in. Nobody's privacy or data retention laws apply here. This is a worldwide compromise of personal data. So lesson number one is obvious: attending to one's personal security requires being very careful about the data tracks that one leaves on other peoples' servers. Regardless of any site's privacy policy or any country's data sharing laws, that data is there for the grabbing. The course of events which led to the compromise of vast amounts of video-viewing data can also lead to the disclosure of electronic mail, accounting data, online chat sessions, purchase histories, software downloads, or which edgy Second Life neighborhood one likes to hang out in. Indeed, records of video viewing activity are more strongly protected in the U.S. than many other types of data; other types of information may well prove easier to get. What we leave on remote machines seems to stay there indefinitely, and it's an open book for those with sufficient legal power on their side. [PULL QUOTE: If you gather together that much information on the behavior of many millions of people, somebody, somewhere, is going to try to get their hands on it. END QUOTE] The second lesson is for anybody running a publicly-available server, as many LWN readers do. The video activity database being grabbed by Viacom is said to be about 12 terabytes deep - before getting into the "removed" videos. It should not be surprising that a data stash of that size would attract this kind of action. If you gather together that much information on the behavior of many millions of people, somebody, somewhere, is going to try to get their hands on it. How could it possibly be any other way? Not enough people are asking this question: why does Google/YouTube hold that much data about its users? Why does it retain the ability to replay their actions years after the fact? And why do "removed" videos not go away? If that data did not exist in the first place, there would be no question of disclosing it to an attacking corporation. A company which keeps that amount of data around is prioritizing whatever commercial value it sees in that data over the privacy and security of its users. And, by inviting raids from corporations (which we hear about) and governments (which we might not hear about), such companies are not helping their own security either. So there are strong arguments for simply not retaining all that data in the first place. Naturally, some governments are doing their best to force that kind of retention, but that's a different battle. In the absence of legal constraints, a standard policy mandating short data retention periods makes a lot of sense. It behooves all of us to think about what kind of data we leave lying around - either through our activities or by facilitating the activities of others - and to keep it to a minimum. The most secure data is data which does not exist. The current development kernel is...linux-next? One of the development process advantages brought by git (and by BitKeeper before it) is the ability to see the up-to-the-second, bleeding-edge status of Linus's tree. So any developer who wants to know where the front edge of development lies can grab that tree and make patches fit into it. But the value of the mainline repository for development would appear to be less than it once was. The mainline is no longer where the action is. Consider, for example, this response from Andrew Morton after finding that a patch posted to linux-kernel would not compile for him: I assume this patch was prepared against some ancient out-of-date kernel such as current Linus mainline. Guys, we have a new development tree now. He followed up with this statement: But what I am repeatedly seeing is people cheerfully raising 2.6.27 patches against the 2.6.26 tree when we have a nice 2.6.27 tree for developing against. Those days are over, guys. So the message would appear to be clear: development work should be done against the linux-next tree rather than against the mainline kernel. There are some clear advantages to having work done in this way. Patches developed against linux-next should merge cleanly during the next merge window. Developers will be testing each other's trees as they work, causing bugs to turn up earlier in the process. And, of course, Andrew won't have to complain about patches which fail to build for him - at least, not as often. Linux-next is a somewhat strange base on which to try to develop, though. It is built anew every day from over 100 subsystem trees, each of which can, itself, change from one day to the next. So linux-next is a moving target, just like the mainline is. But, unlike the mainline, linux-next has no consistent or coherent history. Every day's linux-next tree is a completely new creation with a unique - and transient - history. Consider a developer who bases some work on a mainline release - 2.6.26-rc9, say. That developer's work will be derived from a specific commit in the mainline tree, known as b7279469d66b55119784b8b9529c99c1955fe747 in this case. The history from 2.6.26-rc9 is well defined, and that series of patches can be merged into any other repository which also contains 2.6.26-rc9; the identity of that commit is consistent and immutable across all repositories. With such a development tree, it is (relatively) easy to track the mainline as it advances, and to merge one's work when the time comes. A git tree based on the mainline sits on a solid foundation. It is not possible to base a tree on linux-next in the same way. Development can begin at a specific commit, but tomorrow's linux-next tree may not contain that commit at all. The various component trees will have advanced independently of the previous day's linux-next tree, which can, in itself, complicate things. But the process of making all those trees come together can involve tasks like moving patches from one tree to another, or fixing intermediate patches which break things. That makes the end result better, but at the cost of rebasing those trees. Rebasing completely rewrites the development history, causing the old history to disappear from the tree. So a patch series based on the previous history loses its foundation. And, since linux-next is built from its components every day, a patch developed on top of linux-next may, when integrated into that tree, be merged somewhere in the middle of the sequence; in other words, the patch will be merged into a tree which differs considerably from the tree on which it was developed. As Stephen Rothwell, the maintainer of the linux-next tree, put it: One downsides of the way linux-next works is that, because it is recreated every day, you cannot really base anything on it that is to be merged into it. Another interesting aspect of linux-next development involves API changes. The longstanding rule in kernel development is that internal kernel interfaces can be changed if there is a good reason to do so, but that the person making the change is obligated to fix all in-tree code broken by that change. If an API change is introduced into linux-next, though, the developer is simply not able to fix any code which enters linux-next by way of the other subsystem trees. If the developer does get patches into those trees for the API change, they can no longer be built on top of kernels which lack that change - the mainline, for example. API changes have, in other words, become harder to do - a situation which some may see as a good thing. What all this means is that API changes must be handled through techniques like the creation of backward-compatibility layers; those layers can then be removed a development cycle or two later once the transition is complete. Or changes can be split up and added to individual subsystem trees; that, however, can lead to interesting ordering dependencies between the trees. In some cases, we are seeing 2.6.27 changes being merged into 2.6.26 in stub form as a way of making all of the pieces fit together. Then, there is the simple matter that developers like to have a stable base upon which to create their code. The linux-next tree, since it contains large amounts of relatively new code, will also contain its share of new bugs. That makes developers, who are often having enough trouble just tracking down their own bugs, somewhat grumpy. Development against the mainline tends to have a lower probability of forcing developers to look for bugs which are not of their own making. Many of these complaints have an easy answer: the pain which comes from making all the pieces fit together in linux-next must be faced at some point anyway. The real difference is that linux-next allows those problems to be dealt with at leisure, while the older "merge everything in the mainline" model compressed much of that work into the merge window. How beneficial that really is will be seen for the first time in the 2.6.27 merge window; if linux-next is serving its intended function, 2.6.27 should come together with rather less hassle than its immediate predecessors did. But, regardless of the value provided by linux-next for integration and testing purposes, the fact remains that it is a difficult platform upon which to develop patches. That process is somewhat like building a house on a sand bar; overnight the tide comes in and completely reshapes the land underneath you. That is why most (possibly all) of the subsystem trees used to assemble linux-next are, themselves, based on the mainline. The solution to that problem will have to evolve over time. The linux-next tree is a new institution which is still finding its proper place in the development process. Easier ways to develop patches against the linux-next tree will certainly be worked out; it may well turn out that quilt-like tools work better for this task than git. But, for now, linux-next is an excellent integration and testing resource, but it has not quite yet managed to become the true Linux kernel development tree. Enhanced printk() merged A change very late in the development cycle for 2.6.26 provides a framework for extending printk() to handle new kinds of arguments. Linus Torvalds just merged the change—after -rc9—presumably partially because he knew he could trust the author, but also because it should have no effect on the kernel. It will provide for better debugging output once code is changed to take advantage of it. The core idea is to extend printk() so that kernel data structures can be formatted in kernel-specific ways. In order to get some compile-time checking, the %p format specifier has been overloaded. For example, %pI might be used to indicate that the associated pointer is to be formatted as a struct inode, which could print the most interesting fields of that structure. GCC will be able to check for the presence of a pointer argument, but because it does not understand the I part, cannot enforce that it is a pointer of the right type. Extending printk() in this manner allowed Torvalds—who authored the patch—to add two new types to printk(): %pS for symbolic pointers and %pF for symbolic function pointers. In both cases, the code uses kallsyms to turn the pointer value into a symbol name. Instead of a kernel developer having to read long address strings and then trying to find them in the system map, the kernel will do that work for them. The %pF specifier is for architectures like ppc and ia64 that use function descriptors rather than pointers. For those architectures, a function pointer points to a structure that contains the actual function address. By using the %pF specifier, the proper dereferencing is done. As an example of how the augmented printk() could be used, Torvalds converted printk_address(). The CONFIG_KALLSYMS dependency and the kallsyms_lookup() were removed, essentially leaving a one-line function: If kallsyms is not present, the new printk() just reverts to printing the address in hexadecimal, which allows the special case handling to be done there. The clear intent is to allow additional extensions to printk() to support other kernel data structures. The change to vsprintf(), which underlies printk(), actually allows for any sequence of alphanumeric characters to appear after the %p. The new pointer() helper function currently only implements the two new specifiers, but others have been mentioned. The mostly likely additions are for things like IPv4, IPv6, and MAC addresses. Torvalds specifically mentions using %p6N as a possibility for IPv6 addresses. Some would rather have seen a different syntax be used, %p{feature} was suggested, but that would conflict with some current uses of %p in the kernel. Torvalds is happy with his choice: I _expressly_ chose '%p[alphanumeric]*' because it's basically totally insane to have that in a *real* printk() string: the end result would be totally unreadable. The patch took an interesting route to the kernel, with much of the discussion evidently going on in private between Torvalds, Andrew Morton, and others before popping up on the linuxppc-dev and linux-ia64 mailing lists. The patch itself has not been posted to linux-kernel in its complete form, but was committed on July 6. While it is a bit strange to see such a change this late in the development cycle, it is a change that should have no impact as there are no plans to actually use the new specifiers in 2.6.26. Multiqueue networking One of the fundamental data structures in the networking subsystem is the transmit queue associated with each device. The core networking code will call a driver's hard_start_xmit() function to let the driver know that a packet is ready for transmission; it is then the driver's job to feed that packet into the hardware's transmit queue. The result is a data structure which looks vaguely like this: "Vaguely" because the list of sk_buff structures (SKBs - the internal representation of packets) does not exist in this form within the kernel; instead, the driver maintains the queue in a way that the hardware can process it. This is a scheme which has worked well for years, but it has run into a fundamental limitation: it does not map well to devices which have multiple transmit queues. Such devices are becoming increasingly common, especially in the wireless networking area. Devices which implement the Wireless Multimedia Extensions, for example, can have four different classes of service: video, voice, best-effort, and background. Video and voice traffic may receive higher priority within the device - it is transmitted first - and the device can also take more of the available air time for such packets. On the other hand, the queues for this kind of traffic may be relatively short; if a video packet doesn't get sent on its way quickly, the receiving end will lose interest and move on. So it might be better to just drop video packets which have been delayed for too long. On the other hand, the "background" level only gets transmitted if there is nothing else to do; it is well-suited to low-priority traffic like bittorrent or email from the boss. It would make sense to have a relatively long queue for background packets, though, to be able to take full advantage of a lull in higher-priority traffic. Within these devices, each class of service has its own transmit queue. This separation of traffic makes it easy for the hardware to choose which packet to transmit next. It also allows independent limits on the size of each queue; there is no point in filling the device's queue space with background traffic which is not going to be transmitted in any case. But the networking subsystem does not have any built-in support for multiqueue devices. This hardware has been driven using a number of creative techniques which have gotten the job done, but not in an optimal way. That may be about to change, though, with the advent of David Miller's multiqueue transmit patch series. The current code treats a network device as the fundamental unit which is managed by the outgoing packet scheduler. David's patches change that idea somewhat, since each transmit queue will need to be scheduled independently. So there is a new netdev_queue structure which encapsulates all of the information about a single transmit queue, and which is protected by its own lock. Multiqueue drivers then set up an array of these structures. So the new data structure can, with sufficient imagination, be seen to look something like this: Once again, the actual lists of outgoing packets normally exist in the form of special data structures in device-accessible memory. Once the device has these queues set up for it, the various policies associated with each class of service can be implemented. Each queue is managed independently, so more voice packets can be queued even if some other queue (background, say) is overflowing. David would appear to have worked hard to avoid creating trouble for network driver developers. Drivers for single-queue devices need not be changed at all, and the addition of multiqueue support is relatively straightforward. The first step is to replace the alloc_etherdev() call with a call to: The new queue_count parameter describes the maximum number of transmit queues that the device might support. The actual number in use should be stored in the real_num_tx_queues field of the net_device structure. Note that this value can only be changed when the device is down. A multiqueue driver will get packets destined for any queue via the usual hard_start_xmit() function. To determine which queue to use, the driver should call: The return value is an index into the array of transmit queues. One might well wonder how the networking core decides which queue to use in the first place. That is handled via a new net_device callback: The patch set includes an implementation of select_queue() which can be used with WME-capable devices. About the only other required change is for multiqueue drivers to inform the networking core about the status of specific queues. To that end, there is a new set of functions: A call to netdev_get_tx_queue() will turn a queue index into the struct netdev_queue pointer required by the other functions, which can be used to stop and start the queue in the usual manner. Should the driver need to operate on all of the queues at once, there is a set of helper functions: Naturally, there are a few other details to deal with, and the multiqueue interface is likely to evolve somewhat over time. At one point, David was hoping to have this feature ready for inclusion into 2.6.27, but that goal looks overly ambitious now. It does seem that much of the ground work will be merged in the next development cycle, though, meaning that full multiqueue support should be in good shape for merging in 2.6.28. What's coming in OpenSSH 5.1 OpenSSH is an important tool for remote connectivity: "OpenSSH is a FREE version of the SSH connectivity tools that technical users of the Internet rely on. Users of telnet, rlogin, and ftp may not realize that their password is transmitted across the Internet unencrypted, but it is. OpenSSH encrypts all traffic (including passwords) to effectively eliminate eavesdropping, connection hijacking, and other attacks. Additionally, OpenSSH provides secure tunneling capabilities and several authentication methods, and supports all SSH protocol versions." On July 6, 2008 a call for testing was issued for OpenSSH version 5.1: "OpenSSH 5.1 is almost ready for release, so we would appreciate testing on as many platforms and systems as possible. This release is one of the biggest in recent years, with two hackathons' worth of improvements and fixes for some of our most recalcitrant bugs." A large number of new features are being added to the OpenSSH suite of utilities. Some of the feature highlights include: Experimental SSH fingerprint visualization (see this paper [pdf]) will produce visual representations of host keys for quick key validation. The sshd daemon will get a new extended test mode with capabilities for dumping the configuration and testing match rules. A "df" command has been added to the sftp client for displaying server filesystem information. There will be a new mechanism for disabling further session requests between ssh and sshd. The ssh-keygen command will get a new -l option that will allow searching for a host in the known_hosts file. ssh and sshd will better support port forward destination hosts with multiple forward addresses. Some basic interoperability tests have been added for Twisted Conch. Configuration file changes: Classless Inter-Domain Routing (CIDR) address/masklen matching will be added to sshd_config "Match address" blocks and authorized_keys "from" restrictions. A new sshd_config AllowAgentForwarding option will control authentication agent forwarding. The sshd_config MaxSessions option will give finer grained control to the number of multiplexed sessions. sshd_config "Match group" blocks will get new support for group negation. sshd_config match blocks will now support the MaxAuthTries option. Performance improvements. Documentation improvements. Bug fixes. For those who would like to experiment with the new features, a series of snapshot releases are available for download. Questions and answers with Stormy Peters Those who have followed the GNOME project over the last few years have seen the wishlist item for a "business manager" or "executive director" for the GNOME Foundation; the subject was especially likely to come up during Foundation board elections. This position has remained unfilled for some time, seemingly a result of uncertain funding and the difficulty of finding the right person. These problems would appear to be in the past now; on July 7, the GNOME Foundation announced that this position would be filled by Stormy Peters, formerly of OpenLogic. Stormy now has the challenge of helping an energetic and independent-minded development community build on its success and achieve its ambitious goals for the future. We asked her a few questions about how she thought that might go; here's what we got back. LWN: This is a new position, in that the GNOME Foundation has never had an executive director before. So people may be wondering what you'll actually be doing. How do you expect to be spending your time in this position? Actually, the GNOME Foundation has had an executive director before but not for the past few years. I will spend my time strengthening relationships with the existing sponsors, working on finding new industry partners and helping the Board of Directors and the community execute some of their great ideas for GNOME. The GNOME community's goal is to provide an easy to use, intuitive interface for Linux and Unix as well as a powerful development platform. A year from now, what do you hope your biggest accomplishments will be? The GNOME community has a tremendous amount of passion and a real dedication to making a development platform and a desktop that is easy to use. I think showing the world that, getting the word out and showing how it is changing the way people are able use their computers and mobile devices is key. So to answer your question, I'd like to see a stronger Foundation (more sponsors and members), increase the amount of great ideas that get executed, and make GNOME a household name. :) Next year, it seems reasonably likely that there will be a combined GNOME/KDE developers conference in Europe. What are your thoughts on the current state of cooperation with KDE, and how do you think it could be improved? I hope we have a combined GUADEC/Akademy next year. KDE and GNOME have been working more closely together during the past year or so and they have accomplished some good things like with dbus. I think anytime you get great developers together, good things happen. One high-profile GNOME goal was 10x10 - 10% of the desktop market by 2010. In mid-2008, it seems fairly clear that this goal will not be achieved. Do you think that the desktop remains a suitable target for free software, or should GNOME deemphasize the traditional desktop in favor of other goals? I do think that a free and open source desktop is still a great goal. While the number of free and open source desktops out there might be small, it is growing tremendously. Just look at the number of laptops that ship with GNU/Linux (from Dell, Asus and other) as well as the number of mobile devices that are based on free and open source software. Though the GNOME Foundation is not intended to control the technical direction of the project, it clearly cannot be without influence there. Are there technical directions you would like to see the development community take, directions which would help to convince manufacturers to incorporate GNOME technologies and contribute to GNOME development? I'll be working closely with the community and the board of advisors to figure out how I can best help with technical directions. One thing we'd like to see from our sponsors - through our board of advisors - is more information on what end-users would like to see in GNOME. In the past you have spoken about how introducing money into free software development can have a demotivating effect on developers. Do you fear that sort of problem as GNOME becomes more commercially successful? How would you hope to avoid that kind of difficulty? I don't think it's an issue in the short term as growing the GNOME Foundation doesn't directly correspond to hiring lots of developers. But that said, I think the key is maintaining the intrinsic motivations that make GNOME contributors such a passionate group of developers. Thanks to Stormy for being kind enough to answer our questions in the middle of what must have been a highly busy time at GUADEC in Istanbul. SELinux and Fedora Red Hat has undoubtedly done more to make SELinux usable than any other organization, but has it actually reached the point where it can be enabled by default for all desktops? The Fedora project clearly thinks so. Not only is SELinux enabled, but the installer no longer has an option to disable it or to put it into "permissive" mode. Most of the posts in a thread on the fedora-devel mailing list see that as the right choice, but some are not so sure. Jon Masters started things off by making a request to restore the installation option, giving several reasons summing up with: But there are numerous other justifications I could give, including my personal belief that it's absolutely nuts to thrust SE Linux upon unsuspecting Desktop users (who don't know what it is anyway) without giving them the choice to turn it off. His reasons were unconvincing to many as he was not considered to be a "normal" desktop user; the things he was doing were much more technical than the users that are being targeted by the SELinux policies distributed with Fedora 9. The problems he reported were resolved quickly, but the fact remains that there are paths through Fedora—even just using desktop applications—that will result in SELinux-caused failures. The Red Hat SELinux team is very responsive, but users will get frustrated quickly if things they are trying to do fail in mysterious (to them) ways. Alan Cox argues against providing an installation choice because he doesn't think users have enough context to make a sensible choice. He likens it to a car with multiple choices for safety features: "This car has brakes, enable them ?" "Would you like the seatbelts to work ?" "Shall I enable the airbag ?" When push comes to shove, Masters and a few others see the default of SELinux installed in "enforcing" mode as being too restrictive. It is likely to cause users to become annoyed with Fedora as a whole because one or more paths through the applications have not yet been tested. That, unfortunately, is the crux of the issue: SELinux policies are being developed in a reactive manner based on testing applications and adding exceptions for actions they perform. As a security tool, SELinux is a good choice, because it essentially denies everything by default. Policies are added that will allow certain actions for users and applications. Its complexity is legendary, however, which is why Red Hat (and others) have made a substantial effort to make it work semi-invisibly. They started by generating policies for network-facing services and have now moved into securing desktop applications, particularly programs like web browsers which are increasingly the target of attacks. SELinux has three modes, disabled, which turns off SELinux, permissive, which just logs attempts to do things that violate the policies, and enforcing, which disallows any access that is denied by the policies. When getting applications to work with SELinux, permissive mode is typically used. The log messages are analyzed to determine what changes should be made to the policies or to the application so that they work together. If there are features that were not tested in the application that require additional privileges, the first user that tries that feature in enforcing mode will run into trouble. When that happens, SELinux can be put into permissive mode with a simple GUI or configuration file change, followed by a reboot. One of the problems is that users may very well not know that SELinux is the source of their problem. There are tools, like SETroubleShoot, that can help alert users, but it is still a frustrating, hard to comprehend problem at times. Once the user has "fixed" the problem by disabling SELinux, they are unlikely to turn it back on. It is a difficult choice, but Fedora is firmly on the side of forcing non-technical users into using SELinux, at least until it breaks. More technical users will know about SELinux and, perhaps, be able to make more informed choices. One of Red Hat's SELinux developers, James Morris, neatly sums up the reasons it is important to continue pushing SELinux: The only way to really make progress in improving security is to make it a standard part of the computing landscape; for it to be ubiquitous and generalized, which is the aim of the SELinux project. [...] Punting the decision to the end user during installation is possibly the worst option. It's our responsibility as the developers of the OS to both get security right and make it usable. It's difficult, indeed, but not impossible. There are efforts underway to add easier ways for users to report SELinux log messages, perhaps even in an automated way, so that policy or application problems get identified and fixed more quickly. While it may not be easy for long-time Linux users to adjust to an SELinux-enabled system, it is getting to the point where average users, who never use the command line, rarely run into problems. And those are just the kind of users who need the level of security that SELinux can provide. Fedora takes Linux to college The idea of Linux in the classroom is nothing new. From a grassroots push for district-wide adoption in secondary schools, to a plan to offer the One Laptop Per Child program in every developing country, the FOSS community is always looking for ways to encourage schools to use Linux. Recently, however, there's a new movement afoot that's aimed at snapping up a segment of computer users before they spend their money on computers with commercial operating systems. Linux is headed to college. For the last few weeks, volunteer members of Fedora's marketing team have been kicking around ideas on ways to encourage college students to give Linux a try and draw new users into the Fedora fold. Rather than approach university IT departments running Windows to convince them to switch operating systems, the team hopes to create a groundswell of college-aged users who will march into classrooms and lecture halls with Fedora-laden laptops and eagerly dive into work-study projects that focus on Linux development. Jack Aboutboul, Red Hat's Community Engineer and the main impetus behind the tentatively-named Campus Ambassador program, says though it is similar to Fedora's existing Ambassador program, the new program will have a different governance model and slightly different goals. Students from Auburn, Texas A&M, Berkeley, and other U.S. colleges, as well as team members attending universities in other countries, have already shown an interest in assuming the role of Campus Ambassadors, and have agreed to speak at campus events about the benefits of Fedora and of Linux in general. Taking the idea a step further, many Fedora team members would like to see the development of promotional material designed with college students in mind, such as posters that encourage students in the art department to volunteer their skills creating artwork for Linux distros. As one Fedora marketing team member notes, "How many marketing majors are aware that there are real life marketing opportunities for them within the Fedora project while they are still students? Reaching these students should be one focus of any campus outreach." At least one school, Cabrillo College in Santa Cruz County, California, is already hard at work promoting Fedora on its campus. In addition to forming a GNU/Linux Users Group (LUG) and holding regular installfests, the LUG is also creating its own Fedora-based distro called Seahawk GNU/Linux, named after the school mascot. LUG President Larry Cafiero explains, "Not that the world needs yet another distro, mind you, but we're using the project as a teaching tool more than an actual distro that will take the world by storm." He says that not only do students gain hands-on familiarity with Linux, but "those who get introduced to GNU/Linux through the school-based distro get a sort of introduction to Fedora as well." Since Fedora already has a strong Ambassador Program, the question of why a separate university Ambassadorship is necessary has come up. Essentially, it boils down to a difference in how users will be mentored. In the typical Ambassador arrangement, Fedora users simply evangelize Linux and encourage people to give Fedora a try while offering assistance and tips along the way. Marketing team member Chris Tyler sees the role of Campus Ambassador as more finely-tuned and as a "a matchmaker between a student, a potential need (project), and community resources." Tyler says that there are many benefits to this arrangement, including the opportunity for students to work on projects with a larger user base which, will therefore have a bigger real-world impact than student projects that remain inside the walls of the school. Team member Jeff Spaleta says finding projects with a long shelf life is vital to keeping students interested in Linux, and good for the long-term health of the community. "If students as part of their degrees need to work on a year or semester-long project, I want Fedora to be obvious place to look for compelling things to work on, with an aim towards well scoped projects that have a good chance for long lived utility," he says. "I hate seeing good academic projects die because there was no real plan to hand them off outside of that academic group which incubated them." Team members seem to be in agreement that the Ambassador program is a winning situation for everyone. Students get hands-on experience — and, in some cases, a grade — for participating in a software development project. Computer technology departments can offer a wider learning environment with little to no investment, Fedora may garner new users, and the Linux community as a whole grows. In an effort to move the Campus Ambassador project forward, Jack Aboutboul plans to formally present the idea at a Community Architecture meeting later this month. Secrecy and the DNS flaw By now, most folks will have seen reports of the design flaw discovered in DNS as it has seen fairly widespread coverage, even in the non-technical press. It is rare to see such a coordinated disclosure and security update amongst that many of the big players in the computer industry. While fixes abound, the actual problem has yet to be disclosed, which has both positives and negatives. Responsible disclosure policies dictate that vulnerabilities be kept secret until all affected vendors can create an update. Because this flaw is in the design of DNS, most implementations were affected. This still doesn't quite explain the roughly six months between the discovery of the problem and the release of the fix. Evidently it took a meeting of the minds at the Microsoft campus in March to decide upon the right course of action. Once the fixes were done, presumably they were released on the next "patch Tuesday"—Microsoft's monthly security update day. Normally, once fixes are available, information about the vulnerability is released. But, for a number of reasons, that has not happened in this case. One of the main reasons is that DNS is an essential internet service and it will take time for affected users to patch their systems. In addition, there have been no reports of this flaw being exploited "in the wild", reducing the pressure to divulge it. Security researcher Dan Kaminsky discovered the flaw and he has yet another, "blatantly selfish" reason for keeping it quiet as he would like to be able to announce it at Black Hat in Las Vegas in early August: While I'm out there, trying to get all these bugs scrubbed — old and new — please, keep the speculation off the @public forums and IRC channels. We're a curious lot, and we want to know how things break. But the public needs at least a chance to deploy this fix, and from a blatantly selfish perspective, I'd kind of like my thunder not to be completely stolen in Vegas. None of these seem like horrible reasons to keep the vulnerability quiet for a time (roughly 30 days), but they do leave some DNS implementations and worried administrators without the information they need to evaluate the situation. Administrators do not know what traffic patterns or other symptoms to look for to determine if exploits are being attempted. Smaller, less prominent DNS implementations were not included in the collaboration, thus they don't have enough information to decide whether they are vulnerable or not. A perfect example is Dnsmasq, a lightweight DNS server for smaller networks. Dnsmasq is often used in embedded Linux distributions targeted for home wireless routers. Simon Kelley, Dnsmasq developer, was asked about the vulnerability; his response speaks volumes: I wasn't contacted in advance about this, and no patch for dnsmasq has been released. Since the exact nature of the new vulnerability has not (as far as I know) been announced, I don't know if dnsmasq is vulnerable. Kelley has since released a patched version, but it is still unknown whether it is needed or, really, if it even fixes the problem. It is difficult to know for sure that a security hole has been closed if information about the hole is not available. This points to the problems that can come from withholding vulnerability information. Based on the patches and some information from Kaminsky and others, it is clear that this is a cache poisoning vulnerability. Since source port randomization is the change that was applied to alleviate, but not eliminate, the flaw, we can surmise that Kaminsky found a way to reduce the number of spoofed replies that need to be sent to something tractable. According Internet Systems Consortium, developers of the BIND DNS server, the only true solution is DNSSEC, which implies that the current fixes only make cache poisoning less likely, not impossible. Source port randomization is a technique that has been advocated by Daniel J. Bernstein (i.e. djb) for many years. He implemented it in his djbdns name server long ago. Essentially, it chooses a random source UDP port for each query that the name server makes, which has the effect of increasing the randomness that an attacker needs to be able to predict before being able to poison the cache. While the market share of Dnsmasq may be miniscule, there are certainly other DNS implementations that are also concerned. In addition, we are relying on those who are "in the know" to be on the lookout for suspicious traffic that might indicate the vulnerability being exploited. Kaminsky is certainly under no obligation to reveal anything, but one wonders if the safest course would have been for him to provide details now, even at the expense of his "thunder". What Red Hat and Firestar agreed to On July 15, Red Hat and Firestar released the terms of the settlement [PDF] of their patent suit. When we last looked at this settlement, those terms were not available. Now we can examine exactly what was agreed to and assess the degree of protection that Red Hat actually negotiated for the wider community. It may be tempting to say that recent events have reduced the relevance of this settlement, but that would be a mistake; what Red Hat has done here still matters. Those recent events, of course, are dominated by Sun's announcement that it had successfully challenged the Firestar patent; the US Patent and Trade Office (PTO) has officially rejected all of Firestar's claims. As your editor (along with numerous others) has said, this should not have been a particularly hard thing to do; the weakness of this particular patent was evident after even a cursory reading. So one might well wonder why Red Hat chose to pay the troll in this particular case. And, incidentally, Red Hat did pay. Naturally enough, the specific payment terms have been removed from the agreement, but a payment was a part of the deal. It is nice that Sun took a less compromising approach to this case, even though it was not named as a defendant. But Sun's success has not rendered this settlement moot, for a few reasons. To begin with, Firestar now has two months to fight the PTO decision and reinstate its patent. That looks like a difficult task, but, with the PTO, one never really knows. Second, the settlement does not cover just that one patent; it covers just about any patent that Firestar owns or will acquire in the next five years - though some of that coverage goes away in 2013. And, perhaps most importantly, Red Hat clearly sees this settlement as a template for the resolution of other patent suits which are certain to come in the future. The settlement itself reads somewhat like a Pascal program; one must start toward the bottom and read it in reverse. Following that analogy, the main program can be found in section 5.2: Licensor grants and promises to grant to Red Hat Community Members a perpetual, fully paid-up, royalty-free, irrevocable worldwide license of the Licensed Patents to engage in any and all activities related to Red Hat licensed Products, including without limitation to make, have made, use, have used, sell, have sold, offer for sale, have offered for sale, provide or have provided, distribute or have distributed, import or have imported and Red Hat Licensed Product and services related to any Red Hat Licensed Product. So, these patents have been licensed for any practical purpose to anybody who happens to be a Red Hat Community Member, as long as they are working with Red Hat Licensed Software. Well, almost any purpose; there is a small catch, as will be seen shortly. First, though, it is time to read the declarations toward the top of the settlement to see what those terms really mean. Who, exactly, is a Red Hat Community Member? ...any Entity that is a licensee or licensor of, contributes to, develops, authors, provides, distributes, receives, makes, uses, sells, offers for sale, or imports, in whole or in part, directly or indirectly, any Red Hat Licensed Product, including without limitation any upstream contributor to, or downstream user or distributor of, a Red Hat Licensed Product. This definition is clearly quite comprehensive; anybody who makes use of the software is considered to be a Red Hat Community Member. Your editor is pondering offering for sale a line of "Proud Red Hat Community Member" T-Shirts at the next Debconf or OpenBSD hackfest. This is a club that we all get to join. The other key term, though, is "Red Hat Licensed Product," because only such products are covered by the settlement. The definition of this product is simple: "Red Hat Licensed Product" means any Red Hat Product, Red Hat Derivative Product, or Red Hat Combination Product. Now, perhaps, we have moved away from Pascal programming and are stuck with the unenviable task of making sense of a convoluted Java class hierarchy. One of the subclasses, the definition of "Red Hat Product," is crucial: ...(a) any product, process, service, or code developed by, licensed by, authored by, distributed under a Red Hat Brand by, made by, sold under a Red Hat Brand by, offered for sale under a Red Hat Brand by, sponsored by, or maintained by Red Hat, (b) any predecessor version of any of the foregoing, including without limitation any upstream predecessor version any of the foregoing... So essentially, a Red Hat Product is anything developed or shipped by Red Hat under one of its trade names. So anything in Red Hat Enterprise Linux qualifies. The important thing that Red Hat didn't see fit to specify in its early PR is that anything in Fedora - also being software distributed under a Red Hat Brand - qualifies too. Since Fedora packages rather more software than RHEL does, that broadens the coverage of this agreement considerably. Also important is the "any predecessor version" clause. Coverage under this agreement does not apply to just the specific, possibly patched version of a program shipped by Red Hat; anything which came before in that package's upstream is also part of the deal. And, incidentally, this coverage does not go away if Red Hat stops shipping a package; just one shipped version will do. The Red Hat Brand has become the magic touch which confers protection against Firestar patents onto any software it touches. Thus far, we have coverage for Red Hat's packages and their predecessors upstream. What happens, though, if the upstream project continues to develop the software beyond the version shipped by Red Hat? That's where the "Red Hat Derivative Product" category comes in: "Red Hat Derivative Product" means any product, process, service, or code that is a direct or indirect Derivative of at least one Red Hat Product. So the combination of "any predecessor version" and the definition of a Derivative Product means that the entire project is covered, from its first version through anything it will do in the future - though, once again, there's a catch. But, before we get to that, there is the third subclass: "Red Hat Combination Product." It refers to a grouping of something which is one of the two product types described above and something unrelated - an aggregation. The apparent intent is to cover situations like dynamic linking: an application which links to a covered library will, itself, be covered. These definitions, too, appear to be quite broad. Just about anything which has been shipped by Red Hat, or which has even shared the same disk drive as something shipped by Red Hat qualifies. But, as has been mentioned before, there is one catch in the form of an excluded class of software: a Red Hat Derivative Product that infringes the particular Licensed Patent at issue without use of or reference to any portion or functionality in or from a Red Hat Product on which the Red Hat Derivative Product is based. (There is similar language for Combination Products as well). What this section is saying is that, if a derived product contains infringing code, that infringing code must have been part of the covered Red Hat product as well. In other words, outsiders cannot bless their particular patent infringement by grabbing enough code from some other project to create a derived product. One can see why this restriction was seen to be necessary; without it, any software (free or proprietary) could have easily been brought under the coverage umbrella. Instead, one must first convince Red Hat to distribute that software at least once. Plenty of other legalese can be found in the agreement, of course; interested readers are encourage to read the whole thing. But the core of it is what's described above. Notably absent (unless it has been redacted from the payment section, which seems unlikely) is any discussion of what happens if the patent is held to be invalid. So, even if Sun is ultimately successful in its challenge (as seems likely), Red Hat will not be getting its money back under the terms of this agreement. Red Hat's initial press release claimed that this settlement demonstrated the company's commitment to standing up for the community in the face of patent trolls, and stated that it would discourage any future such cases. At this point it seems fairly evident that Sun has made a better show of standing up for the community and discouraging future cases. What Red Hat has done, though, is to show us how future patent problems could be resolved in the absence of obvious prior art. If one must pay the troll, one would do well to come out with an agreement like this one and, at least, keep the troll away from the rest of the community. Whether patent holders who actually have a legal leg to stand on will be willing to agree to such a settlement remains to be seen; the nature of the game is such that, unfortunately, we are likely to get an answer to that question sooner or later. 2.6.27: what's coming (part 1) Linus wasted no time after the 2.6.26 release; he opened the 2.6.27 merge window less than 24 hours later. As of this writing, the process has barely begun with a mere 3000 changesets merged. So we do not have a complete picture of what will be in the next kernel release. But we can look at what has been merged so far. User-visible changes include: New drivers for CompuLab EM-x270 audio devices (as found on the Toshiba e800 PDA), Philips UDA1380 codecs, Wolfson Micro WM8510 and WM8990 codecs, Atmel AT32 audio devices, AK4535 codecs, SGI HAL2 audio devices (as found in Indy and Indigo2 workstations), SGI O2 audio boards, crypto engines found in Intel IXP4xx processors, Freescale Security Engine processors, AMD I/O memory management units, Marvell Loki (88RC8480), Kirkwood (88F6000), and Discovery Duo (MV78xx0) system-on-chip processors, IBM Power Virtual Fibre Channel Adapters, and GEFanuc C2K cPCI single-board computers. The old "ppc" architecture has been removed; all platforms are now supported by the integrated "powerpc" architecture code. The SCSI command filter - which controls which SCSI commands can be sent to a device by which kind of user - is now per-device and can be changed via sysfs. The block subsystem now has support for hardware which can perform data integrity checking; this will allow some kinds of errors to be caught before the associated data is lost forever. See this article for more information on the block-layer integrity feature. The "dummy" Linux security module has been removed; the default module is now the capabilities module. The crypto code has gained support for the RIPEMD-128, RIPEMD-160, RIPEMD-256, and RIPEMD-320 hash algorithms. Asynchronous hashing is now supported and is implemented by the "cryptd" software crypto daemon. Xen now has support for the saving and restoring of virtual machines - possibly migrating them to different hosts in between. The new virtual file /sys/firmware/memmap shows the memory map as it was configured by the system BIOS before the kernel booted. The ftrace lightweight tracing framework has been merged. See Documentation/ftrace.txt for more information on ftrace. The mmiotrace tool has been merged. Mmiotrace will capture and print out memory-mapped I/O accesses, making it a useful tool for the reverse-engineering of binary drivers. The ARM and powerpc architectures now support the latencytop tool. The RDMA code has acquired support for the InfiniBand "base memory management extension" operations. The IP-over-InfiniBand code can now perform large receive offload (LRO). Delayed allocation support has been added to the ext4 filesystem, which is getting quite close to its target feature set. The SATA layer now has enclosure management support; this allows the system to do things like blink an LED to indicate a specific drive in a large enclosure. The SGI IRIX binary compatibility layer has been removed. Changes visible to kernel developers include: The register_security() function has been removed. Security modules which wish to implement stacking must now do so explicitly. The request_queue_t type is gone at last; block drivers should use struct request_queue instead. Quite a bit of big kernel lock removal work has been merged. For char devices, the open() method from struct file_operations is no longer protected by the BKL. Calls to fasync() have also lost BKL protection. Many drivers have been converted to use the firmware loader, making it possible to strip the firmware from the kernel for those who are inclined to do so. See this article for more information on the firmware work. The API work in the i2c layer continues; there is now an autodetection capability which allows new-style drivers to detect devices on their buses automatically. The SCSI layer has gained new support for "device handlers," which are mostly concerned with multipath management. Some of this code has been moved over from the device mapper. Come back next week for the next episode in the "what's coming in 2.6.27" series. Block layer: integrity checking and lots of partitions One likes to think of disk drives as being a reliable store of data. As long as nothing goes so wrong as to let the smoke out of the device, blocks written to the disk really should come back with the same bits set in the same places. The reality of the situation is a bit less encouraging, especially when one is dealing with the sort of hardware which is available at the local computer store. Stories of blocks which have been corrupted, or which have been written to a location other than the one which was intended, are common. For this reason, there is steady interest in filesystems which use checksums on data stored to block devices. Rather than take the device's word that it successfully stored and retrieved a block, the filesystem can compare checksums and be sure. A certain amount of checksumming is also done by paranoid applications in user space. The checksums used by BitKeeper are said to have caught a number of corruption problems; successor tools like git have checksums wired deeply into their data structures. If a disk drive corrupts a git repository, users will know about it sooner rather than later. Checksums are a useful tool, but they have one minor problem: checksum failures tend to come when they are too late to be useful. By the time a filesystem or application notices that a disk block isn't quite what it once was, the original data may be long-gone and unrecoverable. But disk block corruption often happens in the process of getting the data to the disk; it would sure be nice if the disk itself could use a checksum to ensure that (1) the data got to the disk intact, and (2) the disk itself hasn't mangled it. To that end, a few standards groups have put together schemes for the incorporation of data integrity checking into the hardware itself. These mechanisms generally take the form of an additional eight-byte checksum attached to each 512-byte block. The host system generates the checksum when it prepares a block for writing to the drive; that checksum will follow the data through the series of host controllers, RAID controllers, network fabrics, etc., with the hardware verifying the checksum along each step of the way. The checksum is stored with the data, and, when the data is read in the future, the checksum travels back with it, once again being verified at each step. The end result should be that data corruption problems are caught immediately, and in a way which identifies which component of the system is at fault. Needless to say, this integrity mechanism requires operating system support. As of the 2.6.27 kernel, Linux will have such support, at least for SCSI and SATA drives, thanks to Martin Petersen. The well-written documentation file included with the data integrity patches envisions three places where checksum generation and verification can be performed: in the block layer, in the filesystem, and in user space. Truly end-to-end protection seems to need user-space verification, but, for now, the emphasis is on doing this work in the block layer or filesystem - though, as of this writing, no integrity-aware filesystems exist in the mainline repository. Drivers for block devices which can manage integrity data need to register some information with the block layer. This is done by filling in a blk_integrity structure and passing it to blk_integrity_register(). See the document for the full details; in short, this structure contains two function pointers. generate_fn() generates a checksum for a block of data, and verify_fn() will verify a checksum. There are also functions for attaching a tag to a block - a feature supported by some drives. The data stored in the tag can be used by filesystem-level code to, for example, ensure that the block is really part of the file it is supposed to belong to. The block layer will, in the absence of an integrity-aware filesystem, prepare and verify checksum data itself. To that end, the bio structure has been extended with a new bi_integrity field, pointing to a bio_vec structure describing the checksum information and some additional housekeeping. Happily, the integrity standards were written to allow the checksum information to be stored separately from the actual data; the alternative would have been to modify the entire Linux memory management system to accommodate that information. The bi_integrity area is where that information goes; scatter/gather DMA operations are used to transfer the checksum and data to and from the drive together. Integrity-aware filesystems, when they exist, will be able to take over the generation and verification of checksum data from the block layer. A call to bio_integrity_prep() will prepare a given bio structure for integrity verification; it's then up to the filesystem to generate the checksum (for writes) or check it (for reads). There's also a set of functions for managing the tag data; again, see the document for the details. Extended partitions One of the more annoying and long-lived annoyances in the Linux block layer has been the limit on the number of partitions which can be created on any one device. IDE devices can handle up to 64 partitions, which is usually enough, but SCSI devices can only manage 16 - including one reserved for the full device. As these devices get larger, and as applications which benefit from filesystem isolation (virtualization, for example) become more popular, this limit only becomes more irksome. The interesting thing is that the work needed to circumvent this problem was done some years ago when device numbers were extended to 32 bits. Some complicated schemes were proposed back in 2004 as a way of extending the number of partitions while not changing any existing device numbers, but that approach was never adopted. In the mean time, increasing use of tools like udev has pretty much eliminated the need for device number compatibility; on most distributions, there are no persistent device files anymore. So when Tejun Heo revisited the partition limit problem, he didn't bother with obscure bit-shuffling schemes. Instead, with his patch set, block devices simply move to a new major device number and have all minor numbers dynamically assigned. That means that no block device has a stable (across boots) number; it also means that the minor numbers for partitions on the same device are not necessarily grouped together. But, since nobody really ever sees the device numbers on a contemporary distribution, none of this should matter. Tejun's patch series is an interesting exercise in slowly evolving an interface toward a final goal, with a number of intermediate states. In the end, the API as seen by block drivers changes very little. There is a new flag (GENHD_FL_EXT_DEVT) which allows the disk to use extended partition numbers; once the number of minor numbers given to alloc_disk() is exhausted, any additional partitions will be numbered in the extended space. The intended use, though, would appear to be to allocate no traditional minor numbers at all - allocating disks with alloc_disk(0) - and creating all partitions in that extended space. Tejun's patch causes both the IDE and sd drivers to allocate gendisk structures in that way, moving all disks on most systems into the (shared) extended number space. Even though modern distributions are comfortable with dynamic device numbers (and names, for that matter), it seems hard to imagine that a change like this would be entirely free of systems management problems across the full Linux user base. Distributors may still be a little nervous from the grief they took after the shift to the PATA drivers changed drive names on installed systems. So it's not really clear when Tejun's patches might make it into the mainline, or when distributors would make use of that functionality. The pressure for more partitions is unlikely to go away, though, so these patches may find their way in before too long. Ubuntu, security response, and community contributions A recent interview with Mark Shuttleworth is raising a few eyebrows. The Austrian news site derStandard sat down with Ubuntu founder and Canonical CEO Shuttleworth at GUADEC in Istanbul asking about many aspects of Ubuntu, desktops, and Linux in general. His answers to questions about synchronizing releases with other major distributions included some controversial claims. Last May, Shuttleworth suggested that the major enterprise distributions (Red Hat, SUSE, Debian, and Ubuntu) should coordinate their release cycles to foster better stabilization of Linux components. None of the other distributions have expressed much in the way of interest in that plan—at least publicly—though Shuttleworth says there have been some interesting discussions behind the scenes. In answer to a question about the belief that Ubuntu has much more to gain than either Red Hat or Novell, Shuttleworth said: Well we have a better security track record than Red Hat, we do that by focusing very hard on security, making sure the updates are available as fast as possible on Ubuntu, independent studies have generally ranked Ubuntu number one. Below is a table that summarizes the response time for a few vulnerable packages over the last several months. It shows when the vulnerability was first announced along with the first update from each of four major distributions. Note that some distributions fixed the vulnerability at different times for different versions, so the date below is the first; other distribution versions may have waited longer for an update. There doesn't appear to be any clear "winner", though Red Hat seems to beat Ubuntu in most cases—at least on this set of vulnerabilities. It would be much easier to do this kind of comparison if Ubuntu followed Red Hat's lead and published regular assessments of its security performance. It is rather easy to make sweeping statements, referring to unnamed "independent studies", while it is much harder to actually gather the information and present it. Red Hat's transparency on its security performance is something that all distributions should strive for—especially those who would tout their security response. But the security issue is just a part of a fairly pervasive perception that Ubuntu and Canonical are not contributing very much back to the community. That is the underlying concern that Shuttleworth is addressing. He continues: So what I'm trying to say here, that the notion that Canonical wouldn't contribute anything in such a situation and it would be a one way flow is something I disagree with. Look for example at the fact that Ubuntu has usually better hardware support, if we all were on the same kernel the others could take the drivers we put in there and have hardware support that is just as good as Ubuntu. While supporting more hardware is an excellent goal, doing it by merging unsupported drivers into the kernel is not the recommended path. As Red Hat kernel hacker Dave Jones puts it: Does no-one else see the hypocrisy in this statement ? Here's how it reads to me... "It would be great if everyone just shipped the Ubuntu kernel and debugged the random crap we merge that we don't have the resources to do ourselves". If only there were some kind of process of getting drivers merged upstream to kernel.org. Perhaps then we COULD be on the same kernel. Oh wait, there is a process. Ubuntu just chooses to ignore it. Canonical, unlike the other major enterprise distribution vendors, is not known for its kernel contributions. It is a much smaller organization than Red Hat or Novell, so its support organization is rather small as well. Trying to support lots of hardware is a difficult task. Doing it with out-of-tree and binary-only drivers makes it that much harder. Historically there has also been friction between Ubuntu and its upstream distribution, Debian, at least partially because of a perception that it does not contribute back. It is against this backdrop that Shuttleworth is speaking. The fact that he feels that he needs to defend Ubuntu speaks volumes. Some of the complaints might be written off to jealousy over the popularity of Ubuntu, but there is a fair amount of truth to them as well. Canonical and the Ubuntu community have done some fairly amazing things in a short period of time, but they did it by leveraging lots of work by Debian and others. It is important to be a contributing member of the larger Linux ecosystem, so Ubuntu and Canonical need to work to remove this perception of the distribution—regardless of its merits. Talk alone won't do that, action is required. Trust and mirrors A recent look at attacks on package managers has much of interest. None of the attack methods are particularly new at some level, but applying them to the update process is. When the mechanism that is used to keep one's system updated with respect to security vulnerabilities is itself susceptible, it is definitely worth a look. Much of the problem stems from the fact that many community distributions rely on volunteer mirrors to distribute updates. These mirrors could be malicious which would allow them to distribute bad code to systems that are checking for updates. In addition, mirrors are perfectly placed to notice which machines are updating for particular vulnerabilities—information that could be used in attacks. The study looked at ten of the most popular Linux and BSD package management systems and found all of them to be vulnerable to one or more of the flaws they identified. Package managers track metadata—information about what package versions and dependencies there are—as well as the packages themselves in formats like .rpm or .deb. Typically, the packages are cryptographically signed (using GPG for example) so that they can be verified as genuine by client systems. Some package managers also sign the metadata, but some do not, which allows for additional attacks. The biggest issue with mirrors is the information that they gain. When a client requests a certain package, it is pretty easy to guess that it is probably vulnerable to whatever security flaw is being fixed in that new package. A malicious mirror—or one that has been subverted—could try to attack the client machine via the flaw being fixed. A suitable vulnerability could be used to completely compromise the client machine. Once a particular chunk of data, either package or metadata, has been signed, it is valid more or less forever. This can be used by malicious mirrors in two ways: serving up old metadata that points clients at known vulnerable package versions or serving up old packages that are known to have flaws. In both cases, it is a kind of "replay" attack, using old, valid data for malicious purposes. In most cases, package managers will not downgrade to previous package versions unless explicitly instructed to, so machines that have already upgraded are not generally vulnerable to a package replay. However, if a client reliably contacts a particular mirror for metadata, that mirror can continue serving an older version until an exploit of interest comes along. By knowing that the client has not upgraded—because it has been held back by the mirror-served metadata—an attacker can exploit the newly-discovered vulnerability at their convenience. Mirrors can also perform "endless data" attacks where the data transfer for the package or metadata is never terminated. The mirror keeps sending more and more data until it fills the client disk. This is likely to "only" cause a denial of service on the machine that is being updated, but that can still be a serious result, especially when the update process is automated. Unsigned metadata can allow for several other kinds of attacks. Manipulating the dependencies that are provided or needed by a package can lead to various kinds of problems. A dependency on a non-existent package will stop the update from happening, while a dependency on a package of the attacker's choosing can lead to complete compromise. There is not a lot that can be done to solve the information gathering problem. Subscription-based distributions generally provide their own servers and do not rely upon mirrors to avoid this problem. For community distributions, there really is no central authority that has the resources to do that. Also, controlling all the mirrors only goes so far; if any are compromised, the same kinds of attacks are possible. Downloading the packages to a non-vulnerable host is probably the best avoidance technique, but is difficult to do in practice. The lessons from this study are clear. Metadata should be signed and only downloaded from "trusted" servers. If there is a concern about man-in-the-middle attacks, an encrypted connection should be used between the clients and servers with certificates being checked to ensure the connection is going where expected. In the end, it comes down to trusting the mirrors that one uses. It is not terribly surprising that mirrors can cause these kinds of problems, but the study authors did an excellent job pulling together the different kinds of attacks. The picture that they paint is not particularly pretty, but it is one we needed to see. Handling kernel security problems Even the most casual observer of the linux-kernel mailing must have noticed that, in the shadow of the firmware flame war, there is also a heated discussion over the management of security issues. There have also been some attempts to turn this local battle into a multi-list, regional conflict. Finding the right way to deal with security problems is difficult for any project, and the kernel is no exception. Whether this discussion will lead to any changes remains to be seen, but it does at least provide a clear view of where the disagreements are. Things flared up this time in response to the 2.6.25.10 stable kernel update. The announcement stated that "any users of the 2.6.25 kernel series are STRONGLY encouraged to upgrade to this release," but did not say why; none of the patches found in this release were marked as security problems. As it happens, there were security-related fixes in that update; some users are upset that they were not explicitly called out as such. They have reached the point of accusing the kernel developers of hiding security problems. These problems, it is said, are fixed with relatively benign-sounding commit messages ("x86_64 ptrace: fix sys32_ptrace task_struct leak," for example) and users are not told that a security fix has been made. This, in turn, is thought to put users at risk because (1) they do not know when they need to apply an update, and (2) there is no clear picture of how many security problems are surfacing in the kernel code. So, as "pageexec" (or "PaX Team") put it: the problem i raised was that there's one declared policy in Documentation/SecurityBugs (full disclosure) yet actual actions are completely different and now Linus even admitted it. the problem arising from such inconsistency is that people relying on the declared disclosure policy will make bad decisions and potentially endanger their users. there're two ways out of this sitution: either follow full disclosure in practice or let the world at large know that you (well, Linus) don't want to. in either case people will adjust their security bug handling processes and everyone will be better off. There are two aspects to the charge that the kernel is not following a full disclosure policy: commit messages are said to obscure security fixes, and kernel releases do not highlight the fact that security problems have been fixed. There is an aspect of truth to the first charge, in that Linus will freely admit to changing commit logs which discuss security problems too explicitly: I literally draw the line at anything that is simply greppable for. If it's not a very public security issue already, I don't want a simple "git log + grep" to help find it. That said, I don't _plan_ messages or obfuscate them, so "overflow" might well be part of the message just because it simply describes the fix. So I'm not claiming that the messages can never help somebody pinpoint interesting commits to look at, I'm just also not at all interested in doing so reliably. His goal here is clear: make life just a little harder for people who are searching the commit logs for vulnerabilities to exploit. One may argue over whether this policy amounts to hiding security problems, or whether it will be effective in reducing exploits (and plenty of people have shown their willingness to do such arguing), but the fact remains that it is the policy followed by Linus at this time. In his view, the committing of a fix is the disclosure of the problem, and there is no need to be more explicit than that. That view extends to the whole security update process found in much of the community. He has no respect for embargo policies or delayed disclosure, and he criticizes the "whole security circus" which, in his opinion, emphasizes the wrong thing: It makes "heroes" out of security people, as if the people who don't just fix normal bugs aren't as important. In fact, all the boring normal bugs are _way_ more important, just because there's a lot more of them. I don't think some spectacular security hole should be glorified or cared about as being any more "special" than a random spectacular crash due to bad locking. Beyond that, it is often hard to know which patches are truly security fixes. It has been argued at times that all bugs have security relevance; it's mostly just a matter of figuring out how to exploit them. So explicitly marking security fixes risks taking attention away from all of the other fixes, many of which may also, in fact, fix security issues. Thus, Linus says: If people think that they are safer for only applying (or upgrading to) certain patches that are marked as being security-specific, they are missing all the ones that weren't marked as such. Making them even _believe_ that the magic security marking is meaningful is simply a lie. It's not going to be. So why would I add some marking that I most emphatically do not believe in myself, and think is just mostly security theater? That said, the stable kernel updates go out with patches which are known to be security fixes. Some people clearly believe that being STRONGLY encouraged to update is not sufficient notification of that fact. It does seem that there has been a trend away from explicit recognition of security issues in the stable releases. The inclusion of CVE numbers was once common; in the 2.6.25 series, only 2.6.25.1, 2.6.25.2, and 2.6.25.5 had such numbers in the changelogs. It is, indeed, true that a straightforward reading of the stable release changelogs will not tell users whether those releases fix relevant security issues. There are a number of answers to that complaint too, of course. The real information is in the source code, and that is always public. The fixes in the stable series are unlikely to be all that relevant to most users anyway; they are running distributor kernels which are many months behind even the -stable series and which may (or may not) be affected by a specific problem. In the end, users who are concerned about security issues in their kernels have somebody to turn to: their distributors. Linux distributors follow disclosure rules and tend to do a pretty thorough job of fixing the known security problems and propagating those fixes to users. For users who need a high level of long-term support, there are distributors who are more than willing to provide that kind of service for a fee. As is often the case, what it really comes down to here is resources. It would be nice if somebody were to follow the patch stream (well over 100 patches/day into the mainline) and identify each one which has security implications. For each patch, this person could then figure out which kernel version was first affected by the vulnerability, obtain a CVE number, and issue a nicely-formatted advisory. But this is a huge job, one which nobody is likely to do in an uncompensated mode for any period of time. So somebody would have to pay for this work. And, to a great extent, that is just what the distributors are doing now - with the nice addition that they backport the fixes into the kernels they support. It is worth noting that those distributors have not been doing a whole lot of complaining about how security fixes are handled now. Instead, the complaining has come, primarily, from the maintainers of the out-of-tree grsecurity project which, from a suitably cynical point of view, could be seen to benefit from raising the profile of Linux kernel security problems. But, regardless of the validity of any such charge, there may be some value in what they are asking. It is good to have a clear sense for what the security problems in a piece of code are. If nothing else, it helps the project itself to understand where it stands with regard to security and whether things are getting better or worse. So it would be nice if the kernel developers could be a bit more diligent and organized in how they track security issues, much like the tracking of regressions has improved over the last couple of years. But this kind of improvement will not happen until somebody decides to put the work into it. Actually putting some time into documenting kernel security issues will accomplish far more than complaining on mailing lists. Control model railroads with JMRI JMRI is the Java Model Railroad Interface, a cross-platform open-source project that has been developed by a long list of contributors: The JMRI project is building tools for model railroad computer control. We want it to be usable to as many people as possible, so we're building it in Java to run anywhere, and we're trying to make it independent of specific hardware systems. JMRI is intended as a jumping-off point for hobbyists who want to control their layouts from a computer without having to create an entire system from scratch. JMRI provides the DecoderPro and PanelPro applications, tools for model railroaders who want to configure DCC decoders and create control panels. DCC, the Digital Command Control system, uses a PC-connected interface to send power and two-way control signals over the model railroad track to control boards on model train engines and other peripherals such as track switches and lights. The protocol allows for the control of multiple engines, each engine can have addressable lights, sound effects, smoke generators, etc. The JMRI Hardware Support document lists a wide variety of supported DCC interface devices and other controller options. The JMRI Help System document and DecoderPro Manual are a good place to read about the capabilities of the system. Production version 2.2 of JMRI was announced on July 15, just in time for the 2008 National Model Railroad Convention in Anaheim, CA: "At long last, the 2.1.* series of JMRI test releases has resulted in something good enough for new users to start with, our definition of a "production" release. We're therefore making a new production version, JMRI 2.2, available today." A number of JMRI clinics are being held at the NMR convention. The release notes for version 2.2 mention support for many new devices, improved support for existing devices, new scripts, documentation improvements and more. The JMRI project has suffered a legal controversy: "For the last three years JMRI has been under attack by Matt Katzer and his attorney Kevin Russell. They have been using various coercive tactics, some of which we believe are illegal, in an attempt to put a stop to JMRI's work or to extract money from JMRI. Katzer, through his attorney Russell, obtained a patent on model railroad technology that other people had developed years before. Using a "continuation" application, they applied for a patent that covered JMRI after JMRI had openly published its code. Because Katzer and Russell didn't provide the prior art to the Patent Office, the patent was promptly issued." (Also see this LWN article from April 2006). Donations are being accepted for the JMRI legal defense fund. Despite having no compatible hardware, your author decided to download JMRI 2.2 onto an Ubuntu Hardy Heron system with the default OpenJDK Runtime Environment version 1.6.0-b09. The JMRIdemo application was run and everything started up as expected. The demo allows the user to step through the user interface and see the various configuration and control screens. To get an idea of the amount of complexity that a JMRI system can handle, see the SP Shasta Route model railroad layout that is featured at this year's NMR convention. Gentoo: New release, "new" leadership Last week, lots of Gentoo news came out, so it's a good time to look at what happened and what it means. Gentoo's 2008.0 release marked its first since more than a year ago, despite its attempts to release twice a year. Fortunately, Gentoo releases don't mean much because it's already a live distribution rather than a snapshot in time with occasional updates. A release provides a new kernel with the accompanying driver support, occasionally a flashy new bootsplash, and the usual bugfixes to the GUI installer, which is not universally loved. But what happened to make this release come so long after the last one? First, 2007.1 was canceled, largely because so many security vulnerabilities came out that it was impossible to keep up with release rebuilds. 2008.0 was scheduled to come out in March, so it slipped 4 months. Tobias Klausmann described the problems well. Here are a couple of them: Building release media in itself isn't easy to begin with - catalyst is a powerful but complex (and complicated) tool. ... On top of this, the central release coordinator has to keep in mind all of the gritty details of the arches that will see release media. There's arches like ppc which also have a differently-bitted cousin (ppc64); there are arches that are very, very slow when building stuff (MIPS). On top of that, some software just doesn't build on some arches (no Java on alpha, for example) which can make deciding what to put on the LiveCD very hairy. People have lives. This is one that bit us this time: life struck at a very bad point (not that the event had been any better post-release). This occupied the time of a dev for a prolonged time. It made painfully obvious that in some spots, stand-in personnel wasn't there. In addition, Tobias cited three other problems: Release work is unpopular. The release engineering team is perpetually undermanned, basically because the work is boring and otherwise unrewarding. Bike shedding creates secrecy. Everyone's trying to chip in their own ideas of how things should work without having any experience or clue of what their ideas mean. Reproducing installation bugs is hard. This is much like the Linux kernel because the release engineers just don't have the hardware. In some ways, it's worse, because the people who file distribution bugs about problems installing are often inexperienced Gentoo users who don't know how to file a good bug. Often, bugs that make it to the upstream project have already been filtered by the distribution, but that of course hasn't happened here. The main problem delaying 2008.0 was real life interfering with a critical developer. This is being addressed by creating new processes and backup people who can take over when others aren't around. As for the other problems, it's unclear how to fix them. Suggestions would be appreciated. The other major news in Gentoo is the election of a new council. The council is a group of 7 people who lead Gentoo by making decisions on global issues. Two things make this election interesting: It was a forced election that resulted indirectly from a controversy over expelling developers from the project. It happened because of a technicality in the Gentoo Linux Enhancement Proposal (GLEP) that gave the council its authority. The GLEP requires monthly meetings and forces an election if a majority of council members don't show up to a meeting. The controversy came about because this was an additional meeting beyond the usual one, specifically to discuss the appeals of 3 developers who were fired. It was poorly announced (only mentioned in the meeting minutes). It's unclear whether a majority of council members even agreed on the time. The election involved people who think the social side of development matters versus people who think only the technical side matters. In Gentoo, the silent majority of developers rarely post to mailing lists, preferring to simply do development. Votes like this are often the only way they choose to express their opinions. In the past year, 50% of the traffic on the main development list came from 20 people, yet nearly 150 people voted in the council election and more than 250 are listed as active. The 145 voters approach the highest number ever in a council election—here's how it compares with previous years: This is the highest turnout since the first year the council existed, showing a significant increase in interest by the developer community in who their leadership was compared to the intervening years. To understand exactly who they voted for, these histograms show how highly each candidate was ranked, in order of result. The left side indicates that a candidate was highly ranked, and the right side shows that a candidate was poorly ranked. Of particular interest is the position of "astinus," a developer who retired during the election but was still voted above three other people. Since these three people all favor ignorance of any social issues from someone with good technical contributions, this really shows how strongly the Gentoo development community supports the the creation of a friendlier environment. Notably, of the previous council, every single one of the five members who ran for the new council was re-elected. This shows that the community didn't care about the mistakes that resulted in the new election. It also shows that the community supported the existing council's actions and believed in what its members were saying about the need for social change within Gentoo. With its new release and its accompanying publicity, Gentoo has renewed interest from many users and has shown that it remains a distribution under active development. Having a new council in place for the next year puts Gentoo in position to rebuild its development community and keep development thriving so the publicity and new users gained by the release don't fade away. Fedora and distributed source packages Fedora's new version of RPM, announced on July 9, has hit the Rawhide repositories; after inspiring some initial cries of pain, it would appear to settling in well. It is good to see activity on Red Hat's version of RPM after a long period where nothing much was happening. In the process of bringing this new code to Rawhide, the RPM developers have also inspired some interesting side discussions on topics like whether such a major change should have gone through the official "features" process first. But the most extended (and arguably most interesting) discussion came from an unexpected direction. Doug Ledford is known in kernel development circles, but, being an RHEL engineer, he has not been seen much in the Fedora camp. He joined the RPM discussion with a feature request of his own: he would like a set of tags which would facilitate the location of a package's source code in a distributed version control system (DVCS). So these tags would indicate which DVCS is in use (git, mercurial, etc.), where the repository is to be found, the tag corresponding to the source code for a specific version of a package, etc. And, Doug let it be known, it would be nice if he could have those tags soon; tomorrow would be nice, but before the Fedora 10 release in particular. Once this information exists for a package, interesting things can be done. For example, source RPM packages could become much smaller; rather than containing a tarball and a set of distributor-applied patches, it could just hold the DVCS information. An "installation" of that package would then just go to the source repository and check out the sources from there. If the source repository is managed carefully, it could help the cooperation between Fedora and the upstream projects; patches could be pushed and pulled between repositories with ease. This kind of mechanism could also make it easier for the Fedora project to distribute "spins" created by outsiders by reducing the resources required to make the associated source code available. See this lengthy pitch from Doug for more discussion of the advantages of the distributed source package approach. Of course, there are some obstacles too. Not all projects are using a DVCS, so integration with those projects would be more difficult. Quite a few projects have material in their repositories which, for legal reasons, cannot be distributed by Fedora. Finding a way to excise that material without breaking the connections between repositories could be challenging. The tarballs distributed by many upstream projects - which are the starting place for Fedora packages now - often contain changes which are not reflected in their source repositories. Those changes can include the removal of non-distributable material, or simply generating the configure file. These challenges are real, and some of them will take a fair amount of work to resolve. But it seems clear that things eventually need to go in this direction. Tighter integration between projects and distributors can only help the whole free software ecosystem work in a more efficient manner. Tarballs reflect a form of frozen state which is entirely divorced from the code's history - and from its future. Or, as Doug put it: It's all about the repo. A tarball is something you hand off to poor saps that haven't joined the 21st century, all the while snickering at their inability to get with the times. It is nothing more than a middle man step that interferes with efficiency of operation and that should be cut out of the loop. A source package format that can maintain its connections wherever it goes can only make the whole system work better. So it is good that the Fedora folks (including those beyond Doug who have been thinking about this issue for a while now) are working on this problem. There was, however, an interesting omission from the discussion; as far as your editor can tell, nobody ever mentioned the work being done by the vcs-pkg project, which is aimed toward this goal: Our goal is to integrate version control with distro package maintenance. We want to recognise all involved in the process, from upstream, the package maintainers of the various distributions, their security and release teams, and power users, who aren't afraid to fix their own bugs, and give maximum flexibility to them. This group is mostly Debian-based, but its members are making a concerted effort to create solutions which are independent of any given distribution (or DVCS). It can only make sense for Fedora to work with this project - or at least have a look at what vcs-pkg is doing and come up with a good excuse why a different solution has to be invented for Fedora. The integration of distributed version control and packaging can only reach its full potential if, among other things, it facilitates cooperation between distributors and their upstream providers, their users, and, importantly, other distributors. If each distributor brews up its own solution (again), they'll have a hard time sharing their work with each other. Few upstream projects will have the patience to integrate with several disparate distributor systems, so that integration will be much less likely to happen. All of this can be avoided, though, if the distributors decide now to work toward some common standards for the use of distributed version control in packaging. Kernel security problems: a response I would like to try to clarify a few points in the article, "Handling kernel security problems" by Jonathan Corbet. First off, I speak only for myself, not for the other half of the Linux -stable team, Chris Wright, who might totally disagree with me, nor for the other kernel developers who help out with the security@kernel.org alias, nor for my current employer Novell. Also note that all of my -stable development is done on my own time, and is not part of my role at my current job. All of that out of the way, I object to a few things stated in the original article: It does seem that there has been a trend away from explicit recognition of security issues in the stable releases. The inclusion of CVE numbers was once common; in the 2.6.25 series, only 2.6.25.1, 2.6.25.2, and 2.6.25.5 had such numbers in the changelogs. It is, indeed, true that a straightforward reading of the stable release changelogs will not tell users whether those releases fix relevant security issues. A number of times, when we do -stable releases, there are no CVE numbers issued for the "security" related issues that are fixed in there. This happens when the fix is first made in Linus's tree, and is either forwarded to the stable@kernel.org alias saying, "we need to get this out now", or just by the fact that it is only later that people realize that a CVE number should be allocated. And yes, the trend is away from explicit recognition of security issues, exactly following Linus's statement that you quote from. It comes down to who are the users of the -stable kernel series. I personally see these kernels for two different groups of people: Those who want to follow the latest kernel.org releases and not rely on a distribution for their kernel versions. For distributions to base releases on, and to pick and choose patches from. The first group should always update to the latest -stable kernel update as they are relying on the -stable team to always provide them the latest fixes that are known to be needed for them. Simply marking things as "security related" can be misguided as Linus points out. The change log entries should show all users what was fixed, and if they run machine where this code is used, then they should upgrade. It's as simple as that. In fact, in the 2.6.25.11 release I tried to say exactly that: It contains one bugfix, any user of the 2.6.25 kernel on x86-64 with untrusted local users is very STRONGLY recommended to upgrade. How much clearer can I be? Does a user of the -stable tree, who has to be technically competent to be able to do such a thing in the first place, need to know more to decide if they need to upgrade their machines or not? It seems people are upset that I am no longer using the magic words "security fix", and that is true, I am not saying that anymore. As Linus and others have noted, marking some bugs as being "security-related" is not helpful, especially as not everyone can even agree - or sometimes even know at release time - whether a bug has security implications or not. Also note that this release does not refer to a CVE number. This is because, as of this moment, there still is not a number assigned, despite asking the relevant groups for such an assignment. I never want to hold up a release by waiting for any such number, so I personally will just not use them in the future in -stable releases unless they are already contained in the original changelog entry in Linus's tree. The second group, the distributions, all seem very happy with how the -stable releases are conducted. They have the capability to pick and choose from the fixes and apply them to their older kernel versions and ship them to their customers as they see fit. The distros all know what things are security related by the fact that they know and understand the code and the threat model as they have developers assigned to handle such security issues, and have done so for years. In your summary, you state: It is good to have a clear sense for what the security problems in a piece of code are. If nothing else, it helps the project itself to understand where it stands with regard to security and whether things are getting better or worse. So it would be nice if the kernel developers could be a bit more diligent and organized in how they track security issues, much like the tracking of regressions has improved over the last couple of years. I think the individual developers of the kernel all know quite well what the security problems for their code are. This is backed up by the fact that these developers are the ones usually making the fix and telling the -stable team that a specific patch is needed to be added. What you seem to be asking for is a way to somehow classify bugs and fixes in the kernel tree as "security related" or not. And that goes back to Linus's original point. To try to do so marginalizes bugs which are somehow not so designated as not worth fixing. However, if someone wants to do this work for the kernel community, and it proves to be useful over time, I'll be the first in line to say that I was wrong. Interview: Wind River's John Bruggeman If you wanted a symbol of Linux's impact on the world of embedded systems, you could do worse than consider the edifying case of Wind River's Damascene conversion. Once one of free software's fiercest critics, today Wind River is a cheerleader for the benefits of open source, of sharing, and of giving back to the community. John Bruggeman is Wind River's Chief Marketing Officer. Here he talks to Glyn Moody about why you can't use any old Linux for embedded systems, the respective strengths and weaknesses of the Linux-based mobile platforms from the LiMo Foundation and Google's Android, and what effect Nokia's announcement that it would be open-sourcing the Symbian operating system will have on the sector. Once upon a time, Wind River was synonymous with anti-Linux: what happened? The market changed, and I think that open source became a very, very important part of the addressable market we wanted to reach. And if Wind River was going to be relevant and going to be important in the marketplace, we would have to have an open source and specifically a Linux-based solution for our customers. So, basically, the market thrust us into it, demanded that we do it, and I think it was all for the best that that happened. What do you have to do to Linux to make it suitable for the embedded market? The embedded marketplace has requirements that aren't in the general enterprise computing market. Things like size becomes very critical, and memory utilization and power management and some other features like that. Standard Linux wasn't optimized or suited for device types that face those challenges. Those are kind of software elements, but there is also a hardware element. In the enterprise computing space, you are basically living in an [Intel architecture] world and everything is pretty constant and stable and predictable. Well, that is the anti-case with what we see in embedded. You have a plethora of hardware environments. Each hardware environment has their own specific nuances and special techniques and tips and trips. And making Linux work really well with hardware is a tough problem. How would you compare your Linux offering with your proprietary VxWorks solution? VxWorks is where you need absolute real-time determinism, where you need things like safety and security, [and to] meet certain regulatory standards and certification standards: those kinds of applications are the sweet spot for our VxWorks software. More general solutions, where application availability, middleware integration, [and] where lots and lots of ecosystem partners are required, that's in the sweet spot of our Linux software. Is there any reason why your Linux software couldn't take on the other kinds of things as well? I think, over time, probably not. But, that's a long time way. A great example of that would be security certification for an airplane. The standards and the requirements to meet those certifications are very, very complex. They are very difficult and I think Linux is a long way away from being able to do that. What's the kind of split between the VxWorks and Linux, in terms of revenue? Today about 80% of our revenue is VxWorks, but the fastest-growing segment of our business is Linux. It's growing in the triple digits quarter over quarter over quarter. We announced it well north of $50 million for us this year. Do you think one day you'll ever be wholly open source? Wholly? I don't think so. There will always be certain types of devices in which VxWorks will be a superior solution. But the Linux portion of our business will continue to grow, and I see a day where our Linux business is every bit as big as the VxWorks business. What are the key attractions of Linux for your customers? Let me start with Linux in general. The first is availability of the ecosystem. The need to accelerate the pace of development is becoming critical. Many, many of our customers used to be vertical integrators - they even manufactured their own silicon and they would go all the way up to the top. And we're seeing a change that's happening at light speed, where they are shifting from a vertical integrator to an application developer. And they are really differentiating themselves on the user experience, on the type of applications they develop. The attraction of Linux is there's this massive development community developing that infrastructure stuff that they used to spend so much time on, that enabled application development: they don't have to do that anymore. The second thing is obviously cost. They really can get it at a significantly lower development cost than they did when they used to have to build it themselves. What's your business model? We provide things like integration testing and validation. Open source is a bunch of packages and the magic is how well are they put together and how reliable are they, and how well has that been tested, and can you validate and stand behind that? We have over 300 support engineers located globally around the world, in different time zones. We have the richest indemnity and warranty program in the industry. We don't stand behind Wind River, we stand behind open source. Moving on to the mobile phone space, can you say a little about LiMo and Android, and what your involvement in those has been? Linux has the opportunity to revolutionize the mobile phone space - not just smart phones, but feature phones, converged phones, [Mobile Internet Devices - MIDs]. What's holding it back right now is the fragmentation. There are just way too many different Linux distributions. What that means is the ecosystem can't aggregate and surround anything of any critical mass. So, two initiatives have broken out that seem to be aggregators or consolidators: one is LiMo and one is Android. We're not smart enough to know which one is going to be the ultimate consolidator, so we're tremendously active in both. We joined LiMo as a board member and we work very, very hard with the architectural committee to become the Linux foundation for all LiMo-based development. What that means is the common integration environment, which is the Linux-built system, the tool chain, is all based on Wind River technology. And therefore any contribution that's made to LiMo [is] based on our technology - we contributed that common integration environment to the LiMo foundation. [Open Handset Alliance's Android] was announced about six or nine months or so after LiMo, and Google came out and said Wind River is their Linux commercialization partner. We have been working with them for about two years. We've done a number of hardware integrations for them. That's one of our core competences: how do you get Android running on the hardware. We have phones coming out for both. We see a lot of activity on both and a lot of momentum for both. How would you contrast the two initiatives? LiMo truly is a consortium of equals. There are multiple operators: Vodafone, Docomo, Verizon, Orange, others. A bunch of carriers and a bunch of handset OEMs: Motorola, Samsung, LG, Panasonic, NEC. And the board is made up of those guys and Wind River. And we see that really is sort of: how do we get a common ground between fierce competitors? How do we, for the good of the industry, standardize around that stuff that's non-differentiating? OHA is really a Google-driven initiative. They make product decisions and they make feature decisions. So, let's talk pros and cons about this. When it's not a democracy, when the decision-making is very clear, decisions can be made quickly and things move very fast. On the LiMo side, where it's a lot of people, with a lot of experience building phones, who know what really matters, and what's important and what works and what doesn't work, they can bring a lot of different experience, a wealth of different perspectives together. Sometimes it might take a little longer to make a decision over here but I really understand and can see why that decision works over there. Where this one races ahead, this one's a little more methodical and carefully constructed. But they're both building compelling platforms and will both be successful in the marketplace. Alongside LiMo and Android, we will have an open source Symbian at some point; what effect is that going to have on this whole market? If you look at the smartphone market, it's 7% today of the total phone marketplace. So, from a percentage basis, it's not big. But what we're seeing is more and more feature phone-like capabilities blurring with the smartphone. So even though it's a small part of the market today, it's very strategic, because it does have implications down-market on the feature phones. Symbian's got 60% of the smartphone market. And Microsoft's 20 to 30% of that market. Certainly they are not among equals, but Microsoft's been gaining share against Symbian and against Nokia. So, I think this was an aggressive and a bold and clever move against Microsoft. Vis-a-vis Linux, the Symbian move just endorsed what was going on. It said if you're going to be competitive, if you're going to relevant years from now, you'd better have an open source model. I love that endorsement of Linux. On the other hand, their solution is years away. Nokia said: Well, we'll have it in the first half in 2010. Both Android and LiMo will have phones out by the end of this year. So, there should be a lot of activity. Now if I'm an ecosystem member, am I going to wait for 2010, or am I going to develop today, and address real design opportunities and real win opportunities today? I think Linux has a window of opportunity. We're going to see mass adoption of Linux-based devices, whether they are phones, or converged devices or MIDs, or whatever they are. However this market evolves, Linux is going to have two years' worth of product out there in the marketplace, doing stuff, before we see Symbian open source. While Nokia made a brilliant and bold move, it might be too late, because there is enough Linux momentum, especially behind OHA and LiMo, that I think they left that too long. What about the other player in the closed-source world, Apple with its iPhone? Apple will always be what Apple is. Apple is just fantastic, touches the super, niche, high end - somebody willing to pay $700 for a phone. And there is a big market for that - if you think a big market is 10 million phones. That's going to be there and that's not threatened or messed with in any of this stuff, because they are always going to come out with some really creative form factor or killer application: they are going to touch 10 million people. Three years from now we'll see a couple billion phones in the marketplace. So, let Apple go be content with that [10 million]. Let RIM go hit their niche part of the market. I don't see that catching fire. So you've got the smart phones, the MIDs and now these ultraportables - the $300-400 machines that run GNU/Linux. How do you see that three-way contest panning out? I think all three devices meet certain use cases. I don't see, in the near future, or even the mid-term future, a MID overtaking a phone. There's a reason people talk on phones, but there's this whole different class of people in different use scenarios, they need a MID. What is becoming very, very clear is, it's not about voice and it's not about text or email, it's going to be about a true, rich Internet experience. Can a web page be represented on these devices at the same clarity, the same quality, the same speed, as they are on the PC? When I look at YouTube, I don't want to look at a fuzzy, webcam image. I want to see [High Definition] quality on that thing. So, the devices we're seeing today, they're being required to be able to deliver that level of video representation and audio, that's [as good as] my music device and that's as good as my home entertainment system. In what other embedded sectors Linux becoming important? One of the fastest-growing areas of Linux we see right now is in the automobile: in the in-vehicle entertainment, in the dashboard, in the navigation. Those, for years and years and years, have been relegated to proprietary software stacks, because there's this big stigma that an automobile is hard. It moves and it bumps and there's temperature and there's all these safety requirements, and that's proprietary stuff. I think Apple helped change the game, because everybody wanted their iPod in their car without a bunch of wire striking around. Automobile manufacturers worked on the development cycle that is five to seven years, and all of a sudden the iPod hits and they have one quarter to figure out how to get that thing in there. This is a whole new business and process problem that the automotive manufacturers had not been in before. They all stood up and said: We don't know how to do this. And then the next new application came in and the next new application and, all of a sudden, they said: There's been a tremendous disruption in the industry; we've got to change the underlying principles how we design these applications. And Linux is clearly the solution for that, because it's all about the application and how extensible can the platform be, and how well can we count on consumer-like speed in an automotive-like marketplace. The second market that I would say we're seeing in the home. Things like broadband access points - how you get content into the house: that's going Linux now. Every new data standard, Linux is keeping pace with that better than anything else out there. We're seeing a general theme here. There's a real need for content - I want YouTube and I want cable and I want satellite and I want data. We're seeing those three C's of content, of connectivity, and of complexity. When you have those three things there, Linux is a tremendous solution. Glyn Moody writes about open source at opendotdotdot. Deep packet inspection At its core, the internet is a set of agreements; not just on protocols, but also on practices amongst carriers. Part of what has allowed the explosive growth—in both participants and services—of the internet can be attributed to these agreements. When a new technology like deep packet inspection (DPI) comes along to threaten these long-standing practices, it should be cause for concern. Internet packets are constructed much like postal mail. There is an envelope with addressing information contained in the packet header and a message which is contained in the data payload portion of the packet. Internet carriers are supposed to make their best effort to deliver a packet based on the information in its header. DPI violates that compact by looking inside the data portion, as the packet is en route to its destination, and making decisions based on that. There are some potentially valid uses for DPI—network performance monitoring and law enforcement surveillance, perhaps even with a warrant, are two—but the potential for abuse is large. Because network processing has gotten to the point where devices can do more than just observe and record, packets are being modified and generated on-the-fly in a technique known as deep packet processing (DPP). Various examples of DPI and DPP—generally lumped together as DPI—have been in the news over the last year. Comcast used DPI to try and throttle Bittorrent traffic, while Phorm and NebuAd have used it to rewrite web pages to deliver advertising to unsuspecting users. The DPI problem has gotten enough attention that even various governments have started showing interest. The designer of User Datagram Protocol (UDP)—the connectionless analog to Transmission Control Protocol (TCP)—David Reed recently testified to the US Congress about DPI. In his testimony [PDF] he outlines numerous technical issues, but the biggest may lead to breaking the fundamental model of internet communication: This is the real risk: [a] service or technology unnecessary to the correct functioning of the Internet is introduced at a place where it cannot function correctly because it does [not] know the endpoints' intent, yet it operates invisibly and violates rules of behavior that the end-users and end-point businesses depend to work in a specific way. We have seen this behavior from internet companies in other guises as well. Verisign and various ISPs have tried redirecting failed DNS queries to pages they control (and generally fill with ads). Once again, that breaks many applications; it functions more or less correctly for web browsing, but other applications depend on receiving proper errors when querying for nonexistent domains. Because many ISPs hold a near-monopoly on high-speed access in a particular geographical area, they can hold their customers hostage with little concern that competition will come along to force a change. It is this abuse of their monopoly position that tends to interest regulators. In addition, most of their customers are unlikely to notice these "enhancements", making it easier to get away with—at least until those more technically savvy recognize and raise the issue. Using encrypted communications, HTTPS for web browsing for example, is one defense against DPI. There is some cost associated with encryption, of course, but it is one that is likely to be borne if internet carriers persist in these shenanigans. Another option might be Obfuscated TCP, which is a technique to do backwards-compatible encryption at the packet level. Because it doesn't require all hosts to support it at once—it is negotiated between the endpoints when the connection is established—it could incrementally be added into the arsenal of tools to thwart DPI. DPI uses techniques that have generally been attributed to the "cracking" community. Things like man-in-the-middle attacks and IP address spoofing are difficult-to-solve security problems for many applications. When the "legitimate" middlemen start manipulating packets using these means for their own benefit, they come very close to—or cross—the line into illegality. This is a battle about control; our freedoms to communicate and innovate on the internet are at stake. A phone system that randomly inserted advertising into calls or a postal system that kicked back letters whose contents it didn't like as undeliverable would not be considered functioning systems. The internet requires the same treatment. 2.6.27 merge window, part 2 As of this writing, just over 6200 changesets have been merged into the mainline git repository since the 2.6.26 release. Merge activity appears to be slowing down somewhat; it appears that most of the major trees have been pulled. Andrew Morton has not yet started to unload the -mm tree into the mainline, though; until that happens, the merge window can be expected to remain open. User-visible changes merged since last week's summary include: There are new drivers for Samsung S3C SD/MMC interfaces, Atmel Multimedia card interfaces, Ricoh Bay1Controller cards, S/390 QDIO controllers, Renesas SuperH SH7710 and SH7712 Ethernet controllers, Option HSDPA/HSUPA mobile network devices, Broadcom BCM57711 Ethernet adapters, Mikrotik RouterBoard 532 series boards, Anysee DVB-T/C USB2.0 receivers, Sensoray 2255 video capture devices, Siano SMS10xx digital television devices, SuperH Mobile CEU camera controllers, Niagara2 hardware random number generators, HTC Shift (X9500) touchscreens, iNexio serial touchscreens, Sahara TouchIT-213 touchscreens, Xilinx XPS PS/2 controllers, Maxim MAX7301 GPIO expanders, HP iLO/iLO2 management processors, Atheros L1E Gigabit Ethernet adapters, Marvell XOR DMA engines, Synopsys DesignWare DMA controllers, and Intel version 3.0 I/OAT DMA engines. There is also a new PCI "slot detection driver" which will attempt to find all PCI slots in the system and create corresponding entries in /sys/bus/pci/slots/. Worthy of note: the "gspca" set of video drivers, long maintained outside of the mainline kernel tree, has been merged. These drivers support a large number of video devices; with their merge, most video camera devices on the market are supported by Linux. The Fujitsu laptop driver has been updated with better hotkey and backlight support for more Fujitsu models. The UBIFS filesystem for flash-based storage devices has been merged. The multiqueue networking patches have been merged. The IA-64 architecture has gained a paravirt_ops implementation to support virtualization. The new directories found at /sys/dev/char and /sys/dev/block contain pointers to sysfs entries for devices organized by device number. Changes visible to kernel developers include: The new suspend and hibernate infrastructure has been merged, providing a wider set of callbacks for power management events. The PCI and platform bus interfaces have been enhanced with support for this new infrastructure. The TTY layer continues to evolve; significant changes include the introduction of a new tty_port structure meant to hold information common to all TTY ports and a rework of the line discipline code. The mac80211 code has a new module which can simulate any number of IEEE 802.11 radios; it is suitable for testing mac80211 functionality and associated user-space tools. There is a new "rfkill" mechanism for unified handling of "radio off" switches on wireless devices. A number of Video4Linux2 format-related callbacks have been renamed to make them match the names used with the associated buffer types. In addition, the vidioc_enum_fmt_vbi_cap() callback has been deprecated and marked for removal in 2.6.28. The videobuf layer now has support for controllers which cannot do scatter/gather I/O. The USB "gadget" framework has been massively reworked to provide better support for composite devices. The prototype for device_create() has changed: Those who see a resemblance to device_create_drvdata() are right; all in-tree users were converted over to that interface, the old device_create() was removed, and device_create_drvdata() was renamed. For now, a macro makes calls to device_create_drvdata() do the right thing, but that macro will probably go away before the 2.6.27 final release. User-space UIO drivers can now write a signed value to the /dev/uioX device to enable and disable interrupts. Debugfs (finally) has a function for removing an entire directory tree: As a result, code creating hierarchies in debugfs no longer need remember the dentry of every file they create. The tail end of the 2.6.27 merge window will be covered in next week's LWN Kernel Page. Tracing: no shortage of options Three weeks ago, LWN looked at the renewed interest in dynamic tracing, with an emphasis on SystemTap. Tracing is a perennial presence on end-user wishlists; it remains a handy tool for companies like Sun Microsystems, which wish to show that their offerings (Solaris, for example) are superior to Linux. It is not surprising that there is a lot of interest in tracing implementations for Linux; the main surprise is that, after all this time, Linux still does not have a top-quality answer to DTrace - though, arguably, Linux had a working tracing mechanism long before DTrace made its appearance. Even a casual reader of the kernel mailing list will have noticed that there are a lot of tracing-related patches in circulation at the moment. There are so many, in fact, that it is hard to keep track of them all. So this article will take a quick look at the code which has been posted in an attempt to make the various options a bit clearer. SystemTap SystemTap remains the presumptive Linux tracing solution of choice. It is hampered by a few problems, though, including usability issues, a complete lack of static trace points in the mainline kernel, and no user-space tracing capability. On the usability side, we are seeing a few more kernel developers trying to put SystemTap to work and posting about the problems they are having. If one takes as a working hypothesis the notion that, if kernel hackers cannot make SystemTap work, many other users are likely to encounter difficulties as well, then one might conclude that addressing the reported problems would be a priority for the SystemTap developers. The SystemTap developers do seem to be interested in these reports, which is a good sign. There are other things happening in the SystemTap arena, including the release of version 0.7 on July 15. This release adds a number of new features and tapsets, and a substantial set of examples as well. Meanwhile, Anup Shan has posted an interesting integration of SystemTap and the fault injection framework, allowing tapsets to control fault injection and trace the results. James Bottomley has been playing some with the SystemTap code; one result of that work is changes to SystemTap's internal relocation code in an attempt to make it more acceptable for mainline kernel inclusion. There can be no doubt that the out-of-tree nature of much of the SystemTap support code has made it harder for that code to progress, so any improvement which makes it more likely that some of this code will be merged is welcome. Also by James is this patch implementing a new way to put markers into the kernel. The addition of markers (or static tracepoints) has always been problematic in that many of these markers, by their nature, need to go into some of the hottest code paths in the kernel. To support dynamic tracing, these markers need to be available on production systems, so they must work without creating any significant performance regressions. Quite a bit of work has gone into the static marker code which is in the kernel (but mostly unused) now, but some developers are still uncomfortable with putting them into performance-critical paths. James's patch addresses these concerns by putting the tracepoints entirely outside of the code paths. Rather than add some sort of marker to the code, these markers just make a note of just where in the code the marker is supposed to be; this note is stored in a separate part of the kernel binary. That information is enough for a run-time tool to patch in an actual jump to a tracing function should somebody want to see the information from that tracepoint. An additional benefit is that these markers do not interfere with any optimizations done by the compiler. Other solutions can insert optimization barriers which, while they do make life easier for the tracing subsystem, also affect the speed of the code even when the trace points are not active. Ftrace The text above said that the kernel's static tracepoint code is "mostly unused." That would have been better expressed as "completely," except that the 2.6.27 kernel will include a user in the form of the ftrace framework. One of the things which makes ftrace truly unique is that its documentation was not only merged before the code itself, but well before: the 2.6.26 kernel includes the excellent Documentation/ftrace.txt file. The ftrace (which stands for "function tracer") framework is one of the many improvements to come out of the realtime effort. Unlike SystemTap, it does not attempt to be a comprehensive, scriptable facility; ftrace is much more oriented toward simplicity. There is a set of virtual files in a debugfs directory which can be used to enable specific tracers and see the results. The function tracer after which ftrace is named simply outputs each function called in the kernel as it happens. Other tracers look at wakeup latency, events enabling and disabling interrupts and preemption, task switches, etc. As one might expect, the available information is best suited for developers working on improving realtime response in Linux. The ftrace framework makes it easy to add new tracers, though, so chances are good that other types of events will be added as developers think of things they would like to look at. Tracepoints The kernel markers mechanism is meant to be the way that static tracepoints are inserted into the kernel. To that end, a great deal of effort went into making these markers fast; they are, for all practical purposes, a set of no-op instructions until somebody wants to turn one on, at which point the real tracing code is patched into the running kernel. Since they were merged, however, kernel markers have been the subject of a few grumbles. In particular, kernel markers use a somewhat awkward mechanism to ensure that any arguments passed to the tracing function are interpreted correctly there. Each marker has a printk()-style format string associated with it; that string describes the type of each "argument" (a variable or expression within the code being traced). When tracing code activates a marker, it will supply a function to be called when the marker is hit and a format string describing the arguments that the function expects. The marker code will ensure that both format strings match; otherwise the marker will not be enabled. The problem is that the format string requires extra work to write and is only approximate in its specification of the types involved. These strings can make it clear that a given argument is a pointer, for example, but they say nothing about what type is pointed to. In response to various efforts to get around this issue, Mathieu Desnoyers (the original author of the kernel marker work) has proposed a new mechanism called tracepoints. They are another way of putting static trace points into the kernel, but with a simpler and more type-safe way of putting the pieces together. With tracepoints, every trace point must be declared in a header file with a mildly ugly set of macros: This definition will create a new tracepoint called tracepoint_name. Any function attached to that tracepoint must have a function prototype as provided in the TPPROTO() macro; the names of the associated arguments are provided with TPARGS(). Perhaps this is better understood with an example. The tracepoints patch set includes quite a few static points for use with the LTTng tracing toolkit. There is one called sched_wakeup which fires whenever the scheduler wakes up a process. It is defined with: The actual insertion of the tracepoint is a line like this: Note the trace_ prefix added to the supplied name. At this point in the code, a tracing function can be called with rq (the run queue of interest) and p (the process which is waking up) as parameters. Until an actual function is connected to the tracepoint, though, this declaration is essentially a no-op. Connection of a trace function is done through a call to: The register_trace_sched_wakeup() function (created as part of the DEFINE_TRACE() definition) will connect the supplied trace function to the tracepoint. The fact that the function prototype for the trace function is supplied as part of the tracepoint definition means that the compiler can perform thorough type checking; if the prototypes do not match up, compilation will fail. And that, in turn, should put an end to those embarrassing situations where turning on tracing causes the system to go down in flames. Interestingly, tracepoints have dispensed with much of the mechanism developed to minimize the runtime impact of kernel markers; in particular, they do not use the "immediate values" code. Profiling has shown that the performance impact of tracepoints is so low that there is little value in the added complexity of runtime patching of kernel code. Still, there are signs that some kernel developers will object to the addition of tracepoints in their current form. Developers want tracing support - but not at the cost of slower performance, even if that cost is hard to measure. Tracehook Finally, Roland McGrath recently surfaced with the tracehook patch set. Tracehook has a rather different focus; it is, essentially, a cleanup of the way the kernel handles the ptrace() system call. The tracehook patches try to organize all of the process tracing code (much of which is architecture-dependent) into one place where it can be dealt with as a unit. Tracehook is meant to be a first step toward the merging of a new version of the utrace code. Utrace has long been planned as the successor to the current ptrace() implementation, which has few admirers. But utrace has encountered a number of difficulties, so its path into the kernel has been slow. It disappeared from the lists entirely for a while, but a new version of the patches is said to be coming soon; Roland notes that he expects "some vigorous feedback" when that happens. The real importance of the ptrace() rework is that it is the path toward integrated tracing of kernel- and user-space events. And that, of course, is one of the biggest features offered by DTrace which is not yet available in SystemTap. Getting user-space tracing into the kernel - especially if it could work with the tracepoints already being inserted into some applications for DTrace - would be a major step forward for Linux. A lot of people will be watching when this patch set comes around again. Meanwhile, Roland would like to see the tracehook code merged for 2.6.27. He is late to the party, though, and this code has not done any time in linux-next. So it is not yet clear whether tracehook will go in before the merge window closes, or whether, instead, it will have to wait for 2.6.28. In summary... As can be seen, there is a lot happening in the area of tracing support for Linux. Tracing, it seems, is an idea whose time has come, at last. If the pieces described here can be merged and integrated into a unified framework, and if it can all be made sufficiently easy to use, the time for "DTrace envy" will come to an end. Those "ifs" are not small ones, though. There is quite a bit of work to be done yet; hopefully the current level of energy will remain until the job is done. Anticipating the sunset In his two years at the top of Sun Microsystems, Jonathan Schwartz has embraced a number of ambitious changes. While one need not look too far to find complaints about how Sun works with the free software community, there can be no doubt that Mr. Schwartz has made the company far more open than it was in the past. Free software is an important part of Sun's overall strategy; this can be seen in the company's claims to have contributed more code to the community than any other source. Unfortunately, Mr. Schwartz's time at Sun has been accompanied by a 50% decline in Sun's stock price. Whether he could possibly have done any better given the state of the company when he took over and state of the economy now is something one could debate, but we'll not do that here. More interesting, from the community's point of view, is the rumors that he could soon be looking for a new job. It has often been said that if corporations were people, they would have the personality of a sociopathic teenager. Certainly companies can exhibit no end of the sort of moody, capricious, and even self-destructive behavior sometimes seen in adolescents - then they come back and ask for more money. An abrupt change at Sun could well bring in a CEO determined to show that his predecessor's policies were fundamentally wrong and were primarily responsible for Sun's problems. And that could bring some interesting changes. Imagine a Sun which decided that it could no longer afford to share its Valuable Intellectual Property with the world. Perhaps Solaris, OpenOffice, Java, etc. would be relicensed under the new, Sun Proprietary Overtly Indecent License (SPOIL), with no more free releases. Hungry lawyers could start prowling for cases where Solaris code has been mixed into projects with incompatible licenses. StarOffice might go OOXML-only. MySQL could shift to a new, undocumented on-disk format with users' data subject to Sun-controlled DRM on every table. The new Java license would forbid the publication of not just benchmark results, but also of criticism of features of the language. Clearly, some of these scenarios are rather far afield - though they are fun to make up. But, if we have learned anything from the SCO story, it must be that a company which presents itself as a solid part of the community can, in short order, turn around and go against us. Even if Sun does not degenerate to the point of starting legal attacks against free software, it could certainly put an end to the many contributions that it is making now. Whenever one deals in company-owned free software, one should consider what happens if that company goes away. Projects with distributed copyright ownership are mostly immune to this kind of problem; there is no single company which could create huge problems for the Linux kernel by withdrawing its participation, for example. (Along these lines, it's worth noting that Evolution recently stopped requiring copyright assignments from its developers). But, in situations where a single company owns the copyrights and dominates development, a change of heart could make a real difference to downstream users. It all depends on what sort of community has developed around the code. If future versions of Solaris were to be proprietary-only, the current releases would still be out there. But the Solaris development community outside of Sun is tiny, so chances are good that such a move would kill OpenSolaris as a free software project - to the extent that it is one now. Anybody wishing to continue to use Solaris would probably have to move to the proprietary version. OpenOffice.org would likely survive, though the external development community - never encouraged that much by Sun - would have to organize itself and, perhaps, choose a new name. Java is entirely subject to Sun's policies regarding conformance tests and such; it could easily revert to its status from a few years ago. And so on. The point is that a change of heart at Sun could easily make us appreciate the company's relatively friendly attitude now, and could create difficulties for distributors and users of Sun-sponsored projects. There are plenty of other single-owner projects out there, of course. Many of them are entirely dependent on the continued good will (and viability) of their sponsoring companies. Others are less so. Copyrights on code released by the GNU project are generally owned by the Free Software Foundation. But, if Richard Stallman were to hit his head in an unfortunate contra dancing accident and decide that, henceforth, FSF-owned code would only be released under the binary-only GPLv4, those projects would not suffer much. Instead, the development community behind that code - strongly influenced but not controlled by the FSF - would quickly move to a new home and continue its work. For a practical example, see the creation of X.org in the wake of the relicensing of XFree86. With any luck at all, the silly scenarios outlined above will not come to pass. But there is value in pondering how things could go. Such thought quickly leads to the conclusion that a vibrant development community is not just good because it leads to faster progress and more cool features. That community is the source for the long-term support for the code, support which is not subject to one company's quarterly results. Notes from the Fedora project The Fedora folks have a lot of important problems on their mind. As part of that, there is currently a tense election underway - to choose the codename for the Fedora 10 release. There's a list of nine suitably silly, Red-Hat-legal-approved names to choose from. Your editor, fresh from another failed Rawhide update, suggests voting for "terror." Even though Rawhide hasn't been that terrible recently. Another election - this one for the membership of the Fedora Engineering Steering Committee (FESCO), just finished. FESCO members this time around will be Bill Nottingham, Kevin Fenzi, Dennis Gilmore, Brian Pepple, David Woodhouse, Jarod Wilson, Josh Boyer, Jon Stanley and Karsten Hopp. For the curious, the FESCO mission is: FESCo handles the process of accepting new features, the acceptance of new packaging sponsors, Special Interest Groups (SIGs) and SIG Oversight, the packaging process, handling and enforcement of maintainer issues and other technical matters related to the distribution and its construction. The new feature aspect of the job could be interesting in the near future; there has been some clear confusion on what constitutes a new feature, as compared to a mere "enhancement" which does not involve FESCO. The surprising (to some) replacement of RPM in Rawhide was one of those ambiguous issues which brought this question to the fore. There is now an enhanced draft feature policy up for review which, it is hoped, will clarify the situation. Back in June, the results from the Fedora board election raised some concerns about the process. One reaction to these concerns can now be seen in this proposal for term limits for board members. The reasoning behind this proposal is explained thusly by project leader Paul Frields: The problem at hand was the perceived dominance by full-time Fedora people on the Board. People who spend their entire $DAYJOB as well as their spare time on Fedora are automatically very involved and visible. That can translate directly to votes on the basis of name recognition, which really disadvantages people who are very involved, but in a somewhat more limited fashion because they don't have the luxury of doing Fedora all day every day. So the full-time Fedora folks are simply too prominent, to the point that they need to be eased off the stage after a couple of terms on the board to make room for everybody else. Of course, there's a couple of exceptions. The Fedora project leader, not being an elected member of the board, has no such limits. More to the point, though: term limits would not apply to those board members appointed by Red Hat. The reasoning here is: Extending these term limits to Red Hat appointed seats is not sensible for a number of reasons -- institutional knowledge, flexibility, etc. As of this writing, there has not been a whole lot of discussion of the term limit proposal; opinions which have been posted are not entirely positive. Fedora project members will want to consider whether this proposal can achieve its stated goal. It would be unfortunate if an up-and-coming outsider - with associated institutional memory - got term-limited off the board just as they were really hitting their stride. Finally, OLPC enthusiasts may want to have a look at the newly-formed OLPC special interest group. This group is working to make the Fedora distribution (already shipped by OLPC) as well suited to that platform as possible. One of the results should include a special Sugar "spin" of Fedora. There is a mailing list available for interested people to join. Interview: Kristen Carlson Accardi Kristen Carlson Accardi is a Linux kernel developer for Intel's Open Source Technology Group. She is the maintainer for the PCIE hot-plug driver, the SHPC hot-plug driver, and the PCI hot-plug subsystem in the Linux kernel. She is currently working on SATA drivers, including implementing power management features. Kristen is the benevolent dictator for the upcoming Linux Plumbers Conference. We interviewed her about LPC, why so many Linux developers live near Portland, Oregon, and life as a kernel developer. What is Linux Plumbers Conf? And why the "Plumbers" part? Linux Plumbers Conference is a conference for developers working on the low level programming of Linux, including kernel, libraries, and system applications such as udev, hal, and dbus. We came up with the name "Plumbers" because we wanted to represent these areas as basic system infrastructure which has many connections. Plus these programs are sort of the nasty, grimy, unglamorous underbelly of the system - not unlike the pipes in your house. Essential - but nobody wants to know they are there and everyone takes them for granted until they don't work. Running a conference is a lot of work in addition to your full time job as a Linux kernel developer. What made you decide to start Linux Plumbers Conf? Actually, it was the idea of a group of people. The Portland Linux kernel community gets together once a month or so to socialize and drink beer. At one of these gatherings we had a conversation about how difficult it was to solve big picture problems that cross multiple project boundaries. We felt that there are some cases where you really need to be able to just get everyone in a room and be able hash things out in person, but there wasn't really a forum for this. Existing conferences were either too narrow (like Kernel Summit or the X developers summit) or too broad for our purposes. Then someone said something like "Hey, why don't we just make our own conference". Because we are nothing more than a group of developers with a shared love of beer, we went to the Linux Foundation and asked them to collaborate with us, and it's been a wonderful partnership. It's definitely been a challenge for a bunch of software engineers to try and organize a conference, but we've leaned heavily on LF for advice and we've learned a lot in the past year. Most conferences are centered around talks in which speakers present their work, but open source developers often skip the talks so they can discuss ongoing projects face-to-face. How is LPC balancing these needs? Our format for the conference is based on the idea that we would have a bunch of "microconferences". Each microconf is meant to represent a topic that should be small enough to be able to adequately discuss in a few hours, and should preferably span multiple project areas. Each microconf is being organized by a single expert in the area who dictates the content of the microconf. The microconf runner may decide to have a couple talks and an hour or so for discussion, or they may decide to split the group into teams and solve some specific problems. We are leaving this up to the microconf runner to decide, although we are recommending that talks be not more than 25 minutes in length so that there is ample time for discussion and questions. We also have a general track for presentations that do not fall under our predefined MC topics. In addition to the rooms for the microconfs, we have several rooms that are going to be available for "unconference" style talks. People wishing to get together in smaller groups will be able to reserve a room at the beginning of the conference. Our larger rooms will also be available in the afternoon for working sessions. For several years, developers have been organizing individual summits and workshops for particular projects, like networking and file systems. LPC microconfs are similar, but they're held all in the same location and time. Why did you want to put the microconfs together into one conference? We did this to encourage cross project communication. Individual summits are great for solving narrow problems, but they tend to compartmentalize developers from each other. Who is organizing and sponsoring LPC? LPC is organized by a group of volunteers from the Portland Linux development community and is underwritten by the Linux Foundation. We are a group of developers who just wanted to attend a conference which didn't happen to exist yet, so we made our own. Because we are all volunteers, we have very little overhead for this conference, and the money our sponsors have given up is being used directly on making the conference as productive and memorable as we can make it, with hopefully a little left over to start over again next year. Our Platinum level sponsors are Intel and IBM, with NetApp sponsoring at the Gold level, and HP, MontaVista, and Google at the Silver. In addition the Linux Foundation and Portland State University and have given us so much more than money - they have been true collaborators and we are so grateful for all their time and effort. Were there any sponsorships you didn't accept? Not that I can recall - we actually started fund raising a little late and missed a lot of people's planning cycles. We were extremely lucky that there were so many great sponsors like Intel, IBM, NetApp, HP and Google that believed our conference was valuable enough to find the money in their budget despite the short notice. How did you decide on the location of LPC? Portland State University was always our first choice for LPC. We wanted a non-corporate, friendly environment that was downtown. It was very important to us as well to have a "green" conference - hey, we are Oregonians! We wanted a place were there were plenty of hotels and restaurants within walking distance so that people would not have to rent a car. In addition, we didn't want the more traditional convention center or hotel atmosphere, nor could we afford it. Tell us more about LPC as a green conference. As frequent conference-goers, we are all a little dismayed by the waste generated from conferences. Disposable drinking cups and bottled water, flyers and schwag that immediately hits the garbage bin when you get back to your hotel, and driving around from event to hotel and back again are just some of the things that we decided we'd like to not have at our conference. As such, we are not distributing printed material at the conference. We're also limiting our schwag to only things we've deemed useful, and we are working with our caterers to reduce paper waste and provide foods from local, sustainable sources where possible. How did you get started in Linux kernel development? I started using Linux in college back in 1994 or 1995 - I wanted to be able to work on my homework at home rather than in the lab, and all we had in those days was a horrendously slow modem connection to the school. For years afterward, all I wanted to do for a living was to work on Linux, but it wasn't until around 1999 that I got my first chance to write some drivers for Linux while working in Intel's networking division. I had previously written device drivers for Netware - a job I'd gotten right out of college. After working on out-of-tree drivers for embedded systems and research projects for many years, I finally joined Intel's Open Source Technology Center in 2005 and was able to start contributing upstream in a meaningful way. Portland is home to many top Linux developers, including Linus Torvalds. Why do you think Portland is so attractive to open source developers? Honestly - I have no idea. People ask this question all the time, and all we can do is speculate. I know why a lot of us live here - it's a great city to live in. At some point you get enough critical mass of developers that you start attracting others. It could be any number of things. Maybe because it's easier to thumb our noses at Redmond from here? In your opinion, what are some of the most important technical trends in Linux kernel development today? Low power features in hardware is driving a lot of kernel development these days. Tell us about some of the places you've traveled for your job. When you work in open source, you have to travel to meet your "co-workers". I've had a chance to go to OLS a few times, Sydney for LCA a couple years ago, and Cambridge last year for Kernel Summit and LinuxConfEU. Recently I traveled to FISL in Porto Allegre, Brazil. I've also been to Ireland for Skycon - a fun and interesting conference. I'm actually looking forward to not having to travel to attend LPC. Thanks, Kristen, for taking the time to answer our questions. Linux-next meets the merge window Recent LWN articles on the linux-next tree have noted that, while this tree has been working well in its role of identifying merge conflicts between subsystem trees, it has not yet been through a full kernel development cycle. 2.6.27 will be the first kernel release where linux-next was in existence for the entire preceding cycle; in theory, everything which goes into 2.6.27 should have been aged in linux-next first. As the end of the 2.6.27 merge window nears, a look at how linux-next has affected the process seems warranted. One might think that linux-next maintainer Stephen Rothwell would be able to take a break during the merge window; it should mostly be a matter of watching the linux-next tree drain into the mainline. As it happens, the daily linux-next postings (example) suggest a fair amount of scrambling to deal with merge conflicts, build failures, and more. There are a number of reasons for this, one of which being that subsystem trees are merged into the mainline in an order which is completely unrelated to their order in linux-next. Patches which remain in linux-next are being applied to a highly unstable base. Another interesting phenomenon has been a fair number of patches appearing in linux-next during the merge window. Some of these are actually patches intended for 2.6.28; once maintainers have dumped their 2.6.27 patches into the mainline, they are starting to acquire stuff for the next time around. Stephen has asked them not to do that, requesting that 2.6.28 material not be directed toward linux-next until after the 2.6.27-rc1 release. The goal is that linux-next should be nearly empty when 2.6.27-rc1 comes out. Other patches, though, are intended for 2.6.27 but simply have not done their time in the linux-next tree. That had led to a certain amount of developer grumpiness at times. It is interesting to note, though, that one of the biggest examples of linux-next avoidance - David Miller's merging of the multiqueue networking code which he had finished writing hours before - has generated relatively few complaints. But various other types of conflicts have generated a steady steam of terse notes from Andrew Morton (who is in the unfortunate position of basing his work on top of linux-next) on how new stuff should have been in linux-next weeks ago. Another area of, say, colorful conversation has been around the TTY subsystem, currently been subjected to a much-needed thrashing by Alan Cox. Some developers have been unhappy with Alan for merging code which failed to compile, even though those problems had already been identified in linux-next. Alan, instead, has become irritated with other developers who have surprised him with TTY-layer changes of their own, causing Alan's patches not to apply. Alan has some quaint notions about actually testing his patches, so the resolution of this kind of conflict requires the running of a new set of regression tests and such; after this had happened a few times in a row, he started getting a little short-tempered. These issues would appear to have been worked out at this point, but the idea behind linux-next was to keep them from happening in the first place. Yet another source of occasional merge issues is the rebasing of trees. Rebasing, in git-speak, is the process of modifying the commit history in a repository to cause a series of patches to look like they were written against a later version of the code than they really were. Rebasing can be a useful technique; it generates a series of patches which applies cleanly to the current state of the tree without generating a bunch of unsightly merge commits. Rebasing can be especially useful in the context of linux-next. If testing turns up a patch which breaks the build, simply committing a fix will leave a period in the history where the kernel cannot be built, and that is bad for people running bisections. With the use of git's history editing features, the offending patch can be fixed in place and all evidence of the mistake disappears. In essence, that embarrassing commit mentioning the Eurasian campaign can be fixed up to properly note that we've always been at war with Eastasia. But rebasing a repository changes the history (by design), creating, in the process, an entirely new set of commits. Those commits are new code, to the point that any results from testing the older version may no longer apply. The commits also have new names, so any other developer who was using a version of the repository will be shaken off and unable to merge. Issues related to rebasing have come up a couple of times during the merge window, leading Linus to post a series of lectures on the problems that rebasing can cause. It is clearly a tool which must be used with restraint, but occasional use of rebasing can, in the linux-next context, lead to a better final merge. Finding the right balance is something each developer will have to learn. In the end, the merge window remains a bit of an unruly time. The process of channeling the work of several hundred developers into the mainline over a two-week period is unlikely to ever be an entirely smooth experience. But, for all its glitches, the 2.6.27 merge window has been (so far!) easier than 2.6.26. The presence of the linux-next tree almost certainly has something to do with that. This tree's role continues to evolve, but its benefits are starting to be felt. The Elisa Media Center project Elisa Media Center is a cross-platform (Windows Vista, XP, and Linux, eventually Mac) media management project that is sponsored by Fluendo. The company is also known for its sponsorship of the GStreamer multimedia framework. The Elisa project's home page explains: Elisa is an open source cross-platform Media Center featuring an intuitive interface with a professional look and feel which can be easily used with a standard TV remote control. Elisa is designed to be easily extensible through plugins. It relies on Python and Twisted as core technologies. Elisa can manage movies, photographs, and music. It can work with media from locally connected peripherals, other machines on the LAN and the Internet. The software includes support for IR remotes and touchscreens. Elisa uses a modular design with support for plugins which give the system access to various media sites and other information. A fairly out of date feature list explains the capabilities in more detail. A good way to see the capabilities of the software is to take a look at the flashy demo video and screenshots. Following on heels of the recently announced version 0.5.1 (the initial public 0.5 series release), version 0.5.2, entitled "Good news everyone" was announced this week: The main outlines of this release are: - The integration of a media scanner that indexes one's music collection and allows one to browse it by Artists/Albums, with automatic albums' covers and artists' photos retrieval; - The localization of the UI. Thanks to contributions from the community Elisa is currently fully translated in Spanish, Catalan, French, Italian, German, Dutch, Polish, Swedish and Brazilian Portuguese. The Elisa source code is available for download, packaged versions for Ubuntu and Debian should appear soon. GNOME 3.0 worries The mood on some GNOME mailing lists in the weeks prior to the recently-concluded GUADEC conference was somewhat somber; some members of the community were clearly feeling that GNOME development had slowed down, that the project lacked vision, and that GNOME was threatening to lose its relevance with users. GNOME subsequently emerged from GUADEC with a new executive director, plans for a 3.0 release, and a new burst of enthusiasm. It's amazing what a week in an exotic city with large amounts of beer can achieve. Since then, however, the enthusiasm has dropped a bit, and work on a proposed 3.0 press release appears to have stalled. GNOME is now faced with some big decisions, and it's not clear what the project will do. The initial driving force behind this effort appears to be a plan by the developers of the GTK+ toolkit to move to a new ABI without concerning themselves with backward compatibility. Years of enforced ABI stability have left GTK+ with a large pile of compatibility cruft which the developers would like to leave behind; in addition, there are major changes planned which would be hard to do in a backward-compatible mode. So the GTK+ developers would like to start over with a 3.0 release. Lots of planning is being done to make the transition easy; among other things, care will be taken to ensure that GTK+ 3.0 will coexist nicely with older installations. But, in the end, it's an incompatible ABI change. At this point, the loudest objections seem to come from Miguel de Icaza. He fears that a new version of GTK+ will leave independent system vendors behind and, perhaps, lead to a series of ABI-breakage events. In particular, Miguel takes issue with the plan to make the ABI changes for the GTK+ 3.0 release, and only add the new features (which, like much of the GNOME 3.0 plan are somewhat fuzzy at the moment) later. The needed new features, he says, should be driving the whole process. And, if at all possible, those features should be added in a way which does not require an ABI flag day. It would appear that the GTK+ developers are determined to make this change, though, so expect it to go forward. But a GTK+ change is not the same as a GNOME change; there is no particular need for GNOME to make a major release just because an important library it uses has done so. Anybody who has looked at the linkage of a GNOME application knows that GNOME uses a lot of libraries; they cannot all drive major GNOME releases. So, one might ask, what is happening with GNOME in particular that warrants a 3.0 release? This question was, arguably, most eloquently asked by Luis Villa, who has described GNOME 3.0 as "a terrible idea." Luis's point is that an ABI change is not enough to motivate a major release; instead, there must be a fundamental vision of a better way to do things. That vision, he says, is not there now. This is not an unprecedented situation in the GNOME community: 2.0 almost failed for this exact reason- before there was a clear vision about doing usability/simplicity-centered design, the new version number was a huge invitation to insert $VISION here, leading to all kinds of crack. A 3.0 process without a clearly-articulated vision will invite the same sort of "crack." It will also throw away the rare public relations opportunity that comes with a major update: Finally, from a media perspective: the reason GNOME 2.0 was a success in the Linux media, and the reason KDE 4.0 has been a failure, is that GNOME 2.0 had a clear, persuasive story around it: simplification and usability. No one in the media cared that we had a new toolkit, except where it had specific features (mainly i18n) that had user benefits. Writers ate up our usability story- they could tell their readers the story we put out there, and it made sense to them. KDE 4 has no coherent user-focused story, so this incredible opportunity to reach out to the press has been squandered. There are, certainly, interesting ideas to be found in the GNOME community. The online desktop ideas, Document-centric GNOME, and the mobile initiatives are examples. But it is true that nobody has, yet, put together a concept of GNOME 3.0 which is broad enough to unify and direct all that work while simultaneously being concise enough to fit onto a bumper sticker. Chances are good that most GNOME developers do not know what GNOME 3.0 really means; those outside of the development community will have even less of a clue. The KDE 4.0 experience should be on the GNOME project's collective mind as it ponders a possible 3.0 release. Future KDE users may see KDE 4.0 as the turning point where their desktop started becoming truly great, but, for now, it does not look like a whole lot of fun for the KDE development community. GNOME developers, one assumes, would prefer not to have a similar experience. GNOME 2.x has been around for some time; it may well be true that it is time to make a big jump. It would be gratifying to see some new energy and directions from the highly creative GNOME development community. If the project can come up with a set of overall goals which can inspire that community toward a set of common ends, GNOME 3.0 could be a spectacular success. But those goals, if they exist, have not been communicated to the community yet, and that is making some GNOME developers nervous. 2.6.27 - the rest of the story The 2.6.27 merge window closed with the 2.6.27-rc1 release on July 28. Some 8100 changesets were merged this time around, making 2.6.27 another busy development cycle. A number of interesting things went in since last week's update; the most significant changes visible to Linux users include: There are new drivers for ILI9320 LCD controller chips, Cobalt server LCD frame buffers, SH7760/SH7763 integrated LCD controllers, NXP pca9532 LED controllers, Philips PCA955x I2C LED controllers, WMI-based hotkeys on HP laptops, Maxim MAX73xx I2C port expanders, Micronas DRX3975D/DRX3977D DVB-T demodulators, DvbWorld 2102 DVB-S USB2.0 receivers, MaxLinear MxL5007T silicon tuners, Renesas SH7763 evaluation boards, Renesas Solutions AP-325RXA boards, Renesas R0P7785LC0011RL boards, and Atmel integrated touchscreens. Also added is "mISDN," a new, modular ISDN driver intended to replace older code for a number of ISDN cards. Support for using mISDN drivers remotely via an IP tunnel has been added. The Palm T|X handheld computer is now supported. The tmpfs filesystem has gained support for asynchronous I/O. The hugetlbfs mechanism can now support multiple huge page sizes. There is a new directory (/sys/kernel/hugepages) with information on huge page allocations. The x86 (64-bit) architecture now supports 1GB pages; PowerPC can go to 16GB. Most system calls which create file descriptors can now accept a set of flags; this change allows the race-free establishment of close-on-exec semantics, requesting non-blocking opens, and more. Developers wanting to use this capability will have to wait for a version of glibc which adds the requisite interfaces. The unmaintained v850 architecture has been removed. The kexec jump patch set, which uses the kexec mechanism as an alternative way of implementing suspend-to-disk, has been merged. The omfs filesystem has been merged. /proc now has a file (called syscall) for each process; when read, it displays the process's current system call and the supplied arguments. Linux users hoping to upgrade their systems in the near future will be glad to know that a series of patches designed to make the kernel scale to 4096 processors has been merged. Changes visible to kernel developers include: The tracehook mechanism for defining static trace points (described in this article) has been merged, along with a number of trace points in the core kernel. A new, lockless form of get_user_pages() has been added: Details of this interface can be found in this article, with the one note that early versions were called fast_gup() instead. (See also the related lockless page cache work, which was also merged). The long-debated mmu-notifiers patch has been merged. The notifiers allow external memory management units (as may be seen in some graphics cards or in virtualized guests) to be told about decisions made by the core memory management code. There is a new framework for debugging boot-time memory initialization; there's also "a few basic defensive measures" intended to prevent difficult-to-debug boot problems. The new function: returns a true value if the pointed-to object is on the current kernel stack. There is a new macro for issuing warnings: It's much like WARN_ON() in that it will produce a full oops listing; the difference is the added printk()-style format string and arguments. A new helper function: waits for the specific workqueue job work to finish executing. dma_mapping_error() and pci_dma_mapping_error() have new prototypes: In each case, they have gained a new argument specifying which device the mapping is being done for. There are a couple of new radix tree functions: They are useful for looking up multiple items in a single call. Slab cache constructors no longer have a pointer to the cache itself as an argument; they now take a single void * pointer to the object itself. The long list of Video4Linux2 ioctl() callbacks has been moved into its own structure (struct v4l2_ioctl_ops) which is pointed to by the ioctl_ops member of struct video_device. Now begins the long task of finding and fixing all the bugs in all this new code. If the usual pattern holds, that process will take about two months, suggesting that we can expect 2.6.27 sometime in October. Harald Welte on his new role with VIA Hiring a well-known free software advocate to oversee efforts to work with the community is a good plan for any company, but for a company that has had rocky community relations, it may be essential. VIA Technologies has done just that, by contracting with Harald Welte to help guide its strategy to work more closely—and less contentiously—with the community. VIA announced a new effort aimed at cooperation with the free software world last April, but got off to a slow start that had people wondering about its commitment to fulfilling that promise. Welte will be well placed to ensure that community concerns are heard within VIA. Highly visible in the community for his work on things like netfilter/iptables and, more recently, the Openmoko phone, Welte has the skills to provide VIA with excellent advice. He has also won several awards for his work on GPL enforcement as founder and driving force behind the gpl-violations.org project. We caught up with Welte at this year's Ottawa Linux Symposium to discuss his new role. Because of his work on Openmoko, Welte had been traveling frequently to Taiwan, making a number of industry contacts amongst the companies located in Taiwan. About nine months ago, he was "invited to talk to VIA and give them some feedback from the community". The company, he says, knew from the beginning it needed community input, but how to get that was not decided until late May or early June, when they asked Welte provide it on a regular basis. The push from within VIA came from management, specifically product management, which is somewhat surprising—in the US and Europe, at least, it is typically engineering that pushes for better community relations. "It's a really big opportunity for me being a representative of the community to talk to a company at this high of a level. That's what makes me very optimistic." [PULL QUOTE: It's a really big opportunity for me being a representative of the community to talk to a company at this high of a level. That's what makes me very optimistic. END QUOTE] VIA primarily needs to get drivers and other software for their graphics hardware cleaned up and submitted upstream. It is not just the X.org drivers for 2D and 3D graphics that need to be mainlined, there are also DRM and DRI patches that are maintained out-of-tree. He wants to see kernel patches get moved upstream to kernel.org, while X patches get merged into X.org code. A free 2D driver supporting most VIA chips, old and new, will be available soon. Welte sees his role as "focusing more on the open source strategy inside VIA". That includes improving the skills of VIA's R&D group so that they produce drivers that are mainline quality. Various kinds of problems exist in the drivers, the coding style may not meet the kernel requirements or they may not use the proper APIs. Currently, drivers exist for new products that are supposed to ship with mainline drivers available; Welte will help ensure that happens. "I perceive myself as community person rather than a VIA person." He points to Intel as a "shining star" example of supporting free and open source software, though "sometimes they might focus a bit too much on drivers than on open documentation," especially for wireless hardware. One of the areas that VIA is working on is open documentation for its hardware, but Welte isn't sure when those will be released—though some 800 pages were released this week. Schedules are largely out of his control, as they are subject to a wide variety of variables within VIA. His role with VIA is a chance to "really make a silicon manufacturer understand how the open source community works and what the benefits are to working with it". He will be traveling back and forth from his home in Berlin quite a bit; "that's good, I love Taipei". He has also started to learn to speak Chinese. It seems like a great fit that, in some ways, Dave Jones predicted in his blog posting linked above: "I'm beginning to think the only way VIA will ever really 'get it together' is if they employed someone from the Linux community who actually understands how all this works, because it seems someone in Taiwan isn't getting the memos." Perhaps a little late, but it seems that VIA has gotten and understood the memos now. MARS and The Cell Broadband Architecture This article is based on a talk given by Geoff Levand at the Linux Symposium in Ottawa on July 24, 2008. The latest TOP500 Supercomputers list was released last month and the new front-runner is using a processor quite unlike what you would find in your laptop. The Cell Broadband Architecture (simply referred to as "Cell" in this article) was produced as a joint venture between IBM, Toshiba and Sony. The Cell is available in server hardware but is most commonly found in Sony's Playstation 3 gaming console. The Cell is interesting because of its unusual design and performance characteristics. The Cell is described as a heterogeneous multicore CPU. It has one Power Processing Element (PPE) which is a general purpose processor and up to 8 Synergistic Processing Elements (SPEs). An SPE is a high-performance vector processing unit with 256KiB of local memory and its own DMA unit. The PPE, SPEs and memory and I/O controllers are connected by a high speed bus. The PPE is quite slow compared to modern processors so the SPEs must be used to achieve good performance. This means writing software that takes the Cell's design into consideration because there is no simple way to optimize existing applications. Once an application has been designed to use the Cell's SPEs effectively it may run many times faster than when run on a traditional CPU. GCC with the Cell SDK can emit code for both the PPE and SPEs, including passing messages and managing overlays when the SPE code size exceeds 256KiB. The Linux kernel can also manage multitasking the SPEs with its scheduler. These conveniences make it easier to write code for the Cell processor, but they can have a significant impact on performance. Preemptive multitasking on an SPE involves swapping all the local memory of the current process with the local memory of the process to be run. This requires time and bus bandwidth for the processor. Ideally you would always have at least as many SPEs as processes you need to run so that your process would never be swapped out. The Multicore Application Runtime System (MARS) framework is a prototype of a cooperative multitasking system for the Cell that tries to address the performance overhead of running many processes on the Cell's available SPEs. MARS uses a library on the PPE and a very small kernel on the SPEs. MARS currently has a priority-based cooperative scheduler. This scheduler lets you specify how much context you need to save when your process is swapped out. In the "run complete" case no context needs to be saved allowing the next process to run much more quickly. Synchronizing of processes is commonly required between the Cell's SPEs and PPE. The only way to synchronize with the existing Cell SDK is to cause your SPE to busy-wait on a semaphore, but the MARS scheduler gives you the option of swapping out a process and doing other work instead. Cooperative multitasking does have its downsides. You lose protection between your processes, and one process could hang and require intervention to release the PPE. It is also necessary to place manual yield points through your code or design each process to be short-lived. However, if your application needs to make the most of the Cell architecture, MARS is a promising starting point and addresses the need for a more efficient approach to scheduling. The lockless page cache One of the biggest problems in kernel development is dealing with concurrency. In a system where more than one thing can be happening at once, one must always take care to keep multiple threads of control from interfering with each other and corrupting the system as a whole. In the same way that two roads become more dangerous when they intersect, connecting two or more processors to the same memory greatly increases their potential for the creation of mayhem. Travelers to the US are often amused (or irritated) by the often-favored solution to roadway concurrency: putting in traffic lights. Such a light will indeed (if observed) eliminate the potential for a number of unpleasant race conditions within intersections, but at a performance cost: traffic going through the intersection must often stop and wait. This solution also scales poorly; as more roads (or lanes with different destinations) feed into the same intersection, each of them experiences more red-light time. In kernel programming, the first tool for controlling concurrency - locks in various forms - are directly analogous to traffic lights. It is not coincidental that the name for a common locking primitive (semaphore) matches the name for a traffic light (semaforo) in a number of Latin-derived languages. Locks enforce exclusive access to a kernel resource in the same way that a traffic light enforces exclusive access to an intersection, and with many of the same costs. When too many processors end up waiting at the same lock, the performance of the system as a whole can suffer significantly. There are two common approaches to mitigating scalability problems with locks. For many years after the 2.0 kernel came out, these problems were addressed through the creation of more locks, each controlling a smaller resource. Lock proliferation is effective, in that it reduces the chance that two processors will be trying to acquire the same lock at the same time. Since it works so well, this approach has led to the creation of thousands of locks in the Linux kernel. Proliferation has its limits, though. Adding locks increases complexity; in particular, with more locks, the chances of creating occasional deadlock situations increase. Deadlocks can be avoided through the careful observation of rules on the acquisition of locks, and the order in which they are acquired in particular. But nobody will ever be able to sort out - and document - the proper relative locking order for thousands of locks. So kernel developers must make do with rules for some of the most important locks and the vigilance of the lockdep tool to find any remaining problems. The other problem with lock proliferation is harder to get around, though. The acquisition of a lock requires writing a value to a location in shared memory. As each processor acquires a lock, it must change that value, which causes that processor to acquire exclusive access to the cache line holding the lock variable. The cache lines for heavily-used locks will fly around the system in a way that badly hurts performance, even if no processor ever has to wait for another to release the lock. Adding more locks will not fix this problem; instead, it will just create more bouncing cache lines and make things worse. So, as the number of processors grows, the path to continued scalability must not include the wholesale creation of new locks; indeed, it requires the removal of locks in the most performance-critical paths. And that is what this whole long-winded introduction leads up to: the 2.6.27 kernel will include some changes by Nick Piggin which implement lockless operation in some important parts of the virtual memory subsystem. And those, in turn, will lead to faster operation on multiprocessor systems. The first of these changes is a new function for obtaining direct access to user-space pages from the kernel: This function works much like get_user_pages(), but, in exchange for some limits on its operation, it is able to do its job without acquiring the mmap semaphore; that, in turn, can lead to a 10% performance boost on "a threaded database workload." The details of how this function works were covered here last March (though the function was called fast_gup() back then), so we'll not repeat that discussion here. The other big change is a set of patches which Nick has been carrying for quite some time: the lockless page cache. The page cache holds in-memory copies of pages from files on disk; its purpose is to improve performance by minimizing disk I/O. Looking up pages in the page cache is a common activity; it happens as a result of file I/O, page faults, and more. So it needs to be fast. In 2.6.26 kernels, each mapping (each connection between the page cache and a specific file in a filesystem somewhere) has its own lock. So processors will not normally contend for the locks unless they are operating on the same file. But locks for commonly-accessed files (shared libraries, for example) are likely to be frequently bounced between processors. Most page cache operations are lookups - read-only operations which make no changes. In the lookup operation, the lock protects a few aspects of the task, including: A given page within the mapping must be looked up in the mapping's radix tree to find its location in memory (if any). If the page is resident in the page cache, it must have its reference count increased so that it will not be evicted before the code performing the lookup has done whatever it needs to do. The radix tree, itself, is a complicated data structure; it must be protected from modification while the lookup is being performed. For certain, performance-critical parts of the radix-tree code, that protection is done through (1) some rules on what can be called when, and (2) the use of read-copy-update (RCU). As a result, the radix tree lookup can be done in a lockless manner. There is still a problem, though: a given page may be evicted from the page cache (or simply moved) between steps (1) and (2) above. Should that happen, the second step will increment the reference count for a page which now belongs to a different mapping, and return an incorrect pointer. The kernel developers have, through lots of experience over many years, learned that system crashes resulting from data corruption are quite hard on throughput. So true scalability requires that this kind of scenario be avoided; thus the mapping semaphore, which prevents page cache changes from being made until the reference count has been properly updated. Nick made an interesting observation here: it actually doesn't matter if the wrong reference count gets incremented as long as one ensures that the specific page mapping is still valid afterward. The result is a new, low-level page cache function: If the given page has a reference count of zero, then the page has been removed from the page cache; in that case this function return zero and the reference count will not be changed. If the reference count is non-zero, though, it will be increased and a non-zero value will be returned. Incrementing a page's reference count will prevent that page from being evicted or moved until the count goes back to zero. So kernel code which has incremented a specific page's reference count will thereby ensure that the page stays in its current state. In the page cache case, the code can obtain a speculative reference to a page found in a mapping's radix tree. But it does not, yet, know whether it actually got a reference to the page it was looking for - something may have happened between the radix tree lookup and the obtaining of the reference. So it must check - after the reference has been acquired - to be sure that it has the right page. If not, it releases the reference and tries again. Eventually it will either pin down the right page or verify that the relevant part of the file is not resident in memory. Lockless operation forces a bit more care on the part of the page reclaim code, which is trying to get a page's reference count down to zero so that it can remove the page. Since there is no locking around the reference count now, the reclaim code must set it to zero while checking, in an atomic manner, that nobody else has incremented it. That is the purpose of the atomic_cmpxchg() function, which will only perform the operation if it does not collide with another processor. Since page_cache_get_speculative() will not increment the reference count if it is zero, the reclaim code knows that, by getting that count to zero, it now has exclusive control of the page. The end result of all this is that a set of locking operations has been removed from the core of the page cache, improving the scalability of that code. There is, of course, a cost, in the form of trickier code with a more complex set of rules which must be followed. Chances are that we will see more of this kind of code, though, as the number of processors in our systems increases. OLS: The state of Linux wireless networking Kernel wireless maintainer John Linville outlined the past, present, and future of the Linux wireless stack on the first day of this year's Ottawa Linux Symposium. In his presentation, he ranged from early efforts, which were "a sore spot for Linux" to the future where it is likely that Linux will have support for some features before "that other OS". Along the way, he looked at various issues that wireless support in Linux faces, including vendor participation, suspend and resume, and regulatory issues. Linville has been the maintainer Linux wireless for two and a half years since being recruited into the job by David Miller and Jeff Garzik. When he took over, wireless support was in disarray, as there were competing stacks to support different hardware. Users were faced with lots of pain in getting things working when "they just want their hardware to work" said Linville. Since that time, things have greatly changed. The original wireless hardware was what is called "Full MAC hardware", where the implementation of the wireless protocols was handled by the hardware, generally in firmware. The drivers made these devices appear to be regular wired ethernet devices, though they did require some special configuration for SSID and the like. Because the hardware would enforce various regulatory requirements, vendors would generally work with the community in order to support the hardware. All of that changed with the advent of "Soft MAC hardware"—which Linville likened to winmodems—where the CPU implements most of the protocol. It is a cheaper solution for vendors, but it requires an 802.11 stack for the kernel. The ieee80211 drivers came along to support the Intel Centrino wireless hardware, but they only supported those few devices. Johannes Berg added the ieee80211softmac driver that added some additional hardware support, but it was a kludgy solution. Since then, Linville said, folks have realized that it was "sort of a mistake to go down that road". Enter the Devicescape stack. It was a feature rich 802.11 stack for Linux that was popular with developers. After some locking and SMP problems were resolved, it was merged into 2.6.22 as the mac80211 driver. Once that happened, wireless drivers started using it, to the point where Linville showed a chart of the current drivers, almost all of which use mac80211. "It's been a boon to us to pick up the mac80211 code." One notable driver that does not support mac80211 is the libertas driver for the OLPC. Unlike most other current devices, it is a Full MAC device with special requirements. It has support for power saving modes that do not yet exist in mac80211. Because it is a mesh-networking device that still participates in forwarding network traffic when the system is powered down, it has needs that are not yet supported. Drivers in progress was the next topic Linville addressed. Several of these are in need of developers to work on them, specifically for the Airgo chipset and Atmel USB chipset. The TI chipset drivers have had some questions raised about the reverse engineering process and may require a legal vetting similar to what the SFLC did for ath5k. Marvell is sponsoring development of a mac80211 based driver for its hardware. This driver may also support 802.11n which allows for greater range and higher speeds than current-generation 802.11. Using data from LWN, Linville looked at the activity level of the wireless development in Linux. He was amazed to note "how much of the 2.6.26 kernel came through this laptop". Using his Signed-off-by as a proxy for wireless LAN commits, he noted 4.3-5.6% of the kernel commits in the last three releases (.24 through .26) were for wireless. In each kernel, wireless was either the fourth or fifth highest number of commits. The compat-wireless-2.6 project is aimed at supporting newer hardware in older kernels. Because folks are wary of running kernel.org kernels or their distribution supports an older kernel—but they want to run with the latest hardware—the project backports wireless drivers to kernels as old as 2.6.21. It is a set of scripts and patches that build against the user's kernel. Unfortunately, the project may not last much longer as the multiqueue changes that have been merged for 2.6.27 may change the drivers enough that they will be infeasible to backport. At the top of the list for new features is removal of the wireless extensions in favor of the new cfg80211 mechanism. According to Linville, "nobody likes wireless extensions, and nobody likes the existing tools". The wireless extensions have vague semantics, can have problems with race conditions, and because they are implemented by ioctl() calls, they encourage duplication of code in multiple drivers. cfg80211 will bring a much cleaner API along with fixing some existing bugs like the 31 character limit for SSIDs. Access point (AP) mode is another feature that is coming. Typically, APs use similar or identical hardware to that in wireless MACs. For Soft MAC hardware, all that is needed is support on the CPU side for AP mode, which is coming for mac80211. Mesh networking, which has been popularized by the OLPC project, is also coming to mac80211. Cozybit has provided an implementation which will allow Linux to have a feature unavailable for Windows. Areas that are needed, but are not yet being worked on was next on Linville's agenda. Suspend and resume support is "flawed for mac80211 due to connection management issues". Because mac80211 is unaware of suspend and resume, drivers must work around it by de-registering and re-registering with it, which can be slow. Adding support for suspend and resume is on the list, as is supporting power saving modes. Linville went on to discuss three big issues that are largely outside of the control of the wireless hackers: firmware licensing, vendor participation, and regulatory concerns. Because drivers for Windows come with the firmware in the driver, many hardware vendors do not license the firmware blob separately. This means that it is unclear what can be done with those blobs. Certain vendors—Intel and Ralink were specifically called out—provide liberal licenses for their firmware. Users are encouraged to "vote with your dollars" by purchasing devices that either do not require firmware or that have a clear, free software friendly license. Another consideration when deciding which vendors to support is whether they are engaged with the community. For the most part, all vendors but Broadcom are working with the wireless hackers by providing documentation and/or source code. Some are even providing dedicated developers to work on Linux drivers—Intel was the first, but both Atheros (which just released a driver for its ath9k hardware) and Marvell have also begun doing that. Government regulations about what can and cannot be done in the unlicensed frequencies used by wireless are a concern that is frequently cited by vendors when refusing to work with the community. Unfortunately, their concerns are not completely without merit as hardware vendors are expected to ensure compliance with the regulations. "Non-compliance could be a huge loss" for those companies. As Linville points out, though, most vendors find a way to support Linux drivers. In answer to a question, Linville said that most WiMAX and 3G wireless devices are Full MAC designs, so there should be little or no regulatory concern, which, in turn, means that Linux support should not be much of a problem—at least until Soft MAC devices come along. Overall, Linux wireless has come a long way, but there is lots still to do. One gets the sense that the wireless team is up to the task. OLS: Shuttleworth on free software development In the third keynote given at this year's Ottawa Linux Symposium (OLS), Mark Shuttleworth spoke about "The Joy of Synchronicity". In his speech, he discussed his idea of synchronizing releases between major distributions but he also advocated time-based, rather than feature-based, releases for free software in general. He believes that a release has value in and of itself; by doing them on a regular schedule, a project will get into a kind of cadence that is useful for both developers, testers, and users. Before starting, Shuttleworth was subjected to the traditional introduction by the previous year's keynote speaker—James Bottomley, in this case. Bottomley looked at Shuttleworth's postings to newsgroups over the years, noting three year-long valleys in the graph where there were no postings. It turns out these corresponded to events in Shuttleworth's life. The first is when he received a substantial amount for selling Thawte to Verisign: "when someone is being productive on the mailing list, never give them half a billion dollars," Bottomley said. For the second, he has a pretty good excuse as he was not on planet earth; the last corresponds to starting Ubuntu. In a nod to Bottomley and the other kernel hackers, Shuttleworth mentioned that he had been working on his slides up until close to the start of his speech, while doing some unrelated things in the background—like updating his system. That picked up a new kernel as well and he did a suspend to RAM when he was done; only later in the cab ride to the Congress Centre did he think: "maybe that was a mistake". It turned out to work just fine, which is a testament to both the kernel and to distribution update mechanisms. The alliterative theme of the speech was that free software development should be guided by "cadence, collaboration, and customers". The cadence is a regular schedule for releases, similar to what GNOME—who pioneered this technique, according to Shuttleworth—and the Linux kernel do. This gets a project into a rhythm that makes it more predictable, which enables all interested people to schedule themselves around it. He compared this to various development methodologies such as "Agile" and "Lean". Industries are governed by rules, so if you want to change an industry, you "have to find which rules are only in our heads". Cross-project collaboration is one of those rules. "Nowhere is it written that projects can't collaborate." It is harder to do that if each distribution is working with different versions of the various base-level tools: the kernel, X.org, GNOME/KDE, OpenOffice.org, Mozilla, and so on. Shuttleworth contends that it is releases, rather than features, that bring attention to Linux. In answer to critics who believe that distributions should compete with each other, he says that is just "an opportunity to create friction." Free software companies don't compete on versions, but rather on philosophies and what things they focus on. He likens it to food courts or automobile sales malls where there are many choices in one location which serves to increase the sales of all. For major transitions, Shuttleworth is a fan of establishing meta-cycles, the idea that every N releases is a major release, which may result in breaking some backwards compatibility or introducing completely new functionality, along the lines of KDE 4 or GNOME 3.0. As an example, he used a six month release cycle where every fourth or sixth was a major release. For a distribution, that might be a long-term support release, rather than a major change. One of the key requirements that Shuttleworth sees is the need to "keep the trunk pristine", by doing integration on the trunk and feature development on branches. Along with this is the need for more and better tests. While not necessarily believing in test-driven development, he certainly leans that way. In any case, all the tests should pass before committing to the trunk. Many projects do not yet have an extensive test suite, but this needs to change. He quoted a Chinese proverb that "the best time to plant a tree is 20 years ago, the second best time is today". He mentioned that he is working on a robot that controls the trunk of a development tree. Developers will request it to merge from a branch, so the robot merges the branch and runs all the tests. If the tests pass, it commits, otherwise it gets kicked back to the developer. He sees distributions as "an effective conduit of upstream to users," to that end he believes that agreeing on versions of vital infrastructure can only help. Bugs that users find will be more likely to be fixed; those versions will also get better testing which will help developers. It is a conversation that free software should be having because it is a "very exciting idea" that won't work for every project but should be attempted and experimented with. In answer to criticism about Ubuntu not contributing as much as other distributions would in his proposed synchronized release, Shuttleworth was adamant that it was not true. He hates to see the antagonism and vitriol between distributions. "We have much bigger fish to fry and they are probably not here today." If all of the distributions were to standardize on a particular version of some project for their next release, what happens if that project falls behind? There are risks associated with that, Shuttleworth admits, but if it were happening, more resources would be available to help the project catch up. In the worst case, perhaps falling back to the previous version would have to happen. "Being tightly coupled has risks." This is clearly an idea that Shuttleworth feels strongly about, not necessarily that it be adopted fully, but that it be discussed and considered. Certainly some of his ideas have a great deal of merit. We will have to wait and see whether the grander vision will ever be implemented. Debian Lenny is frozen The Debian project is gearing up for the release of Debian Lenny, the next stable release of the Debian GNU/Linux operating system. This week we heard that Debian Lenny has been frozen. What does the freeze mean and when can we expect Debian Lenny to be released? To answer the second question first, the release is currently expected in September. While the testing branch is very close to what Debian Lenny will be, there are still Release Critical bugs to squash and other work that must happen before Lenny is pronounced stable. This Debian "lenny" Release Information page gives some pointers to various progress pages where you can find out more about the bugs that still need to be fixed. Mostly what the freeze means is that there are no more automatic uploads from Debian's unstable branch to the testing branch. Most Debian packages start out in unstable, also known as sid. That gives people a chance to test the packages and report any bugs. Assuming that these packages are working well, they will be automatically uploaded to the testing branch after a certain amount of time. Now though, testing is frozen, so a release manager will need to evaluate each unstable package and manually upload the package to testing, if it is judged suitable for Lenny. Chapter 5.13.3 of the Debian developers reference covers direct updates to testing, if you are looking for more detailed information. When Debian releases a stable distribution the user can be assured that they are getting a very stable operating system. All the packages will interact well with one another. It will not be the most up-to-date system available, because stability is considered more important than new versions of packages. Many Debian users agree. Some will continue to run Etch, the current stable version, until several months after Lenny is released. If you want a stable system, but need just one or two more current packages, you might consider building those packages yourself. Backports.org is another way of getting a few more current packages for your stable system. AptPinning allows you to run certain packages from one version, say unstable, on your stable system. There will be some risk with each of these methods, as newer packages may require newer libraries or have other dependencies. The more you change your stable system, the more instability you introduce. The lenny package list will help you find out what packages are currently in Lenny. Some digging through the sections there will show that Lenny includes linux-image-2.6-486 (2.6.25+14), dpkg (1.14.20) and hal (0.5.11-2) are among the Administration Utilities. The Python section lists python (2.5.2-1) among the many related packages. To find out if Lenny has want you are looking for, just browse through the sections. OLS: SELinux from academia to your desktop One of the nice things about conferences is the ability to catch up on where a particular project is headed, generally from one of the lead developers. Ottawa Linux Symposium did not disappoint in this area, with several "State of ..." talks. On day two of the four-day conference, James Morris looked at SELinux from its academic roots to its plans for the future. SELinux got its start from university research in the 80s and 90s that recognized that Discretionary Access Control (DAC) did not protect very well against the kinds of attacks that were becoming prevalent. This spawned the idea of Mandatory Access Control (MAC), in which the system makes all of the policy decisions regarding access, so users cannot change the permissions on files or other objects at their discretion. SELinux is a MAC system. Originally developed by US National Security Agency (NSA) in the 90s, SELinux was released under the GPL in December 2000. At the Kernel Summit in 2001, SELinux was proposed for inclusion in the 2.5 development-series kernels (remember those?), but was rejected by Linus Torvalds because there was no consensus amongst the various competing security models. This is what led to the creation of the Linux Security Model (LSM) interface. It was the LSM interface that got Morris involved in SELinux. It took until the 2.6 release in December 2003 before SELinux was available in the mainline, which is about three years after its release. This is "not atypical for a significant change to the kernel," Morris said. The next phase was to get it enabled and working in distributions. Because he works for Red Hat, Fedora (Core in those days) was an obvious choice. FC2 was the first release with SELinux, but it was disabled by default because the policy was too strict. "Every time we switched it on, we would find bugs in the applications." Security bugs that is. So, Fedora came up with the idea of a "targeted" policy that only affected network-facing services. This was released as part of FC3—which formed the basis for Red Hat Enterprise Linux (RHEL) 4. It was an attempt to get SELinux "switched on and doing something useful". It worked well enough that it inspired confidence in the technology by proving it was viable. SELinux developers realized that "if we run into problems, we can fix them". Since 2005, SELinux has emerged from a research orientation to a tool that is usable—with a very active development community. "Even being part of the project, it's hard to follow all that goes on" in the SELinux community. Morris then outlined some of the more significant developments over the last few years. The development of the reference policy by Tresys was a tremendous addition to SELinux. It was a "step forward in policy thinking" because it provides a framework around which to design policy. By getting rid of the original "spaghetti code" policy, it "made policy much more understandable to policy developers". Loadable policy modules broke up the monolithic policy that was originally part of SELinux into separate pieces. Each can then be loaded individually based on "policy booleans". The two of these together allow policy to be built and administered in sensible chunks, as well as allowing sites to "customize policy to support local conditions". Because of library and toolchain improvements, you no longer have to dig through files to edit, compile, and load policy either. Many of the reputation problems that SELinux has stem from the early days when it was well nigh impossible to track down policy problems and fix them. It is this frustrating user experience that SELinux is trying to tackle these days. The targeted policy is being merged with the "strict" policy and hundreds of modules covering different applications have been added. Policy failure—where the policy is written incorrectly causing a user to be unable to do something they should be able to—is "something you don't want the user to know about", but unfortunately that is unworkable. Because the system is under development, bugs will occur; there is nothing more frustrating for a user than to be denied access but to be unable to figure out why. That is where setroubleshoot can help. Inspired by GNOME's bug buddy, it alerts the user to policy violations and tries help find the cause of the problem—to the point of suggesting possible fixes. It is somewhat dangerous, in that users may blindly follow the fixes without understanding what they are doing, but it helps psychologically. "Instead of a black box stopping your system from doing what you wanted, now you have a transparent box." System administrators have a much nicer set of tools to manage policies as well as filesystem labels. audit2why can analyze SELinux output to provide reasons, once again with possible fixes, for policy violations. It is "not the optimum way to develop policy," but it can help. In addition, semanage is the "go to tool" for managing SELinux that is becoming quite powerful. Policy development has several GUI tools that have become available. SLIDE is an Eclipse plugin that assists in policy development. It also includes support for testing and deploying policies. Hitachi has developed SEEdit, which is a tool that provides a simplified policy language specifically targeted at embedded devices. It is a higher-level language that removes much of the complexity from SELinux policy while still compiling into compatible policy files. Performance and scalability have been two areas that have seen much work over the past few years. Many performance and memory reduction patches have come from Japan from the work on embedded SELinux. On the performance critical path, RCU has been used to eliminate some locking, while caching values rather than recalculating them has also provided better performance. One of the areas that the SELinux hackers are most excited about is threat mitigation. "We have seen evidence that SELinux has provided protection for normal desktop users." Tresys tracks these kinds of threats in their SELinux Mitigation News. In the final analysis, this is what SELinux is meant to do, so it is gratifying to see concrete results. SELinux has been adopted widely in Fedora and RHEL, but plans for the future include making it available on other distributions. Ubuntu is shipping SELinux in addition to AppArmor, while Debian and Gentoo are targeted for better SELinux support. SELinux techniques are being pushed beyond the kernel, into virtualization (XSM), the desktop (XACE), storage (Labeled NFS), and applications like databases (SEPostgreSQL). There is also a push into other operating systems, like the OpenSolaris Flexible MAC project. The challenges facing SELinux in the future are in areas like usability, which is a "fundamental problem in security", and documentation, which is "not very good, in some ways really bad". Morris also wants to keep the community of users and developers growing. While SELinux has had a difficult path—first in getting into the kernel at all, then to becoming usable, and finally to actually preventing the kinds of attacks it was designed to stop—the developers seem to overcome each hurdle. It is a complex beast, that in some ways defies analysis, but it can help to protect systems. Like it or hate it, it seems likely to be with us for a long time. OLS: Smack for embedded devices The Simplified Mandatory Access Control Kernel (Smack) is a Linux access control mechanism akin to SELinux. As its name would imply, it is a much less complex scheme that requires far fewer resources than SELinux, which may make it more palatable to developers of embedded systems. Smack developer Casey Schaufler gave a talk at the recent Ottawa Linux Symposium (OLS) outlining how it could be used for embedded devices. Smack has the distinction of being the second user of the Linux Security Module (LSM) kernel interface to be merged into the mainline. This finally put to rest the idea that the LSM might some day be removed from the kernel, requiring all security solutions to be implemented in terms of SELinux. But Smack comes at Mandatory Access Control (MAC)—which is at the heart of both SELinux and Smack—from a different perspective. Schaufler believes that MAC rules should be explicitly specified rather than implicit in a set of policies a la SELinux. In order to get everyone up to speed, Schaufler gave an overview of MAC and Smack. The main thing to remember about MAC is that it is not user controlled. The system makes all decisions about access and the attributes of files that govern access. The standard UNIX model, by way of comparison, is a Discretionary Access Control (DAC) system, where users can change the security attributes of objects under their control. Smack relies on labels for subjects, which are active entities, and objects which are passive. An access is then an operation that is performed by a subject, generally a task/process, on an object, which is typically a file. In order to determine whether the access succeeds or fails, Smack compares the subject and object labels, if they match access is granted, if they do not match, the explicit access rules are consulted. If one matches the attempted access, it is granted, otherwise it is denied. There are three system labels defined, along with access rules governing their behavior, but all other rules must be explicitly added by the administrator. Labels are simply strings up to 23 characters long. Rules then specify a subject label, an object label, and a desired access (read, write, execute, append). After mounting a smackfs filesystem at /smack, rules can be written to /smack/load, which stores them in the kernel for immediate use. It is important to note that objects inherit the label of the subject that creates them. That means that the label on an executable is only relevant to determine whether the subject process is allowed to execute it. The process that gets created has the label of the subject that executed it, not the label associated with the executable file. The same goes for processes that create files, those files get the label of the process. This is very different from the SELinux label inheritance rules. There is more to it, of course, but not a lot more, which is what makes it attractive to some. Interested readers are directed to our article, Schaufler's OLS paper [PDF], or the Smack home page for more detailed looks at Smack. Schaufler outlined specific reasons that a simplified system, like Smack, would be attractive in the embedded world. Many embedded devices are single-purpose and geared towards one user. Because cost is often a major factor, the device only needs to implement the exact set of functions that it is meant to provide. As Schaufler puts it: "feature completeness is uninteresting". Cost often plays a role in the amount of system resources provided, particularly RAM and flash, as well. A solution that uses less memory fits well with the embedded mindset. There have been some efforts to pare down SELinux and its enormous policy file for the embedded world (including a paper at OLS [PDF], and a presentation at the Embedded Linux Conference that we covered briefly), but it is still rather large. It is also a great deal more complex than Smack, which was a major thrust of Schaufler's presentation. One problematic area for putting SELinux on embedded devices is that most flash filesystems do not have support for extended attributes (xattrs). Both Smack and SELinux use xattrs to store labels for files, but Smack can provide a default label for an entire filesystem to avoid requiring xattr support. Also, system files automatically default to the "_" (called floor) label so, in many cases, labels on individual files may not be required. In his talk, Schaufler gave several examples of specific sets of applications and how they could be easily cordoned off from each other while still working together. The model he used was of a mobile phone with multiple applications. The phone's system data would have the default floor label which means they can be read—but not written—by a process with any label. One of Schaufler's examples was of two different applications that each retrieved content from the network to display to a user. Each retrieved headlines from different services, one from CNN, the other from ESPN. At times the content might overlap, in which case the phone vendor wanted each to be able to read the other's data, potentially displaying a sports story as part of the regular news or vice versa. This is easily handled by two Smack rules: Assuming that the CNN application runs with the CNN label, and the ESPN process with ESPN, they can each read and write their own private data (because the labels match). Because of the two rules above, they can also read each other's private data. If at some point, the phone provider decided those two applications should not be able to share data, those rules simply need to be removed, no filesystem relabeling or anything else is required. Another example that Schaufler gave was of a video process and an audio process that cooperated in sharing system resources by sending messages to each other. They had no need to share data, just to send UDP messages. In Smack, a process can send a UDP packet if it has write access to the label of the other process. So the following Smack rules could be used: One might expect that giving write permission would allow Video, for example, to write to data with the Audio label. This is not the case because UNIX file semantics require read access in order to write file data (because the inode of the file must be read). So under this set of rules, each can send (and receive) UDP packets from the other process, but cannot access any of the data labeled for the other process. Schaufler had some other examples in his presentation (slides [PDF]), that were geared more towards exploring Smack capabilities than specifically at embedded applications. He concluded by directly comparing Smack and SELinux in terms of complexity. Clearly Smack is vastly simpler; whether it has enough capabilities to provide the protection that embedded developers require remains to be seen. On the other hand, whether SELinux can be made to work reasonably in embedded environments is also an outstanding question. It will be interesting to watch. A kernel message catalog Kernel developers will often use printk() to output a message when something goes wrong. Such messages tend to be helpful to kernel developers; if nothing else, they can be used to find the place in the source where the message is emitted, and that, in turn, is most useful for somebody trying to figure out what the message is really saying. So, if your kernel tells you, for example, "lguest is afraid of being a guest," a quick dig through the source turns up a comment reading "Lguest can't run under Xen, VMI or itself. It does Tricky Stuff." Problem solved - or, at least, understood. But, for the bulk of Linux users and administrators, the act of printk() interpretation by recourse to the kernel source is, itself, Tricky Stuff. If the kernel cannot tell them directly what the problem is, they would much rather have a more straightforward means of translating messages into some sort of useful English. Or maybe not: for many Linux users, English may not be much more helpful than straight kernel-speak. It would be really nice to translate those messages into some sort of useful French, or Chinese, etc. What it comes down to, in the end, is that printk() alone will never be able to provide sufficient information to users in a way which can be understood and used to solve problems. Just over one year ago, LWN looked at some proposals for adding structure to kernel messages. After that, the discussion went quiet, to the point that it seemed like not much was happening in the messaging area. But one should not forget that we are dealing with companies like IBM which have been creating massive binders full of kernel message documentation for several decades. They're not going to give up so easily. So the posting (by Martin Schwidefsky) of a new kernel messaging proposal is not an entirely surprising event. In the latest scheme, each source file which generates structured messages defines a macro KMSG_COMPONENT as a string naming the specific subsection. This name will often match the name of the module which is created from that code, but that is not necessarily the case. The name, once chosen, is supposed to remain fixed forevermore; it becomes, in essence, part of the user-space interface and should always match the documentation. Then, each message is assigned an integer identification number. The combination of the component name and the message number should be unique throughout the kernel; it is used by various tools to associate a more detailed explanation of whatever the message is intended to communicate. The message number is used with one of a number of new printk()-like functions: The "_dev" versions take an additional struct device argument (like dev_printk()) and encode the device name in the resulting message. That message (for all variants) will include the component name and the message number in any output. So, for example, the S/390 "xpram" driver includes the following: Should this particular error check trigger, the resulting message will look like this: Thus far, our user is probably not feeling much better informed than before. But there is additional information which is made available and associated with that message tag. In this particular case, it looks like this: Here, we have a more verbose description of the message. Even more helpfully (one hopes), there is a discussion of what can be done to make this message go away. This information can be provided within the source or in a separate documentation file; it can also, presumably, be nicely formatted and distributed to paying customers as a binder for the system administrator's bookshelf. It can be translated into other languages for Linux users worldwide (and beyond: one could have a lot of fun with the Klingon translation for this kind of material). The patch includes a script (written in Perl with undocumented messages, of course) which (when invoked with make D=1) will go through the source and make sure that every kernel message has an associated description block; it can also format the descriptions into man pages if desired. There are checks for missing descriptions or overloaded message ID numbers; the script does not, at the moment, check for a change in the message text. Martin's first posting made this work specific to the S/390 architecture; following a suggestion from Andrew Morton, he made it generic in later versions. The cost of this work is zero for those who do not use it, so there is a reasonable chance that it will find its way into the mainline eventually. Before the message catalog system can be truly useful, though, developers will have to go through and document a substantial portion of the messages created by the kernel - and keep that documentation current as the kernel evolves. Can user-space bugs be kernel regressions? Adding new functionality to the kernel while maintaining the interfaces for user space is the standard kernel development practice. Sometimes, though, that can tickle bugs in user-space programs in unpleasant ways. When that happens, it is clearly a regression—something that worked before no longer does—but is it a kernel regression? In the end, it doesn't matter, it seems, because the kernel needs to change to keep the user-space program working, even at the expense of "ugliness". Clearly for purely internal kernel functionality, there is no mandate for compatibility across kernel versions. But, when the user-space interface is involved, things get a bit trickier. A change that alters the way a documented interface works is essentially never done; user-space interfaces are maintained forever. When new functionality properly uses a documented interface, but breaks a user-space program, it gets murkier. That situation came up recently when Andrew Morton noticed that the linux-next tree broke the X server on his laptop. The problem was quickly diagnosed as a problem in the Synaptics touchpad driver for X. An array that was being passed to an ioctl() was sized based on the number of bits, rather than bytes, it should contain. Thus the maximum buffer length passed was off by a factor of eight. As a solution, Dmitry Torokhov offered up a patch, not to kernel code, but to the synaptics X driver. That didn't sit particularly well, with Morton and others, eventually leading to a pronouncement from Linus Torvalds: If somebody has the commit that broke user space, that commit will be _reverted_ unless it's fixed. It's that simple. The rules are: we don't knowingly break user space. Torokhov clearly felt that it was the driver, not his changes, that were at fault, which is entirely understandable because it's true. That doesn't alter the fact that new kernels would break existing, working configurations on laptops everywhere. The kernel change just fully used an existing, documented interface as Torokhov explained: It is not like we broke ABI here. The program (synaptics driver) had a grave bug. Older kernels happened to paper over the bug because they did not fill the whole buffer that was advertised as available. Now that we have more data to report the bug bit us. Declaring an array of 64 bytes, but telling the kernel it can store up to 511 bytes into it is obviously a bug. But, as Morton points out: It really really doesn't matter what the causes are or which piece of code is at fault or anything else like that. What _does_ matter is that people's stuff will break. Apparently lots of people's. That's a problem. A _practical_ problem. Can we pleeeeeeze be practical and find some way of preventing it? Since the code was in linux-next, it was targeted at the 2.6.28 kernel. In Torokhov's thinking, this would allow something approaching six months for distributions to update the synaptics driver. But that is a fundamental misunderstanding of how and when kernels are upgraded—it is not only by way of distributions. Introducing a change like this would result in many messages to linux-kernel from unhappy folks with broken X servers. Kernel hackers purposely build and run kernels on a wide variety of hardware and distributions. That includes older distributions that no longer get updates so they would be stuck with the buggy driver, thus non-working X server, essentially forever. Obviously, they could rebuild the synaptics driver—kernel hackers have been known to compile things other than kernels—but that isn't the point. There are major benefits to also having lots of regular users update their kernels frequently. Trying to ensure that there won't be any unnecessary barriers to doing that can only help. Torvalds describes it this way: And if we want to encourage people to upgrade their kernel very aggressively (and we absolutely do!), then that means that we have to also make sure it doesn't require them upgrading anything else. Torvalds and Torokhov worked out a fix that preserved the old behavior for a specific passed-in buffer length, while allowing the new events to be delivered to any other users of the ioctl() that passed in the proper length. Torvalds commented: "Yeah, it's not pretty, but pragmatism before beauty." It is, to some extent, a gray area. Regressions are bad for any number of reasons, but maintaining hackarounds for buggy user-space programs has its own set of problems. The hope is that eventually the need for the workaround goes away so that it can be removed. It would seem difficult to determine when the last user of the old synaptics driver finally upgrades, so this code could be with us for a long time. Given the alternative, the price seems worth it. Though Torvalds was absolute in condemning any known regression, even for programs that are clearly misusing an interface, there must be a line somewhere. If some obscure program, with few users, gets broken by the kernel doing something documented and reasonable, it is hard to imagine that this kind of workaround will be required. This particular problem was relatively easy to decide, the next might not be. Building custom appliance distributions with rBuilder Linux distributions can be a pain. Users have to go through the whole process of installation, configuration, and updates, and, often, all they really want to do is to run a single application. The vendors of that application, meanwhile, feel the need to support as many distributions as possible, even though the actual system running underneath their code is nearly irrelevant. Wouldn't it be nice if users could simply get their desired application as an "appliance" which comes with all the necessary component parts nicely hidden inside? As it happens, rPath has been in the appliance business for a little while now. Recently, the company has made its appliance-building infrastructure available to free-software products in the form of rBuilder Online. In essence, rBuilder can be used to create and maintain a custom distribution oriented around the delivery of a specific application. The result is a "software appliance" which, in theory, makes the given application available in a self-contained, standalone distribution. There are a number of example appliances available on the site. They include: Bongo, an attempt to revitalize work on the Hula mail client Gallery, a standalone photo album LochDNS, a DNS server Openfiler, a storage management system There are several others oriented around content management systems, telephony applications, database servers, and more. All told, quite a few projects have shown interest in creating software appliances for their applications. Your editor grabbed a copy of the Openfiler appliance and installed it onto a spare box which had been cluttering up the office. Appliances from rBuilder start out looking like a Fedora system; they use the same Anaconda installer. The installed system also shows a lot of Red Hat heritage, such as /etc/sysconfig, various system-config-* commands, an /etc/inittab file which credits Mark Ewing and Donnie Barnes, etc. But there is a crucial difference: there is no rpm command. Instead, these appliances are based on rPath's Conary package management system, which takes a very different approach to the software management problem. But there are still similarities with Fedora: your editor attempted a conary updateall operation on the LochDNS appliance, only to see it fail with a set of file conflict errors; it was almost like running Rawhide again. Appliance users are not supposed to have to dirty their fingertips with command-line administrative operations, though. To help them avoid this fate, rBuilder-based appliances come with the rPath Appliance Platform Agent, otherwise known as a web-based administration interface. Once the user gets past the usual set of obnoxious Firefox dialogs ("this site has an SSL certificate which is not only unknown, but is almost certainly hostile and is ugly besides"), this interface provides a set of administrative screens for standard tasks (networking, updating the system, etc.) along with some specific to the Openfiler application. In theory, it should be possible to manage one of the appliances without ever going to the command line - or even knowing that the command line exists. In practice, how well that works depends a lot on how the administration screens are designed. In the Openfiler case, quite a bit of clicking around in circles was required, but your editor did finally succeed in setting up a volume based on a USB key, perform a software update, and shut down the system at the end. The creation of appliances would appear to be relatively straightforward; details can be found in this document. One creates an account in the rBuilder system, then puts together a file describing which components (packages) are necessary in the final system. Those components will presumably include at least one application provided by the appliance builder - that application being the reason for the creation of the software appliance in the first place. The "rMake" system will then pull all of the pieces together, bring in any needed dependencies, and wrap it all up inside a minimal distribution; the resulting system image seems to run at about 300MB. There are several possible output formats, including the Anaconda-based installation CD image; the rPath folks would appear to have put a lot of effort into making appliances work on a number of virtualization platforms as well. Appliances can be built for VMWare, various forms of Xen, VirtualIron, and Microsoft VHD. Notably absent is anything based on Lguest or KVM. Even more notably absent is any kind of live CD appliance; anything not running in a virtual machine must be installed onto the host system's disks. rPath's Conary servers seem to be set up to handle software updates. It is also possible to obtain source for the packages found in an appliance through the rBuilder site, though one must do a little digging first. Both of these features are important: anybody creating a distribution-based appliance has to arrange for updates and source availability somehow. One assumes that most appliance creators have no real desire to get into the broader distribution business, so it's nice for them to be able to offload these tasks. Anybody distributing these appliance images should note that rPath does not appear to have undertaken any obligation to continue to provide these services in the future. Should rPath decide to stop, some interesting questions on who is ultimately responsible for satisfying the source-availability provisions of the GPL could come up. Naturally enough, rPath offers commercial services for those who would like stronger guarantees about long-term support, or who want to include proprietary software in their appliances. For the time being, this approach to software distribution would seem to be most useful for companies which are in the business of building real, hardware-based appliances. Distributing software in virtual machines has the look of a new and truly impressive form of bloat; even "just enough operating system" is a lot of baggage for an application to drag around. For situations where one wants to try out a complex system, appliance distribution may be worth its cost, but one would probably not want to get every application this way. There may be value, though, in software distributions which can run almost anywhere, and which can be nicely isolated from the outside world. Locking network-exposed applications - server processes or web browsers - into their own little world could help to avoid a lot of security problems in a way which seems more straightforward than SELinux or containers. But, perhaps most interestingly, the appliance approach could eliminate a number of distribution-compatibility issues by putting many more people into the distribution business. Now anybody can throw together a special-purpose distribution without having to deal with all of the plumbing that makes the whole thing actually work. Something interesting will certainly come of this idea, even if it's hard to say just what that might be at the moment. The TALPA molehill The TALPA malware scanning API was covered here in December, 2007. Several months later, TALPA is back - in the form of a patch set posted by a Red Hat employee. The resulting discussion has certainly not been what the TALPA developers would have hoped for; it is, instead, a good example of how a potentially useful idea can be set back by poor execution and presentation to the kernel community. The idea behind TALPA is simple: various companies in the virus-scanning business would like a hook into the kernel which allows them to check for malware and prevent its spread. So the patch adds a hook into the VFS code which intercepts every file open operation. A series of filters can be attached to this intercept, with the most important one being a mechanism which makes the file being opened available to a user-space process as a read-only file descriptor. That process can scan the file and tell the kernel whether the open operation should be allowed to proceed or not. In this way, the scanning process can prevent any sort of access to files which are deemed to contain bits with evil intentions. There are a few other details, of course. A caching mechanism prevents rescanning of unchanged files, increasing performance considerably. There is also a hook on close() calls which can trigger the rescanning of a file. Processes can exempt themselves from scanning if it might get in their way; scanning can also be turned off for specific files, such as those used for relational database storage. But the patch set is relatively small, as it really does not have that much to do. This capability could well prove to be useful. Even if one is not concerned about malware infections on Linux systems, a lot of files destined for more vulnerable platforms can pass through Linux servers. There is also the potential for the detection of attempted exploits of the Linux host. Normally, in the Linux world, the way we respond to knowledge of a specific vulnerability is to patch the problem rather than scan for exploits, but there may be systems which cannot be restarted on short notice, and which could benefit from an updated scanning database while running code with known vulnerabilities. Also, as Alan Cox pointed out, this feature could be useful for entirely different objectives, such as efficient indexing of files as they change. What might be best of all, though, is that this hook could replace a number of rather less pleasant things being done by anti-malware vendors now. Some of these products use binary-only modules, plant hooks into the system call table, and generally behave in unwelcome ways. Moving all of that to a user-space process behind a well-defined API could be beneficial for everybody involved. The patches have gotten a generally hostile reception on the kernel mailing lists, though. Some developers are uninspired about the ultimate objective: So you are going to try to force us to take something into the Linux kernel due to the security inadequacies of a totally different operating system? You might want to rethink that argument. That's an objection which can be worked around; the kernel developers do not normally want to determine which applications will or will not be supported by the system as a whole. Another objection, though, might be harder: this hook is said not to be the best solution to the problem. Instead of putting a hook deep within the VFS layer, the anti-malware people could simply hook into the C library (perhaps with LD_PRELOAD), put the malware scanning directly into the processes (mail clients or web servers, say) which are passing files through the system, or embed the scanning into a stackable filesystem implemented with FUSE (or a similar mechanism). That has led to counterarguments that scanning implemented in this manner could be evaded by a hostile application - by performing system calls directly, for example, instead of going through the C library. Certain kinds of attacks, it is said, could get around a purely user-space solution. That argument, however, highlights the real problem with this posting. The patch includes a set of 13 "requirements," including intercepting file opens, caching results, exempting processes, and so on. But none of these requirements describe the problem which is really being solved. In particular, as noted by Al Viro and others, there is no description of the threat which this patch is intended to mitigate: Various people had been asking for _years_ to define what the hell are you trying to prevent. Not only there'd been no coherent answer (and no, this list of requirements is _not_ that - it's "what kind of hooks do we want"), you guys seem to be unable to decide whether you expect the malware in question to be passive or to be actively evading detection with infected processes running on the host that does scanning. If the scanning host could be infected, then a scanning mechanism which could be circumvented by a rogue program is indeed a problem. But that is a very different threat than simply trying to prevent evil attachments from creating mayhem on Windows boxes; it does not appear to be a threat which these patches are trying to address. The lack of a clearly described problem has caused the discussion of these patches to go around in circles; it is not possible to evaluate (1) whether the goals of these patches are worth supporting, or (2) whether the patches can actually be successful in achieving those goals. The code, in other words, cannot be reviewed. Until the TALPA developers can clarify that situation, their work will look like an example of "shoot first, then aim." That kind of code tends not to make it into the mainline, even if it could be useful in the end. Firefox to support Theora video Video in the browser, at least for Linux, has always resorted to somewhat clunky solutions—Flash plug-ins or external programs—but that is likely to change in Firefox 3.1. Recent commits to the Firefox development tree have added support for the HTML 5 <video> and <audio> tags as well as native Ogg Vorbis and Theora support. Providing multimedia support directly in a free browser, with no plug-in required, is a huge step forward both for Linux and for the royalty-free codecs. The battle over video and audio formats is an ugly one, largely because they are patent minefields. The "mainstream" formats, MPEG-4 for video and MP3 for audio, are licensed on a royalty basis to companies that want to implement playback. Obviously, Mozilla is not in a position to pay a per-installation royalty, so that leaves various ad hoc methods using Javascript and plug-ins—that users have to track down—to make audio-video playback work in its browser. Trying the new feature (seen at left) on one of the recent nightly Firefox builds seemed to work pretty well given that it is still under development. The video played smoothly, but the audio was not functional, only producing a rumbling, clicking soundtrack. The Wikimedia Commons video collection was used to test as it is a nice collection of Theora videos. Some have seen the lack of Theora content currently on the web as a reason to downplay Firefox's support for the format, which is unfortunate, as Mozilla hacker Robert O'Callahan was quick to point out. Unlike the current situation, once a Firefox with video support is released, there will be one format that all content producers can be sure will be available for Firefox. Depending on whose numbers you believe that means that somewhere between 10 and 25% of web surfers (or more than 100 million people) will be using it. Even with the dominance of Internet Explorer, the plethora of codec plug-ins has made it somewhat difficult for content providers to decide upon which video formats to support. With a substantial fraction of browsers supporting a particular free format, that situation may change. Wikimedia will certainly help by providing reasons for those not using Firefox to demand Theora plug-ins—if not integrated Theora support—for their browsers. As more content is available in that format, the pressure will build on Microsoft and Apple. As we mentioned in an article on web video formats last December, more content is the key to Theora support. Some have argued that Vorbis and Theora are just as likely to be patent-encumbered as the more mainstream codecs, but so far that is unproven. There is no licensing authority that claims to have patents covering those codecs. Though Mozilla has some depth to its pockets—largely due to its deal with Google—patent holders might be loathe to attack a free software browser. In many ways, patent holders risk upsetting their entire apple cart if their attacks rise too high into the public consciousness. Though, clearly, Mozilla will be taking on some amount of risk with this move. There have also been arguments that the Theora codec produces inferior video compared to those used by MPEG-4 and others. There is certainly truth to that assertion, but there is ongoing work to bring Theora more in line with the quality of its competitors. Due to the fact that it isn't controlled by a licensing authority with little or no interest in improving it, there is hope that Theora, or some descendant of it, could produce superior results some day. Dirac—also known by the name of its C language implementation Schrödinger—is another royalty-free codec that is being looked at for inclusion into Firefox. There are currently some performance issues with decoding, but if those get resolved, there might be two free choices for video codecs in Firefox. There are lots of entrenched interests that would like to see Theora, Vorbis, Dirac, and others like them disappear. They are quite happy with the current state of affairs. For the most part, though, users are not. Even on "well supported" platforms, video—and to a lesser extent audio—is a confusing jumble of plug-ins and formats that make it somewhat painful to use. Flash and Silverlight are supposed to "solve" these problems, but they do it in a not-quite-free way that still requires plug-ins. If web users start to find it easier to use the video formats embedded in their browser, and content producers take notice, it could completely change video on the web. Looking forward to Fedora 10 The Fedora 10 alpha release is now available. At this point, the next Fedora release (due at the end of October) should be mostly feature-complete, though the project reserves the right to continue development work through the beta release (currently planned for August 19). So this seems like a good opportunity to have a look at some of the features which can be expected in Fedora 10. Rawhide users, who are well known for their masochistic tendencies, are already running the 2.6.27-rc kernels. Given that 2.6.27 should come out in the early part of October, chances are good that this is the kernel version which will come standard with Fedora 10. So Fedora users will be among the first to get enhanced webcam support, UBIFS, ftrace, multiqueue networking, and more. Improved webcam support is an explicit goal for Fedora 10 in general. The kernel upgrade will help a lot in that regard, but Fedora is taking aim at another longstanding problem: quite a few video applications still use the Video4Linux1 API, despite the fact that said API has been deprecated for years. To help improve this situation, Hans de Goede has been working on another long-missing piece: a user-space library to make the Video4Linux2 API easier for applications to use. It will handle things like format conversions, which, by policy, are not allowed in the kernel; it also does better impedance matching between the V4L1 and V4L2 interfaces. The end result of this work will be better-working webcams for Fedora users - and for everybody else. A similar objective for Fedora 10 is better support for remote controls. The LIRC remote control package has always been a some-assembly-required affair; Fedora developers are trying to improve this situation and get remote controls to just work. "Just works," alas, is not a phrase which has been heard often enough around the PulseAudio sound server. The upcoming Fedora release will have a seriously rewritten PulseAudio; the biggest change is a shift to timer-based audio scheduling instead of the older interrupt-driven technique. The promised result will be glitch-free audio; those who are curious about the details of how this will work can find them on this page. PulseAudio is getting better. Another big change, of course, is the shift to RPM 4.6 - the first real update to the RPM package manager in many years. Being fully aware of the consequences of a failed RPM upgrade, the Fedora developers are proceeding with great caution. The on-disk format will not be changed anytime soon, and newer RPM features are not, yet, being used in Fedora; that means that they can revert back to the older RPM if need be without leaving systems stranded. After some early glitches, RPM 4.6 would appear to be working fairly well, though, so this upgrade will probably stick. Beyond that, Fedora users can expect a long list of new goodies. NetworkManager now has a feature allowing the sharing of network connections via wireless. There are plans to provide much-improved support of the Haskell programming language, though that project appears to be moving slowly. And there is an interesting new security audit tool intended to look for security problems and signs of intrusions. Your editor would have loved to try out this tool, but, as of this writing, the version in Rawhide appears to be lacking some fundamental features - like being able to start up successfully. Stay tuned. One thing that apparently will not be in Fedora 10, despite the occasional user request, is KDE 3.5. Some KDE users are not, yet, happy with the state of development of KDE 4 and would like to have their old, familiar desktop back. This note from Fedora leader Paul Frields explains why KDE 3.5 will not be returning to Fedora. In summary: Fedora exists to push the leading edge, QT3 is no longer maintained, and shipping KDE 4 helps that platform improve more quickly. So KDE 3.5 will not be coming back - unless somebody else goes to the trouble of packaging and maintaining it. All told, there is a lot of work going into this distribution release. The best way to really see what's going on - and to help the process - is, of course, to try out the alpha release and report any problems which result. After making good backups, of course. The GNOME 2.24 module proposals The GNOME desktop environment is built in a modular manner with API-stable platform modules and less API-stable desktop modules. Desktop modules can be transitioned to platform modules as they mature. The Damned Lies about GNOME translation site describes the GNOME modules: "Modules are separate libraries or applications, with one or more branches of development included. They are usually taken from CVS, and we keep all relevant information on them (Bugzilla details, web page, maintainer information,...)." The site contains an extensive list of modules for the current GNOME 2.22 release. On August 4, 2008, list of modules to be included in the upcoming GNOME 2.24 was posted. A quick tour of the new modules to be included follows: empathy: "Empathy consists of a rich set of reusable instant messaging widgets, and a GNOME client using those widgets. It uses Telepathy and Nokia's Mission Control, and reuses Gossip's UI. The main goal is to permit desktop integration by providing libempathy and libempathy-gtk libraries. libempathy-gtk is a set of powerful widgets that can be embeded into any GNOME application." project hamster: "Project Hamster is time tracking for masses. It helps you to keep track [of] how much time you have spent during the day on activities you have set up. Whenever you change from doing one task to other, you change your current activity in Hamster. After a while you can see some statistics of how many hours you have spent on what. Maybe print it out, or export to some suitable format, if time reporting is a request of your employee." clutter: "Clutter is an open source software library for creating fast, visually rich and animated graphical user interfaces. Clutter uses OpenGL (and optionally OpenGL ES for use on Mobile and embedded platforms) for rendering but with an API which hides the underlying GL complexity from the developer." libcanberra, announced here, is a lightweight sound event library that implements the XDG sound theming/naming specs. PolicyKit (from an LWN article): "Mounting removable filesystems, CDs, USB devices, and the like, is a classic example of a root-only task that some non-privileged users might be allowed to perform. In the past, various mechanisms using groups or mount options in /etc/fstab have been used with some success, but the mechanisms were specific to mounting and did not provide the flexibility that some administrators would like. Network configuration - particularly for wireless networking - is another common task that users might be allowed to do. PolicyKit is an attempt to centralize these kinds of decisions into a single policy file that the administrator can use to set the kinds of access regular users should be allowed." There's also a few modules which were not accepted this time around: Conduit: "Conduit is a synchronization application for GNOME. It allows you to synchronize your files, photos, emails, contacts, notes, calendar data and any other type of personal information and synchronize that data with another computer, an online service, or even another electronic device. Conduit manages the synchronization and conversion of data into other formats." Conduit was partially rejected due to an incomplete UI, but allowed as an external dependency for use by other applications. It should be ready for inclusion in GNOME 2.26. WebKit: "WebKit is an open source web browser engine. WebKit is also the name of the Mac OS X system framework version of the engine that's used by Safari, Dashboard, Mail, and many other OS X applications. WebKit's HTML and JavaScript code began as a branch of the KHTML and KJS libraries from KDE." The plan is to replace the Gecko html rendering engine with Webkit in time for GNOME 2.26. libgda (part of Gnome-DB): "Libgda is a database abstraction layer which hides all the database backend specifics from the user, offering a simple interface to each supported database (MySQL, PostgreSQL and SQLite are fully functional while Oracle and MDB are useable and missing features) to run queries." Libgda is required by the Anjuta IDE, it will either be included optionally or bundled with Anjuta. There is, of course, a lot more to GNOME 2.24 than a few new modules; see the roadmap for more information. This GNOME release is currently scheduled for September 24. Kernel Hacker's Bookshelf: The Practice of Programming In The Mythical Man-Month, Fred Brooks observes that the productivity of experienced programmers frequently varies by a factor of 10 or more. What makes the 10x programmers so much better? Undoubtedly some of the difference is due to native facility with language or logic. But even with these advantages, no one is born writing beautiful, elegant, maintainable code; everyone goes through a learning process. How do we learn to be good programmers? In many ways, the art of computer programming is still stuck in the era of the master-apprentice system. Some of us are lucky enough to learn to program in something like "the UNIX room" at Bell Labs, where you could shoulder-surf the likes of Ken Thompson and Dennis Ritchie. Occasionally someone practices pair-programming instead of just arguing passionately about it, and once in a very long while, a 10x programmer will actually teach another person how to program. Unfortunately, formal university education rarely teaches students about the practical aspects of programming, as any holder of a computer science degree will readily attest, and few programmers have the time, interest, or ability to write accessible books about programming. As a result, most programmers are doomed to a decade of re-inventing wheels by trial and error. Brian Kernighan and Rob Pike are two 10x programmers who do have the time, interest, and ability to write a book about software engineering best practices. The Practice of Programming aims to fill the gaps in the training of most computer programmers. From the book: Topics like testing, debugging, portability, performance, design alternatives, and style - the practice of programming - are not usually the focus of computer science or programming courses. Most programmers learn them haphazardly as their experience grows, and a few never learn them at all. This book probably won't make you ten times more productive, but it can easily make you twice as productive (and half as frustrated). If I could send one book to a programmer trapped on a desert island, this would be the book - and I'd send the same book to the new programmer who just joined my development team. Overview The Practice of Programming differs from most programming books in several enjoyable ways. Rather than promoting a particular new programming philosophy, Kernighan and Pike focus on three principles: simplicity, clarity, and generality. As you might guess from the title, the book is short on theory and long on practice. About one third of the ~250 page book is taken up by actual real-world example code, starting with the original dodgy code and showing the step-by-step evolution to better code. Most examples are in C, but the principles illustrated readily translate to other languages. The writing style of this book is refreshingly practical and down-to-earth, without losing generality. The authors avoid stark black-and-white pronouncements, preferring to discuss why different techniques are useful under different conditions. Clarity is another hallmark of their style; they use as few words as possible to clearly state each point, and dismiss trivialities and side issues quickly and cleanly. A typical example of this approach is their advice on brace and indentation style: "The specific style is less important than its consistent application. Pick one style, preferably ours, use it consistently, and don't waste time arguing." The book is organized into nine chapters, each covering a topic such as testing or debugging that usually requires an entire book on its own. The table of contents includes headings like "Test as You Write the Code," "Consistency and Idioms," "Strategies for Speed," "Other People's Bugs," and "Programs that Write Programs." I can't cover the whole book in this review, but I'll go into detail on two of my favorite chapters, "Performance" and "Notation." Performance The introduction of this chapter gives some very direct advice: "The first principle of optimization is don't." Computers are fast - go run lmbench on your desktop to update your sense of just how fast. For example, some system calls are now in the sub-microsecond range under Linux on modern hardware. Armchair optimization - the practice of making small theoretical optimizations as you code, at the expense of readability, portability, or correctness - is especially foolish in light of Donald Knuth's observation that 4% of the code typically accounts for more than half of the run-time of the program. Kernighan and Pike's first piece of advice is to write simple, clear, concise code, and optimize only when you have some tangible reason to do so. The chapter begins with a real-world optimization problem: a spam-filter that worked well enough in testing but bogged down in production. The tangible reason for optimizing this program is that the mail queues were filling up with undelivered mail - a clear justification for optimization if there ever was one. The authors show the process they went through to optimize the spam-filter, step-by-step: profiling, analysis, a first attempt at optimization, re-factoring the problem, addition of pre-computation, and measurement of the results. This overview is welcome not only as a good programming war story but also because the overall flow of code optimization is non-obvious (otherwise, "How would you go about optimizing a program?" would not be such a common interview question). The rest of the chapter talks about best practices for each step of optimization. The first topic is timing and profiling, as it should be. All too often, even good programmers measure performance by "feel" - if you don't believe me, search LKML. Sometimes no easy tool exists to measure what is being optimized, but it's still better to write some kind of measurement tool, no matter how clunky or approximate. Human perception and judgment are heavily influenced by preconceptions and the vast majority of theoretical optimizations have negligible effects on performance. A more subtle piece of advice is to turn performance results into pictures or graphs. Chris Mason's seekwatcher is an excellent example; it turns block traces into graphs - and even movies! The authors cram a surprisingly complete demonstration of profiling into less than two pages, using prof on their spam-filter as the example. They show how to identify hot spots and do basic sanity checking on the results - e.g., match up the number of times a function call shows up in the profile with the number of iterations of the main loop. While they include some caveats on trusting profiling results, I wish they had spent some time on the design of profiling tools to show the kinds of biases and errors that so often make profiling results misleading. Perhaps it's because I work on systems software, but I've found that I really have to know the details of whether the profiler is using a periodic timer, hardware counters, includes time spent sleeping for IO in the kernel, how many events are dropped or missed, etc. A useful technique to demonstrate, and one in keeping with their minimalist, do-it-yourself philosophy, would be manually bisecting the code with timers to find hot spots when normal profiling tools fail. The discussion on rewriting code goes beyond "find the top function and optimize it" - it also addresses eliminating calls to hot functions entirely and doing modest amounts of pre-computation. A fair portion of the section on code tuning has been superseded by improved compilers which can do, e.g., loop-unrolling automatically, but it still teaches valuable lessons about how to read code and understand its true cost and complexity. Notation The chapter on notation unfolds elegant, beautiful solutions one by one, turning normally painful problems into fun coding exercises. Each technique - little languages, special-purpose notation, programs that write programs, virtual machines - is accompanied by a concrete demonstration of how to implement the bare minimum of the technique to get the job done. The suggestion to "write a new language" seems absurd in the face of most day-to-day programming problems, but writing a very small, very specialized language can save the programmer much time and many bugs, even when replacing only a few hundred lines of conventional code. Their first example, after printf() format specifiers, is a notation for packing and unpacking network packets. I recently implemented this technique and can report that it worked beautifully, repaying the time I invested in it within days of completion. Another exercise in minimalism is their demonstration of how to write a basic grep in around 100 lines of C, without relying on external libraries. Most of us will never need to re-implement regular expressions from scratch, but we may encounter a problem best solved by writing a small general purpose pattern matcher. Another example demonstrates the power (and danger) of keeping a variety of scripting languages and data processing tools at your fingertips. The authors implement a crude text-only web browser with about 50 lines of Awk, Tcl, and Perl, again using only built-in language support and no external libraries or modules. Here as elsewhere, Kernighan and Pike refuse to make hard and fast assertions about the One True Scripting Language; they'd rather you used the right language for the right job. From the book: These languages together are more powerful than any one of them in isolation. It's worth breaking the job into pieces if it enables you to profit from the right notation. It can be argued that this approach is less justified now, given the modern plethora of scripting languages written specifically to address the limitations of earlier scripting languages. However, their argument still rings true for me, as someone who has never settled down into one scripting language. I have a decade of experience using a hodge-podge of random scripting languages, and when I do write in one scripting language, I end up spending a lot of time contorting language features to fit situations they were not designed for. The section on virtual machines shows how to implement a minimal special purpose virtual machine (the Z-machine for Zork comes to mind immediately). The remaining sections cover programs that write programs, using macros to generate code (a common technique in Linux header files), and just a little taste of run-time code generation. Summary The Practice of Programming embodies its own principles: simplicity, clarity, generality. First published in 1999, it has aged well due to its focus on general principles of good programming rather than language-specific tricks and tips. The book has something to offer to programmers at all levels of experience; beginners will benefit most but experienced developers will appreciate the more advanced and subtle techniques in the later chapters. Of all the books on the Kernel Hacker's Bookshelf, this one should never be missing. Kernel-based checkpoint and restart Your editor, who has carefully hidden several years of experience in Fortran-based scientific programming from this readership, encountered checkpoint and restart facilities a long time ago. In those days, programs which would run for days of hard-won CPU time on an unimaginably fast CDC or Cray mainframe would occasionally checkpoint themselves, minimizing the amount of compute time lost when (not if) the system went down at an inopportune time. It was a sort of insurance policy, with the premiums being paid in the form of regular checkpoint calls. Central processor time is no longer in such short supply, but there is still interest in the ability to checkpoint a running application and restore its state at some future time. One obvious application of this capability is to restore the application on a different machine; in this way, running applications can be moved from one host to another. If the "application" is an entire container full of tasks, you now have the ability to shift those containers around without the contained tasks even being aware of what is going on. That, in turn, can provide for load balancing, or just the ability to move containers off a machine which is being taken down. Linux does not have this capability now. Anybody who thinks about adding it must certainly find the prospect daunting; applications have a lot of state hidden throughout the system. This state includes open files (and positions within the files), network sockets and pipes connected to remote peers, signal states, outstanding timers, special-purpose file descriptors (for epoll_wait(), for example), ptrace() status, CPU affinities, SYSV semaphores, futexes, SELinux state, and much more. Any failure to save and properly restore all of that state will result in a broken process. It is no wonder that Linux does not do checkpoint and restart; most rational developers would be driven away by the complexities involved in making it work in an even remotely robust manner. But, then, there was a time when rational programmers would not have attempted the creation of Linux in the first place. So it should not be surprising to see that developers are working on the checkpoint and restart problem. The latest attempt can be seen in this patch set posted by Dave Hansen (but originally written by Oren Laadan). It is far from being ready for prime-time use, but it does show the sort of approach which is being taken. For some time, the prevailing wisdom was that checkpoint and restart should be pushed as much into user space as possible. A user-space process could handle the marshaling of process state and writing it to a file; the kernel would only get involved when it was strictly necessary. It turns out, though, that this involvement is required fairly often, requiring the addition of "lots of new, little kernel interfaces" to make everything work. So, at a meeting at OLS, the checkpoint/restart developers decided to take a different approach and move the work into the kernel. The result is the creation of just two new system calls: A call to checkpoint() will write an image of the current process to the given fd. The pid argument identifies the init process for the current process's container; it is saved to the image but not otherwise used in the current patch. If the operation succeeds, the return value will be a unique (until the system reboots) "checkpoint image identifier". restart() reverses the process; crid is the image identifier, which is not currently used. The flags argument is currently unused in both system calls. These interfaces seem likely to change; future enhancements to the interface are likely to include capabilities like checkpointing other processes and groups of processes. The CAP_SYS_ADMIN capability is currently required for both checkpoint() and restart(). That is somewhat unfortunate, in that it would be nice if ordinary, unprivileged processes were able to checkpoint and restart themselves. There are some real security implications which must be kept in mind, though, especially when one considers the sort of damage that could result from an attempt to restart a carefully-manipulated checkpoint image. Making restart() secure for unprivileged use will not be a job for the faint of heart. At this stage of development, the patch does not even attempt to solve the entire problem. It is able to save the current state of virtual memory (but only in the absence of non-private, shared mappings), current processor state, and the contents of the task structure. That is enough to checkpoint and restart a "hello, world" program, but not a whole lot more. But that is a reasonable place to start. Given the complexity of the problem, proceeding in careful baby steps seems like the right way to go. So we're probably not going to have a working checkpoint facility in the kernel in the near future, but, with luck and patience, we'll eventually have something that works. Moving the Data Center, a LinuxWorld Keynote from Kevin Clark Last week your author was in San Francisco attending LinuxWorld 2008. One keynote was from Kevin Clark, Director of IT Operations at Lucasfilm. Lucasfilm is the production company that brought us Star Wars, Indiana Jones and many other movies and related merchandise. As the Director of IT Operations, Kevin is responsible for the IT needs of four separate divisions in five locations. In 2005 the main data center was moved to a new facility; Kevin talked about the challenges and lessons learned in the process of moving a high availability data center, while making three movies and maintaining high security. The four divisions of Lucasfilm all have different needs; to meet those needs, the data center has machines running Linux, Unix, Windows and few Macs. Industrial Light and Magic (ILM) is the biggest user of Linux. This is the division that does the special effects for Lucasfilm and many other movies such as Disney's "Pirates of the Caribbean" series. Lucas Arts, Lucas Licensing and Lucas Animation are the other three. These three divisions handle the production of movie-based video games, action figures, official web sites, animated films and other related endeavors. When Hollywood producers want special effects, they want something that hasn't been seen before, something amazing. With each new movie the producer strives to out-do other movies. ILM must be on the bleeding edge of special effects technology, while maintaining high availability and high security. ILM Linux clusters run around the clock, producing "some of the best special effects the industry has to offer." Downtime is not an option, even for a major move. Kevin's talk was about moving the data center, and not particularly about Linux. He did have some nice, short films showing off some of ILM's work. Did you know that Pirates of the Caribbean was not filmed on a ship at sea? It's just rendered that way. For the new data center, Kevin knew he wanted to consolidate systems such as email, databases, storage and backup/recovery. He knew he needed flexible power and cooling requirements and a flexible distribution design with lots of storage for the rendering clusters and the backups and also web hosting for movie sites and other related businesses. The center has high bandwidth requirements, both internally and externally. Also, there are always many people trying to get the scoop on the latest movies and games, so high security is paramount. He chose technologies from AMD, Foundry, NetApp, HP and Juniper to accomplish his goals. The new data center has over 700 miles of fiber and over 2000 miles of copper with a global WAN for sites at the Telco depot, Letterman Digital Arts Center, Skywalker Ranch, Big Rock Ranch and Singapore Animation. There are 400 terabytes of storage. The AMD blades have 32 gigabytes of memory and they stack them 66 blades per rack. There are lots of racks and floor to ceiling airflow cools them. When filming, all shots are archived, so there is high volume at all times and complete disaster recovery is required. Kevin had a few lessons that he learned from the data center move: DC power has limitations, equipment interoperability is key and should be built to scale following a network design. The center has needs outside of IT to consider. All the pieces must be fully redundant. You always think that it is fully redundant until it fails. Power and cooling requirements must be balanced. Run the computers hotter to save power, but not so hot that they fail. The data center is a continually moving target with constant pressure to be more energy efficient. More virtualization could help. Getting light to move faster would help. We were left to wonder how one might overcome the limitations of DC power, or how to get light to move faster. Those points did get a laugh from the audience though. All in all, one might wish for something more Linux related at LinuxWorld, but it was an entertaining presentation. Details of the DNS flaw revealed Dan Kaminsky spoke to a packed house at Black Hat on 6 August to outline the fundamental flaw he found in the Domain Name System (DNS). Contrary to his hopes, though, the flaw was discovered and publicized before his presentation. The vulnerability is interesting in its own right, but the implications of what can be done with it are staggering. In addition, the "fix" has well understood shortcomings that can still potentially be exploited to poison DNS caches. We reported on the vulnerability in early July, including Kaminsky's request that security folks not publicly speculate about the flaw. As one might guess, that request was largely ignored. When security researcher Halvar Flake published his speculation, another researcher, who was known to have the details of the flaw, publicly confirmed it, but just as quickly removed the confirmation. While it sounds a bit like a security community soap opera, it was fairly clearly caused by the attempt to contain the vulnerability information. An important part of DNS is the ability to delegate to another nameserver. When looking up example.net, first one of the root nameservers is consulted; it does not know the answer so it delegates to one of the nameservers that handles .net addresses. The delegation response includes the names of the servers being delegated to, but also helpfully includes the IP address of those servers as well. It is this helpful addition, which is meant to reduce DNS traffic, that can be exploited. The key to DNS cache poisoning is that the first good answer wins. If an attacker can send a packet with all of the proper information, but with his own IP address substituted for the correct one, and that packet reaches the querying server first, the attacker wins. In order for that to happen, the attacker needs to arrange or know that the victim will be making a particular query as well as be able to create a response that will be considered "good". Each DNS query has a 16-bit transaction ID; early implementations just had an incrementing counter, but since that time random transaction IDs have been used. In order for a DNS response to be accepted, it must have the same transaction ID as the request. Just over a year ago, we wrote about a cache poisoning vulnerability in BIND that was caused by a predictable random number generator. When an attacker can narrow down the possible values for transaction IDs, it reduces the number of responses they must generate commensurately. Absent any method to predict transaction IDs, an attacker must send 32K responses on average before the correct response arrives—which is difficult, at best, to do. If the attacker can cause the victim to make multiple requests, though, they can increase their chances. Because DNS servers cache the results of their queries, repeated requests for the same host information will not generate additional lookups. Kaminsky observed that if you make the victim request information about multiple, probably non-existent names in a domain, it will have to make a request to the nameserver responsible for that domain multiple times. If the victim queries for foo1.example.net, foo2.example.net, etc., it will use a different, random transaction ID for each request. The attacker can flood the victim with packets purporting to delegate the request to another server, ns.example.net say, but include an IP address under its control as the IP for that server. The net result is that if one of the attacker's responses gets accepted, because it finally guessed the right transaction ID, the victim's nameserver cache has been poisoned. The attacker can control all lookups in the entire example.net domain because it has substituted its own server as the nameserver for that domain. Because of the birthday paradox, the attacker does not need to generate anywhere near 32K responses to have a high probability of having one with a correct transaction ID. In his testing, Kaminsky found that he could poison a cache like this in less than 10 seconds. This technique works all the way up the hierarchy of DNS servers, potentially allowing top-level-domain or root nameservers to be poisoned. It is clearly a very serious flaw that can be exploited in a huge number of ways. Kaminsky's Black Hat slides [Powerpoint format, but viewable in OpenOffice], detail many different implications and are well worth a read. Also, for an excellent description of how DNS works as well as more details on the flaw Kaminsky found, see Steve Friedl's illustrated guide. The "fix" that was rolled out in a coordinated fashion by many different vendors is to randomize the source UDP port for each query. This is a technique that was implemented years ago in Daniel Bernstein's djbdns and has been recommended by various cache poisoning researchers (notably Amit Klein) for some time. By doing this, an attacker must also guess the proper UDP port to send the response to, which can provide up to an additional 16 bits of randomness to the query. In the best case, where all possible UDP source ports are used, that increases the number of possible responses from 64K to over 4 billion. That seems like it would take the attack out of the realm of possibility, but that clearly isn't the case. Kaminsky and the vendors all knew that adding source port randomization only made it harder—not impossible. Linux kernel hacker Evgeniy Polyakov has done some experiments with the patched version of BIND on a gigabit ethernet LAN, finding that he could poison a cache in under ten hours. As he points out: "So, if you have a GigE lan, any trojaned machine can poison your DNS during one night." Other solutions are actively being sought, but it is a difficult problem because backward compatibility with countless DNS installations needs to be maintained. As always when a DNS problem is publicized, DNSSEC is touted as the solution. There are numerous technical and political problems that have stood in the way of DNSSEC adoption; those seem unlikely to just disappear. This DNS flaw is serious, but there are plenty of serious internet security issues as Kaminsky points out in his blog: Even if we go from 32 bits of entropy to 128 bits — even if we deploy DNSSec — we're still going to deliver email insecurely. We're still going to have an almost entirely unauthenticated web. We're still going to ignore SSL certificate errors, and we're still going to have application after application that can't autoupdate securely. That, at the end of the day, is a far larger problem than this particular DNS issue. While there may be bigger problems in our internet infrastructure, there are few things that are as pervasive as DNS. Kaminsky points out a number of non-obvious places where it is used—and could be abused—such as mailer lookups of HELO strings to try and decide whether to accept email or web servers doing reverse lookups for logfile messages. It is a little surprising that something so integral had such an obvious, in retrospect, flaw in its design that went undetected for around 25 years. It makes one wonder what else is lurking out there. Block layer discard requests Solid-state, flash-based storage devices are getting larger and cheaper, to the point that they are starting to displace rotating disks in an increasing number of systems. While flash requires less power, makes less noise, and is faster (for random reads, at least), it has some peculiar quirks of its own. One of those is the need for wear leveling - trying to keep the number of erase/write cycles on each block about the same to avoid wearing out the device prematurely. Wear leveling forces the creation of an indirection layer mapping logical block numbers (as seen by the computer) to physical blocks on the media. Sometimes this mapping is done in a translation layer within the flash device itself; it can also be done within the kernel (in the UBI layer, for example) if the kernel has direct access to the flash array. Either way, this remapping comes into play anytime a block is written to the device; when that happens, a new block is chosen from a list of free blocks and the data is written there. The block which previously contained the data is then added to the free list. If the device fills up with data, that list of free blocks can get quite short, making it difficult to deal with writes and compromising the wear leveling algorithm. This problem is compounded by the fact that the low-level device does not really know which blocks contain useful data. You may have deleted the several hundred pieces of spam backscatter from your mailbox this morning, but the flash mapping layer has no way of knowing that, so it carefully preserves that data while scrambling for free blocks to accommodate today's backscatter. It would be nice if the filesystem layer, which knows when the contents of files are no longer wanted, could communicate this information to the storage layer. At the lower levels, groups like the T13 committee (which manages the ATA standards) have created protocol extensions to allow the host computer to indicate that certain sectors are no longer in use; T13 calls its new command "trim." Upon receipt of a trim command, an ATA device can immediately add the indicated sectors to its free list, discarding any data stored there. Filesystems, in turn, can cause these commands to be issued whenever a file is deleted (or truncated). That will allow the storage device to make full use of the space which is truly free, making the whole thing work better. What Linux lacks now, though, is the ability for filesystems to tell low-level block drivers about unneeded sectors. David Woodhouse has posted a proposal to fill that gap in the form of the discard requests patch set. As one might expect, the patches are relatively simple - there's not much to communicate - though some subtleties remain. At the block layer, there is a new request function which can be called by filesystems: This call will enqueue a request to bdev, saying that nr_sects sectors starting at the given sector are no longer needed and can be discarded. If the low-level block driver is unable to handle discard requests, -EOPNOTSUPP will be returned. Otherwise, the request goes onto the queue, and the end_io() function will be called when the discard request completes. Most of the time, though, the filesystem will not really care about completion - it's just passing advice to the driver, after all - so end_io() can be NULL and the right thing will happen. At the driver level, a new function to set up discard requests must be provided: To support discard requests, the driver should use blk_queue_set_discard() to register its prepare_discard_fn(). That function, in turn, will be called whenever a discard request is enqueued; it should do whatever setup work is needed to execute this request when it gets to the head of the queue. Since discard requests go through the queue with all other block requests, they can be manipulated by the I/O scheduler code. In particular, they can be merged, reducing the total number of requests and, perhaps, pulling together enough sectors to free a full erase block. There is a danger here, though: the filesystem may well discard a set of sectors, then write new data to them once they are allocated to a new file. It would be a serious mistake to reorder the new writes ahead of the discard operation, causing the newly-written data to be lost. So discard operations will need to function as a sort of I/O barrier, preventing the reordering of writes before and after the discard. There may be an option to drop the barrier behavior, though, for filesystems which are able to perform their own request ordering. Outside of filesystems, there may occasionally be a need for other programs to be able to issue discard requests; David's example is mkfs, which could discard the entire contents of the device before making a new filesystem. For these applications, there is a new ioctl() call (BLKDISCARD) which creates a discard request. Needless to say, applications using this feature should be rare and very carefully written. David's patch includes tweaks for a number of filesystems, enabling them to issue discard requests when appropriate. Some of the low-level flash drivers have been updated as well. What's missing at this point is a fix to the generic ATA driver; this will be needed to make discard requests work with flash devices using built-in translation layers - which is most of the devices on the market, currently. That should be a relatively small piece of the puzzle, though; chances are good that this patch set will be in shape for inclusion into 2.6.28. Udev rules and the management of the plumbing layer Once upon a time, a Linux distribution would be installed with a /dev directory fully populated with device files. Most of them represented hardware which would never be present on the installed system, but they needed to be there just in case. Toward the end of this era, it was not uncommon to find systems with around 20,000 special files in /dev, and the number continued to grow. This scheme was unwieldy at best, and the growing number of hotpluggable devices (and devices in general) threatened to make the whole structure collapse under its own weight. Something, clearly, needed to be done. For a little while, it seemed like that something might be devfs, but that story did not end well. The real solution to the /dev mess turned out to be a tool called "udev," originally written by Greg Kroah-Hartman. Udev would respond to device addition and removal events from the kernel, creating and removing special files in /dev. Over time, udev gained more powerful features, such as the ability to run external programs which would help to create persistent names for transient devices. Udev is now a key component in almost all Linux systems. It's like the plumbing in a house; most people never notice it until it breaks. Then they realize how important a component it really is. Udev is configured via a set of rules, found under /etc/udev/rules.d on most systems. These rules specify how devices should be named, what their ownership and permissions should be, which kernel modules should be loaded, which programs should be run, and so on. The udev rule set also allows distributors and system administrators to tweak the system's device-related behavior to match local needs and taste. Or maybe not. Udev maintainer Kay Sievers has recently let it be known that he would like all distributors to be using the set of udev rules shipped with the program itself. Says Kay: We should all unify as far as possible. Red Hat, SUSE and Gentoo are already using the same rules files, with a minimal rules set on top, in a distro specific file. We ask the rest of the universe to join us, and do the same. This request was surprising to some. A Linux system is full of utilities with configuration files under /etc; there is not normally a push for all distributions to use the same ones. So why should all distributors use the same udev rules? The reasoning here would appear to come down to these points: The udev rules files are not really configuration files - they are, instead, code written in a domain-specific language. For a distributor to change those files is akin to patching the underlying C code; far from unheard of, but generally seen as being undesirable. As a way of underscoring this point, the udev developers are moving the udev rules out of /etc and into /lib. There is little reason for distributors to differentiate themselves based on their device naming schemes, and every reason to have all Linux systems use the same device names. For the situations where reasonable distributions may still differ - which group should own a device, for example - there is a mechanism to add distributor-specific rules. Increasingly, other packages will depend on a specific udev setup for the underlying system. Distributors which use their own rules will have a harder time making these new tools work right. That last point refers, in particular, to DeviceKit, a set of tools designed to make the management of devices easier. Between them, udev and DeviceKit are being positioned to replace most of the functionality in the much-maligned hal utility. See this posting from David Zeuthen for lots more information on DeviceKit and the migration away from hal in general. The only problem is that some distributors aren't playing along. Marco d'Itri, the Debian udev maintainer, responded that a common set of udev rules is "not going to happen." The default rules, he says, do not meet Debian's need to support older kernels, and, besides, "I consider my rules much more readable and elegant than yours". Ubuntu maintainer Scott James Remnant is also reluctant to use the default rules. Scott appears to be willing to consider a change to the default rules if it can be made to work right; Marco, instead, seems determined to hold out. When encouraged to send patches to improve the default rules (and make them more elegant), he responded: Tell me what's missing from my rules instead, I will fix it and then you will be able to use them. If nothing is missing, then you can replace the files right now. It appears likely that most of the distributors will come to see the udev rules as code which is to be maintained upstream; even Debian may come along eventually. As this happens, the layer of "plumbing" which sits just on top of the kernel should be worked into better shape. Kernel developers may find themselves involved in this process; David has posted a proposal that all new kernel subsystems, before being merged, must be provided with a set of udev rules. That would help the udev developers get a set of default rules into shape before the distributors feel the need to step in to make things work. Increasingly, the operation of the kernel is being tied to a set of low-level user-space applications; there is not much which can be done with a bare kernel. How all of this low-level plumbing should work, and how it should interoperate with the kernel, is still being worked out. The management of udev policies is just one of the outstanding issues. So the upcoming Linux Plumbers Conference would seem to be well timed; there's a lot to talk about. OLS: Audio Streaming over Bluetooth On July 23 Marcel Holtmann delivered a presentation on the state of Audio Streaming over Bluetooth at the 2008 Linux Symposium in Ottawa. Holtmann's background involves working on improving Linux Bluetooth audio support for laptops and embedded systems such as cell phones. Marcel expressed frustration with the complexity of the Bluetooth specifications which include approximately 20 protocols and 40 profiles. Profiles include things like mono headsets, in-car usage and high quality stereo headphones. There are protocols for serial device emulation, phone book access, caller ID information, text messaging and multiple options for audio and video. Bluetooth defines separate protocols for streaming and control, such as skipping tracks, seeking within tracks, and displaying ID3 information. Having these aspects split into different protocols was called "messy" because they are always used together. Mono headsets are supported by the Synchronous Connection Oriented link (SCO), while the Advanced Audio Distribution Profile (A2DP) is designed for high quality stereo audio. For audio compression Bluetooth defines a royalty-free SubBand-Codec (SBC) to avoid fees for use of common codecs like MP3 and AAC. All A2DP devices must support SBC, but many also support decoding MP3 and AAC as well. Linux's SBC support was initially very poor, but some developers from the Instituto Nokia de Tecnologia in Brazil stepped up to improve encoding and now the the LGPL SBC implementation rivals some of the best commercial implementations. Early Bluetooth headset support in Linux involved copying all the audio data over sockets from the application to the Bluetooth daemon. The daemon would then copy the data again to the device, causing unnecessary CPU usage and increasing latency. The current design works by setting up channels and connecting external applications directly to the device sockets. Marcel also mentioned investigating a shared memory approach for better performance at the cost of some extra complexity. Adding support for a Bluetooth audio device is quite different than for standard audio hardware — compressed data must be sent directly to the devices, possibly with ID3 and other information. If the audio being played is in a format that a device does not support it must be decoded and re-encoded first. Bluetooth devices will also appear and disappear while audio is being played. Marcel on ALSA: "I won't touch it anymore." ALSA's primary failing is that it wasn't designed to support virtual devices. He is also not convinced that the current direction of PulseAudio is suitable for Bluetooth audio, in particular there is no support for changing codecs while audio is being sent to a device. GStreamer, however can support the concept of virtual devices, sending out encoded data and sending ID3 information when required. If a file format is supported by a Bluetooth device, GStreamer can easily be told to send it as-is without re-encoding it. It can also handle the passing off of the encoding and decoding tasks to special hardware, which is commonly required for embedded systems. Future work includes adding more intelligence to the handling of control signals. When the user presses Pause and there are multiple devices and streams active, which stream should be affected? The current implementation applies the action to all streams, but it may be better to be able to tell which control device is associated with which stream. There is also ongoing work to support new hardware. Marcel has had some issues with headsets that are very sensitive to timing, but don't provide enough timing information to reliably fix. There have also been some problems supporting "Enhanced" Synchronous Connection-oriented (eSCO) Links due to vendors that are unwilling to cooperate with the developers. For more information on Bluetooth development see Marcel's OLS Paper [pdf] and BlueZ.org, the site for the official Linux Bluetooth protocol stack. Distributions at LinuxWorld 2008 I went to LinuxWorld last week primarily to lead a Birds of a Feather discussion, the title of which was "Which Linux Distribution is Right for Me?" It seemed to be generally well received, though a few people left early after it became clear that there were no flashy slides, nor was I going to reveal the "One True Linux Distribution". I don't believe there is one true distribution, just as there is no one true use for Linux. So I pointed people to The List and we talked about a few distributions that might meet some specific needs that people had. There was plenty of time left over to walk around the Expo, looking for distribution booths on the show floor. Oracle had a big booth to the right of the entrance. Access was on the other side. The Linux Garage was an interesting place, full of various embedded devices. Did you know that the Open Moko phones are currently available with three versions of its OS? Version 2007.2 is the oldest. It uses gtk and supports caller dialing contacts. The ASU 2008.8 OS is based on Qt. The latest and greatest Open Moko system is the FSO (FreeSmartphone.Org) which makes use of gtk, Qt and Python. Next up will be a version using Trolltech's Qtopia for the GreenPhone. The NSLU2 comes with Debian or OpenWRT. OpenWRT is also used in the FON wireless router and the Meraki wireless router. The later can be managed via a web interface. OpenWRT will also run on ASUS WL520GU and the Gateway Avila, but it is not installed by default. Canonical had a large booth. In half they were showing off Netbooks, with the Ubuntu remix for the Netbook. The other half had various business partners showing off the software packages that were available on Ubuntu. Ubuntu was also the distribution of choice at the Installfest. Xubuntu was used on the really low memory machines. Untangle was a major sponsor of the Installfest. Linpus and gOS has crowded booths, so I didn't get very close. I did find some pictures from the gOS booth. Fedora and openSUSE had booths in the .org pavilion, where I stopped for a quick chat but didn't get any pictures. Fedora had computers from Shuttle, with Fedora pre-installed. openSUSE's mlasars had this to say about LWE 2008. Linux Magazine's Joe Casad interviewed Fedora's Karsten Wade (video) and Karsten had some reflections on his blog. I also stopped at the Vyatta booth. I reviewed Vyatta briefly several years ago, but at that time the distribution didn't do DHCP protocol. The new version of Vyatta does DHCP, VPN and lots of other things. Vyatta recently announced a firewall/router product that they plan to start shipping in a few weeks. Foresight joined up with Shuttle Computers at their booth. Small and quiet Shuttle computers were also at the Fedora booth. Shuttle will install Foresight or Fedora (and probably other distributions) if you like. Foresight is based on rPath and has been known for closely following the GNOME desktop. It seems that Foresight is now planning on a KDE edition. Chandler finally reaches a 1.0 release The Chandler project has been around since 2001, periodically releasing new versions of its personal information management (PIM) tool, but never quite reaching the 1.0 milestone—until now. Over that time, Chandler has undergone various major revisions of both code and philosophy, while the rest of software industry has hardly been standing still. Whether Chandler is relevant or important going forward is an open question, but it does have some interesting ideas as well as potentially useful code. Chandler is the brainchild of Mitch Kapor, of Lotus 1-2-3 fame, who started the project as part of his Open Source Applications Foundation (OSAF). Kapor and others have funded OSAF to work on Chandler over the last seven years, but in January all that changed. Kapor announced that he was leaving the board and only continuing to finance Chandler until the end of 2008. The 1.0 release is to some extent a "last gasp" attempt to build a community of users and developers to continue Chandler development down the road. Since the time when Chandler was originally envisioned as a shareable calendar and information manager, many other, similar tools have come about. Evolution is a free software example, while Google Calendar is popular, but proprietary and closed. Neither of those cover the full feature spectrum that Chandler aspires to, but they have been available for quite some time. The idea behind Chandler will be familiar to those who know about the Getting Things Done system. Organizing and integrating to-do lists, calendar events, email, and notes into a single system—and single application—is the driving force. These items (known as "notes") can be tagged into various collections (like Home, Work, etc.), assigned as events in the calendar, or mailed to others. The calendar works like one would expect. Events have the standard fields: start/end time, frequency for recurring events, various alarm options, etc. Events get color-coded based on their collection and the calendar itself can be viewed at various granularities: day, week, or month. Based on their proximity in time, as well as user choice, events get "triaged" into categories of "Done", "Now", or "Later". There are multiple synchronization options available with Chandler. Keeping calendars in sync amongst multiple different systems, with different import/export formats is clearly something that the Chandler team focused on. Because Chandler is cross-platform—written in Python and available on Linux, OS X, and Windows—it can interface both with tools that run on those platforms as well as with internet services like Google Calendar. As yet there is no Outlook/Exchange synchronization available which leaves out a rather large portion of the potential audience one would guess. The Chandler desktop is only one of two pieces of the Chandler project; the other is the Chandler server. It is the means to share Chandler information, either with other users or just with other computers. Data can be synchronized to the server, then retrieved on another Chandler desktop elsewhere. For those that do not want to run their own server, the project runs a version of the server as the Chandler hub, which offers free accounts. The 1.0 release looks like a solid tool. It has some enthusiastic users, but will that translate to a larger development community? Chandler development has always been directed—and funded—by the OSAF, so it suffers from a smaller development community than it might have otherwise. Projects that start as proprietary, but then open their code, sometimes have difficulties allowing a community to influence or control the direction of that code thereafter. We have seen that with OpenSolaris and other projects. Chandler seems to suffer from some of those same problems, even though it came about differently. By removing the funding, Kapor may well have jump started Chandler development. Seven years is a long time by any standard, but for software, it is an eternity. By keeping a relatively tight grip on the direction of the project, the OSAF may well have kept interested folks who were not on their payroll from getting involved. If the project can move to a more open style, with frequent releases, it may be able to regain some of that lost time. It is an intriguing tool, but it is way behind schedule. GeekPAC to fight for information rights There's little question that plenty of people are annoyed at how difficult it is to rip movies from legally purchased DVDs into formats readable by handheld devices or media players. The lack of consistency in document formats is an ongoing headache for anyone who receives files that are only readable with certain software. Information rights management has become enough of a frustration that a group has formed specifically to deal with the problem head on. GeekPAC is a political action committee made up of volunteers who are taking their complaints straight to Capitol Hill. Last year California Assemblyman Mark Leno authored AB 1668, a bill designed to encourage the state to adopt the Open Document Format as the standard format for government documents. Not surprisingly, Microsoft came out against the bill and it was eventually struck down in committee. CollabNet Community Manager and longtime FOSS supporter John Mark Walker was angry. Realizing that the open source community had no voice during the hearings and no way to fight back against the opposition's lobbyists, Walker decided to mobilize support from within the ranks of the FOSS community and let them do what they do best — rally behind a cause and prove once again that there's strength in numbers. So he founded GeekPAC. GeekPAC's goal is to pull together enough funding — a mere $2,200 — to file the necessary paperwork to be formally recognized by the Federal Elections Committee as a Political Action Committee (PAC). Then the group will locate politicians or candidates in the House and Senate who support hot-button technology issues like copyright reform and net neutrality. Once identified, GeekPAC will help support their campaigns and lobby together for change. "If all we do is fund some campaigns, create a few attack ads, and do the occasional lobbying, I'll be pretty disappointed," says Walker. "The real goal here is to educate people as to why they should care. Frankly, those of us who care about our rights in the information age have done a really poor job of communicating the importance or relevance." Indeed, Walker suggests that ambiguous verbiage and a lack of communication with people outside the tech industry has been the biggest hindrance to effecting large-scale change. "One of the problems is that we insist on using terms like 'digital rights,' the usage of which basically leaves out a large percentage of the population. Most people don't know what that means, and they assume that digital doesn't include them, because they don't work in the tech industry and have little contact with people who do. So lots of digerati swing around their proverbial phalli and talk 'digital rights' this and 'DRM' that, and it becomes a kind of high-tech circle jerk that is constraining and ultimately self-limiting." A better approach, he says, would be to frame these important issues as "information rights." Once people realize that the bills politicians are voting on aren't about obscure concepts but rather affect human rights at a basic level, Walker is confident GeekPAC will make great strides toward changing minds at the national level. "It's really about the free flow of information and letting free markets do their job. Once you start there, it's a quick hop and a skip down the path of the founding principles of this great country," explains Walker. He goes on to note that these issues affect people at every socio-economic level, from patents that limit free market trade, to "information restrictions that affect our ability to adequately educate the public." Walker asserts that without a total overhaul of the United States patent and copyright laws, the information divide will never narrow, and ultimately lead to larger problems down the road. "It's really about education, innovation, and reducing the bar to entry so that America can remain competitive in the 21st century." One of the overriding reasons Walker chose to launch GeekPAC now is because this is an important election year and political issues are on the minds of many. Though he acknowledges people have been discussing these topics for years, talking just isn't enough. "In the 10 years that have passed since the DMCA, we still haven't been able to mount a credible reform effort, and countless horrible things have taken place on our watch that co-opt our so-called inalienable rights. We must do more, and I can't think of a better time to do more than an election year," he says. GeekPAC is taking a multi-faceted approach to locating politicians to support. The group's supporters and volunteers are encouraged to recommend candidates who they know believe in GeekPAC's goals and direction. Politicians can also contact the group directly and asked to be considered for backing from GeekPAC. Once chosen, candidates are asked to sign a simple pledge promising to "protect my constituents' fair use rights to information [and] support the use of open standards in government for the storage and archiving of public data." Walker says GeekPAC is most interested in helping candidates who take a strong stance on open standards and open access, copyright reform, patent reform, and net neutrality. "Obviously, we'll be most enthusiastic about candidates who support all of those, but we will help campaign for candidates who support at least one of those items." The name GeekPAC may ring a bell for those who have been around the FOSS community for a while. A similar group was formed more than five years ago but never quite got off the ground. Though the two organizations don't share any common members, they do have the same goals — and an affection for the domain name. Before GeekPAC morphed into its current state, it was known as BytesFree — a similar group, but without the political slant. Walker says he originally planned to stay with that name, until he learned that the geek-pac.org domain was available, and then everything fell into place. Walker formally launched GeekPAC at last week's LinuxWorld Expo by hosting a Birds of a Feather get-together at the end of a long day of sessions. While current and would-be volunteers strategized and planned, Walker took a few minutes to share the group's vision with notable columnist and FOSS supporter Doc Searls. Though GeekPAC's premise is strong, not everyone is convinced of its viability. LinuxWorld community blogger Don Marti says the idea is likely to fail, in part, because of a poor choice of names. He claims the inclusion of the term "geek" is insulting and suggests it doesn't relay the true goals of the group. "Creative Commons is a great name. Electronic Frontier Foundation is pretty good," Marti suggests. "You have to get in some words that imply that the people in the organization actually make something useful and that the organization's goals are public goods. Network Growth and Productivity Council?" Marti also notes that GeekPAC should include singers, podcasters, and other sub-groups affected by information rights. Though the underlying commonality among the members of GeekPAC is an understanding of how these issues impact the FOSS community, Marti says that's not enough of a reason to form a splinter group of nothing but techies. "There's a community that already exists around these issues — why split off the subset of EFF supporters who happen to be into free software?" asks Marti. "Of course EFF itself can't be involved because they're tax-exempt, but the target is clearly the same people, and their friends and colleagues. A 'free software users for DMCA reform' group would be like 'cat owners for a balanced budget'." At the end of the day, it won't be the group's name or membership demographic that decides GeekPAC's success. Walker says it will be "When politicians and candidates start referencing us by name because our influence is large enough to matter." Why the JMRI decision matters The Java Model Railroad Interface (JMRI) project is not one to sit at the top of the Debian popularity contest results; it provides tools for model railroad enthusiasts. But the legal wrangling around JMRI has made it one of the more important projects in our community at this time. JMRI has suffered some legal setbacks, but much of that was turned around by the US Federal Circuit Court of Appeals on August 13. The result is a vindication for much of the legal reasoning behind free software licenses. JMRI was charged with patent infringement back in 2006. As part of the legal counterattack, JMRI developer Robert Jacobson charged patent holders Michael Katzer and Kamind Associates, Inc. with copyright infringement for its use of JMRI code. The Federal District Court in this case had concluded that the terms of the Artistic License were contract terms, and not condition on the copyright license itself. That ruling was seen as a major setback. The authors of free software licenses have gone to great lengths to restrict themselves to copyright licensing and to avoid contract law altogether. There are a couple of important reasons for this: A contract is only binding if all parties have voluntarily entered into it. There have been mutterings from some corners for years that licenses like the GPL are not truly enforceable because the recipients of software under those licenses have never signed the relevant contracts. Such mutterings have become relatively hard to hear, but they are still out there. A software license is, instead, a unilateral grant of privilege which does not require agreement. As such, it should be easier to enforce. Violation of the terms of a contract sets up the guilty party to be sued for damages. Copyright infringement, instead, allows for injunctive relief, allowing the copyright owner to immediately shut down the infringing activity. Many of those who would ignore the terms of free software licenses fear injunctions far more than they fear suits for damages. Both points are crucial. If you look at clause 5 of the GNU General Public License (version 2, in this case), you read: You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Anybody who distributes a copyrighted work will be doing so in violation of the author's exclusive rights. If a distributor has a license from the owner, though, then this distributor has a legal defense. The question raised in this case was, in summary, this: if somebody distributes free software without adhering to the terms of the license, does that somebody still have a license at all? The District Court ruled that this person did, indeed, still have a license to distribute the software, though they might be liable for damages for not having followed all of the terms. The Appeals Court, instead, said that failure to hold to the conditions meant that the license simply did not exist; distributing free software in a manner contrary to its license is copyright infringement, not breach of contract. This decision was reached in a sufficiently high court that the conversation should be finished in the United States; we now have a high-level legal precedent that software licenses are licenses, and that they can be enforced with injunctions. In US-style law, precedents are everything; the absence of a clear precedent always causes a certain degree of legal uncertainty. We now have that precedent; as a result, anybody seeking to enforce a free software license in the US is now standing on firmer ground. There are some other interesting conclusions to be drawn from this ruling. Copyright law in the US does not recognize any sort of moral rights to copyrighted works; it is, in classic American style, all about the protection of economic rights. Some have argued that, since free software is, well, free of charge, there is no economic harm in violating its licenses, and, thus, copyright law has nothing to say. But the Appeals Court saw things differently, stating that there was a clear economic interest in the Artistic license: The clear language of the Artistic License creates conditions to protect the economic rights at issue in the granting of a public license. These conditions govern the rights to modify and distribute the computer programs and files included in the downloadable software package. The attribution and modification transparency requirements directly serve to drive traffic to the open source incubation page and to inform downstream users of the project, which is a significant economic goal of the copyright holder that the law will enforce. So the reasoning that free software licenses are unenforceable due to the lack of an economic interest fails to hold water. Similarly, the interesting idea that free software license incompatibility does not really exist, recently promoted on LWN by Brian Cantrill, seems unlikely to stand up to serious scrutiny. Some voices on the net have worried that this ruling could also give sharper teeth to exploitive proprietary end user license agreements. The Electronic Frontier Foundation is one example: While we're pleased to see a panel of learned judges endorse the legal foundations of the open source software paradigm, the decision may also encourage proprietary software vendors who frequently fill their "end user license agreements" with restrictions that are denominated as "conditions" on the license. If violating a "condition" in a EULA results in copyright infringement liability, what's to stop a software vendor from imposing conditions that are unrelated to copyright law (e.g. an agreement not to disparage the copyright owner, or to wear pink bunny ears on Tuesdays), or even antithetical to copyright law (e.g. a waiver of fair use rights)? If this comes to pass, restrictions on reverse engineering, publication of reviews, lack of bunny ears, etc. may, indeed, become easier to enforce. Such an outcome would not necessarily be a bad thing for users of free software, though. If anything, it will simply make the value of freedom that much more clear. Finally, it is worth noting well that this outcome did not just happen on its own. Behind the scenes, concerned lawyers from groups like the Stanford Center for Internet and Society and the Electronic Frontier Foundation, who have understood all along what was at stake here, have put in a great deal of work to get this ruling. They were successful despite the fact that the old Artistic License was not the strongest position to be arguing from. Many of us would prefer to not have to think about legal issues much of the time. But we should be happy and grateful that some very capable people have been willing to put in the effort to defend our rights in cases like this one. (The full ruling is available in PDF format, or in plain text on Groklaw). Triggers: less busy busy-waiting Kernel code must often wait for something to happen elsewhere in the system. The preferred way to wait is to use any of a number of interfaces to wait queues, allowing the processor to perform other tasks in the mean time. If the kernel code in question is running in an atomic mode, though, it cannot block, so the use of wait queues is not an option. Traditionally, in such situations, the programmer simply must code a busy wait which sits in a tight loop until the required event takes place. Busy waits are always undesirable, but, in some situations, they become even more so. If the wait is going to be relatively long, it would be better to put the processor into a lower power state. After all, nobody cares if it executes its empty loop at full speed, or, even, whether the loop executes at all. If the wait is running within a virtualized guest, the situation can be even worse: by looping in the processor, a busy wait can actively prevent the running of the code which will eventually provide the event which is being waited for. In a virtualized environment, it is far better to simply suspend the virtual system altogether than to let it busy wait. Jeremy Fitzhardinge has proposed a solution to this problem in the form of the trigger API. A trigger can be thought of as a special type of continuation intended for use in a specific environment: situations where preemption is disabled and sleeping is not possible, but where it is necessary to wait for an external event. A trigger is set up in either of the two usual patterns: There is a sequence of calls which must be made by code intending to wait for a trigger: Triggers are designed to be safe against race conditions, in that if a trigger is fired after the trigger_reset() call, the subsequent trigger_wait() call will return immediately. As with any such primitive, false "wakeups" are possible, so it is necessary to check for the condition being waited for and wait again if need be. Code which wishes to signal completion to a thread waiting on a trigger need only make a call to: This code should, of course, ensure that the waiting thread will see that the resource it was waiting for is available before calling trigger_kick(). A reader of the generic implementation of triggers may be forgiven for wondering what the point is; most of the functions are empty, and trigger_wait() turns into a call to cpu_relax(). In other words, it's still a busy wait, just like before except that now it's hidden behind a set of trigger functions. The idea, of course, is that better versions of these functions can be defined in architecture-specific code. If the target architecture is actually a virtual machine environment, for example, a trigger can simply suspend the execution of the machine altogether. To that end, there is a new set of paravirt_ops allowing hypervisors to implement the trigger operations. Jeremy has also created an implementation for the x86 architecture which uses the relatively new monitor and mwait instructions. In this implementation, a trigger is a simple integer variable. A call to trigger_reset() turns into a monitor instruction, informing the processor that it should watch out for changes to that integer variable. The mwait instruction built into trigger_wait() halts the processor until the monitored variable is written to. No more busy waiting is required. There is a certain elegance to the monitor/mwait implementation, but Arjan van de Ven worries that it may prove to be too slow. So changes to the x86 implementation are possible. There have not been a lot of comments about the API itself, though, so the trigger functions may well make it into the mainline in something close to their current form. In defense of Ubuntu Criticisms of the Ubuntu distribution and Canonical, its corporate sponsor, are not hard to come by. Depending on who is speaking, Ubuntu and Canonical are guilty of profiting from the free software community without giving back to it, forking important projects or distributions, legitimizing the use of binary-only system components, and more. Of all of these gripes, it is the "contributing to the community" complaint which is heard most. If one believes these complaints, Ubuntu is a parasitic operation which does not understand how the community works and which is harmful to the community as a whole. Your editor would like to submit that these charges are overblown. Ubuntu is far from perfect, and it could certainly give back more than it does, but Ubuntu does not deserve the level of opprobrium it is receiving from certain parts of our community. It is interesting to note that there appears to be a special place for distributors among those who would criticize. Red Hat, it has been said, drives things toward its own profit and has, in the past, pushed far too much bleeding-edge software on its long-suffering users. Fedora is accused of remaining insufficiently open, excessively bleeding-edge, and refusing to make the watching of flash videos just work. Novell/SUSE has done a deal with the devil. Debian, we are told, is simultaneously too chaotic and too bureaucratic, and it can never get a release out on time. Some charge that Gentoo's community is dysfunctional, and that, in any case, it's made up of people with too much time on their hands. And Ubuntu stands accused of taking the work of others while failing to give back to or even credit the community from which draws its software. It is not surprising that distributors are specially blessed with this sort of criticism. Most free software users never deal directly with the upstream projects which create the software they use. Instead, they get it all from a single middleman - the distributor. So the distributor has a great deal of influence over what kind of experience those users have; the distributor is also the obvious guilty party when things seem to go wrong. Lots of people have opinions about their distributor, but they know little about the projects that actually develop their software. That said, much of the criticism of Ubuntu is coming from the developer community, which does have a more detailed view of the full ecosystem. It is worth thinking about why that might be. While Ubuntu's contributions may not be as high as one might like, they are most certainly not zero. There are Ubuntu developers who are Debian developers, X.org developers, GNOME developers, and so on. If this page is to be believed, Ubuntu developers are also contributing to the HURD. The page does not say why, sorry. The developers who castigate Ubuntu are uniformly silent about the number of kernel patches coming from the Mandriva camp. They have nothing to say about how much Xandros gives back to Debian. Nobody totals up contributions from Gentoo. There are no complaints about Slackware's presence in the community. Arch Linux developers do not hear that they are not doing enough. There are no high-profile articles on how rPath is taking advantage of free software developers. Yet Ubuntu's contributions most likely exceed those from all of the distributions named here, with the possible (but far from certain) exception of Gentoo. Ubuntu, it would seem, is being held to a higher standard than many of its peers. One reason for Ubuntu's special treatment must certainly be its nature as the cool kid who showed up out of nowhere. Sudden success can breed a certain amount of animosity, especially when much of that success is perceived to be built on the work of others. It is a rare distribution list which has not seen the occasional "I'm tired of your distribution, I'm moving to Ubuntu now" message; that kind of stuff gets old after a while. And when something gets old and irritating, it's tempting to respond in a short-tempered way. But the real reason must be elsewhere: Ubuntu has overtly set itself up to be held to a higher standard. It has been positioned as a strongly community-oriented distribution with the mission of saving the world for free software. Debian-derived distributions which make less noise about community - Xandros, say - receive less grief for their lack of participation in the community. Nobody expects anything from them, so nobody complains. But people do expect something different from Ubuntu; it's supposed to be a part of our community. So when it seems that Ubuntu is not contributing patches upstream or that it's maintaining forks of important software components, and when tools like Launchpad remain proprietary, it feels like a promise has not been kept. There is no doubt that Ubuntu could do better than it has. But we should not lose track of what Ubuntu has done. Ubuntu has created a distribution which appeals to a whole new class of Linux users. The fact that much of this work was done elsewhere notwithstanding, Ubuntu has shown that a Linux system can wear a friendlier, easier-to-use face. In the process, it has made Debian suitable for a larger class of users. Ubuntu has shown that a Debian-based distribution can make regular, stable releases and still ship contemporary software. Ubuntu has lived up to its promises of support, including providing top-quality security support. And all of this is happening in a way that, we are told, should become commercially self-sustaining at some point. On top of all this, Ubuntu employs a number of developers who work within the community. Yes, it would be a good thing if there were more of these developers. It would also be good if more fixes and enhancements escaped Ubuntu's repositories and made it back upstream. Ongoing encouragement at all levels should help to make this happen. But, as we encourage Ubuntu to live up to its ambitious goals of being a full member of our community, we should not lose our perspective. We are, beyond doubt, richer as a result of Ubuntu's existence. Desktop talks from LinuxWorld 2008 Conference The LinuxWorld 2008 (August 4 - 7) Conference program had plenty of talks that sounded interesting. Unfortunately I only found time to attend two talks, both from the Desktop Linux Track. The first was from John Walicki, Open Client Architect at IBM who presented "Desktop Linux Architects Speak Out". The second was from Don Hardaway and Craig Van Slyke, professors at John Cook School of Business and Saint Louis University, respectively who entitled their talk "Open Source on the Desktop: Why Not?". Their were a couple of common themes in both of these talks. First was that Linux is ready for the general desktop. The second was that the desktop effects of Compiz and similar technologies are vital for attracting people to the Linux desktop. Wobbly windows may not be very useful in practice, but putting a presentation on a cube can be effective. Mostly though it's the "wow factor" that gets people's attention. In many cases, open source applications are just as good as, or better than, their proprietary counterparts. Don and Craig did a study in which they asked university business students to recreate documents and spreadsheets that they had previously done using MS Office. Twenty-eight of 28 students thought that it was just as easy to produce documents of equal quality with OOo Writer. OOo Calc was similarly approved by 26 of the 28 students. There were areas where John Walicki thought Linux needed improvement. Accessibility, making computers useful for people with disabilities, is an important area, as is power management, making computing greener by using less electricity. Linux is greener when it comes to keeping old hardware working longer. One big plus is collaboration, getting KDE applications to run seamlessly on GNOME and vice versa, or when multiple distributions adopt a single tool (upstart, PackageKit, etc.). The collaboration enables the tools to become much better, much faster. John's assessment of the State of Linux Desktop is that it is growing, with hot products that are making rapid changes. Preloads are well established, and Linux is the hottest technology in emerging markets, appliances, and green computing. His forecast is for steady growth. Don Hardaway and Craig Van Slyke had a different perspective as academics. They study people, and looked at why people choose one technology over another. Don presented the '3 leg stool' model for acceptance of technology. There are the 'tech leg', the 'people leg' and the 'organizational leg'. The open source tech leg gets the most attention, and the organizational leg is getting better, but the people leg has been neglected. The first thing about getting people to try new technologies is to realize that people resist change. However the perception of risk is relative to their knowledge. Those of us that use open source technology on a regular basis are comfortable with it, but for those who don't know anything about it there is a perceived risk that makes them reluctant to try it. If they learn more about open source the perception of risk is reduced. There are stages in technology adoption. First people must be aware that it exists. Then something about it must attract their interest. Once that happens they are more willing to evaluate the technology. If the evaluation is favorable, they will try it out. Many of Don and Craig's students had never heard of Linux. Once they had heard, things like the desktop effects of Compiz got their interest. Some began to evaluate Linux, and some are probably still using it. To gain the relative advantage, Linux must be better than the competition. Linux costs less and is virus free, but, in the absence of a good image, people will be reluctant to try it. Craig thought gOS had a good image, but the ease-of-use was not there in all cases. Wireless, streaming media and some applications were difficult for him to get going. Craig found the EeePC with Xandros was very easy to use and he got everything going without resorting to the command line. He thinks the Netbooks will give Linux another boost. So the average user might find sharper graphics appealing, but if things don't work the way they expect or they have to resort to the command-line to get it done, they won't switch. To get more people to switch, a good first step is to hand out live CD/DVDs to people that have never heard of Linux. Explain that they can play around with Linux and then take the disc out of the drive and reboot to whatever was there before. If they realize that Linux can also extend hardware life, they just might be sold. Tangled up in threads Certain kinds of programmers are highly enamored with threads, to the point that they use large numbers of them in their applications. In fact, some applications create many thousands of threads. Happily for this kind of developer - and their users - thread creation on Linux is quite fast. At least, most of the time. A situation where that turned out not to be the case gives an interesting look at what can happen when scalability and historical baggage collide. A user named Pardo recently noted that, in some situations, thread creation time on x86_64 systems can slow significantly - as in, by about two orders of magnitude. He was observing thread creation rates of less than 100/second; at such rates, the term "quite fast" no longer applies. Happily, Pardo also did much of the work required to track down the problem, making its resolution quite a bit easier. The problem with thread creation is the allocation of the stack to be used by the new thread. This allocation, done with mmap(), requires locating a few pages' worth of space in the process's address range. Calls to mmap() can be quite frequent, so the low-level code which finds the address space for the new mapping is written to be quick. Normally, it remembers (in mm->free_area_cache) the address just past the end of the previous allocation, which is usually the beginning of a big hole in the address space. So allocating more space does not require any sort of search. The mmap() call which creates a thread's stack is special, though, in that it involves the obscure, Linux-specific MAP_32BIT flag. This flag causes the allocation to be constrained to the bottom 2GB of the virtual address space - meaning it should really have been called MAP_31BIT instead. Thread stacks are kept in lower memory for a historical reason: on some early 64-bit processors, context switches were faster if the stack address fit into 32 bits. An application involving thousands of threads cannot help being highly sensitive to context switch times, so this was an optimization worth making. The problem is that this kind of constrained allocation causes mmap() to forget about mm->free_area_cache; instead, it performs a linear search through all of the virtual memory areas (VMAs) in the process's address space. Each thread stack will require at least one VMA, so this search gets longer as more threads are created. Where things really go wrong, though, is when there is no longer room to allocate a stack in the bottom 2GB of memory. At that point, the mmap() call will return failure to user space, which must then retry the operation without the MAP_32BIT flag. Even worse, the first call will have reset mm->free_area_cache, so the retry operation must search through the entire list of VMAs a second time before it is able to find a suitable piece of address space. Unsurprisingly, things start to get really slow at that point. But the really sad thing is that the performance benefit which came from using 32-bit stack addresses no longer exists with contemporary processors. Whatever problem caused the context-switch slowdown for larger addresses has long since been fixed. So this particular performance optimization would appear to have become something other than optimal. The solution which comes immediately to mind is to simply ignore the MAP_32BIT flag altogether. That approach would require that people experiencing this problem install a new kernel, but it would be painless beyond that. Unfortunately, nobody really knows for sure when the performance penalty for large stack addresses went away or how many still-deployed systems might be hurt by removing the MAP_32BIT behavior. So Andi Kleen, who first implemented this behavior, has argued against its removal. He also points out that larger addresses could thwart a "pointer compression" optimization used by some Java virtual machine implementations. Andi would rather see the linear search through VMAs turned into something smarter. In the end, MAP_32BIT will remain, but the allocation of thread stacks in lower memory is going away anyway. Ingo Molnar has merged a single-line patch creating a new mmap() flag called MAP_STACK. This flag is defined as requesting a memory range which is suitable for use as a thread stack, but, in fact, it does not actually do anything. Ulrich Drepper will cause glibc to use this new flag as of the next release. The end result is that, once a user system has a new glibc and a fixed kernel, the old stack behavior will go away and that particular performance problem will be history. Given this outcome, why not just ignore MAP_32BIT in the kernel and avoid the need for a C library upgrade? MAP_32BIT is part of the user-space ABI, and nobody really knows how somebody might be using it. Breaking the ABI is not an option, so the old behavior must remain. On the other hand, one could argue for simply removing the use of MAP_32BIT in the creation of thread stacks, making the kernel upgrade unnecessary. As it happens, switching to MAP_STACK will have the same effect; older kernels, which do not recognize that flag, will simply ignore it. But if, at some future point, it turns out there still is a performance problem with higher-memory stacks on real systems, the kernel can be tweaked to implement the older behavior when it's running on an affected processor. So, with luck, all the bases are covered and this particular issue will not come back again. Standards, the kernel, and Postfix Standards like POSIX are meant to make life easier for application developers by providing rules on the semantics of system calls for multiple different platforms. Sometimes, though, operating system developers decide to change the behavior of their platform—with full knowledge that it breaks compatibility—for various reasons. This requires application developers to notice the change and take appropriate action; not doing so can lead to a security hole like the one found in the Postfix mail transfer agent (MTA) recently. The behavior of links, created using the link() system call—on Linux, Solaris, and IRIX—is what tripped up Postfix. In particular, what happens when a hard link is made to a symbolic link. Many long-time UNIX hackers don't realize that you can even do that, nor what to expect if you do. Postfix relied on a particular, standard-specified behavior that many operating systems, including early versions of Linux, follow. Links can be a somewhat confusing, or possibly unknown, part of UNIX-like filesystems, so a bit of explanation is in order. A link created with link()—also known as a hard link—is an alias for a particular file. It simply gives an additional name by which a particular chunk of bytes on the disk can be referenced. For example: creates a second entry in the filesystem (called /link/to/foo) which points to the same file as /path/to/foo. The file being linked to must exist and reside on the same filesystem as the link. Symbolic links, on the other hand, are aliases of a different sort. A symbolic link creates a new entry (e.g. inode) in the filesystem which contains the path of the linked-to file as a string. There is no requirement that the file exist or be on the same filesystem—the only real requirement is that the path conform to standard pathname rules. The symlink() system call is used to create them: Both symbolic links and hard links can also be created from the command line using the ln command (adding a -s option for symbolic links). So, when making a hard link to a symbolic link, there are two choices: either follow the symbolic link to its, possibly nonexistent, target and link to that or link to the symbolic link inode itself. POSIX requires that the symbolic link be fully resolved to an actual existing file, which is the behavior that Postfix relies upon. The exact sequence of events is lost in the mists of time, but Linux changed to non-standard behavior—at least partially for compatibility with Solaris—in kernel version 1.3.56 (which was released in January 1996). Some discussion prior to that change adds an additional reason for it: user space has no way to make a link to a symbolic link without it. Some saw that as a flaw in the interface and proposed the change. An application developer that wanted the original behavior would be able to implement it by resolving any symbolic links before making the hard link. To further complicate things, it appears that the POSIX behavior was restored in the 2.1 development series, only to be changed back in late 1998. This change led to the comments currently in fs/namei.c for the function implementing link(): Where oldname is the file being linked to and newname is the name being created. For the curious, KAB is Kevin Buhr and ADM is Alan Modra. Unfortunately, according to Postfix author Wietse Venema, the link(2) man page didn't change until sometime in 2006. This makes it fairly difficult for application developers to learn about the change, especially because they may not follow kernel development closely. Postfix allows root-owned symbolic links to be used as the target for local mail delivery, specifically to handle things like /dev/null on Solaris, which is a symbolic link. Because an attacker can make a link to a root-owned symbolic link on vulnerable systems, Postfix can get confused and deliver mail to files that it shouldn't. This can lead to privilege escalation (via executing code as root) by making a hard link to a symbolic link of an init script (CVE-2008-2936). As Venema outlines in the Postfix security advisory, the problem can be resolved by requiring that symbolic links used for local delivery reside in a directory that is only writeable by root. It is not a perfect solution, though: "This change will break legitimate configurations that deliver mail to a symbolic link in a directory with less restrictive permissions." There are other workarounds for people who don't want to use the provided patch to Postfix. Protecting the mail spool directory is one solution; Venema provides a script to use to do that. Some systems can be configured to disallow links to files owned by others, which is another way to avoid the problem. This issue has given Postfix a bit of a black eye, but that is rather unfair. The problem was found by a SUSE code inspection, but it has existed in certain kinds of Linux installations of Postfix for a long time. It could be argued that testing should have found it—there is a simple test for vulnerable systems—but relying on documented behavior that is part of an important standard that Linux is said to support is not completely unreasonable either. It is likely that the full implications of not supporting the standard were not completely understood until recently. Linux was still in its infancy when the original change went in. One would like to think that a change of that type today would be nearly impossible because it breaks the kernel's user-space interface. If it were to happen, somehow, the resulting hue and cry would be loud enough that application developers would hear. But that's for intentional changes; a bug introduced into a dark corner of the kernel's API might go unnoticed for quite some time. Hopefully, none of those lingers for ten years before being discovered. Update: The original article referred to CVE-2008-2937 as also being a consequence of the link issue, which it is not. It is an unrelated issue that was found during the same code review. Regulating wireless devices Whenever a Linux system communicates with the rest of the world, it must follow a whole set of rules on how that communication is done. Basic TCP/IP networking would work poorly indeed in the lack of an observed agreement on how the networking medium should be used. Wireless networking has all of those constraints, plus a set of its own. Since wireless interfaces are radios, they must conform to rules on the frequencies they can use, how much power they may emit, and so on. If all goes well, Linux will finally have a centralized mechanism for ensuring that wireless devices are operated according to that wider set of rules. Regulations on radio transmissions bring some extra challenges. They are legal code, so their violation can bring users, vendors, and distributors into unwanted conversations with representatives of spectrum enforcement agencies. The legal code is inherently local, while wireless devices are inherently mobile, so those devices must be able to modify their behavior to match different sets of rules at different times. And some wireless devices can be programmed in quite flexible ways; they can be operated far outside of their allowed parameters. The possibility that one of these devices could be configured - accidentally or intentionally - in a way which interferes with other uses of the spectrum is very real. The potential for legal problems associated with wireless interfaces has cast a shadow over Linux for a while. Some vendors have used it as an excuse for their failure to provide free drivers. Others (Intel, for example), have reworked their hardware to lock up regulatory compliance safely within the firmware. And still, vendors and Linux distributors have worried about what kind of sanctions might come down if Linux systems are seen to be operating in violation of the law somewhere on the planet. Despite all that, the Linux kernel has no central mechanism for ensuring regulatory compliance; it is up to individual drivers to make sure that their hardware does not break the rules. This situation may be about to change, though, as the Central Regulatory Domain Agent (CRDA) patch set, currently being developed by Luis Rodriguez, approaches readiness. At the core of CRDA is struct ieee80211_regdomain, which describes the rules associated with a given legal regime. It is a somewhat complicated structure, but its contents are relatively simple to understand. They include a set of allowable frequency ranges; for each range, the maximum bandwidth, allowable power, and antenna gain are listed. There's also a set of flags for special rules; some domains, for example, do not allow outdoor operation or certain types of modulation. Each domain is associated with a two-letter identifying code which, normally, is just a country code. There is a new mac80211 function which drivers can call to get the current regulatory domain information. But, unless the system has some clue of where on the planet it is currently located, that information will be for the "world domain," which, being designed to avoid offending spectrum authorities worldwide, is quite restrictive. Location information is often available from wireless access points, allowing the system to configure itself without user intervention. Individual drivers can also provide a "location hint" to the regulatory core, perhaps based on regulatory information written to a device's EEPROM by its vendor. If need be, the system administrator can also configure in a location by hand. The database of domains and associated rules lives in user space, where it can be easily updated by distributors. When the name of the domain is set within the kernel, an event is generated for udev which, in turn, will be configured to run the crda utility. This tool will use the domain name to look up the rules in the database, then use a netlink socket to pass that information back to the kernel. From there, individual drivers are told of the new rules via a notifier function. [PULL QUOTE: No distributors have made any policy plans public, but one assumes that the signing keys for the CRDA database will not be distributed with the system. END QUOTE] The database is a binary file which is digitally signed; if the signature does not match a set of public keys built into crda, then crda will refuse to use it. This behavior will protect against a corrupted database, but is also useful for keeping users from modifying it by hand. No distributors have made any policy plans public, but one assumes that the signing keys for the CRDA database will not be distributed with the system. We're dealing with free software, so getting around this kind of restriction will not prove challenging for even moderately determined users, but it should prevent some people from cranking their transmitted power to the maximum just to see what happens. The CRDA mechanism, once merged into the kernel and once the wireless drivers actually start using it, should be enough to ensure that Linux systems with well-behaved users will be well-behaved transmitters. Whether that will be enough to satisfy the regulatory agencies (some of which have been quite explicit on their doubts about whether open-source regulatory code can ever be acceptable) remains to be seen. But it is about the best that we can do in a free software environment. Injunction lifted against MIT students Three MIT students won a victory in court this week, but it was a rather bittersweet one as the injunction that was overturned was, at best, dubious. The students had researched the security of the Massachusetts Bay Transportation Agency's (MBTA) tickets and pre-paid cards. They were planning to give a presentation about their findings at the DEFCON security conference when MBTA sued them. Even after the Electronic Frontier Foundation (EFF) stepped in to represent the students, MBTA was able to get a ten-day injunction that made the presentation impossible. The judge who issued the injunction relied on the Computer Fraud and Abuse Act, a statute aimed at preventing computer intrusions, to make his decision. He ruled that speaking at a conference was a "transmission" of a computer program that could harm MBTA by allowing people to get free subway rides. The free speech rights of the students, Zack Anderson, RJ Ryan and Alessandro Chiesa, were completely ignored by the judge. Unfortunately, when a second judge lifted the injunction this week, he did it on narrow grounds, not considering the First Amendment issues either. He instead, ruled that MBTA was unlikely to succeed on the merits of its case. While the injunction has been lifted, the suit continues. MBTA is likely to be the biggest loser in all of this for a number of reasons, not least of which is the "Streisand Effect". By trying to squelch discussion of their security problems, MBTA ensured that the story got much wider play than it would have as a report from DEFCON. As Barbara Streisand found out when she tried to remove aerial pictures of her Malibu estate from a California coastal survey, suing someone to stop information from flowing rarely works; in fact, on the internet, it generally backfires. After getting an "A" in Professor Ron Rivest's—the R in RSA—class, the students met with MBTA to outline what they had found. They also provided a confidential report that included all of the details. They told MBTA that they planned to keep some of those details out of the DEFCON presentation to stop others from trivially exploiting the system. With no advance warning, 48 hours before the presentation, MBTA sued to get an injunction. Had MBTA done its homework, it would have realized that the slides of the presentation [PDF] were already available, both on the net and on CDs given to the conference attendees. Worse still, MBTA entered the confidential report, with details left out of the presentation, into the open court record. For an agency that claimed that release of the information would cause harm, it did far more to harm to itself than the students did. It is a common fallacy that security problems are somehow, magically kept at bay if they are not discussed. Time and again we see organizations try to stifle discussion of security problems rather than to actually address them. Any system that is likely to attract the attention of "white hat" security researchers is very likely to have attracted others as well. In fact, for a system like MBTA's, where large amounts of money can be made, the chances that someone of malicious intent isn't already looking for vulnerabilities are vanishingly small. By treating the "MIT Three" as criminals, MBTA has done itself and the Boston-area taxpayers a disservice. The students are willing to work with the agency to identify and fix the problems, but not while they are being sued. The agency told the judge this week that it would take it five months to fix the problems identified—it is hard to see how that is expedited by spending time in court. While the students were under a gag order, various MBTA officials were saying that there were no security problems. Because their First Amendment rights had been suspended, the students were unable to respond to defend their research. Only recently has the agency confessed that they do, indeed, have security problems. This is one of the reasons that "prior restraint" on free speech has been deemed unconstitutional in various cases, including the famous "Pentagon Papers" case. It is hard to see how the students could have been more "responsible" with their disclosure. It is not as if these vulnerabilities came out of left field; similar types of problems had been reported for other transit systems. Had MBTA done its job, the students might not have been able to find any flaws to report on. But, instead of thanking them and, perhaps, hiring them, MBTA tried to bully them. The next time someone finds a flaw in their systems, they may decide to anonymously report it with full details—or exploit it for free subway rides. One week of infrastructure issues On August 14, Fedora leader Paul Frields sent out a terse announcement regarding "an issue in the infrastructure systems" supporting the project. This "issue" could lead to some service outages, for which he apologized. Also included in the note was this ominous warning: We're still assessing the end-user impact of the situation, but as a precaution, we recommend you not download or update any additional packages on your Fedora systems. As this article is written (August 20, just barely in time for the LWN weekly publication deadline), there have been a couple of uninformative updates, but the situation persists and nobody seems to know what is really going on. The Fedora team, it would seem, is quite good at keeping secrets when the need arises. As a result, Fedora users worldwide have spent almost a full week wondering what has happened and whether they need to be worried about it. In such a situation, there is a delightful amount of space for wild speculation. Your editor does not usually start his drinking binge until after publication, but, for the purposes of interpreting the following, one should assume that it was already well underway. This "issue" could be explained by any of the following: Maybe a Fedora developer - on a drinking binge of his own, perhaps - tripped over a power cord. The resulting mess not only deprived an important server of power, but said developer, on his way toward the floor, managed to take the entire rack down with him. Ever since, the infrastructure team has been trying to reassemble a set of working systems from the rubble. Last month, Fedora slipped a small patch into gcc designed to ensure that the results from the most recent board election - where one slot went to a candidate who was not a Red Hat employee - would never be repeated. But the patch was botched, and most mathematical operations in gcc-compiled programs have been returning random numbers ever since. Now the Fedora team is trying to quietly replace the broken binaries before anybody notices. It turns out that the rights to the Fedora name had never actually been secured, and the real owner got an injunction shutting the project down. As soon as all the branding has been changed, Fedora will be reborn as Leopard-Skin Pillbox Hat Linux. Just wait until you see the new desktop themes. The package signing key has been compromised, as have the build servers. For the last six months, every version of Firefox shipped by Fedora has reported account names, passwords, and credit card numbers to a server located on a ship in international waters near Colombia. The openssh client has been similarly modified. The Fedora team has been slow to get an explanation out because it takes time to relocate your home and family to an undisclosed location on a different continent. A vulnerability in RPM has enabled the creation of a large ecosystem of hostile mirrors operated by competing criminal groups. Most Fedora users have been installing compromised updates for the last year or so. No less than three Fedora system administrators turned out to be the type of people who will give out their password for a bar of chocolate. The provider of sweets really only wanted to fix the longstanding claws-mail dependency problems in Rawhide, but the project hit the panic button anyway. The Fedora team simply wanted to take a vacation in an undisclosed location on a different continent and didn't want to deal with a bunch of email on their return. The real point of this being, of course, that none of us know what is going on, creating a situation described by Alan Cox as "leaving people in the dark assuming the worst - a very bad way to create long term trust." Distributors occupy a crucial part of our ecosystem; they absolutely need to have the trust of their users. There is just too much that can go wrong at that level. One can only assume that something fairly serious has happened. By all accounts, the Fedora team has been working flat-out to get things resolved as quickly as possible; they seem to be doing an exceptional job under a great deal of pressure. They have undoubtedly earned a big round of thanks - and lots of beers - from the Fedora community as a whole. But Fedora's leadership appears to have failed here. If Fedora users need to be concerned about the software running on their systems, they should have been told by now. If they can relax and stop worrying, they should have been told that as well. Instead, the Fedora user community has been left wondering for nearly a week while the infrastructure they count on is torn down and rebuilt from the beginning. Given that, Fedora users have shown a tremendous amount of patience and restraint; the user community clearly has a high degree of confidence in the project in general, and has been willing to wait until the project is ready to come clean. To retain that confidence, the Fedora project will have to tell the full story in a clear manner - and sooner would certainly be better. A good explanation of why Fedora users were made to wait so long before hearing anything about how this "infrastructure issue" affects them will also be needed. Fedora users are concerned about what has happened so far, but their real response will be determined by what Fedora does next. Fedora, Red Hat, and distributor security On August 22, the Fedora Project released an "infrastructure report" confirming what most observers had, by then, suspected: the project had suffered a major security breach. The attacker got as far as a system used to sign packages distributed by Fedora. That, of course, is something close to a worst-case scenario: if an intruder has control over such a system, it's a relatively small step to capture the package signing key and the passphrase used to employ that key. And those, in turn, could be used to create hostile packages which would be accepted as genuine by Fedora installations worldwide. Fortunately for the Fedora Project (and its users), an audit has determined that nobody made use of the key while the intruder was present. So, even if some means for capturing something as transient as the passphrase were in place, that passphrase was not exposed, and, thus, cannot have been compromised. Needless to say, the project is changing its package signing key anyway. Interestingly, the Fedora Project was not the only target in this attack: Red Hat, too, was compromised. Unlike Fedora, Red Hat did not issue a statement specifically about this intrusion; instead, the information was included in an openssh security update. In this case, the attacker was more successful, to the point of being able to sign "...a small number of OpenSSH packages relating only to Red Hat Enterprise Linux 4 (i386 and x86_64 architectures only) and Red Hat Enterprise Linux 5 (x86_64 architecture only)." This language deserves to be questioned: it is only necessary to sign a single openssh package (certainly qualifying as a "small number") to compromise thousands of RHEL hosts, and the "only" terms describe what must be a large majority of deployed RHEL systems. Seriously scary, but Red Hat has been able to convince itself that none of the compromised packages were fed out to RHEL subscribers. So this attack, too, failed - but not by much. Needless to say, disclosures like this raise more questions than they answer. The one question that Fedora and Red Hat will have to answer at some point is this: how did the intruder get in? One assumes that Fedora and Red Hat are running their own distributions on their internal systems; it thus stands to reason that, if those distributions contained a vulnerability that allowed the attacker to get in, many other systems will also be vulnerable. If, instead, this compromise is the result of administrator or developer error (or, say, of a lost laptop), administrators responsible for Fedora and RHEL systems can breathe a little easier. Either way, they deserve to know how this series of events came to pass. Some people would like to have that information immediately. Beyond that, there has, predictably, been a fair amount of grumbling (also predictably, from a relatively small number of people) on the Fedora lists on how this incident was handled. Your editor, too, has argued that some information took too long to emerge. He will now argue that, while Fedora still has more to disclose, the project has said enough to give itself some breathing room while it struggles to put its infrastructure back together and figure out what really happened. There's all kinds of good reasons why more information may not be immediately forthcoming, including the obvious possibility that nobody really knows yet how the intruder gained access. There is little to be gained from hammering on Fedora at this point at this time. That said, anything the project can say to tell its users whether they should be worried about an undisclosed vulnerability in their systems would be most welcome, and sooner would be better than later. Meanwhile, what can be done, and what Fedora board member Jeff Spaleta, in particular, has been pushing for, is to think about how things should be handled the next time. Says Jeff: Did we have a communication problem? Maybe. But communication problems are not equivalent to trust issues. But considering that was a first of its kind event for us as a project, I don't think its necessarily unexpected to see some miscommunication. I don't think any of us, either inside Red Hat or outside had talked through how this sort of thing should be handled. I don't remember a serious public discussion about how to deal with communication of an event like this before having an event like this. And I'm not going to let the assumption stand that to do things differently should have been obvious to those in a position to deal with the information... If people want things to be better, if god forbid something like this happens again, then a serious effort to write a communication process has to be written up and it must be agreeable to legal as a workable process that won't set off any legal liability landmines. The Fedora Project should certainly write up a policy for situations like this. It would be good for Fedora's users, but it could have an effect far beyond that: such a policy could serve as an example for other distributors as well. It is, after all, probably safe to say that Fedora is not the only distributor which has, thus far, neglected to put plans in place for this sort of disaster. We all need such plans. For better or for worse, distributors have come to occupy an important position with regard to the security of much of the net. Millions of systems run packages signed by Linux distributors; they depend, implicitly, on the security of the process used to create those packages. That process is not a small one; it can involve hundreds of developers, before even counting all of those involved in upstream development projects. The consequences of a failure anywhere in that chain of trust can be severe. It is not surprising that the distribution system was attacked; perhaps the only real surprise is that it has not happened more often - that we know of. These attacks will happen again; distributors need to have a firm idea of how they will respond. A related subject is worth a quick mention: as of this writing, the Fedora Project has issued no security updates since August 12, almost a full two weeks. A number of significant vulnerabilities, including the postfix symbolic link vulnerability, remain unpatched for Fedora users. Red Hat has done better, but not by much. Linux users depend heavily on the distributor security update process, and, for these two distributions, that process has been severely disrupted. If there had been a truly serious vulnerability disclosed during this time, people charged with keeping Fedora and RHEL systems secure might have found themselves in a difficult position. One need not be overly paranoid to envision this type of disruption being done intentionally as part of a zero-day attack on the net. This incident should serve as a sort of wake-up call for both distributors and their users. Distributors wanting to retain their users' trust should be thinking about documenting things like: How the packaging chain is kept secure. It would be good to know how many people are able to sign packages, and how they gain access to the systems where this signing is done. What sort of plans the distributor has in place for dealing with security problems. One can only assume that Red Hat dedicated (and continues to dedicate) a large amount of staff time to understanding and recovering from this incident. Would other distributors be willing and able to do the same? What are the plans for dealing with a major security breach? How might a critical security update be propagated during a time when the integrity of the packaging system has been compromised? Should something go wrong, when and how will information be communicated to the wider community? Conversely, anybody who is deploying an important Linux-based system should be asking such questions when choosing a distribution for that system. If the system requires high-assurance guarantees in this regard, it probably makes sense to look at the vendors who are willing to provide such guarantees for a fee. But, again, the lesson we have learned from recent events is that the time to ask these questions is now, and not when something has gone wrong and people are running around in circles. As a whole, Linux users have been very well served by distributors since the very beginning. The distributors pull together thousands of software releases and integrate them into a coherent whole; they then make the results available, often for free. They provide fixes when things break, and most of them pay particular attention to fixing security-related problems. And they have done a top-quality job of not being used as a conduit for hostile software. It's a great system, and that has not changed. But we have learned something about how heavily we depend on that system, and how it can fail. Proper application of the lessons from this episode should help us all to be more secure in the future. The SCons build tool reaches the 1.0 milestone After a two month release candidate stabilization period, version 1.0 of the SCons build tool has been released. The SCons description states: SCons is an Open Source software construction tool—that is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache. In short, SCons is an easier, more reliable and faster way to build software. SCons is being distributed under the MIT license. Steven Knight is the main developer, the rest of the SCons Development team consists of Chad Austin, Charles Crain, Steve Leblanc, Greg Noel, Gary Oberbrunner, Anthony Roach, Greg Spencer and Christoph Wiedemann. The SCons project history is described: SCons began life as the ScCons build tool design which won the Software Carpentry SC Build competition in August 2000. That design was in turn based on the Cons software construction utility. This project has been renamed SCons to reflect that it is no longer directly connected with Software Carpentry (well, that, and to make it slightly easier to type...). An SCons document entitled TheBigPicture and the Wikipedia entry explain some of the unique SCons features. These include: Designed in a modular fashion. Uses Python scripts for configuration files. Has automatic dependency analysis features for C, C++ and Fortran. Supports many other languages and documentation formats. Supports multiple compilers for a given language. Provides a global view of all source tree dependencies. Uses MD5 signatures for detecting file changes. Has built-in support for numerous version control systems. Can access a large number of utility tools. Operates with a large collection of command line options. Integrates with a number of popular IDEs. Supports parallel compilation with load control. Is user extensible. Supports cross-platform operation and project development. To get an idea where SCons stands in the variety of build tools that are available, the documentation includes a comparison between SCons and other tools. The project's documentation is quite voluminous. The nearly 10,000 line man page is somewhat daunting, it even dwarfs the 8000 line long mplayer man page. Fortunately, the document is available in an indexed html version for easier reading. A test installation of SCons 1.0 was tried on an Ubuntu i386 Hardy Heron machine. The code was downloaded, uncompressed and untared, then the following command was executed as root from the source directory: python setup.py install. A test of SCons was performed on a relatively simple C program that prints out the data from a stepped sine wave (sine2hex.c). After plowing through some of the man page and doing a bit of digging through the SCons User Guide, your author succeeded in compiling and linking the program. An SConstruct file was created to describe the project, it consisted of the following line: Typing scons caused SCons to compile and link the program. That is, of course, only the tip of the iceberg, but it shows that the software is not too difficult to get started with. SCons is being used by a variety of closed and open-source code software projects, the References section lists these and includes user comments about the advantages of switching from other build tools. If you need a next-generation tool for maintaining a large cross-platform project, SCons should be able to do the job. AXFS: a compressed, execute-in-place filesystem Filesystems are clearly an area of high development interest at the moment; hardly a week goes by without a new filesystem release for Linux popping up on some list or other. All of this development is motivated by a number of factors, including the increasing size of storage devices and the increasing capability of solid-state storage. Beyond that, though, there is the simple fact that there is no single filesystem which is optimal for all applications. The recently-announced AXFS filesystem is a clear example of what can be done if one targets a specific use case and optimizes for that case only. At a first impression, AXFS seems like a simple and limited filesystem. It is, for example, read-only; the AXFS developers have made no provision for changing the filesystem after it is created. Some filesystems have a great deal of code dedicated to the creation of the optimal layout of file blocks on disk; AXFS has none of that. Instead, it has a simple format which divides the media into "regions" and, almost certainly, spreads accesses across the device. There is no journaling, no logging, no snapshots, and no multi-device volume management. What AXFS does provide is compressed storage using zlib. It is, clearly, aimed at embedded systems using flash-based storage. For such devices, a compressed filesystem can be built using the provided tools, then loaded into a minimal amount of flash on each device. It thus joins a number of other compressed filesystems - cramfs and squashfs, for example - provided for this sort of application. One interesting aspect of compressed, flash-oriented filesystems is their apparent ability to stay out of the mainline kernel. By posting AXFS for review on linux-kernel, developer Jared Hulbert may be trying to avoid a similar fate. The feature which makes AXFS different from squashfs and cramfs is its support for execute-in-place (XIP) files. Some types of flash can be mapped directly into the processor's address space. When running programs stored on that flash, copying pages of executable code from flash into main memory seems like a bit of a waste: since that code is already addressable by the processor, why not run it from the flash? Executing code directly from flash saves RAM; it also makes things faster by eliminating the need to copy those pages into RAM at page fault time. As a result, systems using XIP tend to boot more quickly, a feature which designers (and users) of embedded systems appreciate. Linux has had an execute in place mechanism for a few years now, but relatively few filesystems make use of it. AXFS has been designed from the beginning to facilitate XIP operation - that's its reason for existence (and the origin of the "X" in its name). There is an additional twist, though. One would ordinarily consider compressed storage and XIP to be mutually exclusive - there is little value in mapping compressed executable code into a process's address space. To be able to executed in place, a page of code must be stored uncompressed. What makes AXFS unique is its ability to mix compressed and uncompressed pages in the same executable file. So pages which will be frequently accessed can be stored uncompressed and executed in place. Pages with infrequently-needed code or which contain initialized data can be stored compressed to save space and uncompressed at fault time. This is a slick feature, but it is not of great use if one does not know which pages of an executable file are heavily enough used to justify storing them without compression. Trying to determine this information and manually pick the representation of each page seems like an error-prone exercise - not to mention one which would tend to create high employee turnover. So another method is needed. To that end, AXFS provides a built-in profiling mechanism. Each AXFS filesystem is represented by a virtual file under /proc/axfs; writing "on" to that file will cause AXFS to make a note of every page fault within the filesystem. Reading that file then yields spreadsheet-like output showing, for each file, how many times each page was faulted into the page cache. Using this data, it is possible to generate an AXFS filesystem image with an optimal number of compressed pages for the target system. Filesystems normally need a few rounds of review before they can make it into the mainline; some filesystems need rather more than that. AXFS is sufficiently simple, though, that it may find a quicker path into the kernel. So far, the comments have mostly been positive, with the biggest complaint being, perhaps, that its name is too close to that of the existing XFS filesystem. So a 2.6.28 merge for AXFS, while far from guaranteed, would appear to be not entirely out of the question. TALPA strides forward When last we left TALPA, it was still floundering around without a solid threat model, but over the last several weeks that part has changed. Eric Paris proposed a fairly straightforward—though still somewhat controversial—model for the threats that TALPA is supposed to handle. With that in place, there is at least a framework for kernel hackers to evaluate different ways to solve the problem, while also keeping in mind other potential uses. It seems almost tautological, but anti-virus scanning is supposed to, well, scan files. In particular, they scan for known malware and block access to files when they are found to be infected. For better or worse, scanning files is seen as an essential security mechanism by many, so TALPA is trying to provide a means to that end. Paris describes it this way: This is a file scanner. There may be all sorts of marketing material or general beliefs that they provide security against all sorts of wide and varied threats (and they do), but in all reality the only threats they provide any help against are those that can be found by scanning files. Simple as that. Some may argue this isn't "good" security and I'm not going to make a strong argument to the contrary, I can stand here for days and show cases where this is highly useful but no one can provide a threat model more than to say, "we attempt to locate files which may be harmful somewhere in the digital ecosystem and try to deny access to that data." All of the various scenarios where active processes can infect files with malware or actively try to avoid scanning can be ignored under this model. While this looks like "security theater" to some, it avoids the endless what-ifs that were bogging down earlier discussions. It may not be a threat model that appeals to many of the kernel hackers, but it is one that they can work with. To many kernel developers—used to efficiency at nearly any cost—time consuming filesystem scans seem ludicrous, especially since they only "solve" a subset of the malware problem. But the fact remains that Linux users, particularly in "enterprise" environments, believe they need this kind of scanning and are willing to pay for products that provide it. The current methods used by anti-virus vendors to do the scanning are problematic at best, causing users to run kernels tainted with binary modules. With a threat model—however limited—in place, work can proceed to find the right way to add this functionality into the kernel. Paris is narrowing in on a design that calls out to user space, both synchronously and asynchronously depending on the operation. File access might go something like this: open() - causes interested user-space programs to be notified asynchronously; anti-virus scanners might kick off a scan if needed read()/mmap() - causes a synchronous user-space notification, which allows anti-virus scanners to block access until scanning is complete; if malware is found, cause the read/mmap to return an error write() - whenever the modification time (mtime) of a file is updated, asynchronously notify user space; this would allow anti-virus scanners to re-scan the data as desired close() - asynchronous user-space notification; another place where anti-virus scanners could re-scan if the file has been dirtied Where and how to store the current scanning status of a file is still an open question. Various proposals have been discussed, starting with a non-persistent flag in the in-memory inode of a file. While simple, that would cause a lot of unnecessary additional scanning as inodes drop out of the cache. Persistent storage of the scanned status of a file alleviates that problem, but runs into another: how do you handle multiple anti-virus products (or, more generally, scanners of various sorts); whose status gets stored with the inode? For this reason, user-space scanners will need to keep their own database of information about which inodes have been scanned. For anti-virus scanners, they will also want to keep information about which version of the virus database was used. Depending on the application, that could be stored in extended attributes (xattrs) of the file or in some other application-specific database. In any case, it is not a problem for the kernel, as Ted Ts'o points out: I'm just arguing that there should be absolutely *no* support in the kernel for solving this particular problem, since the question of whether a file has been scanned with a particular version of the virus DB is purely a userspace problem. It is important to keep the scanned status out of kernel purview in order to ensure that policy decisions are not handled by the kernel itself. This is in keeping with the longstanding kernel development principle that user space should make all policy decisions. This allows new applications to come along, ones that were perhaps never envisioned when the feature was being designed. For example, Alan Cox describes another reason that the state of the file with respect to scanning should be kept in user space: This is another application layer matter. At the end of the day why does the kernel care where this data is kept. For all we know someone might want to centralise it or even distribute it between nodes on a clustered file system. The latest TALPA design includes an in-memory clean/dirty flag that can short circuit the blocking read notification (when clean). That flag gets set to dirty whenever there is an mtime modification. This optimizes the common case of reading a file that hasn't changed. Further optimizations are possible down the line as Paris mentions: If some general xattr namespace is agreed upon for such a thing someday a patch may be acceptable to clear that namespace on mtime update, but I don't plan to do that at this time since comparing the timestamp in the xattr vs mtime should be good enough. Various other uses for the kinds of hooks proposed for TALPA have also come up in the discussion. Hierarchical storage management, where data is transparently moved between different kinds of media, might be able to use the blocking read mechanism. File indexing applications and intrusion detection systems could use the mtime change notification as well. This is a perfect example of kernel development in action; after a rough start, the TALPA folks have done a much better job working with the community. Some might argue that the kernel development process is somehow suboptimal, but it is the only way to get things into Linux. As the earlier adventures of TALPA would indicate, flouting kernel tradition is likely to go nowhere. While it is still a long way from being included—pesky things like working code are still needed—it is clearly on a path to get there some day, in one form or another. Sysfs and namespaces Support for network namespaces - allowing different groups of processes to have a different view of the system's network interfaces, routes, firewall rules, etc. - is nearing completion in recent kernels. A look at net/Kconfig turns up something interesting, though: network namespaces can only be enabled in kernels which do not support sysfs - the two are mutually exclusive. Since most system configurations need sysfs to work properly, this limitation has made it harder than it would otherwise be to use, or even test, the network namespace feature. The problem is simple: the network subsystem creates a sysfs directory for each network interface on the system. For example, eth0 is represented by /sys/class/net/eth0; therein one can find out just about anything about how eth0 is configured, query its packet statistics, and more. But, when network namespaces are in use, one group of processes may have a different eth0 than another, so they cannot share a globally-accessible sysfs tree. One solution might be to add the network namespace as an explicit level in the sysfs tree, but that would break user-space tools and fails to properly isolate namespaces from each other. The real solution is to build namespace awareness more deeply into sysfs. Eric Biederman has been working on a set of sysfs namespace patches for the last year or so; those patches now appear to be getting close to ready for inclusion into the mainline. Network namespaces will be the first user of this feature, but it has been written in a way that makes it possible for any system namespace to provide differing views of parts of the sysfs hierarchy. The core concept is that of a "tagged" directory in sysfs. Any sysfs directory can be associated with (at most) one type of tag, where that type identifies which type of namespace controls how that directory is viewed. Thus, for example, /sys/class/net would have a tag type identifying the network namespace subsystem as the one which is in control there. The /sys/kernel/uids directory, instead, will be managed by the user namespace subsystem. Once a directory is given a tag type, all subdirectories and attribute files inherit the same type. Namespace code makes use of tagged sysfs directories by adding an entry to enum sysfs_tag_type, defined in <linux/sysfs.h>, to identify its specific tag type. The namespace must also create an operations structure: The purpose of the mount_tag() method is to return a specific tag (represented by a void * pointer) for the current process. This tag, normally, will just be a pointer to the structure describing the relevant namespace; for example, network namespaces implement this method as follows: The tag operations must be registered with sysfs using: Thereafter, there are two ways of associating tags with a sysfs hierarchy. One of those is to make a tagged directory directly with: The directory associated with kobj will have differing contents depending on the value of the tag of the given type. The actual tag associated with the contents of this directory will be determined (at creation time) by calling a new function added to the kobj_type structure: The sysfs_tag() function is usually a short series of container_of() calls which, eventually, locates the appropriate namespace for the given kobj. An alternative way to attach tags to a directory tree is to associate it directly with the class structure. To that end, struct class has two new members: When the class is instantiated, it will have tags of the given tag_type; the specific tag for a given class will be found by calling the sysfs_tag() function. Finally, if a specific tag ceases to be valid (because the associated namespace is destroyed, normally), a call should be made to: This call will cause all sysfs directories with the given tag to become invisible, and to be deleted when it is safe to do so. Adding tagged directory support requires some significant changes to the sysfs code. But the interface has been designed to make it very easy for other subsystems to make use of tagged directories; it's a simple matter of providing functions to return the specific tag values which should be used. At this point, the biggest challenge might be making sense of sysfs when its contents may be different for each observer. But that is a challenge associated with namespaces in general. A new "contrib" repository for openSUSE There has been a recent discussion on the opensuse-factory mailing list about the creation of a repository for non-core packages. The concern expressed at the beginning of the discussion is that openSUSE has too many repositories of unknown quality. Right now many openSUSE community members have home repositories with software packages not found in the main openSUSE repository. Some have software that other openSUSE users would like, some have highly experimental packages that most users would rather avoid. It is difficult for the user to find the packages they want, or know which ones they might find suitable. Pascal Bleser expressed some of those concerns: The goal of the contrib repository must indeed be "stability", which essentially means two things: - - feature freeze: when the Factory repository is freezed, the contrib repository must be freezed too; only allow bugfix upgrades (as, clearly, I doubt we'd find enough human resources to backport fixes) and reject feature upgrades - - stable software: packages that are in there need a lot of testing and must hence be picked carefully The point is to make an "additional" type of repository, not an "always the latest". And then we should think about how to have those packages tested properly in order to gain an acceptable level of quality in there when openSUSE distro releases happen (or, rather, when they're freezed). Following the alpha/beta/RC cycles of Factory and issue the same calls for testing could be an option. Alexey Eromenko had some ideas of what that might look like: Yes, "Contrib" is planned to be a community-driven extension of Factory, with all Factory standards and limits applied. This means, that user's will have early version of contrib available for 11.1. "early" doesn't mean unstable, but it means that number of packages are expected to be limited. Only stable software will make it into contrib. All unstable software will remain in user's Home projects in OBS. Pascal Bleser wondered about how a package is determined to be stable. Yes, sure, but => "after it becomes stable" <= That's precisely the point. How do we decide whether a package is stable enough to go into contrib ? Through a release management team ? But maybe we need to offer a comfortable way for people to test packages before they make it into contrib, and having a staging repository is one way of doing it. (I'm just throwing ideas, I'm not saying it's necessarily _the_ way to do it) Richard Guenther proposed some sort of staging repository. It as well makes sense to stage Contrib (I would like this for Factory, too, but it's probably easiest to try with Contrib first). If you are familiar with the Debian way then you know there is the unstable and the testing repositories. So there should be something like Contrib:/Unstable (feel free to pick a more suitable name) where a new package (version) should reside for some time before it is migrated to the main Contrib repository. Criterias ideally would be "zero bugs of severity greater than normal" - but of course this would require proper bugzilla integration (or completely manual migration). Staging Contrib helps getting more peer review and avoids breaking Contrib itself. At the point the next openSUSE is freezed development can continue in the unstable branch but only critical fixes are migrated to Contrib. Alexey Eromenko that the new repository should have stable versioned branches, but unstable packages should remain in Home repositories. Speaking about stable/unstable trees: -I think that stable must have braches, yes, (contrib-stable-11.1, contrib-stable-11.2, etc...) - but only for future releases, not backports. Reason is simple: We will find BETA-testers for 11.1/11.2, but unlikely to find enough testers for packages for 10.2. -unstable: I prefer this branch should exists in user's OBS, but if there are volunteers, it could be part of contrib. Because it is unstable, I don't think it needs branching. The discussion continued from there. For now unstable packages remain the the user's Home repository and a small review team has been formed to review these potential candidates. The discussion, and results have been documented on the Contrib wiki page, along with a wish list of packages, for those who are interested in learning more about the Contrib repository and the shape it might take. Django nears 1.0 milestone The Django web application framework is nearing an important milestone: version 1.0. Like Ruby on Rails, TurboGears, and others, Django is meant to streamline application development for the web by providing easy-to-use libraries for tasks that are commonly performed by dynamic web sites, such as database access and HTML templating. The Django project has just released the second beta of the 1.0 release, with the final release due in early September. Django is Python-based, with an eye towards getting an application—or the beginnings of one—up and running quickly. The framework is quite "Pythonic", so it will be very accessible to those used to programming in that language. Django also has an extensive set of documentation including an on-line (and dead tree) book. While Django can be used to build nearly any kind of web site, it has a "sweet spot" that is well described in the introduction to The Django Book: Because Django was born in a news environment, it offers several features (particularly its admin interface, covered in Chapter 6) that are particularly well suited for "content" sites – sites like eBay, craigslist.org, and washingtonpost.com that offer dynamic, database-driven information. (Don't let that turn you off, though – although Django is particularly good for developing those sorts of sites, that doesn't preclude it from being an effective tool for building any sort of dynamic Web site. There's a difference between being particularly effective at something and being ineffective at other things.) The database abstraction (or model) layer is at the heart of what Django provides to a programmer. Most dynamic web sites use some kind of database, so Django supports multiple, popular database systems, of both free software and commercial varieties. Because the model layer is a high-level description of the data, moving from one database backend to another is greatly simplified. In addition, the flexibility of the model API means that many applications can do all of the queries that they need without ever descending into SQL—though the facility is there if it is needed. An example taken from the book nicely illustrates the simplicity of Django's model API: From this information, along with some configuration concerning database type and name, Django can generate and execute the appropriate SQL to build a database table to store a Book. As fields get added and removed from the model, the proper commands to synchronize the model and the database can be generated. From application code (i.e. the "view" code), then, models can be used in various ways, for instance: This can then be used in an HTML template as follows: That is, of course, a simple example, (and it lacks the URL mapping piece) but it gives a flavor of the power that Django provides. It is also an example that most folks, even non-programmers, can follow to some extent. Like many model-view-controller (MVC) based frameworks, Django splits up the various pieces of functionality in an attempt to break the coupling between the user interface, "business" logic, and data storage, allowing each to be worked on separately. In particular, the template language is meant to be used by web page designers who have little programming background. One of the nicer features is the automatically generated administrative interface. Many web frameworks have incorporated an easy way for a site administrator to start entering data into their models. This allows developers to get their application running quickly with real data without having to code up a bunch of tedious data entry forms. One of the bigger changes from the current 0.96 Django release and the upcoming 1.0 is a complete overhaul of this interface. Many developers have been using the development versions of Django from the subversion repository because the released version (which is what is packaged by distributions) has lagged. There are a number of backward-incompatible changes since 0.96 and the documentation is geared towards the 1.0 version (though it should be noted that versions for each of the last two releases are also readily available). Stabilizing the API has been the driving force behind the 1.0 release. Going forward, compatibility will be maintained unless a security or other serious problem is found. Django has numerous other interesting features: authentication, session handling, and a caching system that is geared towards scalability. It is also fully ready for internationalization, with "full support for multi-language applications, letting you specify translation strings and providing hooks for language-specific functionality." Due to be released on September 2, just in time for DjangoCon, 1.0 is, unsurprisingly, both feature and "string" frozen—only serious bug fixes are still going in. Like other projects, including Python itself, Django is "governed" by an independent foundation; the newly formed Django Software Foundation in this case. The original developers of Django are still active both as foundation board members and as active developers (and users) of Django. There are lots of web frameworks to choose from, in nearly every computer language used today—though none for COBOL as far as we have heard—so Django is just "yet another" web framework at some level. Django does have some things going for it that others may lack: a development community that is active and uses the framework (generally the development version checked out of the subversion repository) for live, high-volume sites, excellent documentation, and a well thought-out design. For anyone looking for a Python-based web application framework, at least taking Django for a spin will be time well spent. EFF continues fight for rights and freedoms There's little doubt that emerging technology is improving our way of life, but it's also creating a quagmire of legal issues surrounding the rights and restrictions we face while living in a digital age. The once ambiguous concept of "digital rights" has now become an all-encompassing term used to designate a wide range of rights that have the potential to be trampled on as courts sort out how Constitutional freedom applies to emerging and existing technologies. LWN recently chronicled GeekPAC, an organization looking for new ways to protect our rights via the political battlefront. The Electronic Frontier Foundation (EFF), one of the oldest non-profit organizations dedicated to establishing and defending our rights in a digital world, takes a different approach. Since the EFF's mission encompasses such a large body of issues, it's no longer practical to say they're protecting "digital rights." Rebecca Jeschke, Media Relations Coordinator for the EFF, says, "Instead, in our increasingly networked world, they are simply 'rights.' But we'll continue to educate folks on the issues." To do so, the EFF focuses its energies on several important issues including free speech, intellectual property, privacy, and innovation. At first blush, it may be easy to dismiss the work they do as something that only applies to people who download music illegally or who need to protect their online content from thieves. In fact, it may surprise some people to know that the EFF also defends the privacy of airline travelers and cell phone users, issues not typically associated with the purveyors of digital freedoms. One of the reasons the EFF's reach is so wide is because of the way technology infiltrates our everyday lives. It's easy to understand why sharing the contents of a store-bought music CD with hundreds of people on the Internet may infringe on the rights of the artist hoping to sell his music. In the case of an airline traveler, rights infringement takes on a completely new form when the Transportation Safety Administration's data analysis and screening software wrongly decides someone is a security risk. Not only is there no way to challenge the error, it's a mistake that's likely to haunt them for the rest of their lives. The more pervasive technology becomes, the more stories of this nature arise. Take, for example, the seemingly innocuous library book. Many public and school libraries are employing RFID technology to track books and other borrowed items. People throw these books into their bag or backpack without realizing the affixed tracking tags can actually be used to track them as well. It's doubtful the government would be interested in the whereabouts of a 9-year old walking home from school, but it's easy to see how this technology can be mishandled or abused. To be sure, no one is suggesting that technology be removed from our daily lives. The mission of the EFF and its supporters is to effect accountability and protect people's rights within the courts. Jeschke says one of the biggest battles surrounding digital freedom that we're likely to hear about in the next year or so is the issue of coders' rights. In response to a gradual uptick of cases in which coders, software engineers, and computer science students are being falsely accused of hacking and other nefarious crimes, the EFF has developed the Coders' Rights Project. According to the EFF, coders are becoming reluctant to explore and research ways to make our technology safer for fear of being prosecuted under laws like the Digital Millennium Copyright Act (DMCA) and the Computer Fraud and Abuse Act. The Coders' Rights Project protects researchers through "education, legal defense, amicus briefs and involvement in the community with the goal of promoting innovation and safeguarding the rights of curious tinkerers and hackers on the digital frontier." Jeschke says another big issue to watch involves the National Security Agency (NSA) and its interest in wiretapping phones and email without first obtaining a court order. Though expressly illegal since 1978, President George W. Bush authorized the NSA to proceed anyway and when the news became public in 2005, the EFF immediately sprang into action against the telecommunications companies assisting the government with their illegal practice. Congress passed an amendment of the original law that grants telecommunications companies immunity, and the EFF is currently working to have that law repealed. Other issues of importance in the upcoming months are expected to be in the area of copyright and fair use in user-generated content. The proliferation of YouTube and other online video hosting sites are creating a new and exciting level of creativity, along with some cinema screen-sized headaches about how content owned by others is permitted to be used.. For example, a homegrown animated video of original content is fine to post online. Setting that video to a favorite Rolling Stones song, however, crosses the line into copyright infringement. Or does it? What if the main character is simply wearing a t-shirt bearing the band's hand-drawn logo? These are some of the issues the EFF is hoping to sort out. As a non-profit organization, the EFF is funded solely through individual and corporate donations. In fact, a full two-thirds of the foundation's operating budget comes from individual donations, much of which is funneled directly into litigation. The EFF's status as a charitable organization does not permit the solicitation of politicians and governmental figures to support its cause. Instead, the foundation fights legal battles in court, advises policymakers, and uses it's corps of 50,000 volunteers to educate the public. One such EFF contributor is SourceForge.net Community Manager Ross Turk, who has been donating consistently to the EFF for 3 years and has been a staunch supporter for much longer. He says: I think the world is changing. Technology has made things possible now that weren't possible before, but I think the system has become highly motivated to preserve itself by making sure people don't do things in new and interesting ways. The EFF's mission is, as I see it, to help the system adapt to the world that we live in today by forcing it to take a closer look at the way it deals with patents, the limitless power it grants industry, and the way it views free speech in an online age. I like that they protect the world's innovators, and I like that they thwart those who try to use the technology we have created to monitor and control us. I'm also very happy to know that they're there to help us protect members of our community who are attacked for doing what they like to do. Turk also notes that EFF Bootcamp, a one day training session presented by the foundation's attorneys has benefited him professionally because it helped him "understand the difference between enforcement and oppression." It's precisely that kind of education that has kept the EFF going strong for 18 years. The first step in protecting our rights in the burgeoning age of technology is to understand how the things we invent and rely on have the potential to impact our freedoms. Firefox 3 SSL certificate warnings Users of Firefox 3 have likely seen the new warnings for various "invalid" SSL certificates. Unlike earlier versions of Firefox, these new warnings are much scarier, as well as more difficult to ignore—clicking through to the web site is decidedly more time consuming. This is exactly as the Mozilla folks intend, but it has raised some eyebrows, and ire, amongst site owners and Firefox users. SSL certificates are used to enable encrypted communication (i.e. https) between browsers and web sites. Web site owners generate a public and private key for use in the encryption. The public key gets wrapped up in an X.509 certificate and must be signed by someone. For larger sites, it is typically a certificate authority (CA) that signs the certificate, but that generally costs money. Many smaller sites will sign their own certificate creating what is known as a self-signed certificate As part of the negotiation of an encrypted connection, a web site will present its certificate to the browser. In order to prevent man-in-the-middle attacks against the encrypted connection, the browser needs to verify that the certificate belongs to the web site it believes it is talking to. It does that by verifying the signature of the CA. A signature can only be verified if the browser has the public key of the CA that has signed the certificate. Because there are a multitude of CAs, a "web of trust" is established whereby a number of root CAs sign the certificate of lesser CAs, who might in turn sign for other CAs. A browser developer, like Mozilla, chooses a set of root certificates that they trust. When verifying the certificate from some random website, the browser follows the signature chain; if it reaches one of their root certificates, the web site certificate is valid. A self-signed certificate will, of course, fail this test. When a user comes across a site that has such a certificate, Firefox 3 puts up a nasty warning. The images that accompany this article are screenshots of the warning, along with two of the three steps one must take to accept the certificate. They were generated by visiting https://bugzilla.gnome.org. The days of a single pop-up message that could easily be clicked through are long gone. There are a few different issues here. To start with, there are a large number of legitimate sites that have self-signed certificates. In order to access those sites, users are being trained to click through a series of dialogs and scary ("Legitimate banks, stores, and other public sites will not ask you to do this") warnings, just as they were trained to do with single pop-up message in earlier Firefox versions. Mozilla's position is that self-signed certificates are untrustworthy, not invalid necessarily, but not something that the browser can trust without asking the user. Because most users are not very sophisticated, the warnings need to be detailed and somewhat frightening. The problem is that users of all kinds may get annoyed by the dialogs—then train themselves to essentially ignore them. Because there are CAs, like StartSSL, that provide free certificate signing (as well as others that cost less than $20/year), Mozilla is clearly trying to push web sites into moving away from self signing. There is a risk of man-in-the-middle attacks from self-signed certificates because anyone can create certificate that purports to be for any other given web site. To some extent, though, the level of danger depends on what the encryption is trying to protect. For sites that do e-commerce or transmit and receive sensitive information, there is no question that a CA signed certificate is required. There are other reasons to encrypt traffic, though, including evading deep packet inspection (DPI), where the risks of accepting a bogus certificate are relatively low. One might get ads injected into their web browser inappropriately—annoying, but hardly fatal. There is no simple solution. Mozilla is erring on the side of caution by trying to protect its users while still allowing them to override its protections. Other techniques, possibly like the Perspectives Firefox extension, may help alleviate the problem in the long term. Until then, we may have to just grit our teeth and click our way past the multiple warnings. SCHED_FIFO and realtime throttling The SCHED_FIFO scheduling class is a longstanding, POSIX-specified realtime feature. Processes in this class are given the CPU for as long as they want it, subject only to the needs of higher-priority realtime processes. If there are two SCHED_FIFO processes with the same priority contending for the CPU, the process which is currently running will continue to do so until it decides to give the processor up. SCHED_FIFO is thus useful for realtime applications where one wants to know, with great assurance, that the highest-priority process on the system will have full access to the processor for as long as it needs it. One of the many features merged back in the 2.6.25 cycle was realtime group scheduling. As a way of balancing CPU usage between competing groups of processes, each of which can be running realtime tasks, the group scheduler introduced the concept of "realtime bandwidth," or rt_bandwith. This bandwidth consists of a pair of values: a CPU time accounting period, and the amount of CPU that the group is allowed to use - at realtime priority - during that period. Once a SCHED_FIFO task causes a group to exceed its rt_bandwidth, it will be pushed out of the processor whether it wants to go or not. This feature is required if one wants to allow multiple groups to split a system's realtime processing power. But it also turns out to have its uses in the default situation, where all processes on the system are contained within a single, default group. Kernels shipped since 2.6.25 have set the rt_bandwidth value for the default group to be 0.95 out of every 1.0 seconds. In other words, the group scheduler is configured, by default, to reserve 5% of the CPU for non-SCHED_FIFO tasks. It seems that nobody really noticed this feature until mid-August, when Peter Zijlstra posted a patch which set the default value to "unlimited." At that point it became clear that some developers have a different idea about how this kind of policy should be set than others do. Ingo Molnar disagreed with the patch, saying: The thing is, i got far more bugreports about locked up RT tasks where the lockup was unintentional, than real bugreports about anyone _intending_ for the whole box to come to a grinding halt because a high-prio RT tasks is monopolizing the CPU. Ingo's suggestion was to raise the limit to ten seconds of CPU time. As he (and others) pointed out: any SCHED_FIFO application which needs to monopolize the CPU for that long has serious problems and needs to be fixed. There are real problems associated with letting a SCHED_FIFO process run indefinitely. Should that process never get around to relinquishing the CPU, the system will simply hang forevermore; there is no possibility of the administrator slipping in with a kill command. This process will also block important things like kernel threads; even if it releases the processor after ten seconds, it will have seriously degraded the operation of the rest of the system. Even on a multiprocessor system, there will typically be processes bound to the CPU where the SCHED_FIFO process is running; there will be no way to recover those processes without breaking their CPU affinity, which is not a step anybody wants to take. So, it is argued, the rt_bandwidth limit is an important safety breaker. With it in place, even a runaway SCHED_FIFO cannot prevent the administrator from (eventually) regaining control of the system and figuring out what is going on. In exchange for this safety, this feature only robs SCHED_FIFO tasks of a small amount of CPU time - the equivalent of running the application on a slightly weaker processor. Those opposed to the default rt_bandwidth limit cite two main points: it is a user-space API change (which also breaks POSIX compliance) and represents an imposition of policy by the kernel. On the first point, Nick Piggin worries that this change could lead to broken applications: It's not common sense to change this. It would be perfectly valid to engineer a realtime process that uses a peak of say 90% of the CPU with a 10% margin for safety and other services. Now they only have 5%. Or a realtime app could definitely use the CPU adaptively up to 100% but still unable to tolerate an unexpected preemption. What could make the problem worse is that the throttle might not cut in during testing; it could, instead, wait until something unexpected comes up in a production system. Needless to say, that is a prospect which can prove scary for people who create and deploy this kind of system. The "policy in the kernel" argument was mostly shot down by Linus, who pointed out that there's lots of policy in the kernel, especially when it comes to the default settings of tunable parameters. He says: And the default policy should generally be the one that makes sense for most people. Quite frankly, if it's an issue where all normal distros would basically be expected to set a value, then that value should _be_ the default policy, and none of the normal distros should ever need to worry. Linus carefully avoided taking a position on which setting makes sense for the most people here. One could certainly argue that making systems resistant to being taken over by runaway realtime processes is the more sensible setting, especially considering that there is a certain amount of interest in running scary applications like PulseAudio with realtime priority. On the other hand, one can also make the case that conforming to the standard (and expected) SCHED_FIFO semantics is the only option which makes sense at all. There has been some talk of creating a new realtime scheduling class with throttling being explicitly part of its semantics; this class could, with a suitably low limit, even be made available to unprivileged processes. Meanwhile, as of this writing, the 0.95-second limit - the one option that nobody seems to like - remains unchanged. It will almost certainly be raised; how much is something we'll have to wait to see. DRI, BSD, and Linux The Direct Rendering Infrastructure project has long been working toward improved 3D graphics support in free operating systems. It is a crucial part of the desktop Linux experience, but, thus far, DRI development has been done in a relatively isolated manner. Development process changes which have the potential to make life better for Linux users are in the works, but, sometimes, that's not the only thing that matters. The DRI project makes its home at freedesktop.org. Among other things, the project maintains a set of git repositories representing various views of the current state of DRI development (and the direct rendering manager (DRM) work in particular). This much is not unusual; most Linux kernel subsystems have their own repository at this point. The DRM repository is different, though, in that it is not based on any Linux kernel tree; it is, instead, an entirely separate line of development. That separation is important; it means that its development is almost entirely disconnected from mainline kernel development. DRM patches going into the kernel must be pulled out of the DRM tree and put into a form suitable for merging, and any changes made within the kernel tree must be carefully carried back to the DRM tree by hand. So this work is not just an out-of-tree project; it's an entirely separate project producing code which is occasionally turned into a patch for the Linux kernel. It is not surprising that DRM and the mainline tend not to follow each other well. As Jesse Barnes put it recently: Things are actually worse than I thought. There are some fairly large differences between linux-core and upstream, some of which have been in linux-core for a long time. It's one thing to have an out-of-tree development process but another entirely to let stuff rot for months & years there. The result of all this has been a lot of developer frustration, trouble getting code merged, concerns that the project is hard for new developers to join, and more. As the DRM developers look to merge more significant chunks of code (GEM, for example), the pressure for changes to the development process has been growing. So Dave Airlie's recent announcement of a proposed new DRM development process did not entirely come as a surprise. There are a number of changes being contemplated, but the core ones are these: The DRM tree will be based on the mainline kernel, allowing for the easy flow of patches in both directions. The old tree will be no more. A more standard process for getting patches to the upstream kernel will be adopted; these will include standard techniques like topic branches and review of patches on the relevant mailing lists. Users of the DRM interface will not ship any releases depending on DRM features which are not yet present in the mainline kernel. The result of all this, it is hoped, will be a development process which is more efficient, more tightly coupled to the upstream kernel, and more accessible for developers outside of the current "DRM cabal." These are all worthy objectives, but there may also be a cost associated with these changes resulting from the unique role the DRI/DRM project has in the free software community. There is clearly a great deal of code shared between Linux and other free operating systems, and with the BSD variants in particular. But that sharing tends not to happen at the kernel level. The Linux kernel is vastly different from anything BSD-derived, so moving code between them is never a straightforward task. GPL-licensed code is not welcome in BSD-licensed kernels, naturally, making it hard for code move from Linux to BSD even when it makes sense from a technical point of view. When code moves from BSD to Linux, it often brings a certain amount of acrimony with it. So, while ideas can and do move freely, there is little sharing of code between free kernels. One significant exception is the DRM project, which is also used in most versions of BSD. One of the reasons behind the DRM project's current repository organization is the facilitation of that cooperation; there are separate directories for Linux code, BSD code, and code which is common to both. Developers from all systems contribute to the code (though the BSD developers are far outnumbered by their Linux counterparts), and they are all able to use the code in their kernels. When working in the common code directory, developers know to be careful about not breaking other systems. All told, it is a bit of welcome collaboration in an area where development resources have tended to be in short supply - even if it benefits the BSD side more than Linux. Changing the organization of the DRM tree to be more directly based on Linux seems unlikely to make life easier for the BSD developers. Space for BSD-specific code will remain available in the DRM repository, but turning the "shared-code" directory into code in the Linux driver tree will make its shared status less clear, and, thus, easier for Linux developers to break on BSD. Additionally, it seems clear that this code may become more Linux-specific; Dave Airlie says: However I am sure that we will see more of a push towards using Linux constructs in dri drivers, things like idr, list.h, locking constructs are too much of a pain to reinvent for every driver. Much of this functionality can be reproduced through compatibility layers on the BSD side, but it must carry a bit of a second-class citizen feel. Dave has, in fact, made that state of affairs clear: The thing is you can't expect equality, its just not possible, there are about 10-15 Linux developers, and 1 Free and 1 Open BSD developer working on DRM stuff at any one time, so you cannot expect the Linux developers to know what the BSD requirements are. The fact that fewer people will be able to commit to the new repository - in fact, it may be limited to Dave Airlie - also does not help. So FreeBSD developer Robert Noland, while calling this proposal "the most fair" of any he has heard, is far from sure that he will be able to work with it: I am having a really difficult time seeing what benefit I get from continuing to work in drm.git with this proposed model. While all commits to master going through the mailing list, I don't anticipate that I have any veto power or even delay powers until I can at least prevent imports from breaking BSD. Then once I do get it squared away, I'm still left having to send those to the ML and wait for approval to push the fixes. I can just save myself that part of the hassle and work privately. If I'm going to have to hand edit and merge every change, I don't see how it is really any harder to do that in my own repo, where I'm only subject to FreeBSD rules. On the other hand, it's worth noting that OpenBSD developer Owain Ainsworth already works in his own repository and seems generally supportive of these changes. Given the difference between the numbers of Linux-based and BSD-based developers, it seems almost certain that a more Linux-friendly process will win over. There is one rumored change which will not be happening, though: nobody is proposing to relicense the DRM code to the GPL. The DRM developers are only willing to support BSD to a certain point, but they certainly are not looking to make life harder for the BSD community. So they will try to accommodate the BSD developers while moving to a more Linux-centric development model; that is how things are likely to go until such a time as the BSD community is able to bring more developers to the party. High- (but not too high-) resolution timeouts Linux provides a number of system calls that allow an application to wait for file descriptors to become ready for I/O; they include select(), pselect(), poll(), ppoll(), and epoll_wait(). Each of these interfaces allows the specification of a timeout putting an upper bound on how long the application will be blocked. In typical fashion, the form of that timeout varies greatly. poll() and epoll_wait() take an integer number of milliseconds; select() takes a struct timeval with microsecond resolution, and ppoll() and pselect() take a struct timespec with nanosecond resolution. They are all the same, though, in that they convert this timeout value to jiffies, with a maximum resolution between one and ten milliseconds. A programmer might program a pselect() call with a 10 nanosecond timeout, but the call may not return until 10 milliseconds later, even in the absence of contention for the CPU. An error of six orders of magnitude seems like a bit much, especially given that contemporary hardware can easily support much more accurate timing. Arjan van de Ven recently surfaced with a patch set aimed at addressing this problem. The core idea is simple: have the code implementing poll() and select() use high-resolution timers instead of converting the timeout period to low-resolution jiffies. The implementation relied on a new function to provide the timeouts: Here, time is the timeout period, as interpreted by mode (which is either HRTIMER_MODE_ABS or HRTIMER_MODE_REL). High-resolution timeouts are a nice feature, but one can immediately imagine a problem: higher-resolution timeouts are less likely to coincide with other events which wake up the processor. The result will be more wakeups and greater power consumption. As it happens, there are few developers who are more aware of this fact than Arjan, who has done quite a bit of work aimed at keeping processors asleep as much as possible. His solution to this problem was to only use high-resolution timeouts if the timeout period is less than one second. For longer timeout periods, the old, jiffie-based mechanism was used as before. Linus didn't like that solution, calling it "ugly." His preference, instead, was to have schedule_hrtimeout() apply an appropriate amount of fuzz to all timeout values; the longer the timeout, the less resolution would be supplied. Alan Cox suggested that a better mechanism would be for the caller to supply the required accuracy with the timeout value. The problem with that idea, as Linus pointed out, is that the current system call interfaces provide no way for an application to supply the accuracy value. One could create more poll()-like system calls - as if there weren't enough of them already - with an accuracy parameter, but that looks like a lot of trouble to create a non-standard interface which few programmers would bother to use. A different solution came in the form of Arjan's range-capable timer patch set. This patch extends hrtimers to accept two timeout values, called the "soft" and "hard" timeouts. The soft value - the shorter of the two - is the first time at which the timeout can expire; the kernel will make its best effort to ensure that it does not expire after the hard period has elapsed. In between the two, the kernel is free to expire the timer at any convenient time. It's a useful feature, but it comes at the cost of some significant API changes. To begin with, the expires field of struct hrtimer goes away. Rather than manipulate expires directly, kernel code must now use one of the new accessor functions: Once that's done, the range capability is added to hrtimers. By default, the soft and hard expiration times are the same; code which wishes to set them independently can use the new functions: In the new "set" functions, the specified time is the soft timeout, while time+delta provides the hard timeout value. There is also another form of schedule_timeout(): With this infrastructure in place, poll() and friends can be given approximate timeouts; the only remaining question is just how wide the range of times should be. In Arjan's patch, that range comes from two different sources. The first is a new field in the task structure called timer_slack_ns; as one might expect, it specifies the maximum expected timer accuracy in nanoseconds. This value can be adjusted via the prctl() system call. The default value is set to 50 microseconds - approximate to a certain degree, but still far more accurate than the timeouts in current kernels. Beyond that, though, there is a heuristic function which provides an accuracy value depending on the requested timeout period. In the case of especially long timeouts - more than ten seconds - the accuracy is set to 100ms; as the timeouts get shorter, the amount of acceptable error drops, down to a minimum of 10ns for very brief timeouts. Normally, poll() and company will use the value returned by the heuristic, but with the exception that the accuracy will never exceed the value found in timer_slack_ns. The end result is the provision of more accurate timeouts on the polling functions while, simultaneously, preserving the ability to combine timeouts with other system events. Linux 3.0? The Linux kernel summit is happening this month, so various discussion topics are being tossed around on the Ksummit-2008-discuss mailing list. Alan Cox suggested a Linux release that would "throw out" some accumulated, unmaintained cruft as a topic to be discussed. Cox would like to see that release be well publicized, with a new release number, so that the intention of the release would be clear. While there will be disagreements about which drivers and subsystems can be removed, participants in the thread seem favorably disposed to the idea—at least enough that it should be discussed. There is already a process in place for deprecating and eventually removing parts of the kernel that need it, but it is somewhat haphazardly used. Cox proposes: At some point soon we add all the old legacy ISA drivers (barring the odd ones that turn up in embedded chipsets on LPC bus) into the feature-removal list and declare an 'ISA death' flag day which we brand 2.8 or 3.0 or something so everyone knows that we are having a single clean 'throw out' of old junk. It would also be a chance to throw out a whole pile of other "legacy" things like ipt_tos, bzImage symlinks, ancient SCTP options, ancient lmsensor support, V4L1 only driver stuff etc. Cox's list sparked immediate protest about some of the items on it, but the general idea was well received. There are certainly sizable portions of the kernel, especially for older hardware, that are unmaintained and probably completely broken. No one seems to have any interest in carrying that stuff forward, but, without a concerted effort to identify and remove crufty code, it is likely to remain. Cox has suggested one way to make that happen; discussion at the kernel summit might refine his idea or come up with something entirely different. Part of the reason that unmaintained code tends to hang around is that the kernel hackers have gotten much better at fixing all affected code when they make an API change. While that is definitely a change for the better, it does have the effect of sometimes hiding code that might be ready to be removed. In earlier times, dead code would have become unbuildable after an API change or two leading to either a maintainer stepping up or the code being removed. The need to make a "major" kernel release, with a corresponding change to the major or minor release number is the biggest question that the kernel hackers seem to have. Greg Kroah-Hartman asks: Can't we do all of the above today in our current model? Or is it just a marketing thing to bump to 3.0? If so, should we just pick a release and say, "here, 2.6.31 is the last 2.6 kernel and for the next 3 months we are just going to rip things out and create 3.0"? There is an element of "marketing" to Cox's proposal. Publicizing a major release, along with the intention to get rid of "legacy" code, will allow interested parties to step up to maintain pieces that they do not want to see removed. As Cox, puts it: I thought it might be useful to actually draw some definite lines so we can actually get around to throwing stuff out rather than letting it rot forever and also if its well telegraphed both give people a chance to fix where the line goes and - yes - as a marketing thing as much as anything else to define the line in a way that non-techies, press etc get. Plus it appeals to my sense of the open source way of doing things differently - a major release about getting rid of old junk not about adding more new wackiness people don't need 8) Arjan van de Ven thinks that gathering the list of things to be removed is a good exercise: I like the idea of at least discussing this, and for a bunch of people making a long list of what would go. Based on that whole list it becomes a value discussion/decision; is there enough of this to make it worth doing. Once the list has been gathered and discussed, van de Ven notes, it may well be that it can be done under the current development model, without a major release. "But let's at least do the exercise. It's worth validating the model we have once in a while ;)" This may not be the only discussion of kernel version numbers that takes place at the summit. Back in July, Linus Torvalds mentioned a bikeshed painting project that he planned to bring up. It seems that Torvalds is less than completely happy with how large the minor release number of the kernel is; he would like to see numbers that have more meaning, possibly date-based: The only thing I do know is that I agree that "big meaningless numbers" are bad. "26" is already pretty big. As you point out, the 2.4.x series has much bigger numbers yet. And yes, something like "2008" is obviously numerically bigger, but has a direct meaning and as such is possibly better than something arbitrary and non-descriptive like "26". Version numbers are not important, per se, but having a consistent, well-understood numbering scheme certainly is. The current system has been in place for four years or so without much need to modify it. That may still be the case, but with ideas about altering it coming from multiple directions, there could be changes afoot as well. For the kernel hackers themselves, there is little benefit—except, perhaps, preventing the annoyance of ever-increasing numbers—but version numbering does provide a mechanism to communicate with the "outside world". Users have come to expect the occasional major release, with some sizable and visible chunk of changes, but the current incremental kernel releases do not provide that numerically; instead, big changes come with nearly every kernel release. There may be value in raising the visibility of one particular release, either as a means to clean up the kernel or to move to a different versioning scheme—perhaps both at once. Cinelerra 4 arrives Cinelerra is a compositing video and audio editor that is being developed by Heroine Virtual LTD's Adam Williams when he isn't playing with autonomous miniature helicopters. Cinelerra is derived from the now-discontinued Broadcast 2000 project. The project is described: Unleash the 50,000 watt flamethrower of content creation in your UNIX box. Cinelerra does primarily 3 things: capturing, compositing, and editing audio and video with sample level accuracy. It's a movie studio in a box. If you want the same kind of editing suite that the big boys use, on an efficient UNIX operating system, it's time for Cinelerra. Cinelerra is not community approved and there is no support from the developer. Donations to community websites do not fund Cinelerra development. The Wikipedia entry for Cinelerra summarizes the project's window set: The user is presented with four screens: 1. The timeline, which gives the user a time-based view of all video and audio tracks in the project, as well as keyframe data for e.g. camera movement, effects, or opacity; 2. the viewer, which gives the user a method of "scrubbing" through footage; 3. the resource window, which presents the user with a view of all audio and video resources in the project, as well as available audio and video effects and transitions; and 4. the compositor, which presents the user with a view of the final project as it would look when rendered. The compositor is interactive in that it allows the user to adjust the positions of video objects; it also updates in response to user input. The main Cinelerra page lists the software's many features. Version 4.0 of Cinelerra was released on August 8, 2008, the change log details the most recent feature additions. Older project history is available in the news document. One big change for this release is the availability of pre-compiled binaries for 32 and 64 bit versions of Ubuntu 8.04. This can be a real time saver due to the complexity of the build process, and will give access to a wider variety of users. Cinelerra works best with specific hardware configurations. An NVidia graphic card is recommended: "Cinelerra supports OpenGL shaders on NVidia graphics cards. The video crunching power that was once exclusively the domain of SGI minicomputers is now yours. NVidia users can run many effects in realtime instead of rendering them. OpenGL also opens up new video resolutions, up to 4096x4096 on high end cards." And a 64 bit Linux platform is a good idea: "Since it's Linux, it's been 64 bit compliant for years. In fact, Cinelerra is only recommended for 64 bit mode. The reason is the large amount of virtual memory required for page flipping and floating point images often exceeds the limit of 32 bits. " Your author has used Cinelerra in the past for audio editing, see this article for details. Cinelerra has one capability that is hard to find in other Linux audio editing software, the ability to split (render) a huge .wav file into a group of smaller .wav files across multiple position labels, all in one operation. This feature is useful for processing long audio recordings such as digitized vinyl album sides and copies of digital audio (DAT) tapes. This was the first operation that Cinelerra 4 was tried on. After some initial crashing difficulties, a startup warning message about an insufficient shmmax value was heeded. Changing shmmax is simply a matter of running echo 0x7fffffff > /proc/sys/kernel/shmmax as root before starting Cinelerra. After doing that, your author was unable to make the software crash while processing audio. Lacking a high resolution video camera, your author was able to use his Nikon Coolpix S10 VR digital camera to produce low resolution .mov format movies with mono audio tracks. Cinelerra was able to display videos from this camera, specifically movies of thunderstorms. Individual frames containing lightning strikes were located by single stepping through interesting sections of the movie, the still frames were grabbed from the screen using an external application (xv). The single-step capability allowed the life cycle of a lightning bolt to be observed. This is a much less expensive way to procure photographs of lightning compared to using lots of 35mm film and specialized hardware. Attempts to do actual video editing were somewhat less successful than simple playback. Creating a fade-in at the beginning of a short video clip worked, but several attempts to add a second video track crashed Cinelerra, as did saving a modified track. This may be related to the camera's data, which has confused other video players (mplayer) in the past or the lack of a professional quality video device. The computer was running a (not recommended) 32-bit version of Ubuntu and an older Radeon video card. As with high-end audio processing, it is probably best to put together a system with the specific hardware and operating system that is recommended for the application. While Cinelerra is more of a professional video tool than a generic desktop application, it nonetheless has some very useful capabilities outside of its primary application space. It is the most full-featured video playback application that your author has experimented with, and it functions nicely as an audio processing tool. Spinning Fedora There was a discussion recently on the fedora-advisory-board list about when a derivative is an official spin vs. one that is Fedora based. It started out innocently enough with a request for trademark approval for an Appliance Operating Spin. Right away Bill Nottingham noted that SELinux is disabled in this spin and wondered why. The answer was simple enough, there are some current issues with the building tool and SELinux. A simple enough start to what turned into a somewhat lengthy discussion of what makes Fedora Fedora. This is not the first time that the Fedora Advisory Board has tackled this issue, but it seems that not all board members are in complete agreement of the difference between an official Fedora spin and something which is merely Fedora based. Jesse Keating recalled a conversation that took place during the merge of core and extras on whether or not there should be a "Fedora Standard Base". That is, a basic set of things you must have in your "spin" in order to call it Fedora. These include things like rpm, yum, and SELinux (at least in my opinion), but we never really coded this up nor hashed out what should be in the FSB, or if FSB was even a good name for the concept. A draft version of trademark guidelines is available, and awaiting comments and approval by the Fedora Board. The guidelines in this document do not make any packages mandatory for trademark approval. They do state that official spins will include only those packages that are available in the official Fedora repository. Pretty much all spins, with the notable exception of the Everything Spin, will contain a subset of all the packages in the repository and are left to chose which packages they need or don't need. Axel Thimm posted that official spins should have high standards and should improve the brand name. Currently I cannot imagine Fedora w/o rpm or yum, but I can imagine it w/o selinux if I think about very small footprints, nano-Fedoras and all the recent suggestion. I wouldn't mind my phone to advertise that it runs on Fedora, even if selinux was turned off (but the high standard of security is ensured in another way). Since we can't envision what nice spins/derivatives people will come up with (I first heard of the appliance spin), we should not statically enforce any requirements, but instead have the board be the checking instance like it is now. Of course, it's not just about the trademarks. The discussion also brought up the kickstart pool and whether unofficial spins should be included in the pool, or even whether all official spins should be included. So there could be trademarked Fedora spins that aren't allowed in the kickstart pool, perhaps because of their choice of packages. Or there could be "Xora", a Fedora based distribution, that would be in the kickstart pool and available in the Fedora Hosted service. Jeff Spaleta looked at how the kickstart pool might be structured. Under the current workflow, there are essentially 3 different technical levels. 1) Spin SIG best practices to get into kickstart pool 2) Technical issues which are associated with trademark approval 3) Technical requirements for RelEng for 'release' of a spin. These can be layered technical hurdles, which the kickstart pool could be structured to mimic. The bottom line, in this instance, seems to be that AOS (Appliance Operating Spin) will likely get trademark approval, since it only contains official Fedora packages. However, unless they get SELinux running on it, either with permissive mode or with a custom policy, it won't get into the kickstart pool. Or perhaps it will be relegated to a second-class pool. It may seem odd that an appliance needs SELinux, but as Jeroen van Meeuwen says: "On the other hand, of course we do have an agenda to push and that agenda includes SELinux as being one of the core features of the entire Fedora line of products (including the few enterprise linux spin-offs). It's one of the main features and we would rather see appliances built upon an AOS that has SELinux enforcing by default while it can still be disabled." Feature removal sparks Git flamewar Removing features from a tool is never easy. Once there is enough of a user base to complain about annoyances, there is also a vocal group that uses and likes those same annoyances. The recent removal of the git-foo style commands from Git is just such a case, but many of those using those commands did not find out about the removal until after the change was made, which only served to increase their outrage. Until version 1.6.0, Git has always had two ways to invoke the same functionality: git foo and git-foo. This was done by installing many—usually more than 100—different entries into /usr/bin for all of the different git subcommands. Some were concerned that Git was polluting that directory, but the bigger issue was the effect on new users. Partially because of shell autocompletion, a new user might be overwhelmed by the number of different Git commands available; even regular users might find it difficult to find the command they are looking for if they have to sort through 100 or more. Many of the Git subcommands that exist are not necessarily regularly used. There are quite a number of "plumbing" commands that rarely, if ever, should be invoked by users. Those are best hidden from view, which can be done by moving them out of /usr/bin. This has been done for the 1.6.0 release, but Junio Hamano opened up a can of worms when he posted a request for discussion about taking the next step to the Git mailing list. In the 1.6.0 release, the only things exposed in /usr/bin are the git binary itself along with a few other utilities; the rest have been moved to /usr/libexec/git-core. The hard links for each of the git-foo commands have been maintained in the new location, which allows folks that still want the old behavior to get it by adding: to .bashrc (or some other startup file, depending on the shell). This would allow users—especially scripts—to continue using the dash versions of commands. Unfortunately, for many users, the first they heard about this change was when things stopped working after they installed 1.6.0. The Git team admittedly did not get the word out very well; by trying to be nice, they missed an opportunity to make users notice the change. As Hamano puts it: But that niceness backfired. Many people seem to argue now that we should have annoyed people by throwing loud deprecation notices to stderr when they typed "git-foo", and we should have risked breaking their scripts iff they relied on not seeing anything extra on the stderr. Hamano got caught in the middle to some extent as he wasn't particularly in favor of the original change, but at the time it was decided, there were few advocates for keeping 100+ commands in /usr/bin. There were several complaints about having that many commands, but chief amongst them was confusion for new users. By removing them from /usr/bin and providing an autocompletion script for bash that completes only a subset of the git subcommands, users will have fewer options to scan through—and to be scared of. The original plan called for moving the dash-style commands out, which has been done, but also eventually removing the links for any of the git-foo commands that are implemented in the core git binary. Over time, much of the functionality that was handled by external commands has migrated into the main git program. It is the eventual removal of the links that Hamano is asking about in his message, but much of the response was flames about the step already taken; some could not see any advantage to moving the git-foo commands out of /usr/bin. David Woodhouse is one of those who wants things to remain the same: Just don't do it. Leave the git-foo commands as they were. They weren't actually hurting anyone, and you don't actually _gain_ anything by removing them. For those occasional nutters who _really_ care about the size of /usr/bin, give them the _option_ of a 'make install' without installing the aliases. Several others agreed, but that particular horse had already left the barn. Throughout the thread, Linus Torvalds was increasingly strident about the $PATH-based workaround, which effectively ends the discussion that Hamano was trying to have. For that workaround to continue working, the links must be installed in /usr/libexec/git-core. Though it strays from the original intent, it is a reasonable compromise, one that will serve git-traditionalists as well as new users and others who no longer want the git-foo syntax. Two things have helped keep the controversy alive: some documentation, test, and example scripts still refer to dash-style commands, but worse than that, one must do man git-foo to get the man page for that subcommand. It is a convention within the Git community to use the dash style when referring to commands in text, which explains some of the usage. Because man requires a single argument, the dash style is used there as well, though git help foo is a reasonable alternative. For users who started relatively early with Git, and are aware of the dash style commands, these examples further muddy the water. It is a difficult problem. Projects must have room to change, but once users become used to a particular way of doing things, they will resist changing—sometimes quite loudly. As Petr "Pasky" Baudis points out, though, Git is still evolving: You can't ask us to stop making any incompatible changes - Git is still too young for that and it's UI got evolved, not designed. But we do document the changes we do, even though we might do a better job *spreading* the word. The Git developers still see it as a young tool that may still undergo some fairly substantial modifications, while the hardcore users see it is a fixed tool that they use daily—or more frequently—to get work done. The tension between those two views is what leads to flamewars like we have seen here. Certainly the Git folks could have done a much better job in getting the word out—Hamano was looking for suggestions on how to do that better in his original post—but users are going to have to be flexible as well. The Kernel Hacker's Bookshelf: UNIX Internals Back in 2001, I landed my (then) dream job as a full-time Linux kernel developer and distribution maintainer for a small embedded systems company. I was thrilled - and horrified. I'd only been working as a programmer for a couple of years and I was sure it was only a matter of time before my new employer figured out they'd hired an idiot. The only solution was to learn more about operating systems, and quickly. So I pulled out my favorite operating systems textbook and read and re-read it obsessively over the course of the next year. It worked well enough that my company tried very hard to convince me not to quit when I got bored with my "dream job" and left to work at Sun. That operating systems textbook was UNIX Internals by Uresh Vahalia. UNIX Internals is a careful, detailed examination of multiple UNIX implementations as they evolved over time, from the perspective of both the academic theorist and the practical kernel developer. What makes this book particularly valuable to the practicing operating systems developer is that the review of each operating systems concept - say, processes and threads - is accompanied by descriptions of specific implementations and their histories - say, threading in Solaris, Mach, and Digital UNIX. Each implementation is then compared on a number of practical levels, including performance, effect on programming interfaces, portability, and long-term maintenance burden - factors that Linux developers care passionately about, but are seldom considered in the academic operating systems literature. UNIX Internals was published in 1996. A valid question is whether a book on the implementation details of UNIX operating systems published so long ago is still useful today. For example, Linux is only mentioned briefly in the introduction, and many of the UNIX variants described are now defunct. It is true that UNIX Internals holds relatively little value for the developer actively staying up to date with the latest research and development in a particular area. However, my personal experience has been that many of the problems facing today's Linux developers are described in this book - and so are many of the proposed solutions, complete with the unsolved implementation problems. More importantly, the analysis is often detailed enough that it describes exactly the changes needed to improve the technique, if only anyone took the time to implement them. In the rest of this review, we'll cover two chapters of UNIX Internals in detail, "Kernel Memory Allocation" and "File System Implementations." The chapter on kernel memory allocation is an example of the historical, cross-platform review and analysis that sets this book apart, covering eight popular allocators from several different flavors of UNIX. The chapter on file system implementations shows how lessons learned from the oldest and most basic file system implementations can be useful when solving the latest and hottest file system design problems. Kernel Memory Allocation The kernel memory allocator (KMA) is one of the most performance-critical kernel subsystems. A poor KMA implementation will hurt performance in every code path that needs to allocate or free memory. Worse, it will fragment and waste precious kernel memory - memory that can't be easily freed or paged out - and pollute hardware caches with instructions and data used for allocation management. Historically, a KMA was considered pretty good if it only wasted 50% of the total memory allocated by the kernel. Vahalia begins with a short conceptual description of kernel memory allocation and then immediately dives into practical implementation, starting with page-level allocation in BSD. Next, he describes memory allocation in the very earliest UNIX systems: a collection of fixed-size tables for structures like inodes and process table entries, occasional "borrowing" of blocks from the buffer cache, and a few subsystem-specific ad hoc allocators. This primitive approach required a great deal of tuning, wasted a lot of memory, and made the system fragile. What constitutes a good KMA? After a quick review of the functional requirements, Vahalia lays out the criteria he'll use to judge the allocators: low waste (fragmentation), good performance, simple interface appropriate for many different users, good alignment, efficient under changing workloads, reassignment of memory allocated for one buffer size to another, and integration with the paging system. He also takes into consideration more subtle points, such as the cache and TLB footprint of the KMA's code, along with cache and lock contention in multi-processor systems. [PULL QUOTE: This is an example of how even the oldest and clunkiest algorithms can influence the design of the latest and greatest. END QUOTE] The first KMA reviewed is the resource map allocator, an extremely simple allocator using a list of <base, size> pairs describing each free segment of memory, sorted by base address. The charms of the resource map allocator include simplicity and allocation of exactly the size requested; the vices include high fragmentation and poor performance under nearly every workload. Even this allocation algorithm is useful under the right circumstances; Vahalia describes several subsystems that still use it (System V semaphore allocation and management of free space in directory blocks on some systems) and some minor tweaks that improve the algorithm. One tweak to the resource map allocator keeps the description of each free region in the first few bytes of the region, a technique later used in the state-of-the-art SLUB allocator in the Linux kernel. This is an example of how even the oldest and clunkiest algorithms can influence the design of the latest and greatest. Each following KMA is discussed in terms of the problems it solves from previous allocators, along with the problems it introduces. The resource map's sorted list of base/size pairs is followed by power-of-two free lists with a one-word in-buffer header (better performance, low external fragmentation, but high internal fragmentation, esp. for exact power-of-two allocations), the McKusick-Karels allocator (power-of-two free lists optimized for power-of-two allocation; extremely fast, but prone to external fragmentation), the buddy allocator (buffer splitting on power-of-two boundaries plus coalescing of adjacent free buffers; poor performance due to unnecessary splitting and coalescing), and the lazy buddy allocator (buddy plus delayed buffer coalescing; good steady-state performance but unpredictable under changing workloads). The accompanying diagrams of the data structures and buffers used to implement each allocator are particularly helpful in understanding the structure of the allocators. After covering the simpler KMAs, we get into more interesting territory: the zone allocator from Mach, the hierarchical allocator from Dynix, and the SLAB allocator, originally implemented on Solaris and later adopted by several UNIXes, including Linux and the BSDs. Mach's zone allocator is the only fully garbage-collected KMA studied, with the concomitant unpredictable system-wide performance slowdowns during garbage collection, which would strike it from most developers' lists of useful KMAs. But as with the resource map allocator, we still have lessons to learn from the zone allocator. Many of the features of the zone allocator also appear in the SLAB allocator, commonly considered the current best-of-breed KMA. The zone allocator creates a "zone" of memory reserved for each class of object allocated (e.g., inodes), similar to kmem caches in the later SLAB allocator. Pages are allocated to a zone as needed, up to a limit set at zone allocation time. Objects are packed tightly within each zone, even across pages, for very low internal fragmentation. Anonymous power-of-two zones are also available. Each zone has its own free list and once a zone is set up, allocation and freeing simply add and remove items from the per-zone free list (free list structures are also allocated from a zone). Memory is reclaimed on a per-page basis by the garbage collector, which runs as part of the swapper task. It uses a two-pass algorithm: the first pass counts up the number of free objects in each page, and the second pass frees empty pages. Overall, the zone allocator was a major improvement on previous KMAs: fast, space efficient, and easy to use, marred only by the inefficient and unpredictable garbage collection algorithm. The next KMA on the list is the hierarchical memory allocator for Dynix, which ran on the highly parallel Sequent S2000. One of the major designers and implementers is our own Paul McKenney, familiar to many LWN readers as the progenitor of the read-copy-update (RCU) system used in many places in the Linux kernel. The goal of the Dynix allocator was efficient parallel memory allocation, in particular avoiding lock contention between processors. The solution was to create several layers in the memory allocation system, with per-cpu caches at the bottom and collections of large free segments at the top. As memory is freed or allocated, regions move up and down one level of the hierarchy in batches. For example, each per-cpu cache has two free lists, one in active use and the other in reserve. When the active list runs out of free buffers, the free buffers from the reserve list are moved onto it, and the reserve list replenishes itself with buffers from the global list. All the work requiring synchronization between multiple CPUs happens in one big transaction, rather than incurring synchronization overhead on each buffer allocation. The Dynix allocator was a major advance: 3 - 5 times faster than the BSD allocator even on a single CPU. Its memory reclamation system was far more efficient than the zone allocator's, performed on an on-going basis with bounded worst case performance on each operation. Performance on SMP systems was unparalleled. The final KMA in this chapter is the SLAB allocator, initially implemented on Solaris and later re-implemented on Linux and BSD. The SLAB allocator refined some existing techniques (simple allocation/free computations for small cache footprint, per-object caches) and introduced several new ones (cache coloring, efficient object reuse). The result is an allocator that was both the best performing and the most efficient by a wide margin - only 14% fragmentation versus 27% for the SunOS 4.1.3 sequential-fit allocator, 45% for the 4.4BSD McKusick-Karel allocator, and 46% for the SunOS 5.x buddy allocator. Like the zone allocator, SLAB allocates per-object caches (along with anonymous caches in useful sizes) called kmem caches. Each cache has an associated optional constructor and destructor function run on the objects in a newly allocated and newly freed page, respectively (though the destructor has since been removed in the Linux allocator). Each cache is a doubly-linked list of slabs - large contiguous chunks of memory. Each slab keeps its slab data structure at the end of the slab, and divides the rest of the space into objects. Any leftover free space in the slab is divided between the beginning and end of the objects in order to vary the offset of objects with respect to the CPU cache, improving cache utilization (in other words, cache coloring). Each object has an associated 4-byte free list pointer. The slabs within each kmem cache are in a doubly linked list, sorted so that free slabs are located at one end, fully allocated slabs at the other, and partially allocated slabs in the middle. Allocations always come from partially allocated slabs before touching free slabs. Freeing an object is simple: since slabs are always the same size and alignment, the base address of the slab can be calculated from the address of the object being freed. This address is used to find the slab on the doubly linked list. Free counts are maintained on an on-going basis. When memory pressure occurs, the slab allocator walks the kmem caches freeing the free slabs at the end of the cache's slab list. Slabs for larger objects are organized differently, with the slab management structure allocated separately and additional buffer management data included. This section of UNIX Internals has aged particularly well, partly because the SLAB allocator continues to work well on modern systems. As Vahalia notes, the SLAB allocator initially lacked optimizations for multi-processor systems, but these were added shortly afterward, using many of the same techniques as the Dynix hierarchical allocator. Since then, most production kernel memory allocators have been SLAB-based. Recently, Christoph Lameter rewrote SLAB to get the SLUB allocator for Linux; both are available as kernel configuration options. (The third option, the SLOB allocator, is not related to SLAB - it is a simple allocator optimized for small embedded systems.) When viewed in isolation, the SLAB allocator may appear arbitrary or over-complex; when viewed in the context of previous memory allocators and their problems, the motivation behind each design decision is intuitive and clear. File Systems Implementations UNIX Internals includes four chapters on file systems, covering the user and kernel file system interface (VFS/vnode), implementations of on-disk and in-memory file systems, distributed/network file systems, and "advanced" file system topics - journaling, log-structured file systems, etc. Despite the intervening years, these four chapters are the most comprehensive and practical description of file systems design and implementation I have yet seen. I definitely recommend it over UNIX File System Design and Implementation - a massive sprawling book which lacks the focus and advanced implementation details of UNIX Internals. The chapter on file systems implementations is too packed with useful detail to review fully in this article, so I'll focus on the points that are relevant to current hot file system design problems. The chapter describes the System V File System (s5fs) and Berkeley Fast File System (FFS) implementations in great detail, followed by a survey of useful in-memory file systems, including tmpfs, procfs (a.k.a. /proc file system), an early variant of a device file system called specfs, and a sysfs-style interface for managing processors. This chapter also covers the implementation of buffer caches, inode caches, directory entry caches, etc. One of the features of this chapter (as elsewhere in the book) is the carefully chosen bibliography. Bibliographies in research papers serve a double purpose as demonstrations of the authors' breadth of knowledge in the area and tend to be cluttered with more marginal references; the per-chapter bibliographies in UNIX Internals list only the most relevant publications and make excellent supplementary reading guides. System V File System (s5fs) evolved from the first UNIX file system. The on-disk layout consisted of a boot block followed by a superblock followed by a single monolithic inode table. The remainder of the disk is used for data and indirect blocks. File data blocks are located via a standard single/double/triple indirect block scheme. s5fs has no block or inode allocation bitmaps; instead it maintains on-disk free lists. The inode free list is partial; when no more free inodes are on the list, it is replenished by scanning the inode table. Free blocks are tracked in a singly linked list rooted in the superblock - a truly terrifying design from the point of view of file system repair, especially given the lack of backup superblocks. In many respects, s5fs is simultaneously the simplest and the worst UNIX file system possible: its throughput was commonly as little as 5% of the raw disk bandwidth, it was easily corrupted, it had a 14 character limit on file names, and so on. On the other hand, elements of the s5fs design have come back into vogue, often without addressing the inherent drawbacks still unsolved in the intervening decades. The most striking example of a new/old design principle illustrated by s5fs is the placement of most of the metadata in one spot. This turned out to be a key performance problem for s5fs, as every uncached file read virtually guaranteed a disk seek of non-trivial magnitude between the location of the metadata at the beginning of the disk and the file data, located anywhere except the beginning of the disk. One of the major advances of FFS was to distribute inodes and bitmaps evenly across the disk and allocate associated file data and indirect blocks nearby. Recently, collecting metadata in one place has returned as a way to optimize file system check and repair time as well as other metadata-intensive operations. It also appears in designs that keep metadata on a separate high-performance device (usually solid state storage). The problems with these schemes are the same as the first time around. For the fsck optimization case, most normal workloads will suffer from the required seek for reads of file data from uncached inodes (in particular, system boot time would suffer greatly). In the separate metadata device case, the problem of keeping a single, easily-corrupted copy of important metadata returns. Currently, most solid-state storage is less reliable than disk, yet most proposals to move file system metadata to solid state storage make no provision for backup copies on disk. Another cutting edge file system design issue first encountered in s5fs is backup, restore, and general manipulation of sparse files. System administrators quickly discovered that it was possible to create a user-level backup that could not be restored because the tools would attempt to actually write (and allocate) the zero-filled unallocated portions of sparse files. Even more intelligent tools that do not explicitly write zero-filled portions of files still had to pointlessly copy pages of zeroes out of the kernel when reading sparse files. In general, the file and socket I/O interface requires a lot of ultimately unnecessary copying of file data into and out of the kernel for common operations. It has only been in the last few years that more sophisticated file system interfaces have been proposed and implemented, including SEEK_HOLE/SEEK_DATA and splice() and friends. The chapters on file systems are definitely frustratingly out of date, especially with regard to advances in on-disk file system design. You'll find little or no discussion of copy-on-write file systems, extents, btrees, or file system repair outside of the context of non-journaled file systems. Unfortunately, I can't offer much in the way of a follow-up reading list; most of the papers in my file systems reading list are covered in this book (exceptions include the papers on soft updates, WAFL, and XFS). File systems developers seem to publish less often than they used to; often the options for learning about the cutting edge are reading the code, browsing the project wiki, and attending presentations from the developers. Your next opportunity for the latter is the Linux Plumbers Conference, which has a number of file system-related talks. Another major flaw in the book, and one of the few places where Vahalia was charmed by an on-going OS design fad, is the near-complete lack of coverage of TCP/IP and other networking topics (the index entry for TCP/IP lists only two pages!). Instead, we get an entire chapter devoted to streams, at the time considered the obvious next step in UNIX I/O. If you want to learn more about UNIX networking design and implementation, this is the wrong book; buy some of the Stevens and Comer networking books instead. Summary UNIX Internals was the original inspiration for the Kernel Hacker's Bookshelf series, simply because you could always find it on the bookshelf of every serious kernel hacker I knew. As the age of the book is its most serious weakness, I originally intended to wait until the planned second edition was released before reviewing it. To my intense regret, the planned release date came and went and the second edition now appears to have been canceled. UNIX Internals is not the right operating systems book for everyone; in particular, it is not a good textbook for an introductory operating systems course (although I don't think I suffered too much from the experience). However, UNIX Internals remains a valuable reference book for the practicing kernel developer and a good starting point for the aspiring kernel developer. Find SQL injection vulnerabilities with sqlmap SQL injections are a particularly nasty type of web application vulnerability that can lead to loss or disclosure of the contents of a database. Testing a web application to find SQL injection holes can be a tedious process, which is where the sqlmap tool may come in handy. sqlmap automates the process of testing a particular web page for various kinds of SQL injection flaws. Sqlmap is a command-line driven Python application that can help in both finding and exploiting SQL injections. By giving it a URL and parameter names of interest (from HTML forms or GET parameters), it tries to determine which of those parameters cause different output based on their value, indicating that they control the dynamic behavior of the application. Those parameters are then tested by repeatedly making an HTTP request with slightly different values. Each of the values passed corresponds to a SQL injection technique, such as appending a single-quote. Based on whether the HTML response is different from the original response, the potential for a SQL injection can be inferred. The tool also tests an often overlooked input source: cookies. The user can specify a cookie value which the tool will then manipulate to attempt a SQL injection via the cookie. Since many applications store their session information in a database using the cookie value as a key, this is a relatively common route to SQL injection—one that penetration tests sometimes miss. While it does help remove some of the tedium involved in testing for SQL injections, sqlmap is by no means an automated solution. A fair amount of work is required to find a vulnerable parameter. Once a vulnerability has been found, though, a great deal of information, including database contents, can be retrieved with a single command. Like many security tools, sqlmap can be used by those of malicious intent rather easily. The automated retrieval of database passwords and contents from a vulnerable application are particularly powerful—thus dangerous. For some database installations, there is even a mode that will get a shell prompt on the server as the user that runs the database application. Because it is free software, sqlmap is very useful for understanding SQL injections and, perhaps more importantly, what kinds of things an attacker can do by abusing a vulnerable application. There is excellent documentation, both for developers and users. Sqlmap recently released version 0.6 and is certainly worth a look for anyone interested in testing a web application or curious about SQL injection in general. Kernel security, year to date Earlier this year, your editor asked a high-profile kernel developer, in a public discussion at a conference, about the seemingly large number of kernel-related security bugs. Was the number of these vulnerabilities of concern, and what was being done about it? The answer that came back was that security issues aren't a huge concern, that most of the reported issues were obscure local exploits requiring the presence of specific hardware. Serious issues, like the vmsplice() vulnerability, are rare. More recently, as part of the panic associated with getting a talk together for the Linux Plumbers Conference, your editor decided to take a closer look at kernel vulnerabilities. It turns out that there are, in fact, quite a few of them. The vulnerabilities which have been given CVE numbers in 2008 (so far) are: That is 41 CVE numbers (so far) for 2008 - not a small number. Fully 1/3 of these vulnerabilities were in the networking subsystem, which is scary: this is the most likely place to find remotely-exploitable problems in the kernel. It is true that sites not running SCTP or DCCP can forget about many of those, and IPv6 is responsible for a few of the rest, so most of those vulnerabilities were not a concern for most sites. Many of the remaining vulnerabilities were in the core kernel or in architecture-specific code. The number of vulnerabilities found in drivers - the part of the kernel which has long been sneered at as containing the worst code - is actually quite small. On the other hand, four of the CVE-listed vulnerabilities (the Xen, AppArmor, and utrace problems) were caused by out-of-tree code added by distributors. There is no way to know how many vulnerabilities were fixed without obtaining a CVE number - or without even realizing that a vulnerability existed in the first place. When a single program is responsible for this many vulnerabilities, it makes sense to ask why. The kernel, of course, is a very large program; more code means more bugs, some of which will have security implications. Beyond that, though, the kernel runs in a special, privileged environment. Flaws which would simply be fixed as just-another-crash in a normal application are denial-of-service vulnerabilities in the kernel - or worse. So a larger number of vulnerabilities in the kernel does not, by itself, imply that the kernel's code is worse than that of other programs; it only reflects the fact that the consequences of kernel bugs tend to be more severe. The discovery (and repair) of vulnerabilities does not necessarily imply that our current process is creating a lot of vulnerabilities; it could be that we are mostly fixing older problems. If the developers are fixing vulnerabilities more quickly than they are adding more, life should be good in the long run. The vulnerabilities in the list above vary from those which are very old (affecting 2.4 kernels too) to some which are very new (the UVC driver was added in 2.6.26). Some of them are in code which, while being intended for the mainline, has not yet been merged. It is probably impossible to say whether security problems are being fixed more quickly than they are being created, but one thing is clear: all of that code flowing into the mainline is bringing a certain number of security problems with it. For that reason, it is a little discouraging that there is little work being done in the kernel community with the explicit goal of improving the security of the kernel. Few patches are reviewed with security issues in mind; the vmsplice() vulnerability, as one example, was a clear failure of the review process. There are undoubtedly many people who are doing fuzz testing and such - some of them are even the good guys - but much of the formal testing going on seems aimed more at API conformance than at security verification. There must be more work going on behind the scenes, but it is still hard to avoid a sense of a certain amount of complacency with regard to security issues. As a community, we take pride in the security of our system. But one vulnerability per week is not the most inspiring security record. It would be good to find a way to do better than that. Better tools must be a part of the solution, but more thorough code review is also needed. There still is no substitute for a pair of eyeballs looking for ways in which new code might be subverted. Asking for more security-oriented review seems ambitious when code review is already one of the biggest bottlenecks in the development process. But the alternative would appear to be to continue to add to our collection of CVE numbers. System calls and rootkits A patch to add some security checks before making system calls would seem like a reasonable addition to the kernel, but because it is, at best, a half-measure, it received a less than enthusiastic response. Preventing rootkits—malware that alters the kernel to hide its presence and function—from altering the system call table was the rationale behind the patch, but it would only work for the current crop of rootkits. Once that change was made, rootkit authors would just change their modus operandi in response. There are many possible ways that a root user—or malware running as root—can modify a Linux system to run rootkit code. Some currently "popular" rootkits modify the system call table, though it is ostensibly read-only. Some commercial malware scanners that run on Linux have also been known to use this technique. In both cases, certain system calls are re-routed from the standard kernel code to code that lives elsewhere. That code, running in kernel mode, can then do just about anything it wants with the system. Arjan van de Ven proposed a patch that hooked into the system call entry code to check the address of the call to ensure that it was within the addresses occupied by kernel code. He describes the change and its impact this way: The patch below, while obviously not perfect protection against malware, adds some cheap sanity checks to the syscall path to verify the system call is actually still in the kernel code region and not some external-to-this region such as a rootkit. The overhead is very minimal; measured at 2 cycles or less. (this is because the branches get predicted right and the rest of the code is almost perfectly parallelizable... and an indirect function call is a branch issue anyway) Various kernel hackers pointed out the flaws inherent in that scheme. As Andi Kleen succinctly puts it: This just means that the root kits will switch to patch the first instruction of the entry points instead. [...] So the protection will be zero to minimal, but the overhead will be there forever. One of the more interesting ideas to come out of the discussion was Alan Cox's thoughts on using a hypervisor to enforce protections: The only place you can expect to make a difference here is in virtualised environments by teaching KVM how to provide 'irrevocably read only' pages to guests where the guest OS isn't permitted to change the rights back or the virtual mapping of that page. Ingo Molnar described a rather complicated scheme that might increase the likelihood of a rootkit being detected, but with a fairly high cost—in build complexity as well as the ability to debug the resulting kernel. The compiler would be changed to insert calls to rootkit checks randomly throughout the kernel binary in ways that would be difficult or impossible for a rootkit to detect and evade. In the end, though, a rootkit could simply install a new kernel that does exactly what it wants, then cause, or wait for, a reboot. Without some kind of hardware enforcement (e.g. Trusted Platform Module) or locked-down virtualization, Linux is defenseless against attacks that run as root. The kernel could change to thwart a particular kind of attack, such as van de Ven's patch, but other kinds of attacks will still succeed. It is clearly a situation where "the only way to win is not to play this game", as Pavel Machek—amongst others—noted in the thread. In the end, van de Ven wrote off the patch as an exercise in measuring the cost of this kind of runtime checking. It was fairly low cost solution, but without any major upside. The real upside was getting kernel hackers thinking about the problem, which could lead to some better solutions down the road. Tightening the merge window rules The 2005 kernel summit included a discussion on a recurring topic: how can the community produce kernels with fewer bugs? One of the problems which was identified in that session was that significant changes were often being merged late in the development cycle with the result that there was not enough time for testing and bug fixing. In response, the summit attendees proposed the concept of the "merge window," a two-week period in which all major changes for a given development cycle would be merged into the mainline. Once the merge window closed, only fixes would be welcome. Three years later, the merge window is a well established mechanism. Over that time, the discipline associated with the merge window has gotten stronger; it is now quite rare that significant changes go into the mainline outside of the merge window. The one notable exception is that new drivers can be accepted later in the cycle, based on the reasoning that a driver, being completely new and self-contained functionality, cannot cause regressions. Even then, there are hazards: the UVC webcam driver, merged quite late in the 2.6.26 cycle (in 2.6.26-rc9), brought a security hole with it. The merge window rule is often expressed as "only fixes can go in after the -rc1 release." Recent discussions have made it clear, though, that Linus is starting to develop a rather more restrictive view of how development should go outside of the merge window. The imminent 2008 kernel summit may well find itself taking on this topic and making some changes to the rules. In short, Linus has concluded that "fixes only" is not disciplined enough; a lot of work characterized as a "fix" can, itself, be a source of new regressions. So here's how Linus would like developers to operate now: Here's a simple rule of thumb: if it's not on the regression list if it's not a reported security hole if it's not on the reported oopses list then why are people sending it to me? There can be no doubt that the tighter rules have come as a surprise to a number of developers - if nothing else, the frequency with which Linus has found himself getting grumpy with patch submitters makes that clear. And, the truth of the matter is that Linus has not enforced anything like the above rule in the past. Beyond new drivers, post-merge-window changes have typically included things like coding style and white space fixups, minor feature enhancements, defconfig updates, documentation updates, annotations for the sparse tool, and so on. Relatively few of these changes come equipped with an entry on the regression list. To look at this another way, here's a table which appeared in the 2.6.26 development statistics article, updated with 2.6.27 (to date) information: * (Through September 9). 2.6.27 appears to be following the trend set by previous kernels: on the order of 25% of the total changesets will be merged outside of the nominal merge window. The most recent 2.6.27 regression summary shows a total of 150 regressions during this development cycle, of which 33 were unresolved. That suggests that at least 2300 patches merged since 2.6.27-rc1 were not fixes for listed regressions. So the "regression fixes only" policy is truly new - and not really effective yet. Should this policy hold, it could have a number of interesting implications including, perhaps, an increase in the number of non-regression fixes shipped in distributor kernels. It might make developers become more diligent about reporting regressions so that the associated fix can be merged. With fewer changes going in later in the cycle, development cycles might just get a little shorter, perhaps even to the eight weeks that was, once, the nominal target. And, of course, we might just get kernel releases with fewer bugs, which would be a hard thing to complain about. In the short term, though, expect more grumpy emails to developers who are still trying to work by the older rules. What's up with the Intrepid Ibex Ubuntu's current development release is called the Intrepid Ibex, which is soon to become v8.10. The Alpha5 release was announced this week, which is pretty close to on schedule. One more alpha release is planned, followed by a single beta, and the final release should be available by October 30, 2008. Looking at the blueprints for Intrepid we see a number of high priority items such as 3G networking, which will be integrated into NetworkManager. Another high priority item is an improved flash experience, which is aimed at improving the plugin finder wizard, better interaction with sites that use the flash detection kit, and an improved user-experience for selecting available alternatives. Internally there are the Package Status Pages, which are meant to provide a web page for each of the top 20-30 packages in Ubuntu showing bug counts and other vital signs and statistics. What else is new in Intrepid? GNOME 2.23.91, X.Org server 7.4, Linux kernel 2.6.27, and Network Manager 0.7 are all being included. An encrypted private directory will also be added to each home directory. In addition, there's a Guest session available from the User Switcher panel applet to give temporary access with restricted privileges. Dynamic Kernel Module Support (DKMS) is also available in Intrepid. It allows kernel drivers to be automatically rebuilt when new kernels are released. This makes it possible for kernel package updates to be made available immediately without waiting for rebuilds of driver packages, and without third-party driver packages becoming out of date. Finally, the "Last successful boot" recovery entry retains a copy of your running kernel and makes it available from the boot loader. This makes it possible for old kernel packages to be safely auto-removed by the package manager, instead of being kept indefinitely. Kubuntu will be using KDE4, with no plans to support KDE3. The Kubuntu wiki for Intrepid says, "KDE 3 is obsolete and largely unmaintained. Keeping with KDE 3 would offer no advantage over giving users Hardy." Bug squashing has been ongoing, with a number of focused Hug Days. The latest of these will be held September 11 to focus on bugs that don't have a package assigned to them. There are still a few known issues in the Alpha5 release, but overall the development is progressing nicely. Of course, if wild mountain goats are not your thing (however intrepid they might be), you can always wait for the more mythological Jaunty Jackalope, which will be in the planning stages at a Ubuntu Developer Summit (UDS) in Mountain View, California next December. Waiting for Rockbox 3.0 - again Rockbox is a GPL-licensed replacement firmware for a number of digital audio players. LWN published an article on the imminent Rockbox 3.0 release in May, 2006. Well over two years later, it is clear that some projects use a larger value of "imminent" than others. In this case, the Rockbox developers concluded that certain problems simply were not going to be resolved in any reasonable 3.0 time frame; rather than make a major release with known problems, they simply gave up on 3.0 at that time. As a result, the current stable Rockbox release is Rockbox 2.5, from September, 2005. It is probably safe to bet that few Rockbox users are running 2.5, which only had support for a handful of Archos players. Grabbing a daily build is a fact of life in the Rockbox community. Meanwhile, Rockbox has performed a valuable service for Debian developers who would otherwise have to struggle to find a project with longer release cycles than their own. Perhaps that state of affairs is about to change. Back in July, the project announced that, once again, an attempt was to be made for a 3.0 release. On August 15, Rockbox went into feature freeze, with the 3.0 release planned for "within a couple (as in two) weeks." That, of course, was a few (as in three) weeks ago, but this release is clearly getting closer. Now would seem like the time for the project to begin its hype campaign with lots of screenshot-heavy articles on all of the features this major release will bring. Evidently the Rockbox developers have some strange ideas about actually working on the code, though; they haven't gotten around to the promotional side of things yet. So, while the Rockbox manual is reasonably comprehensive and current, it's hard to come up with a list of changes for the 3.0 release. At the top of any list would have to be the list of supported players, which has expanded considerably since the 2.5 release. The Rockbox buyer's guide gives a good summary of the currently-supported players. Alas, none of these players are currently in production, though some can still be found on auction sites and elsewhere. There is progress toward support for some more contemporary players; early successes have been announced for the Cowon iAudio D2 and iAudio i7 devices. Those players will not be supported in the 3.0 release, of course, and the Rockbox developers have reserved the right to withhold support for other players as well if it is not stable enough. Beyond that, changes to Rockbox in recent times include the ever-growing list of codecs (including some video formats on suitable players), a five-band parametric equalizer, an increasingly powerful theme capability with many user-contributed themes, album art display, a highly capable tag database, Speex codec support for the voice-based interface, and a whole host of new plugins including the much-anticipated Lamp plugin which displays a blank screen at full intensity, turning your player into an expensive, short-lived flashlight. Rockbox 3.0, it seems, will have something for almost everybody. [PULL QUOTE: Given that installation can be a bit of a sweaty-palms experience overshadowed by the fear of turning that nice, new player into a brick, any help which can be given is more than welcome. END QUOTE] It also appears that 3.0 may include the hard-to-find RBUtil program - a Qt-based tool which automates the process of installing Rockbox. Given that installation can be a bit of a sweaty-palms experience overshadowed by the fear of turning that nice, new player into a brick, any help which can be given is more than welcome. Bricks, after all, are not known for high-fidelity sound. Another recent event in the Rockbox community is the creation of the Rockbox Steering Board, currently consisting of Daniel Stenberg, Linus Nielsen Feltzing, Dave Chapman, Paul Louden, and Jens Arnold. The mandate for this board is not particularly clear; it seems to be intended to help break deadlocks in technical discussions. There have been some concerns raised that the creation of this board is a sign that Rockbox is moving into a more bureaucratic, slow-moving mode, but those worries are probably premature. Rockbox developers also recently decided that all of the project's code would be licensed as "GPLv2 or later." While there is no plan for Rockbox to switch to GPLv3, the developers wanted their code to be available to other projects which are using that license. Since Rockbox does not require copyright assignments, this change will require an audit to find any GPLv2-only code and either relicense it or remove it. There have been no public announcements on how that process is going. The Rockbox project faces a number of challenges. Cooperation from vendors is essentially zero, so all ports require a reverse engineering effort. Target platforms go through their market lifecycle quickly, making it difficult to get a port stable before the target device disappears. Its programming environment is highly specialized and resource-constrained, limiting the pool of developers who can work on the project. And, someday, the whole effort may lose its relevance as platforms become more capable and it gets easier to just run Linux on them. For now, though, there is nothing better for those who want a dynamic and user-oriented operating system for their digital audio player, and it continues to improve. Fedora distributes new keys The Fedora project is back on track after its recent "infrastructure issues" with new package signing keys as well as packages and updates signed with the new keys. Fedora users should be able to pick up the new key and update their systems now, with a minimum of hassle—just verifying and accepting the new key. But, no further information has been released about exactly what went wrong, leading to more speculation and some worry in the Fedora community. When a user gets a package from their distribution—or, more likely, a mirror of their distribution repository—they need to have some way to determine that it is a valid package. Distributors sign packages using a private key; that signature can then be verified by using the distribution's public key. If the private key gets compromised somehow, malicious packages could be created that would be indistinguishable from the real versions. This is why private signing keys must be well guarded, usually by isolating them on separate machines and encrypting them with a password. According to one of the announcements about the problem, there is no evidence that the passphrase used to guard the Fedora private signing key has been compromised, though the clear implication is that the encrypted key file may have been captured. Out of an abundance of caution—and perhaps the concern that the passphrase might be guessed or brute-forced—the project decided to generate new keys. Along with new keys come various headaches: re-signing all of the packages as well as getting the keys installed on user's machines. Getting the keys to users is largely a matter of getting the new fedora-release package—along with PackageKit and friends for GUI-enabled updates—installed. That package contains the new key and repository name (updates-newkey). Of necessity, those updates are the last that will be signed with the old key, so they will install on existing Fedora systems. Once that package makes its way out to the mirrors, users can install it so that they can proceed with any needed updates using the new key. A yum clean metadata was helpful at the time of this writing to accelerate the process; depending on which mirror is being used and when it gets updated, that may not be needed. After fedora-release is installed, yum list updates gives a long list of updates available, all signed with the new key. All a user needs to do is verify the key and add it to the RPM key database. Verifying the key is a manual step as a user must check its fingerprint against that published on the web site. The method described requires importing the key into gpg, then doing gpg --fingerprint fedora@fedoraproject.org to see the key fingerprint; this is clearly something that could be made easier. As part of phase one of the re-signing, Fedora has re-signed all Fedora 8 and 9 package updates. Phase two is ongoing, re-signing each package that is distributed as part of the original release of Fedora 8 and 9. Fedora 10 already has a new signing key as well. From the perspective of a possible compromise of the signing keys, things are well on their way back to normal. But there is still the nagging issue of how this all came about to begin with. Several different questions about the intrusion were directed at the Fedora board from community members in their IRC meeting on September 9. Unfortunately, there was no new information forthcoming, nor was there any indication of when that information might be available. According to the board member Tom "spot" Callaway, information will be released "when we're told that we can by the parties running the investigation, not a second before, and not a second later." Red Hat is clearly holding all information about the intrusion as a closely guarded secret—whether that is at the behest of law enforcement or just lawyers is unclear. While there was no timeline given, the clear sense that one got from the meeting is that it might be weeks or months before clearance will be granted to even confirm that they know how the intrusion occurred. In addition, the Fedora board has not been officially briefed on the incident; some members have knowledge because of their Red Hat responsibilities, but the rest are in the dark. If one needed a reminder that Fedora is not an independent distribution, but instead is subject to the whims of Red Hat, this is a clear demonstration. The justification for secrecy is that Red Hat is a publicly traded company so intrusions into its systems need to be treated differently. Some board members believe that had there not been an intrusion into the servers that handle packages for Red Hat Enterprise Linux—that is if it had only been Fedora servers that were affected—the incident would have been handled much more transparently. Overall, the board is clearly unhappy about the situation but, perhaps because they are almost all Red Hat employees, don't see that there is much that can be done about it. That too should serve as a reminder. It should be noted that Debian has had several server compromises over the years (for example, 1 and 2), which is, perhaps, a poor record of server security, but it is an excellent example of transparency. Debian is rather well known for its independence, which is part of what allows it to be so open. Those incidents do serve as examples; perhaps they are not an exact fit for the current Fedora/RHEL intrusion but that remains to be seen. It may very well be that Red Hat is between a rock and a hard place here. As a friend to free software, Red Hat is unparalleled, but once in a while it shows that it is foremost a corporation with responsibilities to its shareholders. When those responsibilities conflict with the transparency we have come to expect from free software projects—especially with regard to security issues—that transparency must be set aside. One can argue that Red Hat is being overly protective of the details—confirmation that they either know or do not know how the intrusion occurred for example—but that argument really can't be made until all the facts are known. For that we must wait for the process to run its course. The OpenBTS project creates a stand-alone cell phone network On September 3, 2008, Harvind Samra announced the new OpenBTS project: The Open BTS Project is an effort to construct an open-source Unix application that uses the Universal Software Radio Peripheral (USRP) to present a GSM air interface ("Um") to standard GSM handset and uses the Asterisk software PBX to connect calls. The combination of the ubiquitous GSM air interface with VoIP backhaul could form the basis of a new type of cellular network that could be deployed and operated at substantially lower cost than existing technologies in greenfields in the developing world. OpenBTS is currently a work in progress, released components (and the associated pile of telecom acronyms) include a Gaussian minimum-shift keying (GMSK) radio modem and interface code for the USRP hardware, GSM forward error correction (FEC) coders and decoders, GSM L3 message serializers/deserializers, a hybrid GSM/SIP control layer, and a partial short message service (SMS) stack implementation. There are plans for expanding the functionality of the various components of the code. The fairly short project FAQ notes a potential legal issue with a proposed workaround solution: "Although the project founders have built a more complete GSM BTS (base transceiver station), some of that code may be the subject of a legal dispute. While the authors deny any wrongdoing is this matter, it would still not be prudent to release all of the code in these circumstances... Hopefully, the incomplete parts can be replaced quickly." The OpenBTS developers ran a recent alpha-level system field test at the 2008 Burning Man art/technology festival in the Nevada desert. They applied for and received a temporary FCC license, memorialized by this poster, in order to keep everything legal with the licensing authorities. Around $7000 worth of radio equipment was assembled. To top it off, everything was powered by a small wind generator and a 12V battery. A WiFi backhaul connection was made to a nearby satellite ground station to provide VoIP connectivity to the external world. Some interesting technical problems were encountered, including being flooded by connections from active cell phones that were looking for connection points when the system was first activated. Another issue discovered was a "security hole" involving unlimited external long distance dialing. After sorting through the various issues, the system was declared operational. Many in-system and external voice and text connections were made, the alpha test was declared a success. The live field test resulted in exposing a lot of real-world problems that led to numerous code improvements. There's no doubt that sitting in a tent in a hot and windy desert is a fairly difficult environment to develop code in, but progress was made nonetheless. The OpenBTS project illustrates the kind of technical advances that can be made by a small, but dedicated group of people using open-source software and open hardware. LIRC delurks The Linux Infrared Remote Control project (LIRC) provides drivers for a number of infrared receivers and transmitters. It is, perhaps, most heavily used by people running MythTV and similar packages; it would, after all, completely ruin the experience to have to get up from the couch to change channels. Despite their established user base, and despite the fact that a number of distributors ship the code, the LIRC drivers have never found their way into the mainline kernel. In more recent times, little effort has gone into their development and maintenance; the link to "Caldera OpenLinux" on the project's web site would seem to make that clear. But LIRC is useful code, and, as is the case with most out-of-tree drivers, most people would really rather see LIRC in the mainline kernel. Merging into the mainline got a step closer on September 9, when Jarod Wilson posted a version of the LIRC drivers for consideration. Jarod, it seems, has been working (with Janne Grunau) on these drivers for some months; in the process, they have eliminated "tens of thousands" of complaints from the checkpatch.pl script and cleaned up a number of things. Even after that work, though, the LIRC drivers are clearly not yet up to normal kernel standards. Some very strange coding conventions are used in places. Many of the drivers have broken (or completely absent) locking. Duplicated code abounds. One driver has implemented a command parser in its write() function. Another driver is for hardware which already has a different driver in the mainline. And, importantly, these drivers do not work with the input subsystem. [PULL QUOTE: The LIRC drivers would appear to strongly support the notion that out-of-tree code is, almost by necessity, worse code. END QUOTE] In the past, Linus Torvalds (and others) have argued for merging drivers as soon as possible. If the code is poor, its chances of being improved get much higher once it's in the mainline and others can fix it. The LIRC drivers would appear to strongly support the notion that out-of-tree code is, almost by necessity, worse code. These drivers have been around for almost a decade, have been packaged by distributors, and have been used by large numbers of people. Despite all of that, they contain a large number of serious problems which have never been addressed. Now that the drivers have been posted to the linux-kernel list, quite a few of these problems are being pointed out; Jarod and Janne have been responding to reviews and fixing the issues. The "merge drivers early" philosophy would argue for pushing LIRC into 2.6.28, even if serious problems remain. Presence in the mainline will raise the visibility of the code, inspiring (one hopes) more developers to work on fixing it up. Merging LIRC will also free distributors from the need to create separate packages for those drivers. One important question will have to be addressed before merging LIRC can be seriously considered, though: its user-space API. Once LIRC is merged, its user-space API will be set in stone, so any problems with that API need to be resolved first. LIRC, being out of the mainline, did not follow the development of the input subsystem, so it does not behave like other input drivers - even in-tree drivers for infrared remotes. The use of an in-kernel command-line parser in at least one driver is sure to raise eyebrows; that sort of interaction should really be handled via ioctl() or sysfs. All told, it is hard to imagine this code being merged until the API problems have been resolved. Changing the LIRC API will, of course, lead to problems of its own. There is user-space code which depends on the current API; any changes will break that code. The kernel community will certainly understand this problem, but is unlikely to be swayed by it. There are a number of risks associated with maintaining production kernel code out of the mainline tree; one of those risks is that your established APIs will not be accepted by the kernel development community. So an API change may simply be part of the cost of getting LIRC into the mainline at this late date. It should be a cost worth paying. Once LIRC is in the mainline, interested developers will work to continue to bring the code up to kernel standards. The community will maintain it going forward. All Linux users will get the LIRC drivers with their kernel, with no need to deal with external packages. Getting there may be a bit frustrating for users of remotes and (especially) for the developers who have taken on the task of getting this code into the mainline. But, once it's done, remotes will just be more normal hardware, supported by the kernel like everything else. DR rootkit released under the GPL A free software Linux rootkit has been announced with a number of interesting features. Its availability may, unfortunately, help lower the bar for "script kiddies" and others, but it also provides a nice look into what makes up a rootkit. The rootkit, called DR for Debug Register, uses some new techniques to evade detection, such that even a change recently proposed for inclusion in the kernel would have missed it. A rootkit is malware that typically hooks into the kernel to hide its presence from administrators. Usually, rootkits can hide their processes from /proc, which in turn means ps won't see them, but sophisticated rootkits do much more than that. DR can also hide network sockets and files in the filesystem that are associated with rootkit processes. There are some benefits to this approach as the announcement describes: The major benefit of the DR rootkit is that all this happens transparently to the end user. The children of a hidden process are also automatically hidden. The sockets a hidden process creates are also hidden. But if you are a hidden process, you can see hidden resources. This makes the DR rootkit nicely manageable. Unlike many rootkits, DR does not alter the system call table directly. Instead it sets a hardware breakpoint for the syscall_call() function which gets called whenever a system call is made. When that breakpoint is reached, a handler is set up to watch for an access to the memory location where the specific system call's function pointer lives (i.e. syscall_table[__NR_syscall]). When the address is retrieved from that location, the breakpoint substitutes the address of the code the rootkit wants to run—the system call hook. The system call hooks is where the work is done to evade detection. By hooking less than a dozen different calls, DR can hide its processes, files, and sockets. By creating a program that does an exec() of a special filename—one that starts with "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"—one can set the "hidden" bit on the process; spawning a shell or running some malware after the exec() fails will cause those processes to no longer be visible to the rest of the system. There are some limitations outlined in the announcement, the biggest of which is that DR is implemented as a kernel module without any attempt to hide its presence. Doing an lsmod will show it clearly, but there are other ways to detect it as well. Fixing those are all are on the "to do" list and won't take a very large effort to complete. DR was created by Immunity, Inc. as part of their penetration testing efforts and has been released under the GPLv2. It contains roughly 1200 lines of well-documented code that should be of interest to anyone curious about rootkits. It is not the first rootkit available with source code, Adore predates it by several years and there are probably others, but it is an interesting—if a bit scary—release. Review: Intellectual Property and Open Source Free software inevitably runs into the body of law known collectively as "intellectual property." Many developers do their best to avoid the legal side of things whenever possible; others seem to like nothing better than extended debates on the topic. Regardless of one's own feelings in the matter, the fact remains that the legal system exists, it affects our lives, and that we can only be better off if we understand it. To that end, O'Reilly has published Intellectual Property and Open Source by Van Lindberg. The book starts off with a Lessig-like comparison between code intended for computers and legal code. The legal code base is not as clean as one might like: It gets worse: every line of the legal code was written by committee, and almost every line of it has been patched by a later piece of legislation or modified by a court. Indeed, IP law is rooted in a more than 200-year-old codebase. Is it any wonder it's a mess? Mr. Lindberg is clearly trying to write for programmers, so code-based analogies abound. Patents are like regular expressions - quite powerful in the technologies they can match, but you never really know what they will catch until you try them. Patent documents are structured like ELF program headers, and the patent system as a whole is a sort of memorization scheme (we get a Python Fibonacci number generator as an example here). Contracts are like a distributed version control system - they let anybody create their own, localized law. And so on. Roughly the first half of the core part of the book is dedicated to explaining how the four main branches of intellectual property (patents, copyright, trademarks, and trade secrets) work. The chapter on the patent system notes some of the problems with software patents (in particular, the industry's use of oral tradition and the late recognition of software patents makes most prior art invisible to investigators), but, to a great extent, it seems to be written for people who want to obtain patents, rather than those who feel the need to defend themselves against software patents. It might have been nice to get a treatment of the often-quoted idea that software developers are better off not knowing about patents because that way they cannot be accused of willful infringement, but that topic was not touched. There is also no talk of the Open Invention Network or any other efforts to protect the community as a whole. The copyright chapter is a reasonably thorough treatment of the subject which notes how the scope of copyright has expanded over the years. The current situation is compared to an "allow by default" security policy where anything which can be said to have an expressive aspect gets copyright protection by default. Derivative works are discussed at length, leading to this interesting observation: The copyright complexity of open source software systems is in large part due to the rules surrounding derivative works. A large project like the Linux kernel has hundreds or thousands of authors... As a result, nobody really owns the Linux kernel; the best description of its status is that it is owned jointly by its developers. Just a few pages earlier, it is stated that joint ownership means that each author has full rights over the entire work and can do just about anything with it - like license it to others. A finding that the kernel was a joint work could lead to some unpleasant consequences; one hopes that Mr. Lindberg is not really saying that could happen. The book mentions the abstraction-filtration-comparison test used by some courts to determine if one body of code is derived from another, but says nothing about how that test works. It would have been nice to learn a bit more, since that is an important part of how copyright cases are resolved in the US. Also nice would have been some discussion of the value of registration of copyrights. The chapter finishes with this discouraging note: Under a legal realist analysis, any use of copyrighted material that was objectionable or questionable would be struck down as infringing. Non-objectionable use of copyrighted material would be allowed only if the political and economic interests in support of the use were more powerful than the political and economic interests against the use. Unfortunately, this is, in my opinion, the best guide to the outcome of any future copyright case. The discussion of trademarks (compared to desktop shortcut icons) is pretty much as one would expect. The chapter is more concerned with obtaining and defending trademarks than balancing trademarks against the ideals of free software. There is not much to say about trade secrets, though the chapter does touch on what happens if unreleased code is incorporated into a free application. The author concludes that the open development process makes this kind of contamination less likely than with proprietary projects. Next we move into a chapter on contracts and licenses which talks mostly about how contracts are formed and enforced. The book takes a strong position that all licenses are contracts; they are just a special form of contract which grants permission to use some sort of intellectual property. The other point of view (that licenses are distinct from contracts) is touched upon, but dismissed this way: The "pure license" interpretation favored by Eben Moglen makes the enforcement of the GPL much easier, there is no need to consider offers, or acceptances, or the other particulars of contract law discussed in this chapter. Unfortunately, it is impossible to say for certain if a particular agreement will be considered a license, a contract, or considered both a contract and a license. It is a tricky and case-specific question focused on whether the agreement includes a "restriction on the scope" of permissible action or whether it is simply a "covenant" to act in a certain way. Later on, the author refers to the GPL in particular as a "Schrödinger's license" with a currently undetermined nature; it might be "just a license" after all. Clearly there is some confusion on this point. It is worth noting that the book predates the appeals court decision in the JMRI case, which makes the "it's a license" interpretation far more likely. There is a chapter on the "economic and legal foundations of open source," talking about how the community works and, in particular, how free licenses work. There is little here which would be new to most LWN readers, but it might be good to hand to the corporate legal office. Speaking of that office, the next chapter talks about how to contribute to a project without getting into trouble with your employer. There is talk about proprietary information agreements, some important cases (including the Medsphere case, which you editor wishes had been more prominent on his radar), works for hire, and so on. The key advice from the author is to disclose your work and your ideas to your employer as soon as possible - preferably before beginning employment. This is a chapter that many free software developers should read. Chapter 10 is about choosing a license for a free software project. The importance of the topic is stressed - as is the importance of not trying to write one's own license. The author recommends that most projects should limit themselves to considering the 2-clause BSD license, the Apache license (v2), the Mozilla Public License, the GPL or LGPL (versions 2 or 3, though GPLv3 is said to be "a better and surer foundation for future development"), or the Open Software License (v3). Chapter 11 is about the issues involved in accepting patches from others. The author strongly recommends using some sort of signed contributor agreement or even copyright assignments. Getting assignments, he says, allows for "unified legal control," ease of relicensing, and the ability to do commercial licensing. It's probably good advice for a strongly corporate-controlled project, but may not fit with more community-oriented projects. Unfortunately, the book perpetuates this particular fiction: In order to represent a code base against legal challenges, a single entity must have copyright ownership of all the code in that project. And, to make it worse: A good example of this is the BusyBox project... When people found out that BusyBox was being distributed in proprietary products without adherence to the license restrictions, the Software Freedom Law Center (SFLC) was able to file suit on behalf of the project because there were only two people that owned all the copyrighted code. There are a few problems here. No single entity owns the entire Linux kernel, but that code has been quite vigorously defended against some strong legal challenges. (It is interesting, actually, that the author managed to write this entire book without mentioning SCO once.) Kernel developers have also been able to enforce the kernel's copyright numerous times. Meanwhile, a quick look at the BusyBox code is sufficient to turn up copyright assertions from far more than two developers. Unified ownership of a code base may be the right thing for some projects, but the reasons cited here are clearly not applicable. That complaint notwithstanding, this chapter does contain useful information that should be kept in mind when accepting patches from others. Chapter 12 is about the GPL in particular. There is a lot of talk about just what is a derived work under the GPL - does it apply to kernel modules, for example? Unfortunately, the answer is "we just don't know." So, while the chapter is a reasonable summary of how the GPL works, once again there will be little there for most LWN readers. Chapter 13 gets into reverse engineering, providing a quick overview of how it can be done without getting into trouble. According to the book, reverse engineering is generally allowed in the US, even to the point of disassembling proprietary code to learn its secrets. There are a lot of pitfalls, though, and the DMCA changes the game significantly. This chapter is a good starting point, but anybody wanting to do reverse engineering in the US will probably want to learn rather more than what is on offer here. The final chapter talks about the creation of a non-profit corporation to own and/or manage a code base. It's mostly about what's required to create a corporation and keep it in good standing. This information may be useful to some, but it seems a little out of place here. After that, there are 80 pages of license lists and the full texts of a number of free software licenses. Perhaps it's useful reference material, but it's all easily available online; it's not clear that dedicating nearly 25% of the book to this material was necessary. The subtitle of this book is "a practical guide to protecting code," which makes one omission especially striking: there is not a word on how a project should deal with license violations. There is, by now, a fair amount of collective wisdom on how such problems should be approached, but it has not been collected here. There's also little talk on protecting projects against software patent problems, no talk of patent pools, and no talk of related issues like the Microsoft/Novell deal. Software patents have cast a big shadow over free software in the US, but the issue is not really touched upon in this book. It is also worth noting that the book is very heavily based on US law, and the author never attempts to look beyond the border. Certainly it would never have been possible to cover intellectual property law worldwide, but this narrow focus is still a little puzzling. Much intellectual property law in the US is based on international agreements, so an understanding of those agreements would help with the larger picture. A mention of Berne Convention would not have been out of place, for example. The other problem is that free software tends to have little respect for borders; there are few projects which are limited to a single country. Even if a project is based in the US, the existence of contributors elsewhere in the world is almost certain. Free software is a global phenomenon; it is not sufficient to think about US law alone. Despite these complaints, your editor has to say that this is a valuable book. It covers many of the basics of the law in a much clearer way than has been done before. Anybody who manages or contributes to a free software project (in the US, at least) should be familiar with the concepts discussed here. And certainly all of the people peppering the net with IANAL posts would be better informed after reading Intellectual Property and Open Source. This book should bring some light to a complex but crucially important part of the legal code which governs our actions, and that is a good thing. KS2008: Linux 3.0 Prior to this year's kernel summit, Alan Cox had suggested a possible topic: devote a development cycle to the removal of old, unused features - possibly breaking compatibility in places - and release the result as Linux 3.0. Alan did not attend the summit itself (the fact that it is being held in the U.S. was enough to ensure that), but his suggested topic was the first order of business. The result: it looks like there is no Linux 3.0 forthcoming right away, and the removal of older features is not on the agenda. There was some talk about the cost of maintaining older drivers and interfaces which are used by few people. This code requires updates for API changes and may contain security holes. In many cases, the drivers are for hardware which is unable to support features needed by contemporary software, with the result that users complain about tools like PulseAudio not working properly. Linus came into the discussion early to state his unhappiness with the idea. The cost of maintaining these old drivers, he asserts, is essentially zero. And, in places where there are costs, that is OK with him as well. In particular, it's fine with Linus if API changes are a pain; he wants developers to have to think about whether an API change is worth the trouble or not. Linus also pointed out that a lot of hardware which kernel developers see as being useless junk is, in fact, still useful in many parts of the world. There are a lot of people using old stuff, and he does not want to pull the rug out from under them. He is also not concerned about claims of possible security problems with the older code; should such problems exist, he says, they will affect so few people that it's really not worth the trouble for any self-respecting cracker to exploit. So, he concluded, any sort of driver removal might end up getting rid of all of five drivers, which is probably not worth the effort. James Bottomley expressed concern that, by disclaiming concern about things like security issues, we could be creating a two-tier system of support. Older hardware may be nominally supported, but no developers are really interested in keeping the code up, and nobody has the hardware to test them. Christoph Hellwig pointed out that creating a major release which only removed features would be a "marketing disaster." From there, the discussion began to drift a bit. Dave Jones suggested (to general applause) that a useful thing to deprecate would be the "deprecated" marker used within the kernel source. Deprecated functions generate large numbers of warnings, but nobody bothers to fix them; all the deprecation warnings really do is mask other, more important warnings. Christoph noted that the checkpatch.pl script can also warn about deprecated functions, and that it was a much better place for it: there, the warnings affect the person submitting a patch instead of everybody building a kernel. Then it was suggested that, perhaps, a concerted effort should be made toward the removal of all warnings from the kernel build. That idea did not get very far either. Quite a few warnings from GCC are bogus, in that they are complaining about entirely valid code. Fixing warnings like that risks masking other problems and introducing bugs in its own right. Christoph suggested that the warning issue could only really be resolved when we start shipping GCC with the kernel source. The sparse tool was discussed for a bit; the warnings generated by sparse are seen as being more useful much of the time. But, as Linus noted, sparse has its own set of bogus diagnostics and is not a perfect solution either. Heading back toward the original topic, the developers talked about the maintenance of ancient system call compatibility interfaces. Linus talked about how nice it is to know that we can still run binaries from 1991; we should be proud of that fact. The associated cost is, once again, quite small. Matt Mackall then said that, if we are continuing to maintain those interfaces forever, there is little point in discussing the removal of other interfaces. The end result from this discussion would appear to be that there will be no change. Compatibility with old hardware and interfaces remains a priority for the kernel, especially as long as the cost of retaining that compatibility is small. KS2008: Minisummit reports There is an increasing trend toward the use of "minisummits" for the detailed discussion of issues specific to a kernel subsystem. Kernel summits typically include a slot where the results from these sessions can be reported back to the group as a whole. The results from three such events were discussed at the 2008 kernel summit. Power management Len Brown went over the power management summit held last July in Ottawa; some notes from that gathering were posted here in August. The talk started with a quick recap of recent events in the power management area; these include the mainstream adoption of the tickless kernel, the establishment of lesswatts.org, and the creation of acpica.org, which, among other things, contains public bugzilla and git servers for the ACPI reference implementation. The number of unresolved bugs in the ACPI subsystem is dropping; it was 222 in 2004, but is 59 now (though one should count the 45 bugs which have been pushed out to a separate suspend/resume category). New bugs continue to come in at a steady rate, but the ACPI developers have been working at addressing them and simultaneously taking care of the backlog. Most of the bugs, Len notes, are problems which have always been present in the code; very few of them are regressions being added by current work. Andi Kleen (who was the ACPI maintainer while Len took a short sabbatical) made the claim that ACPI is the only kernel subsystem which knows how many bugs it has. There was some talk of the TI OMAP/ARM architecture which, after some effort, is now running entirely on current kernel releases. The flow of patches back upstream is still small, though, in need of improvement. Even suspend and resume work for this architecture, but they are too slow for current needs. USB autosuspend was mentioned briefly. It works, except for the systems which it breaks completely. As a result, it is currently disabled by default. That, says Len, is an unfortunate situation; disabled-by-default code, in many cases, might as well have never been written. Wireless networking John Linville summarized the wireless networking summit, also held in Ottawa. One topic of interest is the cfg80211 API, a wireless configuration interface which is intended to replace the much-maligned wireless extensions interface. One idea being considered is to use DBus to carry cfg80211 messages, which currently travel via netlink. That change would require putting a DBus implementation into the kernel itself, which, says John, might not be quite as crazy as it sounds. The wireless regulatory framework was covered briefly. Power management is an issue for wireless networking as well. The wireless protocols allow a device to announce its intention to go to sleep for a while; the access point will then buffer packets until the interface wakes up again. Linux needs support for this feature, as well as for some more basic things. The mac80211 layer, for example, still lacks support for suspend and resume. Vendor support is getting better, especially with Atheros hiring a community developer and beginning to contribute to the ath9k driver (though not, yet, to the older ath5k driver). Broadcom, on the other hand, remains as uncooperative as ever. There will be another wireless networking summit held in the next year, almost certainly in Europe. Containers The final topic of the session was containers. Several developers in this area got together to talk about outstanding issues. Namespaces in general were dealt with quickly; there are no real changes planned in this area. On the other hand, the group decided to shift the checkpoint/restart functionality over to a "one big syscall" approach; that work has since been covered in this article. The checkpoint developers are still working on getting a simple case - checkpointing a single process with no outstanding signals or other complicated situations - working before dealing with the more complex issues. The biggest area of interest would appear to be in resource management - the general task of keeping containers within a set of resource usage boundaries. The biggest problems in this area appear to be in the control group interface. The current interface does not offer any sort of transactional semantics, and it is hard for user-space administrative processes to learn about resource-oriented events. Some of these problems may be addressed via a new FIFO attached to each control group. There is a lot of work going into I/O bandwidth controllers. Too much work, in fact; there are four independent implementations circulating and they do not appear to be converging. Some sort of consensus on the way forward will need to be reached, but it is not, yet, clear what that consensus will be. Other work in progress includes a swap controller (which will be merged with the memory controller), the beginning of a network traffic controller, and some early effort toward a user-space library for working with control groups. A member of the group asked whether the memory controller still requires the addition of a pointer to the page structure. The kernel keeps a page structure for every page of memory; there are a lot of these structures, so struct page may be the most ruthlessly compressed data structures in the system. Adding a new pointer is not a price that the developers will willingly pay. Balbir Singh replied that this pointer is still there for now, but there is a patch which removes it. The problem is that this patch comes with a 4% performance loss; work toward lessening that impact continues. KS2008: When should drivers be merged? The rough consensus in the kernel development community over the last couple of years has held that device drivers should be merged into the mainline as soon as possible. Even if these drivers have significant problems, it is better to get them into the mainline where they are more likely to be fixed. This approach was reconsidered at a Kernel Summit 2008 session, but the group left the policy essentially unchanged. There are two fundamental lines of thought on this subject. James Bottomley started off the session with his feeling that the time before merging presents the best opportunity to get driver authors to improve their work. The possibility of merging the code provides a motivational incentive which vanishes once the code goes in. So James likes to hold code submissions out of the mainline until the worst problems have been addressed. On the other hand, Arjan van de Ven doesn't like the idea of "holding code hostage" in this way. From this point of view, about the only reason to hold drivers out of the mainline is obvious security or user-space API problems. In the absence of those, getting the code merged into the mainline, where it will be more accessible for others to fix, is the best way to improve bad drivers. Linus is clearly in the second camp. Drivers which are out of the mainline tree, he says, simply do not get better. People just do not spend much time looking at out-of-tree code. Additionally, not accepting drivers from vendors may put us into a position of having no real traction with those vendors; each of their subsequent drivers will have the same problems. By getting those drivers into the tree and fixing them, we may be able to push them toward producing better code. Otherwise, says Linus, we may be "shooting ourselves in the foot." On the other hand, Greg Kroah-Hartman reported some strong successes with his linux-staging tree. That tree currently hosts some 15 drivers, most of which are steadily improving over time. Being in linux-staging is apparently enough to draw some attention to a driver, and that helps to get it into better shape. Much of the discussion was devoted to an attempt to set a line dividing drivers which can be merged from those which cannot. There was not a whole lot of success, though. It really appears to be a case-by-case sort of problem. For example, what about one vendor driver which reads a configuration file directly from /etc? Such behavior is normally frowned upon. But, if the driver is already out there and being used, putting it into the mainline will not make things worse - we already have the problem. So, especially when the driver is already in widespread use, we might as well just merge it. Some ways of mitigating problems with drivers were discussed. Some of the worst behaviors could be configured out, allowing the merging of a barely functional driver which can then be improved in place. Really nasty drivers can set a taint bit in the kernel as a warning to developers trying to track down bugs on the affected systems. Another idea involves outfitting badly-written drivers with strong warnings to keep other developers from copying the code found therein. It was suggested that the distributors could ship drivers from the linux-staging tree, perhaps with the taint feature added. The answer to that was that, if the drivers are being shipped by distributors, they might as well be in the mainline. Linus stated that anything the distributors ship should really be merged as well. There are practical difficulties, though; Fedora ships the Nouveau driver, which still has not committed to a stable user-space API. Until that API stabilizes, Nouveau cannot be merged into the mainline, but there is still value in getting the driver tested by Fedora users. There were a few conclusions from the discussion. The taint flag for substandard drivers will probably be added. There might be a drivers/staging directory for such drivers as well. Greg will take responsibility for getting some of those linux-staging drivers into the mainline; he has, it was suggested, just become the official crap maintainer. Firefox 3 EULA raises a ruckus End User License Agreements—or EULAs—are a mainstay of the proprietary software world that tend to rub free software advocates the wrong way. When a EULA is presented in a click-through window as part of the initial execution of a program, it can really raise some ire as Mozilla is finding out. Its plan to present a click-through license for Firefox 3 on Linux has not met with widespread approval; quite the reverse in fact. The issue has been kicking around since at least last May, when Fedora folks noticed that Firefox 3 builds moved the EULA popup window from the installer—which Linux folks rarely see—to the first time Firefox is run. More recently the issue erupted in the Ubuntu community when a user filed a bug that reads, in part: STARTING UP A CERTAIN 3.0.2 VERSION OF FIREFOX BROWSER MAKES AVAILABLE TO YOU A VERY CAPITAL END USER LICENSE AGREEMENT. THIS AGREEMENT IS OBNOXIOUS and largely irrelevant to Ubuntu users. The predictable outcry followed, mostly because people who are used to free software have a visceral reaction to seeing a click-through EULA. For that reason alone it is a poor choice by Mozilla, at least on Linux. Windows users, who make up a substantial portion of the Firefox userbase, are generally unfazed by EULAs as they are confronted by them regularly—generally blithely clicking through with little or no hesitation. There are a number of objections to the Mozilla EULA, starting with the current text of the license. Mozilla Corporation chairperson Mitchell Baker agreed with the critics of the license text, saying "the most important thing here is to acknowledge that yes, the content of the license agreement is wrong." New license text is now available in draft form, but it still doesn't address an underlying issue: do we need to consult a lawyer when we install or run free software? One of the guiding principles of free software is that it doesn't limit what "end users" can do with the software, it only limits those who wish to distribute it. When a page or two of legalese—undoubtedly toned down from what the lawyers would really like—is presented to a new user, what exactly are they supposed to do with it? Users have rights under free software licenses, and it is important that they can find out about them, but it is fairly rare for a program, or even a distribution, to require a user to click through a copy of the license. Mozilla's position is that they need to protect their trademarks as well as inform users about the web services used to try to detect phishing and malware sites. In answer to those who think a click-through EULA is unnecessary—often using Linux distributions as a counterexample—Baker points out: It's hard to tell what's "necessary." It's an unsettled area and may vary across different locales. We've traditionally been more conservative on this point than many Linux distros. So far, Mozilla does not seem willing to budge from its requirement to show the EULA as a click-through agreement. Fedora was able to get a waiver of sorts for Fedora 9 which allowed shipping Firefox 3 without the EULA while the projects worked out language they both could live with. In Fedora 9, Firefox opens to a page that describes the web services when it is run for the first time. Some kind of compromise along these lines for Linux distributions would seem to satisfy most of the concerns for both sides, but other than for Fedora 9, that solution has not been blessed by Mozilla. Fedora Engineering Manager Tom "spot" Callaway has an excellent overview of the history as well as a nice analysis of the EULA. He notes that almost of all of the terms in the EULA are either covered by applicable laws or by the Mozilla Public License (MPL). None of that really matters though as distributions really only have two choices as outlined by Ubuntu leader Mark Shuttleworth: Mozilla Corp asked that this be added in order for us to continue to call the browser Firefox. Since Firefox is their trademark, which we intend to respect, we have the choice of working with Mozilla to meet their requirements, or switching to an unbranded browser. That is the risk that Mozilla takes; if it is too heavy-handed in what it requires to call a browser "Firefox", distributions will take the code without the trademarks and call it "Iceweasel" as Debian has or "abrowser" which is the Ubuntu equivalent. The Iceweasel "fork" was made because Mozilla objected to Debian backporting security fixes into older browsers without its consent, while abrowser has come about because of the EULA issue. Given that Linux users were some of the earliest and most enthusiastic adopters of Firefox, it is truly unfortunate that many may have to run it under other names. There is an issue that may be getting lost in the shuffle here as well. Fedora board member Jef Spaleta has expressed concerns about how to notify users about web services: "We" as in everybody doing open source software has absolutely no fraking idea as to how to appropriately notify users about the services agreements associated with on-by-default web services. "We" collectively aren't giving it a lot of thought. "We" have this amorphous concept about the online desktop experience which is going to deeply integrate web services and enhance the day-to-day desktop user experience. But that enhancement comes at a cost..and that cost is the complication associated with "terms of service" for a vast array of different web service vendors. Web services clearly bring along a number of additional concerns. There are privacy issues to consider. In many places, particularly Europe, there are fairly stringent requirements regarding data collection and retention that are required to be communicated to users. How that will be done for free software that use these services is an open question. As Spaleta points out, Mozilla may be the only free software organization that is even looking at the problem. The EULA mess is a situation that certainly could have been handled better by Mozilla. One hopes that some kind of compromise can be worked out so that users aren't poked in the eye with legal documents—that aren't even valid in many jurisdictions—and distributions don't feel like they need to fork to preserve their freedoms. Mozilla definitely has some legitimate interests to protect, but it needs to find a saner way to do that. There is hope that is happening as Baker has described in an update on her blog: We've come to understand that anything EULA-like is disturbing, even if the content is FLOSS based. So we're eliminating that. We still feel that something about the web services integrated into the browser is needed; these services can be turned off and not interrupt the flow of using the browser. We also want to tell people about the FLOSS license — as a notice, not as as EULA or use restriction. Again, this won't block the flow or provide the unwelcoming feeling that one comment to my previous post described so eloquently. More details are imminent, but it looks like this could all resolve amicably. The 2008 Linux Kernel Summit The 2008 Linux Kernel Summit was held September 15 and 16 in Portland, Oregon, immediately prior to the Linux Plumbers Conference. At this invitation-only meeting, some 80 developers discussed a number of issues relevant to the kernel and its future development. The following reports were written by Jonathan Corbet, who attended the event and was a member of its program committee. This reporting was sponsored by LWN's subscribers; if you appreciate this kind of content, please consider subscribing to LWN and helping us create more of it. Day 1 The sessions held on the first day were: Linux 3.0: should the developers do a Linux 3.0 release with a focus on dumping older, unneeded code? Minisummit reports: reports from gatherings of power management, wireless networking, and containers developers. When should drivers be merged? A wide-ranging discussion on the trade-offs between getting drivers into the kernel quickly and waiting until they are up to kernel coding standards. Filesystem and block layer interaction; what contemporary file systems need to be able to get the most out of storage devices. Cross-subsystem issues; how do we evolve subsystems which are heavily used by several other parts of the kernel? Tools, and the new Patchwork tool in particular. Bootstrap code. Why does every distributor throw together its own initrd/initramfs code, and can that situation be improved? Kernel quality and release process, various discussions on how to produce better kernels and a near-decision to move to a one-week merge window. Day 2 Tracing. A lengthy discussion on user requirements for kernel tracing and how those requirements might eventually be met. Documentation. We always want more and better documentation, but what documentation would be most useful to the development community? There was a brief bug-fixing session aimed at the top entries on the KernelOops.org. Over the course of half an hour, the developers were able to fix 13 of the top 14 bugs. It was widely agreed that this was a productive use of time which will probably be repeated at future events. More minisummit reports covering virtualization, networking, and kernel bloat. All about threads; kernel thread pools and threaded interrupt handlers in particular. Projects with large user-space components; how can we make it easier for the direct rendering infrastructure project to work with the mainline kernel? Rafael Wysocki led a section on the new suspend/resume infrastructure. Most of that talk was concerned with the API, which was covered here back in March, so it will not be written up again now. Some changes will likely be made; stay tuned to LWN for the details. Linus did ask the crowd how many people were still unable to suspend their laptops. The number of hands raised was quite small; things have clearly gotten better in this area. Fixing the Kernel Janitors Project. How can we do a better job of bringing new developers into the kernel community? The closing party (which was also the Linux Plumbers Conference opening party) was the venue chosen for the annual election of members to the Linux Foundation's Technical Advisory Board. The move out of the regular kernel summit sessions was intended to allow a wider group of people to participate in the election. It would appear to have been successful in that regard; there were record numbers of both candidates and voters. The board members elected this time around were James Bottomley, Kristen Carlson Accardi, Chris Mason, Dave Jones, Chris Wright, and Christoph Hellwig. Christoph was elected to a one-year term; all of the others will serve two-year terms. Next year's kernel summit is currently scheduled for October 18 to 20 in Tokyo, Japan. KS2008: Filesystem and block layer interaction Much is happening with Linux filesystems currently; this is a situation which is likely to persist for some time. As filesystems develop, it is becoming clear that there need to be some changes in the interactions between the filesystem and block I/O layers. This kernel summit session discussed some of the places where changes are needed, but did not get much into their implementation. Chris Mason is the lead developer of the up-and-coming btrfs filesystem. One of the items on Chris's shopping list is a way for filesystems to obtain a better understanding of the topology and nature of the storage system underneath them. He would like, for example, to be able to determine whether a filesystem is sitting on a solid-state device or on a traditional rotating disk. Certain decisions will be made very differently depending on the nature of the underlying device; filesystems stored on solid-state drives, for example, can be laid out without being concerned about seek times. The topology of the device also matters. Especially when multipath storage systems are in use, the filesystem would like to be able to understand what the various paths are, and to be able to partition it into truly independent failure domains. With this information, filesystems can find the optimal ways to perform I/O to the underlying devices. Information needs to flow the other way as well. Upcoming filesystems will perform extensive checksumming on data, so they will be able to inform the storage layer when a block has gone bad. For mirrored devices, that will enable the storage driver to recover the block from an uncorrupted mirror - if the filesystem is able to tell it which mirror went bad. Chris asked for information on storage latency - how long operations can be expected to last - and the optimal I/O sizes and alignments. The motivation behind this request is to optimize I/O to solid-state devices. Here Linus jumped in and suggested that the filesystem developers should "take a deep breath and wait a year." Solid-state devices will change a lot over that time, and many of the problems which exist now will be gone by then. So filesystems designed for today's solid-state drives will contain a lot of useless code by the time those drives are truly widespread. It is better, Linus says, to just treat them as a fast, random-access disk and not worry about the details. Another request was for filesystems to be able to allocate their own bio structures, rather than using the block layer's allocation functions. That would allow the filesystems to store their own private data with the bio without the need to tack on a chain of separate structures via the bi_private pointer. There's also a general need to rework the address space operations to facilitate better layout and more rational locking. The kswapd process is a bit of a problem for contemporary filesystems. Kswapd is charged with freeing up pages for the memory allocator; it needs to be able to get its job done at times when system memory is very tight. Currently kswapd will attempt to write out dirty pages so that they can be freed. The problem is that this writeout can require more memory to carry out; as filesystems become more complex, the amount of extra memory needed seems to be growing. That can lead to deadlocks if that extra memory is not available. So the filesystem developers would like kswapd to concern itself exclusively with clean pages, which can be freed without performing I/O. One answer that came back was that the writepage() VFS callback can be treated as advisory. That is what btrfs does now; if a writepage() call comes in the context of a process with the PF_MEMALLOC bit set (meaning that the system is trying to free memory), the call will simply fail. That is all legal, but it can hurt performance. In the end, kswapd does writeout because, historically, it was possible for a Linux system to end up with all of its pages being dirty. In that kind of situation, writeout is the only way to make memory available again. But current kernels are able to keep close tabs on how much of memory is dirty at any given time, and they can avoid getting into that kind of situation. So writeout in kswapd is no longer necessary; it can, instead, be handled in contexts where memory is not in critically short supply. This change seems likely to be made in the near future. The final topic, discussed briefly, was I/O barriers. The filesystem developers would really like it if the more complex storage layers - such as the software RAID and device mapper code - would implement write barriers. That is a hard thing to do with the current concept of barriers, though; the performance costs will be high. James Bottomley noticed that a better job could be done with a more complex barrier API. But it is not clear whether the benefits that would come would be worth the extra cost. KS2008: Cross-subsystem interactions Jean Delvare has an interesting job: he is the maintainer of the i2c subsystem. Most users don't think much about i2c devices, but kernel developers involved with a number of subsystems are well aware of them. For example, a webcam driver is often three drivers internally: one for the camera controller, one for the camera itself, and one for the i2c connection which allows the host system to configure the camera. There are i2c buses lurking within a number of other devices, so the i2c layer pops up in a number of places. Thus, changes to the i2c layer affect quite a few developers in other parts of the kernel. Jean talked about some of the lessons he has learned over the years. It is necessary to cleanly separate the subsystems. In each case, developers need to know the subsystems their code works with and the associated maintainers; that knowledge will make it much easier to get changes accepted. All subsystems should be treated equally, and subsystem maintainers should be warned about changes as early as possible. It is best to work with those maintainers to chart out a path which makes the changes as easy as possible for everybody to deal with. So where do the problems come in? The universal answer was API changes, especially when there are disagreements over how an API should look. Linus pointed out that disagreements often come about when the maintainership boundaries are unclear. There is, for example, a lot of architecture-specific code which is essentially part of the PCI layer, leading to conflicts between architecture and PCI maintainers. Sometimes the only real solution is to refactor the code - not a simple or quick process. Nick Piggin talked about problems getting maintainers to accept new APIs. Part of his difficulty seems to stem from his idea that, when an API needs to change, the maintainers of affected subsystems should do the work of adjusting to the change. Strangely enough, people tend to react poorly when others create more work for them. Still, Nick would like subsystem maintainers to help out when an API change needs to happen. A developer who fixes a buggy API should be rewarded; forcing that developer to fix all users of the affected API seems, instead, like punishment. Linus disagreed with that reasoning, though. He noted that there is a lot of code in the kernel which is effectively unmaintained; there is nobody else who can take on the work of fixing it when an API changes. And, he says, some developers are far too eager to make API changes. Anything which causes them to hesitate, and to maybe think about how to minimize the pain caused by the change, is a good thing. There were no real conclusions from this session; having said their piece, the developers moved on to the next topic. KS2008: Development tools Paul Mackerras was the leader of a kernel summit session dedicated to development tools. In the end, though, only one tool was discussed: the Patchwork system used by the PowerPC development community. Patchwork is a patch management system; its job is to ensure that posted patches are properly tracked, reviewed, and disposed of. The Patchwork system can be configured to watch a mailing list; whenever a message containing a patch is posted, it is added to the database. Any followup discussion is also captured and stored with the patch. Maintainers can go into the system, review patches, delegate them to other maintainers, and mark them for their final destination. Patches which are set to be merged into a subsystem tree can be grouped into bundles; the maintainer can then extract them as a mailbox file suitable for feeding to the git-am tool. A nice feature of Patchwork is that it can recognize messages containing Acked-by lines and automatically note the acks in the original patch. Patchwork was generally recognized as a useful tool; the developers began discussing whether it should be used for the kernel as a whole. It was noted that all maintainers need to commit to using it, or it will quickly clog up with patches that nobody is paying attention to. Nobody has any illusions that all kernel developers can be convinced to start working with this new tool; Andrew Morton stated that he was probably too stuck in his way to make use of it. Some alternatives - such as having patches automatically age out of the system - were discussed. But it was generally agreed that trying to deal with the full linux-kernel mailing list would probably be too big of a step at this time. So a more likely outcome is that one or more subsystems will start experimenting with Patchwork, perhaps running it on one of the kernel.org systems. The SCSI or ext4 subsystems may be the early adopters here. If that trial works out, expanding the use of Patchwork may be considered. KS2008: Bootstrap code Initramfs is a useful tool; it allows a filesystem (in cpio format) to be tacked on to the end of the kernel executable image. When the kernel boots, it unpacks the filesystem into RAM and mounts it as the initial root filesystem. Therein will be found enough bootstrap code to get the system properly initialized and running from the real root filesystem. It is possible to boot a system without an initramfs, but essentially all distributors make use of this facility. Dave Jones, the Fedora kernel maintainer, made the claim that the initramfs code is one of the most boring parts of any distribution. Even so, all distributors still roll their own initramfs code. It is a pain, and it doesn't make any sense. So Dave looked into what's going on in this code to see if the situation could be made any better. The Red Hat initramfs image, used in Fedora, is the product of many years' worth of heritage and workarounds. Whenever the developers have run into an early bootstrap problem, they have thrown another hack into the initramfs code to make things work again. This code is ugly, but nobody wants to switch to anybody else's version. They fear that a different initramfs will lack all those hard-earned workarounds, and, besides, everybody feels that their particular solution is the best. So what does the initramfs code do? Its job is to load any necessary storage drivers, then wait for the storage devices to settle. The swap system needs to be enabled. If the swap partition contains a hibernation signature, a resume from disk operation is begun. Otherwise the initramfs code must find the root filesystem (an operation which may require setting up the device mapper or getting networking going), mount it, then switch over to the real operating system. Red Hat's version has to support a wide variety of root filesystems, and contains a lot of crufty code. The situation is pretty much the same with the other distributors. Where things differ, it often has to do with differing kernel configurations, and, in particular, differences of opinion over whether specific code should be built into the kernel or built as a module. Differences between initramfs setups can create some annoying problems. Sometimes these differences are enough to cause some kernel configurations to fail on one distribution. It would make life easier for everybody if a more uniform set of tools were used for early system initialization. This code could be part of the kernel tree, and it could change, when needed, in response to kernel changes. In the end, things would just work. There's a few details that would have to be dealt with. Some distributions use the in-kernel hibernation (suspend-to-disk) code, while others are using TuxOnIce. It seems like maybe it's time for everybody to standardize on one hibernation solution. While most distributions have long since switched over to the parallel ATA drivers, some are still using the older IDE subsystem. Not everybody supports root filesystems on iSCSI devices. And so on. But these are problems which should be amenable to a solution. Dave is going to start by adding a "make mkinitrd" option to the kernel build system; it will create a version of the Fedora mkinitrd for now. Others will be encouraged to join in and help make it work for everybody. Beyond that, Dave suggested that the developers could start to build a set of reference boot scripts in the kernel. Once again, this is an area where distributors tend to roll their own code; they could benefit from bits of code showing the best way to initialize parts of the system. Al Viro pointed out that there will be problems coming from the fact that different distributors use different shells in their early boot code. That led to an extended discussion of the evils of nash and the celebrations which will ensue upon its eagerly-awaited demise. There was some brief discussion of klibc - a small version of the C library intended for use in initramfs code. That project has been stalled for some time due to lack of interest; it could probably be restarted without too much trouble. The problem is that, despite all their wishes, distributions often end up having to use glibc in their initramfs filesystems. The biggest driver here appears to be internationalization, which is not properly handled by the various stripped-down libc implementations out there. Getting back to the concept of a uniform set of initramfs tools, Linus suggested that the process could start with some baby steps. The kernel could include some bits of code which are automatically added into whatever initramfs image the distributor provides. There are challenges to making that work too, of course. The best way, perhaps, is just to dump everybody's initramfs and start over with a new, clean version. That project may get underway before too long. KS2008: Kernel quality and release process The first day of the 2008 kernel summit concluded with two sessions dedicated to the quality of our kernels and the process used to produce them. Arjan van de Ven started off talking about the data acquired by the Kerneloops project. In a short period of time, Arjan has accumulated information from tens of thousands of kernel crashes and warnings. From that data, he is able to draw some conclusions about how the kernel fails and how well the developers are doing at fixing problems. Initially, Kerneloops worked by grabbing oops reports from the kernel mailing lists. Since then, a number of distributors have added facilities to find oops tracebacks in the kernel logs and ship them off to the project (after obtaining confirmation from the user, of course). This tool is now the source of the vast majority (99%) of the oops reports in the system. One of the things Arjan noted is that many of the biggest problems encountered by users are never reported on the kernel mailing lists; the problem reports one sees there are not indicative of what users are actually running into. At any given time, the top ten bugs account for a full 60% of the reports; the top 25 make up 70%. So, while there still appear to be many ways to make a kernel crash, most user problems are caused by a very small number of bugs. Fix those problems, and most users will see their troubles go away. At the other end of the scale, almost half of the bugs are represented by a single report. While some of those reports will be the result of obscure timing-related issues, most of them are more likely to be the result of hardware problems. So a lot of the reported problems do not really require any action from the developers. A number of reported bugs result from the utrace code. Utrace is an out-of-tree tracing enhancement shipped by Fedora; it seems that, perhaps, this code still isn't quite ready for prime time. There's also quite a few which are attributable to binary-only modules. Linus asked how many developers get the occasional oops reports mailed out by the project; maybe ten people raised their hands. Linus would like to see that report mailed to a lot more people, and the regression reports too. If this information got to more developers, perhaps more bugs would get fixed. Regressions That was a natural point to move into a discussion of regressions led by Rafael Wysocki. Rafael put up a number of plots of regression counts and associated fixes; by fitting a logarithmic function to regression reports and a line to fixes, he was able to extrapolate the point where the two curves intersect and, in theory, all regressions are fixed. It turns out that recent kernels have been released 1-3 weeks before this point is reached. According to his data, Rafael suggests that the optimal time to release 2.6.27 would be in about three weeks. One problem raised by Rafael was that fixes for regressions take far too long to get into the mainline. Some subsystem maintainers like to let regression fixes sit in the linux-next tree for a while. It was pointed out, though, that presence in linux-next did not help find the original regression, so there is unlikely to be any value in letting fixes age there; they should, instead, go straight into the mainline. Rafael also noted that some regressions attract no debugging effort at all; it seems that nobody is interested in working on them. It can be disheartening for users to hear nothing about a reported regression at all; somebody should at least tell them why the problem is not being worked on. He also noted that regressions which have been bisected (to identify the change which first caused the problem to happen) tend to get fixed much more quickly. The data from the bisection is undoubtedly useful, but the real benefit probably comes from fingering the guilty party, who then feels the need to get a fix in place. Another thing Rafael pointed out is that we have a small core of dedicated testers; most of our regressions are reported by a small, recurring group of people. Perhaps we could recruit some of those people to help with the management of bugs. They could track reports, get more information from users, and harass maintainers to get fixes in place. These people have already shown a certain amount of dedication; giving them this kind of role would let them expand the help they are able to give to the kernel community. There was also some talk of trying to track the amount of test coverage the kernel is receiving. There could be some sort of mechanism set up, perhaps tied into Fedora's "smolt" system, to report successful boots of the kernel on specific hardware. There are obvious privacy issues which would have to be addressed, and the whole thing would take a certain amount of work. It is not clear that anybody feels this idea is important enough to put the requisite amount of time into. Release process Matt Mackall asked a question: what would happen if we were to cut the merge window down to one week - merging less code - and shorten the development cycle to match? With some discipline, maybe we could produce a stable kernel release every six weeks. Linus responded that he would love to see this happen. His main motivation was to reduce the size of the -rc1 releases, which have gotten quite big in recent development cycles. A smaller -rc1 would be easier to debug and should, hopefully, stabilize more quickly. Quite a bit of time went into discussing this idea. The shorter merge window was clearly worrisome to some developers who feel that the two-week window is already painfully short. Merging of trees with dependencies on other trees would get harder. It would also be harder to get good testing coverage, since there would be less time for testers to play with each release. Some code simply takes a long time to fix; it's not clear that this stabilization could be compressed into the shorter cycle. There would have to be some higher barriers to ensure that code which does get in through a particular merge window is truly ready. Andrew Morton jumped in with a complaint about code that shows up in the mainline, but which has never made an appearance in linux-next or the -mm tree. He acknowledged that this would always happen, but asserted that it should be an extraordinary event. The guilty subsystem maintainer, he says, should at least make excuses for doing this. Much of the problem, it was said, comes from vendors who show up with last-minute patches that they want to see merged. The answer was to tell them that it is too late, that the merge window is for subsystem maintainers, not for vendors. Getting back to the shorter cycle, Linus pointed out that it would require a great deal of care from everybody involved, especially the first time around. It would require a development cycle which does not start with a lot of pending code - a problem, since there is always a big pile of patches waiting by the time the merge window opens. Al Viro suggested only merging a subset of subsystem trees in any development cycle, only accepting trivial patches from the rest. James Bottomley responded that, if his trees lost out in a given development cycle, his definition of "trivial" would surely change. Another suggestion was to simply merge linux-next, but Linus did not like that. He goes out of his way to limit the amount of code he merges each day as a favor to the people to test the nightly repository snapshots. Pulling in all of linux-next would make that impossible. Yet another option is to only pull trees for which the pull request is in place before the merge window opens. This idea seemed popular for a while. Just about when it looked like a consensus for trying the idea was settling into place, Matthew Wilcox stated that he didn't like it. His work involves tracking down performance issues, a process which can take quite a bit of time. A shortened development cycle would not allow the time needed to get that work done. Andrew Morton said that he saw no real point in the change; it wasn't addressing any of our biggest problems, and we would lose economies of scale in testing large numbers of changes. Dave Airlie said it would require testers to do twice as much work, dealing with -rc1 kernels twice as often. Ben Herrenschmidt worried that the tighter deadlines would make developers rush, leading to lower-quality code. And Dave Jones said that changing the cycle would make future kernel releases less predictable, making communications with vendors and customers harder. These comments essentially ended the discussion of the shorter development cycle idea. In the end, concluded Linus, it was better not to mess with something which isn't completely broken. So nothing may have come with it, but it was an interesting exploration of how things could be done differently. KS2008: Tracing Tracing is a hot issue in the Linux community, mostly as the result of the actions of an allegedly friendly company: Sun Microsystems has been putting a lot of marketing energy into telling customers that DTrace makes Solaris a better system. The fact of the matter is that Linux does not lack tracing tools, but it does seem to lack tools which are usable to the wider community. The second day of the 2008 kernel summit started with a pair of sessions dedicated to determining where the gaps are and trying to figure out what to do about them. James Bottomley started with a description of his experiences with SystemTap, the utility which is most often cited as our answer to DTrace. He had a lot of trouble getting it to work with his system. In his mind, the root cause for all this trouble is the simple fact that nobody from the development community is actually using SystemTap. A quick query of the room suggested that about half of the developers present had tried using SystemTap at one point or other; maybe 20% actually succeeded. So there is a roadblock of sorts here; SystemTap needs attention from kernel developers to progress, but those developers find it unsuited to their needs and difficult to use, so they tend to ignore it. But kernel developers are not the targeted user base for a tool like SystemTap; it is aimed at end users and deployed systems. To help clarify what those users need, Vinod Kutty from the Chicago Mercantile Exchange took some time to talk about his needs for tracing tools. In general, these users need a higher level of visibility into running, production systems. They need to be able to track down slowdowns, look at the environment in which processes are running, and, in general, to be able to look in corners of the system which nobody will have anticipated in advance. All of this has to happen while the system is running in production; it is, he says, somewhat like needing to look under the hood of a car while driving at 100mph. Also useful is the ability to run tracing tools in a "flight recorder" mode, where an administrator can look at historical data after something goes wrong. And it is necessary to be able to look at user-space events as well as those from the kernel. Events generated in user space are often more meaningful to the people running the system. All of this is needed to be able to communicate with distributors about where the problems come up, so that the distributor can work toward a fix. Current tracing tools for Linux are insufficient. Linus asked: is tracing needed primarily to track down bugs or to find performance issues? It turns out that performance problems are the big issue. James asked what parameters were the most important; the answer mentioned individual process I/O events, user-space events, and the ability to map user events to kernel-level events. Moving on, Vinod also noted that low impact is important; the tracing tool cannot place a heavy load on the system. These tools need to start quickly. Current tools are far too big. There is also a need for good filtering; tracing tools can generate a lot of data. Administrators and developers need a way to boil down all that data to an amount they can deal with. Even better are tools which can spot problems in the trace stream and raise red flags when they happen. And, of course, tracing tools really cannot crash the system while they are running. SystemTap still falls a little short in this area; it's not hard to bring down a system while trying to trace it. Adding a DTrace-style virtual machine was discussed; in theory, a VM can make the tracing tool demonstrably safer. Vinod responded that it could be useful, but the proof of maturity is in watching the software run for a while. This is where Linus came in to proclaim that he hates every tracing tool he has seen. SystemTap is far too complicated; these tools need to be simpler. Adding a virtual machine to SystemTap would just make things more complicated; that's not the way to fix its problems. According to Linus, most of the problems solved by tracing come down to figuring out scheduling issues, and we have the tools to do that now. We should be making better use of the simple tools which are currently in the kernel before trying to put more complicated stuff in. We should, for example, make latencytop work better and push to get it into the enterprise distributions. This "use the tools we already have" suggestion came back many times during the session. Christoph Hellwig brought up another recurring theme: while dynamic tracing is nice, there is a lot of value to be had from well-placed static trace points which are managed by the maintainer of the code. Matthew Wilcox added that the user-space trace points (for DTrace) added to PostgreSQL have proved to be highly useful for database administrators; people running PostgreSQL now have a strong motivation to do so on Solaris systems. We would do well to match that functionality on Linux. A key component of SystemTap is the collection of "tapsets," scripts which allow a user to look into the kernel for specific information. These tapsets are a problem, though; they are tightly tied to a specific kernel, but the kernel is constantly being changed. So tapsets go stale quickly. Moving these tapsets into the kernel might help, but they will still be a separate body of code which is prone to breaking. Static trace points, which can be maintained directly with the code they monitor, are much more likely to continue to work in the long term. Martin Bligh noted that Google maintains a set of 20-30 static trace points for use with the LTTng trace tool. This very small set of trace points is sufficient to solve most problems that Google encounters. Martin will, hopefully, be posting those trace points for inclusion into the mainline, though Google's associated tools might not be available. Vinod finished this portion of the session by stating the he likes the tapset concept. It allows him (or somebody in his group) to write a script aimed at a specific situation, and others can make use of it immediately. There's no need to wait for the release of a more specialized tool. Trace toolkits Mathieu Desnoyers spent a few moments introducing the LTTng tracing package. LTTng is a static tracing tool, depending on markers placed in the kernel itself. It has been designed for high performance and simplicity; that, in turn, should help to make it safe to use on production systems. All LTTng trace points have to be in the kernel code itself and be maintained by the appropriate subsystem maintainers. The core kernel code includes a module for precise time stamping (needed to preserve the ordering of events which go through different per-CPU relay buffers), the relaying code, and a netlink-driven module to control tracing. There is a user-space library and, of course, a set of analysis tools. LTTng can support "flight recorder" mode which can initiate tracing when a specific trigger situation comes about. There is also a mechanism for putting markers into user space. Frank Eigler spent some time talking about SystemTap; he used much of that time to defend the design decisions which had been made. When the SystemTap project started, the kernel had almost no tracing features at all, so they had to pick a path that worked. Since there is a lot of hostility to putting a virtual machine into the kernel, they had to go with code generation instead. They used kprobes because that was the mechanism that was available. And so on. In general, SystemTap has a lot of the same objectives as LTTng, plus, of course, the dynamic tracing feature. There are "some demos" showing working user-space tracing. James stated that there was a real need for the users of these tools - kernel developers, in this case - to provide input into how they work. Frank responded that the SystemTap team has been crying out for people to help. It's clear, though, that this particular user base is not sufficiently engaged in the development process. It was said that the real users of SystemTap are Red Hat consultants, who find that it works well with the standard RHEL kernel. But people trying to use SystemTap with a current mainline kernel have to download "a shaky weekly tarball" to try to make it work. Until SystemTap is easier to use with the mainline kernel, it will be a hard sell in the development community. The problem there, of course, is that keeping SystemTap current while it is out of the mainline tree is always going to be a struggle. Resolving that problem will require getting more of that code merged. It seems that the core SystemTap code is about 15,000 lines - small, according to Frank. This could maybe go in, but Linus is resistant, saying that we need to get the current, simple, in-kernel tracing tools into a usable state before we try to add more of them. Ted Ts'o remarked that there is a real difference with SystemTap: it is the only Linux-based tracing package which, like DTrace, allows users to run code at the trace points. Thus it is able to do more complicated triggering, filtering, and analysis. Thomas Gleixner responded that this is all good, but what is really needed is a simple trace package which does not require the installation of a whole set of new tools. He does tracing (using ftrace) on a number of platforms, including embedded systems, and he isn't willing to deal with the hassle involved in adding another complicated set of software. After that the conversation wandered into various, relatively obscure technical topics like the details of how buffering mechanisms should work, who should really be managing trace points, who manages instrumentation as a whole, and so on. But there was a general sense that the summit wasn't the venue for that kind of low-level detail, which isn't where the real problems are anyway. The tracing topic will be revisited at the Linux Plumbers Conference, so it was decided to defer much of the discussion of the details until then. The openSUSE Project's first board elections The openSUSE Project is about to hold it's first board election. The process is well underway, with the first phase nearly over. All members of the openSUSE project may vote and can run for the board positions, but there is a fast approaching deadline in which to register for this vote or to declare your intention to run for this election. In the last call for candidates, received a bit too late for last week's LWN issue, states that application deadline ends September 24th, 12:00 UTC. An election committee has been formed to oversee the elections. Four people, two from Novell and two from the community, will organize and oversee the election. Committee members Claes Backstrom, Andrew Wafaa, Marko Jung, and Vincent Untz have agreed not to run for this election so that they might remain impartial. The initial openSUSE board was appointed by Novell. Pascal Bleser, a member of that board, has written a blog post about the openSUSE Board and the elections giving his view of the what the board does and does not do. "One point that really must be clarified (again) is that the Board is not responsible for taking technical decisions. That's other people's job, e.g. AJ as the director of openSUSE and platform, Coolo as the openSUSE distribution project manager, or Michl as the openSUSE product manager." Pascal also has a followup post answering some additional questions about the time commitments and involvement expected of a board member. Andreas Jaeger, also a member of the current board, has also written about the board, how it's organized and what upcoming board members might expect. "I'm part of the first openSUSE board and in my opinion we're still bootstrapping it and forming it. Federico mentioned that it took the GNOME board several years until they were really functional - so this shaping of the board is not only in the openSUSE project an evolutionary process that takes time and is influenced by e.g. (constructive) criticism, praise, communication in general, and decisions." New board members will be able to shape the board from the inside. With a new board, community members can also help shape the board with questions, comments and letting their expectations be known. The board will consist of five members, a Novell appointed chairperson, two Novell employees and two community members (not employed by Novell). So far there are three Novell candidates and five non-Novell candidates. The list of candidates with pointers to their platforms can be found here. We will soon be into the campaign period, which runs from September 25th to October 9th. During this time period will be blog entries from the candidates, interviews by the openSUSE news team, and a moderated Q&A session on IRC. There is also a feature in the openSUSE election in which each eligible voter may appoint a second openSUSE member to be eligible to vote. The option to appoint a second voter will be available during the campaign period and may allow a few people who missed the September 24th deadline to vote. The actual election begins as the campaign period ends. Each eligible voter will be able to cast their votes once. No changes will be allowed. Votes will be stored anonymously in the electronic system. Ballots will be closed October 23rd, the winners announced once the election committee has had a chance to verify and count the votes. If you care about the openSUSE project, this is a great time to get involved. Run for the board, vote in the election, and have a say in the shape of things to come. Audacity gets new functionality via Google Summer of Code Audacity is a popular and award winning multi-track open-source and cross-platform audio editor project that is built on the wxWidgets GUI library. LWN looked at Audacity in 2006. The Audacity project announced its participation in the 2008 Google Summer of Code student code writing event on April 21, 2008. GSoC 2008 is wrapping up and the Audacity site notes the progress made this summer: Four students participating with Audacity in Google Summer of Code successfully completed their projects, and their code will be in future versions of Audacity. The four projects were: FFmpeg support, to greatly increase the range of file formats that can be imported and exported. New GUI classes for future use in displaying audio tracks. On-demand/level-of-detail file loading, for near-instant loading and editing of uncompressed files. Sticky labels that stay with the audio through cut and paste. The Audacity GSoC projects page details the goals and achievements made by the students, we'll examine the results. Руслан Ижбулатов worked on adding FFmpeg support to Audacity in order to allow importing and exporting of a wider variety of audio file types. From the FFmpeg site: "FFmpeg is a complete solution to record, convert and stream audio and video. It includes libavcodec, the leading audio/video codec library. FFmpeg is developed under Linux, but it can compiled under most operating systems, including Windows." Audacity natively supports the WAV AIFF, MP3, Ogg Vorbis, and FLAC formats, the FFmpeg library supports those, and adds support for the GSM WAV, MP2, M4A (AAC), AMR, WMA, and many more formats. The Project Progress page has details on how to access this new functionality. The page also includes the full list of FFmpeg supported formats. The FFmpeg library can linked and loaded dynamically at run time, this allows it to be distributed as a separate package and removes any CODEC licensing issues from Audacity. Johannes Kulick added two new wxWidgets GUI classes and used those in Audacity to improve the display of audio tracks. His project abstract states: "Audacitys main user interface is the track panel. Its GUI architecture is written from scratch by the audacity team and as the team noticed the TrackPanel.cpp is a horrendous mess which is neither easy to maintain nor to extend. There are the wxWidgets classes wxGridSizer and wxFlexGridSizer which fit well in the requirements of the track panel. They arrange its content in a table. While in wxGridSizer all rows have the same height and all columns have the same width, in wxFlexGridSizer classes each row can have its own height and each column can have its own width. This is the way the Track panel is arranged, too, but there is one more thing which is important: the ability to drag and drop each track and drag the height of each track as well. And here is the big disadvantage of the wxWidgets classes: they lack the ability of being dragable. If there were classes which have these ability this would be a big step to get a cleaner track panel architecture for Audacity. So the project idea is that I will implement two classes wxDragGridSizer and wxDragFlexGridSizer which have the ability to do exactly these things." The Project Progress tracks the steps that were done to achieve the end results and the additional report covers extra work that was done to extend support for the wxAUI (Advanced User Interface) toolbar and window docking library. Michael Chinen's project involved on-demand/level-of-detail file loading for near-instant loading and editing of uncompressed files. The Project Progress explains: "The QuickLoad project added near-instant loading of PCM uncompressed files without waiting for waveform calculation to complete. Playing and editing is now possible on demand at any point in the track while the waveform image is still being calculated in the background." The Description section further clarifies the new capability: "Previously, it might be necessary to wait several minutes for the file to load and be useable while the waveform computation was completed. The waveform image will draw itself automatically during computation, but users can move the point in the file from which computation takes place, thus allowing them to view and edit any point in the file instantly. " This project also allowed for further improvements to Audacity: "One of the reasons the Quickload project was approved was because the OD framework will provide a method in which other tasks, such as loading non-wav formats, processing effects, and exporting, can be made multithreaded. The current implementation of the OD framework is written generally so that this is possible, which means that future implementations of OD tasks will be done writing a minimum of code. Taking advantage of polymorphism, this kind of thing should get easier and easier as more tasks are made to support OD." Mark Deutsch worked on adding sticky labels that stay with the audio through cut and paste operations. The Project Progress explains: "Label Track Enhancements removed a long-standing limitation that Audacity's labels did not stick to the audio track and move and edit with them." Further: "The biggest single addition from this project was the concept of linking tracks. Two or more linked tracks form a group. When an action is performed in one track, the other tracks in the group mirror that action. For example, if a group consists of one audio track and one label track, deleting part of the audio track will also delete that part of the label track. This linking is done implicitly, and depends on the layout of the tracks. A group is defined as a set of contiguous audio tracks followed by a contiguous set of label tracks." The sticky labels addition also improves the way Audacity handles insertions and other operations: "This functionality doesn't only handle deletes, though. Inserting audio, whether through pasting or using the "Generate" functions also shifts the grouped tracks correspondingly. The "Change" functions (Change Speed/Tempo/Pitch) are also supported. Slowing down a track will insert silence into linked tracks to keep all the tracks sync'd. Similarly, speeding up a track inserts silence into that track to achieve the same result." Lars Luthman was unable to finish the fifth project, Support for the LV2 plugin architecture, but he did organize the problem space and produce some code that should be useful for future work. The Project Progress report shows what was accomplished, and the main Audacity projects document explains how it ended: "The project which did not pass still had plenty of good coding work and skill behind it, indeed believed to be fully working on the linux platform. It was communication, possibly to modify the goals shortly after mid term, that really let it down." The 2008 GSoC projects added a number of useful new capabilities to Audacity. The wxWidgets project also benefited from the work with some enhancements that can be used by other projects. Once again, GSoC proves itself as a program that can focus in on areas of open-source applications that need improvements, and produce useful results in a short time span. GSoC is successful in bringing the guidance of experienced mentors together with the coding muscle of inspired students. User manuals for free software Documentation for free software is generally a problem area, both for users and developers. But developers at least have the code to consult, whereas most users are left poking around through menu items and consulting multiple web pages. The FLOSS Manuals project is using techniques similar to those used in free software development to produce manuals for users. The project seeks to create the kind of manuals that users may be used to from proprietary software packages. The project's About page describes the manuals being produced: FLOSS Manuals make free software more accessible by providing clear documentation that accurately explains their purpose and use. Each manual explains what the software does and what it doesn't do, what the interface looks like, how to install it, how to set the most basic configuration necessary, and how to use its main functions. To ensure the information remains useful and up to date the manuals are regularly developed to add more advanced uses, and to document changes and new versions of the software. There are a wide variety of manuals in progress, covering graphics and audio tools, OpenOffice, Firefox, WordPress for blogging, and more. The most recent addition is a set of eight manuals for the One Laptop Per Child XO. These were created as part of a XO/Sugar book sprint held in August in Austin, Texas. The manuals cover the XO hardware and Sugar interface as well as six different activities that are available as part of Sugar. The use of a "sprint" is just part of the adoption of free software development strategies. The project is set up to allow for collaborative development by a community. FLOSS Manuals describes it this way: The manuals on FLOSS Manuals are written by a community of people, who do a variety of things to keep the manuals as up to date and accurate as possible. Anyone can contribute to a manual – to fix a spelling mistake, to add a more detailed explanation, to write a new chapter, or to start a whole new manual. The way in which FLOSS Manuals are written mirrors the way in which FLOSS (Free, libre open source) software itself is written: by a community who contribute to and maintain the content. The manuals themselves are available in a variety of formats: HTML, PDF, as well as dead tree. One of the more interesting features is the remix capability. Using an AJAX interface, one can pick and choose from the chapters of existing manuals to create a custom manual that includes only the pieces required for some group of users. Remixers can choose their own cover and title, then export it all as a PDF file. Instead, one can also cut and paste some javascript code into a web page that creates a reader application on the page. In this way, the custom manual will always be up-to-date with the latest changes made to the chapters. FLOSS manuals clearly fill a niche that is needed in the free software world. The manuals have a rather professional look that will immediately stand out to users. There is a lot of work to be done, but it would appear that the project has made an excellent start. As one might guess, it is always looking for more interested folks to write, edit, and proofread manuals. (Thanks to LWN reader David Farning for suggesting we look at this project.) KS2008: Documentation Your editor got talked into kicking off the kernel summit discussion on documentation; if this coverage is sketchier than usual, it's because it's hard to try to lead a discussion and take notes at the same time. After some of the obligatory introductory notes on how documentation is always a problem, it was asked: how many kernel developers had actually gotten something useful from the in-tree documentation directory recently? Almost all attendees raised their hands. There is value, it seems, in the documentation which is available now. That said, there are also traps. An aspiring camera driver author would, upon exploring the documentation directory, stumble across a detailed file describing just how those drivers should be written. The author is Alan Cox, who might be considered to be a reasonably authoritative source. But this document describes the deprecated Video4Linux1 API; if our author wrote a new driver to that API, he or she would probably feel a little misled once the initial reviews came back. The value of that document in 2008 is probably negative. There are plenty of equally musty documents in the kernel documentation tree. The real problem is that documentation has no subsystem maintainer, nobody who will clean out the old stuff. The legendary lack of organization in that directory is also a result of a lack of overall maintenance. The question that was put to the developers was: what do you want from kernel documentation? Linus had a clear answer; what he wants is better release notes for each kernel version. It's not clear how to get there; maybe some sort of automated way of finding descriptions of new features in the git changelogs. What's even less clear is how this work could improve on the high-quality work done over at the kernelnewbies.org site. Matthew Wilcox asked for some quality control on documentation submissions. He noted, in particular, that the coding style document would appear to have drifted from its original intent over the years. One useful form of documentation that developers would like to see more of is test programs for new features. Test code for new system calls is especially useful; it describes how the system call should work, and allows architecture maintainers to verify that they have connected things up properly. There were questions on how much of the supplied kernel documentation is truly useful; maybe much of it should be removed? There are some obviously useful files, like those describing kernel boot and tuning parameters. The KernelDoc documents have their value; much of that documentation appears in the code itself, and the KernelDoc code checks to make sure that the documentation matches the associated function definitions. Much of the rest tends to be out of date and unused. One result of the discussion might be an effort to remove some of the oldest, most fictional documentation. Beyond that, though, it looks mostly like business as usual. OpenSSH and keystroke timings Theoretical security weaknesses have a tendency to move from the realm of theory to that of practice over time. Sometimes it is the result of more compute power being applied or better algorithms being developed, but a weakness is certainly not going to get stronger. So when Kevin Neff started discussing fixing a weakness in OpenSSH on the openbsd-misc mailing list, the folks writing it off as "theoretical" may have been jumping the gun. When it is in interactive mode—a user typing into a terminal session for example—ssh sends each key pressed by the user in a separate packet. By observing the timing between packets, an observer may be able to determine something about what was typed just by using traffic analysis, without attempting to break the encryption. Researchers found that the inter-packet timing correlated well with the inter-keystroke timing, so that using statistical techniques they were able to reduce the search space for cracking a password by a factor of 50. This weakness was outlined in a 2001 paper entitled Timing analysis of keystrokes and timing attacks on SSH" [PDF] which looked specifically at the timing-based attack: In this paper we study users' keyboard dynamics and show that the timing information of keystrokes does leak information about the key sequences typed. Through more detailed analysis we show that the timing information leaks about 1 bit of information about the content per keystroke pair. Because the entropy of passwords is only 4-8 bits per character, this 1 bit per keystroke pair information can reveal significant information about the content typed. The paper looked at the now-deprecated SSH1 protocol, which led some to conclude that it substantially invalidated the weakness. Damien Miller pointed out that it was likely to still be valid: There is no reason to believe that keystroke timing attacks will be impossible against protocol 2 where they work against protocol 1. They might just be a little more tricky. Pointing at the paper and discounting it because it is ssh1 only is sticking your head in the sand. It is usually easier to research attacks on simpler protocols and work up to more complicated ones later. There is a fair amount of information that can be gleaned just by looking at the traffic generated over an encrypted session, especially if the attacker can gather a sizable amount of it. There are fairly clear patterns in interactive sessions that can be extracted and used alongside the inter-keystroke timing information to potentially garner lots of useful information. Darrin Chandler describes it this way: The reason why I think it's a weakness is that you can gather statistics on typing and use those to infer things. I.e., you can extract meaningful information from the encrypted session. If you're snooping on ssh and see a short burst of typing followed by another ssh session from the remote machine you can guess they typed 'ssh host.example.com' by the length of typing and the host connected to. Nice crib. Oh, after than connect was there another short burst? Probably the password. How many keystrokes can probably be inferred. Perhaps stats on interkey timing can be used to make some intelligent guesses, such as the 4th char is NOT punctuation because is followed char 3 too closely. Or whatever. Overall, the reception to making OpenSSH less susceptible to this kind of analysis was positive. It is clearly a difficult attack to mount, logistically if nothing else, but it is not impossible either. Better timing information or analysis techniques might make it easier over time as well and that is enough of a reason to look at ways to fix it. KS2008: more minisummit reports The kernel summit dedicated a slot on its second-day agenda to the presentation of more minisummit reports and lightning talks. First up was Chris Wright, who reported from the virtualization minisummit held last April in Austin. The developers at the minisummit learned a lot about the hardware roadmaps maintained by various vendors. There was talk of improving cooperation with Qemu. The possibility of VMWare open-sourcing its user-space tools was raised, though, it seems, there is no prospect of getting that company's drivers released. The problem with the drivers is not legal; it's just that they are so tightly integrated with the VMWare hypervisor that there is little point in putting them out there. Beyond that, there were a number of discussions on topics (like checkpoint/resume) which have since turned into code. And it was noted that the virtualization developers would like improved hugetlb support. In particular, they like the active defragmentation patches which make it more likely that huge page allocations will succeed. David Miller discussed the 2008 networking minisummit, otherwise known as a couple of developers wandering by his house to discuss ideas. He mainly talked about the multi-queue work, which has been covered on LWN separately. One interesting point to note is that, while multi-queue is useful for wireless networking, it is also an important high-end scalability improvement. When the system tries to drive a 10GB network card at full speed, the locking contention on a single queue gets to be a significant performance problem. Matt Mackall presented his bloatwatch work, which monitors the text and data sizes of kernel releases. Unsurprisingly, the kernel is getting larger over time - a development which does not please embedded systems vendors. Bloatwatch allows interested people to see which code changes caused a given kernel release to grow. Matt would like to have more people using this tool and trying to keep a lid on kernel growth. It was suggested that bloatwatch could be run against linux-next and used to catch bloat before it gets into the mainline. Linus asked if growth could be correlated with the information in the git repository, making it possible to shame individual developers. Overall, it was noted that kernel growth is lagging far behind Moore's law, suggesting that the kernel is requiring a smaller portion of system memory over time. Still, it would be good to use even less; Matt figures that about half the growth in the kernel is something which can be avoided with some thought. KS2008: All about threads Ben Herrenschmidt led a session on the management of thread pools in the kernel. Kernel threads are typically used as a way for kernel code to do long-running work (which might sleep) as a separate task. The main mechanism used in the kernel now is the workqueue interface, but workqueues are not perfect. They have become a sort of last resort for all kinds of tasks which need to run in process context. Problems with workqueues include the fact that they serialize all tasks, even when that serialization is not needed. In some cases, this serialization could lead to deadlocks. Workqueues offer developers the choice of setting up their own dedicated worker threads or using keventd - a set of per-CPU threads shared across all users. The dedicated threads are often overkill for the developer's needs, but using keventd can lead to unpredictable latencies. Often there is no good choice. What's needed is an API that can allow more than one thing to happen on any given CPU while still providing shared threads and low latency. One idea is to allow keventd to fork. There could be a new form of workqueue with an "asynchronous" flag set. When a task is queued, keventd would fork and process the task immediately. It would be a relatively easy change to make, but it would also be somewhat inefficient - forks are expensive. Another option would be to go with one of the existing thread pool implementations; there are already a few in circulation. The pdflush daemon has a simple mechanism which can grow and shrink the pool of threads based on demand. Btrfs has a thread pool which is tightly tailored to its needs; it does not resize the pool, but it does provide low latency. The sunrpc code has a thread pool which Ben described as "scary." There is also a proposal from David Howells for a "slow work" mechanism. It is the most generic of the options, and supports resizing as well. The options were discussed for a bit; Linus's suggestion at the end was to just extend the workqueue interface to provide a small, fixed-size pool. Ben replied that the code for resizing the pool is sufficiently simple that there is no point in leaving it out. Thomas Gleixner led a discussion on a related subject: the threaded interrupt handlers which are currently living in the realtime tree. It seems that the realtime developers have finally recovered from having taken on the maintainership of the x86 code and are now getting back to thinking about getting the remaining realtime code merged. The realtime tree is set up to thread almost all interrupt handlers, but that will not work for the mainline. Some devices will continue to run with synchronous interrupt handling, and the idea of running software interrupts in threads is not popular with the networking developers. So the suggestion is to provide a new version of request_irq() which would allow a driver to set up a threaded interrupt handler. In the absence of a change by the driver maintainer, interrupt handlers would continue to be run synchronously. Linus strongly requested that a new request function be added, rather than making a change to request_irq() itself. It seems he is still feeling the pain of previous changes to request_irq(), which have required fixing massive numbers of drivers. The separate request function was always in the plan; the requirements are significantly different. In particular, drivers using threaded interrupt handlers still need to provide a small, synchronous handler which can determine whether the driver's device is actually interrupting. Without that small handler, it is hard to make the handling of shared interrupt lines work right. There was some discussion of details, but no real objection to the overall plan. So chances are good that threaded interrupt handlers will be posted for the 2.6.28 or 2.6.29 development cycles. KS2008: Kernel code with large user-space components The direct rendering infrastructure (DRI) code has always played by different rules than the rest of the kernel. It is an out-of-tree project which has produced wildly different sets of APIs over the years. And it has never quite been as good as anybody would like. This recent LWN article covers some changes happening in the DRI camp. The unique nature of DRI can be traced back to the fact that much of the problem can only be solved in user space. At the 2008 kernel summit, graphics developer Dave Airlie led a session on "best practices" for the creation of kernel code which, like DRI, has large user-space pieces. Dave says that developers for much of the kernel have an easy life; they can work toward the implementation of a well-defined interface which has been specified by POSIX for years. But some folks are not so privileged. In the graphics world, every device must expose a different interface to user space; every attempt to standardize these interfaces has produced highly ugly results. There is no standard here, and there is no real prospect of creating one. Actually, that is not quite true; the standard for this kind of device is OpenGL. But there is little interest in putting a full OpenGL implementation into the kernel itself. So there has to be a wide channel of communication between user space and the kernel, and it will always be somewhat device-specific. The DRI project develops its code outside of the mainline because stabilizing this user-space API is hard. The bulk of the code (90%) is in user space, and, until all that user-space code works, it is not at all clear that the interface with the kernel is correct. Once the code goes into the mainline, that API must be frozen. So DRI code will remain outside until the developers can be confident that the API has reached a stable state. The other reason for out-of-tree development is the need to make life easier for testers. There are a fair number of people who are interested in testing graphics drivers, but who are not kernel developers. The DRI project wants to allow these testers to operate on a stable base - the kernel provided by their distributor, preferably - and not have to run bleeding-edge mainline kernels. So the DRI code has enough backward-compatibility code in it to allow it to run with a range of kernel versions. This code is not welcome in the mainline, so it must be removed before any DRI code is submitted upstream. But it must remain in the DRI tree, or the project will lose a lot of testers. Dave had a couple of requests for the kernel development community. One of those was to be allowed to keep the backward compatibility code even when drivers are sent upstream. Compatibility would not have to be long term - three development cycles, perhaps - but the ability to run across that range of kernels would make life a lot easier. It would also eliminate the need for the DRI developers to rewrite the code immediately before submission to the mainline - a process which does not help to assure stable operation. There was not a lot of opposition to this idea. Linus did note, though, that the DRI developers have not been complaining about API changes which cause them trouble. His suggestion was that they let the community know when API changes create pain; perhaps some of those changes could be reworked to lower their impact on out-of-tree code. The other request was to be allowed to put exports for kernel symbols into the mainline even though the code using those exports is not yet being merged. The presence of those exports would, again, make life easier for testers. This idea, too, drew no serious opposition. It was suggested that any such exports should be accompanied by a comment explaining why it exists and should not be removed. KS2008: Fixing the Kernel Janitors Project James Bottomley started off this session by saying that he had proposed it after being annoyed by one too many white space patches in his mailbox. He does not believe that encouraging people to blindly fix white space problems is a good way to bring in new developers. So the central question for this session was: how can we do a better job of involving newcomers in the kernel development process? Linus asked that new developers not start by trying to fix warnings - a task which currently appears on the "to do" list run by the janitors project. In the past, he has not enjoyed that experience at all. Beginner fixes for warnings tend to be aimed at silencing the warning rather than really understanding what is going on; as a result, they often break things. A better place for people to start, he says, is by testing the kernel and providing good bug reports. Andi Kleen said that task lists can be useful. He put together a document on how to switch code over to the unlocked_ioctl() file operation, thus eliminating the big kernel lock. Some people made use of it and got some useful work done. Linus pointed out, though, that a certain Alan Cox followed that document and got things wrong, forcing the developers to revert his broken patch. Matthew Wilcox stated that the problem in the kernel community is not a shortage of patches - it's a shortage of review. So he would rather start new developers on tasks like bug reports. Jeff Garzik noted that good results can be had by encouraging new developers to acquire an obscure piece of hardware and improve the driver. That only works if one is willing to put in a fair amount of mentoring time, though. Mentoring is a subject that came around a few times. Greg Kroah-Hartman's Linux driver project work has provided a forum for mentoring, and that has helped a number of developers to improve their skills. But Dave Airlie asked how many developers had been "mentored" into the system; almost no hands were raised. The thing that creates new kernel developers still appears to be bugs that irritate people into fixing them. That led to the inevitable suggestion that the developers in the room should fix fewer bugs, providing more opportunities for the recruiting of new developers. Having prospective developers run regression tests was suggested, but was not received with a great deal of enthusiasm. Far better, said Linus, was to have people test out as much hardware as they can; that's where the real problems lie. One often-cited problem with the janitors project is that it is not good at graduating developers to bigger and better tasks. Any sort of mentoring effort should be oriented toward helping developers to grow while, at the same time, having them do something useful at every step. Andrew Morton - who was quieter than usual this year - noted that quite a few people who express interest in kernel development disappear before too long. Putting effort into mentoring them can thus lead to a lot of wasted time. It is better, he said, to do this kind of mentoring in a group situation. There have been problems, though, with people posting incorrect answers to questions on mailing lists, so group mentoring must be handled carefully as well. Andrew repeated his statement that the best thing for new developers to do is to ensure that every system they have access to runs perfectly with current kernels. An attempt was made to get some action items out of the session. The creation of a mentoring project was suggested, but nobody stepped forward to take that on. There was a request for more distributors to package testing kernels for users who would like to experiment with the leading edge. Al Viro, though, argued for a stronger emphasis on getting people to read code, rather than write it. That reading can take the form of code review or simply taking the time to figure things out. Linus would like a tool which could create a minimally useful kernel configuration for a given system. A full distributor configuration takes far too long to build, and the prospect of creating a custom configuration is increasingly daunting. Linus noted that his first kernel for a new system never works - he is certainly not unique in that regard. It turns out that such a tool exists; it will be dusted off and posted soon. LPC: Fitting into the kernel ecosystem The first Linux Plumbers Conference started on September 17, 2008; the opening talk was a keynote by Greg Kroah-Hartman. He got the conference going with with a provocative sermon on how the development ecosystem works and the niche we all occupy within it. It was a fun talk - unless you happen to work for Canonical. He started with an apology to Canonical, though. In earlier talks, he had said that only eight kernel patches had ever come from Canonical. In fact, he has been corrected; the proper number is 100. So, Greg asked, why is he picking on Canonical? His answer came in the form of a table of contributors to the kernel. It looked like this: Then Greg asked: does anybody from Canonical want to say anything? Nobody did. Moving on to the Linux ecosystem. Greg put up a slide showing the larger components of this ecosystem - the low-level stuff that makes Linux what it is. Some of the largest components, beyond the kernel, were GCC, binutils, X.org, and the man pages distribution. Looking at lines of code, the kernel amounts to about 40% of the total. Other large components are all significantly smaller. It turns out that Greg has been doing repository data mining in a number of projects beyond the kernel. So, for projects like GCC, X.org, and binutils, he was able to put up tables listing the top contributors. The results varied somewhat, but there were a number recurring themes. Red Hat tends to be toward the top of the list on all of these projects; companies like IBM and Novell also appear regularly. CodeSourcery is a significant contributor to GCC and binutils. The U.S. National Security Agency contributes 2.1% of the patches into X.org; why is not clear. In all of these projects there are significant contributions from unpaid developers, but those contributions are overshadowed by those from paid developers. And Canonical is always at the bottom of the chart - if it is there at all. At this point Greg moved to a whiteboard to present his view of how the community works. At the development level, you have developers contributing to projects, which then release the code. There may be a few users at that level who feed back information (and maybe patches), but, in general, the biggest consumers of the project's releases are the distributors. Distributors package everything and provide it to their users. At this point, another feedback loop comes into play: users feed their experiences and problems back to the distributor. Those distributors will respond to the user feedback, improving their products. The amount of feedback from the distributors to the upstream projects varies, but it tends to be small. For enterprise distributions, it is quite small; they are running ancient versions of everything and have little to do with current upstream. The community-oriented distributions, such as Fedora or openSUSE, tend to feed more changes back to their upstream sources. Then, there is the matter of redistributors who base their products on another distributor's work; these are distributors like Ubuntu or CentOS. There are no contributions back to the community from that kind of distributor at all. They are not functioning as a part of the Linux ecosystem. Greg finished up with what appears to be the message he came to the Linux Plumbers Conference to deliver: if you are a developer, if you want to be a part of the ecosystem, and if you work for a non-contributing company: quit. There are plenty of companies that understand the ecosystem and which need good people; at least one company, it seems, had wanted to set up a recruiting table at the conference. It is a very good time for people with community participation skills; there is no reason for anybody who wants to work in the community to stay on the outside. [As a postscript, it is amusing to note that, while the conference did not allow companies to set up recruiting tables, nobody has prevented prospective employers from filling a prominently-placed whiteboard with information about available positions.] LPC: Linux audio: it's a mess Audio is a fitting topic for the first day of the Linux Plumbers Conference. Users want sound to Just Work, and there's lots of working code in individual projects. But so far, it seems like nobody has everything quite plumbed together in an annoyance-free way. Lennart Poettering, a lead developer of PulseAudio and Red Hat employee, moderated the miniconference and started with a summary of the state of Linux audio: "it's a mess." The audio miniconference came up with two steps toward cleaning up the mess, though. First, come up with a coherent story for application developers on what sound API to use, and how. Second, clean up the often-confusing array of user-visible audio level controls. PulseAudio first appeared to regular users in Fedora, starting with version 8, and now, as Lennart puts it, is for up-to-date users, "the software that currently breaks your audio." PulseAudio is a sound server that mixes audio from multiple applications and passes it along to the sound hardware. It offers advanced features such as network transparency: an application can play a sound on a remote system, and PulseAudio makes it come out the speakers on the remote machine where the user is working. Supporting it shouldn't be a big change for most application developers to handle. It will handle applications written to the kernel's maintained audio API, ALSA, using the PulseAudio backend for alsa-lib. So the PulseAudio transition has been relatively painless for the distributions. An earlier sound server project, the Enlightened Sound Daemon (ESD) sound server, is falling out of favor and Media Application Server (MAS) has never really caught on. However, one of the competing sound servers looks likely to remain. On the pro audio side, the low-latency sound server JACK is the recommended option. JACK, the "Jack Audio Connection Kit," as Dave Phillips writes, "holds the keys to the kingdom" for connecting studio applications such as the Ardour digital audio workstation and the Rosegarden MIDI sequencer. "If you want all of the features, no one audio system supports all of them," Lennart said. Apple and Microsoft each have a single sound server that does both desktop and pro audio, but nobody at the session seemed to have much interest in that direction for Linux. PulseAudio is optimized for general desktop use and power savings, and supports scheduling features that should minimize wakeups but still allow for reasonably low-latency playback of streaming audio. It's also network-transparent and supports features such as placing desktop sound events based on mouse position. Network audio and desktop effects don't tempt pro audio users. JACK's uncompromising approach toward latency means it's likely to hog too much power to be acceptable to battery-life-watching desktop users, but fine for a studio with a rack full of gear. So two sound servers, one for pro and one for the masses, seems to be fine with both sets of users. Abusing ALSA PulseAudio, however, can't give applications direct access to the hardware, and currently only about 70% of ALSA applications use the API in a PulseAudio-safe way, Lennart said. Some high-profile applications are among those doing audio wrong. "Flash and Skype are really really broken applications, especially Flash," he said. Adobe split out the parts of its code that talk to the audio subsystem, and certain other plumbing, into an open-source library, libflashsupport. But Flash remains broken. The proprietary Flash library talks to libflashsupport from multiple threads, and one thread calls a destructor while another continues to send data. "It works until you close the browser window and then you get a race," Lennart said. Developers who want to play audio have a sometimes-confusing choice of tools, including PortAudio and GStreamer. (PortAudio is cross-platform, which is likely why the popular cross-platform audio editing application Audacity uses it.) GStreamer is relatively feature-intense and heavyweight, also handling video and transcoding. (Write a player with Gstreamer and you get the ability to play your collection of C64 SID files for free.) [PULL QUOTE: If someone comes and says, 'I want to write an audio application. Which API should I use?' I don't have a good answer END QUOTE] "If someone comes and says, 'I want to write an audio application. Which API should I use?' I don't have a good answer," Lennart said. The current best answer seems to be to write to the PulseAudio-safe subset of ALSA. Jeff Licquia of the Linux Standard Base (LSB), in the audience, mentioned that ALSA is on track for inclusion in LSB 4.0, and is a trial use module for 3.2. LSB aims to define a compatibility standard for Linux applications, and aims to do the kind of application developer education that Linux audio developers seem to need. Applications seeking LSB certification must run all of the LSB tests, but can fail anything tagged as trial use. "We're only keeping the stuff that we hope will be around for the long term," he said. If the LSB-safe subset of ALSA fits into the PulseAudio-safe subset of ALSA, application developers could write to ALSA and test with LSB. "I would like to be able to tell people to use libsydney," Lennart said. Libsydney, in progress, is intended to be a networking-friendly general-purpose audio API. ALSA and the HD-Audio widget problem In ALSA, the hardware/software interface is in good shape, but software to user interface needs some work. Takashi Iwai, a core ALSA developer and Novell employee, pointed out in a talk that the line count for /sound code in the kernel is actually shrinking, except for ASoC (system on a chip) and HD-audio. "There will be no more sound cards, especially PCI," he said. The one exception is the SoundBlaster X-FI for gamers, which is currently not supported well in ALSA. Creative announced proprietary drivers in 2006, but one ALSA developer recently did get access to a data sheet under NDA. The new audio standard, HD-Audio, is commonly found on new systems, and it's well-supported at the kernel level. However, it's based on "widgets" with vendor-configurable I/O pins. A driver can't tell how the HD-Audio part is connected, so some Linux plumbing work is required to identify which of the many exposed level controls is the right one to show the user. An audience member pointed out the need to tweak multiple level settings on his hardware, to get the right level without distortion. Linux will need more information on how each machine has its HD-Audio hardware hooked up in order to reliably give the user a useful volume control. Leo Laporte on open micro-blogging Radio talk show and podcast host Leo Laporte doesn't think operating systems or network infrastructures should ever be proprietary. He's the host of The Tech Guy radio show, which airs every weekend on stations around the United States, and of FLOSS Weekly, a regular podcast in which Laporte discusses different aspects of the Free, Libre, and Open Source software community. On The Tech Guy show, Laporte answers questions from computer users who call in to get advice and find ways to make their computers run better. Most of his callers are Windows users, but Laporte usually finds a way to mention Linux and other open source software during the course of his show. Laporte says he has been writing software for decades, and that he has always shared the source code, even before he had a notion of open source. "It was public domain then. But even then, I understood that if you're programming, the most interesting part is to see other people's code and be able to modify it. That's just a natural way to work." His first shot at installing Linux was back in 1994 when he got his hands on a copy of Slackware. "It was murder — but it opened my eyes to the growing open source world." At the time, Laporte was the host of a cable television show called Tech TV. "We were the first television show to install Linux live." On that show, Laporte hosted some of the biggest names in FLOSS, including Linus Torvalds and Richard Stallman, during Tech TV's run. "The longer I worked as a computer journalist, the more obvious it became to me that proprietary software is a bad idea. It's not natural to be secretive and it doesn't make sense." Laporte says that especially in the enterprise, the technological infrastructure should be open. "That should never be proprietary. Protocols, standards, and code need to be open." When it comes to applications, Laporte is a bit more flexible. "If you want to write an app that is closed source, I can see there are reasons why one might want to do that and that's fine with me. But closing the operating system makes no sense, and it is bad for everybody." Laporte, a Twitter user with over fifty-five thousand followers, recently announced he would no longer use Twitter, but would instead now throw his support behind Laconica, the open source micro-blogging platform on which Identi.ca is built. Laporte spoke extensively about Laconica on FLOSS Weekly last month when he chatted with Evan Prodromou, the original author of Laconica and the person who maintains identi.ca. "Laconica is identical to Twitter, but it's open, which is huge, and, more than open just in terms of it being open source." Laporte says open standards are just as important in this case, and that the protocols for micro-blogging should become commoditized so that others can build on top of the infrastructure instead of having to start from scratch. Laconica also offers users the option to release all their micro-posts under a Creative Commons attribution license, making the service about as "open as you could hope for," writes Dan Brickley, co-founder of the Friend of a Friend project (FOAF). With Laconica, different micro-blogging services can communicate with each other since the platform is open, unlike Twitter's service. This makes it possible for different communities to form their own branded services in which users can still search for and follow users in other communities, tying them together in what has become known as a "federation." Right now, Laconica is running on dozens of disparate servers, whose users can all subscribe to each others' updates. Laconica is built using the OpenMicroBlogging specification, which is completely open, free, and independent of any one central maintenance authority, unlike Twitter's proprietary protocol. Laporte believes that this kind of federation, which could be called distributed micro-blogging, is the key to overcoming scalability issues that have plagued Twitter, resulting in frequent outages for the popular service. "If you can't scale, that's another reason to have a more distributed system. Maybe we shouldn't have two million people on one Twitter. Maybe we should have five thousand people on four hundred 'twitters.' I have three thousand people on my system, and that's just about right." Laporte's system is called the TWiT Army, [Note that the web site is currently down] named after another of his podcasts known as This Week in Tech, or TWiT. "The conversation [there] has been very cohesive. The conversation is with people you know. With Twitter, it turns into a broadcast medium instead of a conversation. Now, it is a very useful way to get a message out to all those people. But I would love to have all those people all in their own communities, able to search across the federation by keyword, and if I post something of interest they'll find out about it." Laporte says he is not trying to go "head to head" against Twitter. But he is convinced that Laconica is a better way to do micro-blogging. "One of my problems with Twitter is that I contribute a lot of content and they shut down access to it. I want to be part of an open platform — that's where the innovation is going to occur." Laporte says that features Twitter previously offered but has shut down, including instant messaging and "track," are two of the most valuable features that Twitter offered. "Comcast realized a huge value from Track," he says. Comcast customer service agents were tracking Twitter posts to monitor complaints or issues posted by users, and then following up directly with those people. "Twitter was saying, 'well it's too demanding,' but the conspiracy theory is that they realize this is where the real value of Twitter is and they want to try to monetize it." With Laconica, Laporte says, these types of features can remain open and accessible, not subject to the whims of proprietary ownership. Laporte, Prodromou, and others including RSS pioneer Dave Winer, are talking about a collaborative effort to standardize and open the protocols for micro-blogging. The group is planning a conference for all who are interested in the concept of open micro-blogging, called the BearhugCamp. Laporte says, "we would very much like to encourage Twitter to become a part. The idea is to get all the players to the table and encourage them to support the Extensible Messaging and Presence Protocol (XMPP) (developed by Jabber). We're creating a new messaging medium with emerging open standards, in new and exciting ways. It's not really about Twitter at all – Twitter gave us this idea of micro-blogging, and now we're onto the next thing: let's make it open." LPC: What's happening with webcams Christmas is coming early for webcam users. Support for hundreds of popular webcams, available from Michel Xhaard's GSPCA project, is merged for inclusion in the upcoming 2.6.27 kernel. The amount of tweaking required from the user, the distribution, or both, has been cut, and it's likely that a random webcam will now just work out of the box. Even with the much-wanted drivers becoming part of mainstream Linux, a small matter of plumbing remains. Webcams, Hans de Goede pointed out at the Linux Plumbers Conference, produce a variety of compressed video data. "They all came up with interesting proprietary compressed video formats," he says. The out-of-tree version of GSPCA did some decoding in kernel space, but the decoding of many camera-specific custom video formats had to be ripped out, as doing that kind of work in-kernel is a Linux faux pas. That's where Hans's libv4l comes in. Announced in June, the new library (actually a set of three) does the format conversion. While not a Red Hat employee at the time (he is now) Hans posted a "BetterWebcamSupport" feature idea on the Fedora wiki, writing, "Currently many webcams do not work with Fedora out of the box even though a Linux driver exists for them." The problem was partly fixed with the GSPCA cleanup and inclusion upstream, and partly became the rationale for libv4l. Besides the core libv4lconvert library, the package includes libv4l2, to emulate a /dev/videoX device which, transparently to the application, will deliver "sane" video formats. There's also a libv4l1 to do the same thing but for the V4L1 API. An audience member asked why the library is separate from gstreamer, which is already set up for video transcoding. V4L2 developer Hans Verkuil responded from the audience that "it's something that you do not want to have in the kernel, but it has to be small and fast." That leaves out gstreamer as a general solution, since some webcam applications don't need gstreamer or can't afford the space it takes. Therefore, a separate library. It needs one more feature, too: vendors install camera chips however they'll fit, which means the same camera module could be right side up on one product and upside down on another. Therefore, libv4l has software support for flipping images, but it still needs the data to know when to flip: a table identifying which hardware has the camera module in which orientation. Brandon Philips at SUSE has another piece of the puzzle, a "frame server" that lets multiple applications share the webcam—doing for the webcam what PulseAudio does for the sound hardware. You can't shoot a photo with Cheese while another app has the webcam open, as he showed in a screenshot. You can always rely on the computer hardware industry to figure out ways to save a little money on something if it's possible to solve the problem in software. Many new webcams have motorized focus but no hardware autofocus. Autofocus is up to the host system—which means a focusing daemon needs to see the video at the same time as an end-user application. So providing access for the autofocus daemon is another reason for the frame server. Someone on the mailing list has the autofocus math that will form the guts of the daemon figured out, but it's a fairly intensive calculation and will need to be done on an occasional frame of video, not each frame. While the original frame server idea would have one shared memory segment per system, with access for multiple users, PulseAudio developer Lennart Poettering pointed out the potential security risks of that idea from the audience. "Memory mapping across privileges is a really bad idea," he said. He suggested putting the frame server in the user session to prevent users from, at least, killing each other's webcam applications. The webcam market is one where Linux is an afterthought if it's a thought at all. The Linux conferences aren't teeming with employees of webcam manufacturers. The support Linux does have shows that the community can still support hardware on its own when it has to. LPC: Booting Linux in five seconds At the Linux Plumbers Conference Thursday, Arjan van de Ven, Linux developer at Intel and author of PowerTOP, and Auke Kok, another Linux developer at Intel's Open Source Technology Center, demonstrated a Linux system booting in five seconds. The hardware was an Asus EEE PC, which has solid-state storage, and the two developers beat the five second mark with two software loads: one modified Fedora and one modified Moblin. They had to hold up the EEE PC for the audience, since the time required to finish booting was less than the time needed for the projector to sync. How did they do it? Arjan said it starts with the right attitude. "It's not about booting faster, it's about booting in 5 seconds." Instead of saving a second here and there, set a time budget for the whole system, and make each step of the boot finish in its allotted time. And no cheating. "Done booting means CPU and disk idle," Arjan said. No fair putting up the desktop while still starting services behind the scenes. (An audience member pointed out that Microsoft does this.) The "done booting" time did not include bringing up the network, but did include starting NetworkManager. A system with a conventional hard disk will have to take longer to start up: Arjan said he has run the same load on a ThinkPad and achieved a 10-second boot time. Out of the box, Fedora takes 45 seconds from power on to GDM login screen. A tool called Bootchart, by Ziga Mahkovec, offers some details. In a Bootchart graph of the Fedora boot (fig. 1), the system does some apparently time-wasting things. It spends a full second starting the loopback device—checking to see if all the network interfaces on the system are loopback. Then there's two seconds to start "sendmail." "Everybody pays because someone else wants to run a mail server," Arjan said, and suggested that for the common laptop use case—an SMTP server used only for outgoing mail—the user can simply run ssmtp. Another time-consuming process on Fedora was "setroubleshootd," a useful tool for finding problems with Security Enhanced Linux (SELinux) configuration. It took five seconds. Fedora was not to blame for everything. Some upstream projects had puzzling delays as well. The X Window System runs the C preprocessor and compiler on startup, in order to build its keyboard mappings. Ubuntu's boot time is about the same: two seconds shorter (fig. 2). It spends 12 seconds running modprobe running a shell running modprobe, which ends up loading a single module. The tool for adding license-restricted drivers takes 2.5 seconds—on a system with no restricted drivers needed. "Everybody else pays for the binary driver," Arjan said. And Ubuntu's GDM takes another 2.5 seconds of pure CPU time, to display the background image. Both distributions use splash screens. Arjan and Auke agreed, "We hate splash screens. By the time you see it, we want to be done." The development time that distributions spend on splash screens is much more than the Intel team spent on booting fast enough not to need one. How they did it: the kernel Step one was to make the budget. The kernel gets one second to start, including all modules. "Early boot" including init scripts and background tasks, gets another second. X gets another second, and the desktop environment gets two. The kernel has to be built without initrd, which takes half a second with nothing in it. So all modules required for boot must be built into the kernel. "With a handful of modules you cover 95% of laptops out there," Arjan said. He suggested building an initrd-based image to cover the remaining 5%. Some kernel work made it possible to do asynchronous initialization of some subsystems. For example, the modified kernel starts the Advanced Host Controller Interface (AHCI) initialization, to handle storage, at the same time as the Universal Host Controller Interface (UHCI), in order to handle USB (fig.3). "We can boot the kernel probably in half a second but we got it down to a second and we stopped," Arjan said. The kernel should be down to half a second by 2.6.28, thanks to a brand-new fix in the AHCI support, he added. One more kernel change was a small patch to support readahead. The kernel now keeps track of which blocks it has to read at boot, then makes that information available to userspace when booting is complete. That enables readahead, which is part of the early boot process. How they did it: readahead and init Fedora uses Upstart as a replacement for the historic "init" that traditionally is the first userspace program to run. But the Intel team went back to the original init. The order of tasks that init handles is modified to do three things at the same time: first, an "sReadahead" process, to read blocks from disk so that they're cached in memory, second, the critical path: filesystem check, then the D-Bus inter-process communication system, then X, then the desktop. And the third set of programs to start is the Hardware Abstraction Layer (HAL), then the udev manager for hot-plugged devices, then networking. udev is used only to support devices that might be added later—the system has a persistent, old-school /dev directory so that boot doesn't depend on udev. The arrangement of tasks helps get efficient use out of the CPU. For example, X delays for about half a second probing for video modes, and that's when HAL does its CPU-intensive startup (fig. 4). In a graph of disk and CPU use, both are at maximum for most of the boot time, thanks to sReadahead. When X starts, it never has to wait to read from disk, since everything it needs is already in cache. sReadahead is based on Fedora Readahead, but is modified to take advantage of the kernel's new list of blocks read. sReadahead is to be released next week on moblin.org, and the kernel patch is intended for mainline as soon as Arjan can go over it with ext3 filesystem maintainer Ted Ts'o. (Ted, in the audience, offered some suggestions for reordering blocks on disk to speed boot even further.) There's a hard limit of 75MB of reads in order to boot, set by the maximum transfer speed of the Flash storage: 3 seconds of I/O at 25MB/s. So, "We don't read the whole file. We read only the pieces of the file we actually use," Arjan said. sReadahead uses the "idle" I/O scheduler, so that if anything else needs the disk it gets it. With readahead turned off, the system boots in seven seconds, but with readahead, it meets the target of five. X is still problematic. "We had to do a lot of damage to X," Arjan said. Some of the work involved eliminating the C compiler run by re-using keyboard mappings, but other work was more temporary. The current line of X development, though, puts more of the hardware detection and configuration into the kernel, which should cut the total startup time. Since part of the kernel's time budget is already spent waiting for hardware to initialize, and it can initialize more than one thing at a time, it's a more efficient use of time to have the kernel initialize the video hardware at the same time it does USB and ATA. X developer Keith Packard, in the audience and also an Intel employee, offered help. Setting the video mode in the kernel would not let the kernel initialize it at the same time as the rest of the hardware, as shown in figure 3. The fast-booting system does not use GDM but boots straight to a user session, running the XFCE desktop environment. Instead of GDM, Arjan said later, a distribution could boot to the desktop session of the last user, but start the screensaver right away. If a different user wanted to log in, he or she could use the screensaver's "switch user" button. In conclusion, Arjan said, "Don't settle for 'make boot faster.' It's the wrong question. The question is 'make boot fast'." And don't make all users wait because a few people run a filesystem that requires a module or sendmail on their laptops. "Make it so you only pay the price if you use the feature." Distributions shouldn't have to maintain separate initrd-based and initrd-free kernel packages, he said later. The kernel could try to boot initrd-free, then fall back if for whatever reason it couldn't see /sbin/init, as might happen if it's missing the module needed to mount the root filesystem. PowerTOP spawned a flurry of power-saving hacks from all areas of the Linux software scene. The combination of Bootchart, readahead, and a five-second target looks likely to set off a friendly boot time contest among Linux people as well. At the conference roundup Friday, speaker Kyle McMartin announced that both Fedora and Ubuntu have fixed some delays in their boot process, and there was much applause. FIGURE CREDIT: Arjan van de Ven and Auke Kok, Intel LPC: Upstart 1.0 plans: manifesto for a new init Let's make two things clear about Upstart, a proposed replacement for the Linux "init" process. First, it's not there to speed up boot, and second, it's not intended to parallelize startup. "Upstart is not for what most people think it is for," said its author, Scott James Remnant, in a talk in the dbus miniconference at the Linux Plumbers Conference. What it is there for is to expand the capabilities of "init" on Linux, replace some scripts and workarounds with rules that are intended to be easier to understand and modify, and enable future improvements. Remnant is a Canonical employee, and Upstart is in Fedora as of version 9, making it a welcome example of a Canonical-sponsored project finding its way into other distributions. While Greg Kroah-Hartman mentioned a list of core software on the Linux platform in his Plumbers Conference talk, "the one thing he never put in there was init," Remnant said. The Linux init, originally by Miquel van Smoorenburg, has been unchanged for years, and is modeled on the System V Unix init, which is even older. Instead of updating it, Remnant says that, for too long, distributions have just worked around it. The startup process has traditionally consisted of shell scripts, started by init, but containing workarounds and extensions accumulated over the years. For example, Debian has a wrapper program called start-stop-daemon, that manages PID files, to keep track of what process ID a daemon process ends up with. Upstart handles that itself. Current features of upstart include sending notifications for system events, for example, when a service starts; eliminating race conditions, by offering dependency tracking; and removing some service startups from the critical path for boot, again by handling dependencies. Upstart allows a distribution or sysadmin to spell out the critical path in a script, and also specify dependencies. Tracking dependencies allows distributions to eliminate "sleep" loops from the boot sequence, and instead take actions based on events. Events are not limited to the runlevel changes familiar to sysvinit users, but can depend on other things on the system. But what other things? Future directions for Upstart could be ambitious. For 1.0, Remnant is considering adding the ability to do tasks based on cron-like criteria such as "hourly." But should upstart really replace cron? Another possibly useful direction would be an "idle" event. The Common Unix Printing System (CUPS) is a service that makes sense to start "30 seconds before the user thinks of clicking on the print button," he said. CUPS is not in the critical path for boot, but needs to be running to detect printers before the user needs them. Should it be possible to start non-critical services when the system becomes idle? Even though fast boot isn't the goal of upstart, Remnant is optimistic about being able to help. Some of the slow booting problems that Arjan van de Ven and Auke Kok identified at the conference are deep in the weeds of nested scripts, and might be smoked out by a simpler init layout. "To make boot fast we have to do a bunch of different stuff. it makes it easy for us to do the real work," Remnant said. The Linux Plumbers Conference: a summary Back in the early days of Linux, a developer wishing to meet his or her peers at a conference had a relatively small number of alternatives. Two of those - Linux Expo and the Atlanta Linux Showcase - were held in the United States. But it has been a long time since the US has hosted a serious developer-oriented conference - especially for developers who are working on the lower layers of the system. The US-based conferences died out as a result of a combination of a number of factors, including poor management, competition from the Ottawa Linux Symposium and (yes, really) LinuxWorld, and a feeling among certain developers that becoming the next Dmitry Sklyarov would not be a fun way to spend the rest of the year. There is a certain appeal to overseas events, but that appeal fades more quickly than one might expect. The need for long-haul travel also excludes US-based developers who are unable to arrange funding. So, for some years, the development community in the US has been wishing for a local conference. More recently, a dedicated group of Portland-based developers led by Kristen Carlson Accardi, with some help from the Linux Foundation, decided to do something about it. The result was the first edition of the Linux Plumbers Conference, held September 17 to 19. Staging this conference in a world which does not lack for conferences was a bit of a risk, and the organizers added a few risks of their own to the mix. Looking back, your editor can say that those risks were well repaid; the first Linux Plumbers Conference was a great success. The "plumbing" focus of this event was well chosen. While it is still possible to run a system with a bare kernel and a shell as the init process, Linux systems used for real work increasingly have a layer of user-space software tightly wrapped around the kernel. Quite a bit of kernel-based functionality only works properly in the presence of a tightly-coupled user-space component; examples include system initialization, 3D graphics, and much more. The kernel, along with its collection of user-space software, makes up the "plumbing" layer which makes everything else work. Kernel developers have had ample opportunities to get together in recent years, but there has been no concerted effort to bring together the developers for the full plumbing layer until now. The other significant change made by the LPC organizers was to do away with the "everybody delivers a paper" format used by most conferences. Instead, the conference was planned as a series of 2.5-hour "microconferences," each with a specific focus. Each microconference, which had its own "runner," was able to select its own mode of operation. They generally included a certain number of presentations on relevant topics; in this sense, the microconferences resemble the topic-specific tracks found at many academic gatherings. Where things differ, though, is that most of the microconferences were explicitly oriented toward discussion and problem solving. The best speakers did not (just) talk about their own project; they raised challenges for the group as a whole to address. It worked spectacularly well. Throughout the event, your editor saw rooms full of people who were fully engaged in the work at hand. The discussions had wide participation, most of the necessary people were generally in the room, and there were relatively few bored people checking email. And, most importantly, a lot of real work got done. Developers came out of the sessions with a clear idea of what needs to be done, agreement with others on how it was to be done, and, sometimes, working code. So, what did all of these developers talk about? Developers interested in storage talked about the iogrind tool and a number of outstanding problems; some notes from the session have been posted. The Audio microconference covered a wide range of issues; see this LWN article for a summary. A session on tracing saw presentations by developers of a number of competing technologies, followed by a focused effort to design a unified low-level shared relay buffer. The video input session, for all practical purposes, continued on and off through the entire conference; that group of developers, which had never met before, set in motion some major redesign efforts for the Video4Linux layer. The bootstrap and initialization session was dominated by Arjan van de Ven's five-second boot demonstration; having been given that challenge, developers from multiple distributions set about the task of getting their systems to boot quickly. A session on server management looked for solutions to a number of challenges facing Linux administrators. Kernel/user-space APIs were the topic of another lively session which, while perhaps concluding little, raised a lot of issues on how those APIs should be designed. The power management session concluded that the suspend/resume problem is solved ("if you disagree, you bought the wrong hardware") and made progress on a number of other problems; now, they say, all that is left is the coding. The "future displays" session pounded out the path toward kernel-based graphics mode setting and quite a bit more. And the desktop integration session, while reaching "not a lot of conclusions," examined a number of relevant issues; the discussion on Upstart from that session will be covered here separately. Beyond that, LPC attendees could choose from a handful of more traditional presentations, a provocative keynote from Greg Kroah-Hartman, a rather less provocative kernel update from your editor, a git tutorial taught by some guy named Linus, and no shortage of evening celebrations. All told, the Linux Plumbers Conference was one of the most productive, interesting, and generally worthwhile events your editor has been to in quite some time - and your editor has been to rather more than the usual number of events. There will be a lot of interesting developments kicked off by this gathering, once the exhausted attendees get some rest. This conference is off to a good start. And it is just a start; the organizers are already working on the 2009 edition. It will, once again, be held in Portland. The general format will likely remain the same, but there will be no kernel summit before the 2009 event (the summit will be in October 2009 in Tokyo). Instead, there is a reasonable chance that a more traditional, presentation-oriented conference will be planned to coincide with the 2009 Plumbers Conference. With this new event, the active local community, and the success of this year's conference, LPC2009 looks promising already. After 2009, the Plumbers team hopes to take a page from the linux.conf.au playbook and pass the event onto a new set of volunteer organizers somewhere else in North America. This form of organization has helped to keep linux.conf.au vital and interesting for many years; it makes sense to do something similar with the Linux Plumbers Conference. Now might be a good time for any North American community which would like to host this event in 2010 to start thinking about how it could be done. The Optimistic Contributor Returns - Parted Magic Part 2 About eleven months ago, I wrote an article for LWN about the Parted Magic Linux Live CD distribution, a distribution with the elemental purpose of partitioning hard drives. At that time, the primary developer, Patrick Verner, had announced his intention to stop work on the distribution due to lack of support from the community. I lamented the fate of the project and wondered how many other promising projects had died under similar circumstances. I vowed to try and do better to support open software myself and called upon the community at large to do the same. Fast forward to today, and your Optimistic Contributor feels vindicated in his self-appointed choice of title. Why, you may ask? Well, to put it simply, the project did not die. To find out what happened, I spoke again with Verner on September 14th, 2008. OC - When we last spoke in October of 2007, you had posted on your website that development of Parted Magic would cease after version 1.9 was released. Since that time, you have released many more versions up to 3.0 (with 3.1 on deck). What motivated you to continue the project? PV - There were very little donations, help with code, or users giving me at least a pat on the back. Between 1.8 and 1.9 was by far the lowest point in this project. To this day I still think your article saved the project, well, sort of. After your LWN article I received the best month of donations and offers for help. The worse mistake I made was not asking for help in the first place. Once I started asking for help and starting directly asking for small donations the project turned around at a rapid pace. The best advice I could give anybody working on OSS projects is to ask. People assume you like doing it for free and don't need any help. The project makes about $400 a month now and it's nice because I can take the family out bowling a few times a week, buy some new computer hardware, or buy something for the house. OC - Since development has continued, the distro seems to have evolved at a steady pace. What features would you like to highlight, or rather, what feature(s) are you most proud of? PV - The best thing about Parted Magic is the fact it's not based on another distribution. Parted Magic is it's own entity and has the flexibility to go where ever it needs to go and add whatever may be required to perform needed tasks. There really isn't any comparison between Parted Magic and any other distro. It's really off the wall compared to the rest. Original thinking and process is what makes Parted Magic different and it's what I'm most proud of. OC - You have started what appears to be a project within a project with MiniPM (aka Beef Drapes). What itch were you trying to scratch with this new project? PV - MiniPM is a small project designed to run partimage over PXE. It really wasn't too hard to create and won't be heavily maintained. It fills a small niche and so far it seems to do what it's supposed to and nothing more. It's not much of a diversion. http://partedmagic.com/beef_drapes is my test directory. It's not a separate project or fork. OC - What do you believe will drive you to continue development on both projects for the foreseeable future? PV - When this project is no longer useful or donations starting declining back to 1.8 levels I'm out. I don't want to do this for free. It's fun to work on and I really enjoy it, but how can I justify the hours spent to my wife if I'm getting nothing tangible in return? It was always a goal of mine to do this for a living and I'm still hopeful it could happen. All it would take is $2 from every person that finds this project useful. I work 50+ hours a week at my day job so things happen pretty slow here. I couldn't even imagine how fast things would happen and the quality this project could provide if I just had more time. OC - If you could give advice to any open source programmer on how to keep a project going, what would you say? PV - Enjoy what you are doing, grow a thick skin, and find motivation to do it. OC - How has your opinion open source community changed in the last 10 months? PV - Not at all. I failed to ask, that was my problem. If you want anything from the open source community you need to ask and give back what was given to you. OC - Is there anything you would like to add? PV - Sure. Use http://partedmagic.com/beef_drapes and tell me what needs to be fixed before the next release. This is a big benefit to all Parted Magic users. Now, your Optimistic Contributor would like to take credit for helping to save the project, but all I did was inform the community of the situation. It was the community itself that did the actual saving. The donations, the offers of help, just the notes of thanks were enough to keep Verner going. Verner's response to one of my questions really resonated: "If you want anything from the open source community you need to ask and give back what was given to you." I read that statement several times. After letting it sink in, I realized how effectively Verner got straight to the point. In my previous article I made the common statement that freedom isn't free. Verner has taken that one step further in saying that a community isn't a community without communication and give and take. That sounds obvious after the fact, but I am glad Verner put the idea so clearly in my head. I can only hope (as I am ever the Optimist) that others within the open source community receive the same level of clarity as I have. So what about version 3.0 itself? Just like the motivation of the project maintainer, the project itself has undergone a bit of a revolution. Almost the entire underpinnings have been updated or redesigned. The user interface still looks very similar to what 1.9 was, but everything just seems smoother and more polished than before. It is actually hard to believe that the project is put together by a handful of individuals. The best way to experience what the distribution is capable of (besides reading my original article) is to take Verner's last answer to heart: "Use http://partedmagic.com/beef_drapes and tell me what needs to be fixed before the next release. This is a big benefit to all Parted Magic users." LPC: The future of Linux graphics On the final day of the Linux Plumbers Conference, Keith Packard ran a microconference dedicated to future displays. A number of topics were discussed there, but the key session had to do with the near-term future of Linux video drivers. Longtime LWN readers will be more than familiar with the story: Linux has multiple subsystems charged with managing graphics hardware, the user-space driver model adopted by XFree86 leads to all kinds of problems, support for 3D graphics is not what it should be, etc. That whole story was recounted here, but with a notable difference: solutions are in the final stabilization stages, and these problems will soon be history. There are two major components to the work which is being done: graphics memory management and kernel-based mode setting. A contemporary graphics processor (GPU) is really a CPU in all respects, including the possession of a sophisticated memory management unit. Managing the sharing of memory between user space, the kernel, and the GPU is fundamental to the implementation of correct, high-performance graphics. One year ago, the TTM subsystem looked like the solution to the memory management problem, but TTM grew increasingly unworkable as the understanding of the problem improved. So now the Graphics Execution Manager (GEM) code looks like the way forward; it is currently being prepared for merging into the mainline kernel. Kernel-based mode setting, instead, is meant to get user-space code out of the business of messing around directly with the hardware. Putting the kernel in charge of the configuration of the video adapter has a long list of advantages. Suspend and resume have a much better chance of working, for example. Once the X server stops accessing hardware directly, it no longer needs to run as root; having that much untrusted code running with full privileges has made people nervous for many years. In the current scheme, the kernel cannot change the graphics mode if it needs to; that means that, for example, if the system panics, a graphical user will never see the message. With kernel-based mode setting, the kernel can switch to a different mode and allow the user to frantically try to read the message before it scrolls off the screen. Kernel-based mode setting will also make fast user switching work much better, without the need to use a separate virtual terminal for each user session. One of the first topics of discussion was: how does the kernel decide when to switch to the panic screen to show the user an important message? There are quite a few different paths by which the kernel can indicate distress; should a kernel message be presented every time a WARN_ON() condition is encountered? There would appear to be a need to unify the error paths in the kernel to help simplify this kind of decision. Linus Torvalds Jesse Barnes suggested that the kernel could simply switch on every message emitted with printk(), on the theory that such a policy would lead to a rapid and welcome reduction in kernel verbosity. The real debate in this session, though, had to do with development process. As has been discussed previously on LWN, much of the video driver work is done outside of the mainline kernel tree. We are now seeing a big chunk of that work being prepared for a merge. But the new mode setting interface is a big API change which will require adjustments from user space; a new kernel expecting to handle mode setting may not give the best results when run with an older user space X server. So there will be a big flag day of sorts when everything changes and all of the new code gets run for the first time. Linus is not pleased with the notion of a video graphics flag day; he made a long appeal for a more incremental approach to fixing the video driver work. In his opinion, the flag day will lead to a whole bunch of untested code being made active all at once; there will certainly be design mistakes which show up, and the whole thing will fail to work properly. At which point another flag day will be required. Linus was not impressed by the claim that Fedora users have selflessly been testing this code for everybody; in his view, the kernel developers are not doing this testing. He sees the whole thing as a recipe for disaster. The real problem - and the reason for the out-of-tree development - is that all of this work requires the creation of a number of new, complex user-space ABIs. That is true for both mode setting and memory management, and the two cannot be easily separated from each other. Until the combination as a whole is seen to work, the video driver developers simply cannot commit themselves to a stable user-space interface - and that means that their code cannot be merged. As an example, TTM was cited. Had that code been pushed when it looked like the right solution, there would now be even bigger problems to solve. In summary, the graphics developers believe that the approach they are taking is as incremental as they can make it. Whether they convinced Linus of that fact is unclear, but he eventually seemed to accept the plan. He did ask for them to push the mode setting code upstream first, but that code cannot work without memory management support. So GEM will go into the mainline ahead of kernel-based mode setting. Once everything is in the kernel, it will be possible to boot a system with either kernel-based or user-space mode setting, so both new and old distributions will be supported. Someday, in the distant future, support for mode setting in user space can be removed. Much sooner than that, though, we should all be running much-improved graphics code and will have long since forgotten how things used to be. Newer kernels and older SELinux policies A subtle change in 2.6.25 recently left Andrew Morton with a less than completely functioning system, but it also demonstrated a user-space interface that may sometimes be overlooked: SELinux. The problem stemmed from a change to facilitate containers by making /proc/net into a symbolic link, which tripped up SELinux policies that had been written for earlier kernels. Putting policy into user space is a guiding principle of kernel development, but that can sometimes lead to an unexpected synchronization required between those policies and the kernel. The change itself was fairly minor, making /proc/net be a symbolic link to /proc/self/net so that containers would only see their network devices, rather than those of the enclosing system. But when Morton ran a recent kernel on his Fedora Core 5 and 6 systems, he got: Further investigation found that even ls got permission errors when looking at /proc/net. As is usual with mysterious "permission denied" errors, SELinux was the underlying cause. When the change was made, back in March, it was reviewed by the SELinux developers, but no one noticed that it would cause an additional permission check—on the symbolic link itself. So, when resolving things like /proc/net/dev or other entries in that directory, the "labels" on the symbolic link were checked. Of course, /proc is a synthetic filesystem, so the labels are generated from SELinux code rather than retrieved from extended attributes (xattrs). Distributions have updated their policies to allow access to the symbolic link—probably by noticing the SELinux denial in log messages—so most folks never saw the problem. As Morton found out, though, existing distribution policy files (those shipped with FC5 and FC6 for example) would still disallow the access. Morton regularly runs newer kernels with older distributions to try to catch exactly this kind of error; he is probably one of very few, perhaps the only one, doing that. Because the distribution-supplied kernel was being changed, some argued that requiring users to update their SELinux policies is not an onerous requirement. Paul Moore puts it this way: Maybe I'm in the minority here, but in my mind once you step away from the distro supplied kernel (also applies to other packages, although those are arguably less critical) you should also bear the responsibility to make sure you upgrade/tweak/install whatever other bits need to be fixed. Morton did not buy that argument saying: Nope. Releasing a non-backward-compatible kernel.org kernel is a big deal. We'll do it sometimes, with long notice, much care and much deliberation. We did it this time by sheer accident. That's known in the trade as a "bug". But SELinux developer Stephen Smalley points out that permissions checks are not normally considered part of the kernel to user space interface. It is something of a gray area, though. Clearly the standard UNIX permission checks are part of that interface, at least partially because the kernel does handle the policy for those checks. Since the policies that govern the decisions about SELinux access denial come from user space, it is a bit hard to argue that changes to the kernel will not ripple out. Smalley describes the problem: I should note here that for changes to SELinux, we have gone out of our way to avoid such breakage to date through the introduction of compatibility switches, policy flags to enable any new checks, etc (albeit at a cost in complexity and ever creeping compatibility code). But changes to the rest of the kernel can just as easily alter the set of permission checks that get applied on a given operation, and I don't think we are always going to be able to guarantee that new kernel + old policy will Just Work. One possible solution to the immediate problem was floated by Smalley: SELinux could change the label that it returns for symbolic links under /proc. It is not clear that anyone really wants that change, and there has been no movement to add it. As Morton says, "people who are shipping 2.6.25- and 2.6.26-based distros probably wouldn't want such a patch in their kernels anyway." Longer term, Eric Biederman asks about supporting xattrs for /proc. That would allow user space to label the proc filesystem appropriately, removing one of the special cases. Unfortunately, doing so would create yet another incompatibility between newer kernels and older user spaces. In the end, because the bug was only seen by Morton, many months after it was introduced, it may just be ignored. The larger issue of how permissions checks fit into the kernel to user space interface, though, may rear its head again. e1000e and the joy of development kernels The 2.6.27-rc regression list posted on September 21 contains - deep within the list - an entry reading "e1000e: 2.6.27-rc1 corrupts EEPROM/NVM". One might be forgiven for missing it; the list of regressions is still (unfortunately) long, and there is nothing there to indicate that it is a notable problem. But it is: this particular bug goes beyond breaking networking; when it bites, it corrupts the EEPROM on the device, causing it to cease to function forevermore (or, at least, until the user can manage to flash the EEPROM with working code). This is a problem which is worth fixing. As of this writing, though, nobody seems to know what the problem is. There was some confusion resulting from the fact that the related e1000 driver also suffered from an EEPROM corruption problem - but that turns out to have been an entirely different bug. The e1000 problem was fixed by putting a lock around accesses to the EEPROM, preventing corruption caused by concurrent access. But something else is going on with the e1000e. Figuring out what that "something else" is appears to be a challenge. The problem is not readily reproducible, and there is this little problem that triggering the bug more than once requires the replacement of the affected hardware. It's not even clear which kernel versions are affected, though it appears that only the 2.6.27 development series shows the bug. There is some correlation between e1000e corruptions and graphics driver crashes, leading David Miller to pursue a hypothesis that the real culprit is changes to the X server, but that idea has not, yet been proven. Other developers suspect a concurrency-related problem similar to the e1000 bug. As of this writing, the bulk of what is known can be found in this advisory from Mandriva. Kernel developers are adding information to the kernel bugzilla entry as they find it. It has been suggested that anybody running 2.6.27 on a potentially affected system might want to save a copy of the current EEPROM contents with a command like: (That assumes, of course, that the relevant device is eth0 on your system). With the saved data, it should be possible to recover the device if the worst happens; without, chances are that victims will have to return their systems to the vendor. In one sense, this bug demonstrates that the system works. It was caught while the kernel was still in the stabilization phase; one can be certain that it will be obliterated somehow before any stable 2.6.27 release comes out. On the other hand, the first report of this problem hit the net on August 8; the problem was known for over a month before distributors started responding to it and the all-out hunt for the cause began. That is a long time for any regression to persist, but it is especially long when one is dealing with a regression which has the ability to regress hardware back to a stone-age state. The distributors have now responded; most of them have withdrawn kernels with the affected drivers. So far, nobody has posted tools to help affected users recover their hardware (suggestions to use ibautil should be ignored and forgotten about as soon as possible). Such a tool is forthcoming, but it would be hard to blame the relevant engineers for focusing on fixing the problem first. With any luck at all, the root cause will have been isolated by the time you read this. There is one thing that will not have changed, though. Testers of unstable software - especially the kernel - have often been warned that said software can do all kinds of terrible things to their systems. It is easy to ignore those warnings; even -rc1 kernels actually work for most people, most of the time. But, as we have seen in this case, the potential for catastrophic bugs is real. Development code can brick your network adapter, scramble your filesystems, open up severe security holes, or save your documents as OOXML. When experimenting with unstable code - even if it has been neatly packaged by your distributor - it is always prudent to have good backups and an even better sense of humor. Mobile phone or penetration tool? The NeoPwn is a pocket-sized network penetration tool based on Linux and free software. The form factor should be familiar to anyone that has paid attention to the Linux mobile phone market as NeoPwn is based on the OpenMoko Neo FreeRunner. When the device starts shipping, users will be able to do network monitoring and penetration testing from an unobtrusive platform—then call home with it. NeoPwn comes with an impressive array of free software security tools, including things like Metasploit, Aircrack-ng, WifiZoo, Wireshark, and many others. They all run on top of a customized Linux 2.6.24 kernel—sources to be released when the hardware ships, which is scheduled for October 1—from the microSD flash module. A full Debian distribution is included on a flash filesystem that has been optimized for performance and size. The company behind NeoPwn has also created a GUI interface to the system for hardware control as well as attack automation. The interface is meant to reduce the need for using the command line for the most common types of attacks. Using the tools, Wired Equivalent Privacy (WEP) keys can be cracked in 5 to 14 minutes depending on whether the network has clients connected or not. The NeoPwn is not set up to crack Wifi Protected Access (WPA) keys on the device itself, but it can capture the handshake for use by programs on more powerful systems. There are several different options for purchasing the NeoPwn—all of them rather pricey. The basic model is $699 for the phone (normally $399), software, and some useful accessories. One can also just purchase the software on a 2GB microSD card for $79. The website has a prominent warning that might deter some, however: "Please be advised that if you do not choose a complete system, you will have to program the phone's bootloader manually for the correct microSD bootloader entry, to the NAND memory. This can be dangerous if you do not know what you are doing!" The standard FreeRunner Wifi has firmware limitations that will not allow monitoring or packet injection—pretty important capabilities for a network security tool—so various USB Wifi cards come with the NeoPwn. Also, since a custom kernel is used, one cannot make phone calls and do penetration testing at the same time. At boot time, one must choose between the two modes. Even with those limitations, the FreeRunner seems like an excellent choice as a platform. For those puzzled by the name, "pwn" is used for the word "own" in the "leetspeak" used by many in the security community—both white and black hat. Breaking into and controlling a network or system is then "pwning" it. NeoPwn is not alone in using the term. Metasploit author H D Moore's iPwn Mobile makes UMPC-based penetration testing devices. Both the NeoPwn and iPwn Mobile's Infiltrator look like useful devices for those needing an off-the-shelf solution, but because they are based on free software, the core capabilities are available to those with a lower budget. By showing what can be done with open mobile phones like the FreeRunner, NeoPwn is doing a great service for both OpenMoko and the free software community. Undoubtedly various malicious folks will get their hands on devices like this, so it is important that security researchers and professionals have access to them as well. openSUSE and the distribution of proprietary software Every Linux distributor must find its own peace when it comes to the issue of proprietary software. Some distributors will avoid anything non-free to the point of tearing firmware out of the kernel. Others, like Fedora or Debian, will not include any non-free code. Distributors like Ubuntu are rather more willing to facilitate the use of non-free software, but even they are, perhaps, not 100% comfortable with it. And distributions like Xandros positively embrace proprietary code. OpenSUSE (like SuSE Linux before it) has traditionally taken a position which is relatively friendly toward proprietary software. It was only in 2006 that Novell announced its intention to stop shipping non-GPL kernel modules, but it never made any such promises with regard to user space. So a typical openSUSE installation disk includes a number of proprietary goodies, including the Adobe Flash player, a number of fonts, ARCAD, the Acrobat PDF reader, the Opera web browser, RealPlayer, and more. The presence of all this proprietary code is unwelcome to some users, of course, but it has another interesting effect: it requires that openSUSE be distributed with an end-user license agreement which has some very un-free-software-like terms. Among other things, it reads: Novell reserves all rights not expressly granted to You. You may not: (1) reverse engineer, decompile, or disassemble the Software except and only to the extent it is expressly permitted by applicable law or the license terms accompanying a component of the Software; or (2) transfer the Software or Your license rights under this Agreement, in whole or in part. In other words, redistribution of the openSUSE DVD is not permitted. Members of the openSUSE mirror network are, technically, in violation of the EULA, though nobody appears to be in a hurry to call them on that. But the EULA raises eyebrows and makes some users uncomfortable; many people got into free software to avoid dealing with agreements like that. The need for the EULA, rather than problems with proprietary software in general, is causing developers at Novell to reconsider which packages should go onto an openSUSE DVD. To that end, Novell product manager Michael Löffler has proposed a new scheme whereby the DVD would only contain redistributable software (including proprietary software, such as firmware, which allows redistribution). The openSUSE project would set up a network-based repository from which other proprietary applications could be installed; the installer would then install a couple of packages (the Adobe Flash player and Fluendo's MP3 codec) by default. The end result for most users would be the same: an openSUSE installation with both free and proprietary software. At least, that would be the case for users with a decent network connection. But those users would also gain a DVD with a much less restrictive EULA allowing the DVD to be redistributed at will. (The current plan is to still have an agreement for trademark control and warranty disclaimer reasons, even though other software distributors have managed to eliminate EULAs for those purposes). At this point, it would also be easy to add an option to simply skip the configuration of the non-free repository for users who want a "clean" installation. Most responses to this proposal have been positive. The happiness is not universal, though; one user complained: I don't think Novell, openSUSE and us should be influenced by "bad press" of doubt quality and change what is a key point of openSUSE: offering also proprietary software ready to go on the DVD. Moving these packages to an online repository makes no difference from downloading and installing them by hand. It is true that one-stop shopping has long been a feature of the SUSE distribution. And a recent survey [PDF] suggests that a significant portion of the openSUSE user base makes use of at least a few of the proprietary tools included there. If the presence of this code is truly a "key point" of openSUSE, then taking it out could risk upsetting users at a time when, by some accounts, the visibility of this distribution is already dropping. This risk would be mitigated by a couple of factors, though. One is that the need to download those packages over the net is not much of a stopping point for most users. After all, people installing Linux from a CD or DVD have usually resigned themselves to a massive download of package updates after the first boot anyway. Tossing a few more packages into that download - assuming they weren't set to be updated by then anyway - is not going to change the experience in any significant way. But the other relevant point is that the need for much of this proprietary code is decreasing. Java used to be a big part of the openSUSE proprietary software load, but Java is now free. Your editor cannot remember when he last encountered a PDF file which could not be managed by at least one free viewer - though, evidently, such files do still exist. Perhaps the biggest remaining problem is Flash; progress is being made there, but Flash is most certainly not a solved problem. Beyond that, though, there are few situations indeed where a proprietary application is really needed for ordinary tasks. The openSUSE distribution is not distancing itself from proprietary software at this time; it is just reorganizing its management of that software to address one of the problems it brings. But it is still hard to avoid the temptation to read between lines and look forward to a day when openSUSE, too, distributes only free software - not as a result of any sort of push for purity, but just because its users no longer have any need for anything else. Low-level tracing plumbing Kernel and user-space tracing were heavily discussed at both the kernel summit and the Linux Plumbers Conference. Attendees did not emerge from those discussions with any sort of comprehensive vision of how the tracing problem will be solved; there is not, yet, a consensus on that point. But one clear message did come out: we may end up with several different tracing mechanisms in the kernel, but there is no patience for redundant low-level tracing buffer implementations. All of the potential tracing frameworks are going to have to find a way to live with a single mechanism for collecting trace data and getting it to user space. This conclusion may look like a way of diverting attention from the intractable problems at the higher levels and, instead, focusing everybody on something so low-level that the real issues disappear. There may be some truth to that. It is also true, though, that there is no call for duplicating the same sort of machinery across several different tracing frameworks; coming up with a common solution to this part of the problem can only lead to a better kernel in the long run. But there is another objective here which is just as important: having all the tracing frameworks using a single buffer allows them to be used together. It is not hard to imagine a future tracing tool integrating information gathered with simultaneous use of ftrace, LTTng, SystemTap, and other tracing tools that have not been written yet. Having all of those tools using the same low-level plumbing should make that integration easier. With that in mind, Steven Rostedt set out to create a new, unified tracing buffer; as of this writing, that patch was already up to its tenth iteration. A casual perusal of the patch might well leave a reader confused; 2000 lines of relatively complex code to implement what is, in the end, just a circular buffer. This circular buffer is not even suitable for use by tracing frameworks yet; a separate "tracing" layer is to be added for that. The key point here is that, with tracing code, efficiency is crucially important. One of the main use cases for tracing is to debug performance problems in highly stressed production environments. A heavyweight tracing mechanism will create an observer effect which can obscure the situation which called for tracing in the first place, disrupt the production use of the system, or both. To be accepted, a tracing framework must have the smallest possible impact on the system. So the unified trace buffer patch applies just about every known trick to limit its runtime cost. The circular buffer is actually a set of per-CPU buffers, each of which allows lockless addition and consumption of events. The event format is highly compact, and every effort is made to avoid copying it, ever. Rather than maintain a separate structure to track the contents of an individual page in the buffer, the patch employs yet another overloaded variant of struct page in the system memory map. (Your editor would not want to be the next luckless developer who has to modify struct page and, in the process, track down and fix all of the tricky not-really-struct-page uses throughout the kernel). And so on. The patch itself does a fairly good job of describing the trace buffer API; that discussion will not be repeated here. It is worth taking a quick look at the low-level event format, though: This format was driven by the desire to keep the per-event overhead as small as possible, so there is a single 32-bit word of header information. Here, type is the type of the event, len is its length (except when it's not, see below), time_delta is a time offset value, and array contains the actual event data. There are four types of events; one of them (RINGBUF_TYPE_PADDING) is just a way of filling out empty space at the end of a page. Normal events generated by the tracing system (RINGBUF_TYPE_DATA) have a length given by the len field, which is right-shifted by two bits. So the maximum event length is 28 bytes (32 bytes minus four for the header word), which is not very long. For longer events, len is set to zero and the first word of the array field contains the real length. The other two event types have to do with time stamps. Over the course of the discussion, it became clear that high-resolution timing information is needed with all events, for two reasons. The recording of events into per-CPU arrays, while essential for performance, does have the effect of separating events which are related in time; the addition of precise timekeeping will allow events to be collated in the proper order. That collation could be handled through some sort of serial counter, but some performance issues can only be understood by looking closely at the precise timing of specific events. So events need to have real time data, at the highest resolution which is practical. Just how that data will be recorded is still unclear, and may end up being architecture dependent. Some systems may use timestamp counter data directly, while others may be able to provide real times in nanoseconds. Whatever format turns out to be used, there is no doubt that it will require 64 bits of storage. But most of the time data is redundant between any two events, so there is no real desire to add a full 64-bit time stamp to every event in the stream. The compromise which was reached was to store the amount of time which passes between one event and the next in the 27 bits allotted. Should the time delta be too large to fit in that space, the trace buffer code will insert an artificial event (of type RINGBUF_TYPE_TIME_EXTENT) to provide the necessary storage space. The final event type (RINGBUF_TYPE_TIME_STAMP) "will hold data to help keep the buffer timestamps in sync." This little bit of functionality has not yet been implemented, though. The rate of change of the trace buffer code appears to be slowing somewhat as comments from various directions are addressed; it may be getting close to its final form. Then it will be a matter of implementing the higher-level protocols on top of it. In the mean time, though, the attentive reader may be wondering: what about relayfs? The relay code has been in the kernel for years, and it was intended to solve just this kind of problem. The most direct (if not most politic) answer to that question was probably posted by Peter Zijlstra: Dude, relayfs is such a bad performing mess that extending it seems like a bad idea. Better to write something new and delete everything relayfs related. Deleting relayfs would not be that hard; there are only a couple of users, currently. But relayfs developer Tom Zanussi is not convinced that the problems with relayfs are severe enough to justify tossing it out and starting over. He has posted a series of patches cleaning up the relayfs API and addressing some of its performance problems. At this point, though, it is not clear that anybody is really looking at that work; it has not received much in the way of comments. One way or the other, the kernel seems set to have a low-level trace buffer implementation in place soon. That just leaves a few other little problems to solve, including making dynamic tracing work, instrumenting the kernel with static trace points, implementing user-space tracing, etc. Working those issues out is likely to take a while, and it is likely to result in a few different tracing solutions aimed at different needs. But we'll have the low-level plumbing, and that's a start. Ubuntu debuts its Upstream Report Ubuntu has taken some heat over the years for its relationship with upstream projects, but the distribution seems determined to change that impression. To that end, Ubuntu has started by looking at bugs and bug reporting between the distribution and upstream projects. The visible result is the beta release of the Ubuntu Upstream Report, which displays the progress of getting bugs upstream. Users of Ubuntu report lots of bugs in the software they use but, for the most part, those bugs aren't in any way specific to Ubuntu; they tend to also exist in the upstream project. Ubuntu collects its bugs at Canonical's Launchpad web site which allows linking those bugs to bugs in the bug tracking system of an upstream project. Once the link—or watch as it is called in Launchpad—is established, updates to the upstream bug's status will be reflected in the Ubuntu bug as well. That capability has been available for some time, but as Ubuntu looked at ways to improve how well their bugs were flowing upstream, they needed a way to measure how well watches were being used. Canonical's Ubuntu community manager Jono Bacon describes the idea behind the report: In terms of this project, I was keen to see graphs that show the number of upstream bug linkages going on, the total number of open vs. upstream bugs and how many bugs are fixed elsewhere. We could use these graphs to determine our progress in improving our bug workflow, but this was not enough - we also needed raw data about which projects needed the most focus. Which projects were struggling the most with bug figures? Which projects were not forwarding bugs upstream? Which projects didn't have an upstream bug tracker registered in Launchpad? We had all the answers to these questions in Launchpad, but no means of gathering them. To fix this, we created the Ubuntu Upstream Report. The report ranks Ubuntu projects by the number of open bugs, while also showing how many have progressed towards upstream. Bugs in Ubuntu get triaged by the Ubuntu bug team, with some of them getting classified as "upstream"—meaning that they exist in the project itself, rather than just Ubuntu's build. Upstream bugs that are linked to a bug in the projects bug tracker are considered "watch" bugs. Each successive stage shows the difference between the previous, both as a number and a percentage so that it is easy to see how bugs are being handled as well as where the bottlenecks are. This dashboard-style interface also allows sorting by column and retrieving lists of bugs by following the numeric links. The report was created by Jorge Castro, who is in charge of external project developer relations for Canonical. The tool has multiple uses, as Castro explains: We wanted to provide a tool that not only shows upstreams how well we're linking and forwarding bugs, but a day-to-day tool for maintainers to see where there are targets of opportunity to forward to upstream. And lastly, for triagers we wanted to provide real-time working "bug lists" that you can work through if you want to help be the bridge that connects the downstream Ubuntu Package to the upstream project. Part of the idea is for the report to be used by participants in Ubuntu's 5-A-Day initiative. 5-A-Day is an effort to make the Ubuntu bug list better by encouraging users and developers to work on five bugs each day. Users can do things like try to reproduce the bug, cleaning up and adding more information to the report; while developers can triage bugs or look at patches to the upstream project to see if they are needed for Ubuntu. The report will also help those who are running or participating in Bug Jams—focused efforts to gather people together to move Ubuntu bugs along. Linking to existing upstream bugs or creating new ones for problems that Ubuntu users find can be helpful for projects. Some projects will find it more helpful than others, as Bacon notes: If we do link a bug upstream, we had no firm idea how useful an upstream actually find our bug data. Our discussions suggested very mixed reactions - a small project is likely to have a very different perspective on bugs than a large project. Just think about this in purely quantitative states - a small project will likely get fewer bugs, and these bugs can probably be dealt with by a small collection of volunteers. This is unlikely to scale to something like the Linux kernel or OpenOffice.org. One of the problems, of course, is the one-way nature of the watch link—Ubuntu sees changes to the upstream bug, but the reverse is not true—as projects have to come looking in Launchpad for updates. There is also resistance to using Launchpad because it is not free software, though that is slated to change by mid-2009. Overall, this new report and the focus on improving upstream relations are very welcome, but tracking bugs only goes so far; fixing upstream bugs is an important, but missing, piece. In order to not be seen as just a consumer of upstream software, one needs to not only report bugs, but fix them as well. For all of the various bug-related efforts that Ubuntu is sponsoring, there is very little mention of actually fixing problems and sending patches upstream. There are tools like Harvest that make it easier to find upstream patches—bug fixes and enhancements for possible inclusion in the Ubuntu packages—but the focus is clearly on improving Ubuntu, as opposed to improving the software ecosystem that makes up the distribution. It is important to remember that the efforts so far are just a start; Ubuntu is working on additional projects to improve its upstream relations. One gets the sense that they have heard the criticisms and are working to address them. Like it or no, Ubuntu has its own way of doing things which may mean it takes longer than some would like, but it certainly looks to be headed in the right direction. Plugging into GCC Almost one year ago, LWN examined the GCC plugin mechanism - or, more exactly, the lack of such a mechanism. Despite the increasing level of interest in adding special-purpose modules to the GCC compiler, GCC has no API which allows this addition to be done. So developers working on GCC extensions are faced with the daunting prospect of patching their code directly into the compiler. This situation looked unlikely to change; the Free Software Foundation's fears that a plugin mechanism would be used by proprietary extensions was just too strong. One year later, though, things look a little different; there may be a plugin-capable GCC available in the (relatively) near future. There are a lot of good reasons for wanting to add plugins to the GCC compiler. The implementation of better optimization techniques is an obvious example, but there is more than that. The EDoc++ project has put together a static analysis tool which performs checking of exception handling in C++ code - and generates documentation while it's at it. Mozilla uses its Dehydra tool to find potential problems in the browser's code base. The LLVM compiler can be thought of as a sort of GCC plugin, currently. The Middle End Lisp Translator project is working on a Lisp-like language which, in turn, can be used within plugins for static analysis and code transformations. The list goes on; just about any project working on the processing of programs can benefit from hooking into the GCC platform. The concern that has long been expressed by the FSF (which owns the copyrights on GCC) is that a general plugin mechanism would make it possible for companies to traffic in binary-only GCC modules. Rather than contribute a new analysis or optimization tool - or a new language - to the community, companies might have an incentive to distribute their work separately under a restrictive license. That runs very much counter to what the FSF is trying to accomplish, so opposition from that direction is not particularly surprising. But the pressure for some sort of plugin API is not going away, so the GCC developers have been thinking about ways to make it possible without upsetting Richard Stallman. One alternative which has been discussed is to require plugins to be written in a high-level scripting language - Python or Perl, perhaps. Then plugins would, for all practical purposes, have to be distributed in source form. Even if they carried a hostile license, it would be possible to study them and learn how they actually work. Another possibility is to take a page from the Linux kernel's book and keep the plugin API unstable. If the API changed with every GCC release, GCC would become a moving target which would be much harder for proprietary vendors to keep up with. An unstable API may be the way things go in any case - there may be no other way to allow GCC itself to continue to progress quickly - but experience with the kernel shows that an unstable API is not, by itself, enough to scare off a determined proprietary software vendor. It might reduce the number of proprietary GCC modules, but it would not eliminate them. Alternatively, one could require plugin modules to declare their license to the GCC core, which could then reject plugins that lack a suitable license. Again, experience with the kernel suggests that there are limits to how far one can get with this approach. Proprietary plugin vendors could distribute a version of GCC with the license check patched out - or just have their plugin lie about its license. Yet another possibility is to not worry about the problem at all; it is not clear that the world is full of vendors waiting for an opportunity to abuse a GCC plugin API. As GCC developer Ian Lance Taylor puts it: The FSF doesn't want plugins because they are concerned that people will start distributing proprietary plugins to gcc. I personally think this is a fear from twenty years ago which shows a lack of understanding of today's compiler market, but, that said, the FSF wants to cover themselves for the future as well. Someday, perhaps, the FSF will feel sufficiently confident to allow unrestricted plugin access to GCC, but that does not appear to be in the cards at this time. What does appear to be happening, though, is an attempt to enable plugins by way of some licensing trickery. The GCC suite is covered by the GPL, a fact which does not, in itself, affect the licensing of any program which is compiled by GCC. But GCC is more than just the compiler; it also includes a runtime library needed to make most GCC-compiled programs actually run. Linking to the runtime library could cause the resulting program to be a derived product of that library; since the runtime library is licensed under the GPL, that could be a concern for anybody compiling non-GPL-licensed code. To address that concern, the runtime code has long carried an exception to the GPL: As a special exception, you may use this file as part of a free software library without restriction. Specifically, if other files instantiate templates or use macros or inline functions from this file, or you compile this file and link it with other files to produce an executable, this file does not by itself cause the resulting executable to be covered by the GNU General Public License. This exception does not however invalidate any other reasons why the executable file might be covered by the GNU General Public License. That is the language which enables the distribution of proprietary software built with GCC. The plan, said to be under consideration currently, is to change the wording of that exemption; essentially, it would no longer apply to code compiled with the use of proprietary GCC plugins. The new license is not finalized, but Mr. Taylor guesses it will look something like this: [I]f you modify gcc by adding GPL-incompatible software used to generate code, it is likely that you will not be granted any exception to the GPL when using the runtime library. In other words, if you 1) add an optimization pass to gcc using the (hypothetical) plugin architecture, and 2) that optimization pass is not licensed under a GPL-compatible license, and 3) you generate object code using that optimization pass, and 4) you link that generated object code with the gcc runtime library (e.g., libgcc or libstdc++-v3), then you will not be permitted to distribute the resulting executable except under the terms of the GPL. The actual wording of the new runtime license has been a long time in coming; the FSF's lawyers want to get it right so that it discourages undesired conduct while staying out of the way for everybody else. It also does not appear to be the FSF's highest priority at the moment. So nobody really knows when it might become official - though there have been notes to the list suggesting that it could happen in the near future. What we do seem to know is that it will happen, sooner or later, and the addition of a plugin mechanism to GCC will become possible. So the developers are starting to think about how the API will work. There are a couple of existing GCC plugin frameworks already, and plenty of thoughts on how they could be improved; see, for example, this discussion for an idea of what is being talked about. But the details are likely to be of interest mostly to GCC hackers, while the end result will be beneficial to a much wider community of developers and users. Moving the -staging tree Greg Kroah-Hartman was tagged as the "maintainer of crap" at this year's Kernel Summit for his willingness to shepherd drivers of lower quality into the mainline. He has not shrunk from that label, when introducing a patch set that would merge some of those drivers. In fact, he has embraced the label: as part of his patch, he introduced the TAINT_CRAP flag for use in tainting kernels that load these, well, crappy drivers. There has been an ongoing struggle between those who want to see drivers get included as quickly as possible versus those who want to see them approach or attain normal kernel quality levels first. Kroah-Hartman started the -staging tree last June as a way to increase the visibility, thus testing and bug fixing, of out-of-tree drivers. Because drivers in that tree have been steadily improving—to the point where several have graduated to the mainline—the belief is that moving -staging itself into the mainline kernel will result in even faster progress. So, Kroah-Hartman has introduced a new directory (drivers/staging) to hold these drivers, as well as a mechanism to automatically taint the kernel if any of them get loaded. That will warn users when loading the module—at least if they check their logs—and include that info in any oops message that kernel might produce. Kernel hackers can then filter out problems depending on what the taint is—problems in kernels tainted with binary-only drivers are generally actively ignored. Getting those drivers into the mainline, though, will make it much easier for folks who want to test them. In addition, clean-ups and fixes for the drivers will go in as mainline patches, raising the visibility of the developers working on them. The change should have very minimal impact on other kernel users and developers. In particular, developers will not have to worry about reflecting API changes into drivers/staging as Kroah-Hartman will keep them up-to-date. The main complaint about the proposal has been that it duplicates the functionality or intent of the EXPERIMENTAL flag. There was also some belief that tainting the kernel was unduly harsh, but as Kroah-Hartman points out: "It isn't costing anything, and if a developer doesn't want to debug the kernel if such a driver is loaded, this allows them to do this." As part of the thread, Paul Mundt explains why EXPERIMENTAL has no meaning in the kernel today: EXPERIMENTAL today is pretty damn meaningless. What it tends to mean in practice is that somethings needs some more testing, someone wants to be able to pull out the EXPERIMENTAL card when someone enables their option and their kernel blows up, the option/feature hasn't been around in the kernel for that long, or someone has just been too lazy to remove the flag (this last one probably covers about 90% of in-tree cases today). Stuff that is actively broken (in case of your kernel blowing up, not building, etc.) tends to be shoved under BROKEN instead. Mundt goes on to show the default configurations almost all enable CONFIG_EXPERIMENTAL, further reducing its meaning. It would be nice to audit all of the uses and restore the meaning of the flag, but that is beyond the scope of what Kroah-Hartman has set out to do. There still would be a difference, though, even if EXPERIMENTAL were meaningful. Mundt continues: The other key difference is that even with experimental stuff in the kernel, you will still get support, so it's not really a taintable offense. Stuff in staging/ on the other hand while potentially not actively hostile against the rest of the system, is still very much an unknown, and therefore the only safe thing to do is to taint the system and allow individual developers to make a choice regarding whether any resulting oopses are worth looking at or not. There are still some who are concerned about adding less-than-kernel-quality code. Randy Dunlap puts it this way: "I think that we have enough quality problems without adding crap." But, Linus Torvalds has always been solidly in the "merge early" camp, so this proposal seems likely to go in for 2.6.28. Besides, as Stefan Richter notes: OTOH many if not most of the -staging drivers are ones which are already in use. Their users already deal with whatever quality problems these drivers have, in addition to having to fight with the installation hassles that are inherent to out-of-tree drivers. In a fairly short span of time, merging drivers into the mainline has gotten a whole lot easier. At one time, developers might have to work on a driver for several development cycles before it reached a quality level that would allow it to be merged. In the interim, the -staging tree made things easier and more visible for testers and developers; soon that visibility will rise substantially again. LAME ain't lame no more LAME (Lame Ain't an MP3 Encoder) is a long running open-source MP3 encoder project. From the About LAME document: "...LAME is the source code for a fully LGPL'd MP3 encoder, with speed and quality to rival and often surpass all commercial competitors. LAME is an educational tool to be used for learning about MP3 encoding. The goal of the LAME project is to use the open source model to improve the psycho acoustics, noise shaping and speed of MP3. LAME is not for everyone - it is distributed as source code only and requires the ability to use a C compiler. However, many popular ripping and encoding programs include the LAME encoding engine..." The LAME project has announced the first release in several years: "After rough[ly] two years of development, the LAME project has released a new version (3.98.2) of the best-known Open Source MP3 encoder. All users are encouraged to use it, see new improvements regarding the previous releases and send feedback for the project." LAME has a long and interesting development history. From the LAME home page: "LAME development started around mid-1998. Mike Cheng started it as a patch against the 8hz-MP3 encoder sources. After some quality concerns raised by others, he decided to start from scratch based on the dist10 sources. His goal was only to speed up the dist10 sources, and leave its quality untouched. That branch (a patch against the reference sources) became Lame 2.0, and only on Lame 3.81 did we replaced of all dist10 code, making LAME no more only a patch. The project quickly became a team project. Mike Cheng eventually left leadership and started working on tooLame, an MP2 encoder. Mark Taylor became leader and started pursuing increased quality in addition to better speed. He can be considered the initiator of the LAME project in its current form. He released version 3.0 featuring gpsycho, a new psychoacoustic model he developed. In early 2003 Mark left project leadership, and since then the project has been lead through the cooperation of the active developers (currently 4 individuals)." Numerous additional developers have contributed to the project. The slightly out of date project version history documents the changes to the code since September 1998. Improvements added to version 3.98 (started in May, 2007) include: Numerous bug fixes were implemented. A lot of code cleanup was done. Support was added for newer versions of various libraries. Many build system improvements were done. The RPM specification was updated. Numerous changes were made to the lame front end switches. New VBR code, derived from the NSPSY psymodel, was added. There were changes to the new VBR psymodel. The out of bits strategy for the newer VBR code was overhauled. PCM WAVE_FORMAT_EXTENSIBLE support was added. Support for ID3v2 total track count was added. ID3v2 TLEN support was added. The ATH adjustment was improved for low volume cases. A new SSE version of the FFT code was used. A flush option was added for flushing the output stream in lame.exe. The FFTSSE and FFT3DNOW assembler code was back ported from the Lame4 branch. Building the newest version of LAME on an Ubuntu 8.04.1 LTS (Hardy Heron) i386 system was straightforward. An older Ubuntu package of LAME was first removed from the system using the Synaptic package manager. The LAME version 3.98.2 source code was downloaded, unzipped and untared. The configure script was run, no missing dependencies were found. The usual make and make install steps were done. A few test case .wav files were encoded with the command lame file.wav file.mp3 and the files were played with the SoX play command as well as the closed-source RealPlayer application. Everything worked as expected, and sounded as good as one can expect for an MP3 file. Overall, the latest changes to LAME fall into the category of maintenance or the addition of mostly user-transparent features. It is good news that this important piece of software is going into another phase of active development. The CME Group sees a future with the Linux Foundation The Linux Foundation has another new organization on the membership roster this week. The CME Group announced it has joined the nonprofit organization, and its associate director, Vinod Kutty, will chair the Foundation's End User Council. The CME Group is made up of three derivatives, or futures, exchanges: the Chicago Board of Trade, and the New York and Chicago Mercantile Exchanges. Linux has played a major part of the financial services industry for many years, and representatives of the CME Group say it's time to become more involved in the evolution of open source technology. In a prepared statement Kevin Kometer, Managing Director and Chief Information Officer of CME Group, says, "Our Linux Foundation membership allows us to move beyond just being users of Linux to being participants in the direction of this important technology. Joining the Linux Foundation and being deeply involved in Linux will also help the exchange determine the future use of our own technology." Practically speaking, the move will increase the Group's input into the development of software developed for the financial industry, thereby giving them a boost in a very competitive global marketplace. Kutty explains, "By most accounts, derivatives exchanges around the world do not compete with one another. Unlike the securities markets that compete for listings, the majority of derivatives products are created with intellectual capital or they are licensed products. Our main competition comes in the form of the over-the-counter (OTC) marketplace where 80% of the world's derivatives trade; only 20% of derivatives globally trade on an exchange. The OTC products often are similar or lookalike products to what an exchange would trade." That competitive threat is a chief reason the CME Group chose to join the Linux Foundation. "We're excited to see CME join, but not surprised at its intent," says Amanda McPherson, the Linux Foundation's VP of marketing and developer programs. "CME realizes that direct collaboration with the Linux community gives them a competitive advantage. They have bet their business on Linux to very good effect. We're seeing the innovators and leaders understand that to get the most of Linux it's important to collaborate with the community directly. Through our end user council and the yearly Collaboration Summits, companies like CME can collaborate closely with the brightest minds in Linux." While it's unusual for large financial exchanges to sit down with kernel developers, it's not unheard of. Head Bubba, IT manager for international financial services group, Credit Suisse, was part of a panel that met with developers at last year's Kernel Summit to talk about the challenges companies face when using Linux. Kutty will be picking up where Bubba left off. After attending this year's Kernel Summit, Kutty is slated to speak on behalf of the CME Group at October's Linux Foundation's End User Summit in New York, where he'll be talking about how the exchange has deployed Linux and where he hopes to see it go in the future. Historically, financial transactions have taken place on an exchange's trading floor in a process known as "open outcry." This method is increasingly being replaced by electronic trading, however, and the financial industry appears to be ready to embrace open source technology in the process. McPherson says, "the NYSE and most bank's trading systems are based on Linux. We're entering a third phase of adoption by financial services and Linux. At first it was just small, skunk works projects. Then it moved into broad-based adoption through vendors. Now we're seeing companies getting the most out of their investment by partnering directly with the community." As a means to that end, Kutty, will work with members of the End User Council, Linux vendors, and also leaders within the Linux community to collaborate on technical and legal issues that affect FOSS. The CME Group has relied on Linux since 2003 and though it employs a variety of commercial and open source tools, Linux remains the dominant technology in use today. Kutty describes what they hope to accomplish: The open source solutions tend to address some niches at the web tier as well as scripting tools, performance monitoring tools, log file analysis, development tools and simple document/content management. Additionally, many of the GNU tools that are bundled with our Linux distribution are taken for granted as being available for use on any system we deploy, typically by our sysadmins as part of day-to-day operations. Some pre-date our migration to Linux because it was and is possible to use GNU tools on commercial UNIX. As open source alternatives to commercial products mature, we evaluate them and select them if they make sense. We're trying to play a more active role in the evolution of these products higher up the stack than the OS, but our initial priority is to focus on Linux improvements. Given the current state of the economy in the US, any small advantage for the financial industry is welcome. McPherson says Linux and open source technology can certainly help play a role in fixing what's broken. "The great thing about Linux is it's open and gives customers a great deal of flexibility in working with their vendors. It runs on multiple architectures and you can get support from various vendors (or not pay for support at all). This will become more and more appealing in our current economic environment. But given the collaborative development model, Linux thrives in any economic environment because of the choice it provides." The state of the e1000e bug Linus Torvalds sent out the 2.6.27-rc8 release on September 29 with this comment: This one should be the last one: we're certainly not running out of regressions, but at the same time, at some point I just have to pick some point, and on the whole the regressions don't look _too_ scary. This assertion raised a few eyebrows among those who are nervously watching the e1000e corruption bug. While the development community disagrees on all kinds of issues, there is a reasonably strong consensus that hardware-destroying bugs can be seen as "scary." Given that, it would be nice to say that this particular regression has been tracked down and fixed, but that is not the case. As of this writing, nobody knows what is causing systems with 2.6.27-rc kernels to occasionally overwrite the EEPROM on e1000e network adapters. The progress which had been made, while discouragingly small, does narrow down the problem a bit: There was an early hypothesis that the GEM graphical memory manager code might be responsible for the problem. There have been reports of corruption on distributions which do not package GEM, though, so GEM is no longer a suspect. For similar reasons, the idea that the page attribute table (PAT) work could somehow be responsible has been discarded. There has been a strong correlation between corrupted hardware and the presence of Intel graphics hardware. That has led to a lot of speculation that the X.org Intel driver may somehow be doing the actual corruption, though a separate bug in the e1000e driver may be enabling that to happen. But there is now a report of corruption with a system running NVIDIA graphics. If that report is truly the same problem, then the X.org hypothesis will be substantially weakened. (As an aside, it's worth pondering what would have happened if NVIDIA users had reported the problem first; the temptation to blame the proprietary NVIDIA driver could have been strong enough to delay action on the bug for some time). So the signs point toward a problem localized within the e1000e driver, but it is too early to make that conclusion. This bug remains mysterious, and it could turn out to have surprising origins. The nature of this bug makes it harder than usual to track down. It seems to be dependent on some sort of race condition, so it is hard to reproduce. But the way in which the bug makes itself known has the effect of greatly reducing the number of testers trying to reproduce it. People who can avoid that combination of software are doing so, and distributors shipping development kernels have disabled the e1000e driver. Dave Airlie's approach: But I'm leaving this up to Intel, I don't think HP will take it too kindly if I keep returning my laptop. must be fairly typical. One gets the sense that a fairly hot fire has been ignited underneath a number of posteriors at Intel; its developers are active in the discussion and clearly wanting to get this one solved. One objective has been the creation of a utility which would return corrupted hardware to a functioning state, but that tool has been slow in coming. Restoring trashed e1000e adapters appears to be a hard problem, but this is one that Intel has to get right. If more testers are to be encouraged to risk corruption with the idea that the recovery tool will fix them up again, that tool needs to actually work when the time comes. So it is hard to blame Intel for taking the time to ensure that the recovery tool will do its job, but, in the mean time, its absence is making testing harder. Frans Pop raised an interesting long-term concern: even if this bug is fixed tomorrow, it will be present in most of the 2.6.27 history. Anybody bisecting the kernel in an attempt to track down an unrelated bug risks being bitten by a zombie version of the e1000e bug. There may be no way to deal with that threat other than the posting of some big warnings. Rewriting the bug out of the mainline repository's history is possible with git, but it would create disruption for everybody working from a clone of the repository. Meanwhile, there could be some interesting consequences if the resolution of this problem takes much more time. It is hard to imagine that the 2.6.27 kernel could be released with a regression of this magnitude; let us say that the reaction in the mainstream press would not be kind. A 2.6.27 delay could force delays in a number of upcoming distribution releases. This kind of cascading delay would not look good; it would, instead, be reminiscent of the troubles encountered by certain proprietary software companies. That said, the system is clearly working. Testers found the problem before the code was released in anything resembling a stable form. Developers are now chasing after the bug as quickly as they can. There will be no stable kernel or distribution releases which corrupt hardware. This situation is a pain, but it will be soon resolved and forgotten. ParanoidLinux: from fiction to reality A novel for young adults by Cory Doctorow has inspired the creation of a new Linux distribution focused on privacy. ParanoidLinux is still in the planning stages, but it adopts some interesting ideas from Doctorow's book to place atop a Debian Testing base. It is targeted at those who have a very strict need to disguise their documents and network traffic because of a repressive regime. Doctorow is familiar to many in the free software world, for his work as a science fiction author as well as a digital rights activist and blogger. His recent novel, Little Brother is set in the US after another devastating terrorist attack. Because of the attack, most civil liberties have been suspended leading some characters to use an alternative operating system: ParanoidLinux is an operating system that assumes that its operator is under assault from the government (it was intended for use by Chinese and Syrian dissidents), and it does everything it can to keep your communications and documents a secret. It even throws up a bunch of "chaff" communications that are supposed to disguise the fact that you're doing anything covert. So while you're receiving a political message one character at a time, ParanoidLinux is pretending to surf the Web and fill in questionnaires and flirt in chat-rooms. Meanwhile, one in every five hundred characters you receive is your real message, a needle buried in a huge haystack. It is that description, along with others in the book, that is guiding the development of the "real" ParanoidLinux. While it is relatively easy to come up with a fictional privacy-oriented operating system, the reality of building one is rather challenging. The project has only existed since May, so the current focus is to get some kind of alpha system put together as a starting point. The idea of "chaff" is one that has been taken up on the ParanoidLinux wiki. There are several facets to the problem: how does one generate normal-looking traffic while somehow transferring encrypted data as part of that traffic. There are existing techniques that could be used. Chaff combines the ideas of steganography—hiding even the existence of a message—with cryptographic techniques. The discussion about chaff makes it clear that the ParanoidLinux developers are looking at Doctorow's ideas carefully before implementing them. Chaff is certainly not a panacea, as it won't hide the traffic from an adversary that has specifically targeted someone. It is, instead, a means to fly under the radar, to appear to be a "normal" internet user with standard traffic patterns. Using Tor (i.e. The Onion Router) is one way to anonymously use the internet—within limits—but traffic bound for a TOR node would be very suspicious to any monitoring agency. Another privacy-enhancing feature would be full-disk encryption, but that would be yet another red flag for an agency that was inspecting the computer. These are kinds of trade-offs that are being discussed by the project as they try to narrow their focus to something that can be implemented in the near term. Hiding, or at least obfuscating, the existence of ParanoidLinux on the computer is another piece of the puzzle. It could be very dangerous to be required by the authorities to boot one's ParanoidLinux laptop. But, if it appears to be a "regular" system—perhaps looking much like Windows—it may escape scrutiny. Encrypted data might then be stored on partitions that are not directly accessible from the desktop. This is an interesting project for those who worry about government crackdowns or perhaps already live under a repressive regime. Even if the ParanoidLinux distribution does not meet one's needs, the various discussions on options and different ways to approach a privacy-oriented operating system will be useful. One hopes not to ever need such a system, but knowing that people are thinking about the problem—while generating a working version—is certainly reassuring. For that, we can thank Doctorow for popularizing the idea. Some views from Vision Your editor had the honor of speaking at MontaVista's Vision 2008 conference recently. This conference - a gathering of MontaVista's customers - provided an opportunity to observe how (part of) the embedded industry sees itself and its role in the larger Linux community. Relations between embedded systems and Linux as a whole have often been a little uneasy; a situation which probably will not change in the near future. That said, there are signs that embedded developers are starting to think about the value of engaging more directly with the development community that they depend on. William Mills is the Chief Technologist for Open Linux Solutions at Texas Instruments; his brief presentation at Vision was an interesting demonstration of how attitudes in the industry are changing. According to Mr. Mills, TI's method for developing Linux drivers for its products involved doing the work behind closed doors, then distributing the result through MontaVista. That approach has changed, though. TI now does its driver work in a public git tree, with a focus on merging the code upstream as a first priority. Customers who want to work directly with upstream kernels can get the code directly. In a sense, it would appear that TI has removed MontaVista as the intermediary which distributes drivers for TI hardware. But TI still distributes code through MontaVista, so customers looking for a supported, integrated offering can still get a distribution which suits their needs. There's no shortage of embedded systems vendors who lack the skills and the desire to support a Linux distribution themselves; for those vendors, buying a supported system makes a lot of sense. For everybody else, the software is free and part of the mainline kernel, as it should be. MontaVista founder Jim Ready discussed "the state of embedded Linux," focusing on areas where there is a bit of a mismatch between what the Linux community is providing and what the embedded industry needs. Certain kinds of functionality are missing; the ability to do user-space interrupt synchronization was one example. The rate of change in the kernel is very high, presenting embedded vendors with the difficult choice of backporting fixes or upgrading to a more recent kernel. Tracing and profiling tools are not up to the level needed by the industry. Jim also talked some about realtime functionality, which currently must be patched into the kernel separately. He complained that changes made to the mainline kernel often break the realtime patch sets, leaving developers scrambling to make things work again. Keeping these patches in a working state requires constant effort; it is a significant cost. All of this may sound like whining from an industry which has earned a reputation for taking more from Linux than it is willing to put back in. But Jim put the blame directly on the embedded industry itself; embedded vendors, he says, still haven't quite gotten it. While taking some pride in MontaVista's position in the list of top contributors to the kernel, he suggested that MontaVista should be enjoying the company of more embedded systems firms. The embedded industry should be contributing more to the kernel than it is. What it comes down to, says Jim, is that the center of gravity in the Linux development world can be found in enterprise computing. Vendors in that industry are contributing heavily to the kernel and, as a result, the kernel tends to fit their needs better. The embedded community needs to get together and figure out how it, too, can become a more prominent contributor and work to drive the kernel in directions which suit its needs. Judging from the response in the room, many of those in the audience seem to agree with this point of view. Some see it differently, though. During your editor's talk, a member of the audience asked whether the embedded community should stop using a kernel developed by enterprise system vendors and, instead, make its own version of the kernel suited to its needs. Needless to say, your editor discouraged this approach; the cost of forking the kernel and fragmenting the development community would vastly exceed the value of any benefits gained. But the questioner seemed unconvinced. The clear conclusion to be made from that exchange is that there are still people in the embedded industry who do not see the value of working with the larger Linux development community. It is easy to fault the embedded community for its failure to contribute back, but it also makes sense to look in the mirror and ask if we couldn't make a more persuasive case for joining in. There has been a sustained effort to encourage the embedded systems industry to become a full participant in our community; over the years, that work has yielded a steady stream of successes. By continuing and improving this work, we'll continue the process of bringing our community together. Then we'll truly have a single system that runs on everything from wrist watches to supercomputers. Moving interrupts to threads Processing interrupts from the hardware is a major source of latency in the kernel, because other interrupts are blocked while doing that processing. For this reason, the realtime tree has a feature, called threaded interrupt handlers, that seeks to reduce the time spent with interrupts disabled to a bare minimum—pushing the rest of the processing out into kernel threads. But it is not just realtime kernels that are interested in lower latencies, so threaded handlers are being proposed for addition to the mainline. Reducing latency in the kernel is one of the benefits, but there are other advantages as well. The biggest is probably reducing complexity by simplifying or avoiding locking between the "hard" and "soft" parts of interrupt handling. Threaded handlers will also help the debuggability of the kernel and may eventually lead to the removal of tasklets from Linux. For these reasons, and a few others as well, Thomas Gleixner has posted a set of patches and a "request for comments" to add threaded interrupt handlers. Traditionally, interrupt handling has been done with top half (i.e. the "hard" irq) that actually responds to the hardware interrupt and a bottom half (or "soft" irq) that is scheduled by the top half to do additional processing. The top half executes with interrupts disabled, so it is imperative that it do as little as possible to keep the system responsive. Threaded interrupt handlers reduce that work even further, so the top half would consist of a "quick check handler" that just ensures the interrupt is from the device; if so, it simply acknowledges the interrupt to the hardware and tells the kernel to wake the interrupt handler thread. In the realtime tree, nearly all drivers were mass converted to use threads, but the patch Gleixner proposes makes it optional—driver maintainers can switch if they wish to. Automatically converting drivers is not necessarily popular with all maintainers, but it has an additional downside as Gleixner notes: "Converting an interrupt to threaded makes only sense when the handler code takes advantage of it by integrating tasklet/softirq functionality and simplifying the locking." A driver that wishes to request a threaded interrupt handler will use: This is essentially the same as request_irq() with the addition of the quick_check_handler. As requested by Linus Torvalds at this year's Kernel Summit, a new function was introduced rather than changing countless drivers to use a new request_irq(). The quick_check_handler checks to see if the interrupt was from the device, returning IRQ_NONE if it isn't. It can also return IRQ_HANDLED if no further processing is required or IRQ_WAKE_THREAD to wake the handler thread. One other return code was added to simplify converting to a threaded handler. A quick_check_handler can be developed prior to the handler being converted; in that case, it returns IRQ_NEEDS_HANDLING (instead of IRQ_WAKE_THREAD) which will call the handler in the usual way. request_threaded_irq() will create a thread for the interrupt and put a pointer to it in the struct irqaction. In addition, a pointer to the struct irqaction has been added to the task_struct so that handlers can check the action flags for newly arrived interrupts. That reference is also used to prevent thread crashes from causing an oops. One of the few complaints seen so far about the proposal was a concern about wasting four or eight bytes in each task_struct that was not an interrupt handler (i.e. the vast majority). That structure could be split into two types, one for the kernel and one for user space, but it is unclear whether that will be necessary. Andi Kleen has a more general concern that threaded interrupt handlers will lead to bad code: "to be honest my opinion is that it will encourage badly written interrupt code longer term," but he seems to be in the minority. There were relatively few comments, but most seemed in favor—perhaps many are waiting to see the converted driver as Gleixner promises to deliver "real soon". If major obstacles don't materialize, one would guess the linux-next tree would be a logical next step, possibly followed by mainline merging for 2.6.29. Some development statistics for 2.6.27 It's that time of the development cycle again: the 2.6.27 kernel, if not yet released by the time you read this, will be shortly. Various other LWN articles have looked at features found in this release; here we will look at where that code came from. As of 2.6.27-rc9, a total of 10,604 non-merge changesets had been added to the mainline for the 2.6.27 kernel; those patches added a total of 826,000 lines of code while removing 608,000, for a net growth of 217,000 lines. There were 1,109 developers who contributed to 2.6.27, representing over 150 employers. 376 of those developers contributed a single patch during this development cycle. The most active developers for 2.6.27 were: On the changeset side, Ingo Molnar ended up on top by virtue of the creation of large numbers of mostly x86-related changes, including a big subarchitecture reorganization; Ingo's count also includes the addition of ftrace, though much of that code was written by others. Bartlomiej Zolnierkiewicz continues to rework the old IDE layer, and Adrian Bunk, as always, energetically cleans up code all over the tree. David Miller's total includes the multiqueue networking code and a lot of other changes; Alan Cox did a lot of TTY work and big kernel lock removal. Your editor was disappointed to come in at #23, and, thus, off the bottom of the table. Time to send in some quick white space fixes. More seriously, though, it's worth noting that there are relatively few patches of the "trivial change" variety in the mix this time around. If we look at changed lines, Paul Mackerras comes out on top as the result of a single patch removing the obsolete ppc architecture. David Woodhouse reworked the management of firmware throughout the driver tree. Jean-François Moine brought the GSPCA webcam drivers into the tree, then put vast amounts of effort into cleaning them up. Artem Bityutskiy added the UBIFS flash filesystem, and Luis Rodriguez merged the ath9k wireless driver. If we look at the companies behind this work, we get the following results (note that, as always, these results are somewhat approximate): There are not too many surprises in this table - in particular, the list of companies at the top tends not to change very much. That said, a few things are worthy of note. One is that Sun Microsystems has made its first appearance on this list. People complain about this company, but Sun's engineers have been quietly fixing things all over the tree. Broadcom is another company with a mixed reputation in the Linux community, but Broadcom is happy to provide support for some of its network adapters. Nokia's strong showing in the lines-changed table results primarily from the contribution of the UBIFS filesystem. The most welcome change, though, is the first appearance of Atheros on this list. Atheros is a company which has quickly moved from a position of complete non-cooperation to one of supporting all of its hardware in the mainline kernel. To say that this is an encouraging development would be an understatement. All told, the 2.6.27 development cycle shows that the process continues at full pace in a seemingly healthy state. Developers from all over the industry are all working together to make the kernel better for all. The number of companies which see participation in the process as being in their interest is growing, as is the number of developers who contribute patches. The Linux kernel, it seems, is in good shape. Accessibility in Linux systems The Linux kernel recently saw the addition of a "basic Braille screen reader", and thus, the addition of a drivers/accessibility subdirectory and its corresponding CONFIG_ACCESSIBILITY option. It is worth noting that one of the first reactions was "what the heck is accessibility?" This shows how the idea is still quite unknown to developers. And yet the issue of GNU/Linux accessibility, i.e. the usability of GNU/Linux by disabled people (e.g. blind people) is, of course, not new. Work in that area has been conducted for a long time: the speakup speech screen reader saw its 0.07 version against Linux 2.2.7 in 1999, and the brltty Braille screen reader started in 1995. The basic Braille screen reader that has just been added to the Linux kernel is just the emerging part of that work which has been around since then. With the popularization of GNU/Linux among non-technical people, there has been renewed interest in mainline accessibility support: the GNOME desktop, OpenOffice.org and Firefox 3 can now be rendered via Braille and speech synthesis thanks to the AT-SPI framework and the Orca screen reader. KDE will soon follow when these technologies get rebased on D-BUS. In addition, accessibility menus have started appearing in the upstream distributions. One of the main concerns for disabled people used to be the lack of support of Javascript in text-mode web browsers and office suite support. With more and more companies and governments migrating to Linux—particularly since some states require accessibility of tools used in government—renewed development effort was becoming more and more of a must. In Massachusetts, people had even signed a petition against the migration to libre software because it was not yet accessible at the time! What is Accessibility? Accessibility, sometimes abbreviated a11y, means making software usable by disabled people. That includes blind people of course, but also people who have low vision, are deaf, colorblind, have only one hand, can move only a few fingers, or even only the eyes. It also includes people with (even light) cognitive troubles or just not familiar with the language. Last but not least, it includes elderly people, who often have a bit of all these disabilities. Yes, that actually means everybody is concerned, eventually. That means support for special devices, but also general care during development, like not assuming that an audible alarm will be heard or a transient message will be read. Maybe one of the most obvious accessibility techniques is speech synthesis, which turns text into audio that can be sent to speakers or headphones. There used to be hardware speech synthesis (supported by the speakup drivers), but these have often been replaced by software speech synthesis. While the quality of commercial software speech synthesis is very good these days, the quality of free software vary a lot. While there is very good libre English speech synthesis, the support of other languages is quite diverse. For instance, the Festival and eSpeak libre engines easily support a wide range of languages, but their sound is rather robotic. There are better phoneme libraries like mbrola, but they are often not completely libre. To better handle all these potential speech synthesis backends, the speech dispatcher daemon takes care of automatically choosing the appropriate synthesis according to the desired language and style. Another very popular kind of device is Braille terminals. These "show" text by raising and lowering little pins which thus form Braille patterns. Because their cost is very high, a Braille terminal often has room for only 40 characters or even 20 or 12. They integrate keys to navigate around the screen, so the user ends up reading it piece by piece. Compared to speech synthesis, the reading accuracy is far better, but not everybody can read Braille, and the cost remains very high (on the order of $5,000). The support of the various existing devices is very good: both the brltty and suseblinux screen readers support a very wide range of devices. Blind people will actually often use a combination of speech synthesis and Braille devices. As for other kinds of disabilities, the kind of devices varies a lot. It ranges from joysticks (natively supported by X.org) to eye-tracking systems (managed by dasher), via press button (supported by the GNOME Onscreen Keyboard) or mere screen magnification (implemented by gnome-mag). Everyday Use The eternal Command Line Interface vs Graphical User Interface flamewar actually also holds for people using a Braille terminal or speech synthesis. The contrast is perhaps even exacerbated by the inherent difficulties of performing anything with a computer when being disabled. The old traditional way of using a GNU/Linux system, the text console, has been working well with Braille devices and speech synthesis for a long time. The principle is indeed quite simple: there are 25 lines of 80 characters and text appears sequentially. Screen readers for Braille terminals would thus just automatically display what was last written and permit the user to navigate among these 25 lines. Screen readers for speech synthesis (e.g. speakup or yasr) would speak text as it appears on the screen, and have some review facilities similar to what Braille screen readers have. This works quite well because applications are limited to the TTY interface, they cannot have non-accessible fancy features such as graphical buttons. Some applications may still not be so easy to read, e.g. if they draw ASCII art or use colors to show active buttons, but they often have options to get more accessible, a collection of tips can be found on this wiki. Accessibility of graphical desktops is on the other hand a quite recent matter, in part because the issue is technically much less simple: while applications on the text console are limited to producing text, these days graphical applications usually render text as bitmaps themselves, so that the textual information is not available outside of the application for screen readers. There have been application adaptation attempts in the past (like ultrasonix), but they never really got popular. The GNOME project has been developing AT-SPI (Assistive Technology Service Provider Interface) for the past decade, and that has become really promising with the advent of the Orca screen reader. AT-SPI can be understood as a protocol between screen readers (e.g. Orca) and applications. To be "accessible", applications thus have to implement AT-SPI, or use a toolkit that implements it (like GTK and soon Qt), so that screen readers can get the logical and textual content of the application. Orca is not yet as good as what mature, proprietary Windows screen readers can achieve, but it is already usable for everyday work. It is progressing rapidly, notably thanks to the support of Sun and the involvement of the Accessibility Free Software Group. At the time of writing, only gtk+ 2 (and thus the GNOME desktop and gtk+ 2 applications), Java/Swing, the Mozilla suite, OpenOffice.org, and acrobat reader implement AT-SPI and thus are accessible. Qt (and thus the KDE desktop) is expected to support it once it gets rebased on D-BUS. To get the best results, the latest versions of applications should be used: for instance, Firefox is really usable only starting from version 3. Another approach is the use of self-reading applications. For instance, Firevox is a version of Firefox that integrates a dedicated screen reader. That permits a tighter interaction between the reader and the application, but that is of course limited to that particular application. Another example is emacspeak, which is a vocalized version of emacs. Some people simply just use emacspeak and nothing else, as emacs already meets all their needs. All in all, as usual the mileage varies. Some people will be very happy with the mature, efficient screen reading of the text console, while other people will consider that as a regression (like going back to DOS) and prefer using intuitive environments such as the GNOME desktop, even if the Orca screen reader is still quite young. It is actually quite common to use both: for instance the text console for the usual work, and the graphical environment for tasks that require it, like browsing Javascript-powered websites or manipulating OpenOffice documents. Upstream Integration Now, how can all of that be installed? Most distributions already provide most of the useful packages, but they often lack documentation on which tools are useful according to the various disabilities. The Linux Accessibility Resource Site is a quite complete source of information on the various tools that one could use. There is also a wiki page meant for administrators to get started with accessibility needs. A point worth noting, however, is that some distributions have accessibility components built into their installation CDs. For instance, starting from Etch (aka Debian GNU/Linux 4.0), the Debian installer automatically detects Braille terminals and if found, switches to text mode, runs brltty, and makes sure that brltty gets installed and configured on the target system. Other distributions often have been non-officially adapted into so-called "Braillified" installation images. The very important point is that it permits disabled people to be completely independent from the help of sighted people, even when the (re)installation of a system has to be done! That is clearly one area in which Windows is far behind GNU/Linux achievements. Future Challenges To sum it up, "accessible" GNU/Linux is getting its democratization step as well, just a bit shifted in time compared to the average Linux democratization. There are, of course, things that could be improved. Even if distributions usually contain accessibility software, it is hard for accessibility-newcomers to know which software will be useful for the various kinds of disabilities users can have, so distributions will have to develop wizards to help them. In the meanwhile, websites such as the Linux Accessibility Resource Site can be used as sources of information. In any case, discussion with the disabled users is essential to establish a suitable solution (setting up Braille output would be useless if the user can not read Braille for instance). Beyond the mere use of GNU/Linux or its installation, one area that still is not really accessible at all is the early stages of the boot process. With future development of the recently added basic Braille screen reader, the Linux kernel should eventually be able to provide basic feedback even before user space screen reader daemons can be started from the hard disk. Bootloaders like lilo and grub are able to emit basic beeps, but being able to accurately edit the kernel command line, for example, would require some support. Last but not least, tinkering with BIOS settings is currently possible for disabled people only on high-end machines that can drive a serial console. The democratization of the EFI platform could be an opportunity to embed basic screen reading functionalities. [Samuel Thibault has been working on accessibility since 2002, when he and a blind colleague designed the BrlAPI client/server Braille output engine, now used by Orca for Braille support . Since then he has worked on various accessibility tasks, from the Debian installer support to Braille standardization. In his professional life, he conducted a PhD on thread scheduling on high-end machines, and is now a lecturer at the University of Bordeaux.] Python 2.6 makes its debut Version 2.6 of the Python language was announced on October 2, 2008. A.M. Kuchling's extensive What’s New in Python 2.6 document covers the main goal of this release: "The major theme of Python 2.6 is preparing the migration path to Python 3.0, a major redesign of the language. Whenever possible, Python 2.6 incorporates new features and syntax from 3.0 while remaining compatible with existing code by not removing older features or syntax. When it’s not possible to do that, Python 2.6 tries to do what it can, adding compatibility functions in a future_builtins module and a -3 switch to warn about usages that will become unsupported in 3.0." Python 2.6 marks some changes in the language's development process: "While 2.6 was being developed, the Python development process underwent two significant changes: we switched from SourceForge’s issue tracker to a customized Roundup installation.." Python 2.6 also included a switch to the reStructuredText documentation format via the Sphinx Python documentation generator. A.M. Kuchling explains the reason for the move: "The Python documentation was written using LaTeX since the project started around 1989. In the 1980s and early 1990s, most documentation was printed out for later study, not viewed online. LaTeX was widely used because it provided attractive printed output while remaining straightforward to write once the basic rules of the markup were learned. Today LaTeX is still used for writing publications destined for printing, but the landscape for programming tools has shifted. We no longer print out reams of documentation; instead, we browse through it online and HTML has become the most important format to support." Numerous changes have been made to the Python language and its large collection of modules. Many of these changes came through the Python Enhancement Proposal (PEP) system including: PEP 343: the "with" statement. PEP 366: main module explicit relative imports. PEP 370: per-user site-packages directory. PEP 371: addition of the multiprocessing package to the standard library. PEP 3101: advanced string formatting. PEP 3105: make print a function. PEP 3110: catching exceptions in Python 3000. PEP 3112: byte literals in Python 3000. PEP 3116: new I/O library. PEP 3118: revising the buffer protocol. PEP 3119: introducing abstract base classes. PEP 3127: integer literal support and syntax. PEP 3129: class decorators. PEP 3141: a type hierarchy for numbers. Many new modules were added and a lot of existing modules were extended in Python 2.6. The list includes: ast (abstract syntax tree), future_builtins, json (JavaScript object notation), plistlib (property list parser), ctypes, and ssl. A number of modules were deprecated in this release, including: audiodev, bgenlocations, buildtools, bundlebuilder, Canvas, compiler, dircache, dl, fpformat, gensuitemodule, ihooks, imageop, imgfile, linuxaudiodev, mhlib, mimetools, multifile, new, pure, statvfs, sunaudiodev, test.testall, and toaiff. Finally, there were many minor module changes, C API changes, optimizations, interpreter changes and platform-specific changes to Python 2.6. Python continues to be a live and evolving language, this release represents a fairly large set of changes that will pave the way forward to Python 3. Btrfs to the mainline? One of the kernel projects that seems to be attracting a fair amount of attention these days is the new, copy-on-write filesystem, Btrfs. While still rather immature—the disk format is slated to be finalized by the end of the year—Btrfs has reached a point where lead developer Chris Mason wants to start talking about when to merge it into the mainline. Some are advocating moving quickly, while others are a bit more skeptical that merging it will lead to faster development. Merging Btrfs would have a number of advantages, but more eyes is what Mason is seeking: But, the code is very actively developed, and I believe the best way to develop Btrfs from here is to get it into the mainline kernel (with a large warning label about the disk format) and attract more extensive review of both the disk format and underlying code. The Btrfs developers are committed to making the FS work and to working well within the kernel community. I think everyone will be happier with the final result if I am able to attract eyeballs as early as possible. Typically, kernel code is not merged until it is ready, but an argument can be made that filesystems, like device drivers, are sufficiently isolated from the rest of the kernel that an early inclusion will do little harm. Also, a kind of precedent was set by the early "merge" of ext4, though that was an evolution of the existing ext3 filesystem, while Btrfs is entirely new. Andrew Morton has been encouraging Mason to get Btrfs "into linux-next asap and merge it into 2.6.29." He describes his reasoning: My thinking here is that btrfs probably has a future, and that an early merge will accelerate its development and will broaden its developer base. If it ends up failing for some reason, well, we can just delete it again. For various reasons this approach often isn't appropriate as a general policy thing, but I do think that Linux has needed a new local filesystem for some time, and btrfs might be The One, and hence is worth a bit of special-case treatment. Adrian Bunk is not convinced that an early merge will bring the benefits that Morton is touting. He points to an early ext4 development plan, noting that the timelines outlined in that message were, perhaps, overly optimistic. "When comparing with what happened in reality it kinda disproves your 'acceleration' point." There is a difference, though, between ext4 and Btrfs, that Serge Hallyn points out: OTOH, maybe it's just me, but I think there is more excitement around btrfs. Myself I'm dying for snapshot support, and can't wait to try btrfs on a separate data/scratch partition (where i don't mind losing data). btrfs and nilfs - yay. Ext4? <yawn> That can make all the difference. The original timeline showed mid-2007 as a target for a stable ext4 filesystem, but the project overshot that by a year or so. A recent patch proposes renaming ext4dev to ext4 because it "is getting stable enough that it's time to drop the 'dev' prefix." Unexpected difficulties led to ext4 development taking longer, as Mason describes: Ext4 has always had to deal with the ghost of ext3. Both from a compatibility point of view and everyone's expectations of stability. I believe that most of us underestimated how difficult it would be to move ext4 forward. Many seem to think that Btrfs is different, but it still has a ways to go. Currently, it does not handle I/O errors very well, while running out of space on the disk can be fatal. But it is getting close to usable—at least for testing and benchmarking. Getting the code into the mainline would cause more folks to look at it, as well as test various filesystem changes against it. Mason gives an example of how that can work: For example, see the streaming write patches I sent to fsdevel last week. I wouldn't test against ext4 as often if I had to hunt down external repos just to get something consistent with the current development kernels. ext4 in mainline makes it much easier for me to kick the tires. Btrfs has an aggressive schedule that targets a 1.0 release this year. The focus of that release is to nail down the on-disk format so that changes after that point will be backward compatible. Given that 2.6.29 will likely be released in early to mid-2009, it seems quite possible that Btrfs will be "merge-worthy" by then, which means that it really is not premature to start considering it now. New Release Season Right now there are several major distributions preparing new releases. Ubuntu, openSUSE, Mandriva and Fedora are all on semi-regular six-month schedules; releasing each spring and fall. Debian has a much longer schedule, but that project is also nearing the release of Debian 5.0 "Lenny". Ubuntu 8.10, "Intrepid Ibex" is due for a final release on October 30, 2008. Some new features have been added since the release of Ubuntu 8.04 "Hardy Heron". Some highlights include GNOME 2.24 with tab support in the Nautilus file manager and new file types supported by File Roller. X.Org 7.4 has better support for hot-pluggable input devices such as tablets, keyboards, and mice. Ubuntu 8.10 Beta includes Linux kernel 2.6.27, a release with better hardware support and numerous bug-fixes. The ecryptfs-utils package has been included with support for a secret encrypted folder in your Home Folder. The "Last successful boot" recovery entry retains a copy of your running kernel and makes it available from the boot loader as a "Last successful boot" option. Network Manager 0.7 has some new features that are included in this release. There are also a few known issues with the beta release, so check the wiki before installation. openSUSE 11.1 is currently at beta 2. Some changes since the first beta include VirtualBox 2.0.2, the Intel e1000e have been disabled, OpenOffice.org 3.0RC2 from the openSUSE build service, plus GNOME 2.24.0, KDE 4.1.2, Mono 2.0 RC 3, Compiz 0.7.8, and more. You can see an expanded package list for the factory tree at DistroWatch. Just scroll down to see all the packages with version numbers. You can also find out more about openSUSE 11.1 on this page, which includes links to the most annoying bugs and the roadmap which calls for a final release on December 18, 2008. Mandriva 2009.0 "sophie" could already be officially released, since it is due on October 9, 2008. The second release candidate wiki site lists some major new features including improved boot speed, support for LUKS encrypted partitions in installer and diskdrake, improved support for netbook hardware, support for Intel G41 graphics chipset, and GNOME 2.24 final. KDE4 is the default desktop for sophie. You can find out more about KDE/Mandriva integration here. The 2009.0 Development page has more information. Fedora 10 "Cambridge" is currently scheduled for release on November 25, 2008. The accepted feature list for F10 includes an AMQP Infrastructure, that makes it easy to build scalable, interoperable, high-performance enterprise applications. F10 also has better printing, better remote support, faster startup, the Echo Icon Theme, Eclipse 3.4, GNOME 2.24, RPM 4.6, the Sugar desktop (used in OLPC), and much more. Debian 5.0 "lenny" was originally scheduled for release in September. Now the release date is "when it's ready", which should be soon. We covered lenny in the July 31st edition, at the freeze. "Now to explain what, exactly, we mean by "freeze". The freeze upload policy of uploading changes in through unstable if possible will be continued to apply until the release." Since then there has been lots of bug fixing. See more in the Debian "lenny" Release Information page. Debian 5.0 won't have the newest packages like the distributions mentioned above, but when Debian 5.0 is declared stable you will have just that; a stable system that will be supported for several years. Partial disclosure We are increasingly seeing disclosures of security vulnerabilities that don't actually disclose much, except that the researcher has found something. Unfortunately, we have also seen lots of evidence that once the presence of a flaw is known, it doesn't take very long for folks to figure out what the vulnerability is. Of course, we don't have any data on how long it takes those with a malicious intent to find the flaws, but clearly the "white hats" find them quickly. So what or who, exactly, are those practicing "partial disclosure" protecting? Partial disclosure is clearly a part of the "security circus" that Linus Torvalds recently castigated, as it serves to increase the notoriety of security researchers, without necessarily doing anything to help protect users. Several recent examples come to mind of researchers who have found real flaws, but for various reasons don't want to disclose the details. Instead they "tease" the world by talking around what they found, trying—and generally failing—to leave out enough information so that others can't immediately follow in their footsteps. Dan Kaminsky's DNS flaw was an interesting example in that Kaminsky only disclosed the vulnerability to affected software vendors, allowing them multiple months to produce patches. He then wanted to give administrators time to apply the patches so he delayed disclosing the flaw for another month or so. He also had an admittedly selfish reason for delaying disclosure: he wanted to announce it at the Black Hat security conference. Because of the addition of source port randomization as the fix, it didn't take very long for other security researchers to come up with the vulnerability. Attackers may have come up with it even more quickly, but because there were no details available, developers of other, smaller DNS servers—not privy to the initial disclosure—were unable to determine whether their code was vulnerable. It is commendable that Kaminsky worked with the vendors to fix the problem, but there were clearly holes in his disclosure methods. A worse case can be seen with the recent spate of reports about "clickjacking". It started with a report of a canceled talk at the OWASP AppSec conference. The name is clearly suggestive of where the vulnerability might be, and the description of the canceled talk gave enough information that others were able to duplicate it. This led one of the original researchers to release the vulnerability information. So, in the interim, there was enough information floating around to find and exploit the flaws, and now the vulnerability info has been released, but there are no fixes available for many of them. It is hard to see what delaying the disclosure did for anyone—researchers or users—here. It did generate lots of press, though, partially because of the name as Bruce Schneier pointed out pre-disclosure: "Clickjacking" is a stunningly sexy name, but the vulnerability is really just a variant of cross-site scripting. We don't know how bad it really is, because the details are still being withheld. But the name alone is causing dread. Yet another recent example is the denial of service reported for nearly any TCP device. Like clickjacking, it is being described in scary ways—which may well be justified: Robert and I talk a lot, and I asked him if he'd be willing to DoS us, and he flatly said, "Unfortunately, it may affect other devices between here and there so it's not really a good idea." Got an idea of what we're talking about now? This appears not to be a single bug, but in fact at least five, and maybe as many as 30 different potential problems. They just haven't dug far enough into it to really know how bad it can get. The results range from complete shutdown of the vulnerable machine, to dropping legitimate traffic. There may well be enough information in the description of what the researchers found—and, in particular, how they found it—for an enterprising attacker to find it for themselves. In the meantime, the rest of us are left in the dark. Security researchers are clearly under no obligation to disclose their research sensibly, but it would seem that either releasing all the details at once, or keeping them completely secret, would be better than these partial disclosures. LK2008: The values of the Linux community The opening keynote speaker for the 2008 Linux-Kongress was James Bottomley, who presented his views on the Linux community's values. What these values are, says James, is not entirely obvious. Related groups - the free software community, for example - have well-articulated value systems which define them. The Linux community's values are not so clearly expressed, but, he says, they are central to what we do. James started with a bit of history, noting the the initial value placed on software was entirely commercial. Once the industry realized that software could be worth far more to its users than it costs to create, the proprietary mode became dominant - and that has affected the evolution of programming in general. The value placed on the code by its developers became irrelevant, leading to "paycheck coding." There is no value placed on creativity, and such a model leads to bad code. Eventually Richard Stallman came along and challenged the commercial view of software. But, during this time, about the only alternative to commercial software was the BSD Unix distribution, and that got caught up in the lawsuit by ATT. So closed software took over; Windows won on commodity platforms, but proprietary software also became dominant in the Unix arena. In 1991, Linux hit the scene; since then, it has become the most popular and vibrant free software operating system available. In a sense, this is interesting, in that Linux is licensed under the GPL, a license that many companies hate. Apple explicitly chose BSD as the base for MAC OS to avoid GPL-licensed code. But, despite this antipathy, lots of companies use Linux, and even contribute to its development. It is interesting, James says, to look at why that is. The reason is the Linux community's values. In particular, the community prizes technical merit above all other considerations - including small things like what any company or user would like to have. Also prized is passion; code supported by a developer who clearly cares about it will generally fare better in the review process. If the code quality and the passion are there, the community does not care about much of anything else. Factors like the source of the code or who might benefit from its incorporation don't really matter. In particular, contributors to the kernel are not required to sign on to any particular belief system or any specific view of freedom. A contributor may have an FSF-like belief in free software, or, instead, be a corporate developer who does not care about software freedom at all. Even the BSD community requires acquiescence with a specific view of freedom. A Linux contributor, instead, need only be willing to contribute the code under the share-alike rules of the GPL. As a result, anybody can play with Linux, regardless of philosophy or corporate status. We have a community which is defined by contributions, not by a specific set of values regarding software freedom. That has allowed the formation of a very diverse community with a specific shared interest: creating the best kernel we can. There are some significant benefits from this approach. It forces companies to recognize their engineers' values; that, in turn, makes for more motivated developers. Developers who are interested in improving Linux can get resources and support from corporations. Users get high-quality code from developers who care about what they are doing. Companies get the ability to focus on their little piece of the problem while taking advantage of the community-maintained kernel for the rest; they can also offload their older code to the community for long-term maintenance. James compared the Linux way of doing things with the US constitution. That document only mentions freedom three times, yet it has become a blueprint which has supported freedom for over 200 years. It is a relatively short document. The proposed EU constitution, instead, is about 20 times the length, before taking into account other documents which are referenced. That document would appear to be somewhat bloated; the goals would be better served by a more concise formulation. Similarly, the Linux community spends little time talking about freedom. Instead, the focus is on a set of brief principles involving code quality and passion. Freedom is not legislated; it arises as an emergent value inherent in the Linux way of doing things. Linux has managed to bring about software freedom without talking about it, and without imposing a view of software freedom on its contributors. In the process, Linux has succeeded in creating something which is as free - or more free - than the GNU system envisioned by the Free Software Foundation. During the question period, James wished for a free software advocate who would argue the point with him, but no such person emerged. He will, it seems, have to repeat the talk in a different venue before he can have that debate. Merged for 2.6.28 As of this writing, 4193 non-merge changesets have been incorporated for the 2.6.28 kernel. In other words, this merge window is just beginning, having merged probably less than half of the patches which will eventually find their way into the mainline. What we see so far are a lot of drivers and incremental improvements, but not many major changes. User-visible changes for 2.6.28 include: There are new drivers for Analog Devices SSM2602, AD1882A and AD1980 codecs, Freescale MPC5200 I2S audio devices, Texas Instruments TLV320AIC26 codecs, Tascam US-122L USB Audio/MIDI interfaces, Wolfson Micro WM8580, WM8900, WM8903, and WM8971 audio devices, Blackfin SPORT peripheral interface controllers, NVIDIA HDMI HD-audio codecs, Toshiba RBTX4939 MIPS boards, Atheros L2 10/100 network adapters, Cisco 10G Ethernet adapters, JMicron JMC250 chipset-based network adapters, QLogic QLGE 10Gb Ethernet adapters, SMSC LAN95XX based USB 2.0 10/100 ethernet devices, AFEB9260 ARM-based boards (an open source board design), Arcom/Eurotech VIPER boards, AT91SAM9X watchdog devices, ITE IT8716, IT8718, IT8726, and IT8712 Super I/O watchdogs, W83697UG/W83697UF watchdog devices, TLV320AIC23 codecs, Micron MT9M111 camera chips, Magic-Pro DMB-TH tuners, Afatech AF9015 and AF9013 DVB-T USB2.0 receivers, Conexant cx24116/cx24118 tuners, DVB cards based on SDMC DM1105 PCI chip, Silicon Laboratories SI2109/2110 demodulators, ST STB6000 DVBS Silicon tuners, numerous Fujifilm FinePix cameras, ALi video camera controllers, WM8400 AudioPlus HiFi codecs, and SGS-Thomson M48T35 Timekeeper RAM chips. Support for the old Sun 4 architecture and ColdFire serial ports has been removed. There is a new sysfs file (unload_heads) which can be used by a user-space process to tell an ATA disk to retract its heads and prepare for an impact. When used in conjunction with an accelerometer, this feature could be used to attempt to preserve a disk in a falling laptop. Improved support for ptrace() - and support for precise event-based sampling in particular - has been added for the x86 architecture. The crypto subsystem has gained support for deterministic ANSI X9.31 A.2.4 pseudo-random number generation. The SMACK security module can now be configured to enforce mandatory access control rules on privileged processes. There is a script which can be used to generate a minimal "dummy" policy for SELinux. The smallest workable policy, it seems, is 587 lines long. Some sound devices can detect the presence of audio devices on input and output jacks. The ALSA layer now allows drivers for those devices to register those jacks and report the presence of devices attached to sound cards through the input layer. Work with multiqueue networking continues; 2.6.28 will include the ability to associate a separate queueing discipline with each internal packet queue. The wireless regulatory compliance subsystem has been merged. The kernel now supports the Phonet packet protocol used by Nokia cellular modems. See networking/phonet.txt in the kernel documentation directory for more information. Also added to core networking is support for the Distributed Switch Architecture protocol, with initial support for a number of Marvell switch chips. The netfilter layer has been augmented to support network namespaces. The ext4 system has lost the "ext4dev" name; this is a signal that the developers are getting ready to declare it ready for production use. Ext4 has also gained a set of static tracepoints for use with SystemTap or other tracing tools. The FIEMAP ioctl() for extent mapping has been added. Xen has added CPU hotplugging support. Version 4 of the rpcbind protocol is now supported; this enables the kernel to offer RPC services via IPv6. The OCFS2 filesystem has gained a number of features, including POSIX locks, extended attributes, and use of the JBD2 journaling layer. Changes visible to kernel developers include: Discard request and request timeout handling have been added to the block layer; a number of other internal API changes have been made as well. See this article for details. Video4Linux2 drivers no longer have their open() function called with the big kernel lock held. The lock_kernel() calls have been pushed down into individual drivers within the mainline tree; external drivers will need to be fixed. The merge window is likely to remain open until approximately October 24. Connecting to Microsoft Exchange with OpenChange Working with a Windows network from Linux has never been a smooth ride. While Samba, Wine and OpenOffice.org have made many components workable, connecting to the Microsoft Exchange email server has remained unreliable. Now the OpenChange developers hope to change that, providing the same capabilities as Microsoft Outlook in a range of Linux-native clients like Kontact and Evolution. OpenChange is not yet workable, but partial operation can demonstrate its potential. If you want to connect to Exchange at the moment, you have a few options. Evolution can connect using a hack with Outlook Web Access, providing email, shared folders, calendars and contacts. But it's far from reliable; I tried to get by with it at the office, warts and all, and managed it for a couple of weeks before resigning myself to Windows. The other options are even worse -- just use the webmail client, or use the IMAP server for email and hacks such as this one to get at other data in a manner similar to Evolution. Working from home on Kubuntu, I find it easier to just use the webmail client. OpenChange is taking a much more sensible approach. At the heart of the project is a MAPI-compatible API, which allows clients to talk directly to Exchange and access all of its functionality. The code is still being actively developed, but some application developers have started playing around with it; the first code for Evolution came out in January 2008. According to Brad Hards, an OpenChange and Kontact developer, "OpenChange can do most of the Exchange tasks now, though it can't currently do free/busy." For the curious, OpenChange developer Julien Kerihuel has written a simple command-line client. It's currently available in Ubuntu Intrepid and Debian Experimental, though you're better off compiling it yourself as it is changing quite rapidly. It isn't especially well documented, and the manpage implies some functionality that Kerihuel is still working on, but I did have some success. First, you need to set-up a new profile: You can check if it has worked by listing your mailboxes: I managed to send a test email, which I picked up in Outlook without problems. When I opened the same email in KMail, however, it has a "winmail.dat" binary file attached, which you wouldn't normally get in emails from Outlook. You can also interrogate folders, send emails, create and delete contacts, calendar appointments and access most of the other Exchange functionality. Kerihuel: "Openchangeclient is a test case for libmapi, it's a useful way to test if a problem is in the client application or in libmapi, and there is a plugin for sugarcrm, so it may remain in future." There's a proxy server using Samba too, for those who want yet another way of connecting. For Kontact users, usable integration is probably a good 6 months away. The akonadi resource can deal with most of OpenChange's functionality, "at least a bit", accord to Hards, though "Kontact can't currently make use of it because it isn't converted to akonadi yet." KDE 4.2 should come out with akonadi integration, but the OpenChange functionality might not yet be stable enough for large quantities of important data. Hards thinks KDE 4.3 is probably "the sweet spot." Until then, Ballmer's mantra remains relevant; OpenChange and its client implementations could do with developers, developers, developers. Cracking this nut could throw open Exchange to a new range of clients, and as Kontact and its peers become stable on Windows and MacOSX, an entrenched Windows server will pose less of a threat to free software migrations on desktops. LK2008: Embedded and Mobile Linux Linux-Kongress 2008 attendees had the opportunity to hear two different sessions dedicated to organizations trying to improve the state of Linux support for embedded and mobile systems. They have similar goals, but are taking different approaches and have different levels of resources available to them. The first of these is OpenSourceEmbedded, presented by uClinux developer Jeff Dionne. He opened with a statement that, ten years ago, Linux-based embedded systems were nearly unknown. Now those systems are everywhere, with hundreds of millions of deployments. Embedded systems, he says, make up the largest installed base of Linux systems. All is not perfect, though, in the embedded sphere. Linux still has an uncomfortably large footprint for embedded use. There is also no unified distribution for embedded use; instead, the industry is full of homemade solutions made by vendors. He would like to address this situation through the creation of a next-generation platform. It would take the form of a kit that developers could start with which comes equipped with design examples for a number of applications: telephones, digital video recorders, etc. There are two hardware platforms being targeted initially by this effort. One is a Plasma MIPS processor - a very simple device which can be implemented with an FPGA. A simulator for this processor runs about 600 lines of code. The other, more advanced platform is a LEON 2/3 SPARC processor, a full system with a memory management unit and which supports multiprocessor configurations. Examples of the first processor include a RealTek MIPS system, while the LEON SPARC CPU is similar to current SuperH 3 processors. The Plasma and LEON SPARC processors are being designed now, with the intent of producing them as open hardware designs. On top of these processors will be a base operating system layer with a "mini-POSIX" environment. There will be an interesting packaging system which stores components as separate "blocks" in flash, outside of any filesystem. The running system will be assembled from the blocks by the boot loader. This organization is designed to avoid bricking; any bad or corrupted components can simply be bypassed without affecting the functioning of the rest of the system. This, evidently, is how PalmOS did things. The next challenge is creating a community around this whole effort. To that end, resources are to be put up at opensourceembedded.org - though nothing is available as of this writing. The site will include project hosting, along with the ability to download the development kits. Jeff says that the uClinux experience has shown that the kit approach works; with a ready-to-use code base like that, a community can come together. There are also plans to create an organization behind this effort which, among other things, can enter into non-disclosure agreements with hardware manufacturers. This organization will also work to help vendors ship GPL-compliant products. OpenSourceEmbedded appears to be in an early state, so it's hard to make any guesses about how successful it will be. For more information, see Jeff's slides [PDF]. Mobile Linux The closing session at the 2008 Linux-Kongress was a talk by Dirk Hohndel, who began by noting that Linux-Kongress is, in fact, the oldest Linux event. It was first held in 1994, and hosted many of the kernel developers who were active at that time; Dirk estimates that about half of the development community was to be found in a single room. It would take a rather larger room to accomplish that now. Dirk complimented the event on its avoidance of commercialism and its sustained focus on the technology. The technology that Dirk came to talk about was mobile Linux. He started by expressing his disappointment with desktop Linux. It has become a collection of poorly-integrated applications which are somehow trying to replicate Windows 95. The result does not work well on the desktop, and it most certainly is not optimized for the mobile environment. But, says Dirk, mobile Linux is not really embedded Linux either. Embedded Linux evokes images of access points and other single-application boxes which are not meant to be extended past a single function. They are not concerned with the user's experience, and they are not concerned with mobility. The subject here is devices with a screen, and which can have new applications installed onto them. So some sort of desktop-like interface is needed, but current desktop Linux does not fill the bill. According to Dirk, the problem with desktop Linux is the fundamental approach: developers are not the target audience for this software, but they are making all the interface decisions. What's needed is input from people who are specialized in interface design and human-computer interaction. That leads to a "scary thought": interface specialists are generally not coders, but they will be making decisions that coders are expected to implement. That is not a normal mode of operation in the free software community, but it is needed here. Other problems include the proliferation of "80% done" projects. Much of the work has been done, but nobody wants to do the work to finish the job. There's also far too many choices; in general, says Dirk, people do not like it if they have to choose between more than two alternatives. When dealing with the Linux desktop, it's hard to find situations where there are fewer than six choices. And, overall, the Linux desktop lacks consistency. That, says Dirk, is why he uses an Apple laptop. Apple enforces a consistent design across the application space and, he says, the result is very nice. Devices should be simple and natural to use; such devices are increasingly hard to find anywhere. As an example, he held up a paper notebook. The device boots very quickly, has a nice "touch-based" pencil-oriented interface. No manuals or explanations are needed. Linux-based devices should be just as easy to use. But, at the same time, they need to offer an experience which is close to what people expect from an ordinary, desktop computer. It should have access to the Internet, and users should be able to install software. Dirk then pulled out an Eee PC system and gave the five-second boot demonstration. This work, he says, is an example of what is being done by Intel in support of the Moblin project. Intel is trying to solve some of the hardest problems in the mobile space, contributing the results for everybody to use. To that end, Moblin is working toward the creation of a base distribution for mobile systems. The user interface will be based on the GNOME mobile work, but with a lot of enhancements. The end goal is the creation of a Linux distribution for mobile devices which is far better than the state of the art today. It is not, he says, an attempt to compete with distributors; instead, Moblin is providing a base which the distributors can build on. Intel's effort will naturally focus on Intel processors, but contributions for any architecture are welcome at Moblin. In conclusion, Dirk noted that Linux's success on the server side was relatively easy. The mobile problem is much harder. Intel is hoping that others will join in to help Moblin reach its rather ambitious goals. OpenOffice.org releases 3.0, faces new challenges A new version of the popular free software office application suite, OpenOffice.org (OOo) 3.0, was released this week to lots of press and enough download traffic to bring down its webserver. While the release isn't a huge leap forward in terms of features, it does provide some compelling enhancements. Perhaps the most interesting is the increased focus on extensions, a la Firefox, that don't require modifying the core OOo code. This may help combat the problem—or perceived problem—that Sun is stifling OOo development through its bureaucratic procedures for adding new functionality. The first thing one notices when starting up OOo 3.0 is the new splash screen, but it appears for only a short time. One of the major complaints about the suite has been how long it takes to start up—something that has been addressed in 3.0. The application opens to a new welcome screen (seen at left) that presents a more friendly appearance, rather than an empty window, for new users. Once past that point, the various tools look much as they did in OOo 2.4 and earlier versions. The other changes are mostly under the covers; they will be noticed by power users, but are not immediately obvious to basic users. These include: Writer (word processor) has a new slider for zooming Writer allows multi-page display and editing Calc (spreadsheet) allows up to 1024 columns per sheet Draw (drawing) can handle poster-size files Impress (presentation) supports multiple monitors for presentations Writer has additional editing modes for multi-lingual support as well as wiki document editing Calc has a new equation solver Chart (graphing) has improved graphical output The OOo extensions repository has many different kinds of add-ons for OOo, that provide new or enhanced functionality for users. The most popular is the PDF import extension which allows loading PDF files into the application for editing. Given that OOo has long had the ability to natively export PDFs, importing them is an excellent addition. Clearly Sun and the OOo project see extensions as a fertile ground for innovation by folks who are not necessarily OOo "contributors"—as they have not signed the Sun Contributor Agreement (SCA) [ PDF, currently unavailable due to the download traffic problems ]. Sun's community manager for OOo, Louis Suarez-Potts, puts it this way: OOo 3.0 adds to that freedom by using extensions much the same way that Firefox does: it gives all users the freedom to add new features, functionality. At present, we have a couple of hundred, and they have proved popular. We've also done minimal advertising. I anticipate that in the coming months, as 3.0 gains yet more popularity (all servers are down at the moment), there will be more and more interesting extensions out there. I can see extensions that radically depart from what we consider "office" tools---and why not? OOo is an integrated set of tools based on fairly conservative conceptions of office software. But there is no compelling reason to stick with the conservative past, and every reason to be creative. One of the new features that OOo developers are most excited about won't affect Linux users at all. OOo 3.0 has a native Mac OS X look and feel, rather than the earlier X11-based interface. A native Windows version has always been a part of OpenOffice (and its precursor, StarOffice), but the new default theme is said to be particularly attractive on that platform. There are various new features aimed at those currently using—or needing to interoperate with—Microsoft Office. There is support for Access database files as well as improved Visual Basic for Applications (VBA) macro support. Somewhat controversially, OOo 3.0 has added the ability to read (but not write) Office Open XML (OOXML) files. OOXML is the newly minted standard for office documents that Microsoft and Ecma pushed through the ISO standardization process earlier this year. Support for OOXML is one of the contentious areas surrounding OOo. There are two (vocal) developer camps, one Sun-centric, the other Novell-centric; unsurprisingly they tend to clash over OOXML as well as development pace and direction issues. It has gotten to the point where a fork, called Go-OO, has come about, led by Novell's Michael Meeks. Go-OO's version of OOo has been adopted by several distributions leading some to see it as a "hostile" fork. Sun's chief open source officer, Simon Phipps, clearly sees Go-OO (and the related OO-Build) as an attempt by Novell to control OOo: The result of this is that go-oo.org is definitely a hostile and competitive fork of OpenOffice.org, and OO-Build is no longer a helpful downstream since it no longer upstreams much of anything (especially for Mac), small changes excepted. Unlike Groklaw I'd still hesitate to call OO-Build a fork, but Go-OO is unmistakably one, just look at the web site, the Windows build and the rhetoric. The motivation for Go-OO being hosted and promoted by Novell and its staff seems unmistakable to me, as does the fact it is a Novell-sponsored fork. They are promoting Microsoft's flakey XSLT-based OOXML support, they are isolating Linux from OpenOffice.org (so that no-one in the main OpenOffice.org community is able to get support contracts from Linux users). And it is all cleverly wrapped in a community-friendly story about hackers and their freedom and evil, controlling Sun, delivered without interference from Novell corporate. Meeks most recent look at OOo development is the proximate cause of much of the current sniping in various blogs. Meeks analyzes commits to the OOo codebase to try to extract trends in the development of the tool. His conclusion is stark—undoubtedly inflammatory to those in the Sun camp—"Crude as they are - the statistics show a picture of slow disengagement by Sun, combined with a spectacular lack of growth in the developer community." While there have been various responses to the analysis—including this LWN comment thread—there has, as yet, been no real counter-analysis that comes to a different conclusion. Perhaps there are other ways to slice and dice the data that look more favorable to growth in the OOo community, but if not, the conclusion is worrisome. OOo is a very useful tool, that is used by many, which offers a way out of Microsoft lock-in. Because of Novell's close association with Microsoft, people worry that Go-oo is an underhanded means for another kind of lock-in—this time to Novell. In what seems almost a taunt—as well as a validation of the accusation of a hostile fork—Meeks adds a postscript to his analysis: Why is my bug not fixed ? why is the UI still so unpleasant ? why is performance still poor ? why does it consume more memory than necessary ? why is it getting slower to start ? why ? why ? - the answer lies with developers: Will you help us make OpenOffice.org better ? if so, probably the best place to get started is by playing with go-oo.org and getting in touch [...] There have long been complaints about the pace of OOo development, along with calls for creating a foundation to oversee it. It would seem that OOo is at a bit of a crossroads. If Sun's commitment is reduced, without a corresponding increase in contributions from others, OOo could stagnate—or Go-oo could take over. Ostensibly, the SCA is one of the sticking points for some contributors. They do not trust Sun not to take their contributions in a proprietary direction. But the conflict is really rooted in issues of control and development direction—two things likely to lead to forking. While two forks is suboptimal, perhaps, it may lead to improvements in both the code and the development process for OOo. There are legitimate concerns on both sides of the issue—undoubtedly the mostly silent user community has yet another perspective—but there is enough bad blood between them that it is hard to see it resolving in some relatively amicable way. The office application suite is an extremely lucrative product, at least in the proprietary world. One gets the sense that both Sun and Novell are seeing dollar signs which are clouding their vision. A neutral foundation of some kind might be a good first step towards reconciliation. SELinux permissive domains Readers of this page—along with the kernel page—will not find it surprising that SELinux is a complex beast. It is, however, the dominant security framework for Linux, pushed hard by Red Hat, but also being adopted, slowly, by SUSE, Ubuntu, and others. Over the years, through lots of hard work, it has become somewhat less complex, at least for administrators; a new feature, called permissive domains will help further ease the administration of SELinux-enabled systems. These days, SELinux has two modes, the aptly named enforcing and permissive modes. When in enforcing mode, SELinux will not allow operations that are not permitted by the policy, whereas in permissive mode, a violation is just logged and the operation is allowed to continue. Administrators trying to track down an SELinux problem with an application—whether a real security issue or just a problem with the policy—can put the system into permissive mode, then study the logs to determine what policies are being violated. Or they can use audit2allow to make those policy changes for them. Until permissive domains, though, the choice between permissive and enforcing was binary for the entire system. By putting a system into permissive mode, various attacks that SELinux might normally stop on other applications would instead just be logged. With permissive domains, a single process, or group of related processes, can be marked as permissive, while the rest of the system stays in enforcing mode. Red Hat SELinux hacker Dan Walsh, describes permissive domains on his blog. One of the motivations is to help third-party software developers feel more comfortable about shipping SELinux policy with their application: Another problem SELinux has is that third party software companies want to ship with SELinux policy for their software but do not trust that they have tested it well enough to run their confined applications in enforcing mode. I have talked to developers of stock market software that wanted to write policy for an application, distribute it to a live environment of several hundred machines, and then gather the AVCs as they happen, using this information to fine-tune their policy. After a long period of time, where they saw no AVCs, they might be willing to put their policy in enforcing mode. In RHEL5 they need to put the entire machine in permissive mode, but permissive domains solve this problem. Permissive domains are available in recently updated Fedora 9 systems and will come standard with Fedora 10. As Walsh shows, enabling permissive mode for a domain is trivial: which would put all CGI scripts into permissive mode. And: to remove permissive mode for the CGI script domain (httpd_sys_script_t). This is definitely a nice step forward for assisting with policy development, but there is still a lingering problem with the recommended way to generate SELinux policies. Walsh describes how that is done: Finally, when someone wants to write policy for a new confined domain, we tell the policy writer to build a minimal policy using tools like system-config-selinux. Then we advise them to put the machine in permissive mode, run the confined application, collect the AVC messages, use audit2allow to generate new policy, and try again. Lather, rinse, repeat. This puts the entire machine at risk, since it is no longer protected by SELinux. With permissive domains, you can mark the new domain as permissive and avoid putting the machine at risk. The problem, of course, is that blindly using audit2allow is extremely dangerous. It assumes that the application has no security problems, that all of its accesses should be permitted—if that can be assumed, what is SELinux for? By taking all of the violations and turning them into policy changes, the application, rather than the policy developer, decides on the access it requires. Using audit2allow correctly is much more complex, requiring a good understanding of SELinux and the existing policies and domains. To be fair to Walsh, in a related post, he does warn: Whenever you generate policy in this way you should really examine the te file for what rules audit2allow has generated and try [to] make sure they make sense, and don't open a security [hole]. It is always good to ask if the policy is good on a list like fedora-selinux. If you believe this is a bug in policy, please open a bugzilla. Then we can fix the policy for others. The audit2allow manpage is even more explicit: Care must be exercised while acting on the output of this utility to ensure that the operations being permitted do not pose a security threat. Often it is better to define new domains and/or types, or make other structural changes to narrowly allow an optimal set of operations to succeed, as opposed to blindly implementing the sometimes broad changes recommended by this utility. Certain permission denials are not fatal to the application, in which case it may be preferable to simply suppress logging of the denial via a dontaudit rule rather than an allow rule. Using audit2allow is, unfortunately, the way that most SELinux policy is developed. There aren't enough SELinux experts—there may never be enough—to actually look at the code for applications and determine a priori what the policy should look like. So, testing applications by running them to determine what permissions they require is the only sane way to do it, error-prone though it may be. What is Ulteo? Gaël Duval, founder of Mandrake-Linux, started Ulteo after he was laid off by Mandriva in 2006. The first alpha release was announced several months later. In the past two years the project has had some time to mature and with the announcement that OpenOffice.org 3.0 is available through Ulteo.com it seemed like a good time to revisit the project. Ulteo is aimed at Windows users, and gives them a slow and easy to way to convert to Linux using the first of several several sub-projects; the Ulteo Online Desktop. Many Linux applications are available through a Java enabled web browser such as Firefox or Internet Explorer. OpenOffice.org, KPdf, Kopete, Skype, Thunderbird + Enigmail, Gimp and Digikam, Inkscape and Scribus and many other applications are available in the Online Desktop without installing any new software on the PC. A subscription to Ulteo Premium provides extra storage for documents and other benefits. Once the user becomes comfortable with Linux applications they could be ready for the Ulteo Application System which is an installable system for the PC. The Application System features automatic document synchronization/backup, automatic updates and upgrades, and all the applications included in the online desktop. The Ulteo Virtual Desktop seems to be much the same as the Online Desktop. It is designed to run under Windows and allows the use of both Linux and Windows applications. The Virtual Desktop uses coLinux to provide the Linux desktop on Windows. The final Ulteo product, for now at least, is the Documents Synchronizer. This, like the Virtual Desktop, is Windows software but it can be used with the Online Desktop to backup and retrieve documents, whether these are produced locally with Windows applications or with Linux applications using Online Desktop. Ulteo is not something that will be of immediate interest to the average LWN reader. Presumably most readers are already knowledgeable about running Linux and its applications. However most of us probably do know someone who is not ready to run Linux natively. At least some of those people could start using the Online Desktop and become more familiar with various Linux applications without having to download and install those applications. Who knows where they might go after that. Fedora checking community health with EKG Measuring the health of communities is an interesting, difficult task. The Fedora project has recently started using a new tool, called EKG, to try to get an overview of the demographics of the free software projects that are sponsored by the distribution. EKG is still young, but already provides some interesting information. Because it is GPL-licensed, as is the Fedora norm, it can be picked up by other distributions or interested parties to target their own projects. At its core, EKG is a few Ruby scripts that process mailing list data so that graphs can be produced. Currently, it produces both pie charts and line graphs that indicate the number of Red Hat posters versus those from elsewhere. A portion of the most recent set of graphs can be seen at right. Red Hat's Michael DeHaan has taken on development of EKG to use as a tool to measure how well various projects are building a community separate from Red Hat. There are lots of free software projects that have been released by Red Hat—or Fedora, which often amounts to the same thing—but may or may not be seen as useful tools outside of Fedora. By looking at the mailing list traffic, particularly over time, some idea of which projects are building a community, and which aren't, can be derived. As the project page puts it: The premise is simple... what are the demographics behind open source projects that we run in Fedora? Who posts Who contributes What projects are most active? What projects need a little help? Mailing lists are just one measure of the health of a project, of course, so DeHaan is looking at other metrics. Commits to the project repository—along with the identities of the commiter—would seem an obvious choice. Better graphs with more useful information on each axis as well as time series of the pie charts are also on the "to do" list. He is also looking at derived statistics that will allow direct comparison of different projects by using equations that in some way model success. It is difficult to draw any conclusions from the limited graphs that are currently available. One thing that does stand out, though, is the popularity of gmail.com email addresses, which seem to account for around one-quarter of posts. One can also certainly see projects that are completely dominated by "inside" (i.e. Red Hat) folks. The JBoss lists are a good example. Projects are trying various ways to measure how well they are doing their job; EKG is another way to do that. For the kernel, the statistics on each release are gathered by LWN, as well as over longer periods by the Linux Foundation. Ubuntu has its Upstream Report which looks at how well bugs are getting to upstream bug trackers. Undoubtedly other projects have their own ways of trying to measure their impact. As yet, there is no mailing list for EKG development. We look forward to the day when EKG is applied to its own development list. It would seem that some kind of "metahealth" measurement of the community might be able to be derived from that data. Block layer: solid-state storage, timeouts, affinity, and more The 2.6.28 merge window has seen the addition of a number of changes to the block layer. Here's a summary of the new features and APIs which have gone in. Solid-state storage devices There are some enhancements aimed at improving the kernel's support of solid state storage devices. One of those, the discard API, has been covered here before. This API allows high-level block subsystem users (filesystems) to indicate that a particular range of blocks no longer contains useful data. That allows the low-level device to incorporate those blocks into its garbage collection scheme and to stop worrying about their contents when performing wear leveling. Since the initial LWN article, though, the API has changed a little. The way to issue a discard request is now: The end_io() parameter seen in previous versions of the API is no longer present. There is no way for callers to know when the request completes, or, indeed, if the request completes at all. Since the caller is indicating a lack of interest in the given sectors, it really should not matter what the device does thereafter. There is a filesystem-level function for creating discard requests: Here, the interface is expecting block numbers using the filesystem block size, rather than 512-byte sectors. User-space programs can issue discard requests with the new BLKDISCARD ioctl() call. Needless to say, such operations should be done with care; about the only logical user of this ioctl() would be mkfs programs. Block drivers which support discard requests will provide a suitable function to the block layer: In the absence of a "prepare discard" function, discard requests for the device will fail. The block layer has also added a flag by which drivers can indicate that a device is not rotating storage, and, thus, does not suffer from seek delays. By setting QUEUE_FLAG_NONROT (with queue_flag_set() or queue_flag_set_unlocked()), a driver tells the block layer that it is working with a solid state device. I/O schedulers can use that information to avoid plugging the queue - a useful technique for combining requests to rotating storage devices, but a useless operation when there is no seek penalty to avoid. Request affinity On large, multiprocessor systems, there can be a performance benefit to ensuring that all processing of a block I/O request happens on the same CPU. In particular, data associated with a given request is most likely to be found in the cache of the CPU which originated that request, so it makes sense to perform the request postprocessing on that same CPU. With 2.6.28, sysfs entries for block devices will include an rq_affinity variable. If it is set to a non-zero value, CPU affinity will be turned on for that device. According to the patch changelog, turning this feature on can reduce system time by 20-40% on some benchmarks. Timeout handling Robust device drivers typically have to be written to handle cases where devices fail to complete operations they have been instructed to do. In a few cases, higher-level code helps with this task; the networking layer, for example, can track outgoing packets and let a driver know when a transmit operation has taken too long. In most other drivers, though, it's up to the driver itself to notice when an operation seems to be taking too long. Like the network subsystem, the block layer manages queues of requested operations. As of 2.6.28 the block layer will, again like networking, have a mechanism for notifying drivers about request timeouts; that, in turn, will allow a bunch of timeout-related code to be removed from the lower layers. Timeout handling in the block layer can be more complex, though, and the associated API reflects that complexity. A block driver must register a function to handle timed-out requests: The amount of time a request should be outstanding before timing out is set up with: The tracking of per-request timeouts is done within the block layer; the timer for any individual request is started when that request is dispatched to the driver by the I/O scheduler. Should a request fail to complete before the timeout period passes, the driver's timeout function will be called with a pointer to the languishing request. The driver then can do one of three things: Figure out that, in fact, the request was completed as expected, but that completion had not been noticed by the driver. A dropped interrupt could bring out such a situation, for example. In this case, the driver returns BLK_EH_HANDLED, and the request will be marked as completed. Decide that the request needs more time, perhaps because it has been re-issued by the driver. A BLK_EH_RESET_TIMER will start the timer again for this request. Punt and return BLK_EH_NOT_HANDLED. The block layer currently does nothing at all when it gets this return code; future plans appear to include aborting the request within the block layer when this return value is encountered. If things look bad, the driver may decide to abort any outstanding requests, reset the device, and start over. There are a couple of new functions which can help with this task: These functions will abort the given request, or all requests on the queue, as appropriate. Part of that process involves calling the driver's timeout handler for each aborted request. Other changes in brief Some other block-layer changes include: The handling of minor numbers has been changed, allowing disks to have an essentially unbounded number of partitions. The cost of this change is that minor numbers may be attached to a different major number, and they might not all be contiguous; for this reason, drivers must set the GENHD_FL_EXT_DEVT flag before the extended numbers will be used. See this article for more information on this change. The prototypes of blk_rq_map_user() and blk_rq_map_user_iov() have changed; there is now a gfp_mask parameter. This allows these functions to be used in atomic context. kblockd_schedule_work() has an additional parameter specifying the relevant request queue. The new function bio_kmalloc() behaves much like bio_alloc(), but it does not use a mempool to guarantee allocations and can thus fail. It is, all told, one of the busier development cycles for the block layer in recent times. Fedora and long term support The news that Wikipedia was in the process of switching away from Red Hat and Fedora—and to Ubuntu—has stirred up some Fedora folks. The relatively short, 13 month support cycle for Fedora releases was fingered as a major part of the problem in a gigantic thread on the fedora-devel mailing list. Some would like to see Fedora be supported for longer, so that it could be used in production environments, but that is a fundamental misunderstanding of what Fedora has set out to do. The idea of supporting Fedora beyond the standard "two releases plus one month", which should generally yield 13 months, is not new. It was, after all, the idea behind the Fedora Legacy project. Unfortunately, Fedora Legacy ceased operations at the end of 2006, largely due to a lack of interested package maintainers. So, calls for a "long term support" (LTS) version of Fedora are met with a fair amount of skepticism. Just such a call went up in response to the Wikipedia news. Patrice Dumas outlined the need: [...] it seems to me that a true Fedora LTS is missing, that would allow those who want things that are new, including for testing but cannot afford changing everything each year (servers for example or user desktops). It seems to me that fedora ends up being used almost exclusively as single user desktop, so that testing of other functionalities is likely to be less widespread. Fedora is not meant for production use, nor for those who cannot upgrade at least yearly. It has an entirely different mission, which Jon Stanley sums up: Well, in all fairness, Fedora's stated goal is to advance the state of free software. You get that by being bleeding-edge. Unfortunately, being bleeding edge also means not being suitable for production environments - these are two fundamentally incompatible goals. This is why Red Hat Linux split into two - Fedora and RHEL. RHEL is a derivative distribution of Fedora. Many believe that folks who want "Fedora LTS" would be better served by Red Hat Enterprise Linux (RHEL) or, for those that do not want to pay for a distribution with support, an RHEL derivative such as CentOS or Scientific Linux. But those don't have the package diversity available with Fedora. A stable release would also want to freeze major packages at a particular version—only backporting security fixes into that version—which is definitely not what is done with Fedora while it is being supported. Dumas wants to see something that finds a middle ground: Fedora legacy (or fedora lts) would not be the same than centos. Maybe a Centos + repository with more recent stuff would be, but currently I think that there is something in the middle between fedora and centos that is missing. The Extra Packages for Enterprise Linux (EPEL) project is meant to help fill that gap, by maintaining additional packages—beyond what Red Hat maintains—for RHEL and compatible distributions. Typically, though, those packages will also be held at a version level that will, with time, grow rather obsolete, at least to those who want to more closely follow the upstream project. And, of course, there aren't as many packages available for the enterprise distributions, even with EPEL, as there are for Fedora. It would seem the classic tension between "bleeding edge" and stable as described by Stanley. Though it isn't clear how it would solve that problem, there are calls for reviving Fedora Legacy. There are few opposed to the idea of continuing Fedora support—if enough people can be found to do it—but the implementation details seem to bog things down. There is a bit of a "chicken and egg" problem in that attracting package maintainers is hard to do without a project to point to, but convincing the Fedora Engineering Steering Committee (FESCo) that it is worthwhile without having those maintainers will be difficult. One of the sticking points is the availability of infrastructure—servers and bandwidth primarily—for any nascent legacy project to use. The Fedora board is seen as being resistant to allowing the use of the Fedora infrastructure for such a project. In response to someone who pointed out that the board's approval is not required, Dumas disagrees: When it requires cooperation with the infrastructure, it does. It is also possible to start something external like rpmfusion, but the amount of work is very big. My proposal only made sense if the economies of scale realized by working inside the fedora project were realized. Still, if somebody provides the infrastructure, sure I'll try to help with a project similar than the one I proposed, but I cannot myself do anything for the infrastructure part. There is also the question of what kind of guarantees a legacy project would make about how long it would support older releases. Dumas and others seem to be in favor of essentially no commitment, maintainers would continue supporting their packages for as long as they wished. While there is some attraction to that idea—it certainly reduces the number of maintainers required—it is unclear that it actually provides a useful service. The idea that some security fixes are better than none is attractive, but David Woodhouse cautions against that view: If we present the _appearance_ of a distro with security updates, while in fact there are serious security issues being unfixed, then that is _much_ worse than the current "That distro is EOL. Upgrade before you get hacked" messaging. For anything to have the Fedora name on it, it _must_ have guaranteed security fixes for at least the highest priority issues. As the original Fedora Legacy project wound down, it left just this kind of impression by promising support, but often not delivering it. For several years, updates for serious security problems were delivered late, if at all. Any new effort in that direction would have to be very clear about what it was delivering and how it planned to get the job done. A project that offered few, if any, guarantees would not be seen as something very useful, but making guarantees that don't get met is far worse. While there are clearly Fedora users that would be interested in hanging on to their operating system for longer than one year, it isn't clear that there are enough of them—and, more importantly, enough maintainers—to make a legacy project successful. Agreement on the goal of the project, along with the promises it would make to adopters is important. It is difficult to see how the Fedora powers-that-be could allocate resources to such a project without those things. As Shmuel Siegel points out: You are looking for infrastructure support from Fedora without indicating that there is a benefit to Fedora. Supply without demand is no more useful than demand without supply. Since Fedora views itself as "the cutting edge distro", you have an uphill PR fight. Give the Fedora project a reason to spend some of their limited resources on you. At least let them know your target audience and why they would be interested. At least at this point, it doesn't seem like a revival of Fedora Legacy is in the cards, which leaves the problem unaddressed. Perhaps adding enough additional packages to EPEL will allow CentOS to truly become "Fedora LTS". It should be noted that while the original concern that LTS users might be switching to Ubuntu could well be true, Ubuntu LTS doesn't have a solution to the problem of package versions slowly getting obsolete either. Newer packages and stability are fundamentally at odds—trying to solve that problem is probably far too large of a job for any community distribution. HTTP response splitting HTTP response splitting (HRS) is a technique that attackers can use to inject their own content into a web page. It exploits the way that HTTP delimits the boundary between its headers and the page content. It also is an example of that classic web application security bugaboo: improper filtering of user input. The basic idea is that by injecting one or more carriage-return line-feed (CRLF) sequences into the output that a vulnerable web application returns, an attacker can control what goes to the victim's web browser. The HTTP response from a web server contains two parts: the headers that describe the content and the body which contains the HTML for the page. Each header is delimited by one CRLF and the header section is set off from the body by two CRLFs. It looks something like: Where the first section is the headers, followed by the start of the HTML content. The headers above are generated by the LWN web server directly, but sometimes headers can contain information that comes from a user's request, often in the form of cookies or redirections. If an attacker can sneak an extra CRLF or two into a header he controls, he can effectively create new header lines, or inject his own body content. Typically this is done by using the URL-encoding values for CR and LF: %0d and %0a. If the web application is not careful to check for and filter those characters, the HTTP response can be split. If, for example, the value of the name variable is set into a cookie using code like: then a name like "jake%0d%0a%0d%0a<html>surprise!</html>" could lead to some rather unexpected results. Obviously this is relatively benign, and only impacts someone who sets their name that way, but it does start to give an idea of the power of HRS. Incidentally, the code above is not random, it is adapted from that used to demonstrate a recent Mono HRS vulnerability. If one can only inject headers into one's own session, it hardly merits mention, but there are ways for an attacker to inject into a victim's browser stream. Perhaps the simplest is just by passing a parameter in the URL in time-honored fashion: http://some.vulnerable.site/app?name="...". If the attacker can get the victim to follow that link, they can control headers and body of what gets returned by the server. Depending on the application, persistent versions, where a redirection URL, for example, was stored in a database, might be another way for an attacker to exploit HRS. HRS is not new, Amit Klein first described it [PDF] in 2004, but it does keep cropping up. As described in Klein's paper, it can be used for cross-site scripting (XSS), web cache poisoning, web site hijacking, and other nefarious activities. More recently, Jeremiah Grossman found HRS vulnerabilities to be surprisingly widespread. He was also surprised at the variety and nastiness of the effects of HRS vulnerabilities. HRS is not as well known as some of the other web application flaws, but it is a serious problem that needs to be considered when building or auditing such applications. Hopefully, we are starting to see some decline in the number of SQL injection, XSS, and other higher profile vulnerabilities, which may mean that attackers start looking towards the more obscure for exploitation. In what is likely to be a never-ending battle for control of our web applications, getting out ahead of the attacker community can only be a good thing. 2.6.28 merge window, part 2 As of this writing, just under 6200 non-merge changesets have been merged into the mainline kernel since the 2.6.27 release. This merge window should be drawing to a close around October 24, so we are getting closer to seeing what 2.6.28 will look like. User-visible changes merged since last week's update include: New drivers have been merged for Maxim/Dallas DS3234 SPI realtime clock chips, VIA UniChrome Family graphics chipsets, Toshiba Mobile IO framebuffers, C-Media CM109 USB phones, the touchpad shipped on OLPC XO systems, Automata Sercos III PCI cards (via UIO), Delcom USB 7-segment LED displays, generic USB test-and-measurement devices, Freescale QE/CPM USB device controllers, Vernier Software Technologies USB spectrometers, GPIO-connected NAND flash devices, Freescale i.MX2 and i.MX3 flash controllers, OMAP2/OMAP3-connected OneNAND flash devices, Dialog DA9030/DA9034 multifunction controllers, and Texas Instruments TWL4030/TPS659x0 multifunction controllers. The driver staging tree has been moved into the mainline. It brings with it a new TAINT_CRAP flag and suitably tainted drivers for Meilhaus ME-4000 data collection boards, Go 7007 ("some weird device") video controllers, Agere ET-1310 Gigabit Ethernet controllers, Atmel at76c503/at76c505/at76c505a wireless USB cards, Alacritech SLIC Technology non-accelerated 10Gb Ethernet cards, Alacritech IS-NIC gigabit Ethernet cards, Winbond w35und wireless network adapters, and Prism 2.5 USB wireless network adapters (a driver which includes its own 802.11 stack). Also added are an echo cancellation module and a driver which enables the passing of network packets over a USB link. A lot of work on the Intel i915 graphics driver has been merged; this work includes the Graphics Execution Manager (GEM) GPU memory management subsystem and "IGD OpRegion" support which enables ACPI backlight control. It looks like kernel-based mode setting might not make it for 2.6.28, but much of the rest of the big graphics rework is now merged. The way video drivers handle waiting for vertical blank cycles has been changed to reduce interrupts - and, thus, power consumption. Rik van Riel's memory management scalability patches have, at long last, been merged. These patches separate the management of anonymous, file-backed, and completely unevictable pages, eliminating a lot of useless page scanning. Another VM improvement causes the system to free a page's swap space after that page is brought back into RAM; this effectively increases the amount of swap available on the system. Nick Piggin's rewritten vmap layer should give significant performance improvements, especially as the number of CPUs on a system grows. Huge pages will now be included in core dumps, making the debugging of applications using those pages easier. The container freezer has been merged. It is now possible for the system to freeze all processes within a container (control group) as a unit. The KVM virtualization code has seen a number of improvements, including the ability to assign PCI devices to guests and support for Intel "Tukwila" processors. Kprobes are now supported by the SuperH architecture. There is a new ext3 mount option (data_err=abort) which causes filesystem operations to abort when I/O errors are encountered. In the absence of this option, the old behavior (continue but complain in the system log) remains. In-kernel interrupt balancing for 32-bit x86 systems has been removed. This feature has been deprecated (in favor of user-space balancing) for some time. Changes visible to kernel developers include: A number of tracing-related patches have been merged. These include the tracepoints mechanism, some instrumentation in the core scheduler code, improvements to the ftrace function tracing feature, a new ftrace-based stack tracer, a new ftrace-based boot (initcall) tracer, and the low-level trace buffer code. The sysctl strategy() function prototype has changed: the unused name and nlen parameters have been removed. Asynchronous I/O support can now be configured out of the kernel, saving about 7KB of space on systems where AIO is not needed. As planned, device_create_drvdata() has been renamed to device_create(), with the same parameters. There is now a mechanism to enable and disable output from pr_debug() and dev_dbg() calls on a per-module basis. Control is through a virtual file in debugfs. There is no documentation file associated with this change; instructions on how to use this feature can be found in the patch changelog. The new dev_WARN() function: will output the formatted warning, along with a full stack trace. This will allow the warnings to be collected at kerneloops.org and incorporated into the reports there. The new %pR formatting directive allows printk() and friends to output the contents of resource structures. There is a new function intended to make life easier for PCI driver writers: This function will remap the entire PCI I/O memory region, as selected by the bar argument. See next week's Kernel Page for a summary of the final days of the 2.6.28 merge window. A tale of two conferences Like many communities, the Linux community depends heavily on conferences as a way to help our developers and users know each other and work well together. We make highly effective use of electronic communications, but there is truly no substitute for occasionally getting together, sharing a beer or three, and engaging in some high-bandwidth discussion. So it stands to reason we want our events to be as productive and useful as possible, especially given the expense of participating in them. Your editor recently had the fortune of attending, over the course of one week, two conferences which are arguably the oldest and the newest in our community. They were both interesting events, but they were very different in their organization and attendance. Both show both strengths and weaknesses in our organization of face-to-face events. Arguably, the first Linux-related event ever was Linux-Kongress 1994. That gathering brought together developers working on the Linux kernel for the first time; it played host to a large portion of the (quite small) development community. For a period of time thereafter, Linux-Kongress was the development event for people working at or near the kernel level. It didn't take too long for other conferences (notably Linux Expo in the US) to grab some of the spotlight, but, unlike Linux Expo, Linux-Kongress is still an active conference. The 2008 event, in Hamburg, Germany, was well organized and a lot of fun; it was a pleasant gathering of a part of the community which your editor visits far too rarely. It was a technical conference for technical people, with a number of well-known developers present. But it must be said: Linux-Kongress is a small and relatively obscure event in 2008. There were maybe 200 attendees; much of the northern European development community was absent. Even some developers based in Hamburg declined to attend. The quality of the talks was not uniformly good, though some were excellent. And, in stark contrast to the recent Linux Plumbers Conference, it's hard to point at much work that got done. For something that was once the Linux development gathering, Linux-Kongress has clearly come down in the world. It is interesting to observe that Europe, while being the home to large numbers of free software developers, lacks a definitive development conference. That is not to say that no interesting events happen there; GUADEC and Akademy are probably the biggest desktop conferences, and the upcoming combined event is something to look forward to. But developers looking for a pan-European, Linux-oriented conference will not find one. LinuxConf.eu, a combination of the UKUUG and Linux-Kongress events held in Cambridge last year, offered the potential to become such an event, but the LinuxConf.eu idea appears to have stalled for now. From Hamburg, your editor flew straight to New York City, where the Linux Foundation's End-User Summit was held. This event, happening for the first time, differs greatly from Linux-Kongress in many ways. To begin with, it was an invitation-only event, and one which explicitly excluded the press (which is why there have been no LWN articles from there). It was also intended to host a mixture of developers and users, and to allow them to talk to each other. These characteristics led to a different sort of conference experience. [PULL QUOTE: We do not run an invitation-only community; excluding people from our conferences seems to run counter to the inclusive atmosphere we normally try to encourage. END QUOTE] The invitation-only nature of some Linux Foundation events naturally leads to complaints. We do not run an invitation-only community; excluding people from our conferences seems to run counter to the inclusive atmosphere we normally try to encourage. The Linux Foundation's reasoning here is easy to understand, though: many of the targeted end users (who represent mainly the financial industry in New York) have a hard time talking about what they are doing in any setting. In an open conference with press in attendance, those people will simply keep their mouths closed - if they show up at all. The user community represented by the financial industry is important; they are a significant part of the business which keeps the enterprise distributions going. Even now, they are highly sought after as customers. It is important to know what they are thinking and what their biggest difficulties with Linux are. In the absence of an event like the End User Summit, this information will only be communicated directly to the enterprise distributors under a non-disclosure agreement. An invitation-only summit is fundamentally exclusive at one level, but it does help the development community (as opposed to one or two companies) get a sense for what this user community is thinking. So what are they thinking? They feel some stress between the stability of enterprise distributions and the desire to have the features developed by the community in recent years. They want good tracing mechanisms, but do not necessarily need the dynamic tracing provided by tools like DTrace or SystemTap. They like Linux because its broad hardware support frees them from reliance on any specific hardware vendor. They are very interested in work on next-generation filesystems. Some of them, at least, very much want to better understand how our development process works and, possibly, participate in it. See the Linux Foundation's press release for a summary of what was discussed there. It was a productive gathering, especially once the CEOs got off the stage and the attendees were able to talk to each other. But it points out another thing that we, as a community, lack: there are few forums where developers and users can get together and learn from each other. Developers tend to prefer the company of other developers; convincing them to go to more user-oriented events can be a challenge. So the closest thing we have to a combined user/developer event is the single-vendor conferences held by companies like Red Hat and Novell. Those, needless to say, are not the most community-oriented gatherings. They are not the best way to learn what our users are thinking. The proposed LinuxCon event, to be co-located with the 2009 Linux Plumbers Conference, may help to fill in this gap somewhat. Our community is blessed with a wealth of interesting gatherings worldwide. But that doesn't mean that we can't do better. Whether the subject is a true pan-European Linux gathering, user-oriented conferences, or something else altogether, there are always opportunities to find ways to help our community be more cohesive and productive. The trick is to expand communications to a broader community - as seen in our newest conference - while growing the open collaborative spirit exemplified by our oldest one. The source of the e1000e corruption bug When LWN last looked at the e1000e hardware corruption bug, the source of the problem was, at best, unclear. Problems within the driver itself seemed like a likely culprit, but it did not take long for those chasing this problem to realize that they needed to look further afield. For a while, the X server came under scrutiny, as did a number of other system components. When the real problem was found, though, it turned out to be a surprise for everybody involved. Tracking down intermittent problems is hard. When those problems result in the destruction of hardware, finding them is even harder. Even the most dedicated testers tend to balk when faced with the prospect of shipping their systems back to the manufacturer for repairs. So the task of finding this issue fell to Intel; engineers there locked themselves into a lab with a box full of e1000e adapters and set about bisecting the kernel history to identify the patch which caused the problem. Some time (and numerous fried adapters) later, the bisection process turned up an unlikely suspect: the ftrace tracing framework. Developers working on tracing generally put a lot of effort into minimizing the impact of their code on system performance. Every last bit of runtime overhead is scrutinized and eliminated if at all possible. As a general rule, bricking the hardware is a level of overhead which goes well beyond the acceptable parameters. So the ftrace developers, once informed of the bisection result, put in some significant work of their own to figure out what was going on. One of the features offered by ftrace is a simple function call tracing operation; ftrace will output a line with the called function (and its caller) every time a function call is made. This tracing is accomplished by using the venerable profiling mechanism built into gcc (and most other Unix-based compilers). When code is compiled with the -pg option, the compiler will place a call to mcount() at the beginning of every function. The version of mcount() provided by ftrace then logs the relevant information on every call. As noted above, though, tracing developers are concerned about overhead. On most systems, it is almost certain that, at any given time, nobody will be doing function call tracing. Having all those mcount() calls happening anyway would be a measurable drag on the system. So the ftrace hackers looked for a way to eliminate that overhead when it is not needed. A naive solution to this problem might look something like the following. Rather than put in an unconditional call to mcount(), get gcc to add code like this: But the kernel makes a lot of function calls, so even this version will have a noticeable overhead; it will also bloat the size of the kernel with all those tests. So the favored approach tends to be different: run-time patching. When function tracing is not being used, the kernel overwrites all of the mcount() calls with no-op instructions. As it happens, doing nothing is a highly optimized operation in contemporary processors, so the overhead of a few no-ops is nearly zero. Should somebody decide to turn function tracing on, the kernel can go through and patch all of those mcount() calls back in. Run-time patching can solve the performance problem, but it introduces a new problem of its own. Changing the code underneath a running kernel is a dangerous thing to do; extreme caution is required. Care must be taken to ensure that the kernel is not running in the affected code at the time, processor caches must be invalidated, and so on. To be safe, it is necessary to get all other processors on the system to stop and wait while the patching is taking place. The end result is that patching the code is an expensive thing to do. The way ftrace was coded was to patch out every mcount() call point as it was discovered through an actual call to mcount(). But, as noted above, run-time patching is very expensive, especially if it is done a single function at a time. So ftrace would make a list of mcount() call sites, then fix up a bunch of them later on. In that way, the cost of patching out the calls was significantly reduced. The problem now is that things might have changed between the time when an mcount() call is noticed and when the kernel gets around to patching out the call. It would be very unfortunate if the kernel were to patch out an mcount() call which no longer existed in the expected place. To be absolutely sure that unrelated data was not being corrupted, the ftrace code used the cmpxchg operation to patch in the no-ops. cmpxchg atomically tests the contents of the target memory against the caller's idea of what is supposed to be there; if the two do not match, the target location will be left with its old value at the end of the operation. So the no-ops will only be written to memory if the current contents of that memory are a call to mcount(). This all seems pretty safe, except that it fell down in one obscure, but important case. One obvious place where an mcount() call could go away is in loadable modules. This can happen if the module is unloaded, of course, but there is another important case too: any code marked as initialization code will be removed once initialization is complete. So a module's initialization function (and any other code marked __init) could leave a dangling reference in the "mcount() calls to be patched out" list maintained by ftrace. The final piece of this puzzle comes from this little fact: on 32-bit architectures, memory returned from vmalloc() and ioremap() share the same address space. Both functions create mappings to memory from the same range of addresses. Space for loadable modules is allocated with vmalloc(), so all module code is found within this shared address space. Meanwhile, the e1000e driver uses ioremap() to map the adapter's I/O memory and NVRAM into the kernel's address space. The end result is this fatal sequence of events: A module is loaded into the system. As part of the module's initialization, a number of mcount() calls are made; these call sites are noted for later patching. Module initialization completes, and the module's __init functions are removed from memory. The address space they occupied is freed up for future use. The e1000e driver maps its I/O memory and NVRAM into the address range recently occupied by the above-mentioned initialization code. Ftrace gets around to patching out the accumulated list of mcount() calls. But some of those "calls" are now, actually, I/O memory belonging to the e1000e device. Remember that the ftrace code was very careful in its patching, using cmpxchg to avoid overwriting anything which is not an mcount() call. But, as Steven Rostedt noted in his summary of the problem: The cmpxchg could have saved us in most cases (via luck) - but with ioremap-ed memory that was exactly the wrong thing to do - the results of cmpxchg on device memory are undefined. (and will likely result in a write) The end result is a write to the wrong bit of I/O memory - and a destroyed device. In hindsight, this bug is reasonably clear and understandable, but it's not at all surprising that it took a long time to find. One should note that there were, in fact, two different bugs here. One of them is ftrace's attempt to write to a stale pointer. But the other one was just as important: the e1000e driver should never have left its hardware configured in a mode where a single stray write could turn it into a brick. One never knows where things might go wrong; hardware should never be left in such a vulnerable state if it can be helped. The good news is that both bugs have been fixed. The e1000e hardware was locked down before 2.6.27 was released, and the 2.6.27.1 update disables the dynamic ftrace feature. The ftrace code has been significantly rewritten for 2.6.28; it no longer records mcount() call sites on the fly, no longer uses cmpxchg, and, one hopes, is generally incapable of creating such mayhem again. Reworking vmap() Kernel memory is normally allocated in relatively small chunks - usually just a single page at a time. As the size of an allocation grows, satisfying that allocation with physically-contiguous pages gets progressively harder. So most of the kernel has been written with an eye toward avoiding the use of large, contiguous allocations. There are times, though, when a large memory array needs to be virtually contiguous, but not necessarily physically contiguous. One example is the allocation of space for loadable modules; any given module should live in a single, contiguous address range, but nobody cares how it's laid out in physical RAM. For cases like this, the kernel provides a set of functions like vmalloc() and vmap(). Functions like vmalloc() have long been known to be somewhat expensive to use. They have to work with a single shared (and limited) address range, and they require making changes to the kernel's page tables. Page table changes, in turn, require translation lookaside buffer (TLB) flushes, which are a costly, all-CPUs operation. So kernel developers have generally tried to avoid using these functions in performance-critical parts of the kernel. Nick Piggin has noticed, though, that the performance characteristics of vmalloc() and friends are catching up with us. The vmalloc() address space is kept on a linked list and protected by a global lock, which does not scale very well. But the real cost is in freeing memory regions in this space; the ensuing TLB flush must be done using an inter-processor interrupt to every CPU, each of which must then flush its own TLB. People normally do not buy more CPUs unless they have more work to run on them, so systems with more processors will, as a general rule, be performing more mapping and freeing in the vmalloc() range. As systems grow, there will be more global TLB flushes, each of which disrupts more processors. In other words, the amount of work grows proportional to the square of the number of processors - meaning that everything falls down, eventually. To make things worse, Nick has a longstanding series of patches which, among other things, do a lot of vmap() calls to support larger block sizes in the filesystem layer and page cache. Merging those patches would add significantly to the amount of time the system spends managing the vmalloc() space, which would not be a good thing. So fixing vmalloc() seems like a good thing to do first. As of 2.6.28, Nick has, in fact, fixed the management of kernel virtual allocations. The first step is to get rid of the linked list and its corresponding global lock. Instead, a red-black tree is used to track ranges of available address space; finding a suitable region can now be done without having to traverse a long list. The tree is still protected by a global lock, which poses potential scalability problems. To avoid this issue, Nick's patch creates a separate, per-CPU list of small address ranges which can be allocated and freed in a lockless manner. New functions must be called to make use of this facility: A call to vm_map_ram() will create a virtually-contiguous mapping for the given pages. The associated data structures will be allocated on the given NUMA node; the memory will have the protection specified in prot. With the version of the patch merged for 2.6.28, mappings of up to 64 pages can be made from the per-cpu lists. Note that these functions do not allocate memory, they just create a virtual mapping for a given set of pages. They are a replacement for vmap() and vunmap(), not vmalloc() and vfree(). It is probably possible to rewrite vmalloc() to use this mechanism, but that has not happened. So vmalloc() calls still require the acquisition of a global lock. There's another trick in this patch set which is used by all of the kernel virtual address management functions. Nick realized that it is not actually necessary to flush TLBs across the system immediately after an address range is freed. Since those addresses are being given back to the system, no code will be making use of them afterward, so it does not matter if a processor's TLB contains a stale mapping for them. All that really matters is that the TLB gets cleaned out before those addresses are used again elsewhere. So unmapped regions can be allowed to accumulate, then all flushed with a single operation. That cuts the number of TLB flushes significantly. How much faster do things run? Nicks patch (the merged version can be found here) contains some benchmark results. With an artificial test aimed at demonstrating the difference, the new code runs 25 times faster. By changing the vmap() code in the XFS filesystem to use vm_map_ram() instead, some workloads were sped up by a factor of twenty. So it seems to work. Mozilla releases Firefox 3.1 Beta 1 Version 3.1 Beta 1 of the popular Mozilla Firefox web browser was announced on October 14, 2008. This is a testing release: Firefox 3.1 Beta 1 is a public preview release intended for developer testing and community feedback. It includes many new features as well as improvements to performance, web compatibility, and speed. We recommend that you read the release notes and known issues before installing this beta. The release announcement and the Web Developer Feature Overview page discuss the new capabilities in more detail. The major new additions include: Support has been added for the html <video> and <audio> elements using the OGG Theora and OGG Vorbis formats. Geolocation features have been added, but not in the Linux version (discussed here). The Gecko layout engine has some improved web standards implementations. More CSS 2.1 and CSS 3 properties have been implemented. Support for the CSS @font-face property has been added (Mac OS-X and Windows only), allowing support for downloadable user-specified true type fonts. Support for Access Control for Cross-Site Requests has been added. Beta support for Mozilla's TraceMonkey JavaScript engine has been added. Some new customizations are available for controlling the Smart Location Bar. JavaScript web worker threads are being worked on. New graphics, SVG and CSS capabilities are being added. Improvements have been made to the browser tabs including: A new "Open a new tab" button has been added to the tab bar. Support for switching between tabs with Ctrl-Tab has been added. Tabs can now be dragged and dropped between Firefox windows. More features are planned for the official Mozilla 3.1 release. Your author spent an entire day doing his normal LWN work using Firefox 3.1 Beta 1 on an Ubuntu 8.04 system. The only problem that showed up was choppy and aliased audio playback when viewing some of the recommended test videos. Otherwise, the browser worked well. Firefox 3.1 Beta 1 is available for download here, it is a good idea to read the release notes first. OpenStreetMap contemplates licensing Maps are cool; there's no end of applications which can make good use of mapping data. There is plenty of map data around, but it's almost exclusively proprietary in nature. That makes this data hard to use with free applications; it's also inherently annoying. We, as taxpayers, own those streets; why should we have to pay somebody else to know where the streets are? Your editor likes to grumble about such things; meanwhile, the OpenStreetMap project (OSM) is busily doing something about it. OSM has put together a database and a set of tools making it easy for anybody to enter location data with the intent of producing a free mapping database with global coverage. It is an ambitious project, to say the least, but it's working: Right now on each and every day, 25,000km of roads gets added to the OpenStreetMap database, on the historical trend that will be over 200,000km per day by the end of 2009. And that doesn't include all the other data that makes OpenStreetMap the richest dataset available online. OSM data is not limited to roads; just about any point or track of interest can be added to the database. If current trends continue, OSM could well grow into the most extensive geolocation database anywhere - free or proprietary. And those trends could well continue; one of the nice aspects of this kind of project is that no particular expertise is needed to contribute. All you need is a GPS receiver and some time; some OSM local groups have even acquired a set of receivers to lend out to interested volunteers. This is our planet, and we can all help to map it. All this work raises an interesting question, though: under what license should this accumulated data be distributed? Currently, the OSM database is covered by the Creative Commons Attribution-ShareAlike 2.0 license. It is a copyleft-style license, requiring that derived products be made available under the same license. So, for example, if a GPS navigator manufacturer were to include an enhanced version of the OSM database in its products, it would have to release the enhanced version under the CC by-SA license. The OSM project is not happy with this license, though, and is looking to make a change. The attribution requirement is ambiguous in this context; do users need to credit every OSM contributor? Does making a plot of OSM data with added data layered on top create a derived product? But the scariest question is a different one: can the CC by-SA license cover the OSM database at all? Copyright law covers creative expression, not facts. The information in the OSM database is almost entirely factual in nature; one cannot copyright the location of a street corner. So what OSM is trying to protect is not the individual locations, but the database as a whole. Copyright law does allow for the protection of databases, but that law is far more complex than the law for pure creative works, and it varies far more between jurisdictions. Europe has a specific (though much-derided) database right, the US has far weaker database protections, and other parts of the planet lack this protection altogether. So it may well be that, if some evil corporation decides to appropriate the OSM database for its own nefarious, proprietary purposes, there will be nothing that the OSM project can do about it. So the project is thinking of making a switch to the Open Database License (ODbL), which is still being developed. It, too, is a copyleft-style license, but it is crafted to make use of whatever database protection is available in a given jurisdiction. To that end, the ODbL is explicitly structured as a contract between the database owner and the user. In any jurisdiction where database rights are not recognized under copyright law, the contractual nature of the ODbL should provide a legal basis to go after license violators. But the use of contract law muddies the water considerably; there are good reasons why free software licenses are carefully written to avoid that path. Contracts are only valid if they are explicitly and voluntarily entered into by all parties. If the OSM cannot show that a license violator agreed to abide by the license, it has no case under contract law. The project has a plan to address this problem: To ensure that potential users are aware of and agree to the contract terms, we are proposing to require a click-through agreement before downloading data. (All registered users would agree to this on signing up so will not need a further click-through on each download.) Registration and clickthrough licensing are obnoxious, to say the least. But, in any case, the only people who will go through that process are those who obtain the database directly from OpenStreetMap. The ODbL allows redistribution, naturally, and it does not require that explicit agreement be obtained from recipients of the database. So it is hard to see an outcome where copies of the database lacking a "signed" contract do not proliferate. Additionally, reliance on contract law makes it very hard to get injunctive relief, weakening any enforcement efforts considerably. The ODbL includes an anti-DRM measure; if a vendor locks down a copy of the database with some sort of DRM scheme, that vendor must also make an unrestricted copy available. This license tries to distinguish between "collective databases" (which are not derived works) and "derivative databases" (which are). Drawing layers on top of an OSM-based map is a collective work; tracing lines from such a map is a derivative work. It is, in general, a complex bit of work. It is complex enough that a number of OSM contributors are wondering if it's all worth it. Jordan Hatcher is one of the authors of the ODbL, and he supports its use with OSM, but even he understands the concerns that some people have: The [Science Commons] point is that all this sort of stuff can be a real pain, and isn't what you are really doing is wanting to create and manipulate factual data? Why spend all the time on this when the innovation happens in what you can do with the data, and not with trying to protect the data in the first place. There is an active group with OSM which is opposed to this kind of licensing and would, in fact, rather just get down to the task of collecting and distributing the data. They express themselves in terms like this: One thing I really love about OSM is the pragmatic, un-political approach: You don't give us your data, fine, then we create our own and you can shove it. Not: You don't give us your data, fine, then we create a complex legal licensing framework that will ultimately get you bogged down in so many requests by prospective users who would like to use our data and yours but cannot and you will sooner or later have to release your data according to the terms we dictate and then we will have won and the world will be a better place. These contributors would rather that OSM release its data into the public domain - or something very close to that. Rather than put together a complicated license, they prefer to just publish their data for anybody to use as they see fit. There have been all of the usual discussions which resemble any "GPL vs. BSD" licensing flame war one has ever seen - except that the OSM folks appear to be a very polite crowd. It comes down to the usual question: will the OSM database become more complete and useful if those who extend it are forced to contribute back their changes? The public domain contingent clearly does not believe that any improvements to the database obtained via licensing constraints will be worth the trouble. So it seems likely that there will be some sort of fork involving the creation of a smaller, purely public-domain OSM database. It may well be an in-house fork, with the public domain data being merged into the larger, more restrictively licensed database for distribution. Regardless of how that goes, this split raises issues of its own: how are the two databases to be kept distinct in the face of cooperative additions and edits? Any relicensing of the database also brings up another interesting question: what to do about all of the existing data, which may or may not be copyrighted by those who contributed or edited it? The license change may well require a process of getting assent from all contributors and purging data obtained from those who do not agree. This proposed timeline shows how the project is thinking about working through this task. It is hard to imagine this process going entirely smoothly. The OSM community clearly has a set of thorny issues to work out. Given that, it's not surprising that this process has already been dragged out over the better part of a year. How this issue is eventually resolved will certainly serve as an example - not necessarily a good example - for other projects working on free compilations of factual data. Let us hope that OSM can come to a solution which lets this project continue to grow and generate a valuable database that we all will benefit from. K12Linux - Fedora 9 with LTSP The K12Linux project builds on the efforts of K12LTSP, which started working with the Linux Terminal Server Project (LTSP) on Red Hat Linux before switching to Fedora and CentOS. The newly named K12Linux project recently announced the release of K12Linux Release Candidate 1. The Linux Terminal Server Project provides software that adds thin-client support to Linux distributions. The project's documentation page has pointers to using LTSP with Ubuntu, openSUSE, Fedora and Debian, along with instructions for Integrating LTSP-5 into your favorite Linux distribution. LTSP provides server and client software for a single server and many thin clients or diskless terminals. This can be an inexpensive way to provide files and applications for many users. While often used in schools, LTSP has many other applications as well. K12 refers to the USA primary school system, where children start their education in Kindergarten (from the German) and go through grade 12 before going on to a university. This brings us back to K12Linux, the new name for continuing efforts to integrate LTSP with Fedora. Currently these efforts are focused on LTSP 5 and Fedora 9. This RC release contains Fedora 9 and all updates as of October 12, 2008, with LTSP-5.1.26, ldm-2.0.13, ltspfs-0.5.5, many bug fixes and new K12Linux-themed artwork for the login screen. This release comes as a live image suitable for a USB key or a DVD; both with the client chroot already installed and configured. If you are already running Fedora 9 and would like to try this release you can use the instructions in the install guide instead of the live media. Either way, if you are looking for an easy way to get LTSP running, give K12Linux a try. Closing out the 2.6.28 merge window About 1000 changesets were merged after the previous summary was posted here. Much of those came from architecture-specific trees. Other changes merged this time around include: There are new drivers for Mellanox ConnectX 10GbE network adapters, PowerPC PPC40x and PPC44x GPIO controllers, Panasonic "Let's Note" laptop special keys, Sharp SL-6000 backlight and LCD devices, Dialog Semiconductor DA9030/DA9034 backlight devices, Tabletkiosk Sahara Touch-iT backlight devices, and Toshiba TX4939 SoC ATA controllers. One more not-ready-for-prime-time driver was merged via the staging tree; this one supports Redrapids Pocket Change cardbus devices. The staging tree also brought an extensive set of fixes to the drivers added earlier in the merge window. The kernel has gained support for ultra-wideband protocol stacks. UWB can be used for normal networking, but the immediate application is wireless USB, which will be supported in 2.6.28. The ACPI docking station code has gained support for bay and battery hotplug events. The IA64 architecture now supports Xen. Also added to IA64 is support for DMA remapping devices (IOMMUs). Support for kdump has been added to the PowerPC architecture. The 9P (Plan9) filesystem now has RDMA support. Changes visible to kernel developers include: There is a new core_param() macro: Its purpose is to define "core" parameters and let them be represented in /sys/module/kernel/parameters. It is now possible to create a workqueue running at realtime priority with: The block driver API has changed considerably, with the inode and file parameters being removed from most block device operations. The new API looks like this: The new prototypes do away with the file and inode structure pointers which were passed in previous kernels. Note that the ioctl() method is now called without the big kernel lock; code needing BKL protection must explicitly define a locked_ioctl() function instead. The range timer API has been merged; callers can now specify a time period in which they would like the timeout to be delivered. The kernel can then take advantage of the range to coalesce wakeups and keep the processor idle for longer periods. This time around, linux-next maintainer Stephen Rothwell has put together a list of linux-next patches which did not get into 2.6.28. Perhaps the biggest omission was the credentials work, which seemed poised to go in this time around. Other changes which failed to get merged include the message catalog code (which looks like it will need a change of approach) and TOMOYO Linux (which seems to be caught up in the same old "new security module with pathname-based rules" swamp). Now the stabilization period starts. Linus, perhaps, was trying to set the tone for this development cycle when he released a much smaller and earlier 2.6.28-rc2 than would have normally been expected. By way of comparison: 2.6.25-rc2 had 359 patches applied since 2.6.25-rc1. For 2.6.26-rc2, 446 changesets were merged, and, for 2.6.27-rc2, the count was 780. For 2.6.28-rc2, instead, a total of 22 changes went in. Says Linus: And hey, maybe we can even _continue_ the nice model of "just small fixes after -rc1". I know, it sounds insane, but it's a real pleasure to do an -rc2 with just a handful of fixes for real problems that real people see. What a concept! Should this pattern hold, it may well be that 2.6.28 will stabilize more quickly and successfully than its predecessors. It will, in any case, be interesting to watch. Networking change causes distribution headaches A seemingly innocuous change to the networking code that went into the 2.6.27 kernel is now causing trouble for various distributions. Ubuntu, Fedora, and openSUSE are all buttoning up their packages for a release in the near future—with Ubuntu's due this week—so kernel changes are not particularly welcome. Unfortunately, if the problem is not addressed, some users may never be able to download a fix because their TCP/IP won't interoperate with some broken equipment on the internet. The problem stems from changes that were made to clean up the TCP option code that were merged back in July as part of the 2.6.27 merge window. TCP options are a mechanism to expand the functionality of the protocol as conditions change. There are a handful of commonly used options that the two endpoints of a connection can agree to use, for things like maximum segment size (MSS), window scaling, selective acknowledgment (SACK), and timestamps. Options have been added over time to provide more internet robustness and performance as well as to support higher-bandwidth physical connections. A perfectly reasonable, if unintended, consequence of the code change was that the the options were put into the header in a slightly different order. According to the relevant RFCs, options can appear in any order in the option section of the TCP header. But, some home and/or internet routers seem to expect a fixed order; refusing to make connections if the order is "wrong". In particular, it would seem that the MSS option needs to appear before the SACK option. The bug was reported to Ubuntu Launchpad in early September, but not a lot of progress was made until it was added to the kernel.org bugzilla in early October. It seems to have only affected a relatively small number of users—Red Hat's Dave Jones said that there were no reports from users of the rawhide 2.6.27 kernel—as it was rather hardware-specific. This made it difficult to track down for the majority of folks who couldn't reproduce it. Ubuntu user Aldo Maggi, who filed the kernel bug, sets a marvelous example of how to work with the kernel hackers to track down the problem as can be seen in the bugzilla entry. Eventually, the option re-ordering problem was discovered and a patch was submitted by Ilpo Järvinen that restored the order of the options. Along the way, with help from Mandriva, it was discovered that turning off TCP timestamps by way of: worked around the problem without changing the kernel—at the cost of losing the TCP timestamp functionality. So it would seem that the problem has been solved—the patch has been merged into Linus Torvalds's tree for 2.6.28—but there are still a few unresolved issues. The three distributions that are preparing new releases are all based on 2.6.27, but as yet, there has not been a -stable kernel release that picks up the patch, though it is likely to come fairly soon. In the meantime, Fedora has added the patch to its kernel in rawhide, so Fedora 10 (and eventually Fedora 9 when it gets rebased on 2.6.27) will have the fix. openSUSE is waiting a bit to see what gets submitted by the kernel networking developers to the -stable team. As Novell/SUSE kernel hacker Greg Kroah-Hartman puts it: "We still have a while to go before the final 11.1 kernel is released, so we feel no pressure here." Unfortunately, Ubuntu got caught very late in its release cycle as 8.10 (or Intrepid Ibex) is due on October 30. The original plan as outlined by Debian/Ubuntu hacker Steve Langasek was to note the problem in the release notes for 8.10, but not address the underlying problem until after the release: The kernel fix is known upstream; implementing it requires kernel uploads and installer rebuilds, which it's just not possible to fit in between the release candidate and the release. We will certainly want to include this fix in a kernel update as soon as possible after the release, but this is unfortunately in a class of bugs that we can't fix the week of release (even turning timestamps off requires a kernel upload, unless we want to permanently disable tcp timestamp support for Ubuntu 8.10). That led many in the Launchpad bug thread to note that it was going to be a real mess, especially for the least technical of users. Nick Lowe sums up the problem: [...] You should really delay for this if you need more time... RC shouldn't mean Release ComeHellOrHighWater The users who are most likely to hit this are home users behind their aged/unmaintained consumer routers who are highly unlikely to understand why they can't access the Web and will just go elsewhere... Certainly, the release notes are not the first place an affected user would go if they ran into the problem. More than likely, they would just decide that Ubuntu—by extension Linux—is simply broken, so it is a relief to see that Ubuntu eventually relented. For 8.10, the procps package has been changed to work around the problem by turning off timestamps. Once a new kernel package is released with the re-ordering patch included, timestamps can presumably be restored. This kind of problem—where affected users may not be able to retrieve an update to fix it—should really be part of the definition of a show-stopping (i.e. release date slipping) problem. It was rather galling to some that Ubuntu would consider shipping with this known issue, simply to make its 8.10 release in the 10th month of 2008 (which is how Ubuntu releases are numbered). Ubuntu is justifiably proud of its record of shipping releases on time, but it cannot do that at the expense of its users. While the workaround that was implemented was suboptimal, perhaps, it does ensure that users—especially non-technical users—won't find that web surfing doesn't work in Linux. It should also allow Ubuntu to release on schedule. [ Thanks to Nick Lowe for giving us a heads-up about this issue. ] Tracking tbench troubles Kernel developers tend to have a mixed view of benchmarks. A benchmarking tool can do an effective job of quantifying specific aspects of system performance. But benchmarks are not real workloads; optimizing for a benchmark can often distort a system in ways which are detrimental to real applications. Since kernel hackers do not always see benchmark optimization as their top priority, they can sometimes assign a lower priority to benchmark regressions as well. But, sometimes, benchmark problems indicate a real problem in the kernel. The tbench benchmark is meant to measure networking performance; it consists of a collection of processes quickly making lots of small requests from a server process. Since the requests are small, there is not much time spent actually moving data; it's all a matter of shifting small packets around - and scheduling between the processes. Back in August, Christoph Lameter reported that tbench performance in the mainline kernel had been declining for some time. His system was able to move 3208 MB/sec with a 2.6.22 kernel, but only 2571 MB/sec with a 2.6.27-rc kernel. Each of the releases in between showed a decline from the one which came before, with 2.6.25 showing an especially big hit. Others were able to reproduce the results, and they engaged in various rounds of speculation on where the problem might be, but it seems that, initially, nobody actually dug into the system to see what was going on. At linux.conf.au 2007, Andi Kleen gave a talk describing various types of kernel hackers. One of those was the "Russian mathematician" who, he suspected, was often a room full of talented developers operating under a single name. Evgeniy Polyakov can only have reinforced that view when, in early October, he tracked down the biggest offending commit through a process which, he says, involved "just [a] couple of hundreds of compilations." In the process, he put together a plot of tbench performance which, he says, is suitable for scaring children. Through a massive amount of work, he was able to point the finger at a scheduler patch - not something in the networking stack at all. In particular, Evgeniy found that the patch adding high-resolution preemption ticks was the problem. The idea behind this patch was to make time slices more accurate by scheduling preemption at just the right time. It makes sense; once the regular clock tick has been eliminated, there is no reason not to arrange for preemption to happen when the scheduling algorithm says it should. Unfortunately, it seems that this change also adds sufficient overhead to slow down tbench performance considerably; when Evgeniy backed it out, his performance went from 373 MB/sec to 455 MB/sec. That would seem to be a pretty clear indication that something is amiss with high-resolution preemption ticks. At this point, the public discussion went quiet, though it appears that a number of developers were working on it off-list. David Miller eventually tracked down the worst of the trouble to the wakeup code, something he was rather vocally unhappy about having had to do. Eventually a patch was merged (for 2.6.28-rc2) disabling the high-resolution preemption tick feature. Since the discussion is private, it's not quite clear why this change took as long as it did. But there's a couple of plausible reasons. One is that this particular feature is disabled by default anyway, so most users will not encounter the performance problem it creates. But there is also the question of weighing the benchmark result against the effects on other, "real" workloads. Ingo Molnar said: But it's a difficult call with no silver bullets. On one hand we have folks putting more and more stuff into the context-switching hotpath on the (mostly valid) point that the scheduler is a slowpath compared to most other things. On the other hand we've got folks doing high-context-switch ratio benchmarks and complaining about the overhead whenever something goes in that improves the quality of scheduling of a workload that does not context-switch as massively as tbench. It's a difficult balance and we cannot satisfy both camps. So, by this view, performance on scheduler-intensive benchmarks must be weighed against the wider value of other scheduler enhancements. David Miller has a different view of the situation, though: If we now think it's ok that picking which task to run is more expensive than writing 64 bytes over a TCP socket and then blocking on a read, I'd like to stop using Linux. :-) That's "real work" and if the scheduler is more expensive than "real work" we lose. In David's view, scheduler performance has been getting consistently worse since the switch to the completely fair scheduler in 2.6.23. He would like to see some energy put into recovering some of the performance of the pre-CFS scheduler; in particular, he thinks that Ingo and company should work to fix (what he sees as) a regression that they caused. For the time being, the worst performance regression has been "fixed" by disabling the high-resolution preemption tick feature; Ingo says that the feature will not come back until it can be supported without slowing things down. But the scheduler seems to have gotten slower in a number of other ways as well. Your editor will make a prediction here: now that the issue has been called out in such clear terms, somebody will find the time to fix these problems to the point that the CFS scheduler will be faster than the O(1) scheduler which preceded it. Beyond that, there are suggestions that the scheduler cannot take the blame for all of the observed regressions in tbench results. So developers will have to look at the rest of the system to figure out what's going on. The good news is that this is a clear challenge with an objective way to measure success. Once a problem reaches that level of clarity, it's usually just a matter of some hacking. Debian's election season: old firmware and new contributors Longtime LWN readers will be aware of your editor's tendency toward the publishing of wild predictions at the beginning of each year. The 2007 predictions irritated some Debian developers and users by suggesting that, after getting the Etch release out the door, the project would go back to arguing about firmware issues. At the end of the year, it became necessary to acknowledge that this prediction, like so many others, had failed to come to pass. In retrospect, the error in this prediction was obvious: the Debian Project traditionally saves the firmware argument for the end of the release process. After all, they need to find some way to delay a release once it's looking close to ready. The problem with firmware, of course, is that it is a binary blob lacking the corresponding source, and, sometimes, even a license allowing its distribution. Many developers and users see that blob as being part of the hardware; as long as the blob is distributable, it does not bother them. Others, though, regard firmware blobs as proprietary software and their incorporation into the kernel as a GPL violation. The Debian Project, which promises to deliver a 100% free distribution to its users, houses many developers from the latter camp. These developers, who see firmware distribution as a violation of the project's social contract, can be counted upon to raise the issue each release cycle. In 2004, the project responded by passing a general resolution suspending some social contract provisions through September 1 of that year on the reasoning that it would be long enough to get the Sarge release done. Putting a date on a Debian release tends to be a mistake, though; Sarge was not finished until June, 2005. By unspoken consensus, that date was somehow deemed to have fallen before September 1, 2004. In 2006, the project voted again on firmware. Having learned from experience, the exception they allowed this time lacked a date, simply saying that the presence of binary-only firmware in the Etch release was something the project was willing to tolerate. The 2008 discussion started when Ben Finney pointed out that a number of firmware-related entries in the Debian bug tracking system had been quietly marked "lenny-ignore" - not relevant to the upcoming Lenny release. This action, many have subsequently argued, runs counter to the social contract and constitution, which do not allow the shipping of non-free software to be swept under the carpet in this way. They would, instead, like to see the kernel team remove the (relatively few) firmware blobs remaining in the kernel. Such a change, it is said, should be relatively easy; recent changes within the kernel are helpful in this regard - though said changes became available in 2.6.27, which is not the kernel expected to be shipped with the Lenny release. For the 2.6.26 kernel used by Lenny, Ben Hutchings reports that he has done the necessary work to excise the remaining firmware. On the other side, there are developers who are more concerned about (1) getting the Lenny release out as quickly as possible, and (2) making sure that hardware Just Works for Lenny users. They would rather that the process of removing firmware continue independently of (and without delaying) the Lenny release. This is Debian that we're talking about, so the issue will probably be decided by way of a general resolution. There are currently two sets of resolutions being circulated, though neither has reached a final state for voting. The first set addresses the Lenny question, providing two options: either delay Lenny until the firmware removal work is complete, or accept that - just once more, really this time, honest - a major Debian release will include some firmware in its kernel. (The "ship Lenny" option is actually two options, one allowing firmware and one allowing Debian Free Software Guidelines violations in general). What the project will decide once this resolution comes to a vote is unclear - but Debian's developers have always voted to get the release out in the past. The second proposal addresses what happens after the Lenny release; it says that any package which violates the Debian Free Software Guidelines for more than 180 days will be forced into the non-free repository. The clear hope here is to ensure that this tiresome discussion doesn't happen yet again in the next release cycle. By the time the next release is getting close to ready, any non-compliant packages will have long since been banished to the non-free wasteland. If it ever comes down to moving the kernel to non-free, though, one can assume that the discussion will resume with a vengeance. Developers, Members, Maintainers, and Contributors Meanwhile, a different disagreement is headed toward - you guessed it - a general resolution. Long-time Debian watchers have noted that another recurring topic of debate is the acceptance of new developers. The new maintainer process involves long delays, tests of ideological purity, and more. Even when it works smoothly (which seems to generally be the case in recent years) it requires a certain amount of patience and determination on the part of an aspiring Debian Developer. The difficulty of the process is a design feature; Debian developers occupy a position of some trust, and the project wants to make sure that applicants are serious. Over time, though, it has become clear that this process is costing the project the time and energy of talented contributors who do not wish to jump through all the hoops. In response, the project created a "Debian maintainer" designation which allows the uploading of packages, but withholds many of the other privileges enjoyed by full developers. This change appears to have been successful in enabling a larger group of developers to contribute to Debian. More recently, Joerg Jaspert has proposed lowering the bar to certain types of contribution even further. The proposal reads: Debian is about developing a free operating system, but there's more in an operating system than just software and packages. If we want translators, documentation writers, artists, free software advocates, et al. to get endorsed by the project and feel proud for it, we need some way to acknowledge that. To that end, Joerg would create a new "Debian Contributor" classification. Contributors would be those doing translations or documentation; the proposal doesn't say that contributors don't touch code, but one gets that sense. Contributors would still have to jump through some hoops, but they would be fewer. They would not be able to upload packages on their own. The proposal also changes the Debian Maintainer standards, making that designation a little bit harder to get. Finally, the proposal states that all new applicants to the project would become Contributors or Maintainers. Only after a six-month period would they be able to apply for full Debian Developer or Debian Member status -- "Debian Member" being another new category that, while being equivalent to Debian Developer in almost all respects, would not have package upload privileges. Interestingly, there has not been much discussion of the substance of this proposal. But there has been a fair amount of debate over how it is being done. It would appear that some developers see this change as being imposed by a single project official without the debate that Debian changes normally require. Martin Krafft has further asserted that this kind of change goes beyond Joerg's authority as Debian account manager, a claim that Joerg denies. So now there are proposed general resolutions being circulated. An early version simply decreed that the proposed changes were "suspended" in favor of changes to be made through a more consensus-oriented process. Later versions soften the language somewhat, and thank Joerg for his effort in this area - but still require a "consensus or general resolution" before changes are adopted. In any form, the clear point of the resolution is to slow down the process and open it up for a wider discussion. Again, voting has not begun on any specific resolution, so we don't yet know what will even be voted on, much less how it will come out. But we can expect that, as a certain presidential election process finally (thankfully) comes to a close, activity will be picking up on a different set of votes. Digitizing Vinyl Records with Audacity The Audacity sound editor is an excellent application with many uses. Your author recently started working on a long-term project to convert the better parts of his ancient vinyl phonograph record collection to FLAC files so that they could be added to his digital audio library. Audacity was chosen to do the audio recording and processing work. Prior to undertaking such a project, one must first assemble the appropriate equipment. An older desktop computer with an Athlon 2500 processor and 500MB of RAM was used for the computing platform. Besides a sufficiently powerful CPU, the second most important piece of hardware is a decent sound card. An M-AUDIO Delta 44 was chosen. Standard sound cards should also work, but the Delta 44 has higher quality A-D converters that are mounted external to the computer for lower noise. The Ubuntu Studio distribution was used on the machine, although any current Linux distribution should work. The turntable is an ancient Technics SL-D3 and a Pioneer SX-780 receiver is used as the phono preamp. One of the Tape Record Outputs from the Pioneer receiver is fed into the Delta 44 sound card with an appropriate set of adapter cables. The turntable's tracking weight, anti-skid settings and platter speed should all be adjusted appropriately. One of the new USB turntables could probably be used here if you don't already have access to the legacy hardware. The Audacity sound editor needs to be set up by entering the Edit->Preferences menu, the audio quality was set to 44,100 Hz sampling at 16 bits (standard CD quality). Depending on your needs, other sample rates can be used. One of the more important configuration steps involves making sure the Software Playthrough button in the Audio I/O preference window is deselected. On this particular machine, enabling Software Playthrough results in audible sample loss on the recording. Audio monitoring is done through the Pioneer receiver. The audio meter should be enabled on the main Audacity window and the GNOME ALSA sound mixer is used to set the sound card input levels. The machine is now ready to record. It is a good idea to make a few test recordings on various album tracks to set the sound card's input level adjustment. A loud track should be played and the input level should be adjusted to achieve fairly high readings on the meter without any clipping. Unless you only need to extract one track, it is best to record an entire album side in one pass. Recording should be enabled prior to setting the needle on the record, and disabled after the needle has been lifted. Be sure to use an appropriate record cleaner on the disc to get rid of any dust particles. When an album side has been successfully recorded and the levels look reasonable, it is time to do some trimming. Listen to the beginning of the recording with the volume up a bit, At some point the sound will probably begin with a fade in. Select the audio from the beginning of the recording, past the initial pop from the needle landing in the groove, and ending a few seconds before the first track starts. Delete the selection with Edit->Delete. Next, select from the new beginning to where the sound begins. Use Effect->Fade In to make a smooth transition from quiet to the beginning of the audio. Perform a similar edit at the end of the album side. Delete everything from a few seconds beyond the last sound to the end of the recording and put a Fade Out at the end of the side. If your album has a few clicks and pops, now is the time to remove them. Select the entire recording with Edit->Select->All and de-click with Effect->Click Removal. The default click filter settings seem to work fairly well. The next step involves putting labels at the beginning of each song, assuming the album's material is not one long track. First, create a label track with Tracks->Add New->Label Track. Hit the << rewind button and type Control-B, this puts a label at the beginning of the recording. Move through the album side and put more labels at the middle of each song transition. It is a good idea to zoom in and put the label on a wave zero-crossing point to prevent clicks at the beginnings of individual tracks. If you zoom in, you can often see a change in wave patterns that is left over from the master tape splice. The recording should now look something like the first frame of the Audacity Images. It is a good idea to listen carefully to the entire recorded album side. If the recording has any obnoxiously loud clicks and pops that weren't removed with the Click Removal step, Audacity can smooth them out. To smooth out a click, locate the offending waveform by playing and pausing, then zoom in multiple times until the click is visible. Select a small region around the click (Effect->Repair to smooth out the waveform. Zoom out and play the area where the click removal was performed to verify the operation. Audacity is very forgiving, if you don't like the results of the click removal or make another type of mistake, Edit->Undo will reverse most operations. An example Repair operation is shown in the Audacity Images. At this point, it is time to split the album side into individual audio files. Select File->Export Multiple, chose the desired export format such as WAV, select Split files: based on labels and Name files: Numbering consecutively. Click the Export button and click Audacity will render the individual track files. Audacity can create .mp3 and .flac files at this point, or that can be done at a later time. At this point, you exit Audacity and save any edit information if you think you will need to work on the recording later. The same operations are performed on the B-side of the record. Your author likes to use a short BASH script to rename the Audacity-generated file names to his own name scheme. The track files are all grouped together in one directory, converted to FLAC format with the command FLAC *.wav. A meta-data text file is created with digitizing notes, track titles and any other information that you wish to save. Lastly, all of the files are played one more time to verify that there are no problems. The original album side tracks can now be safely deleted to reclaim some disk space. With enough editing effort, it is possible to make a digital copy of a vinyl record that sounds better than the original. Performing all of the above steps on a large collection of albums is a big undertaking, but the reward comes in turning a hard to play discrete music library into an easy to play digital library. For furthur information on this topic, see the followup article. Directions for GNOME 3.0 Earlier this year at the Gnome Users and Developers Conference, it was announced that there would be a Gnome 3.0 and discussions about how to make the transition are now open. Since then, there has been another gathering of Gnome developers, discussing and making plans about how they would like to modernize the interface. Over the past few days, a number of blog posts have appeared on Planet Gnome discussing some of the happenings at this five day event, and I felt a summary of the ideas so far might be useful to everyone concerned. The Journal The idea that has perhaps received the clearest exposition, along with some concrete work on beginning to make it a reality, is a refreshed way to handle day to day file management based on the OLPC's journal concept. Federico Mena-Quintero posted to his blog reporting his teams brainstorming session. What's wrong with how we handle file management today? Federico says: Let's consider a very common workflow: download an image from a web site, make some modifications to it, and attach it to an e-mail. When you do "save image as" in your web browser, it will default to ~/Downloads or even ~/Desktop. When you do "file/open" in the GIMP, it will default to the last directory you used in the GIMP, even if it was from days ago (on my machine right now, the GIMP defaulted to look at files from ~/src/some-random-directory) ... The end result is that your workflow gets shattered to pieces, as programs try to be helpful within themselves, but they totally fail at being helpful within your workflow. So, programs contribute to having files scattered around everywhere, and there is no easy way to look at everything together. To solve this problem, they began from the premise that humans are fairly good at knowing when they did things: "I started typing my homework last Monday, because I knew it was due on my Thursday class" and "I mailed you that photo two weeks ago, right after my birthday party" were the examples given. From here, the argument is that if we can present users with a journal view of what they did, they can forget about where they put a file and just browse through a time line to find what they were looking for. The journal would not only keep track of files you created, but websites you visited, IM conversations you had, and even allow you to make notes about particular entries. An example of this final kind of functionality might be noting down reference numbers from receipts or customer service representatives.The other two major features of the journal would be the ability to star important items, so they're kept in a separate section, along with the ability to create files from directly within the journal, allowing it to act as a kind of scrap book. As well as Federico's own proof of concept implementation, you can also find similar ideas in Mayanna's timeline, a fork of Gimmie, and the Nemo file manager. Task Orientation This post didn't arise out of the User Experience Hackfest, but from GUADEC earlier in the year. Karl Lattimer has posited that the application centric workflow is broken, and that people don't use a computer with the intention of using a particular application, but with the intention of completing a particular task. Obviously tasks rarely stand on their own, but often form part of a larger project. Karl comments that he believes Federico is making moves in the right direction with the journal, providing users with the capacity to track what they did and when - perhaps a kind of project management framework - but he believes that we also need to provide users with the ability to track why things were done, gathering metadata about the tasks and building a picture of the relationships between them. The example he uses is that of an email received from a colleague asking us to update a file by a certain deadline: from this we could extract the file, the deadline, who sent it to us, and possibly even what needs doing to the file, all of which could be fed into the journal or other interface. This obviously has some practical challenges when it comes to considering how it could be implemented, but if realized could deliver an automated task list that's closely linked with templates for commonly performed tasks, doing away with the idea of static workspaces and applications for ever. Karl sums up his thoughts nicely in this paragraph: For us to get there we need to invent some cool stuff, semantics is one part, organising the data by what it is rather than where it is, especially when the user has a tendency to loose things in the jungle of file systems. Journals and revision control are another part of it, remembering what we've been doing and when, but also templates and schema's are part of it too, hiding the notion of an application behind the tasks you want to achieve and the things you want to get done. The Desktop Shell During this hackfest session, the team tried to forget about the current Gnome interface and focus on what makes sense for users; ironically, Vincent Untz decided to start his post, about how the team forgot about the current Gnome interface, with some observations of the current Gnome interface. The problems he identified in the current interface were four-fold. Firstly, finding the window you want can be difficult when using the default applet, particularly if you have more than a few windows open, and particularly if you have a smaller screen. Secondly, few people make use of the multiple workspaces idea, largely because they were just unaware of their existence. Thirdly, application menus are a slow and inefficient way to open up new applications; some take advantage of launchers or the run dialog to improve on this, but most don't know how to do this. And finally, the current panel is certainly very powerful, but its power is wasted in unneeded flexibility such as being able to position the panel in the middle of the screen. Perhaps the most controversial proposal to fix these problems so far is to restrict Gnome to a single static panel: by removing one panel we'd be saving valuable screen real estate, and by having a layout we can depend on we'd be able to use "hot corners" more effectively, allowing users to easily set their presence, as well as to launch a new "activities overlay mode". While the idea of a single panel hasn't raised too much concern, the static point has: Mathias Hasselmann responds with "Static Panel Nonsense", suggesting that many Gnome users, himself included, as well as Mac OS and Windows users, heavily customize the layout of their panels with custom launchers, and to improve something by removing existing functionality is not a good approach. The most promising proposal from my point of view, and what seems to be a common OLPC inspired train of thought amongst Gnome's community, is the notion of activities. An activity is essentially what Karl Lattimer described as a project, made up of individual tasks, and what many Gnome users organize into separate work spaces in the current environment. In the current Gnome environment, Vincent argues, activities and work spaces are static: a user configures 8 desktops and sticks with them. His proposal is that activities should be far more flexible, and if a user wants to start a new one then we should help them by creating a new desktop automatically. Where Next Reportedly the release team are busy preparing a plan for how we can move from Gnome 2.x to 3.0, with the current plan appearing to be that what would have been called 2.30 will become 3.0. In this time frame, the very least of what we can expect to see is a revamped Gtk+, but what changes the user can expect to see is far harder to tell as there are no known plans for a radical interface overhaul like that seen during the development of KDE 4. Instead, it appears that the Gnome release team are planning on sticking to their current principles with regard to what features will become a core part of the desktop stack: adoption by popular distributions, stability, and a proven track record will all be required for features to make it in. This may seem like it rules out huge amounts of innovation, but there are a number of existing frameworks in Gnome that are very exciting (PolicyKit, PackageKit, Clutter, GVFS, desktop search, D-Conf, online desktop), and perhaps the 3.0 development cycle will see these mature and finally deliver on their promise of revolutionizing the user experience, with many of these technologies forming the backbone of the ideas discussed in this article. Another kind of cookie It has become increasingly difficult to use the web without some kind of Flash player, but a little-known "feature" of Flash is causing some privacy concerns. In some ways, Local Shared Objects (LSOs aka Flash cookies) are similar to browser cookies, but there are a number of significant differences as well. In addition, because the dominant Flash player is closed-source, one must depend on Adobe's ability to faithfully implement the security model. In all, Flash cookies are something that web users should be cognizant of. At its core, an LSO is a chunk of data that is stored on a user's disk based on the domain that the Flash program was downloaded from. Only Flash programs from that domain should have access to the data and, unlike browser cookies, much more data can be stored. By default, 100K bytes can be used per domain, which is a sizable increase from the 4K available for browser cookies. The amount of storage for a Flash cookie can be increased with the assent of the user, or decreased via the management interface. Another major difference from the now-familiar browser cookies is that the interface for managing them is less-than-obvious. From a given Flash application, there is a "Settings" menu that allows control of the LSOs from that site. To see the sites that have stored Flash cookies or to have more global control over them, one must visit Adobe's site. There are also third-party applications and browser add-ons that will allow more control. A user can also resort to the ultimate control—removing them from the filesystem (~/.macromedia/Flash_Player/#SharedObjects). There are many benign things that a Flash application might do with a bit of local storage—caching data, storing preferences, etc.—but they can also be used to track users in much the same way that browser cookies are used. Because Flash cookies are less well-known, and harder to manage, though, they may be more effective because they are removed or restricted less often. Another important thing to note is that there is no requirement that there be a visible Flash application on the web site. A site could embed a Flash application with no visible elements simply to store a cookie. Unless the user has a browser add-on like NoScript, they will get no indication that anything has happened. Assuming that there aren't any holes in Adobe's implementation of the Flash security model, Flash cookies aren't much different—or more dangerous—than browser cookies. But that assumption is a bit worrisome. For Firefox or other free software browsers, the code can be inspected to verify correct behavior. Either Flash or Firefox could have some flaw that allowed cross-site cookie access (which would be a rather nasty information disclosure vulnerability), but for Flash, we can only take Adobe's word. Privacy advocates have been successful in getting the idea of deleting browser cookies into the consciousness of concerned users, but Flash cookies seem to have flown below the radar. A recent blog posting that was widely reported has helped to raise the profile of Flash cookies so that users will, hopefully, know that they exist. Those with a desire to strictly control their privacy will be better able to do so. With luck, it may also lead Adobe to provide an easier and more visible interface to manage them as well. DebXO for the XO laptop The XO laptop was developed for the One Laptop Per Child (OLPC) project. Two weeks ago the XO Software Release 8.2.0 was announced. This week the DebXO project has taken off, with the goal of providing a Debian-based alternative for the XO laptop. Work has been in progress for at least a couple of months, but versions 0.2 and 0.3 were announced this week. As of this writing, Andres "dilinger" Salomon has released three versions, the debxo-latest symlink points to the latest release. According to the version 0.2 announcement DebXO has EXT3 images for booting from USB and/or SD; and while DebXO 0.1 only had a GNOME desktop, 0.2 includes KDE, LXDE, Sugar, Awesome and GNOME desktops. Version 0.3 provides some important bug fixes for problems found in 0.2. This project is obviously still in its infancy, but it seems like a good start on an alternative for the XO laptop. If you have an XO and are interested in helping out you could start by testing the current versions. There is a git repository with the code, which has a web interface, or just use git clone to grab the code. Squashfs submitted for the mainline The Squashfs compressed filesystem is used in everything from Live CDs to embedded devices. Many or most distributions ship it in such situations, but squashfs has been maintained outside of the mainline kernel for years. That appears to be changing as it was recently submitted for inclusion in the mainline by Phillip Lougher. The reaction has been generally favorable, with Andrew Morton requesting that Lougher move it forward: "Please prepare a tree for linux-next inclusion and unless serious problems are pointed out I'd suggest shooting for a 2.6.29 merge." So it seems like a good time to take a look at some of the features and capabilities of Squashfs. The basic idea behind Squashfs is to generate a compressed image of a filesystem or directory hierarchy that can be mounted as a read-only filesystem. This can be done to archive a set of directories or to store them on a smaller capacity device than would normally be required. The latter is used by both Live CDs and embedded devices to squeeze more into less. It has been nearly four years since Squashfs was last submitted to linux-kernel. Since that time, it has been almost completely rewritten based on comments from that attempt. In addition, it has gone through two filesystem layout revisions in part to allow for 64-bit sizes for files and filesystems. Another major change is to make the filesystem little-endian, so that it can be read on any architecture, regardless of endian-ness. The mksquashfs utility is used to create the image, which can then be mounted either via loopback (from a file) or from a regular block device. One of the features added since the original attempt to mainline Squashfs—to address complaints made at that time—is the ability to export a Squashfs filesystem via NFS. Squashfs uses gzip compression on filesystem data and metadata, achieving sizes roughly one-third that of an ext3 filesystem with the same data. The performance is quite good as well, even when compared with the simpler cramfs—a compressed read-only filesystem already available with the kernel. According to Lougher, these performance numbers were gathered a number of years ago, with older versions of the code; newer numbers should be even better. Previously, some kernel developers were resistant to adding another compressed filesystem to the kernel, so Lougher outlines a number of reasons that Squashfs is superior to cramfs. Certainly support for larger files and filesystems is compelling, but the fact that cramfs is orphaned and unmaintained will likely also play a role. In addition, Squashfs supports many more "normal" Linux filesystem features like real inode numbers, hard links, and exportability. Morton had a laundry list of overall suggestions for making Squashfs better in the email referenced above, but documentation is certainly one of the areas that is somewhat lacking. In particular, Squashfs maintains its own cache, which puzzles Morton: Why not just decompress these blocks into pagecache and let the VFS handle the caching?? The real bug here is that this rather obvious question wasn't answered anywhere in the patch submission (afaict). How to fix that? Methinks we need a squashfs.txt which covers these things. One of the reasons that Squashfs doesn't use the page cache is that it allows for multiple block sizes, from 4K up to 1M, with a default of 128K. Better compression ratios can be achieved with a larger block size, but that doesn't work well with the page cache as Jörn Engel notes: "One of the problems seems to be that your blocksize can exceed page size and there really isn't any infrastructure to deal with such cases yet." Lougher has moved the code into a git repository, presumably in preparation to get it into linux-next. He notes that the CE Linux Forum has been instrumental in providing funding over the last four months to allow him to work on getting Squashfs into the mainline. With the additional testing that will come from being included in linux-next, it seems quite possible we could see Squashfs in 2.6.29. Android's first vulnerability A company's response to security vulnerabilities is always interesting to watch. Google has the reputation of being fairly cavalier regarding flaws reported in its code; the first security vulnerability reported for the Android mobile phone software appears to follow that pattern. Unfortunately for users of Android phones, though, Google's attitude and relatively slow response might some day lead to an "in the wild" exploit targeting the phones. The flaw was first reported to Google on October 20 by Independent Security Evaluators (ISE), but was not patched for the G1 phone—the only shipping Android phone—until November 3. Details on the vulnerability are thin, but it affects the web browser and is caused by Google shipping an out-of-date component. Presumably a library or content handler was shipped with a known security flaw that could lead to code execution as the user id which runs the browser. It should be noted that compromising the browser does not affect the rest of the phone due to Android's security architecture. Unlike the iPhone, separate applications are run as different users, so that phone functionality is isolated from the browser, instant messaging, and other tools. An iPhone compromise in any application can lead to the attacker being able to make phone calls and get access to private data associated with any application; clearly Google made a better choice than Apple. One interesting recent development, though, is the availability of an application that provides a root-owned telnet daemon. With that running, a simple telnet gets full access to the phone's filesystem. From there, jailbreaking—circumventing the restrictions placed by a carrier on applications—as well as unlocking the phone from a specific carrier are possible. While it is easy to see how that might be useful for the owner of Android, though it opens the phone to rather intrusive attacks, it probably is not what T-Mobile (and other carriers down the road) had in mind. Google's first response to the vulnerability report was to whine that Charlie Miller, who discovered the flaw, was not being "responsible" by talking about it before a fix was ready. Miller did not disclose details, but did report the existence of—along with some general information about—the flaw. Google's previous reputation regarding vulnerability reporting, as well as how it treated Miller, undoubtedly played a role in his decision. Perhaps the most galling thing is that the flaw was in a free software component that had been updated prior to the Android release to, at least in part, close that hole. It would seem that the Android team was not paying attention to security flaws reported in the free software components that make up the phone software stack. Hopefully, this particular occurrence will serve as a wake-up call on that front. Given that the fix was already known, it is a bit puzzling that it would take two weeks for updates to become available. It was the first update made for Android phones in the field, but one hopes the bugs in that process were worked out long ago. Overall, Google's response leaves rather a lot to be desired. If Google wants security researchers to be more "responsible" in their disclosure, it would be well served by looking at its own behavior. Taking too much time to patch a vulnerability—especially one with a known and presumably already tested fix—is not the way to show the security community that it takes such bugs seriously. Whining about disclosure rarely, if ever, goes anywhere; working in a partnership with folks who find security flaws is much more likely to bear fruit. Testing Fedora on the OLPC In preparation for this year's version of the Give One, Get One (G1G1) promotion of the One Laptop Per Child (OLPC) XO, the Fedora OLPC special interest group (SIG) has undertaken a rather large testing effort. With the assistance of 80 mostly-free XOs, the group has been running Fedora 10 on the hardware, trying to shake out Fedora and OLPC bugs. The idea is to help lift some of the burden from the OLPC developers, while also providing some distribution testing focused on areas specific to the OLPC hardware. G1G1 participants can optionally purchase an SD card pre-loaded with a Fedora 10 live distribution, so that they can run a full Fedora desktop on the XO. Normally, it runs a stripped-down version of Fedora 9 with the Sugar interface as the only desktop available. Part of the Fedora OLPC effort is to help reduce the operating system burden for the OLPC folks. Fedora OLPC liaison (and Red Hat Senior Community Architect) Greg DeKoenigsberg describes where the project is headed: The Fedora community is working closely with OLPC to incorporate their changes upstream, and we are also working to package Sugar as a standard desktop environment for Fedora. Our hope is that, in future releases, the XO can run a completely stock version of Fedora — that way, OLPC will not have to bear any costs of maintaining the distro itself, and can focus their resources where they are most effective: the hardware, and Sugar. Back in September, DeKoenigsberg put out a call for folks interested in testing, with the incentive of a "mostly" free XO. Participants needed to be willing to buy an SD card to put Fedora on and to spend 20 hours testing Fedora on the XO. There were more volunteers than laptops, as would be expected, but 80 XOs—most refurbished returns from the original G1G1 last year—got into the hands of many "experienced Fedora community members." The XOs were provided by the OLPC project through its developer program. The testing has already "found and resolved a number of potential release blockers," according to DeKoenigsberg. There is an extensive test plan that outlines the different testing areas as well as the methodology of testing and reporting bugs found. In many ways, this is just a test of Fedora on a new hardware platform, with the focus on things that set the XO apart: power management, networking, the built-in camera, display, performance, etc. But there is more to the SIG than just testing the XO. The task list has a number of different activities that are currently underway. Getting a developer key to each person who chooses the Fedora 10 option in G1G1 is an important piece of the puzzle—the XO security policy will not allow it to boot from SD without it. Various Sugar tasks are high on the list as well. One of those is the Fedora Sugar spin, a Live CD that allows running the Sugar environment on any computer. So far, there are just a few Sugar "activities"—roughly equivalent to applications for things like web browsing or word processing—available for the spin, but that is another of the tasks that Fedora OLPC will be working on. There is currently a bit of an awkward debate on the fedora-advisory-board mailing list about how "official" the Sugar spin really is—as it missed the deadline for the Fedora 10 freeze—but it would seem that many are in favor of granting it a waiver. The Fedora OLPC SIG's mission statement—To provide the OLPC project with a strong, sustainable, scalable, community-driven base platform for innovation—makes it clear it sees a big role in assisting OLPC going forward. The testing effort is just one facet of that, as DeKoenigsberg notes: We hope to have success with the Fedora on XO testing project, but the real goal is longer term and more strategic. OLPC has placed a very large bet on open source software. In order to be successful, they need knowledgeable contributors — which Fedora has in abundance. There may be more than a million XOs in the wild by the end of this year, and all of them will be running a remix of Fedora by default. In Fedora, we have a responsibility to help make OLPC successful, and the Fedora community takes that responsibility very seriously. The OLPC project is one with great promise. It has suffered at times from the mixed message that it gives regarding free vs. proprietary software, but it could, clearly, be a marvelous example of free software in action. In order for that to happen, though, there will need to be a concerted effort by the free software community to assist. The Fedora OLPC SIG looks to be an excellent step in that direction. Linux and object storage devices The btrfs filesystem is widely regarded as being the long-term future choice for Linux. But what if btrfs is taking the wrong direction, fighting an old war? If the nature of our storage devices changes significantly, our filesystems will have to change as well. A lot of attention has been paid to the increasing prevalence of flash-based devices, but there is another upcoming technology which should be planned for: object storage devices (OSDs). The recent posting of a new filesystem called osdfs provides a good opportunity to look at OSDs and how they might be supported under Linux. The developers of OSDs were driven by the idea that traditional, block-based disk drives offer an overly low-level interface. With contemporary hardware, it should be possible to push more intelligence into storage devices, offloading work from the host while maintaining (or improving) performance and security. So the interface offered by an OSD does not deal in blocks; instead, the OSD provides "objects" to the host system. Most objects will simply be files, but a few other types of objects (partitions, for example) are supported as well. The host manipulates these objects, but need not (and cannot) concern itself with how those objects are implemented within the device. A file object is identified by two 64-bit numbers. It contains whatever data the creator chooses to put in there; an OSD does not interpret the data in any way. Files also have a collection of attributes and metadata; this includes much of the information stored in an on-disk inode in a traditional filesystem - but without the block layout information, which the OSD hides from the rest of the world. All of the usual operations can be performed on files - reading, writing, appending, truncating, etc. - but, again, the implementation of those operations is handled by the OSD. One thing that is not handled by the OSD, though, is the creation of a directory hierarchy or the naming of files. It is expected that the host filesystem will use file objects to store its directory structure, providing a suitable interface to the filesystem's users. One could, presumably, also use an OSD as a sort of hardware-implemented object database without a whole lot of high-level code, but that is not where the focus of work with OSDs is now. [PULL QUOTE: The OSD designers decided to offload another task from the host systems: security. END QUOTE] The OSD protocol [PDF] is a T10-sanctioned extension to the SCSI protocol. It is thus expected that OSD devices will be directly attached to host systems; the protocol has been designed to perform well in that mode. It is also expected, though, that OSDs will be used in network-attached storage environments. For such deployments, the OSD designers decided to offload another task from the host systems: security. To that end, the OSD protocol includes an extensive set of security-related commands. Every operation on an object must be accompanied by a "capability," a cryptographically-signed ticket which names the object and the access rights possessed by the owner of the capability. In the absence of a suitable capability, the drive will deny access. It is expected that capabilities will be handed out by a security policy daemon running somewhere on the network. That daemon may be in possession of the drive's root key, which allows unrestricted access to the drive, or it may have a separate, partition-level key instead. Either way, it can use that key to sign capabilities given out to processes elsewhere in the system. (Drives also have a "master" key, used primarily to change the root key. Loss of the master key is probably a restore-from-backup sort of event.) Capabilities last for a while (they include an expiration time) and describe all of the allowed operations. So the act of actually obtaining a capability should be relatively rare; most OSD operations will be performed using a capability which the system already has in hand. That is an important design feature; adding "ask a daemon for a capability" to the filesystem I/O path would not be a performance-enhancing move. In theory, it should be relatively easy to make a standard Linux filesystem support an OSD. It's mostly a matter of hacking out much of the low-level block layout and inode management code, replacing it with the appropriate object operations. The osdfs filesystem was created in this way; the developers started with ext2. After taking out all the code they no longer needed, the osdfs developers simply added code translating VFS-level requests into operations understood by the OSD. Those requests are then executed by way of the low-level osd-initiator code (which was also recently submitted for consideration). Directories are implemented as simple files containing names and associated object IDs. There is no separate on-disk inode; all of that information is stored as attributes to the file itself. The end result is that the osdfs code is relatively small; it is mostly concerned with remapping VFS operations into OSD operations. Anybody wanting to test this code may run into one small problem: there are few OSDs to be found in the neighborhood computer store. It would appear that most of the development work so far has been done using OSD simulators. The OSC software OSD is, like osdfs, part of the open-osd project; it implements the OSD protocol over an SQLite database. There is also an OSD simulator hosted at IBM, but it would not appear to be under current development. Simulator-based development and testing may not be as rewarding as having a shiny new device implementing OSD in hardware, but it will help to insure that both the software and the protocol are in good shape by the time such hardware is available. It should be noted that the success of OSDs is not entirely assured. An OSD takes much of the work normally done in an operating system kernel and shoves it into a hardware firmware blob where it cannot be inspected or fixed. A poor implementation will, at best, not perform well; at worst, the chances of losing data could increase considerably. It may yet prove best to insist that storage devices just concentrate on placing bits where the operating system tells them to and leave the higher-level decisions to higher-level code. Or it may turn out that OSDs are the next step forward in smarter, more capable hardware. Either way, it is an interesting experiment. See this article at Sun for more information on how OSD works. Hierarchical RCU Introduction Read-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October of 2002. RCU improves scalability by allowing readers to execute concurrently with writers. In contrast, conventional locking primitives require that readers wait for ongoing writers and vice versa. RCU ensures coherence for read accesses by maintaining multiple versions of data structures and ensuring that they are not freed until all pre-existing read-side critical sections complete. RCU relies on efficient and scalable mechanisms for publishing and reading new versions of an object, and also for deferring the collection of old versions. These mechanisms distribute the work among read and update paths in such a way as to make read paths extremely fast. In some cases (non-preemptable kernels), RCU's read-side primitives have zero overhead. Although Classic RCU's read-side primitives enjoy excellent performance and scalability, the update-side primitives which determine when pre-existing read-side critical sections have finished, were designed with only a few tens of CPUs in mind. Their scalability is limited by a global lock that must be acquired by each CPU at least once during each grace period. Although Classic RCU actually scales to a couple of hundred CPUs, and can be tweaked to scale to roughly a thousand CPUs (but at the expense of extending grace periods), emerging multicore systems will require it to scale better. In addition, Classic RCU has a sub-optimal dynticks interface, with the result that Classic RCU will wake up every CPU at least once per grace period. To see the problem with this, consider a 16-CPU system that is sufficiently lightly loaded that it is keeping only four CPUs busy. In a perfect world, the remaining twelve CPUs could be put into deep sleep mode in order to conserve energy. Unfortunately, if the four busy CPUs are frequently performing RCU updates, those twelve idle CPUs will be awakened frequently, wasting significant energy. Thus, any major change to Classic RCU should also leave sleeping CPUs lie. Both the existing and the proposed implementation have have Classic RCU semantics and identical APIs, however, the old implementation will be called “classic RCU” and the new implementation will be called “tree RCU”. Review of RCU Fundamentals Brief Overview of Classic RCU Implementation RCU Desiderata Towards a More Scalable RCU Implementation Towards a Greener RCU Implementation State Machine Use Cases Testing These sections are followed by concluding remarks and the answers to the Quick Quizzes. Review of RCU Fundamentals In its most basic form, RCU is a way of waiting for things to finish. Of course, there are a great many other ways of waiting for things to finish, including reference counts, reader-writer locks, events, and so on. The great advantage of RCU is that it can wait for each of (say) 20,000 different things without having to explicitly track each and every one of them, and without having to worry about the performance degradation, scalability limitations, complex deadlock scenarios, and memory-leak hazards that are inherent in schemes using explicit tracking. In RCU's case, the things waited on are called "RCU read-side critical sections". An RCU read-side critical section starts with an rcu_read_lock() primitive, and ends with a corresponding rcu_read_unlock() primitive. RCU read-side critical sections can be nested, and may contain pretty much any code, as long as that code does not explicitly block or sleep (although a special form of RCU called "SRCU" does permit general sleeping in SRCU read-side critical sections). If you abide by these conventions, you can use RCU to wait for any desired piece of code to complete. RCU accomplishes this feat by indirectly determining when these other things have finished, as has been described elsewhere for Classic RCU and realtime RCU. In particular, as shown in the following figure, RCU is a way of waiting for pre-existing RCU read-side critical sections to completely finish, including memory operations executed by those critical sections. However, note that RCU read-side critical sections that begin after the beginning of a given grace period can and will extend beyond the end of that grace period. The following section gives a very high-level view of how the Classic RCU implementation operates. Brief Overview of Classic RCU Implementation The key concept behind the Classic RCU implementation is that Classic RCU read-side critical sections are confined to kernel code and are not permitted to block. This means that any time a given CPU is seen either blocking, in the idle loop, or exiting the kernel, we know that all RCU read-side critical sections that were previously running on that CPU must have completed. Such states are called “quiescent states”, and after each CPU has passed through at least one quiescent state, the RCU grace period ends. Classic RCU's most important data structure is the rcu_ctrlblk structure, which contains the ->cpumask field, which contains one bit per CPU. Each CPU's bit is set to one at the beginning of each grace period, and each CPU must clear its bit after it passes through a quiescent state. Because multiple CPUs might want to clear their bits concurrently, which would corrupt the ->cpumask field, a ->lock spinlock is used to protect ->cpumask, preventing any such corruption. Unfortunately, this spinlock can also suffer extreme contention if there are more than a few hundred CPUs, which might soon become quite common if multicore trends continue. Worse yet, the fact that all CPUs must clear their own bit means that CPUs are not permitted to sleep through a grace period, which limits Linux's ability to conserve power. The next section lays out what we need from a new non-real-time RCU implementation. RCU Desiderata The list of RCU desiderata called out at LCA2005 for real-time RCU is a very good start: Deferred destruction, so that an RCU grace period cannot end until all pre-existing RCU read-side critical sections have completed. Reliable, so that RCU supports 24x7 operation for years at a time. Callable from irq handlers. Contained memory footprint, so that mechanisms exist to expedite grace periods if there are too many callbacks. (This is weakened from the LCA2005 list.) Independent of memory blocks, so that RCU can work with any conceivable memory allocator. Synchronization-free read side, so that only normal non-atomic instructions operating on CPU- or task-local memory are permitted. (This is strengthened from the LCA2005 list.) Unconditional read-to-write upgrade, which is used in several places in the Linux kernel where the update-side lock is acquired within the RCU read-side critical section. Compatible API. Because this is not to be a real-time RCU, the requirement for preemptable RCU read-side critical sections can be dropped. However, we need to add a few more requirements to account for changes over the past few years: Scalability with extremely low internal-to-RCU lock contention. RCU must support at least 1,024 CPUs gracefully, and preferably at least 4,096. Energy conservation: RCU must be able to avoid awakening low-power-state dynticks-idle CPUs, but still determine when the current grace period ends. This has been implemented in real-time RCU, but needs serious simplification. RCU read-side critical sections must be permitted in NMI handlers as well as irq handlers. Note that preemptable RCU was able to avoid this requirement due to a separately implemented synchronize_sched(). RCU must operate gracefully in face of repeated CPU-hotplug operations. This is simply carrying forward a requirement met by both classic and real-time. It must be possible to wait for all previously registered RCU callbacks to complete, though this is already provided in the form of rcu_barrier(). Detecting CPUs that are failing to respond is desirable, to assist diagnosis both of RCU and of various infinite loop bugs and hardware failures that can prevent RCU grace periods from ending. Extreme expediting of RCU grace periods is desirable, so that an RCU grace period can be forced to complete within a few hundred microseconds of the last relevant RCU read-side critical second completing. However, such an operation would be expected to incur severe CPU overhead, and would be primarily useful when carrying out a long sequence of operations that each needed to wait for an RCU grace period. The most pressing of the new requirements is the first one, scalability. The next section therefore describes how to make order-of-magnitude reductions in contention on RCU's internal locks. Towards a More Scalable RCU Implementation One effective way to reduce lock contention is to create a hierarchy, as shown in the following figure. Here, each of the four rcu_node structures has its own lock, so that only CPUs 0 and 1 will acquire the lower left rcu_node's lock, only CPUs 2 and 3 will acquire the lower middle rcu_node's lock, and only CPUs 4 and 5 will acquire the lower right rcu_node's lock. During any given grace period, only one of the CPUs accessing each of the lower rcu_node structures will access the upper rcu_node, namely, the last of each pair of CPUs to record a quiescent state for the corresponding grace period. This results in a significant reduction in lock contention: instead of six CPUs contending for a single lock each grace period, we have only three for the upper rcu_node's lock (a reduction of 50%) and only two for each of the lower rcu_nodes' locks (a reduction of 67%). The tree of rcu_node structures is embedded into a linear array in the rcu_state structure, with the root of the tree in element zero, as shown below for an eight-CPU system with a three-level hierarchy. The arrows link a given rcu_node structure to its parent. Each rcu_node indicates the range of CPUs covered, so that the root node covers all of the CPUs, each node in the second level covers half of the CPUs, and each node in the leaf level covering a pair of CPUs. This array is allocated statically at compile time based on the value of NR_CPUS. The following sequence of six figures shows how grace periods are detected. In the first figure, no CPU has yet passed through a quiescent state, as indicated by the red rectangles. Suppose that all six CPUs simultaneously try to tell RCU that they have passed through a quiescent state. Only one of each pair will be able to acquire the lock on the corresponding lower rcu_node, and so the second figure shows the result if the lucky CPUs are numbers 0, 3, and 5, as indicated by the green rectangles. Once these lucky CPUs have finished, then the other CPUs will acquire the lock, as shown in the third figure. Each of these CPUs will see that they are the last in their group, and therefore all three will attempt to move to the upper rcu_node. Only one at a time can acquire the upper rcu_node structure's lock, and the fourth, fifth, and sixth figures show the sequence of states assuming that CPU 1, CPU 2, and CPU 4 acquire the lock in that order. The sixth and final figure in the group shows that all CPUs have passed through a quiescent state, so that the grace period has ended. In the above sequence, there were never more than three CPUs contending for any one lock, in happy contrast to Classic RCU, where all six CPUs might contend. However, even more dramatic reductions in lock contention are possible with larger numbers of CPUs. Consider a hierarchy of rcu_node structures, with 64 lower structures and 64*64=4,096 CPUs, as shown in the following figure. Here each of the lower rcu_node structures' locks are acquired by 64 CPUs, a 64-times reduction from the 4,096 CPUs that would acquire Classic RCU's single global lock. Similarly, during a given grace period, only one CPU from each of the lower rcu_node structures will acquire the upper rcu_node structure's lock, which is again a 64x reduction from the contention level that would be experienced by Classic RCU running on a 4,096-CPU system. Quick Quiz 1: Wait a minute! With all those new locks, how do you avoid deadlock? Quick Quiz 2: Why stop at a 64-times reduction? Why not go for a few orders of magnitude instead? Quick Quiz 3: But I don't care about McKenney's lame excuses in the answer to Quick Quiz 2!!! I want to get the number of CPUs contending on a single lock down to something reasonable, like sixteen or so!!! The implementation maintains some per-CPU data, such as lists of RCU callbacks, organized into rcu_data structures. In addition, rcu (as in call_rcu()) and rcu_bh (as in call_rcu_bh()) each maintain their own hierarchy, as shown in the following figure. Quick Quiz 4: OK, so what is the story with the colors? The next section discusses energy conservation. Towards a Greener RCU Implementation As noted earlier, an important goal of this effort is to leave sleeping CPUs lie in order to promote energy conservation. In contrast, classic RCU will happily awaken each and every sleeping CPU at least once per grace period in some cases, which is suboptimal in the case where a small number of CPUs are busy doing RCU updates and the majority of the CPUs are mostly idle. This situation occurs frequently in systems sized for peak loads, and we need to be able to accommodate it gracefully. Furthermore, we need to fix a long-standing bug in Classic RCU where a dynticks-idle CPU servicing an interrupt containing a long-running RCU read-side critical section will fail to prevent an RCU grace period from ending. Quick Quiz 5: Given such an egregious bug, why does Linux run at all? This is accomplished by requiring that all CPUs manipulate counters located in a per-CPU rcu_dynticks structure. Loosely speaking, these counters have even-numbered values when the corresponding CPU is in dynticks idle mode, and have odd-numbered values otherwise. RCU thus needs to wait for quiescent states only for those CPUs whose rcu_dynticks counters are odd, and need not wake up sleeping CPUs, whose counters will be even. As shown in the following diagram, each per-CPU rcu_dynticks is shared by the “rcu” and “rcu_bh” implementations. The following section presents a high-level view of the RCU state machine. State Machine At a sufficiently high level, Linux-kernel RCU implementations can be thought of as high-level state machines as shown in the following schematic: The common-case path through this state machine on a busy system goes through the two uppermost loops, initializing at the beginning of each grace period (GP), waiting for quiescent states (QS), and noting when each CPU passes through its first quiescent state for a given grace period. On such a system, quiescent states will occur on each context switch, or, for CPUs that are either idle or executing user-mode code, each scheduling-clock interrupt. CPU-hotplug events will take the state machine through the “CPU Offline” box, while the presence of “holdout” CPUs that fail to pass through quiescent states quickly enough will exercise the path through the “Send resched IPIs to Holdout CPUs” box. RCU implementations that avoid unnecessarily awakening dyntick-idle CPUs will mark those CPUs as being in an extended quiescent state, taking the “Y” branch out of the “CPUs in dyntick-idle Mode?” decision diamond (but note that CPUs in dyntick-idle mode will not be sent resched IPIs). Finally, if CONFIG_RCU_CPU_STALL_DETECTOR is enabled, truly excessive delays in reaching quiescent states will exercise the “Complain About Holdout CPUs” path. The events in the above state schematic interact with different data structures, as shown below: However, the state schematic does not directly translate into C code for any of the RCU implementations. Instead, these implementations are coded as an event-driven system within the kernel. Therefore, the following section describes some “use cases”, or ways in which the RCU algorithm traverses the above state schematic as well as the relevant data structures. Use Cases This section gives an overview of several “use cases” within the RCU implementation, listing the data structures touched and the functions invoked. The use cases are as follows: Start a new grace period. Pass through a quiescent state. Announce a quiescent state to RCU. Enter and leave dynticks idle mode. Interrupt from dynticks idle mode. NMI from dynticks idle mode. Note that a CPU is in dynticks idle mode. Offline a CPU. Online a CPU. Detect a too-long grace period. Each of these use cases is described in the following sections. Start a New Grace Period The rcu_start_gp() function starts a new grace period. This function is invoked when a CPU having callbacks waiting for a grace period notices that no grace period is in progress. The rcu_start_gp() function updates state in the rcu_state and rcu_data structures to note the newly started grace period, acquires the ->onoff lock (and disables irqs) to exclude any concurrent CPU-hotplug operations, sets the bits in all of the rcu_node structures to indicate that all CPUs (including this one) must pass through a quiescent state, and finally releases the ->onoff lock. The bit-setting operation is carried out in two phases. First, the non-leaf rcu_node structures' bits are set without holding any additional locks, and then finally each leaf rcu_node structure's bits are set in turn while holding that structure's ->lock. Quick Quiz 6: But what happens if a CPU tries to report going through a quiescent state (by clearing its bit) before the bit-setting CPU has finished? Quick Quiz 7: And what happens if all CPUs try to report going through a quiescent state before the bit-setting CPU has finished, thus ending the new grace period before it starts? Pass Through a Quiescent State The rcu and rcu_bh flavors of RCU have different sets of quiescent states. Quiescent states for rcu are context switch, idle (either dynticks or the idle loop), and user-mode execution, while quiescent states for rcu_bh are any code outside of softirq with interrupts enabled. Note that an quiescent state for rcu is also a quiescent state for rcu_bh. Quiescent states for rcu are recorded by invoking rcu_qsctr_inc(), while quiescent states for rcu_bh are recorded by invoking rcu_bh_qsctr_inc(). These two functions record their state in the current CPU's rcu_data structure. These functions are invoked from the scheduler, from __do_softirq(), and from rcu_check_callbacks(). This latter function is invoked from the scheduling-clock interrupt, and analyzes state to determine whether this interrupt occurred within a quiescent state, invoking rcu_qsctr_inc() and/or rcu_bh_qsctr_inc(), as appropriate. It also raises RCU_SOFTIRQ, which results in rcu_process_callbacks() being invoked on the current CPU at some later time from softirq context. Announce a Quiescent State to RCU The afore-mentioned rcu_process_callbacks() function has several duties: Determining when to take measures to end an over-long grace period (via force_quiescent_state()). Taking appropriate action when some other CPU detected the end of a grace period (via rcu_process_gp_end()). “Appropriate action“ includes advancing this CPU's callbacks and recording the new grace period. This same function updates state in response to some other CPU starting a new grace period. Reporting the current CPU's quiescent states to the core RCU mechanism (via rcu_check_quiescent_state(), which in turn invokes cpu_quiet()). This of course might mark the end of the current grace period. Starting a new grace period if there is no grace period in progress and this CPU has RCU callbacks still waiting for a grace period (via cpu_needs_another_gp() and rcu_start_gp()). Invoking any of this CPU's callbacks whose grace period has ended (via rcu_do_batch()). These interactions are carefully orchestrated in order to avoid buggy behavior such as reporting a quiescent state from the previous grace period against the current grace period. Enter and Leave Dynticks Idle Mode The scheduler invokes rcu_enter_nohz() to enter dynticks-idle mode, and invokes rcu_exit_nohz() to exit it. The rcu_enter_nohz() function increments a per-CPU dynticks_nesting variable and also a per-CPU dynticks counter, the latter of which which must then have an even-numbered value. The rcu_exit_nohz() function decrements this same per-CPU dynticks_nesting variable, and again increments the per-CPU dynticks counter, the latter of which must then have an odd-numbered value. The dynticks counter can be sampled by other CPUs. If the value is even, the first CPU is in an extended quiescent state. Similarly, if the counter value changes during a given grace period, the first CPU must have been in an extended quiescent state at some point during the grace period. However, there is another dynticks_nmi per-CPU variable that must also be sampled, as will be discussed below. Interrupt from Dynticks Idle Mode Interrupts from dynticks idle mode are handled by rcu_irq_enter() and rcu_irq_exit(). The rcu_irq_enter() function increments the per-CPU dynticks_nesting variable, and, if the prior value was zero, also increments the dynticks per-CPU variable (which must then have an odd-numbered value). The rcu_irq_exit() function decrements the per-CPU dynticks_nesting variable, and, if the new value is zero, also increments the dynticks per-CPU variable (which must then have an even-numbered value). Note that entering an irq handler exits dynticks idle mode and vice versa. This enter/exit anti-correspondence can cause much confusion. You have been warned. NMI from Dynticks Idle Mode NMIs from dynticks idle mode are handled by rcu_nmi_enter() and rcu_nmi_exit(). These functions both increment the dynticks_nmi counter, but only if the aforementioned dynticks counter is even. In other words, NMI's refrain from manipulating the dynticks_nmi counter if the NMI occurred in non-dynticks-idle mode or within an interrupt handler. The only difference between these two functions is the error checks, as rcu_nmi_enter() must leave the dynticks_nmi counter with an odd value, and rcu_nmi_exit() must leave this counter with an even value. Note That a CPU is in Dynticks Idle Mode The force_quiescent_state() function implements a two-phase state machine. In the first phase (RCU_SAVE_DYNTICK), the dyntick_save_progress_counter() function scans the CPUs that have not yet reported a quiescent state, recording their per-CPU dynticks and dynticks_nmi counters. If these counters both have even-numbered values, then the corresponding CPU is in dynticks-idle state, which is therefore noted as an extended quiescent state (reported via cpu_quiet_msk()). In the second phase (RCU_FORCE_QS), the rcu_implicit_dynticks_qs() function again scans the CPUs that have not yet reported a quiescent state (either explicitly or implicitly during the RCU_SAVE_DYNTICK phase), again checking the per-CPU dynticks and dynticks_nmi counters. If each of these has either changed in value or is now even, then the corresponding CPU has either passed through or is now in dynticks idle, which as before is noted as an extended quiescent state. If rcu_implicit_dynticks_qs() finds that a given CPU has neither been in dynticks idle mode nor reported a quiescent state, it invokes rcu_implicit_offline_qs(), which checks to see if that CPU is offline, which is also reported as an extended quiescent state. If the CPU is online, then rcu_implicit_offline_qs() sends it a reschedule IPI in an attempt to remind it of its duty to report a quiescent state to RCU. Note that force_quiescent_state() does not directly invoke either dyntick_save_progress_counter() or rcu_implicit_dynticks_qs(), instead passing these functions to an intervening rcu_process_dyntick() function that abstracts out the common code involved in scanning the CPUs and reporting extended quiescent states. Quick Quiz 8: And what happens if one CPU comes out of dyntick-idle mode and then passed through a quiescent state just as another CPU notices that the first CPU was in dyntick-idle mode? Couldn't they both attempt to report a quiescent state at the same time, resulting in confusion? Quick Quiz 9: But what if all the CPUs end up in dyntick-idle mode? Wouldn't that prevent the current RCU grace period from ever ending? Quick Quiz 10: Given that force_quiescent_state() is a two-phase state machine, don't we have double the scheduling latency due to scanning all the CPUs? Offline a CPU CPU-offline events cause rcu_cpu_notify() to invoke rcu_offline_cpu(), which in turn invokes __rcu_offline_cpu() on both the rcu and the rcu_bh instances of the data structures. This function clears the outgoing CPU's bits so that future grace periods will not expect this CPU to announce quiescent states, and further invokes cpu_quiet() in order to announce the offline-induced extended quiescent state. This work is performed with the global ->onofflock held in order to prevent interference with concurrent grace-period initialization. Quick Quiz 11: But the other reason to hold ->onofflock is to prevent multiple concurrent online/offline operations, right? Online a CPU CPU-online events cause rcu_cpu_notify() to invoke rcu_online_cpu(), which initializes the incoming CPU's dynticks state, and then invokes rcu_init_percpu_data() to initialize the incoming CPU's rcu_data structure, and also to set this CPU's bits (again protected by the global ->onofflock) so that future grace periods will wait for a quiescent state from this CPU. Finally, rcu_online_cpu() sets up the RCU softirq vector for this CPU. Quick Quiz 12: Given all these acquisitions of the global ->onofflock, won't there be horrible lock contention when running with thousands of CPUs? Detect a Too-Long Grace Period When the CONFIG_RCU_CPU_STALL_DETECTOR kernel parameter is specified, the record_gp_stall_check_time() function records the time and also a timestamp set three seconds into the future. If the current grace period still has not ended by that time, the check_cpu_stall() function will check for the culprit, invoking print_cpu_stall() if the current CPU is the holdout, or print_other_cpu_stall() if it is some other CPU. A two-jiffies offset helps ensure that CPUs report on themselves when possible, taking advantage of the fact that a CPU can normally do a better job of tracing its own stack than it can tracing some other CPU's stack. Testing RCU is fundamental synchronization code, so any failure of RCU results in random, difficult-to-debug memory corruption. It is therefore extremely important that RCU be highly reliable. Some of this reliability stems from careful design, but at the end of the day we must also rely on heavy stress testing, otherwise known as torture. Fortunately, although there has been some debate as to exactly what populations are covered by the provisions of the Geneva Convention, it is still the case that it does not apply to software. Therefore, it is still legal to torture your software. In fact, it is strongly encouraged, because if you don't torture your software, it will end up torturing you by crashing at the most inconvenient times imaginable. Therefore, we torture RCU quite vigorously using the rcutorture module. However, it is not sufficient to torture the common-case uses of RCU. It is also necessary to torture it in unusual situations, for example, when concurrently onlining and offlining CPUs and when CPUs are concurrently entering and exiting dynticks idle mode. I use a script to online and offline CPUs, and use the test_no_idle_hz module parameter to rcutorture to stress-test dynticks idle mode. Just to be fully paranoid, I sometimes run a kernbench workload in parallel as well. Ten hours of this sort of torture on a 128-way machine seems sufficient to shake out most bugs. Even this is not the complete story. As Alexey Dobriyan and Nick Piggin demonstrated in early 2008, it is also necessary to torture RCU with all relevant combinations of kernel parameters. The relevant kernel parameters may be identified using yet another script, and are as follows: CONFIG_CLASSIC_RCU: Classic RCU. CONFIG_PREEMPT_RCU: Preemptable (real-time) RCU. CONFIG_TREE_RCU: Classic RCU for huge SMP systems. CONFIG_RCU_FANOUT: Number of children for each rcu_node. CONFIG_RCU_FANOUT_EXACT: Balance the rcu_node tree. CONFIG_HOTPLUG_CPU: Allow CPUs to be offlined and onlined. CONFIG_NO_HZ: Enable dyntick-idle mode. CONFIG_SMP: Enable multi-CPU operation. CONFIG_RCU_CPU_STALL_DETECTOR: Enable RCU to detect when CPUs go on extended quiescent-state vacations. CONFIG_RCU_TRACE: Generate RCU trace files in debugfs. We ignore the CONFIG_DEBUG_LOCK_ALLOC configuration variable under the perhaps-naive assumption that hierarchical RCU could not have broken lockdep. There are still 10 configuration variables, which would result in 1,024 combinations if they were independent boolean variables. Fortunately the first three are mutually exclusive, which reduces the number of combinations down to 384, but CONFIG_RCU_FANOUT can take on values from 2 to 64, increasing the number of combinations to 12,096. This is an infeasible number of combinations. One key observation is that only CONFIG_NO_HZ and CONFIG_PREEMPT can be expected to have changed behavior if either CONFIG_CLASSIC_RCU or CONFIG_PREEMPT_RCU are in effect, as only these portions of the two pre-existing RCU implementations were changed during this effort. This cuts out almost two thirds of the possible combinations. Furthermore, not all of the possible values of CONFIG_RCU_FANOUT produce significantly different results, in fact only a few cases really need to be tested separately: Single-node “tree”. Two-level balanced tree. Three-level balanced tree. Autobalanced tree, where CONFIG_RCU_FANOUT specifies an unbalanced tree, but such that it is auto-balanced in absence of CONFIG_RCU_FANOUT_EXACT. Unbalanced tree. Looking further, CONFIG_HOTPLUG_CPU makes sense only given CONFIG_SMP, and CONFIG_RCU_CPU_STALL_DETECTOR is independent, and really only needs to be tested once (though someone even more paranoid than am I might decide to test it both with and without CONFIG_SMP). Similarly, CONFIG_RCU_TRACE need only be tested once, but the truly paranoid (such as myself) will choose to run it both with and without CONFIG_NO_HZ. This allows us to obtain excellent coverage of RCU with only 15 test cases. All test cases specify the following configuration parameters in order to run rcutorture and so that CONFIG_HOTPLUG_CPU=n actually takes effect: The 15 test cases are as follows: Force single-node “tree” for small systems: Force two-level tree for large systems: Force three-level tree for huge systems: Test autobalancing to a balanced tree: Test unbalanced tree: Disable CPU-stall detection: Disable CPU-stall detection and dyntick idle mode: Disable CPU-stall detection and CPU hotplug: Disable CPU-stall detection, dyntick idle mode, and CPU hotplug: Disable SMP, CPU-stall detection, dyntick idle mode, and CPU hotplug: This combination located a number of compiler warnings. Disable SMP and CPU hotplug: Test Classic RCU with dynticks idle but without preemption: Test Classic RCU with preemption but without dynticks idle: Test Preemptable RCU with dynticks idle: Test Preemptable RCU without dynticks idle: For a large change that affects RCU core code, one should run rcutorture for each of the above combinations, and concurrently with CPU offlining and onlining for cases with CONFIG_HOTPLUG_CPU. For small changes, it may suffice to run kernbench in each case. Of course, if the change is confined to a particular subset of the configuration parameters, it may be possible to reduce the number of test cases. Torturing software: the Geneva Convention does not (yet) prohibit it, and I strongly recommend it!!! Conclusion This hierarchical implementation of RCU reduces lock contention, avoids unnecessarily awakening dyntick-idle sleeping CPUs, while helping to debug Linux's hotplug-CPU code paths. This implementation is designed to handle single systems with thousands of CPUs, and on 64-bit systems has an architectural limitation of a quarter million CPUs, a limit I expect to be sufficient for at least the next few years. This RCU implementation of course has some limitations: The force_quiescent_state() can scan the full set of CPUs with irqs disabled. This would be fatal in a real-time implementation of RCU, so if hierarchy ever needs to be introduced to preemptable RCU, some other approach will be required. It is possible that it will be problematic on 4,096-CPU systems, but actual testing on such systems is required to prove this one way or the other. On busy systems, the force_quiescent_state() scan would not be expected to happen, as CPUs should pass through quiescent states within three jiffies of the start of a quiescent state. On semi-busy systems, only the CPUs in dynticks-idle mode throughout would need to be scanned. In some cases, for example when a dynticks-idle CPU is handling an interrupt during a scan, subsequent scans are required. However, each such scan is performed separately, so scheduling latency is degraded by the overhead of only one such scan. If this scan proves problematic, one straightforward solution would be to do the scan incrementally. This would increase code complexity slightly and would also increase the time required to end a grace period, but would nonetheless be a likely solution. The rcu_node hierarchy is created at compile time, and is therefore sized for the worst-case NR_CPUS number of CPUs. However, even for 4,096 CPUs, the rcu_node hierarchy consumes only 65 cache lines on a 64-bit machine (and just you try accommodating 4,096 CPUs on a 32-bit machine!). Of course, a kernel built with NR_CPUS=4096 running on a 16-CPU machine would use a two-level tree when a single-node tree would work just fine. Although this configuration would incur added locking overhead, this does not affect hot-path read-side code, so should not be a problem in practice. This patch does increase kernel text and data somewhat: the old Classic RCU implementation consumes 1,757 bytes of kernel text and 456 bytes of kernel data for a total of 2,213 bytes, while the new hierarchical RCU implementation consumes 4,006 bytes of kernel text and 624 bytes of kernel data for a total of 4,630 bytes on a NR_CPUS=4 system. This is a non-problem even for most embedded systems, which often come with hundreds of megabytes of main memory. However, if this is a problem for tiny embedded systems, it may be necessary to provide both “scale up” and “scale down” implementations of RCU. This hierarchical RCU implementation should nevertheless be a vast improvement over Classic RCU for machines with hundreds of CPUs. After all, Classic RCU was designed for systems with only 16-32 CPUs. At some point, it may be necessary to also apply hierarchy to the preemptable RCU implementation. This will be challenging due to the modular arithmetic used on the per-CPU counter pairs, but should be doable. Acknowledgements I am indebted to Manfred Spraul for ideas, review comments, bugs spotted, as well as some good healthy competition, to Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers, Lai Jiangshan, Andi Kleen, Andy Whitcroft, Gautham Shenoy, and Andrew Morton for review comments, and to Thomas Gleixner for much help with timer issues. I am thankful to Jon M. Tollefson, Tim Pepper, Andrew Theurer, Jose R. Santos, Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton Blanchard, and Nathan Lynch for keeping machines alive despite my (ab)use for this project. We all owe thanks to Peter Zijlstra, Gautham Shenoy, Lai Jiangshan, and Manfred Spraul for helping (in some cases unwittingly) render this document at least partially human readable. Finally, I am grateful to Kathy Bennett for her support of this effort. This work represents the view of the authors and does not necessarily represent the view of IBM. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. Answers to Quick Quizzes Quick Quiz 1: Wait a minute! With all those new locks, how do you avoid deadlock? Answer: Deadlock is avoided by never holding more than one of the rcu_node structures' locks at a given time. This algorithm uses two more locks, one to prevent CPU hotplug operations from running concurrently with grace-period advancement (onofflock) and another to permit only one CPU at a time from forcing a quiescent state to end quickly (fqslock). These are subject to a locking hierarchy, so that fqslock must be acquired before onofflock, which in turn must be acquired before any of the rcu_node structures' locks. Also, as a practical matter, refusing to ever hold more than one of the rcu_node locks means that it is unnecessary to track which ones are held. Such tracking would be painful as well as unnecessary. Back to Quick Quiz 1. Quick Quiz 2: Why stop at a 64-times reduction? Why not go for a few orders of magnitude instead? Answer: RCU works with no problems on systems with a few hundred CPUs, so allowing 64 CPUs to contend on a single lock leaves plenty of headroom. Keep in mind that these locks are acquired quite rarely, as each CPU will check in about one time per grace period, and grace periods extend for milliseconds. Back to Quick Quiz 2. Quick Quiz 3: But I don't care about McKenney's lame excuses in the answer to Quick Quiz 2!!! I want to get the number of CPUs contending on a single lock down to something reasonable, like sixteen or so!!! Answer: OK, have it your way, then!!! Set CONFIG_RCU_FANOUT=16 and (for NR_CPUS=4096) you will get a three-level hierarchy with with 256 rcu_node structures at the lowest level, 16 rcu_node structures as intermediate nodes, and a single root-level rcu_node. The penalty you will pay is that more rcu_node structures will need to be scanned when checking to see which CPUs need help completing their quiescent states (256 instead of only 64). Back to Quick Quiz 3. Quick Quiz 4: OK, so what is the story with the colors? Answer: Data structures analogous to rcu_state (including rcu_ctrlblk) are yellow, those containing the bitmaps used to determine when CPUs have checked in are pink, and the per-CPU rcu_data structures are blue. Later on, we will see that data structures used to conserve energy (such as rcu_dynticks) will be green. Back to Quick Quiz 4. Quick Quiz 5: Given such an egregious bug, why does Linux run at all? Answer: Because the Linux kernel contains device drivers that are (relatively) well behaved. Few if any of them spin in RCU read-side critical sections for the many milliseconds that would be required to provoke this bug. The bug nevertheless does need to be fixed, and this variant of RCU does fix it. Back to Quick Quiz 5. Quick Quiz 6: But what happens if a CPU tries to report going through a quiescent state (by clearing its bit) before the bit-setting CPU has finished? Answer: There are three cases to consider here: A CPU corresponding to a non-yet-initialized leaf rcu_node structure tries to report a quiescent state. This CPU will see its bit already cleared, so will give up on reporting its quiescent state. Some later quiescent state will serve for the new grace period. A CPU corresponding to a leaf rcu_node structure that is currently being initialized tries to report a quiescent state. This CPU will see that the rcu_node structure's ->lock is held, so will spin until it is released. But once the lock is released, the rcu_node structure will have been initialized, reducing to the following case. A CPU corresponding to a leaf rcu_node that has already been initialized tries to report a quiescent state. This CPU will find its bit set, and will therefore clear it. If it is the last CPU for that leaf node, it will move up to the next level of the hierarchy. However, this CPU cannot possibly be the last CPU in the system to report a quiescent state, given that the CPU doing the initialization cannot yet have checked in. So, in all three cases, the potential race is resolved correctly. Back to Quick Quiz 6. Quick Quiz 7: And what happens if all CPUs try to report going through a quiescent state before the bit-setting CPU has finished, thus ending the new grace period before it starts? Answer: The bit-setting CPU cannot pass through a quiescent state during initialization, as it has irqs disabled. Its bits therefore remain non-zero, preventing the grace period from ending until the data structure has been fully initialized. Back to Quick Quiz 7. Quick Quiz 8: And what happens if one CPU comes out of dyntick-idle mode and then passed through a quiescent state just as another CPU notices that the first CPU was in dyntick-idle mode? Couldn't they both attempt to report a quiescent state at the same time, resulting in confusion? Answer: They will both attempt to acquire the lock on the same leaf rcu_node structure. The first one to acquire the lock will report the quiescent state and clear the appropriate bit, and the second one to acquire the lock will see that this bit has already been cleared. Back to Quick Quiz 8. Quick Quiz 9: But what if all the CPUs end up in dyntick-idle mode? Wouldn't that prevent the current RCU grace period from ever ending? Answer: Indeed it will! However, CPUs that have RCU callbacks are not permitted to enter dyntick-idle mode, so the only way that all the CPUs could possibly end up in dyntick-idle mode would be if there were absolutely no RCU callbacks in the system. And if there are no RCU callbacks in the system, then there is no need for the RCU grace period to end. In fact, there is no need for the RCU grace period to even start. RCU will restart if some irq handler does a call_rcu(), which will cause an RCU callback to appear on the corresponding CPU, which will force that CPU out of dyntick-idle mode, which will in turn permit the current RCU grace period to come to an end. Back to Quick Quiz 9. Quick Quiz 10: Given that force_quiescent_state() is a two-phase state machine, don't we have double the scheduling latency due to scanning all the CPUs? Answer: Ah, but the two phases will not execute back-to-back on the same CPU. Therefore, the scheduling-latency hit of the two-phase algorithm is no different than that of a single-phase algorithm. If the scheduling latency becomes a problem, one approach would be to recode the state machine to scan the CPUs incrementally. But first show me a problem in the real world, then I will consider fixing it! Back to Quick Quiz 10. Quick Quiz 11: But the other reason to hold ->onofflock is to prevent multiple concurrent online/offline operations, right? Answer: Actually, no! The CPU-hotplug code's synchronization design prevents multiple concurrent CPU online/offline operations, so only one CPU online/offline operation can be executing at any given time. Therefore, the only purpose of ->onofflock is to prevent a CPU online or offline operation from running concurrently with grace-period initialization. Back to Quick Quiz 11. Quick Quiz 12: Given all these acquisitions of the global ->onofflock, won't there be horrible lock contention when running with thousands of CPUs? Answer: Actually, there can be only three acquisitions of this lock per grace period, and each grace period lasts many milliseconds. One of the acquisitions is by the CPU initializing for the current grace period, and the other two onlining and offlining some CPU. These latter two cannot run concurrently due to the CPU-hotplug locking, so at most two CPUs can be contending for this lock at any given time. Lock contention on ->onofflock should therefore be no problem, even on systems with thousands of CPUs. Back to Quick Quiz 12. GFDL 1.3: Wikipedia's exit permit Wikipedia is one of the preeminent examples of what can be done in an open setting; it has, over the years, accumulated millions of articles - many of them excellent - in a large number of languages. Wikipedia also has a bit of a licensing problem, but it would appear that recent events, including the release of a new license by the Free Software Foundation, offers a way out. Wikipedia is licensed under the GNU Free Documentation License (GFDL). The GFDL has been covered here a number of times; it is, to put it mildly, a controversial document. Its anti-DRM provisions are sufficiently broad that, by some peoples' interpretation, a simple "chmod -r" on a GFDL-licensed file is a violation. But the biggest complaint has to do with the GFDL's notion of "invariant sections." These sections must be propagated unchanged with any copy (or derived work) of the original document. The GFDL itself must also be included with any copies. So a one-page excerpt from the GNU Emacs manual, for example, must be accompanied by several dozen pages of material, including the original GNU Manifesto. So the GFDL has come to be seen by many as more of a tool for the propagation of FSF propaganda than a license for truly free documentation. Much of the community avoids this license; some groups, such as the Debian Project, see it as non-free. Many projects which still do use the GFDL make a clear point of avoiding (or disallowing outright) the use of cover texts, invariant sections, and other GFDL features. Some projects have dropped the GFDL; in many cases, they have moved to the Creative Commons attribution-sharealike license which retains the copyleft provisions of the GFDL without most of the unwanted baggage. Members of the Wikipedia project have wanted to move away from the GFDL for some time. They have a problem, though: like the Linux kernel, Wikipedia does not require copyright assignments from its contributors. So any relicensing of Wikipedia content would require the permission of all the contributors. For a project on the scale of Wikipedia, the chances of simply finding all of the contributors - much less getting them to agree on a license change - are about zero. So Wikipedia, it seems, is stuck with its current license. There is one exception, though. The Wikipedia copyright policy, under which contributions are accepted, reads like this: Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. The presence of the "or any later version" language allows Wikipedia content to be distributed under the terms of later versions of the GFDL with no need to seek permission from individual contributors. Surprisingly, the Wikimedia Foundation has managed to get the Free Software Foundation to cooperate in the use of the "or any later version" permission to carry out an interesting legal hack. On November 3, the FSF and the Wikimedia Foundation jointly announced the release of version 1.3 of the GFDL. This announcement came as a surprise to many, who had no idea that a new GFDL 1.x release was in the works. This update does not address any of the well-known complaints against the GFDL. Instead, it added a new section: An MMC [Massive Multiauthor Collaboration Site] is "eligible for relicensing" if it is licensed under this License, and if all works that were first published under this License somewhere other than this MMC, and subsequently incorporated in whole or in part into the MMC, (1) had no cover texts or invariant sections, and (2) were thus incorporated prior to November 1, 2008. The operator of an MMC Site may republish an MMC contained in the site under CC-BY-SA on the same site at any time before August 1, 2009, provided the MMC is eligible for relicensing. In other words, GFDL-licensed sites like Wikipedia have a special, nine-month window in which they can relicense their content to the Creative Commons attribution-sharealike license. This works because (1) moving to version 1.3 of the license is allowed under the "or any later version" terms, and (2) relicensing to CC-BY-SA is allowed by GFDL 1.3. Legal codes, like other kinds of code, have a certain tendency to pick up cruft as they are patched over time. In this case, the FSF has added a special, time-limited hack which lets Wikipedia make a graceful exit from the GFDL license regime. This move is surprising to many, who would not have guessed that the FSF would go for it. Lawrence Lessig, who calls the change "enormously important," expresses it this way: Richard Stallman deserves enormous credit for enabling this change to occur. There were some who said RMS would never permit Wikipedia to be relicensed, as it is one of the crown jewels in his movement for freedom. And so it is: like the GNU/Linux operation system, which his movement made possible, Wikipedia was made possible by the architecture of freedom the FDL enabled. One could well understand a lesser man finding any number of excuses for blocking the change. For whatever reason, Stallman and the FSF chose to go along with this change, though not before adding some safeguards. The November 1 cutoff date (which precedes the GFDL 1.3 announcement) is there to prevent troublemakers from posting FSF manuals to Wikipedia in their entirety, and, thus, relicensing them. Now that Wikipedia has its escape clause, it needs to decide how to respond. The plan would appear to be this: Later this month, we will post a re-licensing proposal for all Wikimedia wikis which are currently licensed under the GFDL. It will be collaboratively developed on meta.wiki and I will announce it here. This re-licensing proposal will include a simplified dual-licensing proposition, under which content will continue to be indefinitely available under GFDL, except for articles which include CC-BY-SA-only additions from external sources. (The terms of service, under this proposal, will be modified to require dual-licensing permission for any new changes.) This proposal will be followed by a "community-wide referendum," with a majority vote deciding whether the new policy will be adopted or not. Expect some interesting discussions over the next month. This series of events highlights a couple of important points to keep in mind when considering copyright and licensing for a project. There is a certain simplicity and egalitarianism inherent in allowing contributors to retain their copyrights. But it does also limit a project's ability to recover from a suboptimal license choice later on. Licensing inflexibility can be a good thing or a bad thing, depending on your point of view, but it is certainly something which could be kept in mind. The other thing to be aware of is just how much power the "or any later version" text puts into the hands of the FSF. The license promises that later versions will be "similar in spirit," but the GPLv3 debate made it clear that similarity of spirit is in the eye of the beholder. It is not immediately obvious that allowing text to be relicensed (to a license controlled by a completely different organization) is in the "spirit" of the original GFDL. Your editor suspects that most contributors will be willing to accept this change, but there may be some who feel that their trust was abused. Finally, it's worth noting that "any later version" includes GFDL 2.0. The discussion draft of this major license upgrade has been available for comments for a full two years now. The FSF has not said anything about when it plans to move forward with the new license, but it seems clear that anybody wanting to comment on this draft would be well advised to do so soon. Large I/O memory in small address spaces In the good old days, video graphics drivers ran in user space and the kernel had little to do with video memory. More recently, graphics developers have decisively voted for change and, in the process, moved video memory management into the kernel. So now the kernel must often manipulate video memory directly. And that, as it turns out, is harder than one might expect - at least, on 32-bit machines if the user actually cares about reasonable performance. The problem is that 32-bit machines have a mere 4GB of virtual address space. Linux (usually) splits that space in two; the bottom 3GB are given to user space, while the kernel itself occupies the top 1GB. Splitting the space in this way yields an important advantage: there is no need to adjust the memory management configuration on transitions between kernel and user space, which speeds things up considerably. The down side is that the kernel has to fit in the remaining gigabyte of memory. That would not seem like much of a problem, even with contemporary kernels, but remember one thing: the kernel needs to map physical memory into its address space before it can do anything with it. So the amount of virtual address space given to the kernel limits the amount of physical memory it can manipulate directly. One other thing that must fit into the kernel's address space is the vmalloc() area - a range of addresses which can be assigned on the fly to create needed mappings in the kernel. When a virtually-contiguous range of memory is allocated with vmalloc(), it is mapped in this range. Another user of this address space is ioremap(), which makes a range of I/O memory available to the kernel. Device drivers typically need access to I/O memory, so they use ioremap() to map it into the kernel's address space. Graphics adapters are a little different, though, in that they have large I/O memory regions: the entirety of video memory. Contemporary graphics adapters can carry a lot of video memory, to the point that mapping it with ioremap() would require far too much address space, if, indeed, it fits in there at all. So a straight ioremap() is not feasible; life was much easier in the old days when this I/O memory was mapped into user space instead. The Intel i915 developers, who are the farthest ahead when it comes to kernel-based GPU memory management, ran into this problem first. Their initial solution was to map individual pages as needed with ioremap() (or, strictly, ioremap_wc(), which turns on write combining - see this article for more details), and unmapping them afterward. This solution works, but it's slow. Among other things, an ioremap() operation requires a cross-processor interrupt to be sure that all CPUs know about the address space change. It is a function which was designed to be called infrequently, outside of performance-critical code. Making ioremap() calls a part of most graphical operations is not the way to obtain a satisfactory first-person shooter experience. The real solution comes in the form of a new mapping API developed by Keith Packard (and subsequently tweaked by Ingo Molnar). It draws heavily on the fact that Linux has had to solve this kind of problem before. Remember that the kernel (on 32-bit systems) only has 1GB of address space to work with; that is the maximum amount of physical memory it can ever have directly mapped at any given time. Any physical memory above that amount is called "high memory"; it is normally not mapped into the kernel's address space. Access to that memory requires an explicit mapping - using kmap() or kmap_atomic() - first. High memory is thus trickier to use, but this trick has enabled 32-bit systems to support far more memory than was once thought possible. The new mapping API draws more than inspiration from the treatment of high memory - it uses much of the same mechanism as well. A driver which needs to map a large I/O area sets up the mapping with a call to: This function returns the struct io_mapping pointer, but it does not actually map any of the I/O memory into the kernel's address space. That must be done a page at a time with a call to one of: Either function will return a kernel-space pointer which is mapped to the page at the given offset. The atomic form is essentially a kmap_atomic() call - it uses the KM_USER0 slot, which is a good thing for developers to know about. It is, by far, the faster of the two, but it requires that the mapping be held by atomic code, and only one page at a time can be mapped in this way. Code which might sleep must use io_mapping_map_wc(), which currently falls back to the old ioremap_wc() implementation. Mapped pages should be unmapped when no longer needed, of course: There are some interesting aspects to this implementation. One is that struct io_mapping is never actually defined anywhere. The code need not remember anything except the base address, so the return value from io_mapping_create_wc() is just the base pointer which was passed in. The other is that all of this structure is really only needed on 32-bit systems; a 64-bit processor has no trouble finding enough address space to map video memory. So, on 64-bit systems, io_mapping_create_wc() just maps the entire region with ioremap_wc(); the individual page operations are no-ops. Keith reports that, with this change, Quake 3 (used for testing purposes only, of course) runs 18 times faster. The far more serious Dave Airlie tested with glxgears and got an increase from 85 frames/second to 380. This is a big enough improvement that they would like to see this code go into 2.6.28, which will contain the GEM memory manager code. Linus responds: I'm inclined to agree. Not that I think 380fps sounds very impressive (I get 850+ fps with _software_ rendering, for chissake), but because 85 fps is a joke, and clearly without this setup there's not even any point to try to do any other optimizations. As a result, this code has been merged into the mainline and will appear in 2.6.28-rc4. Linux Connectivity for the Wii Remote Linux has had support for numerous hand-held infrared remote control devices for many years through the Linux Infra Red Controller (LIRC) drivers. There has been recent work to include LIRC in the kernel. The Nintendo Wii Remote is a more sophisticated remote control that was developed for the Wii game platform, it is accessible through a collection of Linux tools called CWiid. Wikipedia describes the Wii Remote: The Wii Remote, sometimes nicknamed "Wiimote", is the primary controller for Nintendo's Wii console. A main feature of the Wii Remote is its motion sensing capability, which allows the user to interact with and manipulate items on screen via movement and pointing through the use of accelerometer and optical sensor technology. Another feature is its expandability through the use of attachments. The Wii Remote was announced at the Tokyo Game Show on September 16, 2005. The Wiimote hardware capabilities (photo) include: Two-way wireless Bluetooth connectivity to the host. A screen-mounted Sensor Bar with multiple IR light sources and a 5 meter range. A built-in IR camera with distance and rotation sensing capabilities. A three axis accelerometer for detecting hand motions. Six general purpose remote control pushbuttons labeled A, -, Home, +, 1 and 2. An up-down-left-right four-way pushbutton. A power switch. Four remote controlled LEDs. A built-in speaker for providing audio effects. A "rumble" device for producing vibrations. Built-in non volatile memory with space for user data. A hardware expansion port. Powered by two AA cells, can use rechargeable types. CWiid was written by L. Donnie Smith and has been released under the GPLv2. The project has been around since March, 2007 and is currently at version 0.6.00. The libcwiid API document explains the CWiid software interface. There are currently at least twelve programs using CWiid. Some of the highlights include control of DMX lighting systems with Wiimote Control, 3D display of chemical structures using the Avogadro molecular editor, the WiiOSC control device for music programs and a newly released prototype Wiimote Control for the Ardour multi-track audio editor. Although the Wiimote control is ideal for use in games, there don't appear to be any such developments under Linux at this point. One of the more interesting uses of the Wiimote includes Head Tracking for an immersive 3D experience, based on the work of Johnny Chung Lee. This approach to 3D visualization produces full-color displays, unlike the the old-fashioned 3D movie technology that uses glasses with red and green lenses. Other 3D technologies require expensive LCD shutters that tend to produce a lot of flicker. The head tracking 3D technology would be well suited for use by the physically disabled. New Wiimote devices can be purchased for $40 or less. Many of them exist on the used markets, thanks to the popularity of the Wii platform. If your favorite application could benefit from a two-way wireless remote control device with a wide variety of features, the Wiimote looks like a good choice. Interview with the openSUSE board The openSUSE project recently welcomed the first community elected board. The previous board was appointed by Novell. The new board consists of both Novell employees and non-Novell members of the community. From the Non-Novell side of the community Pascal Bleser and Bryen Yunashko were elected, and from the Novell side Henne Vogelsang and Federico Mena-Quintero were elected. Novell appointed Michael Löffler as chairman of the board. We asked the new board a few questions and are pleased to present their responses. LWN: There was some discussion in the mailing lists prior to the election about the definition of a member. The current definition says: "openSUSE Members" are specifically distinguished contributors who have brought a continued and substantial contribution to the openSUSE project." Do you agree with this definition? What is your definition of "continued and substantial"? Pascal: We agree to that definition. The only potential issue with it is the name "Members". We had a long discussion on our opensuse-project mailing-list (that is open to everyone) about a proper name (and the process too), but we didn't manage to come up with something less ambiguous. Fedora's "Ambassador" title wouldn't be too bad, but actually even more confusing as it is not the same role. Unfortunately there is no red line we can cross between "non substantial" and "substantial", which is why the Board discusses and votes on each membership request. Typically, membership is granted to individuals who have been contributing to the community since more than half a year, in domains such as packaging, translating, authoring content on the Wiki, hacking, helping out by answering questions or administrating mailing-lists, our forum, or on IRC, etc... This is by no means an exhaustive list. We are looking for verifiable contributions though, and we will discuss how to proceed for granting Member status in the future, as the current process doesn't scale that well. Several other options were discussed on our opensuse-project mailing-list when we initiated the idea. Bryen: This will always be a continuously evolved definition as we identify people who contribute to the project as a whole, or in part, in new ways we might not have thought of previously. Myself, I became a member for my advocacy for a11y (accessibility through computing) and encouraging others to think about the needs of people with accessibility issues. I talked about it, I provided relevant information and I participated regularly in openSUSE meetings. I see "continued and substantial contribution" as someone who opens doors to making openSUSE a more relevant platform for users. LWN: Does Novell adequately support openSUSE? Should Novell do more to support the project? Henne: Novell is investing a lot in openSUSE. Nearly the whole technical infrastructure of the project is taken care of by Novell. Novell is also putting a lot of manpower and money into the project. But of course we could always use more, the project is unsatisfiable in that regard. So is Novell adequately supporting openSUSE? Yes. Could Novell do more? Definitely. Should Novell do more: Yes please. Bryen: One of the things that attracted me to the openSUSE Community was the active participation of Novell developers within the project. They continue to make themselves accessible and over time, they have given the reins to other people from the community and empowering us all to do more for the project. While I think Novell could stand to do a bit more promoting of openSUSE to the general public, I think they've done a fine job with Community Manager Joe Brockmeier. In many ways, I think it is premature to determine whether Novell *needs* to do more. It's only been a week since the polls closed for our new Board and we've had a good outcome of voter turnout at 75%. This makes it the first time that our Board has a community-backed mandate and a strong one at that. So it isn't a question of whether they've done enough thus far, but more a question of what more will they do now that Novell sees how strongly vested Community members are in the project as stakeholder. Pascal: But this doesn't mean it's one-way. The non-Novell contributors in the community also do a lot, most of them during their free time, as in almost any FOSS project. The community is very important to Novell, and I believe that the relationship should be seen as being equal partners. Michael: Could Novell do more? Of course as everybody could do more. But would it make sense? I doubt it. Rather than having even more support by Novell I'd love to see more sponsors stepping up to base the openSUSE project on broader shoulders and loose the dependency on one sponsor. LWN: Does Novell exert too much control over the project? Where are the areas where Novell could allow more community control? Bryen: The only time I've ever really seen any resistance from Novell is when they are unable to provide adequate manpower and support for a particular feature or request. But beyond that, there seems to be great transparency by the teams within openSUSE. If we were to go down the path towards greater community control, it is more about whether there is adequate manpower within the community to provide the support it needs. I don't think Novell is resistant to that at this time, but ultimately, it is about increasing membership where we can all work together seamlessly. Pascal: I don't believe that Novell exerts too much control over the project. It's rather the opposite. It is important to understand that this is an evolving process, where we started from (more or less) everything closed except for a few when S.u.S.E. GmbH was acquired by Novell to the point where Novell pushed for opening up many things around openSUSE when it launched opensuse.org three years ago. More and more domains and teams have opened up towards the community, and the community has grown with it. Right now, we're rather in a position where Novell is actually looking for more contributors from the community, with existing open processes (open discussions on our mailing-lists, source code available in public Subversion repositories, etc...). There are still a few areas where we're still at the beginning of opening up (actually rather at the point of starting to think about ways to do that properly), such as having non-Novell employees co-maintain core distribution packages or the openSUSE reference guide. As said, it's in flux, and certain things take time, but Novell definitely hasn't been standing in the way, quite the contrary. And while Novell ultimately has the most resources in certain or even most areas of the project, especially in building the distribution and providing security maintenance during the openSUSE release lifetime, there is always room for discussion. One notable example was the thread about whether KDE 3 should be removed from openSUSE 11.1, as KDE 4 is where almost all KDE developers put their efforts in. At first, Novell's product management position was to drop KDE 3 because it would mean supporting it for 2 years. But the discussion lead to a compromise, as many believe that KDE 4 wasn't quite ready enough. In the end, openSUSE 11.1 will still have KDE 3, but KDE 3 will be dropped in 11.2. The KDE3 maintenance during the lifetime of openSUSE 11.1 will be taken care of by Novell employees. So, again, while Novell commits most of the resources, the opinion of the community is important to them, for obvious reasons. Michael: With regards to more community control I think we (the project) need to define clearer rules how to contribute and I'd love to see co-maintainership for more and more packages, long term even core packages. LWN: Is the current board, with 5 members (2 Novell + 3 non), a good size and does it achieve the right balance of corporate vs. community? Pascal: I believe that 5 individuals form a good team size to effectively get things done. The background of each member also happens to strike an interesting, diverse and good balance of opinions and influences both from within Novell employees as well as from the non-Novell employees in the community. This is clearly very healthy and can only lead to better representation of the community's opinions. I can also imagine that at some point, that differentiation between Novell employee and non-Novell employee Board position shall be removed. Remember that it is our first elected Board. We take some decision and define some processes because we believe that they offer a good balance. Actually, the idea behind that separation was to make sure that there were two seats occupied by non-Novell employees (and not less). Bryen: What is important is that we ensure adequate representation of the community and that the community is heard. As the Community grows, we might have to revisit the size of the Board and consider adding adequate representation. Henne: I also have the feeling that on the topic of Novell and openSUSE there is a big misconception. There really is no versus in that relationship. In fact the openSUSE community consist of people that support the openSUSE project and its goals. Some of those people are employed by Novell, some of them are not. So its not "corporate vs. community" but rather "community equals corporate and non-corporate". Also, the openSUSE Board isn't the dictator of this project. The project consists of many many different areas where people lead and make decisions on a daily basis. Some of those people are employed by Novell, some of them are not. So to execute control over this Board does not give you control over the openSUSE project. Just over the openSUSE Board. We understand that this is a controversial topic. But you shouldn't get too theoretical while reflecting upon it. It is very tempting but this is a total non-issue in reality. We are not trying to find the best theoretical possibility to govern a project. We are hackers that try to get things done. LWN: In the openSUSE election members were able to name another contributor (non-member) to be a voter. Would you like to see this continue in future elections? Do you know if there were many non-members who voted in this election? Henne: I think we all agree with our election officials that his was a special rule for the first election (see Board Election page. So this will not continue for future elections. There were 25 non-members eligible. How many of them voted is not public. Bryen: The non-member voters really represented a small fraction of the electors as a whole. 25 out of 237. There were certainly mixed feelings across the board about the idea of franchising votes, but the idea was noble by the election committee to find ways to further identify potential members and increase membership. Since one of the Board's mandates is to grow the Community, I don't see the need for franchising votes in upcoming elections. By the time the next election comes along, we should have done significant work in reaching out to more new potential members. LWN: How do regular users get more involved? How can they contribute to both the technical work and the decision making? Henne: As with any other open source project: Be there or be square! Seriously, its as simple as showing up and participate. That's one of the beauty's of free and open source projects. To contribute to our user support just subscribe to one of our mailing lists, join one of our IRC channels or login to our forum and help people as best as you can; as described here. To contribute to our documentation go and help organizing and authoring in our Wiki. To help to translate the openSUSE distribution into your language join one of our various translation teams and translate strings to your native language. To contribute to the distribution use the openSUSE Build service. And those are just the four main entrances into our project. You can have it as specialized as helping openSUSE HAM users to transmit via radio, spinning your own version of openSUSE for educational use or create artwork to be included in the distribution. The same holds true for decision making. All of the different areas make their own decisions. So to influence decisions you go there, participate and voice your opinion. Decisions that concern the overall project are always discussed at the opensuse-project mailing list or are a topic in the bi-weekly opensuse-project meeting. So you just go there and do the same. It is really as simple as that. Bryen: I think my personal story is the best example of all. Of all the members on the Board, I'm probably the least technical. I'm more of an active user than anything else. How did I get involved? I attended meetings on IRC, advocated a11y, got involved in education by forming the openSUSE Helping Hands project and as co-editor of the openSUSE-Tutorials.com web site. Users can become members through advocacy, promoting openSUSE regularly at local events, getting online and providing support to new users in IRC and the forums. It's really not that difficult to become a member and you don't have to be a technical genius to become a member. Pascal: I'd like to add that while a number of things are in place, we clearly have to work even further on lowering the barriers for people who are willing to contribute. Even better tools, better documentation, even more translations. As always, and as for any other FOSS project, there is always room for improvement. Michael: I can just support Pascal's opinion that it would be beneficial for us to lower barriers and provide a clearer description in what and how everybody can contribute and present our existing tools better. Especially implement some cross-functionality like better integration of our Bugzilla and the openSUSE build service for instance. LWN: Do you see areas of collaboration with other distributions either currently or in the future? Henne: Collaboration is our foundation. We would cease to exist if we wouldn't collaborate with everyone else in the free an open source community. Among them are, of course, also other distributions. Whether openSUSE project members hack with members of the Debian or Slackware projects on some upstream project like the Linux kernel, or that we coordinate when it comes to security issues with everyone else on vendor-sec, or that we try to consolidate tools we use like we do with Fedora on smolts.org. There are of course also the big collaboration projects we support like the Linux Standard Base or freedesktop.org. So we are collaborating heavily with other distributions already and we will continue to do so in the future. Bryen: Generally speaking, collaboration is an integrated feature of open source, so yes, collaboration exists across the board between openSUSE and other distributions. Pascal: I'd like to see that going even further. As one of the organizers of the FOSDEM conference, one of our primary goals there is to foster cross-pollination between projects with similar goals and domains. While difference and choice are some of the key features of FOSS (yes, I believe that having many distributions is very healthy), there are situations where working together on a few things makes sense. There isn't always a point in reinventing the wheel. Sharing development efforts and commoditizing tools for contributors is clearly something I'd like to see happening more often. While openSUSE is definitely a brilliant distribution in many regards and while we have a healthy community that consists of great people, we still have a lot to do, and so do the other distribution projects. They're not less good nor less deserving, we all have our strengths and weaknesses, so we should work together to make Linux and FOSS a better experience to everyone. If you're feeling at home with another distribution, great, contribute there! And if you think you'd like to contribute to our distribution or to our community, you are definitely welcome here too. And if you believe there are domains where we can work together, please get in touch with us at board -AT- opensuse.org. Editor's note: We would like to thank the openSUSE Board for taking the time to answer our questions. Readers may notice that not all board members have answered every question. This was their choice and not due to any censorship on our part. The end of the road for Firefox 2 By some accounts, the Firefox browser is now responsible for a full 20% of web traffic. As the number of Firefox users grows, so does the need for top-quality support; 20% makes for a large number of potential attack points. So it is interesting to note that Mozilla is now planning to end Firefox 2 support in the near future, perhaps before the end of the year. This change could leave a lot of users - and not just Firefox users - in a difficult position. One obvious question to ask would be: have most Firefox users moved on to Firefox 3? Apparently, about two out of three users have made the change, but millions of users have yet to move away from the older browser. The Mozilla project would like to get as many of those users to switch before ending support; that, in turn, requires looking at why they haven't yet upgraded. There seem to be a few prominent reasons beyond sheer inertia: Some users have systems which are not supported by Firefox 3. Many of these, it seems, are running old versions of Windows - 9x or NT4. In these cases, the operating system itself has long since ceased to receive support, so it's not entirely clear that continuing to support the browser does a whole lot of good. Others are dependent on extensions which have not been ported to Firefox 3. While most actively-developed extensions were ported some time ago, it appears that there are quite a few extensions which, while still having significant numbers of users, have been abandoned by their developers. Zack Weinberg has suggested that the project could make an active effort to find new maintainers for those extensions, or even fix a few of them itself. The Firefox 3 experience is not problem-free for all users; there have been some complaints about printing on some systems, for example. Finding - and fixing - the remaining blockers is clearly an important thing for the Firefox developers to do. Somehow, ways will probably be found to coax most of these users into moving forward to a newer browser. Beyond doubt, though, some will be left behind, and some of those may learn the hard way what "unsupported" really means. But that will be true no matter how long Firefox 2 is supported; there's never a way to get all users to upgrade. Firefox is not different from any other application in this regard, with the sole exception that its user base is larger than most. There is another important aspect to this story, though: this decision will affect users well beyond those who use Firefox. The end of Firefox 2 support will also bring an end to support for the Gecko 1.8.1 platform. And this version of Gecko is used by several applications beyond Firefox, including Camino, SeaMonkey, Sunbird, Miro, Instantbird, and Thunderbird. All of these platforms currently use Gecko - the soon-to-be-discontinued version of Gecko - for HTML rendering. There is a fair amount of concern about Thunderbird in particular. This mail client was recently kicked out of the Mozilla nest to fend for itself. Thunderbird developers are working toward a Thunderbird 3 release (the third alpha release came out in mid-October) which will use a newer version of Gecko. But the 3.0 release is still several months away - some months after the end of Gecko 1.8.1 support. Naturally enough, the Thunderbird developers worry that their current users will be running in an unsupported mode; that does not strike them as the best start for their newly-independent project. The word from the Mozilla Foundation seems to be that the Gecko platform will continue to be supported, in some minimal fashion, for a while yet. According to Samuel Sidler: The triage and release team that currently works on Firefox and Thunderbird 2.0.0.x releases will continue to triage requests for Thunderbird 2.0.0.x and maintain its releases until six months after the release of Thunderbird 3. Note that this will mean that browser-specific security and stability bugs will likely be ignored/minused. We'll only be considering bugs that affect Thunderbird 2.0.0.x. So it seems that Thunderbird should be covered - as long as the people who decide whether bugs are "browser-specific" do their job properly. But experience has shown many times that it can be hard to understand the full implications of a given bug. It would not be all that surprising for one or more "browser-specific" bugs to turn out to be fully exploitable in Thunderbird. Beyond that, though, applications like SeaMonkey and Camino are browsers. Developers from those projects are, needless to say, concerned that their needs are not being taken into account. They are not attracted by the idea of shipping a browser based on a platform where browser-specific bugs are being ignored. Mozilla developers have tried to reassure these groups that the situation is not as bad as it seems, but how things will work for them is far from clear. The real answer was, perhaps, suggested by Samuel: The community can take over this branch, just as has been done for Gecko 1.8.0 (currently managed by Linux vendors) In other words, Mozilla would like to outsource the maintenance of this code to the community, and to distributors in particular. The good news is that this is free software, so this kind of extended maintenance is possible as long as the interest is there to do it. Gecko is a non-trivial body of software to maintain, but it should be possible for the various interested projects, along with distributors still shipping this code, to pool their effort and get the job done. In their spare time, perhaps, they can give some thought to how they might avoid getting caught in the same situation when Firefox 3 reaches the end of its supported life. The sad story of the em28xx driver Over the last year or two, the kernel development process has been changed in a deliberate attempt to make the addition of new drivers easier. It has become clear that out-of-tree drivers often do not get any better until they are merged; meanwhile, users want those drivers and distributors are shipping them. So it would seem that everybody's interests are served by getting those drivers into the mainline tree. Experience with drivers merged under this policy has generally been positive; once those drivers head for the mainline, they get more attention and tend to improve quickly. Given that, one might well wonder why Markus Rechberger's recently submitted "empia" driver series is encountering so much resistance. This driver works with a number of video acquisition devices based on Empia chips; many of those are not supported by the kernel now. As an Empia Technology employee, Markus has access to the relevant data sheets and is, thus, well placed to write a fully-functional driver. There are users who will attest that the drivers work, and that Markus provides good support for them. But, as things stand now, it would appear that this driver is not headed for the mainline. What we have here is a classic story of an impedance mismatch between a developer and the development community. In the process, this long story has helped to give the Video4Linux development community a bit of a reputation as a dysfunctional family - a perception which those developers are only now beginning to overcome. The sad truth would seem to be that, while working with the community is something that a couple thousand developers do with little trouble every year, there will always be a few who have difficulties. A quick review of some of the history is in order here. Markus was one of the authors of the original em28xx driver, first merged for the 2.6.15 kernel. His efforts to enhance that driver quickly ran into trouble, though, when he tried to make substantial changes to the low-level tuner interface - changes which affected a number of other drivers. These changes were not popular in the Video4Linux community, and there were fears that they could break unrelated drivers. So this code was not merged. In response to this rejection, Markus claimed ownership of the em28xx driver and asked that it be removed from the mainline kernel. He then continued development of the code, hosting it on his own server. There was even a period where the code was relicensed to the MPL, apparently as part of an attempt to prevent it from being taken into the mainline. Eventually, Markus came back with a new approach which moved much of the tuner code into user space. That solution, too, failed to pass review; nobody else could really see much advantage in moving that much driver code out of the kernel. The fact that Markus clearly intended to have some of that code appear in the form of binary-only blobs did not help his case. So the user-space approach, like its predecessor, was not merged. While Markus was working on his own version of the code, others were putting patches into the mainline em28xx driver. At times, Markus tried to block those changes. The tone of the discussion is, perhaps, best seen from this note sent to Video4Linux maintainer Mauro Carvalho Chehab: Best would be to replace you as a maintainer since you don't have any respect of others work either. Companies should be aware that if they try to submit any code to you they will loose the authority over _their_ work. Of course, losing "authority" over code is inherent in releasing that code under a license like the GPL. This attempt to exercise control over freely-licensed code was slapped down by Andrew Morton and others, but it left unpleasant memories behind. Now Markus is back with a driver that, to all appearances, duplicates the functionality of a driver which is already in the mainline kernel. It is not hard to see this submission as an attempt to retake control of that driver and, perhaps, restart the discussions from past years. So it is not entirely surprising that this driver has not been received with a great deal of enthusiasm. In short, Markus has been told to go away until he is prepared to submit his work in the form of a series of small patches to the in-tree em28xx driver. The advantages of improving the current driver, rather than duplicating some of its functionality in a new code base, are clear. It would avoid the confusion which can come from having two drivers for the same hardware in the tree, and it would minimize the risk of losing important fixes which have been applied to the in-tree code. This is, also, the way that kernel developers are normally expected to do their work. On the other hand, video developer Hans Verkuil reviewed the new driver and concluded: In my opinion it's pretty much hopeless trying to convert the current em28xx driver into what you have. It's a huge amount of work that no one wants to do and (in this case) with very little benefit. This review notwithstanding, Mauro has indicated that he is not interested in accepting this patch. But rejecting Markus's new driver out of hand might just be a mistake. There seems to be little doubt that it has developed well beyond the in-tree driver; it supports a wider range of devices. Failure to merge it risks losing the work that has been done, and, perhaps, losing the future work of a developer who, for all his faults, is clearly trying to provide a better experience for Video4Linux users. Having multiple drivers for the same hardware in the kernel is not an ideal situation, but it is also not without precedent. The IDE and parallel ATA subsystems provide redundant support for a wide range of hardware. The e1000 and e1000e drivers had overlapping coverage for some time. In such cases, the long-term goal is usually to work toward the removal of one of the drivers. So one could make the case for merging the new driver and, eventually, removing the older one. In the process, the new driver could receive some much-needed attention from other developers. It has coding style and copyright attribution problems; a quick review has also left your editor wondering about locking issues. But such problems are common to drivers which have spent a lot of time out of tree; they are simply something to fix. Meanwhile, this driver contains the result of years of work and access to the relevant data sheets; freezing it out may not be in the best interests of kernel developers or users. Tracking of testers and bug reporters - a status report A recurring topic at kernel summits is proper recognition for users who report bugs and test fixes. These people help the development process considerably, but they are far less visible than the developers who are creating those bugs in the first place. Since we would like to have more testers and reporters, it makes sense to reward them in whatever way we can. One of the strongest currencies we hold is credit for work done. So it stands to reason that crediting those who help the development process is in the interest of everybody involved. One mechanism developed for this purpose is a set of tags applied to patches before they are merged into the mainline. When a patch fixes a bug, the user(s) who reported that bug should be credited through the addition of a Reported-by: tag. Similarly, testers are credited with the Tested-by: tag. As it happens, some developers have adopted the habit of using Reported-and-tested-by: as a way of saving valuable newlines in the common case where a user fills both roles. There is a certain warm feeling that comes with having one's name stored in a changelog entry in the kernel source repository. But the amount of visibility which comes from this event is relatively small. So your editor decided to hack up his git data mining utility to track these tags. Without further ado, here are the top problem reporters and patch testers for the 2.6.27 development cycle: All told, there were a total of 205 Reported-by: and 153 Tested-by: credits entered during the 2.6.27 kernel cycle. This is arguably a reasonable start for a new tag, but it seems clear that a lot of problem reporters are not, yet, being credited in this manner. Your editor became curious to see just who is taking the time to credit these people; they, too, deserve some credit. A bit more script hacking yielded these tables: The end result: Adrian Bunk gave over 20% of the total bug reporting credits - to himself. Beyond that, a number of the core developers are taking at least some time to credit those who report bugs and test patches. But, in the end, the 10,628 changesets merged for 2.6.27 probably contained quite a few more patches which could have carried such tags. If the reporting and testing tags are to become truly useful and significant, they will have to be more universally used. While your editor was at it, he also collected statistics for Reviewed-by: tags. These tags differ in that they are offered by the reviewer, who thereby states that a reasonably thorough review has been done and the code has not been found seriously wanting. Code review is perennially in short supply in just about any free software project, so, again, proper credit for reviewers seems like more than just a good idea. Here's the top 2.6.27 credited reviewers: If these numbers are to be believed, only 123 reviews were performed over the 2.6.27 development cycle. Even the most cynical observer is likely to agree that a bit more reviewing than that is going on. Most reviewers do not offer the associated tag, so their contribution goes unrecorded. In particular, Andrew Morton, who seems to review almost every patch which appears, should be at the top of the above list. Clearly, the task of ensuring proper credit for testers, bug reporters, and reviewers is still in its initial stages. But one has to start somewhere; this is more information than we had before. Hopefully, over time, the habit of crediting those who help with the development process will become more widespread. And that, with luck, will encourage more testing and bug reporting and, as a result, a better kernel. NLUUG/ELCE: Embedded devices and free software On successive days, Harald Welte and David Woodhouse gave different views of the relationship between embedded companies and the free software communities whose code the companies are increasingly using. Their outlooks were not contradictory, but instead complementary; each came at the topic from a different direction. Welte looked mostly at what companies, particularly chip vendors could do better, while Woodhouse looked at what things the community could do to improve. Welte and Woodhouse spoke at the co-located NLUUG autumn Mobility conference and Embedded Linux Conference Europe in Ede, the Netherlands, November 6 and 7. The Congrescentrum De Reehorst facility was excellent, well-suited to an event of this type which is not surprising as NLUUG has been holding two events there each year for the last ten years or so. In addition, the conference was well-organized and run; clearly displaying the experience that comes from the 26 years that NLUUG has been in existence. [ The following covers Welte's presentation, Woodhouse's talk will be covered in a subsequent article. ] Welte kicked things off on Thursday with a talk entitled "How chipmakers should (not) support free software". As the conference got a bit of a late start and was already 15 minutes behind at that point, Welte said that he would make the time up because "everyone can understand gzip compressed speech". More seriously, he outlined his experience as a member of the Linux community, embedded developer, chip manufacturer from his recent work with Via, as well as a customer of consumer-grade embedded devices for gpl-violations.org; all of which result in multiple relevant points of view. Linux is being found in more and more devices today—some less than obvious. Welte listed fairly well-known things like mobile phones and in-flight entertainment systems, but then noted that there are DSL Access Multiplexers (e.g. DSLAMs), payphones, ATMs, as well as vending and exercise machines that also run Linux. Vendors of those devices are using free and open source software (FOSS) because of its strengths, which he outlined. There is a great deal of innovative and creative development done in FOSS because the barriers to entry are fairly low: the codebase is easy to read—at least in comparison to closed source—and there are standard development tools that are freely available. Because development is done in the open, developers will be embarrassed if their software architecture or code is bad. This also results in better security because of the code review that takes place. The outcome of using FOSS this way is that "we should have a perfect world" with tons of embedded products, all secure and maintainable, that allow for additional or alternate functionality via third parties. The first of those, many embedded products, has been achieved, but we are still waiting for the other two, Welte said. He contrasted a user's experience with Linux on PCs today with the experience provided by most embedded devices. For PCs, you can download the kernel, build it and it will run, with most hardware supported. You can choose from multiple distributions, any of which will have a kernel close to that of a mainline kernel and provide regular security updates. These are "things we are used to for many years", but things are not that way in the embedded space. In the embedded world, every CPU or system-on-a-chip (SoC) has its own kernel tree, typically based on some ancient version of the kernel, that never gets cleaned up or submitted for mainline inclusion. So, they get no benefit from new features or security fixes in the kernel. There are no distributions to choose from, either for users or board makers and, even if updates are generated, there is generally no packaging system to use to update the code; re-flashing the entire device is required. In Welte's words, "this sucks!" The embedded vendors get unstable and unmaintainable software with "security nightmares" and no innovation from elsewhere. The vendors have kernels that have diverged so far from the mainline that new features or fixes can't be backported, nor can their kernels get merged upstream. This is because the vendors tend to be very short-sighted, only focusing on getting one particular device out the door. From Welte's perspective, embedded vendors do not understand the real potential of FOSS. They do not think in terms of creating platforms that others can build atop. In general, "they would rather sell a new [device] rather than improve the existing one". So, the vendors compete on the basis of the features their proprietary competitors implement rather than figuring out how to take advantage of the true strengths of FOSS. If, instead, they used FOSS to its fullest, they could outcompete the proprietary vendors in ways that could not be matched—except by using FOSS. Turning to the chip vendors, Welte points out that there are two types of customers: Linux-aware and Linux-unaware. The Linux-aware customers—whose numbers are growing—will seek out vendors whose Linux support is better. It is already relatively late in the game: "if you don't have proper FOSS support, you will lose the 'openness competition'". Chip manufacturers should be engaging in "sustainable development" by releasing kernels developed against the mainline in cooperation with the community. One large mistake these vendors make is to think their customers are only the tier-one companies that buy chips directly. There are many more downstream users of a chip once it has been integrated into other hardware; the buyers of those devices are also important as they will determine the success or failure of the product. Unsurprisingly, Welte recommends that the development be done in the open, with a public development tree. Releases should not just be stable snapshots or big code drops; "post early, post often" should be the governing principle. FOSS is not just a technology, as chip vendors tend to think, it is a research and development philosophy that needs to be integrated into both the internal and external processes of the chip vendor. On the external side, making documentation available, without a non-disclosure agreement (NDA)—or at worst a FOSS-friendly NDA—is essential. Internally, there is normally quite a bit of learning required to understand the FOSS philosophy. This will require training for engineers as well as product management folks. Having a clear FOSS support strategy, with clear goals, is important for making it work. Product management needs to understand that supporting Linux is mostly a process of understanding the development model. The Linux APIs are not a particularly big hurdle, but understanding the community and how to work within it can be. Supporting Linux should mean supporting the mainline, not just N distributions, as N will grow over time, which leads to more problems. It is important to recognize that Linux-aware customers care as much about the quality of the code as they do about price and performance. Engineering management needs to encourage engineers to communicate with the community, which requires real internet access. When faced with adding functionality to some FOSS code, they should be looking at ways to cooperate with others who have similar needs, rather than reinventing the wheel. Engineers need to figure out how and where to ask the right kinds of questions. They also need to learn that code is written to be read, not just executed; "this is something new to many people". The community also has responsibilities to help the chip makers by providing "non-partisan" documentation because these manufacturers often have "no clue where to start or who to talk to" when they start considering supporting Linux. Commercial embedded distributors have a different perspective from the community so documentation from the community viewpoint is required. Welte says that various Linux Foundation sponsored efforts are helping in this area, but more needs to be done. A mentoring program of some sort might help by having FOSS developers willing to work with engineers to walk them through the process of getting their code upstream. The community must also work to keep from scaring chip vendor engineers away by being overly rude or terse; it is important that valid criticism be fully explained. Welte sees a number of current or looming problems for chip vendors in supporting Linux, mostly involving patents or technology licensing issues. Various licensing regimes (like those for MPEG or Sony's memory stick) impose requirements that essentially preclude the development of free software drivers to talk to devices that implement those technologies. Everyone in the industry has these problems, though, so Welte suggests that they band together to present a case to the license holders; with enough smaller players working together, their voice can be heard. On the whole, Welte is somewhat pessimistic about where embedded devices are headed. He certainly sees more FOSS being used in devices in the future, but expects to see them still be restricted so that they cannot leverage the full potential of FOSS. He does see "some very dim light at the end of a very far tunnel" with projects like Openmoko, but also efforts by some chip vendors, notably Intel, to fully support Linux. It was not that many years ago when the desktop Linux situation looked as bleak as the embedded space does today, so there is hope. Presentations like Welte's can only help to bring that about. The audience contained many embedded developers, hopefully they can help their company's management see the benefits that Welte outlines so that his perfect world comes about sooner, but if the desktop is any guide, it will come about eventually. /dev/ksm: dynamic memory sharing The kernel generally goes out of its way to share identical memory pages between processes. Program text is always shared, for example. But writable pages will also be shared between processes when the kernel knows that the contents of the memory are the same for all processes involved. When a process calls fork(), all writable pages are turned into copy-on-write (COW) pages and shared between the parent and child. As long as neither process modified the contents of any given page, that sharing can continue, with a corresponding reduction in memory use. Copy-on-write with fork() works because the kernel knows that each process expects to find the same contents in those pages. When the kernel lacks that knowledge, though, it will generally be unable to arrange sharing of identical pages. One might not think that this would ordinarily be a problem, but the KVM developers have come up with a couple of situations where this kind of sharing opportunity might come about. Your editor cannot resist this case proposed by Avi Kivity: Consider the typical multiuser gnome minicomputer with all 150 users reading lwn.net at the same time instead of working. You could share the firefox rendered page cache, reducing memory utilization drastically. Beyond such typical systems, though, consider the case of a host running a number of virtualized guests. Those guests will not share a process-tree relationship which makes the sharing of pages between them easy, but they may well be using a substantial portion of their memory to hold identical contents. If that host could find a way to force the sharing of pages with identical contents, it should be able to make much better use of its memory and, as a result, run more guests. This is the kind of thing which gets the attention of virtualization developers. So the hackers at Qumranet Red Hat (Izik Eidus, Andrea Arcanageli, and Chris Wright in particular) have put together a mechanism to make that kind of sharing happen. The resulting code, called KSM, was recently posted for wider review. KSM takes the form of a device driver for a single, virtual device: /dev/ksm. A process which wants to take part in the page sharing regime can open that device and register (with an ioctl() call) a portion of its address space with the KSM driver. Once the page sharing mechanism is turned on (via another ioctl()), the kernel will start looking for pages to share. The algorithm is relatively simple. The KSM driver, inside a kernel thread, picks one of the memory regions registered with it and start scanning over it. For each page which is resident in memory, KSM will generate an SHA1 hash of the page's contents. That hash will then be used to look up other pages with the same hash value. If a subsequent memcmp() call shows that the contents of the pages are truly identical, all processes with a reference to the scanned page will be pointed (in COW mode) to the other one, and the redundant page will be returned to the system. As long as nobody modifies the page, the sharing can continue; once a write operation happens, the page will be copied and the sharing will end. The kernel thread will scan up to a maximum number of pages before going to sleep for a while. Both the number of pages to scan and the sleep period are passed in as parameters to the ioctl() call which starts scanning. A user-space control process can also pause scanning via another ioctl() call. The initial response to the patch from Andrew Morton was not entirely enthusiastic: The whole approach seems wrong to me. The kernel lost track of these pages and then we run around post-facto trying to fix that up again. Please explain (for the changelog) why the kernel cannot get this right via the usual sharing, refcounting and COWing approaches. The answer from Avi Kivity was reasonably clear: For kvm, the kernel never knew those pages were shared. They are loaded from independent (possibly compressed and encrypted) disk images. These images are different; but some pages happen to be the same because they came from the same installation media. Izik Eidus adds that, with this patch, a host running a bunch of Windows guests is able to overcommit its memory 300% without terribly ill effects. This technique, it seems, is especially effective with Windows guests: Windows apparently zeroes all freed memory, so each guest's list of free pages can be coalesced down to a single, shared page full of zeroes. What has not been done (or, at least, not posted) is any sort of benchmarking of the impact KSM has on a running system. The scanning, hashing, and comparing of pages will require some CPU time, and it is likely to have noticeable cache effects as well. If you are trying to run dozens of Windows guests, cache effects may well be relatively low on your list of problems. But that cost may be sufficient to prevent the more general use of KSM, even though systems which are not using virtualization at all may still have a lot of pages with identical contents. The Gumstix Overo - a miniature X Window System platform Attendees at this year's Kernel Summit were treated to an early prototype version of the Gumstix Overo miniature Linux-powered cpu board on top of the Overo Buddy motherboard. The system packs all of the functions of a desktop computer onto a platform that is slightly larger than a credit card. The Specifications for the Overo processor board include: A 600 MHz Texas Instruments OMAP 3503 processor. 256 MB of DDR RAM. 256 MB of NAND Flash RAM. A microSD adapter slot with a 2.0 GB memory stick. WiFi and Bluetooth ports. A USB 2.0 port. Stereo Audio input and output ports. A port for driving a graphical LCD panel. An assortment of Analog and Digital I/O ports. The Overo Buddy motherboard adds even more functionality including a digital video (DVI) controller and two more USB ports. Upon receiving the Overo Buddy board, the only way to establish a connection was via an emulated serial connection over one of the USB ports using the provided USB cable, as explained here. This worked as advertised, it was possible to watch the system boot up and then log into a root shell. At this point, your author decided to try the installation of the latest software on the removable microSD memory. As directed by the instructions, the software image was downloaded and installed on the memory using another machine and the provided microSD adapter card. Again, this proceeded without any problems and the machine booted with the new image. Running the full X environment required purchasing a USB hub, a USB keyboard and mouse, an assortment of USB cables and a Mini DVI to DVI adapter for the monitor connection. The Mini DVI adapter was a bit wide, and the strain relief around the Overo Buddy's power supply connector had to be clipped off to allow the two connectors to be plugged in at the same time. Getting the USB cabling right was a bit of a challenge. On the first attempt, the DVI monitor showed an X login window, but the keyboard and mouse were not active. Digging through the documentation revealed the source of the problem. The OTG USB port needed a type A cable and your author was using a type B cable. The Wikipedia USB documentation was consulted, and your author used a special surface mount soldering iron to create a tiny solder jumper between pins 4 and 5 of the Overo Buddy's micro-USB jack, simulating the correct cable. Upon booting, the keyboard and mouse came to life. When logging into the Overo's X Window System, one is presented with the simple but effective Enlightenment window manager. Applications include the typical collection of an X terminal, a file manager, a text editor (gpe_edit) the Midori web browser, a mail client, an instant messenger client, and a selection of four games. Also included are the AbiWord word processor, the Gnumeric spread sheet and basic audio record and play utilities. A large collection of GUI-based admin tools and window system configuration tools are available. Both ssh and scp are also installed on the system, so secure network connections are possible. Unfortunately, both the audio recorder and player froze up during basic tests, and their windows did not go away until the system was rebooted, this appears to be some kind of audio hardware issue. The next step to having a functioning system would be to have some kind of networking. The Overo processor has built-in 802.11 wireless networking and Bluetooth, but neither of those systems functioned. That is a known issue with some of the early-run prototype boards. One still has the option of adding USB WiFi and Ethernet boards to the Overo, several devices are supported natively. Once networking can be established, it should be possible to use the network-based applications, transfer user data add more application packages. Having so much functionality in something as tiny as the Overo Buddy board seems like an amazing technological feat. Gumstix has truly achieved a new milestone in the miniaturization of Linux systems. Production versions of this system are scheduled for release in the fourth quarter of 2008. Reinventing the Fedora desktop Now that Fedora 10 is nearing completion, it is time to start looking forward to the shape of Fedora 11. Matthias Clasen started a discussion with a post to the Fedora-desktop list, including a pointer to the whiteboard where people can fill in their ideas. The page contains some ideas guaranteed to warm an editor's heart and a few which inspire rather less enthusiasm. So what are the Fedora desktop people pondering? Some of the ideas include: Removing icons from the desktop menus. The reasoning behind this change would appear to be "Windows and OS X do it that way." Fixing up power management. Among other things, those posting to the wiki note "When the user changes the brightness, he doesn't appreciate if the computer turns it right back down again"; better late than never. Better power management also involves turning off blinking cursors, which would also be a welcome change. "Better fonts" is on the list; that seems to translate to better and easier ways for users to install new fonts. There is some wondering about whether the current packaging system is really the best way to deal with fonts. The volume control has been singled out for special attention. One of its claimed problems is the vast number of sliders which can appear for a complex audio device; it is true that it can become overwhelming. But playing "find the hidden slider" when some audio source is inaudible is not a better state of affairs. There is also a worrisome note to the effect that Windows has a better volume control because it is not removable. So, in the future, we may have a volume control whether we want it or not. Replacing the panel altogether, along the lines of the ideas bashed out at the recent GNOME hackfest, is under consideration. This would, of course, be a major change to the desktop which would not be welcomed by all users. Somebody has noticed that the flurry of "notification" windows can get a little irritating. So different approaches to notifications are being considered. A new approach to system settings is also under consideration. The idea would be to get away from the "preferences" and "administration" menus in favor of a single window with a search feature. There is talk of better location awareness, but it appears to be limited to mundane tasks like setting the time zone automatically. It seems like it should be possible to set more ambitious goals in this area. The Fedora developers note that Ubuntu beat them to shipping a working "guest user" implementation. Surely they will now contribute to improving that implementation, rather than making their own...right? Evidently users should not be asked to distinguish between hibernating the system (which saves memory to disk and powers off) and suspending (which keeps main memory powered up). To avoid this problem, Fedora might implement a "hybrid suspend" which saves to disk but still keeps RAM energized for a fast restart. There are a number of practical problems to solve in this area, not the least of which being that waiting for a full hibernate when you want to suspend the system quickly can be obnoxious. Fast boot is, naturally, on the list. There is a lot more on the list - far more than the Fedora developers can hope to implement (or even integrate) in the near future. But the process is a good one, and some of these ideas will certainly show up in future Fedora releases. With any luck at all, the Linux desktop will continue to improve for a long time. NLUUG/ELCE: Embedded Linux and the community As one of two embedded maintainers for the Linux kernel, David Woodhouse is in an excellent position to see where the community is failing to keep up its end of the bargain. At the recent co-located NLUUG and Embedded Linux conferences, his keynote on the second day made it very clear what areas he sees that need improvement. We fairly regularly hear about things that companies should be doing—see the report on Harald Welte's first day keynote—but the community should certainly keep an eye on its behavior as well. In his presentation, Woodhouse notes multiple projects that are not upstreaming their changes; he also notes things that individuals could do to make Linux better. He started by pointing out that "it's not entirely clear what 'embedded' means", as there are many kinds of devices that have embedded attributes. Things like headless, handheld, low power, small size, limited ram, or limited persistent storage tend to be a part of the description of embedded devices, but there is "no real definition that I'm aware of that makes any sense". Woodhouse then went on to see if he could define what an "embedded maintainer" is and does. He doesn't see the role as chasing patches to get them included upstream, it is more of an advocate role. Keeping an eye out for stupidity in the kernel using Bloatwatch and other tools as well as encouraging people—in various companies as well as in different parts of the community—to work together on solutions to problems they have in common are all part of the job. From Woodhouse's perspective, companies are "getting a lot better" in terms of their Linux support. Less promising is the community: "We suck, really". He looked at a number of community embedded projects—like OpenWrt, Maemo, Moblin, and OLPC—to see how well they work with upstream; what he found was rather discouraging. By looking at several concrete criteria, such as how many unsubmitted local kernel patches there were, how accessible their source is, and how old the kernel is that the project is using, Woodhouse is judging those projects the same way that companies are measured. Of the four projects that he looked at, only one, OLPC, was "mostly OK", the rest varied from "less good" to "FAIL". Moblin for example, only had 23 outstanding patches, but those were against kernel 2.6.24. OpenWrt had a better kernel version, 2.6.27, but had 160 outstanding patches, plus an extra 425 files weighing in at 125,000 lines of code, which prompted a "sorry!" from an OpenWRT developer in the audience. OLPC has just a few outstanding patches against 2.6.27.4, while Woodhouse couldn't even find the kernel source for Maemo. Getting work upstream is extremely important. Running older kernels and backporting fixes and features may seem like it saves time, but "it never works in the long run, it's a false economy". Woodhouse listed the usual suspects as reasons to get things upstream: code review, compile testing, updates for kernel API changes, and automated bug checking. He also mentioned the Kernel Janitors, whose efforts are generally useful, even though they are "often a little misguided, sometimes they don't engage their brain before sending patches". All of these benefits only come from getting code into the mainline. [PULL QUOTE: The theme of the talk is summed up in one statement: "Divergence is pain" END QUOTE] The theme of the talk is summed up in one statement: "Divergence is pain". Any time that your code is not current with the most recent kernels or your patches are not making their way upstream, it should be felt as pain because diverging from upstream will end up causing exactly that. The pain may not be felt until later, but Woodhouse wants developers to recognize the problems caused by divergence so that they are averse to it right from the start. Looking at the reasons why code is hoarded is instructive, he says. One of the reasons that is often heard, as well as Woodhouse's opinion, are summed up in a bullet point on one of his slides: "too hard to write decent code get code accepted". Another reason is that there is not enough time in the schedule for getting code merged. Many "see it as an extra part of the process after the driver is complete", which is the wrong way to look at it. Drivers and other features should be shared early on the appropriate mailing list so that any problems are dealt with near the beginning of development. An issue related to code quality is that many times drivers are developed for ancient versions of the kernel, but that really shouldn't be a barrier as any "decent code will port relatively easily". Sometimes there is resistance to changes by the upstream developers. An example he noted was a feature that allowed multicast to be optionally removed from the IPv4 networking stack. It saved a fair amount of space for embedded devices that did not need that functionality, but David Miller and other networking developers were not very interested. This is where the embedded maintainer role can come into play as Woodhouse can step in to try to help convince the upstream developers. Woodhouse had specific suggestions for making the situation better. "For a start, put everything in git trees" as it allows others to look at and test the code. Each feature should have its own topic tree that gets pulled into the main tree and developers should regularly assess the outstanding code to determine if it is ready to be moved upstream. Working with the upstream developers, getting them involved, and getting them to care about the feature or driver is crucial. In cases where a logjam develops, call on Woodhouse or Andrew Morton, they "can't promise any miracles, but often it can help". Something that Woodhouse would like to see more developers do is to adopt a driver. There are countless drivers in SourceForge and elsewhere that are not upstream, so he suggests that folks "pick one driver, just tidy it up and make it acceptable upstream". Incidentally, Woodhouse is no fan of SourceForge: "I don't think I wrote 'don't use SourceForge' on any of the slides, but pretend that it's there". He mentioned the -staging tree as a possible destination for adopted drivers, though he is skeptical of the tree, "but it exists, we should see if we can get something from it". Woodhouse summed up his talk with a simple statement: "We need to work better as a community before we can point fingers at companies who don't play nicely". It is certainly true that the community needs to set a good example for companies to follow. By highlighting some of our failures, Woodhouse has done the community a great favor, we can and, with luck, will do better. Fedora release cycles: longer or shorter? The Fedora 10 release is currently planned for November 25 - somewhat later than had been originally intended. Delays in Fedora releases are certainly not unheard-of, even when the project isn't coping with a major compromise of its fundamental infrastructure (the full story of which, it should be noted, still has not been told). Fedora 10 looks like it will be worth the wait, but the project is not waiting for the release to start thinking about its upcoming release cycles. A couple of discussions related to this topic provide some interesting insights into the pressures being felt by Fedora's leadership. A recent video review of Fedora 10 was seen by the project as being something other than entirely favorable. But the biggest complaint expressed by the project is on a different subject: credit for work which is done by Fedora developers. Quoting Fedora leader Paul Frields: Another point that had me scratching my head was the same host indicating that Fedora had a lot of features that were in Ubuntu 8.10. This is certainly true, but the differentiator is that many of these features were *built* by Fedora contributors, inside and outside Red Hat. It's important for us to keep emphasizing this fact. Subsequent discussion indicates that a number of Fedora developers feel that other distributions - Ubuntu in particular - are stealing Fedora's thunder by shipping Fedora-developed improvements first. This is not the first time this kind of concern has been raised; it has been asserted that Novell's behind-closed-doors XGL work was done that way to keep Ubuntu from shipping it first. Fedora does not appear to be considering pulling its development from public view - that would run counter to the project's open nature - but some other responses are being discussed. More than anything else, the Fedora project would like to ensure that the world knows about the work its developers are doing. Initiatives like the feature list for each release help to get information out ahead of the actual software release. There is also talk of more aggressive blogging, outreach to news sites, etc. The project has even posted a proposed marketing schedule which would help to ensure that all the right marketing activities are happening at the right points in the release cycle. Former Fedora leader Max Spevack had a different suggestion to offer: If "features" and "first" are hurting because of where we are in the calendar compared to the Ubuntu release, allowing them the chance to release their new distro first and to receive a lot of credit for new features when reviewers and press don't understand where the upstream work is being done (in Fedora, for example), then Fedora Marketing should ask the Fedora Board to think about altering our "May Day" and "Halloween" release targets by a little bit, so that Fedora's cycle finishes before Ubuntu's. This proposal brings to mind a vision of distributors racing to be the first to release, leading to ever-shorter cycles and a corresponding decrease in release quality. It is hard to imagine that the first mover has such an overwhelming marketing advantage; there must be a better way. It does not look like Fedora will attempt a "first post" counterattack anytime soon. In fact, if the recently-posted Fedora 11 release schedule proposal is adopted, the exact opposite will happen. In the past, Fedora has responded to a much-delayed release by shortening the following release cycle in an attempt to get back on schedule. For Fedora 11, it would appear that this will not happen; there will be no attempt to go for a "May Day" release. The reasoning against shortening the Fedora 11 cycle comes down to this: Fedora 11 will be extremely important to Red Hat Enterprise Linux (otherwise known as RHEL). RHEL 6 planning has looked to use Fedora 10 and Fedora 11 as releases to work out new technologies and features that are desired in RHEL 6. This includes a lot of upstream work that is being done, and targeted to land in these two releases. So a shortened Fedora 11 cycle would make it harder to get all of the changes planned for RHEL6 in. That's problematic for Red Hat, and, since Red Hat pays for much of Fedora's existence, Red Hat's problems become Fedora's problems. Beyond that, though, it seems that a number of core Red Hat engineers will be working on Fedora during the next cycle to help get RHEL6-targeted features into shape. If the next cycle is shorter, Fedora will get less attention from those developers. Fedora would like to avoid that situation and take advantage of the RHEL team's attention while it can. So the proposal is to retain the six-month cycle for Fedora 11 and release around the beginning of June. The Fedora 12 cycle, though, would be shortened to get the project back to the original schedule. The hope is that the advance notice will make it easier to plan for a short release cycle; Jesse Keating also suggests that the project "could even focus more on polish issues in F12 than large sweeping features." The more cynically-minded among us might conclude that Fedora 11 will be stuffed full of bleeding-edge new stuff that the RHEL team wants to evaluate, and Fedora 12 will be the release where all of that work is actually stabilized. But your editor would never want to be cynical. The initial response to the proposed schedule is almost entirely positive, so it seems likely that things will go that way. Some Fedora developers may feel that releasing behind Ubuntu gives the project a public relations disadvantage, but other concerns are seen as being more important. Since those "other concerns" can be seen as "take the time to focus a lot of work on pulling together new features for an upcoming stable release," this set of priorities seems hard to argue with. Storm botnet used to study spam Spam is a problem that all email users suffer from but getting a handle on the economics of spamming has never been easy. A group of researchers has changed that to some extent by publishing a study [PDF] that looks at the conversion rate of spam emails. While the methods they used were somewhat ethically questionable, the data it provides is quite useful and interesting. In the study, the Storm botnet's "command and control" (C&C) infrastructure was infiltrated in such a way that spam messages sent by Storm worker nodes would point the URLs in the spam at sites controlled by the researchers. By doing this, they could determine how much spam was sent and, more importantly, how much of it was clicked on. While sending spam is not very costly, it clearly does not have a zero cost. This means that—unbelievable though it sometimes seems—people actually do click through spam emails; not only that, they actually make purchases from the sites where they land. The researchers set up fake pharmacy sites—selling male enhancement products amongst other things—that would be reached via the spam links. To protect the spam "victims", a visitor to the site would be allowed to get to the checkout stage before showing a site error. It seems plausible that nearly everyone willing to fill their shopping cart with such products and enter the checkout process is a very likely buyer. In this way, the study could count not only those who followed the links, but also those who were likely to buy. What they found was that of 350 million emails sent—they estimate 82 million actually delivered—ten thousand recipients visited the site for a click-through rate of 0.003%. Of those, 28 users actually tried to check out with products totaling over $2700. The study was run for 26 days, so this could have resulted in roughly $100 per day of revenue. Also of interest were the campaigns that were run to test the propagation of the Storm malware. This is normally done by sending spam that directs users to a website (via a "you have received a postcard" message) and entices them into clicking a link that will download and install the malware. The percentages of click-throughs were slightly higher (0.004-0.006%), but a rather large percentage of those (almost 10%) actually clicked the malware link once they reached the website. The researcher's version would download a benign executable, but the clear implication is that a small, but useful, number of users would actually add themselves to the botnet more-or-less voluntarily. While the study is quick to point out that it represents only one data point, there is some value in extrapolating what the botnet might be able to generate in terms of revenue: Different campaigns, using different tactics and marketing different products will undoubtedly produce different outcomes. Indeed, we caution strongly against researchers using the conversion rates we have measured for these Storm-based campaigns to justify assumptions in any other context. At the same time, it is tempting to speculate on what the numbers we have measured might mean. We succumb to this temptation below, with the understanding that few of our speculations can be empirically validated at this time. The conclusion is that something on the order of $7000-9500 per day could be generated, which equates to $2.5-3.5 million per year—a tidy sum by any measure. There is some additional speculation that because of the retail cost of sending spam (rumored to be something like $80 per million sent), it only makes sense that the Storm operators and the "pharmacies" are one and the same. The sites used for propagation of the Storm malware have similarities to those used by the shopping sites, which also indicates a close association between the two. The study makes the following, perhaps overly optimistic, argument: If true, this hypothesis is heartening since it suggests that the third-party retail market for spam distribution has not grown large or efficient enough to produce competitive pricing and thus, that profitable spam campaigns require organizations that can assemble complete "soup-to-nuts" teams. Put another way, the profit margin for spam (at least for this one pharmacy campaign) may be meager enough that spammers must be sensitive to the details of how their campaigns are run and are economically susceptible to new defenses. The full paper is well worth a read for those interested in botnets or spam, but there are some ethical questions to consider as well. Is it reasonable to use other people's computers for your research without their consent? There is no easy answer to that question. The researchers outline their argument, which boils down to "we strictly reduce harm". Because they are just intercepting and modifying orders that are already being sent to workers, their research did not increase the amount of spam sent, nor did it increase the work that others' computers would do. Since the spam that they arrange to be sent is harmless—at least in terms of selling bogus medicine or propagating malware—they have actually reduced the number of harmful spams sent. While their arguments seem at least well-thought-out, it is not something that would be fun to try to explain to a judge bent on enforcing some of the poorly-thought-out computer crime statutes. The researchers seem confident that their methods will pass muster, though: "We have been careful to design experiments that we believe are both consistent with current U.S. legal doctrine and are fundamentally ethical as well." It is difficult to see how this kind of data could be gathered without co-opting Storm or another spam-sending botnet. From that standpoint, the researchers took the only path they could, but they certainly appear to have considered the legal and ethical landscape. While there may be a tendency to overestimate how widely applicable the data is—which the authors warn against—it does provide a nice look under the covers of the botnets delivering spam to one's inbox daily. The libferris virtual filesystem The Unix mantra "everything is a file" gives you great flexibility over where you store your data and how information is manipulated and replicated. Unfortunately, many things in Unix and Linux are not files, or ones that you might want to interact with anyway. For example, a PostgreSQL database is ultimately stored in a collection of binary files though you probably wouldn't want to interact with those files directly. Instead of storing settings in a collection of tiny files, many applications use XML to store settings in a single file but then have to deal with parsing XML instead of just reading little files. libferris lets you mount both PostgreSQL and XML and provides you with a useful way to interact with the data contained in both as a virtual filesystem. Other operating systems like Plan 9 pushed the envelope further than Unix, making more things "just a file". Unfortunately, to use Plan 9 you had to abandon your trusty old Unix roots and jump to an entirely new operating system. I started the libferris virtual filesystem project back in 2001 to push the "everything is a file" concept further, it was all implemented on a Linux base. Libferris is a virtual filesystem implemented as a shared library with FUSE bindings. Because FUSE is already in the Linux kernel you don't have to do any kernel patching to use libferris. Because libferris is a shared library and not in the kernel, it can use other libraries to help it mount data sources like XML, relational databases and Emacs to name a few. And as an upshot of being out of kernel, I can work on letting libferris mount anything I like no matter how strange it might be without any third party approval. There are actually two ways to use libferris -- through a native C++ interface and using the normal Unix APIs with FUSE. The FUSE interface is very useful if you want to rsync(1) some structured information from an XML file into a PostgreSQL database. Just mount them both with FUSE and rsync away. Another few interesting things you can do with the FUSE interface is expose data as a virtual office document using XSLT stylesheets that libferris processes for you as well as geotagging with Google Earth. The design of libferris revolves around two primitives: exposing file contents as C++ std::iostreams, and rich metadata support through an interface similar to Extended Attributes (EA) attr_get(3). Since then libferris has gained sophisticated support for indexing both the full text contents of files as well as their metadata. Libferris is written in C++ and aims to take full advantage of the language. Interfaces are designed to be as easy to pickup for C++ programmers as possible, for example, displaying a directory can be done using iterators, find(), begin() to end() etc. Both the types of things that libferris can provide as virtual filesystems and the metadata handling are done through a plugin interface. The handling of metadata is done through the Extended Attributes (EA) interface. This EA interface is also virtualized -- if you write an attribute to file:///foo/bar and the kernel filesystem supports extended attributes, then the value will be saved in a kernel level EA using attr_set(3). On the other hand if file:///foo/bar happens to exist on a network filesystem that does not support EA, then your value is saved in RDF by libferris. In both cases the value can be read again using an identical interface. Looking at filesystems in an abstract way -- a hierarchy of files, file contents, and metadata associated with files and directories as key-value pairs -- there is somewhat of a resemblance to the data model of XML. Although there are obvious differences: XML elements can have multiple text nodes as contents, an XML element does not need to have specific unique names for each child XML element and so on. In many cases it can be advantageous to smooth over the differences and view a filesystem as XML and vice versa. Over the years libferris has gained the ability to interact with it's virtual filesystems as virtual Document Object Models (DOM)s. The reverse is also true, you can take an xerces-c DOM and interact with it as a virtual filesystem. Using virtual DOMs makes it easy to create a view of a filesystem using a browser and XSLT. See xml.com for information on using XQuery against a libferris virtual filesystem. The ability to mount XML and Berkeley db4 data as filesystems has long been a part of libferris. If you want to store a filesystem inside a platform independent format, then using XML is great, whereas the speed of individual file look up in a Berkeley db4 database of many many file records can come in handy. Each format has its advantages, but they are all just virtual filesystems as far as libferris is concerned. When a filesystem can offer what it likes through key-value pairs (EA) associated with files, relational databases can also be viewed as a virtual filesystem. Databases, views, tables and result sets become directories, tuples become files named by the value of their primary key, and the individual values of tuples are exposed as Extended Attributes on their tuple file. Again, PostgreSQL appears just like another virtual filesystem. For relational data there are a few caveats, for example, to create a new "file" in a table you must supply at least the primary key EA as well as any EA which are explicitly marked "not null" in the database. Libferris will automatically mount many filesystems for the user. For example, if you try to read an XML file as though it is a directory then libferris will implicitly mount it as one for you. This does blur the lines between what is a directory and what is a file in the system. There is some additional metadata that libferris makes available if you would like to avoid the automatic mounting. For example, if you wish not to descend into XML files then read the is-file metadata and if it is true do not attempt to descend into the file. One of the motivations for creating libferris as a project of its own was to be able to expose anything that I felt could be interacted with in an interesting manner as a filesystem as one. So libferris can mount some things that folks might not think of as filesystems -- including Firefox, Emacs, DBus, LDAP, Evolution, Amarok, klipper, xmms, X Window System and gphoto2. The metadata plugins for libferris currently support extracting information from file formats automatically, for example, EXIF, XMP and ID3 tags. Metadata overlays are also supported, so you can see what tags you have associated with an image in f-spot through extended attributes in libferris. I use the term overlays because a central repository of tag data (in this case from f-spot) is scattered over an entire filesystem in libferris. The lower level metadata plugins handle more standard extended attributes usage, for example using attr_set(3) to store values or saving them in RDF. Many of the standard utilities have been rewritten to use the native libferris API and take advantage of extra features it offers. Things like ls, cp, mv, rm, cat, io-redirection, touch, head and tail all have native libferris versions which are shipped with the main tarball. These all also serve as code samples for how to use the libferris API. Extensions to the normal clients include the ability to output directory listings in XML for ferrisls, ferriscp has the ability to use memory mapped IO as well as the more standard open(), read() and write() calls to perform the copy. Using memory mapped IO this way also uses the madvise(2) MADV_SEQUENTIAL call to let the kernel correctly select caching policy. The indexing support in libferris is also handled using plugins. Two different indexing plugin types exist; full text and metadata. There are two types of plugin, because the strategy for how to create an index can be quite different depending on if you are performing a search for some words in a document text or if you wish to find files with certain metadata values. Using inverted files can be great for resolving a ranked full text query for "alice wonderland" but finding all files in either my home directory or /pictures that have been modified in December 2008 can be solved in a number of ways. There are currently indexing plugins for CLucene, Lucene, LDAP, Federations of other libferris indexes, ODBC, PostgreSQL, Redland (RDF), Xapian, Beagle, Strigi and some custom designs. There are likely to be more index plugins explicitly designed to work on NAND Flash in the future. Those interested in indexing and libferris should see this article. A major advantage of closely combining the index and search operations into the virtual filesystem is that anything the virtual filesystem can see can be indexed. When searches are performed you should also be able to interact with any of the results as a virtual filesystem. This avoids the issue where a discrete search library might return a URL that the client can not do anything with. So, what does it look like to code using libferris? Most objects in ferris are smart pointers, many using intrusive reference counting. The type for such objects is prefixed with "fh_" to indicate a ferris handle. The notion of files and directories is amalgamated into a single "Context" abstraction. To get a smart pointer to a filesystem path the Resolve() function is used. So without further ado, to get a file and its metadata with libferris: Libferris is steadily gaining commercial interest. Currently I provide things like custom builds of libferris, explicit support for new test cases in the core regression test suite that are important to clients and of course extensions to libferris to perform a specific task that might be desired. There are packages available for both 32 and 64-bit Fedora 8, 9 and Ubuntu 7.10 gusty as well as 32bit packages for openSUSE 10.3. Unfortunately there is currently a bug in building 64bit stldb4 on openSUSE. Install the libferris-suite package to pull in all the dependencies. Feel free to email the witme-feris mailing list or add comments to this article suggesting any weird and wonderful (and obscure) filesystems you have experienced in the past. Though my libferris.TODO file always grows more than it shrinks, I'm always happy to add new and exciting suggestions near the top of it. UKUUG: Arnd Bergmann on interconnecting with PCIe PCI express (PCIe) is not normally considered as a way to connect computers, rather it is a bus for attaching peripherals, but there are advantages to using it as an interconnect. Kernel hacker Arnd Bergmann gave a presentation at the recent UKUUG Linux 2008 conference on work he has been doing on using PCIe for IBM. He outlined the current state of Linux support as well as some plans for the future. The availability of PCIe endpoints for much of the hardware in use today is one major advantage. By using PCIe, instead of other interconnects such as InfiniBand, the same throughput can be achieved with lower latency and power consumption. Bergmann noted that avoiding using a separate InfiniBand chip saves 10-30 watts which adds up rather quickly on a 30,000 node supercomputer. There are some downsides to PCIe as well. There is no security model, for example, so a root process on one machine can crash other connected machines. There is also a single point of failure because if the PCIe root port goes down, it takes the network with it or, as Bergmann puts it: "if anything goes wrong, the whole system goes down". PCIe lacks a standard high-level interface for Linux and there is no generic code shared between the various drivers—at least so far. As an example of a system that uses PCIe, Bergmann described the "Roadrunner" supercomputer that is currently the fastest in existence. It is a cluster of hybrid nodes, called "Triblades", each of which has one Opteron blade along with two Cell blades. The nodes are connected with InfiniBand, but PCIe is used to communicate between the processors within each node by using the Opteron root port and PCIe endpoints on the Cells. There is other hardware that uses PCIe in this way, including the Fixstars GigaAccel 180 accelerator board and an embedded PowerPC 440/460 system-on-a-chip (SoC) board, both of which use the same Axon PCIe device. Bergmann also talked about PCIe switches and non-transparent bridges that perform the same kinds of functions as networking switches and bridges. Bridges are called "non-transparent" because they have I/O remapping tables—sometimes IOMMUs—that can be addressed by the two root ports that are connected via the bridge. These bridges may also have DMA engines to facilitate data transfer without host processor control. Bergmann then moved on to the software side of things, looking at the drivers available—and planned—to support connection via PCIe. The first driver was written by Mercury Computers in 2006 for a Cell accelerator board and is now "abandonware". It has many deficiencies and would take a lot of work to get it into shape for the mainline. Another choice is the driver used in the Roadrunner Triblade and the GigaAccel device which is vaguely modeled on InfiniBand. It has an interface that uses custom ioctl() commands that implement just eight operations, as opposed to hundreds for InfiniBand. It is "enormous for a Linux device driver", weighing in at 13,000 lines of code. The Triblade driver is not as portable as it could be, as it is very specific to the Opteron and Cell architectures. On the Cell side, it is implemented as an Open Firmware driver, but the Opteron side is a PCIe driver. There is a lot of virtual ethernet code mixed in as well. Overall, it is not seen as the best way forward to support these kinds of devices in Linux. Another approach was taken by a group of students sponsored by IBM who developed a virtual ethernet prototype to talk to an IBM BladeCenter from a workstation by way of a non-transparent bridge. Each side could access memory on the other by using ioremap() on one side and dma_map_single() on the other. By implementing a virtio driver, they did not have to write an ethernet driver, as the virtio abstraction provided that functionality. The driver was a bit slow, as it didn't use DMA, but it is a start down the road that Bergmann thinks should be taken. He went on to describe a "conceptual driver" for PCIe endpoints that is based on the students' work but adds on things like DMA as well as additional virtio drivers. Adding a virtio block device would allow embedded devices to use hard disks over PCIe or, by implementing a Plan 9 filesystem (9pfs) virtio driver, individual files could be used directly over PCIe. All of this depends on using the virtio abstraction. Virtio is seen as a useful layer in the driver because it is a standard abstraction for "doing something when you aren't limited by hardware". Networking, block device, and filesystem "hosts" are all implemented atop virtio drivers, which makes them available fairly easily. One problem area, though, is the runtime configuration piece. The problem there is "not in coming up with something that works, but something that will also work in the future". Replacing the ioctl() interface with the InfiniBand verbs (ibverb) interface is planned. The ibverb interface may not be the best choice in an abstract sense, but it exists and supports OpenMPI, so the new driver should implement it as well. Two types of virtqueue implementations are envisioned, one for memory-mapped I/O (MMIO) and the other for a DMA-based virtqueue. The MMIO would be the most basic virtqueue implementation, with a local read of a remote write. Read access on PCIe is much slower than write because a read must flush all writes then wait for data reception. Data and signaling information would have separate areas so that data ordering guarantees could be relaxed on the data area for better performance, while strict data ordering would be set for the signalling area. The DMA engine virtqueue implementation would be highly hardware-specific to incorporate performance and other limitations of the underlying engine. In some cases, for example, it is not worth setting up a DMA for transfers of less than 2K, so copying via MMIO should be used instead. DMA would be used for transferring payload data, but signaling would still be handled via MMIO. Bergmann noted that the kernel DMA abstraction may not provide all that is needed so enhancements to that interface may be required as well. Bergmann did not provide any kind of time frame in which this work might make its way into the kernel as it is a work in progress. There is much still to be done, but his presentation laid out a roadmap of where he thinks it is headed. In a post-talk email exchange, Bergmann points to his triblade-2.6.27 branch for those interested in looking at the current state of affairs, while noting that it "is only mildly related to what I think we should be doing". He also mentioned a patch by Ira Snyder that implements virtual ethernet over PCI, which "is more likely to go into the kernel in the near future". Bergmann and Snyder have to agreed to join forces down the road to add more functionality along the lines that were outlined in the talk. BBC opens a little more content for Linux The British Broadcasting Corporation (BBC) has long dabbled with free software, starting a number of new projects and opening content via their backstage developer network. Now they've announced a bold new step forward, releasing an experimental service—initially just for Linux users—with open access to some multimedia content, which has already spun out in unexpected ways. The BBC's Research and Innovation team took a fairly conventional commissioning process for this experiment. Having identified the feature—help existing content to "surface" in multimedia applications, so users don't need to browse around the web site—they went on to find the right approach. George Wright and his team settled on integrating BBC content into the Totem media player with Canonical, aiming to get a first version out with the recent Intrepid release. Things then moved quickly. Discussions with the company contracted to do the Totem work (Collabora) started in spring 2008, although according to Christian Schaller from Collabora "it was probably around July things got concrete". Over a few autumn months the work was completed, opening up a large number of radio shows to Ubuntu users worldwide (although much of the content is restricted to the UK because that's who pays the TV license that funds the BBC). This great new feature, exclusive to Ubuntu, was promoted in the Intrepid press release but received little attention in the media. Given that it still only delivers a fraction of the content you can get through iPlayer (proprietary Windows software full of DRM technology) this is hardly surprising. That you can stream Dirac-encoded videos released under Creative Commons licenses is obviously still a bit geeky for most. But that doesn't stop free software developers. Barely days after the Totem announcement, Nikolaj Hald Nielsen wrote a script to neatly integrate the content in Amarok 2.0. As a core Amarok developer his main motivation was familiar: "I wanted to inspire other people to write similar scripts for Amarok 2, and I think it is important to have some good example scripts ready when Amarok 2.0.0 final is released." I've been watching the Amarok 2 betas come along, and having given the "get more features" dialogs in KDE a miss over the past few years, I was pleasantly surprised how well this worked. You just go to the script manager, click to get some more scripts, install the BBC script and—like magic—you get all the BBC content in the "internet" tab on the left. Wright's team did all the hard low-level work to make this kind of adaptation straightforward. The Amarok script has delighted Wright, who is a long-time Amarok user; they've even been in touch with Nielsen to see how they can help improve the integration. The question everyone wants an answer to is: will this ever match iPlayer for content range? Wright's team have a fairly wide remit, but they're not in charge of releasing content, so this is unlikely to change the Corporation's attitude towards DRM overnight. According to Wright, the content teams have given great feedback, but over the past five years we've seen promises of an open Creative Archive wither away, with a consumer-facing focus on proprietary products like iPlayer. Truly open content from the BBC, or even the volume of copyrighted-but-available archives released by the National Public Radio (NPR) in the US (also integrated into Amarok ), is probably still a long way off. This new service is strictly experimental, Wright says, "it's a way to experiment with distribution platforms and free software." They've also learned a lot more about developing in a free software community; although many of them have been Linux users for years, this was a first for them. Working to the feature freezes for Gnome and Ubuntu Intrepid meant the UI isn't a nice as they might have hoped, but it's a great start. The open service is here to stay. They're not sure if they'll keep developing the Totem feature and patching against mainline in Ubuntu or Totem; time will tell. More work between Collabora, the BBC, and Canonical is also uncertain. But, since the code is all open, we can definitely expect the Totem and Amarok features to be maintained. We can also look forward to more open content integrated into free desktops in the future in a way that is extremely difficult to do with proprietary platforms. Blending Debian Last week we introduced Debian Pure Blends, and now this week we'd like to look a bit deeper into the concept, the white paper and how this idea compares to similar ideas. To begin with, the Pure Debian Blend is not a new idea. It's a new name for an existing concept that goes back to early 2004. Discussions probably started earlier, but April 2004 is when a mailing list was opened for this topic. At DebConf5, held in Helsinki, Finland in July of 2005, there were talks about Debian Derivatives and Custom Debian Distributions. Custom Debian Distributions (CDD) was the previous name for Debian Pure Blends and the derivatives are now forks. A white paper, available in PDF or HTML, was originally written in 2004 to describe the the CDD concept. It has been recently modified for the new name of Debian Pure Blends. There are a few places in the white paper where its age shows. These are mostly references to distributions other than Debian. You'll find some mention of Mandrake, for example. The combined Mandrakesoft and Conectiva forming the new entity Mandriva was finalized later in 2004. Debian 3.0 (Woody) appears to have been the stable version when the document was new. Since then Debian has released 3.1 (Sarge) and 4.0 (etch), and is nearing the 5.0 release (Lenny). While the dates are old, the whole stands as a definition of what is a Pure Blend and what is a fork. The Pure Blend is based on Debian stable (currently etch). It contains only packages found in the stable repository. A Pure Blend must retain 100% compatibility with the stable repository. A system administrator using a pure blend could easily install additional packages from Debian's sizeable repository. It is not uncommon for one or more developers of a Pure Blend to also be Debian Developers who are able to maintain the packages needed by the Blend within the Debian archive. The document is also a valuable resource for anyone who wishes to create their own Pure Blend. The list of forks in section 5.1.1 could use some attention, although this is not really important to the overall topic. Currently listed are Linspire, Xandros and Libranet. Libranet died in 2006 following the death of it's founder Jon Danzig. Linspire was acquired by Xandros earlier this year and what was Linspire is now part of Xandros. The free version of Linspire, called Freespire, is still around. Roughly speaking, Freespire is to Xandros as Fedora is to Red Hat. A community project to test drive new technologies which may find their way into the enterprise distribution. Whether Freespire is a fork or something more pure remains to be seen. Freespire 5.0 is not finalized yet. It appears that Freespire will wait for the official Debian 5.0 (Lenny) release before its final 5.0 stable release. Another fork that might be mentioned here is Ubuntu. This popular distribution didn't exist when this document was originally created. The first Ubuntu release was 4.10 preview (Warty Warthog), dated September 2004. Ubuntu is clearly a fork though, based on Debian's unstable branch, known as sid. Packages from Debian's stable repository might work on Ubuntu, but that is by no means a sure thing. So how does this compare to other distributions? At this time Debian remains the most popular base, whether the spinoff is Pure or a fork. This is largely due to the size of Debian's repository. There are simply more packages to chose from. Fedora's repository has about half the number of packages, but it continues to grow. Fedora would like to become more widely used as a base. The project is still working on a draft of trademark guidelines, where a "Spin" is much like a Pure Blend and a "Remix" is more of a fork. Spin maintainers are welcome to become Fedora contributors and package the free software needed by the Spin. Red Hat addressed this issue some years ago, when Red Hat Enterprise spinoffs flourished following the demise of the old Red Hat Linux distribution. Red Hat made separate packages with its logos and trademark so that spinoffs could more easily take the free software, without the commercial baggage. At first separating the logos from the free software was a difficult process. Debian has an official logo and an unofficial logo, for other projects to use. Fedora is coming up with its own rules, with the draft trademark guidelines. The terminology for spinoffs varies as well. A Fedora Spin is mostly equivalent to a Debian Pure Blend. A Fedora Remix is more of a fork. Regardless of what they are called, these spinoff distributions make the free software landscape a richer and more diverse place. UKUUG: The right way to port Linux Arnd Bergmann pulled double duty at the recent UKUUG Linux 2008 conference by giving a talk on each day of the event. His talk on Saturday, entitled "Porting Linux to a new architecture, the right way", looked at various problems with recent architecture ports along with a project he has been working on to simplify that process. By creating a generic template for architectures, some of the mistakes of the past can be avoided. This is one of Bergmann's pet projects, that "I like to do for fun, when I am hacking on the kernel, but not for IBM". The project and talk were inspired by a few new architectures that were merged—or were submitted for merging—in the last few years. In particular, the Blackfin and MicroBlaze architectures were inspiring, with the latter architecture still not merged, perhaps due to Bergmann's comments. He is hoping to help that situation get better. The biggest problem with architecture ports tends to be code duplication because people start by copying all of the files from an existing architecture. In addition, "most people who don't know what they are doing copy from x86, which in my opinion is a big mistake". According to Bergmann, architecture porters seem to "first copy the header files and then change the whitespace", which makes it difficult to immediately spot duplicated code. He points to termbits.h as an example of an include file that is duplicated in multiple architectures unnecessarily as the code is the same in most cases. He also notes there is "incorrect code duplication", pointing to new architectures that implement the sys_ipc() system call, resulting in "brand new architectures supporting a broken interface for x86 UNIX from the 80s". That call is a de-multiplexer for System V IPC calls that has the comment—dutifully duplicated into other architectures—"This is really horribly ugly". Then there are problems with "code duplication by clueless people" which includes a sembuf.h implementation that puts the padding in the wrong place because of 64 vs. 32-bit confusion. In addition, because code is duplicated in multiple locations, bug fixes that are made for one architecture don't propagate to all the places that need the fix. As an example he noted a bug fix made by Sparc maintainer David Miller in the x86 tree that didn't make it into the Sparc tree. Finally, there are ABIs that are being needlessly propagated in new architecture ports: system calls that are implemented in terms of newer calls are still present in new ports even though it could all be handled in libc. The "obvious" solution is to create a generic architecture implementation that can be used as a starting point for new ports. Bergmann has been working on that, resulting in a 3000 line patch that "should make it very easy for people to port to new architectures". To start with, it defines a canonical ABI that is a list of all of the system calls that need to be implemented for a new architecture. It puts all of the required include files into the asm-generic directory that new ports can just include—or copy if they need to modify them. Unfortunately, things are not quite that simple of course, there are a number of problem areas. There are "lots of things you simply cannot do in a generic way". Most of these things are fairly hardware-specific areas like MMU support, atomics, interrupts, task switching, byte order, signal contexts, hardware probing and the like. Bergmann decided to go ahead by defining away some of these problems in his example architecture. So, there is no SMP or MMU support with the asm-generic/atomic.h and asm-generic/mmu_context.h include files being appropriately modified. Many of the architecture-specific functions have been stubbed out in arch/example/kernel/dummy.c so that he can compile the template architecture. The example architecture uses an Open Firmware device tree to describe the hardware that is available at boot time. Open Firmware "is a bit like what you have with the new Intel EFI firmware, but it's a lot nicer". A flattened device tree data structure is passed to the kernel at boot time by the bootloader, so Bergmann will be able make it to the next step: making it boot. As one might guess, there is still more work to be done. There are eight header files that are needed from the asm-example directory, but Bergmann hopes to reduce that some. He notes that there are other architecture-specific areas that need work. For example, every single architecture has its own implementation of TCP checksums in assembly language, which may not be optimal Bergmann pointed attendees at the ukuug2008 branch of his kernel.org playground git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git to see the current state of his example architecture. It looks to be a nice addition to the kernel that will likely result in better architecture ports down the road. MinGW and why Linux users should care The Minimalist GNU for Windows (MinGW) project is a way to get GCC and tools like binutils working to build software for the Windows environment—something that might not sound very interesting to Linux users or developers. But there are a number of advantages to porting and regularly testing free software on Windows, as Red Hat's Richard Jones and Dan Berrange explain in the following interview. Richard and Dan also describe Red Hat's involvement, how developers can participate, as well as how it all helps the free software cause. LWN: Could you describe the MinGW project? How did it get started? Richard: For some time I have been making Windows builds of libvirt available and, frankly, it was a real chore. I needed a Windows virtual machine to do it. But Windows is so frustrating to use and maintain: it doesn't come with any of the tools such as shells or version control that we are used to, and because I was only doing builds once a month or so I'd go back to it and find something had gone wrong that would require maintenance or even reinstallation. During this time, we didn't routinely build libvirt for Windows. New code would inevitably break something. I had to fix things on Windows, then copy the code back to Linux and check that my fixes didn't break the Linux build, then come up with a patch, and all of this was complicated by the fundamental incompatibility of Windows with the rest of the world -- even simply copying code back and forth is irritatingly difficult when one machine is a Windows machine. (There's no ssh or scp or tar, files get executable bits set or have CRLF line endings, etc.) At the same time we were getting a strong demand for the rest of our virt tools on Windows. Enough was enough. We decided that the only way to deal with this was to remove Windows from the equation. We wanted to build and test libvirt and the virt tools for Windows routinely (daily or more often), from the Fedora host, using the normal development environment. The way to do this is through cross-compilation (the Fedora MinGW project) and testing under emulation (Wine). Debian & Ubuntu have been shipping the MinGW cross-compiler for quite a while, but it's important to say that the cross-compiler itself is the easy bit. The hard part about this project are the 50+ libraries and development tools that we ship and maintain alongside. Without those, just having the cross-compiler is fairly useless. Dan: The libvirt project started a few years ago to provide an API for managing Xen virtualization hosts. Initially it was just a locally accessed C library, but over time the project expanded in scope to allow remote RPC access to the management APIs, and over other virtualization technology like QEMU, KVM, OpenVZ, LXC (native Linux containers) & User-Mode Linux. Shortly after we added support for RPC, a number of community members expressed an interest in using the client side from the Windows platform to manage their Unix hosts. Periodically people would contribute patches to make libvirt build on Windows, but soon after they were applied, new unrelated work would break the Windows build again. It became clear that if the libvirt community was to officially support building a Windows client, then all developers needed to be able to easily test builds for Windows. The obvious stumbling block here is that most of our community developers do not use or even own Windows machines for testing. The MinGW project provides a cross compiler toolchain and stubs for the Win32 APIs to allow building of Windows executables and DLLs from a Linux host. Add in WINE and you can also run your cross-compiled build. MinGW and WINE are completely open source, so we can provide a very good level of support without ever having to purchase a Windows license or leave our primary Linux development environment. We are not the first people to see the value in MinGW for supporting Windows platforms in open source software. Prior to the the start of the Fedora MinGW effort, Fedora developers would have to build all the cross compilers & libraries themselves. This is not particularly hard, but it is a lot of wasted effort to have everyone duplicating the work. Providing the MinGW compiler toolchain, and important libraries such as libxml, gnutls, libpng, libjpeg, GLib, GTK, etc directly in the Fedora repositories enables developers to focus on their own code, rather than the cross-compilers. LWN: What is Red Hat's involvement in MinGW? Richard: Dan and I work for a Red Hat group responsible for fostering the development of new tools and technologies. We have an eye to productisation and I spend quite a lot of time going to customer conferences and asking them what they want to see, but as for whether MinGW will make it into some future supported Red Hat product I cannot say. Dan: Red Hat initiated development on the libvirt project and supports its ongoing evolution with significant developer resources. Red Hat wants the libvirt project to be the de facto standard for managing virtualization hosts, and the project community members want Windows to be a supported client platform. The work we are doing on the MinGW project in Fedora is thus a response to demand from the libvirt community for better Windows support in our releases. It is just a small part of our day job, alongside major libvirt feature development for Linux systems and in particular KVM & Xen. LWN: Why does Red Hat care? Are you going into the Windows software business now? Richard: Red Hat certainly cares about libvirt, and making libvirt available on the widest range of platforms. The alternatives to libvirt are interfaces like XenAPI and VMWare's APIs, which lock customers into proprietary technologies. Any way we can make it easier to provide open APIs and open source software even on closed platforms like Windows is a win for Red Hat, the Linux community, and even for Windows users. Dan: As Richard says, this effort isn't about any particular Red Hat product. It is a community focused effort to address demand from libvirt users for better Windows client support. People are interested in open source virtualization technology like Xen and KVM, as an alternative to closed source solutions. Open source exists in a heterogeneous world though, and even if someone decides to migrate their servers to virtual machines on a Linux KVM host, they may still need to manage these servers from a Windows desktop. The MinGW project helps us maintain a reliable client build for the Windows platform, and thus lets a broader spectrum of users take advantage of open source virtualization technology. Growing the size of the libvirt community, and encouraging use of virtualization is what is important to Red Hat, and the MinGW project is one small part of that effort. LWN: Why should free software developers care about MinGW? Does it do anything for them? Richard: There's been some opposition, along the lines of "why are we helping Windows?". IMHO people who say that are ignoring both history and reality. First the history bit: the GNU project started off as a set of better compilers and command-line tools for the proprietary Unix systems of the day. I remember before Linux was around that you'd get some horrible system like HP-UX or (in my case) OS-9, and the first thing you would do would be to install all the GNU tools. Without real GNU grep, make, awk, bash, those systems were less than useful. Eventually when GNU got a kernel (Linux) we moved over to that system because it came with all the good tools. Second the reality bit: Windows users are locked into proprietary applications and file formats, everything from Photoshop to QuickBooks to MSN to Illustrator. No Windows user can switch without first switching all their applications, which is going to be a very long transition process. Therefore we need a way to enable the developers of Gimp, GnuCash, Pidgin, Inkscape (to pick four out of hundreds) to easily build and test their software for Windows, so they can ship their software for Windows, respond easily to bug reports, and break that proprietary lock-in. Fedora MinGW does this - in fact we already used our compiler and huge chain of libraries to port Inkscape. [PULL QUOTE: Another thing we've found in porting to other platforms, is that it can generally improve the quality of the codebase. Different compilers and runtime environments expose different bugs in an application. The more combinations you can regularly build & test on, the better the overall quality of your code. END QUOTE] Dan: The libvirt project started off with a strong Linux focus due to our immediate needs for a management API for Xen in Fedora and later RHEL-5. Over time the community has contributed patches to improve our portability to non-Linux platforms, in particular Solaris and more recently Windows. While Red Hat's focus is on Linux, enabling portability to other platforms is important because it grows the size of your developer community. Every significant open source project has a huge wishlist of features and nowhere near enough developers and testers to address them all. Cross-platform portability enlarges the pool of potential contributors. They may initially only send minor patches to fix portability bugs for Windows, but over time they can end up working on major new features that benefit every platform. Another thing we've found in porting to other platforms, is that it can generally improve the quality of the codebase. Different compilers and runtime environments expose different bugs in an application. The more combinations you can regularly build & test on, the better the overall quality of your code. LWN: Is there anything in particular that developers should keep in mind to make life easier for people building their code for MinGW? Richard: My pet list would be: Don't write your own build system. Use autoconf/automake/libtool or cmake. That's not to say I'm a great fan of autoconf, but these really do make cross-compilation almost trivial. Autoconf-based programs can generally be cross-compiled by doing: Don't try to run executables during the build phase. It doesn't work when you're cross-compiling. Do use pkg-config. And if you can't use pkg-config, then make sure your *-config program is a shell script, not a binary. Do use common, portable libraries such as glib, gtk, libvirt or any of our other libraries. Please use Fedora MinGW to routinely cross-compile your own code for Windows. Dan: I have been pleasantly surprised at just how easy it has been to build many open source libraries with MinGW. Despite almost universal dislike for autotools, the applications which use autotools have been some of the easiest to port, particularly when it comes to building DLLs. The apps with home-brewed build systems have been much more involved. I definitely echo Richard's suggestion to stick to a broadly supported build system like autotools or cmake. Any project which is serious about enabling support for Windows in their releases should make sure they are running regular automated builds & tests of their codebase. This is actually just good sense for any software engineering project regardless of whether Windows support is desired - it just happens to be particularly useful for configurations that developers rarely test on a day-to-day basis to avoid otherwise unnoticed regressions. If you are not using a support library like GLib, QT or NSPR (which provides a degree of cross-platform portability) then seriously consider making use of Gnulib. This is a library of code which you can drop into an application, fixing POSIX API portability problems on various platforms. As an example, it replaces Winsock's socket() call so it returns real file descriptors that you can use in both read() and recvfrom(). It can't fix all problems - such as the lack of fork()/exec() on Windows - but if your application / library is written against POSIX, using Gnulib will significantly improve your portability across all Linux, UNIX and Windows platforms. LWN: What are the biggest challenges that your project faces now? How can the community help? Richard: Scaling the project is a big challenge. Red Hat dedicates quite limited resources to this project. The only way we can scale it is if the application developers themselves start to use our tools to build and maintain their own programs. I would like to see everyone who has an important Linux app or library start building and shipping for Windows routinely. Bringing open APIs, apps and file formats to Windows users is important: It's important to Windows users because it breaks their lock-in and makes switching to a fully free platform easier down the road. It's important for you, because your potential audience of users will increase by a factor of 10x or 20x. Dan: Spreading the package maintenance job across a larger number of Fedora members is an important task. There is a limit to how many packages a single person can do a good job at maintaining. To make it manageable we track & pull patches from the native builds to the MinGW cross-compiled builds of common packages. Ultimately we still need more package maintainers to look after the cross-compiled builds. There are some core pieces of the open source ecosystem which do not work / are not fully portable to a Win32 environment. The most obvious one being DBus, which is used by an ever increasing number of apps for local RPC. There have been a number of efforts to port DBus, but none ever completely finished & merged into the official releases. LWN: Anything else you'd like to say to LWN readers? Richard: Get involved. Dan: Cross platform portability is often beneficial to your project even if you personally only care about its use in Linux. In the libvirt case it is opening up use of libvirt & virtualization to a set of users who have only ever had access to closed source virtualization technology. Portability broadens the pool of potential contributors to your project. Open source developers on the various BSDs, OpenSolaris, and Windows all have the potential to make valuable contributions to your project. [ We would like to thank Richard and Dan for taking time to answer our questions. ] Tbench troubles II LWN has previously covered concerns over slowly deteriorating performance by current Linux systems on the network- and scheduler-heavy tbench benchmark. Tbench runs have been getting worse since roughly 2.6.22. At the end of the last episode, attention had been directed toward the CFS scheduler as the presumptive culprit. That article concluded with the suggestion that, now that attention had been focused on the scheduler's role in the tbench performance regression, fixes would be relatively quick in coming. One month later, it would appear that those fixes have indeed come, and that developers looking for better tbench results will need to cast their gaze beyond the scheduler. The discussion resumed after a routine weekly posting of the post-2.6.26 regression list; one entry in that list is the tbench performance issue. Ingo Molnar responded to that posting with a pointer to an extensive set of benchmark runs done by Mike Galbraith. The conclusion Ingo draws from all those runs is that the CFS scheduler is now faster than the old O(1) scheduler, and that "all scheduler components of this regression have been eliminated." Beyond that: In fact his numbers show that scheduler speedups since 2.6.22 have offset and hidden most other sources of tbench regression. (i.e. the scheduler portion got 5% faster, hence it was able to offset a slowdown of 5% in other areas of the kernel that tbench triggers) This improvement is not something that just happened; it is the result of a focused effort on the part of the scheduler developers. Quite a few changes have been merged; they all seem like small tweaks, but, together, they add up to substantial improvements in scheduler performance. One change fixes a spot where the scheduler code disabled interrupts needlessly. Some others (here and here) adjust the scheduler's "wakeup buddy" mechanism, a feature which ties processes together in the scheduler's view. As an example, consider a process which wakes up a second process, then runs out of its allocated time on the CPU. The wakeup buddy system will cause the scheduler to bias its selection mechanism to favor the just-waked process, on the theory that said process will be consuming cache-warm data created by the waking process. By allowing cooperating processes like this to run slightly ahead of what a strictly fair scheduling algorithm would provide, the scheduler gets better performance out of the system as a whole. The recent changes add a "backward buddy" concept. If there is no recently-waked process to switch to, the scheduler will, instead, bias the selection toward the process which was preempted to enable the outgoing process to run. Chances are relatively good that the preempted process might (1) be cooperating with the outgoing process or (2) have some data still in cache - or both. So running that process next is likely to yield better performance overall. A number of other small changes have been merged, to the point that the scheduler developers think that the tbench regressions are no longer their problem. Networking maintainer David Miller has disagreed with this assessment, though, claiming that performance problems still exist in the scheduler. Ingo responded in a couple of ways, starting with the posting of some profiling results which show very little scheduler overhead. Interestingly, it turns out that the networking developers get different results from their profiling runs than the scheduler developers do. And that, in turn, is a result of the different hardware that they are using for their work. Ingo has a bleeding-edge Intel processor to play with; the networking folks have processors which are not quite so new. David Miller tends to run on SPARC processors, which may be adding unique problems of their own. The other thing Ingo did was, for all practical purposes, to profile the entire kernel code path involved in a tbench run, then to disassemble the executable and examine the profile results on a per-instruction basis. The postings that resulted (example) point out a number of potential problem spots, most of which are in the networking code. Some of those have already been fixed, while others are being disputed. It is, in the end, a large amount of raw data which is likely to inspire discussion for a while. To an outsider, this whole affair can have the look of an ongoing finger-pointing exercise. And, perhaps, that's what it is. But it's highly-technical finger-pointing which has increased the understanding of how the kernel responds to a specific type of stress while also demonstrating the limits of some of our measurement tools and the performance differences exhibited by various types of hardware. The end result will be a faster, more tightly-tuned kernel - and better tbench numbers too. NLnet Foundation seeks projects to fund A little-known organization—at least outside of its native home in the Netherlands—has quietly been funding various free software projects to the tune of roughly €2.5 million a year. Most of those projects have been in the Netherlands or Europe, but it is looking to expand its reach to the rest of the world. It is "actively encouraging" submissions of funding proposals for projects that involve network technology and will be released as open source, according to NLnet Foundation Director Valer Mischenko. The Foundation grew out of the Netherlands' first internet provider, NLnet, which laid the original backbone along the rails in that country. In 1998, it was sold to UUNet and the proceeds were invested into the Foundation. The intent of the money was to fund technology, particularly internet technology. Because the internet depends on interoperability, it just makes sense to require projects that are funded to release their code, Mischenko says. The Foundation prides itself on being quick to answer requests for funding as there are "not too many bureaucratic layers" to the organization. Projects that try to get government funding often fall behind because it takes so much time and effort to get a grant of some kind—the technology may well have moved on. Depending on the size of the project, and the amount of funding required, answers can come as quickly as just a few weeks. Each year, two themes are chosen to focus on so that projects in those areas get priority for funding. For 2008, those themes are "Identity, Privacy, and Presence" and "Open Document Format" (ODF). While ODF is not directly connected to network technology, the internet will be a poorer place without open formats that can be freely shared. Part of the ODF effort was helping governments understand the importance of open formats in general and ODF in particular. One of the outcomes of that work was that all agencies in the Netherlands must start using open formats or justify why they cannot. The ODF theme is just one area where the Foundation has broadly interpreted its mission. It has helped fund the FSF Europe (FSFE) Freedom Task Force project for several years. In addition, it provided €200,000 to help pay for Eben Moglen's time to work on GPLv3 at the FSF. Mischenko notes that it is important for the foundation to fund things that will help "protect the network"; he and the board see these efforts as important in that regard. The bulk of funding this year has gone into the Identity, Privacy, and Presence theme. A list of the currently funded projects has a number of interesting entries from support for Tor hidden services and an improved routing algorithm for GNUnet to hardware projects such as RFID Guardian and e-Passport. The current structure of funding is made up of four "layers", each corresponding to how much the Foundation will provide as well as how long it will provide funding for. The first layer is for things like funding trips for developers and other community members to attend conferences and the like. The second layer is for commitments of up to €30,000. Currently around 15% of proposals for second layer funding are granted. For larger projects, the third layer can provide 2-4 years of funding of up to €500-600,000 per year. The fourth layer projects are currently fixed for the next five years as the Foundation is funding DNSSEC work at NLnet Labs as well as work on intelligent agents at Vrije Universiteit Amsterdam. Mischenko said that the board is "willing to hear about ideas that don't fit into the layers". He said that the Foundation will continue its current funding model "unless we hear a great world-changing idea that we put all our money in and then we are gone". It is not just projects that can be funded by the Foundation, any person, company, or organization can apply. "As long as it is a network technology and it will be put in open source", the Foundation will consider funding it. [ Along those lines, the author would like to thank the NLnet Foundation for helping to fund his recent trip to the co-located NLUUG autumn Mobility conference and Embedded Linux Conference Europe in Ede, the Netherlands. ] SSH plaintext recovery vulnerability A somewhat mysterious SSH vulnerability has been reported in a way that unfortunately looks a bit like partial disclosure. In this case, though, there is a workaround that is supposed to alleviate the problem, so there are good reasons—as opposed to publicity-oriented reasons—to announce the flaw. While it is difficult to exploit, it does expose up to 32-bits of plaintext from within an SSH session which is a failure mode that is rather worrisome. The flaw has only been confirmed in OpenSSH 4.7p1, but the announcement indicates that it is likely to be much more widespread: "We expect any RFC-compliant SSH implementation to be vulnerable to some form of the attack." The flaw is in the design of SSH and can allow an attacker who has "control over the network"—presumably the ability to monitor and inject traffic—to recover 32 plaintext bits with a very low probability (2-18). The bits recovered come from an attacker-selected block of ciphertext. The attack leads to the termination of the SSH connection, so iterative attacks will be difficult or impossible. It is hard to get too worked up about that kind of attack, even with much of the details lacking, but typically these kinds of flaws can be expanded in various ways. The announcement mentions variants that recover 14 bits with a probability of 2-14. It also carries the following warning: "The success probabilities for other implementations are unknown (but are potentially much higher)." It is a security tautology that vulnerabilities only get bigger over time, which we have seen in various contexts, notably in DNS cache poisoning flaws over the years. Another bit of information provided by the Centre for the Protection of National Infrastructure (CPNI), the UK government agency who issued the advisory, is that the attack analyzes "the behaviour of the SSH connection when handling certain types of errors". This particular attack is also only applicable to the default cipher-block chaining (CBC) mode, so switching to counter (CTR) mode works around the flaw. OpenSSH supports the use of AES in CTR mode, which is what the advisory recommends using: A switch to AES in counter mode could most easily be enforced by limiting which encryption algorithms are offered during the ciphersuite negotiation that takes place as part of the SSH key exchange (see RFC 4253, Section 7.1). There is quite a bit of information in the advisory that might lead a determined attacker in the "right" direction. It might also provide enough for someone to come up with attacks that are more probable and/or reveal more plaintext. So far, the Internet Storm Center is reporting that they have not seen any evidence that the flaw is being exploited in the wild. OpenSSH has not, as yet, addressed the issue, at least on their security page. At least in its current form, there is probably very little to worry about from this flaw, but very security-conscious SSH users will want to apply the workaround. Interview with Paul Frields Paul Frields is the Fedora Project Leader and in the days before the Fedora 10 release he was giving telephone briefings to the media. I took advantage of about an hour of Paul's time to talk about Fedora and the Fedora 10 release. The following article is based on that conversation. To begin with, we talked about Fedora's new Special Interest Group (SIG) for servers running Fedora. Fedora is a fast-paced distribution, and therefore not suitable for all servers. There are many places Fedora makes an excellent server, though. Some of those uses are: in house, non-internet facing servers or servers with a separate firewall. It is used in server farms and home servers, and other places where the 13 month life cycle is not a problem. The roadrunner supercomputer, a hybrid cluster with both IBM PowerXCell and AMD Opteron processors runs both Red Hat Enterprise Linux and Fedora. Roadrunner holds the number 1 spot in the top500 list. Fedora is more than a bleeding edge desktop, although it is good at that. Fedora sponsors the development of many projects through FedoraHosted.org, and provides many other contributions to upstream projects. Extra Packages for Enterprise Linux (EPEL) is a community effort by Fedora developers to provide high-quality add-on packages that complement Red Hat Enterprise Linux and its compatible spinoffs such as CentOS or Scientific Linux. Fedora also contributes to The One Laptop Per Child (OLPC) project. Fedora does serve many needs. Including those of "remixers", the creators of derivative distributions. The new trademark guidelines, still in draft form, are designed to spell out the DOs and DON'Ts of creating a remix. Remixers can chose packages from the official Fedora repository, EPEL, RPMFusion and other repositories. Packages can also be built from source, with or without patches; to create the distribution they want. Naturally, I asked Paul about the infrastructure/security problems that were announced last August. LWN covered the issue in August and September. We have yet to see a final analysis of what happened. Paul did say that a team of Red Hat engineers and Fedora volunteers rebuilt everything from scratch, and signed the packages with new keys. Beyond that, we were told that the investigation is ongoing and more information will be available once the investigation is complete. Fedora 10 was announced this week, along with the RPM Fusion and ATrpms repositories, updated for Fedora 10. Here are some highlights of this release. With Fedora 9 it became possible to create a persistent USB device, that is a key that can be updated, remember settings and store some data. With Fedora 10 you have all that, plus you can encrypt your home directory on the key. The new NetworkManager features connection sharing to enable collaboration everywhere. PackageKit advances the software management system with its ability of using yum, apt, conary, and other existing tools. PackageKit can search for codecs, listen to dbus and communications between applications. With the long-term roadmap for PackageKit, this utility will understand what packages you need and will get it for you. F10 has faster boot times, kernel mode settings and improved virtualization with KVM. Paul said that the number of Fedora Ambassadors doubles each year. The ambassador program is world-wide, with people who represent the Fedora Project to the wider public, help spread the word about Fedora, Linux, and Open Source, become a point of contact for local community members and channel the feedback to Fedora Project, help recruit project contributors and think of creative ways for promoting Fedora. Fedora 10 has more official spins than ever before. These are specialized distributions that contain only packages in the main Fedora repository. A small sampling includes the Fedora Electronics Lab (FEL) Spin, Fedora KDE Desktop, Fedora Edu/Math Spin and Fedora XFCE Desktop. So check out Fedora 10, or one of the many spins and remixes that are available. The Grumpy Editor's Asian Tour Your editor, having actually managed to spend a few weeks at home, once again succumbed to the allure of long-distance travel. What is life, after all, without jet lag, economy-class seats, and airline meals? The excuse this time was the combination of the Linux Foundation's Japan Linux Symposium and the Consumer Electronics Linux Forum's Korea Technical Jamboree. Both events are intended to increase communications with the Asian technical community and encourage participation in the development process. They are also an opportunity for developers from other parts of the world to learn more about what their colleagues are thinking. This trip was your editor's second Japanese adventure, so it is interesting to look at what has changed over the intervening 16 months. The organization of the event remains about the same, down to the pizza-and-sushi party at the end of the first day. The agenda was more heavily oriented toward filesystems this time around, along with an overview of control group resource controllers by Hiroyuki Kamezawa. There was a big difference, though, in how the discussions went. Japanese audiences are notoriously quiet and unwilling to ask questions, but the attendees at the Japan Linux Symposium have gotten over this constraint. Questions and discussion abounded - and this is a good thing. Free software development does not work well if people are unwilling to ask questions or raise concerns. The fact that Japanese developers seem to be becoming more willing to participate in this way bodes well for their participation in the process as a whole. How much are these developers participating now? Your editor did a quick and unscientific pass over the changes merged for the 2.6.28 kernel. It appears that a full 5% of those patches came from Japanese developers. If we exclude the work of one prolific developer who currently lives in Europe, it can be said that about 4% of 2.6.28 came from Japan itself. There has been a distinct increase in the amount of kernel code coming from that part of the world, and that can only be a good thing. The Linux Foundation's events in Japan (which began in the OSDL days and have been occurring regularly for a few years now) are, perhaps, producing the intended result. Partly in recognition of the larger role now played by Japan in the free software community, the Japan Symposium will be taken to a higher level next year. The 2009 Kernel Summit will be held in Tokyo in October, followed by an expanded, three-day Symposium hosting talks by developers from all over the world. Planning for this event is just getting underway; expect the call for papers to come out early next year. It should be an interesting gathering in a fun city; your editor is already looking forward to attending. The Korea Technical Jamboree was a lower-key gathering, held for a single afternoon on the 25th floor of a Seoul skyscraper. It lacked some of the infrastructure of the Japan Symposium (simultaneous translation, for example), but made up for it in enthusiasm. Your editor found a highly-engaged group of developers interested in talking about the technology. While much of the discussion was, surprisingly enough, in Korean, your editor was able to figure out that virtualization is high on the list of topics that this group was interested in. There was also talk of business models and more. What there was less of, though, was talk of working with the community. From this brief encounter, your editor can guess that the Korean community is still working through the stage of figuring out what it can get from free software. Developers there seem to have, for the most part, not yet reached the point of sharing control of our free operating system and driving it in directions which better suit their needs. By their own admission, Korean developers are a little behind their Japanese counterparts in this regard, but that situation may not last for long. One event your editor was not able to attend was FreedomHEC Taipei, held at the same time. Harald Welte was there, though, and posted a brief report: I was really happy about FreedomHEC. It is really about time that the Linux world and the Taiwan-based chipset vendors and system integrators start much more interaction. It is a simple economic fact that a lot of hardware development, both in the PC mainboard, Laptop as well as the embedded device space happens in Taiwan. It is also very true, that for whatever reason the gradual Linux revolution in the server and desktop market in the EU, the US and other markets such as Southern America has not really reached Taiwan. Harald concludes that a higher Linux awareness in Taiwan should lead to better hardware support worldwide. With any luck at all, events like FreedomHEC, like those in neighboring regions, will help to create that awareness and expand our global development community. Your editor was also unable to attend FOSS.in this year, despite a desire to return to that part of the world. FOSS.in is experimenting with a new event plan which is strongly oriented toward the production of tangible results; it has clearly been influenced by the success of the Linux Plumbers Conference. India has vast numbers of capable developers, relatively few of whom actively participate in our community now. That number has been growing, though, and events like FOSS.in have a lot to do with that change. Finally: while your editor saw a lot of people expressing enthusiasm for Linux, many of them seemed to be doing it with Windows laptops. It seems that the value of Linux has not yet made itself felt in the desktop setting, even among those whose job it is to develop for or promote Linux. It would be interesting to know why more of this work can't move off of proprietary platforms. Some of the answer may be related to episodes like this: your editor had rashly upgraded his laptop to a new stable distribution release (we'll call it Incredibly Irritating for the purposes of this discussion) just prior to traveling. The obligatory check to ensure that video projection still worked got forgotten this time; it had always worked before, what could go wrong this time? But it seems that this "upgrade" moved the tools needed to interface with RandR into a separate package, which it did not bother to install. So it was not possible to tell the laptop to send video out the external port. Suffice to say that, five minutes prior to giving a talk, while disconnected from the network, one does not want to hear "you need to install this package before I'll turn on your external video port" from one's computer. Your editor will accept the blame for not having verified this functionality before traveling, but, still: things like this should Just Work, especially with a distribution which claims to have invested much energy into making such things Just Work. The presenters using Windows laptops were not having to contend with this kind of challenge. That little glitch notwithstanding, this trip was a big success. The hospitality was amazing, interest was high, and there is always value in seeing how other groups are approaching free software. Our community continues to grow; many good things will come from that. Ksplice and kreplace Rebooting a system to apply a security update is a pain. In some situations, it's more than a pain; for various reasons, many systems cannot be taken down at all without compromising the work they are supposed to be doing. Back in April, LWN looked at Ksplice, a mechanism designed to enable the installation of kernel updates without the need to reboot the system. Since then, work has continued on Ksplice, a new version has been posted, and the project is starting to push toward mainline inclusion. So another look is called for. The core idea behind Ksplice remains the same: when given a source tree and a patch, it builds the kernel both with and without the patch and looks at the differences. To that end, the compilation procedure is modified to put every function and data structure into its own executable section. That makes life a little harder for the compiler and the linker, but developers are notably insensitive to the difficulties faced by those tools. With things split up this way, it is relatively easy to identify a minimal set of changes in the binary kernel image which result from the patch. Ksplice can then, with some care, patch the new code into the running kernel. Once this work is done, the old kernel is running the new code without ever having been rebooted. This technique works well for code changes, but different challenges come with changes to data structures. Back in April, Ksplice could not handle that kind of change. Even so, the project's developers claimed to be able to apply the bulk of the kernel's security updates using ksplice. Since then, though, the developers have applied some energy to this problem. With the addition of a couple of new techniques - which require extra effort on the part of the person preparing the patch for Ksplice - it is now possible to apply 100% of the 65 non-DOS security patches released for the kernel since 2005. In some cases, a kernel patch will simply require that a data structure be initialized differently. The way to handle this change in an update through Ksplice is to modify the relevant data structures on the fly. To effect such changes, a patch can be modified to include code like the following: While Ksplice is applying the changes - and while the rest of the system is still stopped - the given func will be called. It can then go rooting through the kernel's data structures, changing things as needed. For example, CVE-2008-0007 came about as a result of a failure by some drivers to set the VM_DONTEXPAND flag on certain vm_area_struct structures. Ksplice is able to apply the fix to the drivers without trouble, but that is not helpful for any incorrectly-initialized VMAs present on the running system. So the modifications to the patch add some functions which set VM_DONTEXPAND on existing VMAs, then use ksplice_apply() to cause those functions to be executed. The result is a fully-fixed system. Changes to data structure definitions are harder. If a structure field is removed, the Ksplice version of the patch can just leave it in place. But the addition of a new field requires more complicated measures. Simply replacing the allocated structures on the fly seems impractical; finding and fixing all pointers to those structures would be difficult at best. So something else is needed. For Ksplice, that something else is a "shadow" mechanism which allocates a separate structure to hold the new fields. Using shadow structures is a fair amount of additional work; the original patch must be changed in a number of places. Code which allocates the affected structure must be modified to allocate the shadow as well, and code which frees the structure must be changed in similar ways. Any reference to the new field(s) must, instead, look up the shadow structure and use that version of the field. All told, it looks like a tiresome procedure which has a significant chance of introducing new bugs. There is also the potential for performance issues caused by the linear linked list search performed to find the shadow structures. The good news is that it is only rarely necessary to modify a patch in this way. The Ksplice developers do not appear to be done yet; from the latest patch posting: We're currently working on the problem of making it feasible to apply the entire stable tree using Ksplice. Although Ksplice's original evaluation focused on patches for CVEs, we understand the idea that "security bugs are just 'normal bugs'" (i.e., tracking security bugs separately from normal bugs can be difficult and isn't necessarily advisable). We ultimately want to provide to long-running machines hot updates for all of the bug fixes that go into the corresponding stable tree. This is an ambitious goal; a single stable series can add up to hundreds of changes, some of which can be reasonably large. It will be interesting to see how many users are really interested in this particular sort of update; sites running critical systems tend to have older "enterprise" kernels which are no longer receiving stable tree updates. But a Ksplice which is flexible enough to handle that kind of update stream should also be useful for distributors wanting to provide no-reboot patches to their customers. Meanwhile, Nikanth Karthikesan has posted a facility called kreplace. On the surface, it looks similar to Ksplice, but the goal is a little different: its purpose is to allow a developer to quickly try out a change on a running kernel. Kreplace works by simply patching out and replacing one or more functions in the kernel. Kreplace may have its value, but the initial reaction has not been greatly enthusiastic. Among other things, it has been pointed out that Ksplice also has a facility to allow for quick experimentation with changes - though it will be quick only if the developer is already set up to use Ksplice with the running kernel. A final concern with either of these solutions is that they are, for all practical purposes, employing rootkit techniques. A mechanism which can be used by distributors to patch running systems can also be (mis)used by others. Vendors of binary-only modules could, for example, use Ksplice or kreplace to get around GPL-only exports and other inconvenient features of contemporary kernels. Crackers could also use it, of course, but they already have their own rootkit tools and gain no real benefit from an officially-supported runtime patching mechanism. Whether this aspect of Ksplice is of concern to the development community may be seen in the coming months as this code gets closer to mainline inclusion. Driver API: sleeping poll(), exclusive I/O memory, and DMA API debugging There are currently a number of proposed driver API changes being discussed on the lists. None of them are major, but they are worth being aware of. poll() Most of the functions in the file_operations structure are concerned with I/O. So it is not surprising that these functions are allowed to sleep. Except that, as it turns out, one of them - poll() - cannot. There is nothing inherent in the poll() or select() system calls which would require the driver poll() callback to be nonblocking; this requirement is, instead, a result of the implementation. In essence, the core poll() implementation looks like this: The problem is relatively straightforward: if a specific driver chooses to sleep in its poll() callback, the current task state will get set back to TASK_RUNNING and schedule_timeout_range() will return immediately. So a sleeping driver turns the main loop into a busy-wait. The solution, as developed by Tejun Heo, is also straightforward. His patch causes sys_poll() to define a custom wakeup function which, in turn, sets a new triggered flag when called. That eliminates the need to put the process into TASK_INTERRUPTIBLE for the duration of the main loop; that can be done, instead, right before actually sleeping. Most driver writers can remain unaware of this change, which looks highly likely to be merged for 2.6.29. But, for those who need it, there will be one more degree of flexibility in the implementation of poll() callbacks. Exclusive I/O memory For a while, developers involved in the hunt for the e1000e corruption bug thought that the X server might be the problem. The real bug turned out to be elsewhere, but the suspicion cast upon X led to the development of a new API designed to make it harder for user-space programs to interfere with the operation of an in-kernel driver. In particular, it seemed sensible to prevent user space from manipulating I/O memory which has been allocated by device drivers. This can be achieved by not allowing an mmap() call on /dev/mem to map regions already given to drivers. If the STRICT_DEVMEM configuration option is set, the kernel will protect its own memory from mapping by user space; protecting I/O memory is really just a matter of extending that mechanism. Arjan van de Ven has implemented that feature in his MMIO exclusivity patch. He chose, however, not to make this protection the default. Instead, drivers which want exclusive access to an I/O memory region should call one of these new functions: There is also a new, low-level allocation macro: In each case, these functions are equivalent to their non-exclusive cousins, except for the changed name and the resulting exclusive allocation. There may be cases where a developer wants to be able to map a region from user space on a development system, regardless of what the driver thinks. For such situations, there is a new iomem=relaxed boot parameter. When relaxed is selected, exclusive allocations are not enforced. Clearly this is not an option which one would want to set on a production system, but it may be useful in development environments. DMA API debugging The last topic is not actually an API change, but it's worth a look anyway. The kernel provides a nice API for setting up DMA operations. In many cases, the associated functions do little or no work; the system they are running on does not require any additional effort. The result is that a lot of "tested" driver code may, in fact, have serious errors in its use of the DMA API. When those drivers are run on a different system - one with an I/O memory management unit (IOMMU) in particular - those errors could lead to no end of unpleasant behavior. Kernel developers like the idea of finding bugs before they bite users on remote systems. To help make that happen with the DMA API, Joerg Roedel has posted a new DMA API debugging facility. This feature, when built into the kernel, should make it possible to find a number of previously-hidden bugs in device drivers. It has, in fact, already turned up a few problems with in-tree drivers, mostly in the networking subsystem. Use of this facility simply requires enabling a configuration option; the API itself does not change. Once it's enabled, this code will check for a number of problems, including freeing DMA buffers with a different size than was given at allocation time, freeing buffers which were never allocated at all, mixing coherent and non-coherent functions on the same buffer, confusion over I/O directions, and more. Each of these problems might slip by on a developer's test system, but might create havoc where an IOMMU is being used. When a problem is found, a warning and stack traceback are logged. The response to this API has been positive. The biggest complaint seems to be about the fact that this API is implemented as an x86-specific feature. So it will probably have to be made generic before merging - after all, developers on other platforms are entirely capable of introducing DMA-related bugs too. Once it goes in, this feature should probably be enabled on any system used for driver development. Character devices in user space There is a lot of functionality—things like filesystems and device drivers—that are normally considered to be kernel tasks, but have, over time, been allowed to move into user space. The UIO user space driver framework came along in 2.6.23, while filesystems in user space (FUSE) have been around since 2.6.14. Tejun Heo would like to see this idea broadened even further with the character devices in user space (CUSE) patches. At first blush, the uses for a character device implemented in user space are not obvious. Looking a bit deeper, though, one finds numerous programs—both open and closed source—that rely on legacy character drivers. Those drivers are currently in the kernel, but need not be if there were a way to implement them in user space. In addition, older, deprecated interfaces, such as Open Sound System (OSS) can be better supported without constantly fiddling with the in-kernel emulation. Providing better OSS support is one of the prime motivators for CUSE as Heo announced in a linux-kernel posting introducing the OSS proxy. The proxy uses CUSE to implement the /dev/dsp, /dev/adsp, and /dev/mixer devices that programs using OSS expect. Adrian Bunk didn't necessarily see this as a good thing: Sorry for being destructive, but 6 years after ALSA went into the kernel we are slightly approaching the point where all applications support ALSA. The application you list on your webpage is UML host sound support, and I'm wondering why you don't fix that instead of working on a better OSS emulation? But Heo sees the current state of OSS emulation as a rather complicated mess that, for better or worse, needs cleaning up: We now have in-kernel OSS emulation which can't mux with other streams, aoss [ALSA OSS emulation] with its own supported and broken list and can also be routed through PA [PulseAudio] by configuring ALSA right and then padsp [PA OSS emulation] with its own supported and broken list and nothing works good enough. So, if we have one thing which just works, we can in time put all those to rest. But there are other uses for CUSE too. Greg Kroah-Hartman notes that legacy software for talking to Palm Pilots, much of which is binary-only, expects to talk to a /dev/pilot serial port. The kernel carries around a driver, but "a libusb userspace program can handle all of the data to the USB device instead". So CUSE could be used to eventually remove another crufty driver from the kernel, while still maintaining compatibility with old user space code. CUSE is implemented on top of FUSE as there is a fair amount of overlap between them. Character devices and filesystems implement many of the same file operations—things like open(), close(), read(), and write()—which makes them a good match. Heo has a separate patchset for FUSE that implements additional operations for filesystems some of which will be used by CUSE. The additional FUSE operations include an implementation of ioctl() that is necessarily rather ugly. Because an ioctl implementation can access memory in unpredictable ways—and those data structures can be arbitrarily deep—there needs to be a mechanism for user-space CUSE devices to read and write that memory. The CUSE server does not have direct access to the caller's memory, so a multi-step ioctl() with retries must be implemented. This particular bit of ugliness is only allowed for in-kernel use, so that CUSE (or other things like it) can allow "unrestricted" ioctl() implementations. All FUSE filesystems are still required to have "restricted" ioctls where the kernel can determine the direction and amount of data that is transferred. poll() support has also been added to FUSE, which, in turn, requires a separate patch that allows poll() callbacks to sleep (described in this article). Once the FUSE changes are in place, the actual implementation of CUSE is relatively small, weighing in around 1000 lines plus some housekeeping to rename and export FUSE symbols. At its core, it collects up a FUSE-mounted filesystem that connects to the user-space implemented device along with the kernel-exported character device, binding the two together. FUSE handles the interaction with the user-space code, in the same way that it does for a filesystem. CUSE creates a device for commands, /dev/cuse, which is opened by a program that wants to implement a particular character device. CUSE queries the opener to determine which device it is implementing and then creates the device node. For most operations, CUSE just hands off to FUSE, but for open() it, instead, opens a file from the FUSE mount, storing the file handle for use by later operations. In many ways, CUSE is a kind of impedance matching layer that creates something that acts like a character device, but has no hardware directly behind it. This allows CUSE to ignore things like hardware interrupts; those would need to be handled by something else, typically a downstream driver—the soundcard driver in the OSS proxy case. This is one of the big differences between UIO and CUSE. UIO is much more like a regular kernel device driver that requires kernel code to handle interrupts. CUSE drivers, on the other hand, can be created without ever touching kernel space. The only objection so far seems to be Bunk's complaint about supporting OSS when it has been deprecated for so long. As Heo points out, though, there are still many applications that only support OSS. In addition, all of the code that has been submitted is "way smaller than the in-kernel ALSA OSS emulation which is somewhat painful to use these days", Heo says. Since there are other potential users of CUSE, not just the OSS proxy, it would seem that, absent any major objections, CUSE could make it into 2.6.29. An open letter to Evgeniy Polyakov [Editor's note: the following article may look like a message to a specific kernel developer, but it is really about the development process in general. Over the years, your editor has seen too many worthy hackers run into development process problems; the end result is often that we lose that person's contributions. We are not so rich that we can afford that sort of loss. The desire to prevent such problems was the motivation behind your editor's recently-written development process document - and this letter.] Dear Evgeniy, Your editor has chosen to write to you in a public manner because he hates to see talented developers get frustrated with the kernel process and storm off. We do not have an excess of capable hackers, especially those who can work at your level. Losing one hurts. Your editor hopes that this eventuality can be avoided in this case - for you, and for others who may be encountering the same sort of frustrations you are. Getting code into the kernel can be a pain, sometimes. That said, some 1160 developers have managed it since the opening of the 2.6.28 merge window in October. It is possible to get code merged with sufficient care. You first posted your distributed storage (DST) patch back in 2007; LWN took a look at it at that time. Since then, this code has come a long way. Beyond the basic task of exporting (and accessing) storage volumes across the net, this code claims "bullet-proof memory allocations," zero-copy transport, failover recovery with full transaction support, support for IPv6 and beyond, and a number of features including encrypted data channels. And, it is said, this code is fast. In general, it looks like good stuff. You have posted the DST code on the mailing lists a number of times - too many, apparently, for your tastes. Frustration with the process appears to have led to the behavior described in your recent weblog post: To understand the roots of this issue, I made a simple experiment with the previous DST release. I added following lines into the patch to catch reviewer's eyes: As you may expect, this does not compile and thus was never read by the people who are subscribed to the appropriate mail lists. I got one private mail about this fact for the whole week. The same DST code (without above lines) was sent public first time more than month ago and was resent 3 times after that. That's why I do not care about DST inclusion anymore. I do not care about its linux-kernel@ feedback. So, because the fourth posting of identical code in one month received little attention, DST now risks joining Kevents, network channels, network tree memory management, asynchronous crypto, and more in that place where dusty, out-of-tree stuff lives. This would not be a good outcome. So let us look at what can be done to avoid that - for your sake, for DST users' sake, and for the sake of other developers who may follow. One way to get more reviews for your code is to pay attention to what those reviewers are saying. Andrew Morton spent some time on DST back in October. He had a number of concrete requests - such as documenting the user-space ABI and the network protocol - which have not been satisfied. He also asked for better code documentation in general: So please. Go through all the code and make it tell a story. Ask yourself "how would I explain all this to a kernel developer who is sitting next to me". It's important, and it's an important skill. The November 25, 2008 version of DST still does not tell that story, and that makes it very hard for other developers to understand. Code review, as you know, is in critically short supply in most free software projects. Getting reviews for difficult-to-understand code is hard, especially when it is a large body of complex code which occupies a niche in which relatively few developers have expertise. So it's not surprising that your most recent comment involved white space - anybody can make that kind of review without any need to actually understand what's going on. Not only does your patch not tell a story, but the individual pieces of it do not even contain changelogs. For a patch set marked "consider for inclusion," that is a fatal error. Playing along with the system on things like that can seem like a waste of time, especially if you hold out no real hope of the patch being merged, but it is a necessary sign of respect for the people you are asking to consider the patch. No maintainer will accept a patch without a changelog. While we're on the topic of documentation, your kernel configuration help text reads, in its entirety: This driver allows to create a distributed storage block device. You owe your users a little bit more than that. Why might they want to use DST? Where can they get the associated tools? This, too, is a fatal error for any substantive kernel change. And, while we're still somewhat on the subject of reviews: Andrew naturally called out the generic-looking thread pool implementation buried deep within DST; shouldn't it pulled out and made more generic? Your response can be paraphrased as "I can't be bothered to get the API past the review process, which, in any case, is biased toward those who are 'closer to the high end'." But pulling out this code and merging it separately might be the ideal starting point for getting the larger patch set into the kernel. A generic thread pool hiding within a storage device driver, instead, will be an ongoing impediment to inclusion. Then there is the issue of motivation: why should the kernel developers want to merge this patch? Who are the users of it - do you have users now? How does it compare to other distributed storage technologies already in the kernel? What's the performance like - can you post some benchmark results? As it stands, DST looks like a nice piece of technology, but its benefits are still unclear. Tell that story, and the level of interest may well go up. Finally, your editor would like to counsel patience. Some patches just take longer than others to find their way in the kernel. That is especially true of complex patches which touch on issues like memory management and which add new user-space ABIs. As a close-to-home example, look at David Howells's FS-Cache code, recently reposted for consideration. The first LWN article on this code was published more than four years ago. David is probably getting a little tired of maintaining this code out-of-tree, but he sticks with it, responds to reviews, and appears to be getting closer to inclusion. Evgeniy, you appear to be a brilliant and productive hacker. You charge into places that scare off most kernel developers, and you always come back out with something interesting. We need developers like you. But we need developers like you who can work with the process - no matter how frustrating it gets. The kernel process is certainly far from perfect, but it is built around a set of principles which have served us well for many years. You could easily rise up through that process to become one of the "high end" developers who, you say, have an easier time getting code merged. Or you could take your marbles and storm home, making snide comments about reviewers on the way. But that would not be good for anybody involved. (See also: Evgeniy's response to this article.) ELCE: Free software strategies for business Shane Coughlan, legal coordinator for the Free Software Foundation Europe (FSFE), spoke about the advantages of free software from a business perspective at the recent Embedded Linux Conference Europe. His talk was not necessarily directed at his audience—as most were already free software users—but, instead, at the bosses of his audience, the management of companies using or considering using free software. His approach was to use the language that management understands while making a strong case for the value that free software can bring. Coughlan noted the obligatory analyst projections, including 4% of European GDP coming from free software by 2010 as well as 80% of commercial software projected to contain free software by 2011. These are eye-opening numbers, so Coughlan went on to explain why those numbers are that high. Businesses are created to deliver value to their investors; in order to succeed, they will need to "deliver value now and deliver more value later and that's how you are going to run a successful business". A short-term outlook is not going to deliver real success. Paraphrasing Bill Clinton, he said "it's for the long term, stupid". Proprietary software allows businesses to "do some stuff", but free software allows them to "do more stuff". As Coughlan describes it, the correct approach is for a business to "do more and keep doing it"; using free software makes that easier. "From a business perspective, free software rocks." The key to free software is not in the cost nor is it in the availability of source code, he said, as those do not embody the freedoms that are important. The ability to "use, study, share, and improve", known as the four freedoms, are what gives free software its edge. They allow for more flexibility and growth than other kinds of software, he said. If free software has so many upsides, what's the catch? "Free software is powered by licenses", so businesses need to understand those licenses and, just as importantly, the reasoning behind those licenses. This is no different than any other license, but a common problem is that people don't read the licenses or follow the terms. If they do, there is no problem, though. So, there is a catch, but "the catch isn't too big". A business must apply some management science to determine its strategy: whether to use an existing solution or work on building a new one. If it decides to build something new, does it foster some kind of community model or not? These are the kinds of questions that need to be answered as part of determining a free software strategy. Communication with people in the community is important as is choosing licenses that are popular and compatible. There are ways to reduce any risk associated with free software by using existing best practices. That means pro-actively resolving issues, not just putting free software into a product, then "pray, and be upset when someone tells us we were naughty". One of the resources available to help management is the FSFE's Freedom Task Force (FTF) which is set up to assist everyone in understanding free software licensing. The FTF does training and consulting for businesses to help with licensing or other issues. If one is having trouble getting management on-board, refer them to FTF, "we won't actually lock them up and brainwash them", Coughlan said. While companies are resistant to releasing their code, "if you're doing your marketing right and you're not relying on temporary monopolies, you can probably release quite a lot" of code without any business harm. It has been estimated that the body of free software is "worth" $12 billion, so a company can reimplement it, "at an estimated cost of $12 billion, or you can share your $2-3 million [investment] and use the code". It's a matter of recognizing the immense benefits that come with free software. Coughlan also described a legal network that the FTF is fostering in Europe, where lawyers and legal experts can discuss issues of importance to free software, especially across jurisdictional boundaries. That network can help provide businesses with legal information to help reduce risks. There is, as yet, no US equivalent, though some US lawyers are participating with the European network. "Still, I'm confident that eventually the US will catch up with us", he said. He wrapped up with some thoughts on the GPLv3, noting that "adoption in the first year has been very, very promising". In fact, it has been adopted faster than he expected. He did note that there are some problems with license incompatibilities, but that those are probably unavoidable. The ideal situation would be for every license to be able to work with every other, but it doesn't work that way, which is a bit of an inconvenience, but not really a problem at this point. Coughlan did not really say very much that LWN readers won't have heard before, but he did put it together in a way that should resonate with businesspeople. It was also interesting to get a look at what FSFE, and particularly FTF, are up to. There is a lot of important free software work, completely separate from development, going on in Europe. Because I am US-based—hopefully not too US-biased—that sometimes gets overlooked, so it was very nice to have a chance to hear about that work. FFADO approaches the 2.0 release The FFADO (Free Firewire Audio Drivers) project allows the support of FireWire (IEEE 1394) audio devices under Linux: The FFADO project aims to provide a generic, open-source solution for the support of FireWire based audio devices for the Linux platform. It is the successor of the FreeBoB project. FFADO is a volunteer-based community effort, trying to provide Linux with at least the same level of functionality that is present on the other operating systems. It is a work in progress, we are close, but we are not quite there yet. The About document explains further: "We try to support any FireWire device available out there. The FFADO codebase is a framework that has been built with this in mind. This however doesn't mean that all FireWire devices work with FFADO. In order to support a device, we need cooperation from manufacturers, or somebody that want[]s to reverse engineer the protocol. Luckily we have support from the manufacturers of the three major platforms vendors build their devices around (BridgeCo, TC Applied Technologies and ECHO). The exact devices supported (or not supported) can be found on our device list." Release candidate 1 of FFADO 2.0 was announced this week: "This release candidate is intended to collect feedback about the library under wide-spread usage. The code should be free of major bugs. We are looking for packagers that are interested in creating packages for their favorite distribution. Please contact us if you can help us out with this." Users of FreeBoB are encouraged to try this release out. The full change Log shows the latest changes to the software, most of the work involves bug fixing. The feature list is also found there. Capabilities include: Support for an unlimited number of 24-bit audio I/O channels. Support for all device sample rates. Support for an unlimited number of MIDI I/O channels. Support for the S/PDIF audio interface format. Support for the ADAT SMUX I/O format. Support for external synchronization. Support for internal mixers and other device controls. Support for device aggregation on an externally synced bus. The project documentation has more information. The installation notes from the FAQ pages explain how the various components of the software work together. If your favorite application requires FireWire support, or you need to migrate away from the unsupported FreeBoB library, now would be a good time to give FFADO a try. Distribution advisories Here at LWN, we get a chance to see a fair number of security advisories in the course of a week—sometimes even in just a single day—so we tend to notice the quality, or lack thereof, of these important announcements. There are a few important pieces of information that need to be a part of any security update announcement, but sadly sometimes they aren't included. Overall, the quality of advisories seems to be declining, which is something that we would like to see change. While it clearly would make collecting security advisories easier for us, that is not the primary motivation for this look at security reporting—users are not being well-served by the current state of affairs. Distributions need to remember that the audience for their security announcements is their users. Those users require some basic information to make an informed choice about whether they need to apply the update as well as how urgently. In order to make those decisions, the following should be present in advisories: the package affected the problem that is being fixed the impact of the vulnerability some kind of unique identifier for the alert links to relevant additional information (CVE, bugzilla, ...) where and how to update the package consistent formatting of advisories is a definite plus Users are not as familiar with either the package or the distribution as the person writing the alert is, so it should be written with that in mind. The most important thing is to concisely communicate the severity and urgency of the problem in a way that the reader can understand—and figure out what to do about it. The biggest problem seen with alerts of late is a lack of information about the problem they are fixing. As an example, consider the recent Fedora advisory on kvm. It refers to a recent CVE number (CVE-2008-4539) which is "reserved", but no details are present, and says that it fixes a "cirrus vulnerability". It also references a bugzilla entry that apparently addresses a separate CVE from 2007 (CVE-2007-1320), if you follow that link in the bugzilla, you finally end up somewhere with actual information, though the connection between the two problems is not particularly obvious. Another example of this is CentOS advisories, which suffer from a number of problems, but the most vexing for folks trying to determine whether they need to update is this lack of bug information. It is not all that hard to get the information as a typical alert has a link to the appropriate Red Hat advisory, but why make users take that step? A concise summary of the bug(s), as well as a reference to the—generally very complete—Red Hat errata, would be quite useful. There is certainly nothing wrong with linking to sources of additional information, but the basics of the problem and its impact should be available in the alert. Unique identifiers for advisories are useful for a number of reasons: keeping track of which have been addressed, having a unique search string to use, or referring to them in conversations, bug reports, etc. When the identifier is not unique, it muddies the waters a bit, making it more difficult than it needs to be. Sometimes mistakes are made (like the spate of recent Fedora alerts with the same FEDORA-2008-10000 identifier), but there appear to be distribution policies about using identifiers multiple times. CentOS uses the same identifier on multiple advisories, one per architecture, but also shared between CentOS releases. So the same identifier will be applied to an s390 update for CentOS 4 as is applied to x86_64 for CentOS 5. Another identifier reuse problem comes from Fedora. When mozilla (or more recently xulrunner) library vulnerabilities occur, Fedora pro-actively rebuilds and updates all of the packages that depend on those libraries. This is very much to its credit as the API is not (yet) stable, but all of the resulting alerts refer to the same identifier. For those who try to track vulnerabilities along with alerts, that results in messy listings that don't provide much in the way of helpful information. Other library bugs result in much saner listings where one could relatively easily track down—and keep straight—the advisories for various packages. There are others problems as well. Alerts that combine unrelated fixes do "avoid flooding mailing lists", but they are a bit painful to tease apart for users that are tracking specific packages. Too much history, in the form of changelogs (example) can also be confusing. If there is only a link to provide vulnerability information, as is the CentOS way, it should probably go directly to a page about the flaw, not to some page that lists all recent upstream flaws (example). And on and on. Certain distributions have been singled out here, but that is not really the point. These are just recent examples of problems that are regularly seen in distribution security alerts. It should be noted that the commercial distributions (SUSE, Ubuntu, Red Hat, Mandriva) seem to do a much better job overall, which is not surprising, but sometimes they fail as well. The key thing to remember is that security announcements are meant to be read by users and acted upon. If information is lacking, the communication will fail. This is not the first time we have looked at the problem, way back in 2000 security page editor Liz Coolbaugh took a look at security advisories, and had some of the same complaints seen here. Her conclusion is still valid: it is not that distributions are not trying or that they don't care, but at times the contents of their advisories slip below the radar. After her article, things got better with security alerts, hopefully this gentle prodding will have a similar effect. A look at free software in Ecuador I recently spoke at the Congress on Free Software and Democratization of Knowledge hosted in Quito by the Universidad Politecnica Salesiana of Ecuador. My general report about the conference and Free as in Freedom knowledge in that country is at the P2P Foundation blog: the trip, however, was also an excellent occasion to check out the most interesting Free Software projects currently taking place in Ecuador. It turns out that there is a lot of activity at the Government level to promote Free Software, and interesting news from some cool projects developed locally. FOSS in the Government A recent presidential decree mandates that most national Public Administrations migrate entirely to Free Software. Ing. Mario Albuja, head of the Subsecretariat for Information Technology of the Presidency of Ecuador, explained during the congress the reasons and the general guidelines of this initiative. Later on, I was able to get more details in a couple of meetings with the members of his staff. Among the most important things going on right now there are the studies and tests for a Government digital signatures application which runs on Gnu/Linux and a unified document management system for 45 central Public Administrations. There is also a field trial of the GPL hospital management software Care2X in the works. The initial implementation of the digital signature project, which uses Free Software whenever possible, is based on keys and digital certificates stored on SafeNet iKey 2032 USB tokens from Entrust. The first official field test will take place in the next weeks, when President Correa himself will use one such key to sign a decree. The Certificate Authority infrastructure which will issue keys and certificates is the same implemented by Banco Central del Ecuador in November 2007. The software application, instead, runs inside any browser. A PostgreSQL backend stores all the documents, together with administrative metadata, on a CentOS-based server. The decrees waiting for electronic signature are presented to the user via a simple Apache/PHP front-end. The actual digital signature happens through a Java applet which reads the encrypted key from the USB token thanks to libraries provided by Entrust. Another big step in the process of freeing Ecuador institutions from proprietary software will be the formal ratification of OpenDocument 1.0 by the Ecuadorian Institute of Standards (INEN). Large-scale usage of this format for public documents should take off right after that, around mid-2009. All the public officials I talked with really believe in the potential of Free Software for a developing country like Ecuador. This only makes more relevant, and worthy of careful consideration, a comment I got from them: there, they say, is no coordination or common vision among the developers of the several FOSS applications they need to deploy. This was no surprise, of course: people at the Subsecretariat understand how FOSS development works. Nevertheless, the fact that there is no unified, local, reliable source for support, with predictable, if not guaranteed, response times, is creating them more problems than they expected when they began. There may be quite a business opportunity here for local FOSS entrepreneurs. Talking with hackers Rafael Bonifaz told me what's new in the Elastix world. In case you never heard of it, Elastix is a specialized GNU/Linux distribution born and (mostly) developed in Ecuador. Its goal is to solve all the communication problems of organizations of any size. Elastix integrates in one easy to administer package all you need to have PBX, VoIP, email, instant messaging, fax and fax/email gateway through Asterisk, Hylafax, Postfix and Openfire for Jabber. You can manage all the PBX functions with a customized version of freepbx. Other tools developed by the Elastix team provide hardware detection, centralized automatic configuration of phones and billing support with a2billing. Elastix is doing great in Ecuador: RTS and Aerolineas Galapagos (Aerogal), which are respectively one of the most important TV channels and one of the main domestic airlines in Ecuador, are using it. Namely, Aerogal is running its call center off Elastix, which is being deployed also in the Ministry of Public Health. Rafael, who is the current coordinator of the Elastix Community, is also proud of the fact that Elastix is the only Gnu/Linux distribution for communications which has two manual books, totaling about five hundred pages, freely downloadable from the Internet: Elastix Without Tears [PDF] by Ben Sharif and Unified communications with Elastix [PDF] by Edgar Landivar. The second manual is still a beta version, currently available only in Spanish. There already is, however, a new mailing list devoted to coordinating all the translation efforts for this second book. Still thanks to Rafael, after knowing about Elastix I met a local group of Java developers who have very recently begun developing a new, interesting content management system called Melenti. Adrian Cadena, member of the Melenti team, explained to me that he and his partners needed a GPL, friendly, easy to use and fast CMS that could scale well from personal web pages to corporate portals. Another must on their requirement list was ease of integration with enterprise software (Java or not) for ERP, CRM and SAP services. That's why, three months ago, after some unsatisfactory experiences with the popular Joomla CMS they started writing Melenti. One of the main features of Melenti should be performance under high loads. Adrian said they are aiming for something able to handle hundreds of thousands of clicks per second, something which Joomla "simply could not handle, when we tried it". Melenti administrators, instead, would be able to configure load balancing without problems, thanks to an interface based on Jndi and other tools. Melenti should run on any JEE infrastructure, from Websphere to JBoss, BEA, Oracle AS, Tomcat, Jetty and more. According to Adrian, Melenti will also be much simpler to set up and extend than most other GPL software for Content Management. Installation should be as simple as dropping a .war file into your flavor of JEE container and following the steps of the graphical wizard which will pop up. Writing Melenti "gadgets", that is plugins, should also be easier than with Joomla, Drupal, Php-nuke and similar products. This because, says Adrian, "unlike those products, Java has worldwide standards like Spring, JPA, JSF, GWT and so on: new developers can just take a look at the core Melenti API and start writing their own gadgets in no time." The first releases of Melenti will support basic CMS functions like management of web pages, images and other files. There will be also interfaces for banner rotation, creation of user polls and a Web Services Creator. The latter is a simple wizard to create Web Services from existing Melenti gadgets. The first alpha version of Melenti has been just uploaded to Sourceforge. You're obviously welcome to have a look at the code and to participate in the development of Melenti. Let's go back to the reason why I went to Quito now, that is Free Software and Democratization of Knowledge. Quiliro Ordonez, with one friend and other occasional volunteers, is now implementing in the field a project first announced in 2007: placing Free Software in a school of the community of Quilapungo, south of Quito, which serves about 200 students. Thus far, Quiliro has installed 2 servers and 4 thin clients running gNewSense. He chose this distribution because it is "100% free software, without non-free repositories or blobs in the kernel which promote functionality before anything else, as this would weaken our position for freedom." He's also very happy with TCOS, which made setting up the thin clients a breeze. The school staff will use Projecto Alba, a modular administration and planning software for schools first developed in Argentina. While gNewSense worked fine out of the box, Quiliro and his partners had to localize Alba to adapt it to the terminology and procedures adapted in Ecuadorian schools. Eventually, the school in Quilapungo will have about 40 Gnu/Linux workstations, but Quiliro doesn't plan to stop there. If all goes well, Quilapungo will be presented as a pilot project in a proposal for Free Software deployment in all public schools in Ecuador. Let's wish Quiliro good luck! Tux3: the other next-generation filesystem There is a great deal of activity around Linux filesystems currently. Of the many ongoing efforts, two receive the most attention: ext4, the extension of ext3 expected to keep that filesystem design going for a few more years, and btrfs, which is seen by many as the long-term filesystem of the future. But there is another project out there which is moving quickly and is worth a look: Daniel Phillips's Tux3 filesystem. Daniel is not a newcomer to filesystem development. His Tux2 filesystem was announced in 2000; it attracted a fair amount of interest until it turned out that Network Appliance, Inc. held patents on a number of techniques used in Tux2. There was some talk of filing for defensive patents, and Jeff Merkey popped up for long enough to claim to have hired a patent attorney to help with the situation. What really happened is that Tux2 simply faded from view. Tux3 is built on some of the same ideas as Tux2, but many of those ideas have evolved over the eight intervening years. The new filesystem, one hopes, has changed enough to avoid the attention of NetApp, which has shown a willingness to use software patents to defend its filesystem turf. Like any self-respecting contemporary filesystem, Tux3 is based on B-trees. The inode table is such a tree; each file stored within is also a B-tree of blocks. Blocks are mapped using extents, of course - another obligatory feature for new filesystems. Most of the expected features are present. In many ways, Tux3 looks like yet another POSIX-style filesystem, but there are some interesting differences. Tux3 implements transactions through a forward-logging mechanism. A set of changes to the filesystem will be batched together into a "phase," which is then written to the journal. Once the phase is committed to the journal, the transaction is considered to be safely completed. At some future time, the filesystem code will "roll up" the journal changes and write them back to the static version of the filesystem. The logging implementation is interesting. Tux3 uses a variant of the copy-on-write mechanism employed by Btrfs; it will not allow any filesystem block to be overwritten in place. So writing to a block within a file will cause a new block to be allocated, with the new data written there. That, in turn, will require that the filesystem data structure which maps file-logical blocks to physical blocks (the extent) will need to be changed to reflect the new block location. Tux3 handles this by writing the new blocks directly to their final location, then putting a "promise" to update the metadata block into the log. At roll-up time, that promise will be fulfilled through the allocation of a new block and, if necessary, the logging of a promise to change the next-higher block in the tree. In this way, changes to files propagate up through the filesystem one step at a time, without the need to make a recursive, all-at-once change. The end result is that the results of a specific change can remain in the log for some time. In Tux3, the log can be thought of as an integral part of the filesystem's metadata. This is true to the point that Tux3 doesn't even bother to roll up the log when the filesystem is unmounted; it just initializes its state from the log when the next mount happens. Among other things, Daniel says, this approach ensures that the journal recovery code will be well-tested and robust - it will be exercised at every filesystem mount. In most filesystems, on-disk inodes are fixed-size objects. In Tux3, instead, their size will be variable. Inodes are essentially containers for attributes; in Tux3, normal filesystem data and extended attributes are treated in almost the same way. So an inode with more attributes will be larger. Extended attributes are compressed through the use of an "atom table" which remaps attribute names onto small integers. Filesystems with extended attributes tend to have large numbers of files using attributes with a small number of names, so the space savings across an entire filesystem could be significant. Also counted among a file's attributes are the blocks where the data is stored. The Tux3 design envisions a number of different ways in which file blocks can be tracked. A B-tree of extents is a common solution to this problem, but its benefits are generally seen with larger files. For smaller files - still the majority of files on a typical Linux system - data can be stored either directly in the inode or at the other end of a simple block pointer. Those representations are more compact for small files, and they provide quicker data access as well. For the moment, though, only extents are implemented. Another interesting - but unimplemented - idea for Tux3 is the concept of versioned pointers. The btrfs filesystem implements snapshots by retaining a copy of the entire filesystem tree; one of these copies exists for every snapshot. The copy-on-write mechanism in btrfs ensures that those snapshots share data which has not been changed, so it is not as bad as it sounds. Tux3 plans to take a different approach to the problem; it will keep a single copy of the filesystem tree, but keep track of different versions of blocks (or extents, really) within that tree. So the versioning information is stored in the leaves of the tree, rather than at the top. But the versioned extents idea has been deferred for now, in favor of getting a working filesystem together. Also removed from the initial feature list is support for subvolumes. This feature initially seemed like an easy thing to do, but interaction with fsync() proved hard. So Daniel finally concluded that volume management was best left to volume managers and dropped the subvolume feature from Tux3. One feature which has never been on the list is checksumming of data. Daniel once commented: Having been checksumming filesystem data during continuous replication for two years now on multiple machines, and having caught exactly zero blocks of bad data passed as good in that time, I consider the spectre of disks passing bad data as good to be largely vendor FUD. That said, checksumming will likely appear in the feature list at some point, I just consider it a decoration, not an essential feature. Tux3 development is far from the point where the developers can worry about "decorations"; it remains, at this point, an embryonic project being pushed by a developer with a bit of a reputation for bright ideas which never quite reach completion. The code, thus far, has been developed in user space using FUSE. There is, however, an in-kernel version which is now ready for further development. According to Daniel: The functionality we have today is roughly like a buggy Ext2 with missing features. While it is very definitely not something you want to store your files on, this undeniably is Tux3 and demonstrates a lot of new design elements that I have described in some detail over the last few months. The variable length inodes, the attribute packing, the btree design, the compact extent encoding and deduplication of extended attribute names are all working out really well. The potential user community for a stripped-down ext2 with bugs is likely to be relatively small. But the Tux3 design just might have enough to offer to make it a contender eventually. First, though, there are a few little problems to solve. At the top of the list, arguably, is the complete lack of locking - locking being the rocks upon which other filesystem projects have run badly aground. The code needs some cleanups - little problems like the almost complete lack of comments and the use of macros as formal function parameters are likely to raise red flags on wider review. Work on an fsck utility does not appear to have begun. There has been no real benchmarking work done; it will be interesting to see how Daniel can manage the "never overwrite a block" policy in a way which does not fragment files (and thus hurt performance) over time. And so on. That said, a lot of these problems could end up being resolved rather quickly. Daniel has put the code out there and appears to have attracted an energetic (if small) community of contributors. Tux3 represents the core of a new filesystem with some interesting ideas. Code comments may be scarce, but Daniel - never known as a tight-lipped developer - has posted a wealth of information which can be found in the Tux3 mailing list archives. Potential contributors should be aware of Daniel's licensing scheme - GPLv3 with a reserved unilateral right to relicense the code to anything else - but developers who are comfortable with that are likely to find an interesting and fast-moving project to play in. KSM runs into patent trouble On the kernel page a few weeks ago, we took a look at KSM, a technique to reduce memory usage by sharing identical pages. Currently proposed for inclusion in the mainline kernel, KSM implements a potentially useful—but not particularly new—mechanism. Unfortunately, before it can be examined on its technical merits, it may run afoul of what is essentially a political problem: software patents. The basic idea behind KSM is to find memory pages that have the same contents, then arrange for one copy to be shared amongst the various users. The kernel does some of this already for things like shared libraries, but there are numerous ways for identical pages to get created that the kernel does not know about directly, thus cannot coalesce. Examples include initialized memory (at startup or in caches) from multiple copies of the same program and virtualized guests that are running the same operating system and application programs. Unfortunately, as Dmitri Monakhov points out, the KSM technique appears to be patented by VMware. A patent for "Content-based, transparent sharing of memory units" was filed in July 2001 and granted in September 2004. The abstract seems to clearly cover the ideas behind KSM: [...] The context, as opposed to merely the addresses or page numbers, of virtual memory pages that [are] accessible to one or more contexts are examined. If two or more context pages are identical, then their memory mappings are changed to point to a single, shared copy of the page in the hardware memory, thereby freeing the memory space taken up by the redundant copies. The shared copy is ten preferable [sic] marked copy-on-write. Sharing is preferably dynamic, whereby the presence of redundant copies of pages is preferably determined by hashing page contents and performing full content comparisons only when two or more pages hash to the same key. It should be noted that the abstract has no legal bearing, that comes from the—always tortuously worded—claims, which can be seen at the link above. In this case, as far as can be determined, the claims and abstract are in close agreement. The dates above are rather important because there is some "prior art" to consider, namely the mergemem patch first announced in March of 1998. It is substantially the same as the patented idea: it looks for identical "context pages", then changes the memory mappings to point to a single copy-on-write page. This would seem to be a clear example of the idea being implemented well before the patent was filed, so it should invalidate the patent. As with everything surrounding software patents, though, it isn't as easy as that. In order to invalidate a patent, either a court must rule that way or the patent office must be convinced to re-examine it, then find that the prior art makes it invalid. Both of these methods take time and usually money and lawyers as well. Free software projects may have time, but the other two are typically out of reach. Alan Cox suggests that "perhaps the Linux Foundation and some of the patent busters could take a look at mergemem and re-examination". While that might eventually resolve the problem, it is a multi-year process at best. The folks behind the KSM project are some of the kvm hackers from Qumranet—which is now part of Red Hat. It is certainly conceivable that VMware might consider kvm a competitor and try to use this patent as a "competitive" weapon. That concern is probably enough to keep KSM out of the mainline until the issue is resolved. There is a much quicker resolution available should VMware wish to do so. Like IBM has done with the RCU patent, VMware could license its patent for use in GPL-licensed code. There is much to be gained by doing that, at least in terms of positive community relations, and there is little to be lost—unless VMware truly believes that the patent will stand up to scrutiny. Both VMware and its parent, EMC, are members of the Linux Foundation, so one could see a role for the foundation in helping to put that kind of agreement together. The original mergemem idea did not make into the kernel, but the code is still available for those running Linux 2.2.9. It appears that it was not pushed very hard in the face of some security concerns—which will need to be addressed by KSM as well. Processes could create a page of memory with known contents then, after waiting for the checker process (or kernel thread) to run, see if memory usage has increased. Based on that information, one can determine if other processes have a page with identical values. It would seem rather difficult to exploit, but clearly does allow some information to leak. It will come as no surprise to most LWN readers that software patents are an increasingly dense minefield that can derail free software projects. Unfortunately, it is the kind of problem that has no solution in the technical domain where such projects excel. The political arena is where any solution will have to come from, though there seems to be some hope that judicial opinions (like the Bilski decision) may limit the scope of the damage. It is a problem that we are likely to see more frequently until there is some kind of resolution. MySQL 5.1 and development models The MySQL development team decided to celebrate the (US) Thanksgiving holiday with the release of MySQL 5.1.30, the first "general availability" (read "production-ready") release in the 5.1 series. There is a lot of good stuff in 5.1.30, including table partitioning, row-based replication, a new plugin API, a built-in job scheduler, and more; see the nutshell summary for more information. It's a celebration point for a long development series; the MySQL developers are to be congratulated for what they have accomplished with this release. Behind the celebration, though, one can hear the grumbling from unhappy developers and users. This release has been a long time in coming; the first 5.0 GA release was in October, 2005 - just over three years ago. The first 5.1 release candidate (5.1.22) came out in September, 2007; seven more "release candidates," many with major changes, were announced over the following 14 months. So the 5.1 production release came rather later than desired, but some developers feel that it was still to soon; the complaints reached a climax in this lengthy posting from Michael "Monty" Widenius, the original creator of MySQL. His point of view, in short, is that this release has fatal bugs, and that these bugs come from a number of flaws in how MySQL development is managed. Your editor cannot claim to be an expert on the MySQL development community. But Monty, presumably, is an expert on this community, so his observations have a higher than usual likelihood of reflecting something close to reality. Reading various dissenting posts (example) has done little to make your editor feel otherwise. And, in any case, much of what Monty says rings true when compared against experiences from elsewhere in the free software community. As projects grow, they must occasionally revisit their development models. There is little happening here which is truly unique to MySQL. Monty asserts: MySQL 5.1 was declared beta and RC way too early. The reason MySQL 5.1 was declared RC was not because we thought it was close to being GA, but because the MySQL manager in charge *wanted to get more people testing MySQL 5.1*. This didn't however help much, which is proved by the fact that it has taken us 14 months and 7 RC's before we could do the current "GA". This caused problems for developers as MySQL developers have not been able to do any larger changes in the source code since February 2006! Two things jump out of that statement. One is that MySQL apparently suffers from an inadequate testing community. Needless to say, that is not a problem which is unique to this project; testing is a scarce resource throughout our community. MySQL users who are unhappy with the results of the development process might want to ask themselves if they are doing enough to help with the testing process. Like it or not, testing software and finding bugs is one of the costs of "free" (beer) software. If this testing doesn't happen during the development cycle, it will end up happening with the "stable" releases instead. The other attention-getter above is the statement that MySQL developers have been unable to make major changes since early 2006. One need only think back to the 2.4 kernel days to see the kind of damage that can result from pent up "patch pressure." Developers get frustrated, major changes start to find their way into "release candidate" code, and the number of bugs tends to increase. The existence of a separate MySQL 6 development branch helps, perhaps, in reducing patch pressure, but it can also only serve to distract developers from stabilizing current release candidates. Related to this is another assertion: Too many new developers without a thorough knowledge of the server have been put on the product trying to fix bugs. This in combined with a failing review process have introduced of a lot new bugs while trying to fix old bugs. Review would appear to be a big part of the problem in general. It may well be that a failure of review has caused the introduction of new bugs with fixes. But one could argue that the problem is deeper than that: any code which failed to stabilize over fourteen months of release candidates should, almost certainly, never have been merged into the MySQL trunk to begin with. It seems that there are not enough eyeballs being applied to major new features before they go in. Your editor has resisted the temptation to make comparisons with other relational database manager projects, but there is value in comparing this state of affairs with the review problems faced by PostgreSQL in recent years. An inability to get additions to PostgreSQL properly reviewed resulted in those additions not being merged. That, in turn, leads to delayed releases with fewer than the desired number of features, neither of which is particularly pleasing for users or developers. But, on the other hand, PostgreSQL does not appear to have the same kind of trouble stabilizing its major releases. Perhaps the key point to take away from all of this, though, is here: In addition, the MySQL current development model doesn't in practice allow the MySQL community to participate in the development of the MySQL server. MySQL is very much a corporate-owned, corporate-driven project, and it has been for a long time. Decisions on what to include are made internally; there is little discussion of development decisions on the project's mailing lists. It is hard to find information on how to contribute to the project; some of the available information still tells prospective contributors to use BitKeeper. All code is copyrighted by MySQL (now Sun), which reserves (and uses) a right to distribute that code under proprietary licenses. All of the above reflects an arrangement which has worked well for years, and which has produced an immensely valuable database manager used by vast numbers of people. But it is not a community project, so development decisions will not necessarily reflect the best interests of the wider user or developer communities. If, as Monty suggests, those decisions are made in ways which favor features and deadlines over quality, there will be little that the community can do about it. Mercurial 1.1 - a major feature release The Mercurial project is described as: "a fast, lightweight Source Control Management system designed for efficient handling of very large distributed projects." The Major Features document presents an overview of Mercurial's capabilities and Understanding Mercurial explains how Mercurial works as a distributed source control system. Mercurial version 1.1 was announced this week: "This is a major release with numerous new features." The What's New document explains the many changes that were added to Mercurial 1.1. Highlights include a new resolve command for tracking in-progress merges, a new repository format, performance improvements, support for Python 2.6, bug fixes and work on the documentation. The web interface now has a canvas-based repository graph, new themes, improved WSGI compliance, support for the display of nested repositories and other improvements. The Mercurial commands have gone through numerous improvements and extensions, some bugs have also been fixed. Some new extensions have been added to Mercurial 1.1, including a rebase extension for rebasing changesets, a bookmarks extension for providing git-like branches, a zeroconf extension for publishing repositories and an hgcia extension for communicating with CIA. Some of the existing extensions have undergone a variety of improvements. Version 1.2 of the mercurial plugin for the Eclipse IDE was also announced this week. According to Wikipedia, Mercurial was started in 2005 and the software is being used by such high profile projects as Mozilla, OpenSolaris and Xen. This latest release shows that the code continues to undergo active development, and holds an important place in the world of source code control systems. Debugfs and the making of a stable ABI Remi Colinet recently proposed the addition of a new virtual file, /proc/mempool, which would display the usage of memory pools within the kernel. Nobody really disagreed with the idea of making this information available, but there were some grumbles about putting it into /proc. Once upon a time, just about anything could go into that directory, but, in recent years, there has been a real attempt to confine /proc to its original intent: providing information about processes. /proc/mempool is not about processes, so it was considered procfile-non-grata. It was suggested that another home should be found for this file. Where that other home should be is not obvious, though. Somewhere like /sys/kernel might seem to make sense, but sysfs has rules of its own. In particular, the one-value-per-file rule makes it hard to create an easy file where developers can simply query the state of a kernel subsystem, so sysfs is not a suitable home for this file either. The next option is debugfs, which was created in December, 2004. Debugfs is meant to be an aid for kernel developers; it explicitly disclaims any rules on the types of files that can be put there. All rules except for one: debugfs is not a mandatory part of any kernel installation, and nothing found therein should be considered to be a part of the stable user-space ABI. It is, instead, a dumping ground where kernel developers can quickly export information which is useful to them. Since debugfs is not a part of the user-space ABI, it seems like a poor place to put things that users might depend on. When this was pointed out, it became clear that the non-ABI status of debugfs is not as well established as one might think. Quoting Matt Mackall: The problem with debugfs is that it claims to not be an ABI but it is lying. Distributions ship tools that depend on portions of debugfs. And they also ship debugfs in their kernel. So it is effectively the same as /proc, except with the 1.0-era everything-goes attitude rather than the 2.6-era we-should-really-think-about-this one. Pushing stuff from procfs to debugfs is thus just setting us up for pain down the road. Don't do it. In five years, we'll discover we can't turn debugfs off or even clean it up because too much relies on it. As an example, Matt pointed out the extensively-documented usbmon interface which provides a great deal of information about what's happening on a USB bus. If it is not an ABI, he says, nobody should be upset if he submits a patch which breaks it. That is a perennial problem with interfaces between the kernel and user space; changing them causes pain for users. That is why incompatible changes to user-space interfaces are almost never allowed; an important goal for the kernel development process is to avoid breaking user-space programs. One might think that this problem could be avoided for a specific interface by explicitly documenting it as an unstable interface. The files in Documentation/ABI/testing are meant to serve that role; anything found there should be considered to be unstable. But, as soon as people start using programs which depend on a specific interface, it has, for all practical purposes, hardened into part of the kernel ABI. Linus put it this way: The fact that something is documented (whether correctly or not) has absolutely _zero_ impact on anything at all. What makes something an ABI is that it's useful and available. The only way something isn't an ABI is by _explicitly_ making sure that it's not available even by mistake in a stable form for binary use. Example: kernel internal data structures and function calls. We make sure that you simply _cannot_ make a binary that works across kernel versions. That is the only way for an ABI to not form. So a given kernel interface can be kept away from ABI status if it is so hard to get to, and so unstable, that nothing ever comes to depend on it. The kernel module interface certainly fits this bill. Modules must generally be built for the exact kernel they are intended to work with, and they must often be built with the same configuration options and the same compiler. Anybody who has gotten into the dark business of distributing binary-only modules has learned what a challenge it can be. Debugfs is different, though. It is enabled in a number of distributor kernels, even if, perhaps, it is not mounted by default. Once a set of files gets placed there, their format tends to change rarely. So it is possible for people to write programs which depend on debugfs files. And the end result of that is that debugfs files can become part of the stable kernel ABI. That is generally not a result that was intended by anybody involved, but it happens anyway. The only way to avoid it would be to deliberately shake up debugfs every kernel cycle - and few developers have much desire to do that. This is a discussion without a whole lot in the way of useful conclusions; it leaves /proc/mempool without a home. ABI design, it turns out, is still hard. In the longer term, dealing with an ABI which was never really designed, but which just sort of settled into being, is even harder. There does not appear to be any substitute for thinking seriously about every interface between kernel and user space, even if it's just for a developer's debugging tool. Packaging qmail for Debian An effort to get the qmail mail transfer agent (MTA) into Debian repositories has run aground due to various concerns, but the overriding one seems to be a distaste for qmail itself. Distributions make package availability decisions based on "taste" all the time, but they are generally made strictly on technical grounds, which does not seem to be the case here. While it has its share of detractors, qmail is a relatively popular MTA—with an excellent security track record—and one of the main impediments, its license, has changed in the last year. Because of that, it makes it a bit hard to understand why qmail would be kept out of Debian. More than six months ago, Gerrit Pape had uploaded qmail and related packages to the ftp-master system, but they have yet to be added to the official Debian archive. He recently outlined his efforts in a post to debian-devel trying to see if he could break a kind of standoff between him and the ftpmasters, who are the folks that decide which packages get moved into the official archives. More than two months after his first upload of the packages, Pape got a reply from Joerg Jaspert outlining multiple technical reasons why the packages were being opposed, but also containing the following disheartening verdict: Aside from these technical - and possibly fixable - problems, we (as in the ftpteam) have discussed the issue, and we are all of the opinion that qmail should die, and not receive support from Debian. As such we *STRONGLY* ask you to reconsider uploading those packages. After that, Pape addressed some, but not all, of the technical complaints and uploaded updated packages along with a reply to Jaspert's rejection on September 1. Since that time, there has been no action on the packages nor any further communication from the ftpteam, which is what led to the debian-devel post. Responses there mostly backed the ftpmaster's "decision"; qmail, it seems, is not very popular with many Debian developers. Unfortunately, some of the complaints are based on old or faulty information. There is a reasonably active upstream and, since Daniel J. Bernstein (aka djb) released the code into the public domain, there is no longer the need to patch qmail to get a sensible MTA. There are some legitimate concerns, in particular the backscatter that gets created by the default qmail configuration, but it is rather disingenuous to list security as one of those problems. While not as bulletproof as djb would have it, qmail does have a long record of few security problems. In response to claims that the Debian security team would have more work because of qmail's inclusion, Moritz Muehlenhoff makes it clear that the team won't block qmail. Florian Weimer puts it this way: Like Moritz, I don't see issues with security support, provided that the number of additional patches is rather small. (To my knowledge, badly patched qmail with a SMTP AUTH bypass vulnerability was one of the few MTAs which were actually exploited to send spam in recent times.) I'm also not sure if upstream can be considered dead, and arguments along that line are not very convincing because similar criticism could be brought against our default MTA. I can understand that people have strong feelings. I'm willing to provide security support, but it's extremely unlikely that I'll run qmail on production MTAs ever again. 8-/ In the end, it comes down to emotions, largely. People generally feel strongly about qmail, either hating it or loving it, with few who know much about it anywhere in between. Clearly the ftpteam has the responsibility to reject packages on technical grounds, but are they the arbiters of taste for Debian as well? An earlier thread about including qmail, from shortly after djb freed the code, showed a fair amount of interest in qmail, along with some opposition. It is unlikely that all Debian developers are happy with all of the packages currently supported by the distribution, so singling qmail out seems rather arbitrary. As Wouter Verhelst notes: As long as qmail is free, packaged properly, and integrates well with the rest of Debian, I don't see why anyone should oppose its packaging. Whether or not it's a good MTA, the fact is that it's a *popular* MTA. That alone should be a good reason to package it. Installing qmail has always been painful; it is a package that cries out for distribution integration, which Pape is trying to provide. Whether it gets into the official repositories or not, unofficial qmail packages do exist. If the problems with qmail are largely packaging-related, it is hard to see how they will get fixed by staying unofficial. But if the problems are based on an emotional response to qmail itself—whether based in technical concerns or not—it is hard to see how a developer can overcome them. Variations on fair I/O schedulers An I/O scheduler is a subsystem of the kernel which schedules I/O operations to the various storage devices to get the best possible throughput from those devices. The algorithm is often reminiscent of the algorithm used by elevators when dealing with requests coming from different floors to go up or down. This is the reason I/O scheduling algorithms are also called "elevators." I/O requests are submitted in an order designed to minimize disk head movement (thus minimizing disk seek times), yet guaranteeing good I/O rates. The next request chosen will be dependent on the current disk head position, in order to service the requests quickly, and spend less time seeking, or moving the disk head. However, algorithms may also consider other aspects such as fairness or time guarantees. The Completely Fair Queuing (CFQ) I/O scheduler, is one of the most popular I/O scheduling algorithms; it is used as the default scheduler in most distributions. As the name suggests, the CFQ scheduler tries to maintain fairness in its distribution of bandwidth to processes, and yet does not compromise much on the throughput. The elevator's fairness is accomplished by servicing all processes and not penalizing those which have requests far from the current disk head position. It grants a time slice to every process; once the task has consumed its slice, this slice is recomputed and task is added to the end of the queue. The I/O priority is used to compute the time slice granted and the offset in the request queue. The Budget Fair Queuing scheduler The time-based allocation of the disk service in CFQ, while having the desirable effect of implicitly charging each application for the seek time it incurs, still suffers from fairness problems, especially towards processes which make the best possible use of the disk bandwidth. If the same time slice is assigned to two processes, they may each get different throughput, as a function of the positions on the disk of their requests. Moreover, due to its round robin policy, CFQ is characterized by an O(N) worst-case delay (jitter) in request completion time, where N is the number of tasks competing for the disk. The Budget Fair Queuing (BFQ) scheduler, developed by Fabio Checconi and Paolo Valente, changes the CFQ round-robin scheduling policy based on time slices into a fair queuing policy based on sector budgets. Each task is assigned a budget measured in number of sectors instead of amount of time, and budgets are scheduled using a slightly modified version of the Worst-case Fair Weighted Fair Queuing+ (WF2Q+) algorithm (described in this paper [compressed PS]), which guarantees a worst case complexity of O(logN) and boils down to O(1) in most cases. The budget assigned to each task varies over time as a function of its behavior. However, one can set the maximum value of the budget that BFQ can assign to any task. BFQ can provide strong guarantees on bandwidth distribution because the assigned budgets are measured sectors. There are limits, though: processes spending too much time to exhaust their budget are penalized and the scheduler selects the next process to dispatch I/O. The next budget is calculated on the feedback provided by the request serviced. BFQ also introduces I/O scheduling within control groups. Queues are collected into a tree of groups, and there is a distinct B-WF2Q+ scheduler on each non-leaf node. Leaf nodes are request queues as in the non-hierarchical case. BFQ supports I/O priority classes at each hierarchy level, enforcing a strict priority ordering among classes. This means that idle queues or groups are served only if there are no best effort queues or groups in the same control group, and best effort queues and groups are served only if there are no real-time queues or groups. As compared to cfq-cgroups (explained later), it lacks per device priorities. The developers however claim that this feature can be incorporated easily. Algorithm Requests coming to an I/O scheduler fall into two categories, synchronous and asynchronous. Synchronous requests are those for which the application must wait before continuing to send further requests - typically read requests. On the other hand, asynchronous requests - typically writes - do not block the application's progress while they are executed. In BFQ, as in CFQ, synchronous requests are collected in per-task queues, while asynchronous requests are collected in per-device (or, in the case of hierarchical scheduling, per group) queues. When the underlying device driver asks for the next request to serve and there is no queue being served, BFQ uses B-WF2Q+, a modified version of WF2Q+, to choose a queue. It then selects the first request from that queue in C-LOOK order and returns it to the driver. C-LOOK is a disk scheduling algorithm, where the next request picked is the one with the immediate next highest disk sector to the current position of the disk head. Once the disk has serviced the maximum sector number in the request queue, it positions the head to the sector number of the request having the lowest sector number. When a new queue is selected it is assigned a budget, in disk sector units, decremented each time a request from the same queue is served. When the device driver asks for new requests and there is a queue under service, they are chosen from that queue until one of the following conditions is met: (1) the queue exhausts its budget, (2) the queue is spending too much time to consume its budget, or (3) the queue has no more requests to serve On termination of a request, the scheduler recalculates the budget allocated to each process depending on the feedback it gets. For example, for greedy processes which have exhausted their budgets, the budget is increased, whereas if it has been idle for long, its budget is decreased. The maximum budget a process can get is a configurable system parameter (max_budget). Two other parameters, timeout_sync and timeout_async, control the timeout time for consuming the budget of the synchronous and asynchronous queues respectively. In addition, max_budget_async_rq limits the maximum number of requests serviced from an asynchronous queue. If a synchronous queue has no more requests to serve, but it has some budget left, the scheduler idles (i.e., it tells to the device driver that it has no requests to serve even if there are other active queues) for a short period, in anticipation of a new request from the task owning the queue. Test Results The developers compared six different I/O scheduling algorithms: BFQ, YFQ, SCAN-EDF, CFQ, the Linux anticipatory scheduler, and C-LOOK. They compared a multitude of test scenarios analogous to real-life scenarios, including throughput, bandwidth distribution, latency, and short-term time guarantees. With respect to bandwidth distribution, BFQ can be concluded as the best, and a good algorithm for most scenarios. There were also extensive tests comparing BFQ against CFQ, and the results are available here. The throughput of BFQ is more or less the same as CFQ, but it scores well in distributing I/O bandwidth fairly among the processes, and displays lower latency with streaming data. Using sector budgets instead of time as a factor of granting slice for fair bandwidth distribution is an interesting concept. The algorithm also employs timeouts to terminate requests of "seeky" processes taking too much time to consume their budget and penalizes them. The feedback from current requests help determine future budgets, making the algorithm self-learning. Such tighter bandwidths distribution would be a requirement for systems running virtual machines, or container classes. However, it depends on how BFQ stands the test of time against the tried-and-tested stable CFQ. See the BFQ technical report [PDF] for (much) more information. Expanded CFQ Control Groups provide a mechanism for aggregating sets of tasks, and all their future children, into hierarchical groups. These groups can be allocated dedicated portions of the available resources, or resource sharing can be prioritized within these groups. Control groups are controlled by the cgroups pseudo-filesystem. Once mounted, the top level directory shows the complete set of existing control groups. Each directory made under the root filesystem makes a new group, and resources can be allocated to the tasks listed in the tasks file in the individual groups directory. Control groups can be used to regulate access to CPU time, memory, and more. There are also several projects working toward the creation of I/O bandwidth controllers for control groups. One of those is the expanded CFQ scheduler patch for cgroups by Satoshi Uchida. This patch set introduces a new I/O scheduler called cfq-cgroups, which introduces cgroups for the I/O scheduling subsystem. This scheduler, as the name suggests, is based on Completely Fair Queuing I/O scheduler. It can take advantage of hierarchical scheduling of processes, with respect to the cgroup they belong to, each cgroup having its own CFQ scheduler. I/O devices in a control group can be prioritized. The time slice given to each hierarchical group per device is a function of the device priority. This helps shaping of I/O bandwidth per group, per device. Usage To use, cfq-cgroups, select it as a default scheduler at boot by passing elevator=cfq-cgroups as a boot parameter. This can also be dynamically changed for individual devices by writing cfq-cgroups to /sys/block/<device>/queue/scheduler. There are two levels of control: through the cgroups filesystem, for individual groups, and through sysfs, for individual devices. Like any other control group, cfq-cgroup is managed through the cgroup pseudo-filesystem. To access the cgroups, mount the pseudo cgroups filesystem: The cgroup directory, by default, will have a file called cfq.ioprio, which contains the individual priority on a per-device basis. The time slice received per device per group is a function of the I/O priority listed in cfq.ioprio. The tasks file represents the list of tasks in the particular group. To make more groups, create a directory in the mounted cgroup directory: The new directories are automatically populated with files, cfq.ioprio, tasks etc, which are used to control the resources in this group. To add tasks in a group, write the process ID of the task to the tasks file: The cfq.ioprio file contains the list of devices and their respective priorities. Each device in the cgroup has a default I/O priority of 3, while the valid values are 0 to 7. To change the priority of a device for the cgroup group1, run: This would change the priority of the entire group. To change the I/O priority of a specific device: To change the default priority while keeping the priority of the devices unchanged: The device view shows the list of cgroups and their respective priorities on a per-group basis. This can be changed by: The device view contain other parameters similar to the CFQ scheduler, such as back_seek_max or back_seek_penalty, which are specific to the control of the individual device, same as the traditional CFQ. Implementation The patch introduces a new data structure called cfq_driver_data for the control of I/O bandwidth for cgroups. All driver-related data has been moved from the traditional cfq_data structure to cfq_driver_data structure. Similarly, cfq_cgroups is a new data structure to control the cgroup parameters. The organization of data can be assumed as a matrix with cfq_cgroups as rows and cfq_driver_data as columns, as shown in the diagram below. At each intersection, there is a cfqd_data structure which is responsible for all CFQ related queue handling, so that each cfq_data corresponds to one cfq_cgroup and cfq_driver_data combination. When a new cgroup is created, the cfq_data from the parent cgroup is copied into the new group. While inserting new nodes of cfq_data into the cgroup, the cfq_data structure is initialized with the priority of the cfq_cgroup. This way all data of the parent is inherited by the child cgroup, and shows up in the respective files per group in the cgroup filesystem. Scheduling of cfq_data within the CFQ scheduler is similar to that of the native CFQ scheduler. Each node is assigned a time slice. This slice is calculated according to the I/O priority of the device, using the per-device base time slice. The time slice offset forms the key of the red-black node to be inserted in the service tree. One cfq_data entry is picked from the start of the red-black tree and scheduled. Once its time slice expires it is added to the tree again, after recalculation of its time slice offset. So, each cfq_data structure acts as a queue node per device, and, within each CFQ data structure, requests are queued as with a regular CFQ queue. Both BFQ and cfq-cgroups are attempts to bring a higher degree of fairness to I/O scheduling, with "fairness" being tempered by the desire to add more administrative control via the control groups mechanism. They both appear to be useful solutions, but they must contend with the wealth of other I/O bandwidth control implementations out there. Coming to some sort of consensus on which approach is the right one could prove to be a rather longer process than simply implementing these algorithms in the first place. System integrity in Linux Ensuring that a Linux system is only running "approved" programs—ones that haven't been maliciously replaced—is one of the goals of the integrity patches currently being proposed for the Linux mainline. With some hardware assistance, in the form of a Trusted Platform Module (TPM) chip, systems will be able to protect against unauthorized binaries as well as attest to other systems that they are only running good code. These patches have been around for a number of years in various forms, but it would seem they are getting close to being merged. Perhaps more interestingly, we are starting to see them be used by various projects. Over on the kernel page, we have looked at the integrity patches several times, most recently in March 2007. The core idea is to complement mandatory access control (MAC) systems, such as SELinux, by preventing attacks that are made when that system isn't running—the machine has been booted with a different kernel for example. It is generally considered a security truism that physical access to a device moots any security measures, but with a properly outfitted TPM-based system, that is no longer the case. Conceptually, there are two parts to the integrity feature. One is the extended verification module (EVM) that associates each file with a hash that has been calculated over its contents and metadata. That hash is then signed by the TPM chip ensuring that unauthorized changes will be noticed. The other half is the integrity measurement architecture (IMA) which tracks the use of mmap(). IMA verifies the hashes of files that have been mapped in executable mode and then keeps track of them in a way that the TPM can sign. EVM then provides the protection against tampering with binaries, while IMA can provide a signed attestation of which executables have been run. Previous incarnations of EVM and IMA used the Linux Security Modules (LSM) interface, but that has a very unfortunate side effect: inability to also run SELinux. LSM code has no way to stack or cooperate, so there can only be one module active at a time. Since integrity and MAC are intended to work together, this was seen as a rather serious impediment, so the most recent versions add in hooks for Linux Integrity Modules (LIM). IMA is then added as a LIM integrity provider rather than as an LSM. In response to an Andrew Morton query about the need for LIM/IMA (EVM has been incorporated into IMA over time), David Safford listed several users of the code: LIM/IMA's maintenance of a TPM hardware anchored file measurement list is fundamental to the Trusted Computing Group's standards efforts. Several projects have implemented the TNC (Trusted Network Connect) and PTS (Platform Trust Services) standards (see below). There are three demo packaged distros which have integrated these apps, two of which are government funded (EU and US), with definite customer interest. We are working with the RHEL team to provide a supported, patched kernel for HAP. All of these so far have used the old LSM based IMA, and have asked for a supported, upstreamed implementation, with the ability to work with SELinux. While that looks a bit like alphabet soup, there is a lot of useful information there (and in his links further down in the post linked above). The biggest news is the three distributions that are implementing "Trusted Computing". The High Assurance Platform (HAP) program is funded by the US National Security Agency (NSA), the folks who brought us SELinux, while the Open Trusted Computing project is funded by the European Commission. While the security that can be provided by a Trusted Computing platform is useful for some installations, there are some potential pitfalls as well. Systems with TPM hardware can be configured to only run binaries that are signed by some external authority. If manufacturers were to enable that functionality, but only provide the key to "trusted" software companies, it would lead to a horrendous loss of freedom. This is why some have called it "Treacherous Computing". There are numerous examples of systems that do not necessarily preserve physical security, but that one might want to ensure were running the proper code—voting and cash machines come quickly to mind. For those situations, as well as countless others, Trusted Computing will be a real boon. We just need to be vigilant so that hardware vendors (or, worse yet, governments) don't start restricting what we can run on our own machines. Dueling performance monitors Low-level optimization of performance-critical code can be a challenging task. At this point, one assumes, the potential for algorithmic improvements in the targeted code has been realized; what is left is trying to locate and address problems like cache misses, mis-predicted branches, and so on. Such problems can be impossible to find by just looking at the code; one needs support from the hardware. The good news is that contemporary hardware provides that support; most processors can collect a wide range of performance data for analysis. The bad news is that, despite the fact that processors have been able to collect that data for many years, there has never been support for this kind of performance monitoring in the mainline kernel. That situation may be about to change, but, first, the development community will have to make a choice between a venerable out-of-tree implementation and an unexpected competitor. The "perfmon" patch set has been under development for some years, but, for a number of reasons, it has never found its way into the mainline kernel. The most recent version of the patch was posted for review by Stéphane Eranian in late November. The perfmon patches show the signs of all those years of development work and usage experience; they offer a wide set of features and extensive user-space support. The full perfmon patch adds twelve system calls to the kernel; the posted version, though, trims that count back to five in the hope that a narrower interface will have a better chance of getting into the mainline. The additional system calls, one assumes, will be proposed for inclusion sometime after the perfmon core is merged. The reduced interface is described in the patch set; briefly, an application hooks into the performance monitoring subsystem with a call to: This system call returns a file descriptor to identify the performance monitoring session. The regs parameter is used to return a list of performance monitoring registers available on the current system; flags is currently unused. Specific performance counter registers can be manipulated with: These system calls can be used to write values into registers (thus programming the performance monitoring hardware) and to read counter and configuration information from those registers. Actually doing some performance monitoring requires a couple more calls: A call to pfm_attach() specifies which process is to be monitored; pfm_set_state() then turns monitoring on and off. There are a couple of distinctive aspects to the perfmon interface. One is that it knows almost nothing about the specific performance monitoring registers; that information, instead, is expected to live in user space. As a result, the bare perfmon system call interface is probably not something that most monitoring applications would use; instead, those system calls are hidden behind a user-space library which knows how to program different types of processors for the desired results. Beyond that, perfmon uses the ptrace() mechanism to stop the monitored process while performance counters are being queried; as a result, the monitoring process must have the right to trace the target process. On December 4, Thomas Gleixner and Ingo Molnar posted a surprise announcement of a new performance counter subsystem. The announcement states: We are aware of the perfmon3 patchset that has been submitted to lkml recently. Our patchset tries to achieve a similar end result, with a fundamentally different (and we believe, superior :-) design. This is not the first time that these developers have shown up with an out-of-the-blue reimplementation of somebody else's subsystem; other examples include the CFS scheduler, high-resolution timers, dynamic tick, and realtime preemption. Most of the time, the new code quickly supplants the older version - an occurrence which is not always pleasing to the original developers - but the situation does not seem quite as straightforward this time. The proposed interface is much simpler, adding a single system call: This call will return a file descriptor corresponding to a single hardware counter. A call to read() will then return the current value of the counter. The hw_event_period can be used to block reads until the counter overflows the given value, allowing, for example, events to be queried in batches of 1000. The pid parameter can be used to target a specific process, and cpu can restrict monitoring to a specific processor. There are a few advantages claimed for the new implementation. The simplicity of the system call interface is one of those; it is possible to write a very simple application to perform monitoring tasks, with no additional libraries required. The second version of the patch includes a simple "kerneltop" utility which can display a constantly-updated profile of anything the performance counting hardware can monitor. Another advantage is the avoidance of ptrace(); this reduces the amount of privilege needed by the monitoring process and avoids perturbing the monitored process by stopping and restarting it. The management of counters is said to be more flexible, with facilities for sharing counters between processes and reserving them for administrative access. The low-level hardware interface is said to be simpler as well. Those claimed advantages notwithstanding, a number of complaints have been raised with regard to the new performance monitoring code. Two of those seem to be at the top of the list: the single counter per file descriptor API, and programming the hardware performance monitoring unit inside the kernel. On the API side, the biggest concern is that putting each counter behind its own file descriptor makes it very hard to correlate two or more counters. Reading two counters requires two independent read() system calls; as is always the case, just about anything could happen between those two calls. So it's hard to tell how two different counter values relate to each other. But that sort of correlation is exactly what developers doing performance optimization want to do. Paul Mackerras says: Your API has as its central abstraction the "counter". I am saying that that is the wrong abstraction. The abstraction really needs to be a set of counters that are all active over precisely the same interval, so that their values can be meaningfully compared and related to each other. In response, Ingo argues that the loss of precision caused by independent read() calls is small - much smaller than the muddying of the results caused by stopping the target process so that all of the counters can be read at the same time. That argument does not appear to have convinced the detractors, though. The other complaint is that moving the counter programming task into the kernel requires that the kernel know about the complexities of every possible performance monitoring unit it may encounter. This hardware sits at the core of the most performance-critical CPU subsystems, so its design parameters value non-interference above features or a straightforward programming interface. So programming it can be a complex business, involving sizeable tables describing how various operations interact with each other. The perfmon code keeps those tables in a user-space library, but the alternative implementation won't allow that. Quoting Paul again: Now, the tables in perfmon's user-land libpfm that describe the mapping from abstract events to event-selector values and the constraints on what events can be counted together come to nearly 29,000 lines of code just for the IBM 64-bit powerpc processors. Your API condemns us to adding all that bloat to the kernel, plus the code to use those tables. Paul (and others) argue that this information - which can add up to hundreds of kilobytes - is better kept in user space. There also seems to be a bit of concern over the fact that Stéphane had clearly never heard about this work before it was posted for review. It must, indeed, be a shock to work on a subsystem for years, then find a proposed replacement sitting in one's mailbox. As David Miller put it: And also, another part of the backlash is that the poor perfmon3 person was completely blindsided by this new stuff. Which to be honest was pretty unfair. He might have had great ideas about the requirements (even if you don't give a crap about his approach to achieving those requirements) and thus could have helped avoid the past few days of churn. So, at this point, what will happen with performance monitoring is unclear at best. Perhaps, though, this discussion will have the effect of raising the profile of performance monitoring, which has been without proper kernel support for many years. The merging of either solution - or, perhaps, a combination of both - seems like it has to be an improvement over having no support at all. A new realtime tree It has been just over four years, now, since the realtime discussion got serious and the realtime preemption patch set got its start. During that time, your editor has heard many predictions for when the bulk of the realtime work would be merged; generally, the guess has been "within about a year." While a lot of realtime work has been merged, some of the core components of the realtime tree remain outside of the mainline. Beyond that, the realtime developers have been relatively quiet over the last year - at least on the realtime front. Having taken on some little side tasks - unifying the x86 architecture and maintaining it going forward, for example - some of those developers have been just a little bit distracted recently. The realtime patch set has not gone away, though. If nothing else, the fact that a number of distributors are shipping this code is enough to ensure continued interest in its development. So your editor noted with interest the recent announcement of a new -rt tree with an updated set of realtime patches. This tree will be of interest for anybody wanting to look at the realtime work in the context of the 2.6.28 kernel or beyond. One of the core technologies in the realtime tree is a change to how spinlocks work. Spinlocks in the mainline will busy-wait until the required lock becomes available; they thus occupy the processor to no useful end when acquiring a contended lock. Holding a spinlock will also prevent a thread from being preempted. This behavior is generally best for system throughput; it also makes it easier to write correct code. But anything which prevents a CPU from immediately servicing the highest-priority process runs counter to the chief design goal of a realtime operating system: providing deterministic response times in all situations. So, for the realtime patches, classic spinlocks had to go. The solution was to turn most spinlocks into a form of mutex with priority inheritance. A process which attempts to acquire a contended "spinlock" will no longer spin; instead, it goes to sleep and waits for the lock to become free, making the processor available to another thread. Code which holds one of these non-spinlocks is no longer immune to preemption; a higher-priority thread can always push it out of the way. By changing spinlocks in this way, the realtime hackers were able to eliminate one of the largest sources of latency in the mainline kernel. Much of that work found its way into the mainline some time ago in the form of the mutex API, but spinlocks themselves have not been changed in the mainline. To minimize the pain of maintaining the realtime patches, the developers simply redefined the spinlock_t type to be the new mutex type instead. Except that, as it turns out, some spinlocks in low-level parts of the kernel really do need to be spinlocks still. So those were switched to a new raw_spinlock_t type - but without changing the various spin_lock() calls. Instead, some truly frightening macro trickery was introduced to cause the spinlock API to do the right thing when passed either of two entirely different mutual exclusion primitives. This bit of macro magic was always going to be an impediment to mainline inclusion, so the realtime developers never really expected to merge the lock code in that form. The new realtime tree now shows how the realtime developers think this work might get into the mainline. It involves a more explicit separation of the two types of "spinlocks" - and a lot of code churn. In the realtime tree, most locks of type spinlock_t are changed to a new lock_t type. There is a new set of operations for this type: For a normal, non-realtime kernel build, lock_t will be the same as spinlock_t, and things will work as they always have. On realtime kernels, instead, lock_t will be a mutex type. The other variants of the spinlock API will be represented in the new API (there is an acquire_lock_irqsave(), for example), but none of them will actually disable interrupts in a realtime kernel. Meanwhile, spinlock_t will remain a true spinlock type. This change gets rid of the tricky macros, but at the cost of changing the declarations of and operations on almost all spinlocks in the kernel. That is a lot of code changes: a quick grep turns up over 20,000 spin_lock*() calls in the upcoming 2.6.28 kernel. That will make for some pain if and when this change is merged. But in the mean time, it can only make for a lot of pain for the people who have to maintain this patch out of tree. To make their lives a little easier, the realtime developers have created a couple of scripts to do the bulk of the work. First, all spinlocks in a pristine kernel are converted to lock_t, then the few locks which truly must be spinlocks are switched back. This work is kept in a separate branch which is regenerated when needed; in this way, the realtime developers avoid the need to do nasty merges to keep up with current kernels. Your editor has heard talk of another locking change which does not, yet, appear in this tree. One problem with the realtime patch set is that it requires distributors to create yet another kernel build - something they hate doing - if they want to support realtime operation. In an effort to make life easier for distributors, the realtime developers are working on a scheme whereby a kernel would determine at run time whether it should be running in a realtime mode. If so, spinlocks will be changed to sleeping locks by patching the kernel binary as it boots. Kernels built this way will be able to run efficiently in either mode. The branches of the realtime tree provide a quick guide to the other parts of the realtime work which remain outside of the mainline. The threaded interrupt handler code is one example; that change could be proposed (again) for merging in the near future. The priority workqueue mechanism sits in another branch, as do patches aimed at Java support, filesystem changes, memory management changes, and more. Then, there's a branch for stuff which will never be merged; for example, there is this patch which gives Java programs direct access to physical memory - not something which strikes most kernel developers as a good idea. All told, there is a great deal of work sitting in the realtime patch set; this work is finally being organized into a proper git tree. The "upstream first" policy says that vendors should merge their code upstream before shipping it to customers. The 2.6.x development model is built on the idea that no change is too fundamental to be accepted into a regular, 3-month development cycle. The realtime patches would appear to be an exception to both rules. It has taken over four years to get to a point where some of the fundamental realtime technologies are close to ready for the mainline, but distributors have been shipping it for at least three of those years. It has, in other words, been one of the biggest forks of the Linux kernel, ever. The plan has always been to join this fork back with the mainline, though; perhaps, finally, that goal is getting closer. With luck, it will happen within about a year. Tracking down a "runaway loop" The Linux boot process, at least as provided by distributions, depends on help from user space, with drivers being loaded as required from the initial filesystem (initramfs/initrd). Loading drivers requires using tools built into initramfs and if those tools break, the kernel won't boot. But when a working kernel configuration and initramfs are used with a new kernel, the result is expected to be a kernel that successfully boots. When that doesn't happen, bugs are filed regarding kernel regressions but, as a recent example shows, the actual problem may be elsewhere. The original report was made in late October, but no progress was made until Evgeniy Polyakov saw it again in early December. The symptom was a kernel that hangs after printing: four times on the console. Since nothing in the user space (initramfs) or kernel configuration had changed, it seemed to clearly point to something in the kernel itself. It turns out that the "runaway loop" message is meant to indicate that the request_module() function has been invoked recursively. So in an effort to load the driver for the character device with major/minor numbers 5/1—which corresponds to /dev/console—request_module() was invoked again. The code in kernel/kmod.c: Python 3 is out - now what? For some years now, the Python development community has been talking about "Python 3000," the far-future release which would allow a complete rethinking of the language to fix the various annoyances which had built up over time. On December 3, that talk came to fruition with the Python 3.0 release. This release is the end result of a great deal of thought and development; it represents the vision Guido van Rossum and company have for the language into the indefinite future. Now that it's out, the Python community as a whole appears to have stopped for a "now what?" moment. The wider Python development community appears to be split into three camps on Python 3.0; the situation amusingly resembles the classic folk tale "Goldilocks and the three bears." One set (the "too large" crowd) seems to think that an incompatible version of Python should never have been released, that languages should stay compatible forever. Another group ("too small") can handle the idea of an incompatible transition, but thinks that the Python community should have added more shiny features to the language while they were at it. And, of course, there's a "just right" crowd taking the position that the changes in Python 3 are just about as they should be. See this discussion by James Bennett for a well-argued description of the "just right" position. Time will tell which position is closest to reality. If the "too large" group is right, Python 3 (or Python in general) will fade away as developers, unhappy with the break, move to a language they like better. If Python 3 is too small, there will be strong pressure for a Python 4 in the too-near future. Your editor, though, thinks that the Python community has come pretty close to getting it right. Things that truly needed to be fixed got fixed, but the Python developers resisted the temptation to try to do too much. They watched, from a safe distance, what happened with the Mozilla rewrite and Perl 6, and wisely concluded that their lives - and the lives of those who use Python - would be better if they avoided a similar experience. So they limited their goals and were able to get the job done in a reasonable amount of time. Except, of course, that the job is not really done. To begin with, the presence of a few difficulties with the 3.0 release should not surprise anybody. The developers forgot to remove the deprecated cmp() function, with the result that newly-converted code may come to depend on it. There are some performance issues. A couple of other features are not working quite right. Getting Unicode truly straightened out may take a while yet - a problem which is certainly not unique to Python. The list seems to be quite short given that this is a major release of a complex programming language, but there are still things to fix. So there will almost certainly be a 3.0.1 release before the end of the year, and a 3.0.2 in (approximately) February. Meanwhile, the Python hackers have made it clear that the 2.x version of the language will be supported for some years yet. Version 2.6, available now, includes a number of features aimed at making the eventual port to 3.0 easier. As the porting projects get serious, other ways to help that process will become clear; there will be an eventual 2.7 release which incorporates those lessons wherever possible. A 2.8 release further down the road has not been ruled out. The current plan seems to be to maintain Python 2.x for at least the next three years. [PULL QUOTE: For many Python developers, it is not yet really time to make the jump to 3.0. END QUOTE] That is good because, for many Python developers, it is not yet really time to make the jump to 3.0. The core language appears to be in reasonably good shape, but a language like Python involves much more than the core. Most non-trivial code makes heavy use of the wide variety of Python libraries, and, at this point, many or most of those libraries do not support Python 3. So, now is a good time for library maintainers to be looking at moving to 3.0, but application developers who try to port their code now are likely to run into frustration. Porting smaller programs or subsystems as an exercise in learning the new language may make sense, but complex application porting probably cannot happen for a little while yet. What distributors should be doing is another question. So far, it would appear that only Fedora is having a (public) discussion on how to handle the Python 3 transition - see this thread - and they don't really know what they are going to do yet. Fedora's maintainers, it seems, would prefer to stay with Python 2 for the indefinite future; the chances of Python 3 making an appearance in Fedora 11 are quite small. There is a strong wish to avoid maintaining both 2.x and 3.x on the same distribution release; they would rather make a clean switch. Your editor suspects that the flag-day approach to the language transition is not going to work. There are a lot of packages which need to be ported, and many of the people doing the porting would appreciate support from their distributor. Red Hat dragged its feet for a long time on the transition to Python 2, with the result that many users had to build and install the newer version of the language themselves. For Fedora to do the same with Python 3 is a sure path toward user frustration. That said, keeping both versions of the language around is not a task for the faint of heart. Installing a different version of Python itself is quite easy. Keeping a whole set of modules for multiple versions is distinctly less so. This will be especially true for Fedora; some other distributions (especially the Debian-derived ones) have better mechanisms for (and experience in) maintaining multiple versions of core system tools. So the reluctance on the part of the Fedora developers to take on this work is thus unsurprising. Perhaps this would be a good opportunity for offers of help from the wider Fedora community. It may well take a couple of years, but this transition will eventually be made and people will eventually wonder what all the fuss was about. And, when it's done, we'll have a cleaner, more maintainable, more Unicode-rational version of an important programming language to work with. That, one hopes, will be worth the short-term pain involved in getting there. (For more information, see the Python3000 FAQ, currently under development). Create and Manage Gantt Charts with GanttProject GanttProject is an open-source cross-platform Java application that can be used to generate Gantt charts for the management of projects. Different components of GanttProject have been released under the GPL and Apache licenses. The project is described: GanttProject is a free and easy to use Gantt chart based project scheduling and management tool. Our major features include: Task hierarchy and dependencies, Gantt chart, Resource load chart, Generation of PERT chart, PDF and HTML reports, MS Project import/export, WebDAV based groupwork. The learn about document explains more of the project's features and some screen shots show some examples of what an older version of GanttProject looks like. Version 2.0.8 of GanttProject was recently announced: The major improvement in GanttProject 2.0.8 is that task web links now appear in PDF and HTML exports. Besides, those who use filesystem paths as web links, now can specify relative path to a file from .gan file location. GanttProject 2.0.8 also includes a few bugfixes and localization improvements for Croatian, Japanese and Colombian users. Installation of GanttProject 2.0.8 on an Ubuntu 8.04 system was fairly straightforward. The software was downloaded and unzipped. The prerequisite Sun Java Runtime Environment was downloaded and installed. The ganttproject.sh startup file was given execute status and run, the application started up as expected. GanttProject is easy to figure out. There are top-level tabs for creating charts and resources (people). Tasks can be added, assigned date ranges and a variety of other attributes. Tasks can be tied to other prececessor tasks and assigned to people. It only took a few minutes of poking around the software to create a new project, produce a simple Gantt chart and output a PostScript file that was suitable for printing. GanttProject is not alone in its ability to generate Gantt charts under Linux. Planner is a project management tool for the GNOME desktop environment and TaskJuggler is yet another project management tool. Both of these applications have a broader project management scope. If your needs only require generating Gantt charts, GanttProject is a straightforward application that can be used to easily produce professional looking results. Interview: Vernor Vinge Science fiction writer Vernor Vinge is best-known for novels like A Fire Upon the Deep and Rainbows End, as well as the concept of The Singularity -- the idea that, in the next couple of decades, humans will become or create a super-human intelligence. What is less well-known is that Vinge has been a free software supporter since the earliest days of the Free Software Foundation (FSF). He has served several times on the jury for the FSF Awards and spoke at an FSF-sponsored event held last month in San Diego to coincide with the LISA conference. As someone who deals regularly with large scale speculations, Vinge places free software in a larger historical context. He even speculates that free software may be one of the factors that will shortly bring about the Singularity. Part of Vinge's interest in free software is personal. A mathematician and computer scientist, he quickly found that the rise of proprietary software greatly increased the difficulties of teaching. "When I looked at contracts and user-agreements," he recalls, "the legalese was extraordinarily intimidating, not just because it was complicated, but because it actually seemed to restrict things to the point where it was really difficult to imagine how a student could follow the agreement and still do a project. So the openness that was in the GNU General Public License (GPL) was really very, very welcome." Vinge soon got into the habit of giving students "a little spiel about the GPL" and encouraging them to license their projects under the GPL. "If they did that," he says, "that would mean I would be able to use their stuff in later projects with other students. And a very large percentage of students in most classes though it was a cool enough idea that they actually did use [the GPL] in their projects." The historical trend to cooperative infrastructure However, as important as free software may have been to Vinge in his teaching, what seems to interest him the most is placing free software in a broader historical context. Early on, Vinge came to view free software -- and, later on the Internet and social networking applications that it was instrumental in creating -- as part of a historical trend towards creating an increasingly elaborate "infrastructure of trust and cooperation" that increases the rate of technological advance. Vinge says: "There are business inventions of the last 2000 years like the widespread use of loans and credit, the use of insurance, the use of limited liability corporations, all of which involve at least at the beginning, a leap of trust." To Vinge, free software, the Internet and social networking are simply the latest extensions to the infrastructure created from such institutions. What these institutions all have in common is that they allow people to interact in more creative and productive ways. More specifically, he sees free software as the natural and more logical extension of the insight that had produced the shareware culture a few years before the start of the GNU Project and the FSF. With the emergence of the personal computer, entrepreneurs were finding that "the barriers to entry were so low that you didn't need a lot of the overhead that was involved in commercial stuff, and you might just be able to get away with trusting people to pay you. There was much blind feeling around the concept of producing stuff in some sort of context that was different from cars." According to Vinge, what the GPL and the software and institutions that have grown up around it have produced is "a platform for experimenting with social invention. In the 20th and 19th century, if you wanted to experiment with a new infrastructure for people to interact in, in most cases, like with the railroads, you needed enormous effort. And now -- we can actually do social experiments -- cooperative experiments -- much more cheaply, and you can design ways for people to interact based on just the software guiding what the interactions are like." Vinge acknowledges that the consequences have not always been beneficial. "One thing the last ten years have proved is that we seem to be very bad at thinking how stuff can be abused," he says, no doubt thinking of such phenomenon as crackers and online predators. "Any time you can make something a hundred or a thousand times cheaper than it was before, there are probably side-effects. But there's a tendency when something works really, really well to push it hard and deliberately avoid thinking about side-effects." Still, the main change has been beneficial overall in Vinge's view. In particular, he says: "One nice thing is that the price of failure is a lot lower than what you might imagine in the 19th century. Say someone spent ten million 1850 dollars, to make steam-powered dirigibles. Now, it doesn't work, and you've just spent a lot of money, and you don't have anything except a lot of ruined effort. Now, there's still ruined effort if something doesn't work out, but you can retarget or repurpose much more easily, and you can justify taking much larger leaps of faith than you could in 1850." The result is that more experimentation, and more and quicker development becomes possible. In this view, free software represents the currently most-advanced realization of the possibilities inherent in computer technology. "It's an interesting, science-fictiony, parallel-world story to imagine what would have happened if Richard Stallman hadn't come along with the GPL," says Vinge. "Without Richard Stallman's insight, I think we would have eventually got something like what we got with free software, but it would have been a very interesting muddle. [The process] could have gone for years, and it could easily have gone on so many years that it impacted the era in which really large stuff can be built in the free model. So, overall, I think we would have got something, but, even now, the low overhead involved and even the insight that comes from the GPL would not be with us." In other words, the GPL and modern computer structures are all "in the tradition of the last few centuries. They're taking the traditions that we saw with the industrial revolution and adding several layers of magnitude to that flexibility." Bringing on The Singularity Although speculation is part of Vinge's stock in trade as an SF novelist, he is cautious about predicting the future. "I always rush to say, 'Terrible things could happen!'" he says. "A giant meteor could hit the earth, or a civil war could happen." However, caution aside, Vinge does concede that "we have the tools to keep running along the same lines for some time. And, in the absence of disaster, it quickly runs to the point where you're talking about stuff that's of the same significance as the rise of the human race within the animal kingdom." In other words, the Singularity arrives. Vinge does not offer a map of exactly how free software and its infrastructure will lead to the Singularity. But, given the probable inability of humans to understand super-human intelligence, he should not be expected to do so. "It's easy to imagine," he says, "but you run out of adjectives and high-sounding words that could mean anything to someone like us." All that can really be said is that, as the latest manifestation of the historical trend to increasingly complex cooperative infrastructures, free software plays a large role in creating a future in which the Singularity becomes increasingly inevitable. "I think that's going to happen in the relatively near historical future," says Vinge. "And these sorts of trends are all consistent with that possibility." Meanwhile, Vinge is personally content with the improvements that have come to free software in the last couple of years. He is particularly pleased that you can download and install a stable and easy to use operating system in an afternoon. "If you look back over the last ten years, you see how easy it's become to do things," he says. "It's silly to put number to this, but it's ten or a hundred times easier now. I can remember spending days getting PPP to work. And now, you just plug this cable into that socket, and it works. I feel much more able to do what I have to do without having to worry very much, without having Catch-22s nibble me to death. Things have really come together in a coherent and useful way." Fedora and CAPP Removing the ability for regular users to execute "system" programs has a certain appeal, but does it really provide any extra security? A thread on the fedora-devel mailing list explores that question in the context of usermod (and other, similar tools), which had their permissions changed more than two years ago in an effort to meet security certification requirements. Whether these changes, and at some level the certifications themselves, actually increase the security of the system is the open question. Callum Lerwick noticed that running usermod no longer worked as a regular user. He has a habit of doing that to get a quick overview of the command syntax and options from the help page, but unless he uses sudo, that doesn't work. That was done on purpose as Steve Grubb describes: These should have been gone for quite a while...and on purpose. You cannot do anything with them unless you are root. Allowing anyone even to execute them would require lots of bad things for our LSPP/CAPP evaluations. LSPP and CAPP are two protection profiles that are used for Common Criteria security certifications (such as EAL3) that Red Hat Enterprise Linux (RHEL) has earned. Because these tools can modify trusted databases (e.g. /etc/shadow), attempts to run them by untrusted users must be added to the audit log in order to comply with the certifications. But adding audit events requires the CAP_AUDIT_WRITE capability bit; in today's systems that effectively means setuid(0). As Grubb puts it: "IOW, if we open the permissions, we need to make these become setuid root so that we send audit events saying they failed." Leaving aside the idea that only processes with root permissions are allowed to generate auditable events—which seems a bit bizarre—there is still the question of how much protection is provided by changing the file permissions. Seth Vidal asks: And do we seriously think we can keep the code away from a non-root user by chmodd'ing the binaries? A user can get a binary for anything fedora can install in about 30s w/firefox. Allowing users to download binaries "takes the system out of the certified configuration", according to Grubb, "So, if you need to be in the CAPP certified configuration, don't let users do this." This fairly clearly demonstrates the dubious nature of the security afforded by the current certifications. For the most part, the protection profiles define away nearly all of the interesting threats that most systems face today. To a large extent, CAPP/LSPP certifications are the kinds of things listed in marketing materials for "enterprise" operating systems rather than serious attempts to address the real security needs of the vast majority of network connected systems. Grubb provides an excellent overview of some of the requirements of CAPP, along with how they are implemented in Fedora as part of the discussion. The CAPP information page gives the full story, however: The CAPP provides for a level of protection, which is appropriate for an assumed non-hostile and well-managed user community requiring protection against threats of inadvertent or casual attempts to breach the system security. The profile is not intended to be applicable to circumstances in which protection is required against determined attempts by hostile and well-funded attackers to breach system security. But CAPP does require that all attempts to modify trusted databases like the shadow password file generate an audit trail, so there is a lower-level audit rule set up for that file. Any access to /etc/shadow, for example, is logged as Grubb describes in his overview. That, though, begs other questions as Lerwick points out: So we *are* auditing low level filesystem calls? So then what, other than security theater, does auditing execution of usermod gain us? The answer is that auditing execution of usermod by non-root users gains exactly one thing: CAPP compliance. It requires that binaries which modify trusted databases leave an audit trail. Even though any actual attempt to access the underlying file will be logged, just accessing the binary that could modify the file is also something that must be logged. Part of the dismay displayed in the thread comes from the fact that Fedora will probably never be certified with CAPP for any number of reasons. So taking away longstanding user abilities, though there are reasonable alternatives like man usermod, for a certification that won't be done, doesn't sit well with some in the Fedora community. Though, as Jef Spaleta notes, there might be a use for the certification in a Fedora spin: Is there need for certified 'appliance' situations that a new 3rd party could leverage Fedora to create? I can imagine all sorts of no network software appliance situations where the CAPP certification applies and a Fedora derived image would be a good development target. There is always going to be tension between the security needs of an "enterprise" distribution like RHEL and a more user/desktop-oriented distribution like Fedora. While the specific reduced functionality in this case is fairly minimal, the discussion increased the visibility of the auditing required for certification as well as what that means for both distributions. The original decision was made back in the Fedora Core days when there was much less visibility and community input into the process. Discussions like this will only help continue the process of opening up Fedora while also exposing some of the inadequacies of security certifications. Problems with Fedora 10 LWN has received several emails regarding bugs in Fedora. These are serious bugs that can prevent you from installing new updates, or new packages of any kind. Fedora users may want to be aware of the following and, perhaps, wait until things settle down a bit. The start things off, bug #475068 was reported for Fedora 9 with x86_64. This bug is present in Fedora 10 and also affects x86 systems. There was a workaround for this bug, for Fedora 10 users, involving using yumdownloader to install an older version of dbus. Unfortunately the older packages won't show up on all mirrors. It is still possible to recover from this bug by manually editing /etc/dbus-1/system.conf and rebooting the system. Fedora 9 users will need this version of PackageKit. For Fedora 10 you'll want this version of PackageKit. Bug #475069 covers a dbus access problem with bluez. If you are seeing the error message: "Agent registration failed: A security policy in place prevents this sender from sending this message to this recipient, see message bus configuration file (rejected message had interface "org.bluez.Adapter" member "RegisterAgent" error name "(unset)" destination "org.bluez").", this may help. Fedora 9 users will want bluez-utils-3.36-3.fc9. Fedora 10 users should grab bluez-4.22-2.fc10. If you are still running Fedora 8 the proper package to get is bluez-utils-3.35-5.fc8. Another bug that may be troubling you is bug #469434, in which subnetmask settings are not saved. For some people this has been fixed. That fix did not seem to work for everyone though. The system-config-network-1.5.94-2.fc10 update does seem to work. If you run into the error "PackageKit failed to get a TID" you will want to see this forum thread which affected several people on December 7, 2008. So far, no fix seems to be forthcoming. Bugs in PackageKit are especially troubling for some, since you can't install an update using the GUI tools. Your editor completed a fresh install of Fedora 10 last weekend on an aging Thinkpad laptop. After the usual update she could no longer find or update any packages. A manual yum update did not help. It would appear that bug #475656 addresses the error "failed to get a TID: A security policy in place prevents this sender from sending this message to this recipient...". No doubt a SELinux expert could edit the offending policy. The rest of us will have to wait for a fix. Editors note: as noted in the comment below, this is a DBus security problem and has nothing to do with SELinux. This last bug was reported December 9, and by December 10 a fix was already being tested. A look at KOffice 2.0 Beta 3 The KDE office application suite, KOffice is getting closer to its 2.0 release. Beta 3 was announced November 19, with another beta due any day. The final release is expected early next year, so it seems like a good time to take it for a spin. The beta releases are available for Kubuntu Intrepid Ibex (8.10), making it relatively easy to try out. There are also openSUSE and Debian packages available as well as source code (of course). The author didn't look forward to trying to build KOffice on his normal Fedora 9 desktop, so borrowing an Intrepid laptop from the wife was in order; after that enabling the "Unsupported Updates" and installing the koffice-kde4 package (which didn't seem to work through the GUI, but apt-get worked just fine) is all that it took. The initial impression was a bit rocky as most of the small handful of ODF files that were opened caused KOffice to crash. It is a beta, though, so some of that is to be expected. Trying again with the imminent Beta 4 and filing bugs for failures should be high on the author's list. The one presentation file that successfully opened in KPresenter seemed to have lost much of the formatting that was present in the original, which was also disheartening. It should be noted that the author is hardly an office suite "power user". Normally, OpenOffice.org is used for minimal business documents (invoices mainly), simple spreadsheets (expense reports, football pools), and boring, bullet-list slides for presentations (as anyone who has been to one will attest). By and large, these simple needs are met by OpenOffice, with the added bonus of being mostly able to open the various Microsoft-format documents that unfortunately cross the desktop. Any other office suite with similar capabilities would serve just as well. Opening spreadsheets in KSpread provided the most reliable experience when opening existing documents, but there were still a number of problems. Formulas did not calculate automatically regardless of the auto-recalculate setting, but the data was there, unlike some of the other document types. KWord seemed to be unable to open any of the ODF documents tried, crashing in all cases. One "handy" .doc file opened, but the formatting and contents were mangled; OpenOffice can reproduce the formatting of that document pretty well. KWord also crashed on exit from that document. Perhaps betas are not the place to try opening existing files. There clearly are many new features in KOffice 2.0, but the major ones, porting to KDE4/Qt4 and using the Flake object library throughout, are infrastructural in nature—they aren't obvious to users. Much like KDE 4.0, it would appear that KOffice 2.0 is a launching pad for subsequent releases. There is an emphasis on a consistent user interface between the various applications which does stand out when using KOffice. For better or worse, the OpenOffice interface is fairly consistent between applications as well, but seems more cluttered, or more poorly organized somehow. Using Flake everywhere will be a boon to those who are power users as it treats everything as a "shape" that can be transformed (via scale, rotate, skew) and moved between any of the separate applications. Vector graphics can cohabitate with raster graphics and text easily. Using KOffice 2.0 is fairly straightforward for simple tasks. It is noticeably slower than OpenOffice on the same hardware. Opening files, even empty documents seems to take an inordinate amount of time. Even moving around within KSpread or KWord seemed sluggish. Presumably these are things that will be fixed, whether that will be in the next few months or for KOffice 2.1 remains to be seen. This beta gives the impression of great promise, but not yet a very usable tool. Of course, there is more to KOffice than just the three applications mentioned. The database application Kexi is not yet part of the KOffice 2.0 release, nor is the Visio-like flowchart program Kivio. Two drawing applications, Karbon14 for vectors and Krita for raster graphics have been released with the beta. Other than a quick startup to see if the interface was consistent with the rest of the suite—it was—the author didn't try them. The same goes for KPlato, the project management and planning application, though it has a rather different look—no toolboxes on the right hand side—likely because of its very different needs. Perhaps unfairly, the author expected a bit more from this beta release. It would seem there is still a fair amount of work to do before the final 2.0 version, but there are still a few months left. For whatever reason, previous attempts to use KOffice had always caused the author to quickly switch back to OpenOffice. Even though there were so many problems, this KOffice—or more likely 2.1—somehow seems more plausible to switch to. Another look in a few more months is likely called for. The FSF raises the stakes for Cisco On December 11, the Free Software Foundation announced the filing of a GPL-infringement lawsuit against Cisco. This action represents another step in a long series of license-compliance issues involving Cisco and its subsidiaries. It may look like just another licensing lawsuit, but it represents an interesting step in the evolution of attitudes toward compliance with the GPL. The eventual outcome is fairly predictable, but the process is still worth watching. Cisco does look like a serial offender with regard to the GPL. Most of its problems in this area were actually acquired with its purchase of Linksys; routers made by Linksys have been been followed by GPL issues since at least 2003. Over those years, a fairly consistent pattern has developed: a new Linksys product is released which, upon inspection, is determined to be running GPL-licensed software. There is no corresponding source release, which is a clear violation of the GPL. After a series of contacts and negotiations, some of the copyright holders involved succeed in getting a source release - though that release is not always as complete as it should be. The problem appears to be solved - until the next product comes out. The sad part is that there is almost certainly no real desire on the part of Cisco or Linksys to violate the GPL. The company is being set up for trouble by its suppliers - the firms based in the far east which actually make the hardware sold under the Linksys name. Those suppliers feel, perhaps with good reason, that they need not concern themselves with the details of license compliance. There is not, after all, much of a history of successful license enforcement in that part of the world. So they deliver an infringing product which Cisco then resells; it could well be that Cisco honestly has no idea that those products incorporate software in violation of its license. Of course, it could also be that Cisco does not really want to know about such problems. Nameless original equipment manufacturers in China are a difficult target for those who would enforce the GPL; a high-profile American company is clearly easier game. Beyond that, though, Cisco is a legitimate target for a lawsuit: the company is distributing GPL-licensed software without making the source available. It is also an appealing target because Cisco is in a position to apply pressure on those nameless suppliers: if a company of that size refuses to resell equipment which does not come with fully-licensed software (whether free or proprietary), its suppliers will learn to pay attention. The FSF is arguing, in essence, that it is Cisco's responsibility to put a program in place to ensure that its suppliers are delivering properly-licensed software. It is Cisco which should be finding licensing problems in its products, not the owners of the code it is using. The complaint [PDF] describes a long series of meetings with Cisco. Several times, the complaint says, "Defendant corresponded with Plaintiff repeatedly regarding the matter and Plaintiff believed in good faith that a satisfactory resolution of its concerns could be reached." But then more problems always turned up. So, after a few years, the FSF has given up: Given Defendant's extensive history of violating Plaintiff's Licenses, Plaintiff considers Defendant's current and proposed activities insufficient to ensure Defendant's future compliance. Defendant has refused to meet several of Plaintiff's reasonable requirements for reinstatement of Defendant's right to distribute the Programs. Defendant has not demonstrated that it has meaningfully improved its software review process which failed to prevent previous violations, or that it intends to do so. Defendant has refused to acknowledge its previous violations or inform the users who received Infringing Products of its omissions. And Defendant has refused to provide regular compliance reports to Plaintiff regarding Defendant's pervasive exploitation of Plaintiff's software. Nonetheless, Defendant continues to distribute the Infringing Products and Firmware in violation of Plaintiffs' exclusive rights under the Copyright Act. The complaint alleges that Cisco is guilty of copyright infringement. The court is asked to provide injunctive relief - taking the offending products off the market. The FSF is also asking for damages, attorney's fees, and "all profits derived by Defendant from its unlawful acts." All this would be a heavy price for Cisco to pay. And it could well be that a court would go along with most of these requests. The fact of the matter, though, is that things are unlikely to get that far. Unlike, say, SCO, Cisco has not made any statements about the validity of the GPL. It is an active contributor to GPL-licensed projects, including the Linux kernel. Cisco's behavior looks more like negligence than malice. This suit will probably get the attention of people in very high levels of management at Cisco; they, in turn, will almost certainly come to the table and find a way to make the FSF go away. There is no value for them in any other course of action. So this episode will blow over, probably within a few months. But there are still a couple of interesting things to note here. One is that the Linux kernel is not involved in this suit at all, and neither is Busybox. Those two projects have been at the center of most GPL-enforcement actions thus far. The FSF, though, is focusing on projects that it owns: glibc, GCC, coreutils, binutils, gdb, and wget. That widens the scope somewhat, showing that GPL compliance is not just required for a small number of programs. Incidentally, all of the code at issue in this suit is licensed under GPLv2; version 3 of the license is not part of this action. This suit also marks a bit of a change for the FSF, which, traditionally, has strongly favored quiet resolution of GPL-compliance issues. It seems that even the FSF has a point where its patience runs out. It may also be that the influence of the Software Freedom Law Center, which appears to be rather more willing to go to court, is being felt at the FSF. In any case, it is reasonable to expect that the FSF might find itself involved in more legal actions in the future. This lawsuit will doubtless be used by people to show how use of GPL-licensed software can create risks for companies. The truth is more straightforward, though. Use of any copyrighted material without an accompanying license is generally against the law; incorporating such material into products will always be a risky thing to do. There is nothing special about the GPL in that regard. The Grumpy Editor's 2008 retrospective Holidays are an exercise in tradition. One of the more charming holiday traditions around LWN is to look at the predictions made at the beginning of the year and measure them against reality. There is, after all, great value in things which make us laugh. This year's predictions were featured in the January 3, 2008 edition. As might be expected, some of them were better than others. What was predicted Your editor's first prediction was that support for Flash playback would mature in 2008. In some sense, that may be true. Your editor's desktop system, running the Rawhide build of Gnash, can now faithfully display a wide variety of Flash ads, web site "intros," and various other thoroughly useless bits of media. A Flash-based "interactive tour" offered by LWN's bank worked nicely. But support for many other Flash features, including audio and simple playback from online sites, still is not especially solid, and other interactive Flash applications do not work at all. This problem, it seems, is still not solved. The prediction of the KDE 4.0 release required little in the way of foresight, as did the prediction that users would be unhappy. That stage was well set before the beginning of the year. A continued focus on power management was also an easy thing to foresee; there will be great value in making our systems more power-efficient into the indefinite future. Flush from those two obvious successes, your editor went off and stated that the bulk of the realtime tree would be merged into the mainline kernel by the end of the year. Oh well. Your editor should know by now that expecting deterministic merge times for realtime patches is a sure path to disappointment; latencies in this area are always higher than one would like. In this case, the realtime developers got stuck in a high-priority interrupt (taking over the x86 architecture) with the result that realtime work got preempted and suffered from severe starvation. As predicted, debate over Microsoft's OOXML format continued, and Microsoft succeeded in obtaining standard status for that format anyway. Things have since gotten quieter, though, perhaps because people see it as a done deal and no longer worth fighting about. The GPL was the subject of two predictions this year. One was that more projects, perhaps even glibc, would move to GPLv3. There is a steady stream of analyst verbiage to the effect that GPLv3 is quickly growing in popularity (example), but the truth of the matter is that the number of conversions in projects which really matter appears to be low. Projects with significant numbers of developers and users continue to approach GPLv3 with caution. The other prediction was that GPL enforcement actions would continue, and perhaps grow. The recent FSF lawsuit against Cisco makes it clear that the GPL enforcers are serious about what they are doing. Your editor cannot help but wonder, though, whether the increasingly litigious actions by the Software Freedom Law Center might not eventually lead to a serious backlash within the community. We are about freedom, not punitive damages. Enforcement of the GPL is necessary if we expect our licenses to be taken seriously, but overly zealous - or greedy - litigation could encourage those who say that use of free software exposes companies to an unacceptable level of risk. Your editor included a rosy prediction about the One Laptop Per Child project and where it would go over the course of the year. In fact, OLPC has continued to work toward its goal of putting laptops into the hands of children around the world. But your editor completely missed the way internal divisions would rise to the surface and distract OLPC developers from what they are trying to do. OLPC seems to have moved beyond the worst of that, and much-needed development on the Sugar software continues. But the project seems far from its original goals, and the increasing popularity of ultra-mobile systems, while vindicating the original vision behind the OLPC hardware, threatens to render the XO hardware obsolete and irrelevant. Ever the optimist, your editor said that the days of hardware hassles would be over. We are closer. Finding an off-the-shelf system - server, desktop, laptop, or palmtop - which is fully supported by Linux is now easily done. OK, maybe the modem is not supported, but few people will be inconvenienced by that omission anymore. That said, there will probably never be a shortage of uncooperative hardware manufacturers; if we value our free operating system, we must continue to support manufacturers who work with our community, and avoid those which do not. The prediction that the intensity of competition between distributors would increase was reasonably well satisfied. One need only look at Novell's "migrate from Red Hat" offering or the continued attacks on Ubuntu, not all of which have to do with its community participation. Finally, the three "community" predictions at the end of last January's article were all satisfied reasonably well. None of them were especially daring, so that should not be surprising. What was not predicted One commenter in January asked about the lack of predictions about SCO. In December, it is hard to say that SCO deserved a place there. The company still exists in some form, but it no longer has much to warrant the attention of the Linux community. Your editor predicts that there will be no SCO predictions in 2009 either. So what else did your editor miss? Perhaps at the top of the list is the evolution of the Linux platform as it is used in mobile devices, and in cellular telephones in particular. Google's (unpredicted by your editor) Android platform has made a splash, regardless of what one might think of its openness. The first Android phone has been reasonably well received, and it would appear that more are on the way. The merger of the LiPS and LIMO consortia shows that some consolidation is happening in this area. The announced plans to open Symbian were also an interesting development. In the near future, the handset business seems likely to be firmly dominated by free software - though, alas, the bulk of those handsets will not be designed to pass the benefits of that freedom on to their owners. Your editor has often predicted software patent troubles, though he did not do so in 2008. What was completely unforeseen, though, was Red Hat's resolution with Firestar Software. The company got itself out of a patent bind, and, in the process, removed the patent as a threat to the wider development and user community too. We may see this sort of solution repeated for patent problems in the future - if we are lucky. Finally, unpredicted - and unpredictable - was the series of "infrastructure issues" which shut down much of the Fedora project for a good month. That episode showed us a number of things: how much some of us depend on our distributors' infrastructure, how vulnerable we can be to intrusions, and how the interests of the companies behind some distributions can interfere with the availability of useful information. Months after the fact, we still have no idea what happened with the Fedora project; it is not unreasonable to wonder if we will ever know. Despite problems like that, and other small distractions (the total meltdown of the global financial system, for example), Linux has only grown stronger over the last year. Our community has grown, our software has gotten better, and the economy around free software has gotten stronger. Your editor predicted that, too, but not even he is so arrogant as to claim credit for having foreseen something nearly as obvious as the sunrise. SLQB - and then there were four The Linux kernel does not lack for low-level memory managers. The venerable slab allocator has been the engine behind functions like kmalloc() and kmem_cache_alloc() for many years. More recently, SLOB was added as a pared-down allocator suitable for systems which do not have a whole lot of memory to manage in the first place. Even more recently, SLUB went in as a proposed replacement for slab which, while being designed with very large systems in mind, was meant to be applicable to smaller systems as well. The consensus for the last year or so has been that at least one of these allocators is surplus to requirements and should go. Typically, slab is seen as the odd allocator out, but nagging doubts about SLUB (and some performance regressions in specific situations) have kept slab in the game. Given this situation, one would not necessarily think that the kernel needs yet another allocator. But Nick Piggin thinks that, despite the surfeit of low-level memory managers, there is always room for one more. To that end, he has developed the SLQB allocator which he hopes to eventually see merged into the mainline. According to Nick: I've kept working on SLQB slab allocator because I don't agree with the design choices in SLUB, and I'm worried about the push to make it the one true allocator. Like the other slab-like allocators, SLQB sits on top of the page allocator and provides for allocation of fixed-sized objects. It has been designed with an eye toward scalability on high-end systems; it also makes a real effort to avoid the allocation of compound pages whenever possible. Avoidance of higher-order (compound page) allocations can improve reliability significantly when memory gets tight. While there is a fair amount of tricky code in SLQB, the core algorithms are not that hard to understand. Like the other slab-like allocators, it implements the abstraction of a "slab cache" - a lookaside cache from which memory objects of a fixed size can be allocated. Slab caches are used directly when memory is allocated with kmem_cache_alloc(), or indirectly through functions like kmalloc(). In SLQB, a slab cache is represented by a data structure which looks very approximately like the following: (Note that, to simplify the diagram, a number of things have been glossed over). The main kmem_cache structure contains the expected global parameters - the size of the objects being allocated, the order of page allocations, the name of the cache, etc. But scalability means separating processors from each other, so the bulk of the kmem_cache data structure is stored in per-CPU form. In particular, there is one kmem_cache_cpu structure for each processor on the system. Within that per-CPU structure one will find a number of lists of objects. One of those (freelist) contains a list of available objects; when a request is made to allocate an object, the free list will be consulted first. When objects are freed, they are returned to this list. Since this list is part of a per-CPU data structure, objects normally remain on the same processor, minimizing cache line bouncing. More importantly, the allocation decisions are all done per-CPU, with no bad cache behavior and no locking required beyond the disabling of interrupts. The free list is managed as a stack, so allocation requests will return the most recently freed objects; again, this approach is taken in an attempt to optimize memory cache behavior. SLQB gets its memory in the form of full pages from the page allocator. When an allocation request is made and the free list is empty, SLQB will allocate a new page and return an object from that page. The remaining space on the page is organized into a per-page free list (assuming the objects are small enough to pack more than one onto a page, of course), and the page is added to the partial list. The other objects on the page will be handed out in response to allocation requests, but only when the free list is empty. When the final object on a page is allocated, SLQB will forget about the page - temporarily, at least. Objects are, when freed, added to freelist. It is easy to foresee that this list could grow to be quite large after a burst of system activity. Allowing freelist to grow without bound would risk tying up a lot of system memory doing nothing while it is possibly needed elsewhere. So, once the size of the free list passes a watermark (or when the page allocator starts asking for help freeing memory), objects in the free list will be flushed back to their containing pages. Any partial pages which are completely filled with freed objects will then be returned back to the page allocator for use elsewhere. There is an interesting situation which arises here, though: remember that SLQB is fundamentally a per-CPU allocator. But there is nothing that requires objects to be freed on the same CPU which allocated them. Indeed, for suitably long-lived objects on a system with many processors, it becomes probable that objects will be freed on a different CPU. That processor does not know anything about the partial pages those objects were allocated from, and, thus, cannot free them. So a different approach has to be taken. That approach involves the maintenance of two more object lists, called rlist and remote_free. When the allocator tries to flush a "remote" object (one allocated on a different CPU) from its local freelist, it will simply move that object over to rlist. Occasionally, the allocator will reach across CPUs to take the objects from its local rlist and put them on remote_free list of the CPU which initially allocated those objects. That CPU can then choose to reuse the objects or free them back to their containing pages. The cross-CPU list operation clearly requires locking, so a spinlock protects remote_free. Working with the remote_free lists too often would thus risk cache line bouncing and lock contention, both of which are not helpful when scalability is a goal. That is why processors accumulate a group of objects in their local rlist before adding the entire list, in a single operation, to the appropriate remote_free list. On top of that, the allocator does not often check for objects in its local remote_free list. Instead, objects are allowed to accumulate there until a watermark is exceeded, at which point whichever processor added the final objects will set the remote_free_check flag. The processor owning the remote_free list will only check that list when this flag is set, with the result that the management of the remote_free list can be done with little in the way of lock or cache line contention. The SLQB code is relatively new, and is likely to need a considerable amount of work before it may find its way into the mainline. Nick claims benchmark results which are roughly comparable with those obtained using the other allocators. But "roughly comparable" will not, by itself, be enough to motivate the addition of yet another memory allocator. So pushing SLQB beyond comparable and toward "clearly better" is likely to be Nick's next task. System calls and 64-bit architectures Adding a system call to the kernel is never done lightly. It is important to get it right before it gets merged because, once that happens, it must be maintained as part of the kernel's binary interface forever. The proposal to add preadv() and pwritev() system calls provides an excellent example of the kinds of concerns that need to be addressed when adding to the kernel ABI. The two system calls themselves are quite straightforward. Essentially, they combine the existing pread() and readv() calls (along with the write variants of course) into a way to do scatter/gather I/O at a particular offset in the file. Like pread(), the current file position is unaffected. The calls, which are available on various BSD systems, can be used to avoid races between an lseek() call and a read or write. Currently, applications must do some kind of locking to prevent multiple threads from stepping on each other when doing this kind of I/O. The prototypes for the functions look much like readv/writev, simply adding the offset as the final parameter: But, because off_t is a 64-bit quantity, this causes problems on some architectures due to the way system call arguments are passed. After Gerd Hoffmann posted version 2 of the patchset, Matthew Wilcox was quick to point out a problem: Are these prototypes required? MIPS and PARISC will need wrappers to fix them if they are. These two architectures have an ABI which requires 64-bit arguments to be passed in aligned pairs of registers, but glibc doesn't know that (and given the existence of syscall(3), can't do much about it even if it knew), so some of the arguments end up in the wrong registers. Several other architectures (ARM, PowerPC, s390, ...) have similar constraints. Because the offset is the fourth argument, it gets placed in the r3 and r4 32-bit registers, but some architectures need it in either r2/r3 or r4/r5. This led some to advocate reordering the parameters, putting the offset before iovcnt to avoid the problem. As long as that change doesn't bubble out to user space, Hoffmann is amenable to making the change: "I'd *really* hate it to have the same system call with different argument ordering on different systems though". Most seemed to agree that the user-space interface as presented by glibc should match what the BSDs provide. It causes too many headaches for folks trying to write standards or portable code otherwise. To fix the alignment problem, the system call itself has the reordered version of the arguments. That led to Hoffmann's third version of the patchset, which still didn't solve the whole problem. There are multiple architectures that have both 32 and 64-bit versions and the 64-bit kernel must support system calls from 32-bit user-space programs. Those programs will put 64-bit arguments into two registers, but the 64-bit kernel will expect that argument in a single register. Because of this, Arnd Bergmann recommended splitting the offset into two arguments, one for the high 32 bits and one for the low: "This is the only way I can see that lets us use a shared compat_sys_preadv/pwritev across all 64 bit architectures". When a 32-bit user-space program makes a system call on a 64-bit system, the compat_sys_* version is used to handle differences in the data sizes. If a pointer to a structure is passed to a system call, and that structure has a different representation in 32-bits than it does in 64-bits, the compat layer makes the translation. Because different 64-bit architectures do things differently in terms of calling conventions and alignment requirements, the only way to share compat code is to remove the 64-bit quantity from the system call interface entirely. That just leaves one final problem to overcome: endian-ness. As Ralf Baechle notes, MIPS can be either little or big-endian, so the compat_sys_preadv/pwritev() needs to put the two 32-bit offset values together in the proper way. He recommended moving the MIPS-specific merge_64() macro into a common compat.h include file, which could then be used by the common compat routines. So far, version 4 of the patchset has not emerged, but one suspects that the offset argument splitting and use of merge_64() will be part of it. The implementation of the operation of preadv() and pwritev() is very obvious, certainly in comparison to the intricacies of passing its arguments. The VFS implementations of readv()/writev() already take an offset argument, so it was simply a matter of calling those. It is interesting to note that as part of the review, Christoph Hellwig spotted a bug in the existing compat_sys_readv/writev() implementations which would lead to accounting information not being updated for those calls. This is not the first time these system calls have been proposed; way back in 2005, we looked at some patches from Badari Pulavarty that added them. Other than a brief appearance in the -mm tree, they seem to have faded away. Even if this edition of preadv() and pwritev() do not make it into the mainline—so far there are no indications that they won't—the code review surrounding it was certainly useful. Getting a glimpse of the complexities around 64-bit quantities being passed to system calls was quite informative as well. Profiling the Power Usage of a Desktop PC Reducing the power usage of a desktop computer can bring about a number of benefits. Whether your goal is to save money on your power bill, reduce your carbon footprint or eliminate unwanted heat and noise from your office, a bit of effort can produce a more power-efficient computer. Effort spent reducing power can have an even larger effect on servers and other machines that run 24 hours a day compared to machines that are only on during work hours. This work was done on a nearly ten year old PC, but the process still applies to more modern hardware. The test setup consisted of an opened-up desktop PC, a P3 International Kill-a-watt meter and a collection of peripheral cards and disk drives. The Kill-a-watt has a 1W resolution, if a reading alternated between 2 values such as 8 and 9 Watts, the estimated value was called 8.5 Watts. Some of the measurements made were small enough that they were "in the noise". Other variables included devices with inconsistent power usage and inconsistent line voltage. The resulting measurements were actual power used by the power supply, this may vary from the DC power used by the tested components. Lastly, the Kill-a-watt meter also shows power factor; a fairly consistent value of 0.67 was read. The tests were performed on the machine while it was in a number of different software states. Many of the tests were done while at the BIOS prompt, disk drive and network adapter tests were done while the machine was running Linux (Ubuntu 8.10). Power consumed by external devices such as the LCD video monitor and amplified speakers was not taken into account. When a peripheral such as a disk drive was removed for a test, the drive was disconnected from power and the interface cable was removed to eliminate possible power consumption by bus termination resistors. The tested computer used a fairly old, but still adequate Asus A7V333 motherboard with an AMD Athlon 1700 processor clocked at 1466 Mhz. The RAID option was not present on the motherboard. A pair of 256MB PC2700 DIMMs were used for the memory. The power supply was a 300W Antec PP-303X. Initially, the machine was loaded down with two hard drives, both CDR and DVD-RW drives, a floppy drive, an AGP video card with an ATI Radeon 8500 GPU, and both wired and wireless 802.11 networking cards. The machine was shut down, all of the PCI and AGP cards were removed and the disks were disconnected. The first power test involved the PC2700 memory DIMMs. With no memory, power consumption was 72 Watts. Adding one DIMM caused the power to drop to 67 Watts. Your author guesses that with no memory, the CPU runs in some kind of power-consuming loop. Interestingly, the two DIMMs had significantly different power usage. The Kensington Value Ram with Hynix chips caused the machine to use 73 Watts versus 67 Watts with the generic Chinese RAM with unbranded chips. With both DIMMS installed, power consumption as 75 Watts. We can deduce that the Kensington RAM used 8 Watts while the Chinese RAM used 2 Watts. Sufficient RAM is critical for good system performance, the brand seems to be significant in the area of power usage. Tests with additional brands of memory seem to be in order. Fans consume a fair amount of power. A quick unplugging of the noisy CPU fan caused the power to go from 75 Watts to 72 Watts, the CPU would melt down without this 3 Watt component, so it was left in place. It may be possible to find a more efficient CPU fan. The case had a front-mounted "push fan". This consumed around 2 Watts of power. The power supply's built-in fan provides plenty of air circulation so the front fan was disconnected. This also made the machine a bit quieter. The floppy drive is virtually useless now that 4GB USB memory sticks can be purchased for under $10. The floppy drive consumes about one half Watt of power, so the savings are small. But big savings can come from many small cuts, so the device was left unplugged. The Asus CD-S500/A CDR drive was tested, it consumed about 1 Watt of power. The Sony CRX320E DVD-RW drive was tested, it consumed about 2 Watts of power. Most people can get by with a single removable media drive, or none at all. The DVD-RW drive would be the obvious choice for a single-drive system. If one can put up with the occasional inconvenience of rebooting, it should be possible to put a DPDT power switch on the back of the machine to allow shutting off the +5V and +12V lines to the removable media drive. All together, the floppy and two optical drives consumed around 3.5W when idle. The Radeon 8200 video card was somewhat of a power hog, it consumed around 8 Watts of power with no built-in fan. A lower performance ATI-S3 AGP video card consumed 4 Watts. If high performance video operation is not critical, example: running Google Earth, the S3 card should be sufficient. As with sufficient memory, this sacrifice may not be worth the power savings. The next part of the power test involved the fixed disk drives. The main boot device was a Western Digital WD600 60GB PATA disk. It consumed about 7 Watts of power at the BIOS prompt, power went up by about 5 Watts when the system was running Linux and the drive was active. Some of this power is likely being consumed by the CPU and memory and some is used to power the disk's head actuator motor. An auxiliary Western Digital WD2500 250GB SATA drive and associated SATA PCI adapter card consumed around 9 Watts of power when idle and also about 5 watts more when active. Interestingly, as the machine was more heavily loaded with drives and peripherals, system usage became less of a variable to overall power consumption. Hard drives are one of the more power hungry devices in a system, putting all of your data on a single drive is a good way to save power. A generic-brand 10/100 Ethernet controller with an Intel chip consumed about 1 Watt of power at the BIOS level. Running Linux and moving a lot of data across the card caused the power consumption to jump by about 8 Watts, as with the disk drive test, a lot of that increase is likely caused by CPU and memory use. A Hawking Technology HWP54G 802.11 wireless Ethernet card also consumed about 1 Watt when idle and a few watts more when busy. The fully loaded system with 512MB of RAM, two hard drives, two optical drives, two network adapters, the Radeon video the floppy disk drive and the front fan consumed about 108 Watts of power when idle and a similar amount when busy. When the machine was stripped down to one hard drive, no optical or floppy drives, the lower performance S3 video card and no front fan, its power dropped to 80 Watts idle and 88 Watts when busy, or between 74 and 81 percent of the original power consumption. This is enough of a reduction in power usage to justify the effort of testing. Don't forget that even when it is completely powered down, the computer may still act as a phantom load, this system consumed a full 3 Watts when it was off. An easy remedy to that problem is to route the power plugs for the CPU, video monitor and speaker through a switched power strip. Debian goes to the polls It is general resolution season at the Debian Project. As was discussed here in October, Debian seeks to resolve two questions: one regarding types of developers in the project, and one being the perennial firmware debate. As of this writing, the first vote is done, while the second remains open. But it has become clear that, regardless of the outcome of the firmware vote, this issue has stressed the Debian community, perhaps to the breaking point. Taking the easier subject first: Joerg Jaspert's proposal to create new classes of Debian developers was always going to be controversial. The real purpose of the associated general resolution was to put the brake on those changes. That purpose was fulfilled; the winning choice in that (low-turnout) vote was "Invite the DAM to further discuss until vote or consensus, leading to a new proposal." So the project will go back to doing one of the things it excels at: talking. What form the membership proposal will have when it re-emerges from discussion - if it ever does - is unclear. The other vote - open until December 21 - is essentially about whether the upcoming "Lenny" release will be delayed until all known violations of the Debian Free Software Guidelines have been resolved - and whether firmware blobs in the kernel count as such violations. The question being asked is not so simple, though; in fact, Debian developers have no less than seven different options to vote upon. The nature of this ballot, how it was constructed, and how it will be decided has led to significant acrimony within the project. It is worth looking at what the seven options are (with the actual ballot text in bold): Reaffirm the Social Contract. The titling of this option is somewhat controversial - all Debian developers committed to supporting the Social Contract before gaining their status. What this option really means is "delay the Lenny release until all DFSG violations known on November 1, 2008 have been resolved." Allow Lenny to release with proprietary firmware. This option allows the Lenny release to happen, as long as no new firmware blobs make their way into the distribution. The language here is quite similar to what has been found in the resolutions allowing the Sarge and Etch releases to happen despite ongoing firmware concerns. This option has been deemed by project secretary Manoj Srivastava to require a three-to-one supermajority vote to pass. Allow Lenny to release with DFSG violations. This choice, also requiring a supermajority, has almost the same effect as option 2. Empower the release team to decide about allowing DFSG violations. Here, the project (again, with a supermajority) would say that it trusts the release team to make the right decisions. The team is currently working toward a release which includes firmware, so, again, the end result would be about the same: allow the Lenny release process to go ahead. Assume blobs comply with GPL unless proven otherwise. The actual text of this choice does not mention the GPL at all; in fact, it reads very much like options 2 and 3. However, this one was not deemed to require a supermajority vote. Exclude source requirements for firmware. This option (which requires a supermajority) says that, for all practical purposes, firmware is not software and, thus, a corresponding source distribution is not required. Further discussion. This outcome seems inevitable regardless of how the developers vote. If it were to win, though, then the outcome of this general resolution would be to decide nothing. See this posting for the full text of all seven options. So why are many Debian developers unhappy with this ballot? There would appear to be a few reasons, the first of which being the long list of options. Some developers would have rather seen a simple "can Lenny release or not?" vote, with related issues being handled in a separate resolution. The titles given to some of the choices are seen by some as deceptive. "Reaffirm the Social Contract" really means "delay Lenny," and "Assume blobs comply with GPL" goes with a resolution that never mentions the GPL at all. Developers who are unhappy with a long, messy ballot are even less happy with option titles which seem confusing at best, or deceptive at worst. Then, there is the matter of the supermajority requirements. Some developers wonder why option 2 requires a three-to-one vote, while an almost identical resolution for Etch did not require a supermajority in 2006. The decision on majority requirements is made entirely by the project secretary, who has the task of determining whether a given resolution "overrides a foundation document" or not. A few developers have made the claim that Manoj's decisions are based less on clear understanding of what really "overrides a foundation document" and more with the goal of ensuring that his own favored outcome wins. That last is, needless to say, a strong charge. As it happens, Manoj is the proposer of the "assume blobs comply with GPL" option; he also seconded options 1 and 2. Two of the options he has publicly supported do not have the supermajority requirement attached to them, so, perhaps, one could argue that Manoj is, indeed, trying to rig the vote. On the other hand, those two options conflict with each other: one would delay Lenny indefinitely, the other would wave the firmware problem away. So if this is an attempt to steal an election, it is one with a highly uncertain outcome, even if it is successful. The more straightforward interpretation - that a long-serving project secretary is interpreting the project's constitution to the best of his understanding, ability, and good faith - would seem to be the more likely alternative. Still, that has not prevented a discussion involving statements like this: Recognizing the validity of the vote is not a "must". The alternative is that we end up in a state of constitutional crisis. That's unfortunate, but it's also unfortunate that our Secretary is failing to act in a manner that safeguards the integrity of that office. Other, more reasoned - but still unhappy - voices are pondering the replacement of the project secretary. It turns out that how to do that is not entirely clear, though. Some others have asked project leader Steve McIntyre - who has been conspicuously quiet in this whole discussion - to intervene. He finally responded this way: I've been talking with Manoj already, in private to try and avoid flaming. I specifically asked him to delay this vote until the numerous problems with it were fixed, and it was started anyway. I'm *really* not happy with that, and I'm following through now. What "following through" means remains unclear. The Debian project leader does not command vast powers which can be brought to bear on a problem like this. Debian is an exceptional project in that it operates in a democratic mode under a formal constitution. Unlike many other projects, Debian lacks a benevolent dictator or a backing corporation with the ability to force a decision. So we do not know what Steve will be able to do to resolve this issue. What we do know is that quite a few developers are going to be unhappy with this vote regardless of how it comes out. Talk of "constitutional crisis" is almost certainly overblown; Debian has muddled its way through no end of strong disagreements in the past. But that still leaves a lot of room for public conflict which further diminishes Debian's reputation and further delays the Lenny release. What one can hope is that, somehow, the project will manage to muddle through to an understanding on firmware that can prevent all this from happening yet again when the next major release cycle comes near to completion. Localization under a government umbrella In an era of wider governmental adoption of free software, the Serbian authorities decided to take a different approach toward the affirmation of GNU/Linux and free software in the business sector and the general public. Instead of direct adoption of free software and open standards, Serbian authorities decided to fund several localization projects with the goal of helping to improve the competitiveness of free software on the Serbian IT market. The first information about the government's plans to help the localization of Free Software appeared in December 2007, when several of the Serbian media reported about the issue. Shortly after the news was revealed, the official press release (Google cached page, since the site was changed with no resources in English at the moment) from the Serbian Ministry of Telecommunications and Information Society was published, giving all the details that were available to the public at the moment. In short, February was set as a deadline for the first results, which meant localized versions of Ubuntu, Fedora, Mozilla Firefox, Thunderbird and OpenOffice.org. The projects were funded by the ministry and delegated to the several Serbian computer science faculties for organization and implementation. All of them, except the Ubuntu localization team, showed their first results in March at a presentation organized by the ministry. Ubuntu was late since the localized version was planned for the LTS (Long term support) release which came out in April. Shortly after Ubuntu 8.04 was released localized Ubuntu ISOs appeared on project servers. Ubuntu was known as a distribution which didn't have a localized installer or characteristic Ubuntu software translated in Serbian. In order to provide better localization, people from Faculty of Electrical engineering in Belgrade forked Ubuntu and named the new distribution cp6Linux. Cp6Linux was recognized as symbolic way to write "SerbLinux" since cp6 can be understood as "Serb" in something that might be considered as Cyrillic "leatspeak". The development team never confirmed this though. "Linux for human beings who speak (only) Serbian" is packaged in three flavors: Home, School and Business. Beside this way of packaging, the cp6 development team customized visual identity and adopted a user interface to make it more friendly for users coming from Windows. The most important task and the purpose of cp6's existence is not entirely completed, but the situation compared to a vanilla Ubuntu installation is a lot better. The live disk bootstrap interface and the live system installer are translated into Serbian. System tools and package managers are also localized, but translations of package descriptions and configuration messages are missing. The graphical configuration tools shipped with Ubuntu, like restricted-manager, are translated too, so it seems that cp6 2008 (which is the first and so far the only version) is basically targeting localization of the GUI applications and tools. The cp6 team produced a 52-page Creative Commons licensed User manual (CC-NC-SA), covering the most important features in using and installing cp6Linux 2008. The Fedora localization team (Google translation) took different strategy and decided to produce localized flavors of Fedora, with no forks and branding. The Serbian Fedora localization community was quite well organized and productive before, so the first thing that people for Faculty of Organization Sciences in Belgrade did was getting in touch with translators who already worked on Fedora. According to them, 19416 of 32480 strings in total were localized already, and they've localized 98% of 19500 unlocalized strings, which leads us to the total score of 99% localized strings. Almost 100% of localization strings in real life mean localized configuration tools, package management GUIs and installation interface. YUM and package descriptions, similar to cp6Linux, remain untranslated. Most of the work was done on Fedora 8, which is available for download from project servers, with Serbian localization and settings out of the box. There is no information about ISOs or localization details for Fedora 9 or 10 on the project website. Mozilla products were localized by the people from Electronic Faculty in Niš. As in the case of Fedora, project organizations continued existing efforts. The final result, for GNU/Linux and Windows, are Cyrillic and Latin versions available for download from the project website (Firefox 2.0.0.12 and Thunderbird 2.0.0.9). Back in Belgrade, localization of OpenOffice.org was delegated to The Faculty of Mathematics. Again, the project continued existing efforts and took over the coordination of the official Serbian translation team. The first steps toward a localized OpenOffice.org dated back to 2001 when a group of Serbian free software users got together for a big translation marathon organized by ICT Tower, a local OSS oriented software company. Sadly, without any external support, they failed to keep interest in the project and translations were never updated. The second big push was in the summer of 2005 when Novell gave some money to the "prevod.org" group for improving Serbian localization in SUSE. Following the OpenOffice release 2 "prevod.org" members returned to keeping up with GNOME translations, and once again the OpenOffice.org translation was left unmaintained. "In December 2008 the Ministry of telecommunications and information society Republic of Serbia started four projects for free software localization." explains Goran Rakic, Serbian OpenOffice.org native language project lead. According to Rakic, the biggest achievements of the project are localized releases of 2.4, 2.4.1 and 3.0 with continuity. "We did large QA and localization quality is better then ever", he states. Project statistics show distribution of more than 30,000 localized installations via the project site and more than 3000 in just one week after the 3.0 release. Rakic reveals that localized OOo is used inside government too, with some large deployments and many more to go. Rakic looks into the future saying that the "Ministry and Faculty of Mathematics in Belgrade signed contract for three years with option to extend and we are just one year in it. I can say that future looks bright for all current and new OpenOffice.org users in Serbia." It is very hard to give a general conclusion about the implementation and impact of these projects. First of all, the public was never informed of any study related to the use of localized versions of any software in Serbia. So it's impossible to predict how many users might directly benefit from those activities. The only numbers that we can use for any sort of analysis are download statistics, which doesn't necessarily reflect the real amount of acceptance or everyday use of localized programs and distributions. Contributions and translations from the Faculty of Organization Sciences have gone upstream, and cooperation with the Fedora translation team seems to be established and functioning according to the information on the Serbian team page. On the contrary, it seems that the Cp6Linux translations didn't go upstream, since there are no noted contributions on Launchpad. As in the case of Fedora, communication and cooperation is managed on the Serbian Mozilla localization team wiki. OpenOffice is the only project that actually took over coordination of the localization team, at least officially. Speaking of distributions, in both cases GNOME is being used as the default desktop environment, which has a strong and devoted localization community whose work was packaged in cp6Linux and Fedora in Serbian. GNOME translation is not a part of government funded activities, though. In the meantime, the Faculty of Technical Sciences from Novi Sad started to work on Alfresco localization, and the results are available on the Alfresco Forge page. This non-typical approach to free software from the government was motivated by the expectation that localization will become another recommendation for the Free Software adoption in Serbia. According to Mr. Nebojša Vasiljevic, assistant of the Minister of Telecommunications and Information Society for Information society, in his interview for GNUzilla magazine (issue 36). He also said that those project are not part of any strategy involving switching to free software in governmental institutions. The Android Dev Phone 1 Your editor's long-suffering spouse will attest that gadgets are never in short supply in the house. Many of them pass below her interest, but a new one has come in which has attracted attention throughout the household: an Android Dev Phone, otherwise known as the fully unlocked version of the G1 phone offered by T-Mobile. This phone is certainly a fun toy, but it has the potential to be a lot more than that. The details of this device have been well publicized for a while now. It includes a nice touchscreen display, QWERTY keyboard, GPS receiver, accelerometer, 3.2 megapixel camera, and more. The whole thing is powered by Google's Linux-based Android platform. The Dev Phone is essentially the same device as that sold by T-Mobile, but with a crucially important difference: it is unlocked in all senses. This means not just that it can be used with any mobile carrier's SIM, but also that the base operating software has not been locked down. This is a phone for which the entire system can be rebuilt and replaced at will. The Dev Phone thus joins the OpenMoko Neo Freerunner on the very short list of truly open mobile handsets. This device, though, has the advantage of being a bit more of a finished product with what appears to be a rather stronger software development team behind it. It also, for what it's worth, has some nice hardware capabilities that the Neo lacks: quad-band GSM, 3G (though not on the bands used by your editor's carrier, alas), keyboard, etc. Your editor believes that it will be a successful product. Over the course of the next few months, your editor plans to dig into this device and report on what he finds. How open is the device really? What does it take to put a new kernel onto it? What might it take to put a different operating system onto it altogether? And, in general, how does this whole Android thing work? Assuming that he does not brick the device early on, your editor hopes to get a real sense for what can be done with this device, how close its software is to what we normally think of as Linux, and where it might go into the future. It should be a fun project. First, though, one has to get through the stage of simply playing with the new toy. So the rest of this article will be a user-level review of sorts. The hardware: it feels generally solid. The device is larger and heavier than handsets your editor has used in the past, but that is to be expected. The keyboard works better than one might think given its size; even your relatively fat-fingered editor is able to type with reasonable speed and accuracy. The vibrator lacks strength. The camera seems to take nice photos (for a phone camera), but it is exceedingly slow. As with most color-screen devices, the display is entirely unreadable when the backlight is off. A nice touch with this phone is an indicator LED which blinks when the phone has something to tell you - an unread text message, for example - but the use of the LED seems to be somewhat inconsistent. Your editor has yet to get a sense for what the battery life would be in the absence of children playing with the device all day long. Complaints about battery life can be found on the net, but it appears that the phone should be able to get through two or three days of moderate usage where the GPS receiver is off most of the time. On the other hand, if you let your kids use it to mess around on video sites, the battery runs down relatively quickly. On the software side, this phone gets off to a bit of a rough start. It first requires the user to configure the phone to access data service from the carrier, a process which must be done by hand if that carrier is not T-Mobile. Your editor's last new phone recognized the carrier from the SIM and handled this task automatically. More annoying, though, is that the phone requires the creation of a Gmail account as part of its setup process. The fact that one does not have - and does not want - such an account is not relevant. So now your editor has an entry in the Gmail account database which will never be used. That, of course, ties in to why Google has gotten into this exercise in the first place. There are many features of the Android platform which are designed to tie the user in more closely to services provided by Google. Some features, such as the calendar, are really just an extension of the online offerings. The phone wants to sync the contacts list to...somewhere...and turning the feature off leads to unpleasant behavior. It is possible to use many of the features of the device without connecting back to the Google mother ship, but it's not the natural mode of operation. Another example is email handling. There is a separate icon for Gmail which just works; that application offers the features (such as threading) provided by that service. One can run a different mail application to connect to a POP or IMAP account somewhere, but it's a separate setup process. Later, with luck, one discovers the improved K9 client, which must be installed separately and which requires one to go through the setup process again. Even with K9, the non-Gmail mail client is not what it should be. There is no threading of messages, many basic commands (refiling messages, for example) are missing, etc. Then there's little problems like refusing to connect to a server if it doesn't think it can trust the SSL certificate and failing to authenticate if the user's password contains special characters. One assumes that this client will improve, or that other clients will be ported to the platform, but, for now, it doesn't seem to be a priority for the Android developers. More generally, though, the Android software is pretty slick. A fair amount of thought has been given to how interaction should work on this kind of device. Once one gets used to a few specific differences (holding a finger on an item on the screen for a few seconds often brings up otherwise hidden options, for example), navigating through applications comes fairly naturally. Only in some cases do inconsistencies pop up - some applications have different notions for how to zoom in and out than others is one that your editor has noticed. As a whole, the interface comes across as polished and attractive. That said, use of the display could be improved. On a small display, there will always be a certain tension between getting enough information on-screen and avoiding the creation of headaches through severe eye strain. Different users will do better with small fonts than others. But if Android offers an option to configure default font sizes, your editor cannot find it. So it becomes necessary to manually zoom almost every web page, almost every email, etc. to get a sufficient amount of information onto the screen. That gets a little tiresome after a while. The "Android Market" offers a wealth of applications, most of which are available as free software or, at least, in a free-beer mode. When browsing applications, one runs into the Android security model, which is oriented around a long set of capabilities which can be granted to applications. A program which needs do things like access the net, obtain location data, change hardware settings, etc. must declare the capabilities it needs; these are then presented to the user at installation time. Most users will probably just say "yes," but it is worth taking a closer look. Your editor decided to decline the installation of a Mahjongg game after being unable to figure out why it was asking for full network access. Beyond the inevitable games (including one of the worst Tetris implementations seen in a while), there is a wide variety of available applications. The "Locale" tool makes up for the (surprising) lack of the sort of "profile" feature found on almost every handset your editor has ever seen; it performs tricks like using the GPS receiver to automatically change profiles when the phone enters the office or a theater. The "bubble" application (shown on the left) turns the handset into a portable level. There's no shortage of "smart shopper" applications, most of which can read a barcode using the camera and look up prices for items. There is a "power manager" which attempts to configure the device for optimal power use in a number of situations; it provides a basic profile functionality as well, though the user should be prepared to spend some time configuring the options into a workable form. There's plenty of travel-oriented applications which will fetch weather reports, currency rates, or call a taxi. One notable omission, with both the base phone and the available applications, is voice over IP functionality. This handset should be able to do VOIP beautifully, but almost no such functionality is available. There appears to be a tool for Skype users, but that's about it. There are a couple of applications that are of particular interest to your editor. ConnectBot is an SSH client which works surprisingly well; the developers are clearly working toward the creation of a tool useful for people logging into Linux-like systems. And the terminal emulator provides that all-important feature: a shell prompt on the device. Even more fun, on the Dev Phone, a simple "su" with no password will yield a root shell. Playing around on the device, your editor sees that the ARM processor provides a mighty 383 bogomips. It appears to have a little over 100MB of usable memory. It's running a 2.6.25 kernel (known to be heavily modified) with a single loadable module called "wlan." And so on. As useful as the keyboard is, trying to use it to type commands at a shell which lacks a history mechanism gets painful after a while. Time to go looking for an SSH server. There are other useful applications, of course, such as the one which actually makes phone calls. Like the others, it lacks perfection, but one can only assume that, on a platform driven by free software, that imperfect applications will be improved or replaced. How easy it is to do such things is part of what your editor intends to find out in the coming months. Stay tuned. "Vishing" advisory targets Asterisk A light-on-details warning—issued late on a Friday no less—had users of the Asterisk telephony platform scrambling recently. It was issued by a US government group that includes the FBI, which tends to attract attention, and warned of unspecified vulnerabilities that would allow "vishing" attacks using subverted Asterisk systems. Vishing is a relatively new scam that uses phone calls in phishing expeditions (the name comes from combining 'voice' with 'phishing'), but typically using systems that are owned and run by the scammers. Evidently, the FBI received word that Asterisk systems were being subverted by way of a vulnerability (AST-2008-003) reported last March. Systems were then used to make "thousands of vishing telephone calls [...] within one hour" trying to elicit personal information—generally credit card numbers—from victims. By using caller ID spoofing techniques those calls could appear to be coming from the credit card company itself. Typically, a pre-recorded message would give the user another number to call, where they would be prompted to enter the information via an interactive voice response (IVR) interface. Asterisk is a multi-purpose free software suite that can act as a public private branch exchange (PBX), handle VoIP traffic, do IVR, and more. Because it provides such a general purpose platform, it does make an attractive target. It is probably also enticing to control such a device that is being run by—and can be traced to—someone else. But the folks at Digium—original developers and primary maintainers of Asterisk—don't really think the problem is as bad as was indicated. The original problem was fixed months ago, so it clearly irks the Digium folks that it has been fingered now. In addition, the original advisory didn't even point to the vulnerability so users and Digium were left to guess what exactly was being exploited. The advisory was updated to include information about AST-2008-003, but there is still some skepticism about the potential for exploitation. On Digium's blog, community manager John Todd thinks the problem was overstated: While I won't get into the details of configuration specifics, I would say that an administrator would have to consciously configure their system in what I believe to be an extremely unusual way in order to be victimized by this particular vulnerability. The flexibility of Asterisk lets a developer do almost anything, but it seems that there would need to be an almost absurd configuration circumstance that would allow this bug to be harmful in the way described. While it may well be that this particular vulnerability is difficult to exploit, there will likely be others down the road that are less so. While some users may be getting a little more wary about phishing and email-based scams in general, phone calls have generally been considered more trustworthy. But it is no longer true that phone numbers are definitely traceable back to a physical location with a billed party known by the telephone company. Much of this information can be spoofed or re-routed in ways that make detection more difficult. Phones have certainly been used in scams over the years, but the advent of caller ID has tended to put an undeserved stamp of authenticity on certain calls. If a pre-recorded message purports to come from GiantCompany and the caller ID entry has that name, it is easy to conclude that the call is genuine. Much of the same effort that has gone into educating the public about phishing will also need to be applied to vishing. This is certainly not the first instance of PBX systems being abused either. Subverting PBXs for free long distance calls is a longstanding trick in the "phreaking" community. But Asterisk provides a much more capable platform, thus a much more useful tool, both for those that run them and those that subvert them. Asterisk users need to keep that in mind when security issues come to light. Hv3 and the art of minimalist web-browsing Even if you appreciate full-featured applications like OpenOffice.org, Firefox, or GNOME, minimalist replacements have a fascination all their own. Not only are minimalist applications a throwback to the original traditions of Unix-like operating systems, but their emphasis on efficiency at the expense of extra features can force you to re-evaluate your computing needs. A case in point is Hv3, a web browser written in Tcl/Tk. Although currently in alpha and paying more attention to developers' needs than those of end users, Hv3 is already highly suitable for basic web-browsing, with a design philosophy all its own -- and, quite possibly, the fastest performance of any free software browser. Hv3 is available for both GNU/Linux and Windows. Packages of nightly builds are available for Puppy Linux, but the users of most distributions must fall back on statically-linked tarballs, following the instructions on the download page to obtain the latest build with wget, then de-compress it and change the permissions. You can also download the source code, as well as Tkhtml3, a development tool for embedding standards-compliant HTML/CSS implementation in applications that Hv3 uses. When you start Hv3, you also have the option of install hv3_polipo, a small web cache, in the same directory. You can run Hv3 without hv3_polipo -- at the expense of clicking through the same dialog each time you start the application -- but, if you are end-user, there is no reason not to install hv3_polipo. In fact, there is every reason to do so, since it increases Hv3's speed by at least 25%. Using Hv3 Hv3 opens on a gun metal gray window with four top-level menus, a toolbar consisting of five basic navigation choices, and the URL entry field (as well as debugging tools that are, presumably, temporary). At the bottom is a status bar that gives instructions for toggling between modes, but apparently does nothing yet. Both bookmarks and downloads open in separate tabs, rather than in a menu or a floating window, which makes for a less cluttered appearance than in most browsers, but does result in each new tab opening by displaying bookmarks. This default occasionally comes in handy, but is more often an annoying preliminary step to what you really want to do. Two unusual features in the Hv3 window are the ability to hide the menus and toolbar to maximize display space, and a tree view of the page's HTML source. Both are available from the right-click menu for a link. The tree view is especially welcome, since it is quicker to navigate than the plain text file of markup you get in most browsers. The difference, I suspect, is that the Hv3 assumes that users are actively interested in looking through the markup and using it as efficiently as possible, so that the view is not just an after-thought. So far, at least, search capacity is minimal in Hv3, differing little from Firefox's except in the fact that searches of both the web and the current page are grouped together and given prominence by a top-level search menu. Again, the impression is that Hv3 developers are thinking of what might be convenient for those who make regular use of the feature. You can configure Hv3 from the Options menu, choosing the icon set to use in the toolbar, and the size (but not the typeface) to use for the widgets and on web pages. For some reason, you have three choices for font size on web pages: The page zoom, the font scale (a percentage), and the font size table (a description). You also have the option of disabling the display of images for greater speed, and for turning off support for ECMAScript, which provides support for what is commonly referred to as JavaScript. Bookmarks As you explore Hv3, you will probably want to start by opening the Bookmark tab. For one thing, Hv3 seems to have paid most attention to bookmarks among the most common browser features. Because bookmarks in Hv3 open in a separate tab, they display a tree-view list on the left, and the actual page on the right, making them easy to learn. More importantly, the default bookmarks include a short but adequate page explaining the features of Hv3. An especially noteworthy feature is the distinction between regular bookmarks, which open directly on the page, and snapshots, an archived version of a bookmark that can be used to work off-line. You can tell a regular bookmark because it is indicated in the tree view by having a cyan colored circle for an icon, while a snapshot has an icon resembling a page. There is also a third type of bookmark that is a snapshot that retains a link to the original. You tell this type of icon by clicking on it and watching it toggle back and forth between the other two, a distinction that seems all too easy to miss. Another reason for turning early to the Bookmarks tab is to use the Import Data button to import bookmarks from Firefox. The process lasts less than ten seconds, and is almost formidably efficient: Not only your personal bookmarks, but the default bookmarks for your distribution and Firefox's default bookmarks are added to the tree view -- regardless of whether they still appear on your personal toolbox in Firefox or not. Speed vs.Geekiness Many of Hv3's features suggest an effort to rethink functionality that you can easily take for granted in your daily browsing. However, what interests many people about minimalist web browsers is their speed. In this category, Hv3 is in a class by itself. Without hv3_polipo installed (see above), Hv3 loads pages roughly 50% faster than Firefox, and about the same speed as Dillo, perhaps the best known minimalist browser. However, with hv3_polipo installed, Hv3 loads pages nearly twice as quickly as Firefox, and about 50% faster than Dillo. Moreover, Hv3 has the advantage over Dillo of supporting JavaScript, which means that it displays more pages correctly than Dillo does -- although, if you are watching, you will see any text-only alternative pages display before Hv3 renders a JavaScript page. If Hv3 would only include a Flash plugin, possibly using Gnash, the free Flash replacement, then its users would have few basic reasons to envy the users of heavyweight browsers like Firefox except the absence of an active extensions-building community. In its current release, Hv3 pays little attention to usability. Not only are the debugging tools prominently displayed, but some of the options, such as "GUI fonts" or "Force CSS metrics" seem pitched at the understanding of developers more than that of everyday users. However, the interface names are not that hard to figure out, particularly since they are relatively few. Presumably, too, the Hv3 team is more concerned with performance right now than finishing details, and will get around to such concerns closer to the first full release. For now, the lack of polish seems a small price to pay for the speed and simplicity of Hv3 -- to say nothing of the reminder that useful and thoughtful alternatives exist to well-known applications. Followups: performance counters, ksplice, and fsnotify There's been progress in a few areas which LWN has covered in the past. Here's a quick followup on where things stand now. Performance monitors In last week's episode, a new, out-of-the-blue performance monitoring patch had stirred up discussion and a certain amount of opposition. The simplicity of the new approach by Ingo Molnar and Thomas Gleixner had some appeal, but it is far from clear that this approach is sufficiently powerful to meet the needs of the wider performance monitoring community. Since then, version 3 and version 4 of the patch have been posted. A look at the changelogs shows that work on this code is progressing quickly. A number of change have been made, including: The addition of virtual performance counters for tracking clock time, page faults, context switches, and CPU migrations. A new "performance counter group" functionality. This feature is meant to address criticism that the original interface would not allow multiple counters to be read simultaneously, making it hard to correlate different counter values. Counters can now be associated into multiple groups which allow them to be manipulated as a unit. There's also a new mechanism allowing all counters to be turned on or off with a single system call. The system call interface has been reworked; see the version 3 announcement for description of the new API. The kerneltop utility has been enhanced to work with performance counter groups. "Performance counter inheritance" is now supported; essentially, this allows a performance monitoring utility to follow a process through a fork() and monitor the child process(es) as well. The new "timec" utility runs a process under performance monitoring, outputting a whole set of statistics on how the process ran. There are still concerns about this new approach to performance monitoring, naturally. Developers worry that users may not be able to get the information they need, and it still seems like it may be necessary to put a huge amount of hardware-specific programming information into the kernel. But, to your editor's eye, this patch set also seems to be gaining a bit of the sense of inevitability which usually attaches itself to patches from Ingo and company. It will probably be some time, though, before a decision is made here. Ksplice In November, we looked at a new version of the Ksplice code, which allows patches to be put into a running kernel. The Ksplice developers would like to see their work go into the mainline, so they recently poked Andrew Morton to see what the status was. His response was: It's quite a lot of tricky code, and fairly high maintenance, I expect. I'd have _thought_ that distros and their high-end customers would be interested in it, but I haven't noticed anything from them. Not that this means much - our processes for gathering this sort of information are rudimentary at best. The response on the list, such as it was, indicated that the distributors are, in fact, not greatly interested in this feature. Dave Jones commented: It's a neat hack, but the idea of it being used by even a small percentage of our users gives me the creeps.... If distros can't get security updates out in a reasonable time, fix the process instead of adding mechanism that does an end-run around it. Which just leaves the "we can't afford downtime" argument, which leads me to question how well reviewed runtime patches are. Having seen some of the non-ksplice runtime patches that appear in the wake of a new security hole, I can't say I have a lot of faith. The Ksplice developers agree that the writing of custom code to fit patches into a running kernel is a scary proposition; that is why, they say, they've gone out of their way to make such code unnecessary most of the time. This discussion leaves Ksplice in a bit of a difficult position; in the absence of clear demand, the kernel developers are unlikely to be willing to merge a patch of this nature. If this is a feature that users really want, they should probably be communicating that fact to their distributors, who can then consider supporting it and working to get it into the mainline. fsnotify The file scanning mechanism known as TALPA got off to a rough start with the kernel development community. Many developers have a dim view of the malware scanning industry in general, and they did not like the implementation that was posted. It is clear, though, that the desire for this kind of functionality is not going away. So developer Eric Paris has been working toward an implementation which will pass review. His latest attempt can be seen in the form of the fsnotify patch set. This code does not, itself, support the malware scanning functionality, but, says Eric, "you better know it's coming." What it does, instead, is to create a new, low-level notification mechanism for filesystem events. At a first look, that may seem like an even more problematic approach than was taken before. Linux already has two separate file event notifiers: dnotify and inotify. Kernel developers tend to express their dissatisfaction with those interfaces, but there has not been a whole lot of outcry for somebody to add a third alternative. So why would fsnotify make sense? Eric's idea seems to be to make something that so clearly improves the kernel that people will lose the will to complain about the malware scanning functionality. So fsnotify has been written - employing a lot of input from filesystem developers - to be a better-thought-out, more supportable notification subsystem. Then the existing dnotify and inotify code is ripped out and reimplemented on top of fsnotify. The end result is that the impact on the rest of the VFS code is actually reduced; there is now only one set of notifier calls where, previously, there were two. And, despite that, the notification mechanism has become more general, being able to support functionality which was not there in the past. And, to top it off, Eric has managed to make the size of the in-core inode structure smaller. Given that there can be thousands of those structures in a running system, even a small size reduction in their size can make a big difference. So, claims Eric, "That's right, my code is smaller and faster. Eat that." What this code needs now is detailed review from the core VFS developers. Those developers tend to be a highly-contended resource, so it's not clear when they will be able to take a close look at fsnotify. But, sooner or later, it seems likely that this feature will find its way into the mainline. Development statistics for 2.6.28 As of this writing, the 2.6.28 kernel is getting quite close to its final release. The flow of patches into the mainline repository has slowed to a trickle. So it become appropriate to look at what was done in this development cycle, and where all that code came from. In these articles, your editor routinely forgets to thank Greg Kroah-Hartman, who continues to do a lot of work to ensure that these statistics are at least moderately accurate. So we'll get that taken care of at the outset: thanks, Greg! The 2.6.28 development cycle has seen the incorporation of just under 9,000 changesets; that makes it a bit smaller in this regard than 2.6.27 (10,600) or 2.6.26 (10,100). The development base broadened, though; 1,262 developers have contributed to 2.6.28, more than has been seen with its predecessors. Those developers added 769,000 lines of code while removing 285,000, for a net growth of 484,000 lines - a relatively large amount. Much of that growth came by way of a single developer, as we will see below. In recent development cycles, some 25% of the patches merged were accepted after the close of the merge window. Linus Torvalds has been making sounds about tightening the criteria for patches during the stabilization period, to the point that they would have to address known regressions to be accepted. A look at 2.6.28, though, shows that 1835 patches (so far) have gone in since 2.6.28-rc1. At 20% of the total, the patch flow rate during the stabilization period has fallen - but not by much. So where did these patches come from? Here's the top twenty contributors to 2.6.28: On the changesets side, David Miller contributes a lot of work to the network stack, but the bulk of his changes this time around are to the SPARC architecture code. Yinghai Lu is a constant source of x86 architecture patches. Al Viro returns to the list with a lot of cleanup work in the VFS code, user-mode Linux, and beyond. Bartlomiej Zolnierkiewicz continues to clean up the legacy IDE code, despite the fact that its user base is shrinking. And Alexey Dobriyan contributed work in a number of areas, with the bulk of it being in the netfilter subsystem and /proc. When looking at changed lines, one gets the sense that Greg Kroah-Hartman has been rather busy this time around. As it happens, Greg did not actually write most of that code; the bulk of it came in with the addition of the -staging tree. It seems that Greg, the self-named "maintainer of crap," has acquired substantial amounts of it. Inaky Perez-Gonzalez was the source of the patches adding support for ultrawideband radio and wireless USB. Expect to see him show up again soon; he is now working to get the WIMAX subsystem into the kernel. Mark Brown added drivers for a number of Wolfson Micro devices. Joseph Chan contributed the VIA framebuffer driver, and Pavel Machek added a handful of miscellaneous drivers. So who paid for this work to be done? The 2.6.28 employer table looks like this: In general, the employer tables tend not to change too much from one development cycle to the next. Greg's staging tree work did put Novell at the top of the lines-changed column, despite the fact that this work did not originate at Novell. As always, one needs to bear in mind that these numbers are approximate. One welcome change is the first-time appearance of VIA. It appears that this company is truly getting serious about supporting Linux, and that can only be a good thing. Writing all this code is important, but so is reviewing, testing, and reporting bugs. Continuing with a relatively new tradition, we'll look at who shows up in patch tags indicating this kind of participation, starting with the reviewers: At this point, we are seeing about one Reviewed-by tag for every 100 changes going into the mainline repository. Fortunately, the review situation is not quite that bad; most reviewers simply do not provide these tags for the patches they look at. The numbers for bug reporting and patch testing look like this: In each case, everybody with at least two credits was listed. The good news is that, while there's certainly some familiar names on that list, we are also seeing appearances by people who are not known as kernel developers. There really is a testing community out there which includes more than just developers. Your editor suspects that we still are not doing a very good job of crediting them for their work, but this convention is relatively new and we can still hope for progress in this direction. To that end, the developers who are crediting reporters and testers are: A quick grep shows that the number of Reported-by and Tested-by tags in patches was almost exactly the same over the 2.6.27 and 2.6.28 development cycles. Given the smaller number of patches in 2.6.28, this indicates that a slightly higher percentages of patches are now carrying those tags. Emphasis on "slightly" is in order, though; we are, for the most part, still not crediting a great many people who have helped to get 2.6.28 into shape. Unifying filesystems with union mounts Unification of filesystems is the concept of mounting several filesystems on a single mount point, with the resulting mount showing the logical combination of all the filesystems. Traditionally, when a filesystem is mounted on a directory, the existing contents of the directory are masked, and the content of the latest mounted filesystem is shown. These masked files are available only after the mounted filesystem is unmounted. Even though these files exist, they are inaccessible to the user. Union mount overcomes this by providing access to all directories and files present in the directory, even after a mount. In the kernel, the filesystems are stacked in order of their mount sequence, the first mounted filesystem is at the bottom of the mount stack, and the latest mount is at the top of the stack. Only the files and directories of the top of the mount stack are visible. With union mounts, directory entries from the lower filesystems are merged with the directory entries of upper filesystem, thus making a logical combination of all mounted filesystems. Files with the same name in a lower filesystem are masked, as the upper one takes precedence. Union mounts could be used to update packages of a distribution on a DVD. A writable filesystem could be mounted over the read-only filesystem on the DVD. All new and updated package files would be written to the writable, topmost filesystem, while hiding the duplicate files of the read-only media, or even deleting files (this is done through white-outs discussed later). This allows the user to change any of the files on the system, with the new file stored transparently in the image. Such a setup could be used to roll-up an updated DVD, or maintain a package repository with the latest packages for network installs. As compared to other implementations, such as unionFS, union mounts try to do all directory entry unification handling in the VFS layer, instead of creating a new filesystem type. Some of the advantages of this approach are: Simple and Lightweight Design: Since all merges happen inside VFS, there is no need for an additional filesystem layer to maintain and merge metadata. No need to re-iterate the mount stack by the user while mounting: the user is not required to list the directories participating in the union as a part of the mount command. Only the mount point is enough. Bind mount works without any problems: this is a VFS feature to remount part of the filesystem hierarchy at additional mount points. Union mount, developed by Jan Blunck, Bharta B Rao, and Miklos Szeredi, is the first step in unifying mounts in the VFS. The patch implementation is similar to that of the Plan 9/Inferno operating system. Currently, it only does namespace unification at the root directory level and not in the subdirectories. To mount directories through union mount, the mount command must be modified to recognize and set the union mount options. The util-linux patches that update the mount command can be found at ftp://ftp.suse.com/pub/people/jblunck/union-mount/ As an example, consider the following directory structure of two filesystems: Issuing the following commands will perform a union mount: After the union, the directory structure looks like: Unmounting the /mnt directory unwinds the filesystem mount stack: The filesystems are stacked in the mount order in the kernel. The MNT_UNION flag in vfsmnt is set while mounting union mounts. This helps to identify that the directory entries of the stacked filesystems are supposed to be merged. While performing the lookup sequence, if the MNT_UNION flag is set, all root directory entries of all filesystems are scanned. Scanning happens from top of the filesystem stack to bottom, and the first matching entry is returned. This way any duplicate entries in underlying filesystems are automatically ignored. Similarly, for the readdir() call, the directory entries are read from the topmost union mount directory to the lowest, and collected in the cache. The cache is responsible for collecting and keeping the directory entries across the stacked filesystem, with different callbacks for each filesystem. Like regular files, directories are seekable and the position of the following read is marked by the file position filp->f_pos. When reading from directories across filesystems, it is possible that the file position exceeds the inode size of the directory where it is merged. In such a situation, the file position is rearranged to select the correct directory in the union stack. This is done by subtracting the inode size if the file position exceeds it and selecting the next member of the union. This works for filesystems such as ext2 that use flat file directories. The directory entry offsets are arranged linearly and are always smaller than the inode size of the directory. However, some filesystems return special cookies as directory entry offsets which are unrelated to the position in the directory or the inode size. Updating file->f_pos to accommodate more directories does not not work for such filesystems. There can be multiple calls to readdir()/getdents() routines for reading the entries of a single directory. Currently, the union directory cache is not maintained across these calls. Instead, for every call the previously read entries are re-read into the cache and newly read entries are compared against these for duplicates before being returned to user space. The developers are working on making this efficient by maintaining the cache across readdir()/getdents() calls. Future Plans: Writable Unions Currently, the namespace unification is limited to the root filesystem directory entries. Future plans, known as writable unions, would come close to the implementations of unionfs namespace unification. Directory entry merging would not be limited to the root filesystem, but would be done for subdirectories as well. Though these patches have been developed, they still require some time and clean up for the mainline. Using the example above, a writable union mount of the two filesystems would contain: Note that dir1 directory now contains both file_b1 and file_c1. All writes are directed to the topmost mounted filesystem if it is mounted read-write. Mounting a new filesystem upon the current union mount makes all filesystems lower in the stack read-only, though the unified namespace would appear read-write to the user. Any modifications in the files of lower filesystems is handled through copy-on-write. If a file belonging to the lower layers of the stack is opened, the entire file is copied on the topmost filesystem on the stack. This is also known as copy-up, where the file is copied to the topmost layer if it has to record a change. While performing a copy-up, the directory path of the file is also recreated on the topmost filesystem, so that the next time it is mounted as a union, it appears in the same location. The older file gets masked during the directory merge the next time the filesystems are union-mounted in the same order. Rename on union mounts is handled through -EXDEV. -EXDEV is returned in a rename() operation if the source and destination file paths are on different mounted filesystems. In such a case, the application, such as mv, resorts to a copy operation, and unlinks the file from which the filesystem moved. On union mounts, since any writes are performed in the topmost layer, a move operation to directories in the lower layers returns -EXDEV, which means the application must copy the file to the new directory. If both the source and destination of the rename() operation are in the topmost later, the traditional rename method is used. Deletion of files is handled by a special file type called white-outs. The white-out file type is similar to negative dentries: they describe a filename which isn't there. This is used to mark a file in the lower read-only filesystem as deleted, since only the topmost layer can be modified. However, white-outs would require support from all the filesystems, to store and recognize such a special file type. Currently, there is a special type, DT_WHT defined in include/linux/fs.h which defines a white-out, but is not in use. Directory namespace unification is a tough task. FreeBSD implementations gave up after calling it "messy code", while unionfs entered the -mm tree for a brief period, it did not make it to mainline. Since the unification is a pathname-based it is best handled in the VFS instead of using a separate stacked filesystem. The union mount offers a cleaner and more lightweight approach for merging directories, however getting it to adhere to POSIX compliant directory calls such as telldir() or seekdir() is still a challenge and is currently being worked on. The git repository to track union mounts is located at: under the union-dir branch. The union mounts developers intend to release the patches in a phased manner, starting with the current patch of root directory level merging. Further developments would see patches related to merging at the subdirectory level as well. Refining the Process of Digitizing Vinyl Records In October, your author discussed the process of digitizing vinyl records for the creation of a digital audio library. Since that time, the process has been performed on around 40 disks and a number of refinements have been made. This article discusses what has been learned in that time. One part of the digitizing process that has proven to work well involved treating one side of the original media as a single chunk of data. Many of the processing steps can be performed on these large data chunks before splitting up the individual tracks. After making numerous recordings, it was discovered that a single record level, 93 on the inputs of the M-Audio Delta 44, consistently produced recordings with a useful volume range on the majority of the records that were copied. An interesting phenomenon was observed with some recordings that were recorded with too much gain. On loud passages, as the waveform reached the upper or lower limit (rails in electronic-speak), instead of just flattening out, a complete inversion of the wave would occur, resulting in harsh sounding rail-to-rail glitches. The source of the problem is open to speculation. If this should occur, it is best to make a new recording of the album side with a lower input level. Having two machines handy has helped to optimize the audio processing work. One machine is dedicated to making the initial album side recordings. The sides are minimized in size by removing data before and after the recorded audio starts, and fade-ins and fade-outs are added to whole album side. The album sides are copied to another machine with a faster processor for further processing. The original copy is kept around as a backup until the side has been fully processed. After copying the recorded album side to the secondary machine, a new recording can be started on the recording machine. The process of removing clicks and scratches from an album side has seen the most changes since the original article. This is a bit of a learned art. The first step now involves visually inspecting the waveform of the album side with Audacity. Often a few huge spikes will be visible on the recording. They can be removed by repeatedly selecting an area and zooming in until the zoom resolution shows individual samples as dots. The repair operation should be performed on all of the large clicks. Smaller clicks can often be found and removed by zooming into the quiet passages, an almost infinite amount of of hunting, zooming and repairing can be done. Another good way to find clicks is to listen, pause, remove and move on. Most tracks can be cleaned up to a reasonable level without too much effort. Some albums can contain an incredible number of clicks while others can be nearly click-free. After the manual deglitching is done, the automated click removal step can be performed. This is now optional, but it can find additional clicks that are buried in busy waveforms. After whatever amount of declicking seems reasonable, the audio is exported from Audacity as a .wav file. Before exiting Audacity, the Stereonorm script (available here) is run on the .wav file to bring the left and right channel levels up to 100% volume. If the normalization results look reasonable compared to the Audacity visual representation of the recording, Audacity is exited and restarted with the normalized recording. If the normalization numbers seem right compared to the visual wave representation, it is often possible to remove more offending large clicks, export again and rerun the normalization step. Although it may make audiophiles cringe, it may be beneficial to use the repair function to shave the level off on the peaks of loud percussive waveforms. Done sparingly, this can be used to fix balance problems encountered during the normaliztion step. The version of Audacity that your author has been using, 1.3.4-beta on Ubuntu 8.04, has a few bugs that can cause crashes and the loss of time-consuming work. Occasionally after doing a lot of repairs, attempting to export a file as .wav produces a long stream of zero-length write errors. It is usually possible to recover from this by writing out the data in the Audacity native .aup format, exiting and restarting Audacity with the .aup file, and trying the .wav export again. On numerous occasions, adding a label track followed by doing more click repairs has caused Audacity to crash. It is advisable to perform the labeling step on a new instantiation of Audacity. Hopefully these bugs to disappear when the system gets updated to a newer version of Audacity. After investing many hours into the creation of a large audio library (now up to around 200GB), it becomes critical to back up the data. Fortunately, the price of IDE disks has dropped as fast as the capacity has risen and hard drives can be treated as high capacity data cartridges. Backups can easily be done by adding a temporary SATA or USB drive to a system and running an efficient rsync operation to copy any new or changed data to the offline archive. openSUSE 11.1 is out openSUSE 11.1 was released this week. This point release contains new features and bug fixes. A series of sneak peeks looks at KDE 4.1.3, The Latest GNOME Desktop, Improved Installation, Easier Administration and more, with plenty of eye candy. There is a look at the download numbers as of December 24, 2008 and lots of coverage. DistroWatch summed up a lengthy review with: My only reservation is to do with proprietary codecs and drivers, which still needs some work to reach the same level as other distributions. For new users, this is still just too hard. I tried to get 3D working with ATI's proprietary driver and gave up in the end (X worked, but no 3D due to OpenGL errors). The 'recommended packages' feature of the package manager is a great idea and does install MP3 support automatically, but this is still second rate and users expect more. Overall I really feel that this version of openSUSE provides a complete desktop experience for the user. What does it have to offer you? Download it and give it a try, you might be pleasantly surprised at what you find. This version of openSUSE comes with a new OpenSUSE License with no EULA. DaniWeb interviewed community manager Joe "Zonker" Brockmeier. What's new in openSUSE 11.1? Tons. :-) More specifically, we have a lot of new software -- OpenOffice.org 3.0, GNOME 2.24, KDE 4.1.3, Banshee 1.4, and a lot more. We've also updated some important YaST modules (YaST is the system management tool for openSUSE) including the partitioner, printer module, and security module that allows users to examine their system's security. This release also introduces a major new feature called Nomad, which is a new remote desktop technology. (http://en.opensuse.org/Nomad) This was also a major update in other ways. First, this is the first release that was built in the openSUSE Build Service, which is an important step for allowing more contributions from the community over time. Also, we introduced a new, more friendly license and we removed some pieces of software from the DVD media that prevented redistribution, so now openSUSE is easier to obtain and distribute than ever before. We asked openSUSE developers to share a little about their views of the best new features or what they are most excited about? We will conclude this article with their responses. Greg Kroah-Hartman: The new kernel version update, to the 2.6.27 release series, provides support for many new devices and platforms over the previous openSUSE releases. Aaron Bockover: I am excited about Mono 2.0 in openSUSE 11.1 as it brings a number of major performance, memory, and stability improvements to our applications. From the developer point of view, Mono is more compelling than ever with full C# 3.0 support. openSUSE is hands-down the best distribution for developing on Mono. Michael Meeks: My favourite OpenOffice.org feature, and a world-first, is the split build; this allows you to quickly compile just 'writer' against your installed libraries (finally, like all other applications); so you can get involved with OO.o much more easily. My second favourite is the console help when invoking a missing tools, telling you the command to install it and the respective package - that combined with the speedy zypper makes life exceeding smooth. Hans Petter Jansson: I think one of my favorite 11.1 features must be that user switching (switching to another logged-in user's desktop without logging out) finally works seamlessly with GDM. Joe 'Zonker' Brockmeier: Of all the features and updates in this release, there are two things that really make the release for me. One is the KDE 4 desktop, which has come a very long way. It has a lot of polish and I'm really impressed with the improvements since 11.0. The other is the new license, which makes openSUSE much easier to redistribute and gets rid of the EULA that openSUSE used to have. PDF-based presentations with 3-D effects At first, the idea of adding 3-D transitions to command line presentation software may give you a kind of cognitive dissonance. Just as you would if someone had added a GPS tracking system to a one-horse cart plodding along at two kilometers an hour, you have to wonder why anyone would bother. But, the dissonance disappears as you start to explore the control and precision you have in command-line programs like PDFCube and Impressive (formerly KeyJNote). Both are small and efficient programs that allow you to add transitions and other special effects to PDF-based presentations, although the range of options varies considerably between the two programs. Before using either PDFCube or Impressive, you need to have to have support for 3-D graphics installed. PDFCube works well with OpenGL, as well as with the drivers and video cards listed on its hardware compatibility page. By contrast, Impressive is somewhat more erratic under OpenGL, with some transitions displaying slowly, especially when you have less than two gigabytes of RAM available. However, by picking and choosing effects, you can still test drive Impressive without resorting to proprietary drivers. Both applications are available as source code from their project sites. However, you will also need to install dependencies for PDF support, such as Poppler for PDFCube, and Xpdf Reader or Ghostscript for Impressive. Impressive also requires Perl and Python. For convenience, you may prefer to use the Debian packages for both programs, or, in the case of PDFCube, the packages available in the Fedora and Ubuntu repositories. Impressive is also available for OS X and Windows. PDFCube With version 0.0.3 just released, PDFCube is more a proof of concept than a finished application. In fact, it currently has only one transition effect — a spinning cube. However, a day after the latest release, maintainer Mirko Maischberger has already posted a brief announcement on the project home page that he has already started work on "an abstraction layer for 3D effects (cube, fading, cover flow) to be done in C++ and OpenGL)." What you currently have in PDFCube is the basic engine. No options are available, so all you need to type to try PDFCube is pdfcube filename.pdf. However, before trying PDFCube, take the time to read its man page to learn how to navigate within the program. Unlike full office applications like OpenOffice.org Impress or KPresenter, PDFCube is driven completely by keyboard commands, and — so far, at least — does not work with the mouse at all. Fortunately, the basic commands are few. You press the 'c' or space key to move to the next page of a presentation using an effect, or the PageUp key to move to the next page without any effect or the PageDown key to move to the previous page without effect. You can also use the 'h','j','k', and 'l' keys to zero in on one of the corners of the current page, or the 'z' key to zoom in on the center. Pressing any of these keys zooms out again, while Esc stops the presentation. These are all the controls that you are likely to need. As Maischberger suggests on the project home site, the spinning cube is easy to overdo, so you might want to limit its use to major transitions. You can impose this limit by adding the page numbers before the places you want the transition. For instance, if you entered pdfcube filename.pdf 0 3, you would have the spinning cube between pages 1 and 2 and pages 4 and 5 only. Other transitions would lack the effect. Another point to be aware of with PDFCube is that is designed for landscape oriented pages. You can display PDF files with a portrait orientation, but the application currently gives you no way of scrolling up or down the page. But, this limit aside, PDFCube shows a simplicity and performance that you don't often see in its desktop equivalents. Impressive At version 0.10.2, Impressive is already much more complete than PDFCube. It not only runs slideshows from directories with BMP, JPEG, PNG, and TIFF graphics as well from PDFs, but also includes a complete set of controls for fine-tuning how its presentations run — to say nothing of several unique controls for running a presentation. You can view a complete list of options with impressive --help, or from the project documentation page. They include options to set up an automatic slideshow, complete with a loop from the end back to the beginning, to set the size of the presentation window, and just about every other aspect of the running and appearance of a presentation that you can imagine. Two especially noteworthy options are -d, which allows you to set a time for the entire presentation, then pace yourself by an unobtrusive bar along the bottom of the screen, and -u, which polls original files periodically to see if they are updated. If you want to use slide transitions, you will need to enter impressive --listtrans to see a list of over 20 possible transitions. All the transitions have names like SlideUp or WipeDownRight that are clear enough to be self-explanatory, although the help screen does include a slightly longer description. You can use a transition by adding its name with the -t option. However, unlike PDFCube, Impressive currently limits you to a single transition for the entire slide show — a limitation that might frustrate some users, but also prevents the aesthetic disaster of anyone using too many. In addition, Impressive includes several handy controls. Pressing the Tab key opens a view of all the slides in the presentation, while pressing the Enter key enables a spotlight that follows the mouse and can be used as a built-in pointer. Still another option is to draw an enclosed shape with the mouse, which results in the rest of the screen darkening and blurring, so that the audience's attention is focused on the area you defined. You can add multiple highlighted areas, each of which you can close with a right mouse-click. The screen returns to normal when you close the last highlighted area. Impressive's view of all Slides is reminiscent of the slide view in many programs, or the Sun Presenter Console for OpenOffice.org, but its highlight boxes and spotlight are both features that I haven't seen in desktop-oriented programs. These features alone make Impressive worth a look, but more experienced users might also appreciate the wealth of available options — even if they don't often use many of them. Conclusion Both PDFCube and Impressive are works in progress, with some ways — and, at the current rate of development, perhaps some years — to go before their 1.0 releases. However, in the current versions, PDFCube has the superior basic engine, while Impressive allows users the greater control. Despite PDFCube's lack of options and Impressive's mediocre OpenGL support, both are worth keeping at least an occasional eye on. In their separate ways, both demonstrate that, contrary to what many desktop users seem to assume, command line applications are not just archaic remnants. You need time to enter all the options in a command line application, but, if you take the trouble to familiarize yourself with the applications, you may find their controls easier to use than the cluttered editing windows of a desktop application like OpenOffice.org Impress. Far from being outdated, applications like PDFCube and impressive are practical demonstrations that command line applications can be both modern and innovative. Justifying FS-Cache In what must seem like a never-ending effort, David Howells is once again trying to get a generic mechanism to do local caching for network filesystems into the kernel. The latest version, number 41, of his FS-Cache patches was posted back in November, so now he is asking for it to be added to linux-next. That would mean that the feature was on-track for the mainline in 2.6.29, but it would appear that 2.6.30—if ever—is more likely. The idea behind FS-Cache is to create a way for "slow" filesystems to cache their data on the local disk, so that repeated accesses do not require accessing the underlying slow storage. Howells has been working on getting it into the kernel for a number of years; our first article about it appeared in 2004. The canonical example of where it might be useful is a network filesystem on a heavily-used or low bandwidth link—the cost of re-reading data from the network may be much higher than retrieving it from a local disk. In addition, the cache can be persistent across reboots, allowing some files to live locally for a very long time. But, Howells already has a fairly large, intrusive patch that is headed for 2.6.29: credentials. That patch touches a lot of code in the kernel, in particular the VFS layer. Christoph Hellwig is concerned about both credentials and FS-Cache going in at the same time : I don't think we want fscache for .29 yet. I'd rather let the credential code settle for one release, and have more time for actually reviewing it properly and have it 100% ready for .30. While that would delay the addition of FS-Cache, Andrew Morton has a larger concern: I don't believe that it has yet been convincingly demonstrated that we want to merge it at all. It's a huuuuuuuuge lump of new code, so it really needs to provide decent value. Can we revisit this? Yet again? What do we get from all this? Morton is worried about adding additional maintenance headaches with no—or limited—benefits. Using a local disk to cache data from a remote disk is only useful in some scenarios; it can certainly make things worse in others. As Howells puts it: "It's a compromise: a trade-off between the loading and latencies of your network vs the loading and latencies of your disk; you sacrifice disk space to make up for the deficiencies of your network." What Morton is looking for is a push from users, be that end users or distributions that are shipping the feature. He would also like to see some benchmarks that show what gain there is when using FS-Cache. Howells has patiently answered these concerns, pointing at some benchmarks he had posted in November that showed some significant savings. The benchmarks used NFS over a deliberately slow link (to simulate a heavily used network) and showed a huge decrease in the time required to read a large file, but was essentially break-even when operating on a kernel tree. In the kernel tree benchmark, though, the reduction in network traffic was significant. More importantly, perhaps, is the fact that Red Hat has shipped FS-Cache in RHEL 5 and there are customers using it, as well as customers interested in using it as Howells pointed out: We (Red Hat) have shipped it in RHEL-5 and some Fedora releases. Doing so is quite an effort, though, precisely because the code is not yet upstream. We have customers using it and are gaining more customers who want it. There even appear to be CentOS users using it (or at least complaining when it breaks). While shipping out-of-tree code is no guarantee that the feature will get merged—AppArmor is an excellent counterexample—actual users whose needs are being met by a particular feature are a fairly persuasive argument. Howells outlines some customer use cases for FS-Cache, for example: We have a number of customers in the entertainment industry who use or would like to use this caching infrastructure in their render farms. They use NFS to distribute textures (say a million and a quarter files) to the individual rendering units. FS-Cache allows them to reduce the network load by satisfying subsequent NFS READ requests from each rendering unit's local cache rather than having to go to the network again. In all, it would seem that Morton's concerns were addressed. Whether that means the path is clear for 2.6.30 or these or other concerns will come to the fore is a question that will likely have to wait another three months or so. SSL man-in-the-middle attacks A while back, we looked at the new Firefox 3 warnings for self-signed and expired SSL certificates. As annoying as some found those to be, it certainly increased the visibility of "invalid" certificates. Those certificates could lead to man-in-the-middle attacks, which is what led Mozilla to issue such eye-opening warnings. More recently, Eddy Nigg of Startcom—issuer of free SSL certificates—found another way to do man-in-the-middle attacks without setting off any of the new warnings. What Nigg found was that he could get a perfectly valid certificate for a domain he did not control: in this case mozilla.com. He could then masquerade as the secure Mozilla site with impunity; any browsers that landed there would verify the certificate as belonging to mozilla.com. He did it through a Comodo reseller with no questions asked: "Five minutes later I was in the possession of a legitimate certificate issued to mozilla.com – no questions asked – no verification checks done – no control validation – no subscriber agreement presented, nothing." That is clearly a bug in the verification process, but it is completely out of the control of the browser. The browser must trust some set of key signing authorities (i.e. Certificate Authorities or CAs), but has no way to control how well or poorly they actually vet the keys they sign—or their downstream resellers sign. We saw the same potential problem in a slightly different guise with "Extended Validation" certificates back in 2006. It all comes down to trusting CAs. Sometime after Nigg's story hit Slashdot, Comodo revoked the certificate, which did cause Firefox to put up an error and disallow the connection. One wonders how many bad certificates have been issued but not revoked because a phisher or other scammer received them. One would think those folks would be less likely to publicly announce what they had done. Bringing attention to the problem will likely help, but there are just too many ways to create bad SSL certificates for those that really want them—bribing CA employees if nothing else. Another useful outcome is that Richard Bejtlich got interested in just how the revocation process works. He collected packet data from accessing Nigg's certificate after it had been revoked which gives look inside the Online Certificate Status Protocol (OCSP). OCSP is designed to do just what it did, cause a bad certificate to fail when verified by the browser. Nigg's certificate listed an OCSP server that should be consulted. Because that information has been signed by the CA, it can't be tampered with. So long as the browser makes the OCSP check, certificates can be revoked in this manner—as long as the CA is aware that revocation is needed. Public key cryptography—the basis of SSL and many other encryption schemes—is an amazing method for doing encryption, but it does suffer from a major shortcoming: key exchange. For relatively simple situations, where both parties know each other and have a way to securely exchange keys, it works well. When trying to handle other kinds of communications, either a "web of trust" (a la PGP and GPG) or some kind of trusted authority is required. When those break down, man-in-the-middle and other scams are possible.