[LWN Logo]
[LWN.net]
From:	 Tom Lord <lord@regexps.com>
To:	 fsb@crynwr.com
Subject: the model t
Date:	 Wed, 16 Jan 2002 01:39:00 -0800 (PST)



One view of the history of the automotive industry is that it is a
story about the evolution of manufacturing processes.  In the early
days, the companies that successfully implemented assembly lines
became dominant.  Nowadays, some of those same companies are acting to
remain competitive by redesigning their entire process, from design to
roll-out to showroom to post-sale support -- trying to take advantage
of the efficiencies and flexibility made possible by modern computing,
communications, parts production, and assembly technology.  The
classical assembly line is, of course, a dinosaur.

One view of the history of software companies is that it is a similar
story about the evolution of software engineering processes.  In this
story, FSBs have an interesting role.

In the 70s and early 80s, from what I've heard, a lot of development
was driven by small teams of gurus-with-low-badge-numbers.  The senior
OS hackers at minicomputer companies and the first implementors of
MacOS, for example, were figures of legend.  Their success was pretty
much a black art as far as the industry was concerned: difficult to
reproduce reliably.

In the mid-to-late 80s, we got software assembly lines. A lot of
processes started to become commoditized.  A company could plan a
large software product, break it down into modules, hire teams of
journeyman programmers, hand them marching orders, and succeed with
some reliability.  Team managers and project leads coordinated
development around conference tables and at white-boards.
Standardized specialties started showing up in job reqs: QA engineer,
release engineer, integration engineer, porting engineer,
documentation specialist.  These process innovations were sufficiently
efficient and reliable to win out over small teams of gurus --
software gurus as a group even got a bad reputation during these
years.

With roots in the 80s, but taking off commercially in the nineties, we
got "open source processes": distributed, asynchronous development,
transcending company and organizational boundaries, coordinated by
project maintainers, and often involving distributed, competitive
planning and unplanned development.

Compared to software assembly lines, open source processes seem so far
to have such advantages as: efficient development cost sharing,
opportunistic enhancements, rapid need-driven enhancements, low
bureaucratic overhead, multi-role programmers (i.e., efficient soaking
up of engineer hours), and implicit testing (Raymond's "many
eyeballs").  When open source processes work, they work very, very
well, accomplishing quickly and surprisingly what the assembly lines
could never do.

The trend, since the eighties, has been for open source projects to
grow in areas such as:

	- the number of contributing programmers

	- the size and complexity of source code

	- the number of goals for a project being simultaneously
          pursued

	- the importance of producing a steady stream of distributions
	  of monotonicly-increasing quality

In response to those trends, programmers have developed a series of
techniques and tools for coordinating development -- open source
processes have become more of a commodity.

In the early days of GNU, the standard project infrastructure was
mailing lists and an FTP site.  There was a list for bugs, a list for
announcements, and sometimes a list for discussion.  Changes to
software were exchanged as manually created and applied patch sets.
In most cases, a single individual acted as "maintainer", having sole
responsibility for making changes to the official sources.  Projects
were slow-paced, small, and simple enough that this was practical.

In the nineties, the mailing lists acquired web-browsable archives;
bugs started to be saved in web-based issue-tracking systems; many
projects had multiple maintainers who coordinated their changes to the
official sources by using a network-accessible CVS repository.  That
project-hosting infrastructure bundle became something of a commodity,
led, I believe, by SourceForge.

The costs and capabilities of that infrastructure and the processes
that use it are worthy of close attention from FSBs: that's our
manufacturing technology.  When we win against closed-shop assembly
lines, sure, there's an element of guruism at work, but its our
software engineering tools and techniques that are the source of
reproducible success.

Moving from the general to the specific, there is a particular part of
our open source software engineering technology that is of particular
interest to me:

CVS is a bottleneck in our infrastructure.  On the one hand, CVS does
something incredibly useful: it helps multiple maintainers coordinate
changes software.  On the other hand, CVS is very limiting: for the
most part, it helps only the people who have write access to a
repository.  Anyone else offering changes still has to go the
`diff/patch' route, relying on one of the maintainers to turn patch
sets into CVS transactions.  That raises the cost and inconvenience of
contributing and soaks up maintainer hours on the mundane task of
manually manipulating incoming patches.  Consider how these costs and
inconveniences are multiplied when a given set of changes doesn't
quite pass review -- and the maintainers call for revision and
resubmission.

CVS use creates some other undesirable artifacts, too.  To name a few:
The simple act of improving the maintainability of a project by
rearranging the source tree isn't well supported by CVS -- maintainers
therefore tend to resist making that kind of improvement.  CVS allows
development of new features on "branches", but has an awkward
interface for this and only minimal facilities for merging changes
between branches.  Consequently, many projects are characterized by a
free-for-all on the trunk line of development which therefore, rather
than steadily increasing in functionality and reliability, regularly
thrashes between working, broken, and even uncompilable states.

Unfortunately, I have no hard data that quantifies the costs imposed
by the shortcomings of CVS.  I do, however, read the developer mailing
lists of several projects and my impression has been that CVS is a
serious problem, even if it isn't always explicitly recognized as
such.

A better revision control system for open source processes would
eliminate the "privileged writer" problem (so that non-core
contributors aren't forced to resort to diff/patch), eliminate the
tree-rearrangement problem, and bend over backwards to encourage
development on branches (so that trunk-lines of development can more
easily be kept in stable, steadily improving states).

So I've written such a revision control system: `arch'. See:

		http://www.regexps.com

The user's guide is on-line, as is a simple repository browser for the
change history.  There's a read-only copy of arch's self-hosted
repository there too.  (Please let me know if the web pages give you
troubles on your particular browser -- this is the first time I've
tried using tables, images, and colors so heavily.)

Some of the key advantages of arch compared to CVS are:

	1. Atomic, whole-tree commits, reliable repository database.

	   These aim to fix some reliability nits with CVS.

	2. File and directory renames handled cleanly.

	   This aims to solve the problem mentioned above, of
	   maintainers being reluctant to re-arrange source trees.

	3. Fancy features for branching and merging:

	   These aim to provide practical alternative to a
	   "free-for-all" on trunk lines of development.

	   For example, arch has a high level merge operator that is
	   especially good for projects where multiple maintainers of
	   a project each work on separate branches, merging to and
	   from a shared "trunk" to stay in sync (the `star-merge'
	   command, so called because the graph of trunk and branches
	   has a star topology).

	4. Distributed repositories

	   This aims to eliminate the need for non-core contributors
	   to resort to diff/patch and to simplify the change-review
	   task for maintainers.

	   arch treats all accessible repositories as one big
	   repository, permitting branch and merge operations to span
	   repository boundaries.  "World-Wide Revision Control" :-)

	5. Automatic ChangeLog maintenance.

	6. Configuration management for multi-package distributions.

	7. Weighs in at about 30K lines of code.

	   (Some of the lines are rather wide, though :-)

arch is in pretty good shape in the sense that the core functionality
is done and I've been using it heavily, myself.  The main weaknesses,
and, hence, opportunities to contribute are:

	1. I use it only on a BSD-based system.  Though porting 
	   to other platforms should be easy, it won't be a noop,
	   and it hasn't been done yet.  

	2. Since revision control ought to be rock-solid reliable,
	   a comprehensive test suite for arch is an important goal:
	   but it's a large job. 

	3. The web interface and facilities for browsing revision
           history are a bit weaker than I'd like -- I'm working on
           that, though.

	4. No facility, yet, for automatically converting a CVS
           repository into an arch repository.

	5. No fancy GUI, yet, for drawing a graph that illustrates
	   the branching and merging history of a project.

	6. No fancy GUI, yet, for running arch commands via a 
	   control panel.

	7. For very large and/or active projects, some performance
	   tuning is likely to be desirable.  I've been using arch on
	   a tree with around 1500 files and find performance to be
	   acceptable.  (By way of contrast, GCC has around 6500 files
           (at least in the old distribution I have on hand)).  I
           perform a small handful of commits per day (whereas (I
           presume) that across all branches, GCC gets at least
           dozens).  It is straightforward to speed up the arch
           commands that might cause problems -- they were written for
           simplicity and functionality first, omitting some obvious
           speed-ups.

-t