From: Tom Lord <lord@regexps.com> To: fsb@crynwr.com Subject: the model t Date: Wed, 16 Jan 2002 01:39:00 -0800 (PST) One view of the history of the automotive industry is that it is a story about the evolution of manufacturing processes. In the early days, the companies that successfully implemented assembly lines became dominant. Nowadays, some of those same companies are acting to remain competitive by redesigning their entire process, from design to roll-out to showroom to post-sale support -- trying to take advantage of the efficiencies and flexibility made possible by modern computing, communications, parts production, and assembly technology. The classical assembly line is, of course, a dinosaur. One view of the history of software companies is that it is a similar story about the evolution of software engineering processes. In this story, FSBs have an interesting role. In the 70s and early 80s, from what I've heard, a lot of development was driven by small teams of gurus-with-low-badge-numbers. The senior OS hackers at minicomputer companies and the first implementors of MacOS, for example, were figures of legend. Their success was pretty much a black art as far as the industry was concerned: difficult to reproduce reliably. In the mid-to-late 80s, we got software assembly lines. A lot of processes started to become commoditized. A company could plan a large software product, break it down into modules, hire teams of journeyman programmers, hand them marching orders, and succeed with some reliability. Team managers and project leads coordinated development around conference tables and at white-boards. Standardized specialties started showing up in job reqs: QA engineer, release engineer, integration engineer, porting engineer, documentation specialist. These process innovations were sufficiently efficient and reliable to win out over small teams of gurus -- software gurus as a group even got a bad reputation during these years. With roots in the 80s, but taking off commercially in the nineties, we got "open source processes": distributed, asynchronous development, transcending company and organizational boundaries, coordinated by project maintainers, and often involving distributed, competitive planning and unplanned development. Compared to software assembly lines, open source processes seem so far to have such advantages as: efficient development cost sharing, opportunistic enhancements, rapid need-driven enhancements, low bureaucratic overhead, multi-role programmers (i.e., efficient soaking up of engineer hours), and implicit testing (Raymond's "many eyeballs"). When open source processes work, they work very, very well, accomplishing quickly and surprisingly what the assembly lines could never do. The trend, since the eighties, has been for open source projects to grow in areas such as: - the number of contributing programmers - the size and complexity of source code - the number of goals for a project being simultaneously pursued - the importance of producing a steady stream of distributions of monotonicly-increasing quality In response to those trends, programmers have developed a series of techniques and tools for coordinating development -- open source processes have become more of a commodity. In the early days of GNU, the standard project infrastructure was mailing lists and an FTP site. There was a list for bugs, a list for announcements, and sometimes a list for discussion. Changes to software were exchanged as manually created and applied patch sets. In most cases, a single individual acted as "maintainer", having sole responsibility for making changes to the official sources. Projects were slow-paced, small, and simple enough that this was practical. In the nineties, the mailing lists acquired web-browsable archives; bugs started to be saved in web-based issue-tracking systems; many projects had multiple maintainers who coordinated their changes to the official sources by using a network-accessible CVS repository. That project-hosting infrastructure bundle became something of a commodity, led, I believe, by SourceForge. The costs and capabilities of that infrastructure and the processes that use it are worthy of close attention from FSBs: that's our manufacturing technology. When we win against closed-shop assembly lines, sure, there's an element of guruism at work, but its our software engineering tools and techniques that are the source of reproducible success. Moving from the general to the specific, there is a particular part of our open source software engineering technology that is of particular interest to me: CVS is a bottleneck in our infrastructure. On the one hand, CVS does something incredibly useful: it helps multiple maintainers coordinate changes software. On the other hand, CVS is very limiting: for the most part, it helps only the people who have write access to a repository. Anyone else offering changes still has to go the `diff/patch' route, relying on one of the maintainers to turn patch sets into CVS transactions. That raises the cost and inconvenience of contributing and soaks up maintainer hours on the mundane task of manually manipulating incoming patches. Consider how these costs and inconveniences are multiplied when a given set of changes doesn't quite pass review -- and the maintainers call for revision and resubmission. CVS use creates some other undesirable artifacts, too. To name a few: The simple act of improving the maintainability of a project by rearranging the source tree isn't well supported by CVS -- maintainers therefore tend to resist making that kind of improvement. CVS allows development of new features on "branches", but has an awkward interface for this and only minimal facilities for merging changes between branches. Consequently, many projects are characterized by a free-for-all on the trunk line of development which therefore, rather than steadily increasing in functionality and reliability, regularly thrashes between working, broken, and even uncompilable states. Unfortunately, I have no hard data that quantifies the costs imposed by the shortcomings of CVS. I do, however, read the developer mailing lists of several projects and my impression has been that CVS is a serious problem, even if it isn't always explicitly recognized as such. A better revision control system for open source processes would eliminate the "privileged writer" problem (so that non-core contributors aren't forced to resort to diff/patch), eliminate the tree-rearrangement problem, and bend over backwards to encourage development on branches (so that trunk-lines of development can more easily be kept in stable, steadily improving states). So I've written such a revision control system: `arch'. See: http://www.regexps.com The user's guide is on-line, as is a simple repository browser for the change history. There's a read-only copy of arch's self-hosted repository there too. (Please let me know if the web pages give you troubles on your particular browser -- this is the first time I've tried using tables, images, and colors so heavily.) Some of the key advantages of arch compared to CVS are: 1. Atomic, whole-tree commits, reliable repository database. These aim to fix some reliability nits with CVS. 2. File and directory renames handled cleanly. This aims to solve the problem mentioned above, of maintainers being reluctant to re-arrange source trees. 3. Fancy features for branching and merging: These aim to provide practical alternative to a "free-for-all" on trunk lines of development. For example, arch has a high level merge operator that is especially good for projects where multiple maintainers of a project each work on separate branches, merging to and from a shared "trunk" to stay in sync (the `star-merge' command, so called because the graph of trunk and branches has a star topology). 4. Distributed repositories This aims to eliminate the need for non-core contributors to resort to diff/patch and to simplify the change-review task for maintainers. arch treats all accessible repositories as one big repository, permitting branch and merge operations to span repository boundaries. "World-Wide Revision Control" :-) 5. Automatic ChangeLog maintenance. 6. Configuration management for multi-package distributions. 7. Weighs in at about 30K lines of code. (Some of the lines are rather wide, though :-) arch is in pretty good shape in the sense that the core functionality is done and I've been using it heavily, myself. The main weaknesses, and, hence, opportunities to contribute are: 1. I use it only on a BSD-based system. Though porting to other platforms should be easy, it won't be a noop, and it hasn't been done yet. 2. Since revision control ought to be rock-solid reliable, a comprehensive test suite for arch is an important goal: but it's a large job. 3. The web interface and facilities for browsing revision history are a bit weaker than I'd like -- I'm working on that, though. 4. No facility, yet, for automatically converting a CVS repository into an arch repository. 5. No fancy GUI, yet, for drawing a graph that illustrates the branching and merging history of a project. 6. No fancy GUI, yet, for running arch commands via a control panel. 7. For very large and/or active projects, some performance tuning is likely to be desirable. I've been using arch on a tree with around 1500 files and find performance to be acceptable. (By way of contrast, GCC has around 6500 files (at least in the old distribution I have on hand)). I perform a small handful of commits per day (whereas (I presume) that across all branches, GCC gets at least dozens). It is straightforward to speed up the arch commands that might cause problems -- they were written for simplicity and functionality first, omitting some obvious speed-ups. -t