a/nb-devfs

From:	Neil Brown <neilb@cse.unsw.edu.au>
To:	linux-kernel@vger.rutgers.edu
Date:	Thu, 11 May 2000 15:31:03 +1000 (EST)
Subject: devfs - the missing link


Hi there all.

I have been watching the devfs debate for the last year or so with
interest and amusement ... and occasional boredom:-)

I must admit that I have sympathies for both sides.  I think something
is definitely needed, and devfs certainly solves some problems, but
is it the "best" solution...?...
Or, put another way, I'm sure that it can be made to work, but I
suspect there could be a better way of doing it.

Possibly more informative than the content of the argument, is the
fact of the argument.  The fact that (apparently) intelligent and
rational people cannot agree - or even agree to differ - about a
technical issue seems very significant.

Based on some years of observing and experiencing human behaviour, my
view is that when two people cannot agree on something, it is almost
always because they are failing to communicate.  Either they don't
correctly interpret the words that the other is using or (more
commonly) they are working from different sets of basic assumptions -
they have different axioms.  If you can pinpoint this failure to
communicate - if you can identify the item of information that one is
assuming and the other is not (or is assuming to be different), they
you can usually resolve the issue.

Obviously I believe there is a communication failure going on -- a
common understanding that is missing.  In this mail item I hope to
expose an important issue that seems to be being glossed over, and
then to elaborate the implications of this issue.  You may still not
like the resulting proposal, but I will have succeeded if people have
a clearer idea of what the divisive issues really are.


Like most important truths, the "important issue" can be stated in a
simple sentence (cf John 3:16) but can benefit from substantial
elaboration (cf the rest of the Bible).

My one sentence summary:
   Device special files are *not* devices, they are gateways to
   devices. 

Before I embark on the  elaboration, it might help to identify some
particular issues that seem to have caused particular disagreement.  I
believe that the approach discussed below answers all of these issues
to some degree.  I'll let you be the judge.
  1/ persistence of permissions on device files - not trivial when
     device files are not persistent.  Several solutions have been
     discussed with no clear agreement.
  2/ /dev in a chroot gaol.  This requires a /dev which is the same
     as, but  different too, the "real" /dev.
  3/ 16 bit device numbers are too small.  Do we enlarge them?
     Deprecate them? If so, how?
  4/ Where and how is devfs mounted? /dev? /devices? at the same time
     as /? at the same time as /proc?
  5/ The choice of names of things in devfs - the Linus imposed scheme
     vs the original scheme.

The rest of the memo comprises:
   A:  A discussion of what device special files really are.
   B:  A brief outline of what (I think) I would like a device
       filetree to look like.
   C:  A new construct to carry device special files into the next
       century.
   D:  Some notes on backward compatibility.
   E:  Some closing comments.
   F:  My signature :-)

A: A discussion of what devices special files really are.

  You probably mostly know all this, but it needs to be said up front
  to make sure that we are communicating.

  In *traditional* Unix (I'm thinking Edition 7 Unix from Bell Labs
  specifically, though not that much has changed), devices were
  addressed by a static, 3.5 level numeric hierarchy.

  The three levels of this hierarchy are:
     1/ Block or Character devices (== fronted by buffer cache or not)
     2/ Major device number - Each number identified a driver
     3/ Minor device number - identified a particular device
            managed (driven?) by that driver.

  The extra 0.5 level of hierarchy came because some devices had
  sub-components: disc drives had partitions. Tape drives had
  rewind-on-close behaviour and no-rewind-on-close behaviour.  This
  resulted in some drivers splitting the bits in the minor number up
  into a device identifier and some extra bits to indicate how to
  interact with the device.
  Clearly, the three level hierarchy was already limiting back then.

  That is the device hierarchy.  Unix wanted to (and this is one of
  the great strengths of Unix) access the device hierarchy from the
  filesystem.  Hence the device-special files.
  A device-special file has ownership and Access Control*, and
  identifies a particular device.  (note that devices in general have
  no ownership and no access control, though some drivers might
  restrict some ioctls (e.g. format-disc) to root).

  The semantic of the device-special file is that if the ACL gives you
  some particular access, you have that access to the referenced
  device, rather than access to the device special file
  itself. (i.e. write access doesn't mean that you can change the
  device-special file, only that you can write to the device).
  Only root can create device-special files.

  Hopefully this explains what I mean when I say that device special
  files are only gateways to devices, not the devices themselves.
  It is worth noting that it is quite possible to create two different
  device special files which refer to the same device, but have
  different owners and access control.  I haven't ever come across a
  case where this is useful, but it does help highlight the
  difference.

* When I refer to Access Control or ACL, I am primarily talking about
  standard Unix ugo+rwx access control, but any other ACL scheme could
  be used equally well.

B:  A brief outline of what (I think) I would like a device
       filetree to look like.

  The traditional Unix device tree is clearly limiting.  There are two
  particular aspects that are limiting.
  1/ 3.5 level hierarchy is too rigid.
  2/ numeric identifiers are hard to manage, and not human-friendly. 

  The "obvious" response to this is to have a hierarchy that looks
  like a filesystem - with textual names for elements and arbitrarily
  many level as suits particular types of devices - and this is what
  devfs does.
  My reason for proposing something different to the current devfs
  structure is that I am coming to the problem with different
  priorities.  devfs seems to want to copy the traditional layout of
  /dev, and with good reason.  I have no desire to mimic that, but
  instead a desire to mimic the 3-level hierarchy of devices numbers -
  but take it a bit further.

  This is far from a complete proposal and is intended to be largely
  indicative of what is possible rather the prescriptive of what
  should be done.
  Also, my knowledge of device technologies and nomenclature are
  fairly superficial, so please excuse me if I say something silly.

  The approach I would take to the device hierarchy is to have a
  controller/instance/function hierarchy where a "function" may be a
  "controller" of a different sort and so would have a symlink back up to
  the top.
  Taking for example my PCI based AHA2950 dual port SCSI controller
  with several discs on one port and a DLT library on the other, it
  might look something like:

  pci/                     directory containing all pci busses
  pci/0/                   directory containing pci buss 0.
                           Presumably PCI controllers have some sort
                           of address (IO port? memory?) so this might
                           be pci/0x880000000/ or some such.
  pci/0/2/		   directory of information about device 2 on
                           pci buss 0.  This '2' has a physical
                           meaning, it is not a sequential number.
                           Possibly this could be pci/0/device/2 so
                           that other information could go in pci/0
                           and not get confused with devices.
  pci/0/2/vendor	   file containing vendor id.  Yes, procfs-like
			   stuff goes here too.
  pci/0/2/deviceid	   file containing deviceid.  You might prefer
			   vendor and deviceid to be one file.. so might I.
  pci/0/2/function/        directory containing the different
			   functions of the device.
  pci/0/2/function/0 -> ?scsi/0
		           a symlink saying that function 0 is the
			   first scsi buss found.
			   The '?' means that something should go here
			   to say "go back to top of device
			   hierarchy".  It could be a sequence of "../"s,
			   or possibly something else. See later.
  
  pci/0/2/function/1 -> ?scsi/1
		           a symlink identifying that this is the
			   second scsi buss.

  scsi/			   directory containing all scsi busses
  scsi/0		   first scsi buss. the '0' is a purely
			   sequential number it has no external
			   meaning.		 
  scsi/0/master -> ?pci/0/2/function/0
			   A symlink so that you can find out where
			   this scsi controller exists.  "master"
			   probably isn't a good name.
  scsi/0/2/		   A directory with information about device 2
			   on scsi buss 0. In this case a disc drive.
			   Again, possibly scsi/0/device/2
  scsi/0/2/function/0 -> ?disc/1
		           This is a disc drive (though your's might
			   be a disk drive:-). It is drive 1. (drive 0
			   is probably ide/0/0 which is pci/0/1 ...)
  scsi/0/3/function/0 -> ?disc/2
  scsi/0/4/function/0 -> ?disc/3
			   more discs.
  scsi/1/0/function/0 -> ?tape/0
			   LUN0 of device 0 on this scsi buss is a
			   tape drive.
  scsi/1/0/function/1 -> ?changer/0
		           LUN1 is an auto-changer to changing tapes.

  disc/			   directory for all disc drives.  I think I
			   would include all ide and scsi
			   non-removable magnetic drives here.  There
			   would be separate cdrom/ floppy/.  I'm not
			   sure where ZIP drives would go.
			   Possibly the interesting distinction is
			   removable/non-removable....
  disc/1/		   Information about disc 1
  disc/1/master -> ?scsi/0/2/function/0
  disc/1/device		   This *is* the device. You read/write this
			   to access the disc drive. It might look (to stat())
			   like a device special file, or it might
			   look like a named pipe or a socket.
  disc/1/partition -> ?partition/1
		           Where to find the partitions.
			   I might be taking the levels of indirection
			   too far here. a disc/1/parition/ directory
			   might be better.
  partition/1/		   Partitions on second partitions device
  partition/1/style	   file containing the word "msdos\n"
  partition/1/master -> ?disc/1
  partition/1/table        raw partition table in format according to
			   "style"
  partition/1/1/part	   partition 1.  This is the real partition.
			   You can open/read/write/mount this. It
			   might look (to stat) like a device special
			   file or a socket or ...
  partition/1/2/part	   partition 2
  partition/1/2/partition -> ?partition/2
			   Partition two has "extended" partitions in
			   it!  Again, I might be taking indirection
			   too far.

  I think (hope) that you get the idea. The device tree reflects the
  physical organisation of devices where possible, and allows for
  "virtual" devices to help flatten the hierarchy.  The tree contains
  not only devices, but also information about devices such as is
  often found in /proc.

  The ownership/access control on things within the tree is minimal
  and (mostly) not changeable.  Almost everything is owned by root.  A
  exception might be slave ptys which were created by a particular
  user are owned by that user and can have their permissions changed. 
  Access control is mostly wide open (you will see why later).  Some
  things might only be writable by root.  Directories are read-only.

  We already have some things in the tree that are not physical
  devices - partitions.  They are really a layer of interpretation
  on top of the data in the device.  There are other interpretations
  that we put on top of data, and they should be reflected in the
  tree.

  filesystem/		  (maybe fs/) directory of all filesystems
  filesystem/ext2	  sub-dir of ext2 file systems
  filesystem/ext2/some-long-hex-uuid/
		          a particular filesystem.  If there are uuid
			  conflicts, or if the filesystem doesn't
			  support uuids, some sequential number would
			  be used.
  filesystem/ext2/some-long-hex-uuid/dev -> ?disc/1/device
  filesystem/ext2/some-long-hex-uuid/fs/
			  The actual filesystem appears to be mounted
			  here, was well as at /usr (or wherever)
			  thanks to Al Viro's new vfs mounting stuff.
  filesystem/ext2/some-long-hex-uuid/mountpoints
			  a file listing current mount points, or
			  maybe a directory full of symlinks to the
			  mount points.

  md/other-hex-uuid/
			  directory with assorted stuff about an md
			  device.  There would be a link to the
			  actually device in disc/ and a superblock and
			  more. 
  etc.

  I understand that Richard is already planning stuff like this for
  devfs - /dev/volumes I believe.

  
  Just to bring you back to where we are up to, this hierarchy is NOT
  meant to replace /dev.  It replaces the block-or-char/major/minor
  hierarchy.  Like that hierarchy, it has little in the way of access
  control. 
  Though the bc/major/minor hierarchy is not directly accessible from
  the filesystem, it would be nice if this hierarchy were.  We could
  mount it somewhere like /devices.  However I would prefer it to go
  somewhere like //devices or //linux/devices.  It wouldn't get
  mounted there. It would simply always be there, much as / is always
  there and the bc/major/minor hierarchy is always ... wherever it
  is.
  It is true that linux doesn't currently differentiate // from /, but
  POSIX allows us too, and there has been talk about going that way,
  and we can keep that as a long term goal, and mount it in
  /proc/devices or similar for now.


  Also, as the devices in this tree have wide open permissions, if we
  mount it, we must make sure that the root directory is closed -
  owner root, permission 700.  Probably access through the root should
  require CAP_MKNOD as well so that it can safely be visible in chroot
  gaols (presumably chroot changes / but not //).

C:  A new construct to carry device special files into the next century.

  We have a new device tree, but what good is it if no-one can access
  it?  That is equally true of the old bc/major/minor device
  hierarchy.  We need gateways into it.

  To follow the pattern of device special files as outlined above, we
  need some sort of filesystem object which has ownership and access
  control, and contains a pointer to a device.  However in this case
  the pointer to the device is not a pair of numbers but is a textual
  name.   Sounds like a job for symlinks to me.
  
  Obviously symlinks as they are don't cut it, but there are three
  bits (setuid, setgid, sticky) that we can use to enhance symlinks -
  and Unix has a (murky?) history of using these bits ... creatively. 

  Let me propose that a symlink with, say, the setuid bit gets treated
  differently to symlinks, and somewhat like device special files.

  - chmod/chown on such a symlink (lets call it a devlink) applies to
    the symlink itself, and not on the target of the link.
  - If we decide that the device tree doesn't get mounted, then we
    could assert that such a devlink always points into the device
    tree instead of the filesystem, but I would prefer to mount the
    device tree at //devices, and have all the devlinks start
    //devices.
  - accessing a devlink (open) provides access to the referenced
    object, and has access permissions checked based on the ACL of the
    devlink. 
  - only root (CAP_MKNOD) can create devlinks, or set the setuid bit
    on a symlink. 

  This provides essentially the functionality of device special
  files, but with a more flexible hierarchy for devices.  It allows us
  to give away access to specific devices to specific users, and to
  have this access stored in a normal filesystem and to be persistent
  in the way that /dev normally is.
  As devices appear in multiple places in the device tree under
  multiple identities (by phys address, by uuid, by function) we can
  (hopefully) give away the access that we really want to give away.
  e.g. //devices/camera/xxx gives you access to digital camera with
  uuid xxx no matter which buss it gets connected on.

  However, this structure may be a bit too limiting.  Suppose that
  rather than giving away access to a specific device, I want to give
  away access to a directory full of devices. e.g. You can have
  access to any digital camera that gets plugged in.  I really want to
  be able to have a devlink that points to a directory.  What does
  that mean?  In particular, how is the ACL carried along if I chdir
  through a devlink, and how am I prevented from using ".." to walk
  all over the device tree.

  The abstraction that seems to work best here is a "mount".  When I
  access a devlink, particularly one to a directory, I want the
  directory to effectively be mounted on the symlink ... and with Al's
  new mount stuff there is no problem mounting different bits in
  different places, possibly mounting the one directory in several
  places.   This provides control of "..", but what does it do for
  preserving the ACL?

  Here I think we need one more bit of magic.
  Every object in the device tree will have ownership/ACL, though the
  ACLs will be chosen from a fairly limited set and almost everything
  will be owned by root.  However, some objects, particularly devices,
  will have a sticky bit set.  In the device filesystem, the sticky bit
  will have a special meaning. It means "use the owner/acl of the
  mountpoint".  The mountpoint is always available through the
  vfsmount structure so getting hold of this should be quite easy.

  One thing that this doesn't answer is how symlinks inside the device
  filesystem get treated when you have only mounted part of it.
  Possibly these symlink need to be devlinks as well.  I haven't
  completely resolved this issue for myself, but I don't think it is
  insurmountable.

  So, in summary, we have
   - devlinks which are symlinks with setuid bit set.
   - chown/chmod affect devlinks directly, not the target like
     with symlinks.
   - accessing a devlink does some sort of magic mount
   - in the device filesystem, "sticky" objects get their permissions
     from the mountpoint.
   - only root(CAP_MKNOD) can make devlinks.

  It might actually be nice to use devlinks more generally:

    /usr -> //devices/filesystem/ext2fs/long-hex-uuid/fs

  obviates some of the need for /etc/fstab.

D:  Some notes on backward compatibility.

  But what about major/minor numbers?  Some applications (tar?) still
  require them.  And we need a clean transition from the old to the
  new.  My old /dev must still work while I am transitioning from the
  old style to the new.

  Making old device special files still work simply means providing a
  mapping from cb/major/minor to //device/path.   This could be
  encoded into the kernel, or could be provided by making a directory
  tree:
    /oldevices/char/1,1 -> //proc/mem
    /oldevices/char/5,0 -> //proc/self/controlling-tty
    /oldevices/block/3,0 -> //devices/disc/0/device
    ....
  and telling the kernel to look up this directory to resolve device
  special files.  Possibly a combination - the kernel "knows" about
  many common things and goes to /oldevices for what it doesn't
  understand - would be best.  As it is a transition, speed shouldn't
  be too important and caching could make up for any lack.

  Providing major/minor numbers for programs that really want them is
  not so straight forward.  The best solution probably depends on what
  real problems there turn out to be.
  One idea would be that "mknod" on an existing "device" object would
  set the major/minor numbers of that device, and some boot-time script
  does the mknod for those devices that really need it.

  Possibly devices which haven't been 'mknod'ed just appear to have
  some automatically allocated unique major/minor number from an
  unused number space.  The number gets allocated the first time the
  device is used.

E:  Some closing comments.

  As said, this proposal is very incomplete.
    - The naming structure needs lots of thought by someone with lots
      of relevant experience.
    - Maybe devices should appear as device special files with
      sequentially assigned numbers, or maybe they should appear to be
      sockets. 
    - are the links in the device filesystem devlinks or symlinks, and
      how costly would it be to have literally hundreds of little
      mounts of the device filesystem.
    - complete semantics of devlinks need to be worked out - e.g. when
      something gets mounted on a devlink, can you still see the link,
      and how do you chown/chmod it.
    - there is no code
    - what happens when a name is accessed in the device filesystem
      that doesn't exist? Do we do a kmod callback or an autofs style
      upcall.  Do we cache negative responses?
    - more..

  There would still be a place for a devfsd like program, particularly
  for taking arbitrary action on hot-plug/unplug events, though
  probably for other things too.  I would hope that the system could
  still work reasonably well without devfsd running though.

  This change would not be optional like devfs is as it substantially
  changes the way devices are handled.  This would imply that we would
  want to be able to have a stripped down device filesystem for use in
  embedded systems.  How much stripping down would depend on how much
  code it actually took to implement.

  One issue that this proposal doesn't directly resolve is allowing
  changes to permissions in /dev on a read-only root filesystem.
  However it doesn't provide for file creation in /tmp on a read-only
  root filesystem either:-)  I like the 'copy /dev to a tmpfs, and then mount
  that on /dev' approach.  I'm sure that other approaches are possible
  and could be "better", but I think the issue is separate from the
  issue of how to represent devices.

F:  My signature :-)

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/