From: Neil Brown <neilb@cse.unsw.edu.au> To: linux-kernel@vger.rutgers.edu Date: Thu, 11 May 2000 15:31:03 +1000 (EST) Subject: devfs - the missing link Hi there all. I have been watching the devfs debate for the last year or so with interest and amusement ... and occasional boredom:-) I must admit that I have sympathies for both sides. I think something is definitely needed, and devfs certainly solves some problems, but is it the "best" solution...?... Or, put another way, I'm sure that it can be made to work, but I suspect there could be a better way of doing it. Possibly more informative than the content of the argument, is the fact of the argument. The fact that (apparently) intelligent and rational people cannot agree - or even agree to differ - about a technical issue seems very significant. Based on some years of observing and experiencing human behaviour, my view is that when two people cannot agree on something, it is almost always because they are failing to communicate. Either they don't correctly interpret the words that the other is using or (more commonly) they are working from different sets of basic assumptions - they have different axioms. If you can pinpoint this failure to communicate - if you can identify the item of information that one is assuming and the other is not (or is assuming to be different), they you can usually resolve the issue. Obviously I believe there is a communication failure going on -- a common understanding that is missing. In this mail item I hope to expose an important issue that seems to be being glossed over, and then to elaborate the implications of this issue. You may still not like the resulting proposal, but I will have succeeded if people have a clearer idea of what the divisive issues really are. Like most important truths, the "important issue" can be stated in a simple sentence (cf John 3:16) but can benefit from substantial elaboration (cf the rest of the Bible). My one sentence summary: Device special files are *not* devices, they are gateways to devices. Before I embark on the elaboration, it might help to identify some particular issues that seem to have caused particular disagreement. I believe that the approach discussed below answers all of these issues to some degree. I'll let you be the judge. 1/ persistence of permissions on device files - not trivial when device files are not persistent. Several solutions have been discussed with no clear agreement. 2/ /dev in a chroot gaol. This requires a /dev which is the same as, but different too, the "real" /dev. 3/ 16 bit device numbers are too small. Do we enlarge them? Deprecate them? If so, how? 4/ Where and how is devfs mounted? /dev? /devices? at the same time as /? at the same time as /proc? 5/ The choice of names of things in devfs - the Linus imposed scheme vs the original scheme. The rest of the memo comprises: A: A discussion of what device special files really are. B: A brief outline of what (I think) I would like a device filetree to look like. C: A new construct to carry device special files into the next century. D: Some notes on backward compatibility. E: Some closing comments. F: My signature :-) A: A discussion of what devices special files really are. You probably mostly know all this, but it needs to be said up front to make sure that we are communicating. In *traditional* Unix (I'm thinking Edition 7 Unix from Bell Labs specifically, though not that much has changed), devices were addressed by a static, 3.5 level numeric hierarchy. The three levels of this hierarchy are: 1/ Block or Character devices (== fronted by buffer cache or not) 2/ Major device number - Each number identified a driver 3/ Minor device number - identified a particular device managed (driven?) by that driver. The extra 0.5 level of hierarchy came because some devices had sub-components: disc drives had partitions. Tape drives had rewind-on-close behaviour and no-rewind-on-close behaviour. This resulted in some drivers splitting the bits in the minor number up into a device identifier and some extra bits to indicate how to interact with the device. Clearly, the three level hierarchy was already limiting back then. That is the device hierarchy. Unix wanted to (and this is one of the great strengths of Unix) access the device hierarchy from the filesystem. Hence the device-special files. A device-special file has ownership and Access Control*, and identifies a particular device. (note that devices in general have no ownership and no access control, though some drivers might restrict some ioctls (e.g. format-disc) to root). The semantic of the device-special file is that if the ACL gives you some particular access, you have that access to the referenced device, rather than access to the device special file itself. (i.e. write access doesn't mean that you can change the device-special file, only that you can write to the device). Only root can create device-special files. Hopefully this explains what I mean when I say that device special files are only gateways to devices, not the devices themselves. It is worth noting that it is quite possible to create two different device special files which refer to the same device, but have different owners and access control. I haven't ever come across a case where this is useful, but it does help highlight the difference. * When I refer to Access Control or ACL, I am primarily talking about standard Unix ugo+rwx access control, but any other ACL scheme could be used equally well. B: A brief outline of what (I think) I would like a device filetree to look like. The traditional Unix device tree is clearly limiting. There are two particular aspects that are limiting. 1/ 3.5 level hierarchy is too rigid. 2/ numeric identifiers are hard to manage, and not human-friendly. The "obvious" response to this is to have a hierarchy that looks like a filesystem - with textual names for elements and arbitrarily many level as suits particular types of devices - and this is what devfs does. My reason for proposing something different to the current devfs structure is that I am coming to the problem with different priorities. devfs seems to want to copy the traditional layout of /dev, and with good reason. I have no desire to mimic that, but instead a desire to mimic the 3-level hierarchy of devices numbers - but take it a bit further. This is far from a complete proposal and is intended to be largely indicative of what is possible rather the prescriptive of what should be done. Also, my knowledge of device technologies and nomenclature are fairly superficial, so please excuse me if I say something silly. The approach I would take to the device hierarchy is to have a controller/instance/function hierarchy where a "function" may be a "controller" of a different sort and so would have a symlink back up to the top. Taking for example my PCI based AHA2950 dual port SCSI controller with several discs on one port and a DLT library on the other, it might look something like: pci/ directory containing all pci busses pci/0/ directory containing pci buss 0. Presumably PCI controllers have some sort of address (IO port? memory?) so this might be pci/0x880000000/ or some such. pci/0/2/ directory of information about device 2 on pci buss 0. This '2' has a physical meaning, it is not a sequential number. Possibly this could be pci/0/device/2 so that other information could go in pci/0 and not get confused with devices. pci/0/2/vendor file containing vendor id. Yes, procfs-like stuff goes here too. pci/0/2/deviceid file containing deviceid. You might prefer vendor and deviceid to be one file.. so might I. pci/0/2/function/ directory containing the different functions of the device. pci/0/2/function/0 -> ?scsi/0 a symlink saying that function 0 is the first scsi buss found. The '?' means that something should go here to say "go back to top of device hierarchy". It could be a sequence of "../"s, or possibly something else. See later. pci/0/2/function/1 -> ?scsi/1 a symlink identifying that this is the second scsi buss. scsi/ directory containing all scsi busses scsi/0 first scsi buss. the '0' is a purely sequential number it has no external meaning. scsi/0/master -> ?pci/0/2/function/0 A symlink so that you can find out where this scsi controller exists. "master" probably isn't a good name. scsi/0/2/ A directory with information about device 2 on scsi buss 0. In this case a disc drive. Again, possibly scsi/0/device/2 scsi/0/2/function/0 -> ?disc/1 This is a disc drive (though your's might be a disk drive:-). It is drive 1. (drive 0 is probably ide/0/0 which is pci/0/1 ...) scsi/0/3/function/0 -> ?disc/2 scsi/0/4/function/0 -> ?disc/3 more discs. scsi/1/0/function/0 -> ?tape/0 LUN0 of device 0 on this scsi buss is a tape drive. scsi/1/0/function/1 -> ?changer/0 LUN1 is an auto-changer to changing tapes. disc/ directory for all disc drives. I think I would include all ide and scsi non-removable magnetic drives here. There would be separate cdrom/ floppy/. I'm not sure where ZIP drives would go. Possibly the interesting distinction is removable/non-removable.... disc/1/ Information about disc 1 disc/1/master -> ?scsi/0/2/function/0 disc/1/device This *is* the device. You read/write this to access the disc drive. It might look (to stat()) like a device special file, or it might look like a named pipe or a socket. disc/1/partition -> ?partition/1 Where to find the partitions. I might be taking the levels of indirection too far here. a disc/1/parition/ directory might be better. partition/1/ Partitions on second partitions device partition/1/style file containing the word "msdos\n" partition/1/master -> ?disc/1 partition/1/table raw partition table in format according to "style" partition/1/1/part partition 1. This is the real partition. You can open/read/write/mount this. It might look (to stat) like a device special file or a socket or ... partition/1/2/part partition 2 partition/1/2/partition -> ?partition/2 Partition two has "extended" partitions in it! Again, I might be taking indirection too far. I think (hope) that you get the idea. The device tree reflects the physical organisation of devices where possible, and allows for "virtual" devices to help flatten the hierarchy. The tree contains not only devices, but also information about devices such as is often found in /proc. The ownership/access control on things within the tree is minimal and (mostly) not changeable. Almost everything is owned by root. A exception might be slave ptys which were created by a particular user are owned by that user and can have their permissions changed. Access control is mostly wide open (you will see why later). Some things might only be writable by root. Directories are read-only. We already have some things in the tree that are not physical devices - partitions. They are really a layer of interpretation on top of the data in the device. There are other interpretations that we put on top of data, and they should be reflected in the tree. filesystem/ (maybe fs/) directory of all filesystems filesystem/ext2 sub-dir of ext2 file systems filesystem/ext2/some-long-hex-uuid/ a particular filesystem. If there are uuid conflicts, or if the filesystem doesn't support uuids, some sequential number would be used. filesystem/ext2/some-long-hex-uuid/dev -> ?disc/1/device filesystem/ext2/some-long-hex-uuid/fs/ The actual filesystem appears to be mounted here, was well as at /usr (or wherever) thanks to Al Viro's new vfs mounting stuff. filesystem/ext2/some-long-hex-uuid/mountpoints a file listing current mount points, or maybe a directory full of symlinks to the mount points. md/other-hex-uuid/ directory with assorted stuff about an md device. There would be a link to the actually device in disc/ and a superblock and more. etc. I understand that Richard is already planning stuff like this for devfs - /dev/volumes I believe. Just to bring you back to where we are up to, this hierarchy is NOT meant to replace /dev. It replaces the block-or-char/major/minor hierarchy. Like that hierarchy, it has little in the way of access control. Though the bc/major/minor hierarchy is not directly accessible from the filesystem, it would be nice if this hierarchy were. We could mount it somewhere like /devices. However I would prefer it to go somewhere like //devices or //linux/devices. It wouldn't get mounted there. It would simply always be there, much as / is always there and the bc/major/minor hierarchy is always ... wherever it is. It is true that linux doesn't currently differentiate // from /, but POSIX allows us too, and there has been talk about going that way, and we can keep that as a long term goal, and mount it in /proc/devices or similar for now. Also, as the devices in this tree have wide open permissions, if we mount it, we must make sure that the root directory is closed - owner root, permission 700. Probably access through the root should require CAP_MKNOD as well so that it can safely be visible in chroot gaols (presumably chroot changes / but not //). C: A new construct to carry device special files into the next century. We have a new device tree, but what good is it if no-one can access it? That is equally true of the old bc/major/minor device hierarchy. We need gateways into it. To follow the pattern of device special files as outlined above, we need some sort of filesystem object which has ownership and access control, and contains a pointer to a device. However in this case the pointer to the device is not a pair of numbers but is a textual name. Sounds like a job for symlinks to me. Obviously symlinks as they are don't cut it, but there are three bits (setuid, setgid, sticky) that we can use to enhance symlinks - and Unix has a (murky?) history of using these bits ... creatively. Let me propose that a symlink with, say, the setuid bit gets treated differently to symlinks, and somewhat like device special files. - chmod/chown on such a symlink (lets call it a devlink) applies to the symlink itself, and not on the target of the link. - If we decide that the device tree doesn't get mounted, then we could assert that such a devlink always points into the device tree instead of the filesystem, but I would prefer to mount the device tree at //devices, and have all the devlinks start //devices. - accessing a devlink (open) provides access to the referenced object, and has access permissions checked based on the ACL of the devlink. - only root (CAP_MKNOD) can create devlinks, or set the setuid bit on a symlink. This provides essentially the functionality of device special files, but with a more flexible hierarchy for devices. It allows us to give away access to specific devices to specific users, and to have this access stored in a normal filesystem and to be persistent in the way that /dev normally is. As devices appear in multiple places in the device tree under multiple identities (by phys address, by uuid, by function) we can (hopefully) give away the access that we really want to give away. e.g. //devices/camera/xxx gives you access to digital camera with uuid xxx no matter which buss it gets connected on. However, this structure may be a bit too limiting. Suppose that rather than giving away access to a specific device, I want to give away access to a directory full of devices. e.g. You can have access to any digital camera that gets plugged in. I really want to be able to have a devlink that points to a directory. What does that mean? In particular, how is the ACL carried along if I chdir through a devlink, and how am I prevented from using ".." to walk all over the device tree. The abstraction that seems to work best here is a "mount". When I access a devlink, particularly one to a directory, I want the directory to effectively be mounted on the symlink ... and with Al's new mount stuff there is no problem mounting different bits in different places, possibly mounting the one directory in several places. This provides control of "..", but what does it do for preserving the ACL? Here I think we need one more bit of magic. Every object in the device tree will have ownership/ACL, though the ACLs will be chosen from a fairly limited set and almost everything will be owned by root. However, some objects, particularly devices, will have a sticky bit set. In the device filesystem, the sticky bit will have a special meaning. It means "use the owner/acl of the mountpoint". The mountpoint is always available through the vfsmount structure so getting hold of this should be quite easy. One thing that this doesn't answer is how symlinks inside the device filesystem get treated when you have only mounted part of it. Possibly these symlink need to be devlinks as well. I haven't completely resolved this issue for myself, but I don't think it is insurmountable. So, in summary, we have - devlinks which are symlinks with setuid bit set. - chown/chmod affect devlinks directly, not the target like with symlinks. - accessing a devlink does some sort of magic mount - in the device filesystem, "sticky" objects get their permissions from the mountpoint. - only root(CAP_MKNOD) can make devlinks. It might actually be nice to use devlinks more generally: /usr -> //devices/filesystem/ext2fs/long-hex-uuid/fs obviates some of the need for /etc/fstab. D: Some notes on backward compatibility. But what about major/minor numbers? Some applications (tar?) still require them. And we need a clean transition from the old to the new. My old /dev must still work while I am transitioning from the old style to the new. Making old device special files still work simply means providing a mapping from cb/major/minor to //device/path. This could be encoded into the kernel, or could be provided by making a directory tree: /oldevices/char/1,1 -> //proc/mem /oldevices/char/5,0 -> //proc/self/controlling-tty /oldevices/block/3,0 -> //devices/disc/0/device .... and telling the kernel to look up this directory to resolve device special files. Possibly a combination - the kernel "knows" about many common things and goes to /oldevices for what it doesn't understand - would be best. As it is a transition, speed shouldn't be too important and caching could make up for any lack. Providing major/minor numbers for programs that really want them is not so straight forward. The best solution probably depends on what real problems there turn out to be. One idea would be that "mknod" on an existing "device" object would set the major/minor numbers of that device, and some boot-time script does the mknod for those devices that really need it. Possibly devices which haven't been 'mknod'ed just appear to have some automatically allocated unique major/minor number from an unused number space. The number gets allocated the first time the device is used. E: Some closing comments. As said, this proposal is very incomplete. - The naming structure needs lots of thought by someone with lots of relevant experience. - Maybe devices should appear as device special files with sequentially assigned numbers, or maybe they should appear to be sockets. - are the links in the device filesystem devlinks or symlinks, and how costly would it be to have literally hundreds of little mounts of the device filesystem. - complete semantics of devlinks need to be worked out - e.g. when something gets mounted on a devlink, can you still see the link, and how do you chown/chmod it. - there is no code - what happens when a name is accessed in the device filesystem that doesn't exist? Do we do a kmod callback or an autofs style upcall. Do we cache negative responses? - more.. There would still be a place for a devfsd like program, particularly for taking arbitrary action on hot-plug/unplug events, though probably for other things too. I would hope that the system could still work reasonably well without devfsd running though. This change would not be optional like devfs is as it substantially changes the way devices are handled. This would imply that we would want to be able to have a stripped down device filesystem for use in embedded systems. How much stripping down would depend on how much code it actually took to implement. One issue that this proposal doesn't directly resolve is allowing changes to permissions in /dev on a read-only root filesystem. However it doesn't provide for file creation in /tmp on a read-only root filesystem either:-) I like the 'copy /dev to a tmpfs, and then mount that on /dev' approach. I'm sure that other approaches are possible and could be "better", but I think the issue is separate from the issue of how to represent devices. F: My signature :-) NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/