From: Richard Gooch <rgooch@atnf.CSIRO.AU> To: linux-kernel@vger.rutgers.edu Subject: 2nd draft README for devfs (long) Hi, all. After reading the various messages flying around on this topic and doing some more thinking, I've updated the justification section of my README for devfs. Basically, I think the days of major and minor numbers are numbered (pun unintended). I don't mean 8 bit major&minors are doomed. I mean the whole concept is doomed. Increasing these to 16 bits each is merly a kludge and what's worse it doesn't scale. Either you chew heaps of RAM or you scan lists. See below for more detail. Also, for those interested in seeing what the interface currently looks like, I've appended an extract from the source. Soon I will be posting a kernel patch that people can start playing with. Regards, Richard.... =============================================================================== void *dev_register (unsigned int major, unsigned int minor, umode_t mode, uid_t uid, gid_t gid, const char *name, unsigned int namelen, struct file_operations *fops, int auto_owner) /* [SUMMARY] Register a device entry. <major> The major number. Not needed for regular files. <minor> The minor number. Not needed for regular files. <mode> The default file mode. <uid> The default UID of the file. <guid> The default GID of the file. <name> The name of the entry. <namelen> The number of characters in <<name>>. <fops> The file_operations structure. This must not be externally deallocated. <auto_owner> If TRUE then when an closed inode is opened the ownerships are set to the opening process and the protection is set to that given in <<mode>>. When the inode is closed, ownership reverts back to <<uid>> and <<gid>> and the protection is set to read-write for all. [RETURNS] A handle which may later be used in a call to [<dev_unregister>]. On failure NULL is returned. */ void dev_unregister (void *handle, const char *name, unsigned int namelen, unsigned int major, unsigned int minor) /* [SUMMARY] Unregister a device entry. <handle> A handle previously created by [<dev_register>]. If this is NULL then the list of devices must be searched. <name> The name of the entry. This is ignored if <<handle>> is not NULL. <namelen> The number of characters in <<name>>. <major> The major number. This is used if <<handle>> and <<name>> are NULL. <minor> The minor number. This is used if <<handle>> and <<name>> are NULL. [RETURNS] Nothing. */ =============================================================================== Device File System (devfs) Overview Richard Gooch <rgooch@atnf.csiro.au> 9-JAN-1998 What is it? =========== Devfs is an alternative to "real" character and block special devices on your root filesystem. Kernel device drivers can register devices by name rather than major and minor numbers. These devices will appear in the devfs automatically, with whatever default ownership and protection the driver specified. Why do it? ========== There are several problems that devfs addresses. Some of these problems are more serious than others (depending on your point of view), and some can be solved without devfs. However, the totality of these problems really calls out for devfs. Major&minor allocation ---------------------- The existing scheme requires the allocation of major and minor device numbers for each and every device. This means that a central co-ordinating authority is required to issue these device numbers (unless you're developing a "private" device driver), in order to preserve uniqueness. Devfs shifts the burden to a namespace. This may not seem like a huge benefit, but actually it is. Since driver authors will naturally choose a device name which reflects the functionality of the device, there is far less potential for namespace conflict. Solving this requires a kernel change. /dev management --------------- Because you currently access devices through device nodes, these must be created by the system administrator. For standard devices you can usually find a MAKEDEV programme which creates all these (hundreds!) of nodes. This means that changes in the kernel must be reflected by changes in the MAKEDEV programme, or else the system administrator creates device nodes by hand. The basic problem is that there are two separate databases of major and minor numbers. One is in the kernel and one is in /dev (or in a MAKEDEV programme, if you want to look at it that way). Solving this requires a kernel change. /dev growth ----------- I maintain a subset of the common /dev nodes, and I have nearly 600! Others have twice this number. Most of these devices simply don't exist because the hardware is not available. A huge /dev increases the time to access devices (I'm just referring to the dentry lookup times here: the next section shows some more horrors). An example of how big /dev can grow is if we consider SCSI devices: bus 4 bits unit 8 bits LUN 8 bits partition 6 bits TOTAL 26 bits This requires 64 Mega (1024*1024) inodes if we want to store all possible device nodes. Even if we scrap different units and LUNs, that's still 10 bits or 1024 inodes. Each VFS inode takes around 256 bytes (kernel 2.1.78), so that's 256 kBytes of inode storage! This could be solved in user-space using a clever programme which scanned the kernel logs and deleted /dev entries which are not available and created them when they were available. This programme would need to be run every time a new module was loaded, which would slow things down a lot. Devfs is much cleaner. Node to driver file_operations translation ------------------------------------------ There is an important difference between the way disc-based c&b nodes and devfs make the connection between an entry in /dev and the actual device driver. With the current 8 bit major and minor numbers the connection between disc-based c&b nodes and per-major drivers is done through a fixed-length table of 128 entries. The various filesystem types set the inode operations for c&b nodes to {chr,blk}dev_inode_operations, so when a device is opened a few quick levels of indirection bring us to the driver file_operations. For miscellaneous character devices a second step is required: there is a scan for the driver entry with the same minor number as the file that was opened, and the appropriate minor open method is called. This scanning is done *every time* you open a device node. Potentially, you may be searching through dozens of misc. entries before you find your open method. Linux *must* move beyond the 8 bit major and minor barrier, somehow. If we simply increase each to 16 bits, then the indexing scheme used for major driver lookup becomes untenable, because the major tables (one each for character and block devices) would need to be 64 k entries long (512 kBytes on x86, 1 Mbyte for 64 bit systems). So we would have to use a scheme like that used for miscellaneous character devices, which means the search time goes up linearly with the average number of major device drivers. Note that the devfs doesn't use the major&minor system. For devfs entries, the connection is done when you lookup the /dev entry. When dev_register() is called, an internal table is appended which has the entry name and the file_operations. If the dentry cache doesn't have the /dev entry already, this internal table is scanned to get the file_operations, and an inode is created. If the dentry cache already has the entry, there is *no lookup time* (other than the dentry scan itself, but we can't avoid that anyway, and besides Linux dentries cream other OS'es which don't have them:-). Furthermore, the number of node entries in a devfs is only the number of available device entries, not the number of *conceivable* entries. Even if you remove unnecessary entries in a disc-based /dev, the number of conceivable entries remains the same. Devfs provides a fast connection between a VFS node and the device driver, in a scalable way. /dev as a system administration tool ------------------------------------ Right now /dev contains a list of conceivable devices, most of which I don't have. A devfs would only show those devices available on my system. This means that listing /dev would be a handy way of checking what devices were available. Major&minor size ---------------- Existing major and minor numbers are limited to 8 bits each. This is now a limiting factor for some drivers, particularly the SCSI disc driver, which consumes a single major number. Only 16 discs are supported, and each disc may have only 15 partitions. Maybe this isn't a problem for you, but some of us are building huge Linux systems with disc arrays. Solving this requires a kernel change. Readonly root filesystem ------------------------ Having your device nodes on the root filesystem means that you can't operate properly with a read-only root filesystem. This is because you want to change ownerships and protections of tty devices. Existing practice prevents you using a CD-ROM as your root filesystem for a *real* system. Sure, you can boot off a CD-ROM, but you can't change tty ownerships, so it's only good for installing. Also, you can't use a shared NFS root filesystem for a cluster of discless Linux machines (having tty ownerships changed on a common /dev is not good). Nor can you embed your root filesystem in a ROM-FS. You can get around this by creating a RAMDISC at boot time, making an ext2 filesystem in it, mounting it somewhere and copying the contents of /dev into it, then unmounting it and mounting it over /dev. A devfs is a cleaner way of solving this. Non-Unix root filesystem ------------------------ Non-Unix filesystems (such as NTFS) can't be used for a root filesystem because they variously don't support character and block special files or symbolic links. You can't have a separate disc-based or RAMDISC-based filesystem mounted on /dev because you need device nodes before you can mount. Devfs can be mounted without any device nodes. Solving this requires devfs. PTY security ------------ Current pseudo-tty (pty) devices are owned by root and read-writable by everyone. The user of a pty-pair cannot change ownership/protections without being suid-root. This could be solved with a secure user-space daemon which runs as root and does the actual creation of pty-pairs. Such a daemon would require modification to *every* programme that wants to use this new mechanism. It also slows down creation of pty-pairs. An alternative is to create a new open_pty() syscall which does much the same thing as the user-space daemon. Once again, this requires modifications to pty-handling programmes. The devfs solution would allow a device driver to "tag" certain device files so that when an unopened device is opened, the ownerships are changed to the current euid and egid of the opening process, and the protections are changed to the default registered by the driver. When the device is closed ownership is set back to root and protections are set back to read-write for everybody. No programme need be changed.