[LWN Logo]
Date:	Sun, 30 Jul 2000 18:23:05 -0400 (EDT)
From:	Alexander Viro <aviro@redhat.com>
To:	linux-kernel@vger.rutgers.edu
Subject: [RFC][Long][Horror story] Mount flags

Sorry for the length of that, but I really felt that the whole story was
needed to appreciate the situation. In short, I think that there is a need
of new variant of mount(2). Yep, new syscall number. See below for the
reasons. Here it comes:

			Mount Flags, or
		   A Story of Interace Rot.

	Once upon a time life was simple, interfaces pleasant and look at
the mount(2) didn't raise a suspicion that Frankenstein's monster got what
he wanted. Back then mount(2) had 3 arguments - directory, device and rw
flag (unused, by the way). Alas, it didn't last. In March '92 mount(2) got
a new argument - fs type. So far, so good, but the story didn't end on that -
somewhere in July '92 msdosfs went in and brought mount(8) options that were
obviously fs-specific. And that brought a new argument - void *data.
sys_newmount(9), you are saying? You wish... That's what had actually happened:

 * Flags is a 16-bit value that allows up to 16 non-fs dependent flags to
 * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
 * data is a (void *) that can point to any structure up to 4095 bytes, which
 * can contain arbitrary fs-dependent information (or be NULL).
 * NOTE! As old versions of mount() didn't use this setup, the flags has to have
 * a special 16-bit magic number in the hight word: 0xC0ED. If this magic word
 * isn't present, the flags and data info isn't used, as the syscall assumes we
 * are talking to an older version that didn't understand them.


 * do_mount() does the actual mounting after sys_mount has done the ugly
 * parameter parsing. When enough time has gone by, and everything uses the
 * new mount() parameters, sys_mount() can then be cleaned up.

Needless to say, this interface is still with us. Nevermind that current
kernel will simply refuse to exec() a binary from '92, the kludge is still
there. First bunch of flags was nice and sweet: ro, nodev, nosuid, noexec
and sync. Bits 0--4, indeed. But in January '93 we've got remount and it had
been implemented as a new flag. It wasn't a flag, indeed, but hey, why not
encode the action into the same argument and avoid API changes? So there it
went and got the bit #5. And so it stayed for a while. Flags (real flags,
that is, not remount one) were mirrored into ->i_flags of every inode and
everyone was happy.

In August '94 ->i_flags got two new bits - S_APPEND and S_IMMUTABLE. They
could not be passed by mount(2), indeed. They got bits #8 and #9, apparently
to make them visibly separate from the rest. Well, putting them at #16 and
above might be wiser, but hey, who will ever need more than 8 (OK, 7) mount

In the late '95 we got an implementation of quota. And ->i_flags got a
new bit - S_QUOTA, #7. Originally it got an inventive name S_WRITE, but
that insanity had been fixed in '98.

Fast-forward to October '96. POSIX mandatory braindam^Wlocking gores in. 
Since nobody wants the overhead hitting all filesystems we are getting
a new mount flag. This time - real. OK, #6 is still free, so there it goes.

Novermber '96, and we have one more flag - noatime. Oops, looks like
we had made a bad choice when append-only and immutable went in. Oh, well,
who actually cares? #10 it is.

Originally remount could change only read-only bit. Well, mandatory locking
and noatime also became changable, so in September '97 somebody asked
himself why the rest didn't? At that point MS_RMT_MASK (flags that can be
changed by remount) started to look somewhat ugly. It got worse three months
later, when nodiratime went in (bit #11).

In October '98 RMK noticed that remount doesn't update ->i_flags, so macros
got uglier - now we were checking both for ->i_flags and ->s_flags.

In April '99 unrelated events (rename() cleanup) had lead to Yet Another
Mount Flags Ugliness(tm). This time the guilty party is known - it's me (AV).
I needed a way to tell rename() that some filesystems need special treatment
(silly-rename ones). Instead of putting that into ->s_type->fs_flags (after
all, that's a property of filesystem type) I've added a new bit to ->s_flags
(#15 - at that point we were visibly low on space; why not #16? Hell knows,
I plead temporary braindamage inflicted by contact with NFS).

A year later one more bit got there, this time in ->i_flags - S_DEAD. That
time I had finally had seen the light (OK, actually I had seen the dire lack
of space, but let's pretend that I was clever) and it went into #16.

About the same time we've got Plan9-ish bindings. I made some noises about
a new syscall, but they were not too convincing. For several reasons: first
of all, the name (bind) had been already taken and bind9(2) was a half-hearted
proposal at best. Moreover, I wanted to debug it fast and didn't want to
change mount(8) source. So the quick kludge^Whack went in - -t bind. In other
words, passing the thing through the "type" argument.

The same batch of changes introduced unlimited stacking. Which looked fine
at first, but brought a lot of complaints, arguments and finally such an
example of misuse that drove the point through. It was an obvious exploit,
letting any user who can mount something (floppy, CD, whatever) to drive
the system into OOM. Worse yet, cleaning up after that was damn hard, and
I don't mean washing the LART. That was it - we need more flags, since
the ability to overmount must be root-only. And checks should be in mount(8),
since it's suid-root and from the kernel POV all calls of mount(2) are
done by root. On the other hand, the actual test for presence of another
filesystem at the mountpoint must be left to mount(2) to avoid races.

OK, but we also want to be able to support union-mounts at some future
point. That means two more flags (head/tail of the union). We also want
to get rid of the -t bind kludge, so that's one more bit going our way.
However, currently we have only 3 unused bits - #12, #13 and #14. We can
get more if we relocate S_QUOTA, S_APPEND, S_IMMUTABLE and MS_ODD_RENAME,
though... OK, assume that we've done that, what do we have?

#0 to #4, #6, #10 and #11 are used for real flags. Fine. #5 and some of
the rest are used for "action" flags. So we can fit into 16 bits, but
it's getting really, really crowded here. We can get a bit more if we notice
that MS_RENAME, MS_AFTER, MS_BEFORE and MS_OVER are mutually exclusive,
but that gets really ugly - we could fit into 3 bits instead of 4, but
they would be spread over not-contiguous area. And we can't do anything
about that without breaking every existing binary of mount(8). Moreover,
we can't do anything about the 0xc0ed kludge - all kernels since '92 are
going to send us to hell if we change that. Yes, Virginia, removal of that
check had been overdue for some 7 years, but there is no helping to that.

_Or_ we can do what needed to be done back in '92 and '94 and introduce
sys_newmount(action, mountpoint, type, flags, device, data). Why "action"
separate from "flags"? Well, see the story above. Mixing the bitmap and
number into one integer _never_ pays. And inside the kernel we will have
to start with separating them anyway. I could buy an argument about the
register pressure, but damnit, it's mount(2) we are talking about. If it's
a hotspot of your program I want to know what the hell are you trying to


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/