ext3-2.4-0.9.5

From:	 Andrew Morton <akpm@zip.com.au>
To:	 lkml <linux-kernel@vger.kernel.org>,
	 "ext3-users@redhat.com" <ext3-users@redhat.com>
Subject: ext3-2.4-0.9.5
Date:	 Tue, 31 Jul 2001 01:18:49 +1000


The latest ext3 patches against linux-2.4.7 and linux-2.4.7-ac3 are at

	http://www.uow.edu.au/~andrewm/linux/ext3/

Changes since 0.9.4 include:

- Fixed a bug which could trip an assertion failure when using small
  journals under heavy load in full data journalling mode.

- A patch from Ted plus the latest version of e2fsprogs plus the stomping
  of various ext3 bugs gives us preliminary support for external journals.

- Redesigned the handling of synchronous operations.  Much simplified and
  several bugs fixed.

- Drastically improved throughput with synchronous mounts - they're now as
  efficient as `chattr +S'.

- Fixed an O(n^2) bottleneck in the commit code.

- Implemented transaction handle batching for a big throughput increase
  with synchronous operations.


The external journal code seems to work OK - brief usage details are at the
web site.  The intent here is that the external journal be an NVRAM device
(or a disk) which can be used to accelerate full-data journalling. 
Simulation using a normal RAM drive indicates that we can double throughput
with some loads (dbench) but not others (synctest).  More work is needed to
fully characterise this.


For the synchronous operations I've put together an application which
attempts to simulate an MTA's behaviour.  The simulator is called
`synctest' and it is in ext3 CVS.  There's a copy at
http://www.uow.edu.au/~andrewm/synctest.c - I'd really appreciate it if the MTA
guys could poke some useful holes in the modelling.

The simulator launches a (large) number of sub-processes.  Each subprocess
does the following:

  for 100 different filenames
    create a file
    write some data to the file (5k to 250k, exponential distribution)
    optionally fsync() the file
    close the file
    optionally fsync() the file's parent dir
    rename the file
    optionally fsync() the file's parent dir
    rename the file
    optionally fsync() the file's parent dir
    rename the file
    optionally fsync() the file's parent dir
    unlink the file from 30 passes ago.

(I'm told that postfix does a lot of renaming).


Now, it makes a very great deal of difference how these files are organised
in directories.  If you have 100 processes each doing synchronous operations
in separate directories then the new transaction batching in ext3 gives it
enormous scalability, whereas ext2 basically stops.

If you have 100 processes each doing synchronous operations in a single big
directory then ext2 does OK, and ext3 is only slightly quicker than ext2. 
This is because the VFS serialises operations on particular directories via
parent->i_sem and defeats ext3 transaction batching.

Most testing was performed on a `chattr +S' directory tree because that seems
to be a convenient way to operate popular MTAs.

ext3 relied upon the `chattr' setting to provide synchronous semantics for
all directory and write operations.  For ext2, the synctest `-f' option was
used to fsync the data at the end of the write.

The following tests were executed on a modern IDE disk with disk write
caching enabled.  Internal journal.  100 processes were used in every test. 
The number of `synctest' processes per directory was altered.

The final column represents ext2 throughput without `chattr +S', but using
fsync() to sync the parent directory and the data.

processes/dir   number of      ext2 completion   ext3 completion  ext2 (no
                directories    time (minutes)    time (minutes)   chattr)

   50               2              7:24              5:10          3:24
   20               5              9:21              3:31
   10              10             11:09              3:05          6:01
    5              20             14:37              3:02
    1             100             23:10              2:44          9:44


Apparently postfix will typically use 256 directories for hashing its
mailspool files.  The reason for this is, presumably, to avoid having single
directories with hundreds of thousands of files in them.  Postfix will spawn
hundreds of processes to work on those directories.  So the last row of this
table is the interesting one.

ext2 bogs down because it has so much metadata to write - it is spread all
over the disk and cannot benefit from write clustering.

ext3 stopped scaling at 20 processes per directory because the limiting
factor was checkpointing all the data and metadata into the main filesystem. 
Seeking.  The time taken to write the data to the journal is negligible when
compared with this.  In fact, the same testing was performed with an external
journal on RAM disk and the throughput was basically unaltered.  More main
memory will really help improve things here.

A 400 megabyte journal was used.  What happens is that ext3 happily writes
all outgoing data into the journal in linear 100 megabyte chunks until you
run out of either a) journal space or b) memory.  Then the whole world stops
for 15-20 seconds while hundreds of megabytes of stuff is written all over
the main filesystem.  This is optimal, but perhaps not desirable.  Using a
smaller journal size will tame this behaviour nicely.  Or use ordered-data
mode which runs smoothly, performs well and has full synchronous behaviour
and recoverability.


Conclusions.  Assuming that `synctest' is somewhat like a real MTA, and that
the MTA is using two-level hashing we can say that:


- chattr +S on ext2 costs you 2:1 or 3:1 throughput when compared with
  fsync()-on-data and fsync()-on-dir.

- full-journalling ext3 can offer a 3x to 10x improvement over ext2,
  depending upon how ext2 is used and the directory layout/task count.

- ext2 likes to have few directories, many processes per directory.

- ext3 likes many directories, few processes per directory.

- We can write data to the journal much faster than we can checkpoint that
  data into the main filesystem, so the benefit of an external journal device
  (spinning or NVRAM) has not been demonstrated.

- The holding of i_sem over the parent is a severe scalability limitation
  with synchronous metadata operations.  Better to have:

	void *opaque;
	down(&parent->i_sem);
	file->f_op->op(&opaque, args...);
	up(&parent->i_sem);
	if (IS_SYNC(inode))
		inode->i_op->wait_on_stuff(opaque);

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/