ext3 for 2.4

From:	 Andrew Morton <andrewm@uow.edu.au>
To:	 ext2-devel@lists.sourceforge.net,
	 "Peter J. Braam" <braam@mountainviewdata.com>,
	 Andreas Dilger <adilger@turbolinux.com>,
	 "Stephen C. Tweedie" <sct@redhat.com>
Subject: ext3 for 2.4
Date:	 Thu, 17 May 2001 21:20:38 +1000
Cc:	 linux-fsdevel@vger.kernel.org

Summary: ext3 works, page_launder() doesn't :)



The tree is based on the porting work which Peter Braam did.  It's
in cvs in Jeff Garzik's home on sourceforge.  Info on CVS is at
http://sourceforge.net/cvs/?group_id=3242 - the module name
is `ext3'.  There's a README there which describes how to
apply the patchset.

Current status is: quite solid.  Stress testing on x86/SMP
passes and performance in ordered data and writeback data
mode is good.  Journalled data performance is, of course, so-so.
The only big issue of which I am aware is a VM livelock
on SMP, discussed below.

The patch is against 2.4.4-ac9.

Today's changes:

- quotas appear to work OK.  I'll leave them turned on
  as I test things, and watch out for oddities.

  It's hard to find working quota tools.  Most of them
  either don't want to compile and/or don't understand
  ext3.  Jan Kara is maintaining a set of quota tools
  at http://www.sourceforge.net/projects/linuxquota/ which
  work well.  The current CVS tree from there seems to be
  under XFS development at present and needs a couple of
  patches to work against ext3 (and even ext2).  I can send them
  to whoever needs.

- Recovery works fine now.  The bug was that I was splicing new
  blocks into a file in ext3_splice_branch() *before* doing a
  journal_get_write_access() on its parent's buffer.  Duh.

- Four debugging fields have been removed from buffer_head.
  b_alloc_transaction, etc.   These were debug fields which
  I couldn't find a use for in 2.4.  In 2.2, these were set
  in ext3_new_block() when we do a getblk() on the new block.
  In 2.4, we don't do the getblk() any more...

- Some tightening of the way commit feeds buffers into the 
  request queues.  At present, 256 buffers are fed into
  ll_rw_block() before we run tq_disk.  I *was* pushing
  thousands down.  It doesn't seem to make much difference.
  Overall throughput with some benchmarks in ordered data
  mode has been significantly improved by this change.
  ext3 in general seems faster in 2.4 than in 2.2, presumably
  because of better request merging.

  Much more work needs to go into benchmarking and performance
  tuning.

- There's an issue with page_launder():

	ext3_file_write()
	-> generic_file_write()
	   -> __alloc_pages()
	      -> page_launder()
		 -> ext3_writepage()

  This is bad.  It will cause ext3 to be reentered while it
  has a transaction open against a different fs.  This will
  corrupt filesystems and can deadlock.

  Making ext3_file_write() set PF_MEMALLOC wasn't suitable.  It
  easily causes 0-order allocation failures within generic_file_write().

  The current approach to this is, in ext3_writepage(), to detect
  when ext3 is being reentered and to simply *return* without
  writing the page at all.

  This is kludgy but should work - the only place where the fs can
  be reentered via writepage() is from page_launder(), and
  page_launder() doesn't wait on the page. Quotas don't use
  writepage(), and reentry there is OK.

  If Marcelo's `priority' argument to writepage() goes in,
  this can be used in a more sensible manner.

  Note that this return-if-reentered code is not related to
  the VM livelock.  It has a big printk in it at present..

- Some new test tools:

  To simulate crashes I have added a new mount option:

	mount /dev/foo /mnt/bar -t ext3 -o ro-after=NNN

  When the fs is mounted this way a timer will fire after
  NNN jiffies and will turn the underlying device immutable.
  It does this by setting a flag which is tested in submit_bh().
  For WRITE requests submit_bh() will simply call
  bh_end_io(uptodate=1) and return.

  There's a new ext3 ioctl() which will block the caller until the
  device has gone readonly.  I semi-randomly chose

	#define EXT3_WAIT_FOR_READONLY         _IOR('w', 1, long)


  The intent here is that a controlling script will:

	1: Mount the fs with ro-after=1000	(Ten seconds)
	2: Start a test script (eg: dbench)
	3: Block on the wait-for-readonly ioctl
	4: wake up when the disk has "crashed"
	5: Kill off the test script
	6: Unmount the fs
	7: Mount the fs (let recovery run)
	8: unmount the fs
	9: run e2fsck to check that the fs is sane
	10: modify the ro-after parameter
	11: do it all again

  Scripts which do all this are in the testing/ and tools/
  directories.  I've been happily simulating crashes in
  the middle of `dbench 12' runs for an hour now.  All is well.

  I think this covers everything except for verifying that the
  data content of the files are sane.  That can be handled with
  test tools.  Special code will probably be needed to simulate
  crashes during truncate - with this shotgun approach the fs
  tends to go immutable before *any* of the truncate has committed,
  and it's as if nothing ever happened.

  The `ro-after' code and submit_bh() changes are conditional
  on CONFIG_JBD_DEBUG.



So.  The big outstanding issue is the VM livelock.  It only
happens on SMP.  If you have a huge amount of dirty data
in the system (a big `dd if=/dev/zero of=foo' will do it),
everything goes happily for 10-20 seconds and then it freezes.
The write drop-behind code should be cutting in here.

Running dbench, same problem.  We have seven runnable tasks, *all*
of them chugging away in page_launder(), none of them making any
appreciable progress.  Disk throughput falls to about one LED-flicker
per three seconds.

After 10-30 seconds, things magically clear themselves and 
it starts back up.  And then after ten seconds it chokes again.

It could be something silly I've done.  It doesn't seem to affect
other filesystems.  Tomorrow will tell.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org