[LWN Logo]
[LWN.net]
From:	 Andrea Arcangeli <andrea@suse.de>
To:	 linux-kernel@vger.kernel.org
Subject: blkdev in pagecache
Date:	 Wed, 9 May 2001 04:34:56 +0200
Cc:	 "Stephen C. Tweedie" <sct@redhat.com>,
	 Linus Torvalds <torvalds@transmeta.com>,
	 Alexander Viro <viro@math.psu.edu>, Jens Axboe <axboe@suse.de>

This night I moved the blkdev layer in pagecache in this patch:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/blkdev-pagecache-1

It is incremental and depends on the o_direct functionality, latest
o_direct patch against 2.4.5pre1 is here:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/o_direct-5

The main reasons I moved the blkdev in pagecaches is that the current
blkdev provides horrible performance with fast I/O subsystem capable of
over 50mbyte/sec that I just increased x2 with a simple hack that you
can see here if you're curious:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.5pre1aa2/00_4k_block_dev-1

(btw, also the current rawio uses a 512byte bh->b_size granularity that is even
worse than the 1024byte b_size of the blkdev, O_DIRECT is much smarter
on this side as it uses the softblocksize of the fs that can be as well
4k if you created the fs with -b 4096)

However after running this 4k_block_dev-1 hack on some more machine I
noticed the blkdev layer wasn't able anymore to update the superblock of
1k ext2 filesystems and to make it "usable" in real life I needed to fix
it. But I didn't wanted ot invest any further time on such an hack and I
preferred to move the blkdev in pagecache and to fix the problem on top
of the new better design (moving blkdev in pagecache of course
introduces that same problem too as I also mentioned in one of the below
points).

I'll describe here some of the details of the blkdev-pagecache-1 patch:

- /dev/raw* and drivers/char/raw.c gets obsoleted and replaced by
  opening the blkdevice with O_DIRECT, it looks much saner and I
  basically get it for free by just implementing 10 lines of the
  blkdev_direct_IO callback, of course I didn't removed the /dev/raw*
  API for compatibility.

  While testing O_DIRECT I destroyed the first 50mbyte of the root
  partition so I will need to wait the test box to return alive before I
  can make further testing ;). But I just fixed the bug that caused the
  corruption before uploading the patch so I don't expect further
  problems (it was only a s/i_dev/i_rdev thing) because the regression
  testing was working well even if it was writing in the wrong disk ;).

- I force the virtual blocksize for all the blkdev I/O
  (buffered and direct) to work with a 4096 bytes granularity instead of
  the current 1024 softblocksize because we need that for getting higher
  performance, 1024 is too low because it wastes too much ram and too
  much cpu. So a DBMS won't be able anymore to write 512bytes to the
  disk using rawio being sure it will be a single atomic block update.
  If you use /dev/raw nothing changed of course, only opening blkdev
  with O_DIRECT enforce a minimal granularity of 4096 bytes in the I/O.
  I don't think this is a problem, and also O_DIRECT through the fs was
  just using the fs softblocksize instead of the hardblocksize as unit
  of the minimal direct-IO granularity.

- writes to the blockdevice won't end in the buffer cache, so it will
  be impossible to update the superblock of an ext2 partition mounted ro
  for example, it must not be mounted at all to update the superblock, I
  will need to invent an hack to fix this problem or it will get too
  annoying. One way could simply to change ext2 and have it checking
  the buffer to be uptodate before marking it dirty again but maybe
  we could also do it in a generic manner that fixes all the fs at once
  (OTOH probably not that many fs needs to be fscked online...).

- mmap should be functional but it's totally untested.

- currently the last `harddisk_size & 4095' bytes (if any) won't be
  accessible via the blkdev, to avoid sending to the hardware requests
  beyond the end of the device. Not sure how/if to solve this. But this is
  definitely not a new issue, the same thing happens today in 2.2 and
  2.4 after you mount a 4k filesystem on a blockdevice. OTOH I'm scared
  a mke2fs -b 1024 could get confused. But I really don't want to
  decrease the b_size of the buffer header even if we fix this.

- to share all the filemap.c code and not to change too much stuff in
  the first patch I added some ISBLK check in fast paths, basically
  only to check against blk_size instead of inode->i_size, I also
  considered changing the i_size semantics for the blkdev inodes but
  I didn't wanted to break all the fs yet so I took the localized
  slower way for now (I doubt it is noticeable in the benchmarks
  but nevertheless it would be nice to optimize away those branches).

- once the blkdev is closed in the block_close callback I
  filemap_fdatasync;fsync_dev;filemap_fdatawait;invalidate_inode_pages2
  (fdatawait seems not necessary but it won't hurt). I'm not calling
  truncate_inode_pages because those pages could be still mapped
  (->release is called when f_count goes down to zero, not when
  i_count reaches zero). I'd like to defer the invalidate_inode_pages2
  to the time an fs gets mounted or when check_media_change triggers
  like in 2.2, but this is another issue.

Besides the s/i_dev/i_rdev/ thing that is just fixed, it looked stable
but better to backup before playing with it ;).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/