Linux: Rewriting the Buffer Layer

Submitted by Jeremy
on June 24, 2007 - 12:54pm

Posting a series of three patches, Nick Piggin [interview] announced that he was working on a rewrite of the buffer layer which he calls fsblock, "the name is fsblock because it basically ties the fs layer to the block layer." As to just what the buffer layer is, Nick explained, "the buffer layer is a layer between the pagecache and the block device for block based filesystems. It keeps a translation between logical offset and physical block number, as well as meta information such as locks, dirtyness, and IO status of each block. This information is tracked via the buffer_head structure." Before listing the improvements introduced by his rewrite, Nicked offered a justification for the effort:

"Why rewrite the buffer layer? Lots of people have had a desire to completely rip out the buffer layer, but we can't do that because it does actually serve a useful purpose. Why the bad rap? Because the code is old and crufty, and buffer_head is an awful name. It must be among the oldest code in the core fs/vm, and the main reason is because of the inertia of so many and such complex filesystems."


From: Nick Piggin [email blocked]
To:	Linux Kernel Mailing List [email blocked]
Subject: [RFC] fsblock
Date:	Sun, 24 Jun 2007 03:45:28 +0200


I'm announcing "fsblock" now because it is quite intrusive and so I'd
like to get some thoughts about significantly changing this core part
of the kernel.

fsblock is a rewrite of the "buffer layer" (ding dong the witch is
dead), which I have been working on, on and off and is now at the stage
where some of the basics are working-ish. This email is going to be
long...

Firstly, what is the buffer layer?  The buffer layer isn't really a
buffer layer as in the buffer cache of unix: the block device cache
is unified with the pagecache (in terms of the pagecache, a blkdev
file is just like any other, but with a 1:1 mapping between offset
and block).

There are filesystem APIs to access the block device, but these go
through the block device pagecache as well. These don't exactly
define the buffer layer either.

The buffer layer is a layer between the pagecache and the block
device for block based filesystems. It keeps a translation between
logical offset and physical block number, as well as meta
information such as locks, dirtyness, and IO status of each block.
This information is tracked via the buffer_head structure.

Why rewrite the buffer layer?  Lots of people have had a desire to
completely rip out the buffer layer, but we can't do that[*] because
it does actually serve a useful purpose. Why the bad rap? Because
the code is old and crufty, and buffer_head is an awful name. It must 
be among the oldest code in the core fs/vm, and the main reason is
because of the inertia of so many and such complex filesystems.

[*] About the furthest we could go is use the struct page for the
information otherwise stored in the buffer_head, but this would be
tricky and suboptimal for filesystems with non page sized blocks and
would probably bloat the struct page as well.

So why rewrite rather than incremental improvements? Incremental
improvements are logically the correct way to do this, and we probably
could go from buffer.c to fsblock.c in steps. But I didn't do this
because: a) the blinding pace at which things move in this area would
make me an old man before it would be complete; b) I didn't actually
know exactly what it was going to look like before starting on it; c)
I wanted stable root filesystems and such when testing it; and d) I
found it reasonably easy to have both layers coexist (it uses an extra
page flag, but even that wouldn't be needed if the old buffer layer
was better decoupled from the page cache).

I started this as an exercise to see how the buffer layer could be
improved, and I think it is working out OK so far. The name is fsblock
because it basically ties the fs layer to the block layer. I think
Andrew has wanted to rename buffer_head to block before, but block is
too clashy, and it isn't a great deal more descriptive than buffer_head.
I believe fsblock is.

I'll go through a list of things where I have hopefully improved on the
buffer layer, off the top of my head. The big caveat here is that minix
is the only real filesystem I have converted so far, and complex
journalled filesystems might pose some problems that water down its
goodness (I don't know).

- data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on
  64-bit (could easily be 32 if we can have int bitops). Compare this
  to around 50 and 100ish for struct buffer_head. With a 4K page and 1K
  blocks, IO requires 10% RAM overhead in buffer heads alone. With
  fsblocks you're down to around 3%.

- Structure packing. A page gets a number of buffer heads that are
  allocated in a linked list. fsblocks are allocated contiguously, so
  cacheline footprint is smaller in the above situation.

- Data / metadata separation. I have a struct fsblock and a struct
  fsblock_meta, so we could put more stuff into the usually less used
  fsblock_meta without bloating it up too much. After a few tricks, these
  are no longer any different in my code, and dirty up the typing quite
  a lot (and I'm aware it still has some warnings, thanks). So if not
  useful this could be taken out.

- Locking. fsblocks completely use the pagecache for locking and lookups.
  The page lock is used, but there is no extra per-inode lock that buffer
  has. Would go very nicely with lockless pagecache. RCU is used for one
  non-blocking fsblock lookup (find_get_block), but I'd really rather hope
  filesystems can tolerate that blocking, and get rid of RCU completely.
  (actually this is not quite true because mapping->private_lock is still
  used for mark_buffer_dirty_inode equivalent, but that's a relatively
  rare operation).

- Coupling with pagecache metadata. Pagecache pages contain some metadata
  that is logically redundant because it is tracked in buffers as well
  (eg. a page is dirty if one or more buffers are dirty, or uptodate if
  all buffers are uptodate). This is great because means we can avoid that
  layer in some situations, but they can get out of sync. eg. if a
  filesystem writes a buffer out by hand, its pagecache page will stay
  dirty, and the next "writeout" will notice it has no dirty buffers and
  call it clean. fsblock-based writeout or readin will update page
  metadata too, which is cleaner. It also uses page locking for IO ops
  instead of an extra layer of locking which seems nice.

- No deadlocks (hopefully). The buffer layer is technically deadlocky by
  design, because it can require memory allocations at page writeout-time.
  It also has one path that cannot tolerate memory allocation failures.
  No such problems for fsblock, which keeps fsblock metadata around for as
  long as a page is dirty (this still has problems vs get_user_pages, but
  that's going to require an audit of all get_user_pages sites. Phew).

- In line with the above item, filesystem block allocation is performed
  before a page is dirtied. In the buffer layer, mmap writes can dirty a
  page with no backing blocks which is a problem if the filesystem is
  ENOSPC (patches exist for buffer.c for this).

- Block memory accessors for filesystems. If the buffer layer was to ever
  be replaced completely, this means block device pagecache would not be
  restricted to lowmem. It also doesn't have theoretical CPU cache
  aliasing problems that buffer heads do.

- A real "nobh" mode. nobh was created I think mainly to avoid problems
  with buffer_head memory consumption, especially on lowmem machines. It
  is basically a hack (sorry), which requires special code in filesystems,
  and duplication of quite a bit of tricky buffer layer code (and bugs).
  It also doesn't work so well for buffers with non-trivial private data
  (like most journalling ones). fsblock implements this with basically a
  few lines of code, and it shold work in situations like ext3.

- Similarly, it gets around the circular reference problem where a buffer
  holds a ref on a page and a page holds a ref on a buffer, but the page
  has been removed from pagecache. These occur with some journalled fses
  like ext3 ordered, and eventually fill up memory and have to be
  reclaimed via the LRU (which is often not a problem, but I have seen
  real workloads where the reclaim causes throughput to drop quite a lot).

- An inode's metadata must be tracked per-inode in order for fsync to
  work correctly. buffer contains helpers to do this for basic
  filesystems, but any block can be only the metadata for a single inode.
  This is not really correct for things like inode descriptor blocks.
  fsblock can track multiple inodes per block. (This is non trivial,
  and it may be overkill so it could be reverted to a simpler scheme
  like buffer).

- Large block support. I can mount and run an 8K block size minix3 fs on
  my 4K page system and it didn't require anything special in the fs. We
  can go up to about 32MB blocks now, and gigabyte+ blocks would only
  require  one more bit in the fsblock flags. fsblock_superpage blocks
  are > PAGE_CACHE_SIZE, midpage ==, and subpage <.

  Core pagecache code is pretty creaky with respect to this. I think it is
  mostly race free, but it requires stupid unlocking and relocking hacks
  because the vm usually passes single locked pages to the fs layers, and we
  need to lock all pages of a block in offset ascending order. This could be
  avoided by doing locking on only the first page of a block for locking in
  the fsblock layer, but that's a bit scary too. Probably better would be to
  move towards offset,length rather than page based fs APIs where everything
  can be batched up nicely and this sort of non-trivial locking can be more
  optimal.

  Large blocks also have a performance black spot where an 8K sized and
  aligned write(2) would require an RMW in the filesystem. Again because of
  the page based nature of the fs API, and this too would be fixed if
  the APIs were better.

  Large block memory access via filesystem uses vmap, but it will go back
  to kmap if the access doesn't cross a page. Filesystems really should do
  this because vmap is slow as anything. I've implemented a vmap cache
  which basically wouldn't work on 32-bit systems (because of limited vmap
  space) for performance testing (and yes it sometimes tries to unmap in
  interrupt context, I know, I'm using loop). We could possibly do a self
  limiting cache, but I'd rather build some helpers to hide the raw multi
  page access for things like bitmap scanning and bit setting etc. and
  avoid too much vmaps.

- Code size. I'm sure I'm still missing some things, but at the moment we
  can do this in about the same amount of icache as buffer.c. If we turn
  off large block support, I think it is around 2/3 the size.

That's basically it for now. I have a few more ideas for cool things, but
there are only so many hours in a day. Comments are non-existant so far,
and there is lots of debugging stuff and some things are a little dirty,
but it should be slightly familiar if you understand buffer.c. I'm not so
interested in hearing about trivial nitpicking at this point because things
are far from final or proposed for upstream. There is still a race or two,
but I think they can all be solved.

So. Comments? Is this something we want? If yes, then how would we
transition from buffer.c to fsblock.c?


From: Nick Piggin [email blocked] To: Linux Kernel Mailing List [email blocked] Subject: Re: [RFC] fsblock Date: Sun, 24 Jun 2007 03:53:55 +0200 Just clarify a few things. Don't you hate rereading a long work you wrote? (oh, you're supposed to do that *before* you press send?). On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote: > > I'm announcing "fsblock" now because it is quite intrusive and so I'd > like to get some thoughts about significantly changing this core part > of the kernel. > > fsblock is a rewrite of the "buffer layer" (ding dong the witch is > dead), which I have been working on, on and off and is now at the stage > where some of the basics are working-ish. This email is going to be > long... > > Firstly, what is the buffer layer? The buffer layer isn't really a > buffer layer as in the buffer cache of unix: the block device cache > is unified with the pagecache (in terms of the pagecache, a blkdev > file is just like any other, but with a 1:1 mapping between offset > and block). I mean, in Linux, the block device cache is unified. UNIX I believe did all its caching in a buffer cache, below the filesystem. > - Large block support. I can mount and run an 8K block size minix3 fs on > my 4K page system and it didn't require anything special in the fs. We Oh, and I don't have a Linux mkfs that makes minixv3 filesystems. I had an image kindly made for me because I don't use minix. If you want to test large block support, I won't email it to you though: you can just convert ext2 or ext3 to fsblock ;)
From: Jeff Garzik [email blocked] To: Nick Piggin [email blocked] Subject: Re: [RFC] fsblock Date: Sat, 23 Jun 2007 23:07:54 -0400 Nick Piggin wrote: > - No deadlocks (hopefully). The buffer layer is technically deadlocky by > design, because it can require memory allocations at page writeout-time. > It also has one path that cannot tolerate memory allocation failures. > No such problems for fsblock, which keeps fsblock metadata around for as > long as a page is dirty (this still has problems vs get_user_pages, but > that's going to require an audit of all get_user_pages sites. Phew). > > - In line with the above item, filesystem block allocation is performed > before a page is dirtied. In the buffer layer, mmap writes can dirty a > page with no backing blocks which is a problem if the filesystem is > ENOSPC (patches exist for buffer.c for this). This raises an eyebrow... The handling of ENOSPC prior to mmap write is more an ABI behavior, so I don't see how this can be fixed with internal changes, yet without changing behavior currently exported to userland (and thus affecting code based on such assumptions). > - An inode's metadata must be tracked per-inode in order for fsync to > work correctly. buffer contains helpers to do this for basic > filesystems, but any block can be only the metadata for a single inode. > This is not really correct for things like inode descriptor blocks. > fsblock can track multiple inodes per block. (This is non trivial, > and it may be overkill so it could be reverted to a simpler scheme > like buffer). hrm; no specific comment but this seems like an idea/area that needs to be fleshed out more, by converting some of the more advanced filesystems. > - Large block support. I can mount and run an 8K block size minix3 fs on > my 4K page system and it didn't require anything special in the fs. We > can go up to about 32MB blocks now, and gigabyte+ blocks would only > require one more bit in the fsblock flags. fsblock_superpage blocks > are > PAGE_CACHE_SIZE, midpage ==, and subpage <. definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, like I've been planning. > So. Comments? Is this something we want? If yes, then how would we > transition from buffer.c to fsblock.c? Your work is definitely interesting, but I think it will be even more interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are converted. My gut feeling is that there are several problem areas you haven't hit yet, with the new code. Also, once things are converted, the question of transitioning from buffer.c will undoubtedly answer itself. That's the way several of us handle transitions: finish all the work, then look with fresh eyes and conceive a path from the current code to your enhanced code. Jeff
From: Nick Piggin [email blocked] To: Jeff Garzik [email blocked] Subject: Re: [RFC] fsblock Date: Sun, 24 Jun 2007 05:47:55 +0200 On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote: > Nick Piggin wrote: > >- No deadlocks (hopefully). The buffer layer is technically deadlocky by > > design, because it can require memory allocations at page writeout-time. > > It also has one path that cannot tolerate memory allocation failures. > > No such problems for fsblock, which keeps fsblock metadata around for as > > long as a page is dirty (this still has problems vs get_user_pages, but > > that's going to require an audit of all get_user_pages sites. Phew). > > > >- In line with the above item, filesystem block allocation is performed > > before a page is dirtied. In the buffer layer, mmap writes can dirty a > > page with no backing blocks which is a problem if the filesystem is > > ENOSPC (patches exist for buffer.c for this). > > This raises an eyebrow... The handling of ENOSPC prior to mmap write is > more an ABI behavior, so I don't see how this can be fixed with internal > changes, yet without changing behavior currently exported to userland > (and thus affecting code based on such assumptions). I believe people are happy to have it SIGBUS (which is how the VM is already set up with page_mkwrite, and what fsblock does). > >- An inode's metadata must be tracked per-inode in order for fsync to > > work correctly. buffer contains helpers to do this for basic > > filesystems, but any block can be only the metadata for a single inode. > > This is not really correct for things like inode descriptor blocks. > > fsblock can track multiple inodes per block. (This is non trivial, > > and it may be overkill so it could be reverted to a simpler scheme > > like buffer). > > hrm; no specific comment but this seems like an idea/area that needs to > be fleshed out more, by converting some of the more advanced filesystems. Yep. It's conceptually fairly simple though, and it might be easier than having filesystems implement their own complex syncing that finds and syncs everything themselves. > >- Large block support. I can mount and run an 8K block size minix3 fs on > > my 4K page system and it didn't require anything special in the fs. We > > can go up to about 32MB blocks now, and gigabyte+ blocks would only > > require one more bit in the fsblock flags. fsblock_superpage blocks > > are > PAGE_CACHE_SIZE, midpage ==, and subpage <. > > definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, > like I've been planning. Yeah, it wasn't the primary motivation for the rewrite, but it would be negligent to not even consider large blocks in such a rewrite, I think. > >So. Comments? Is this something we want? If yes, then how would we > >transition from buffer.c to fsblock.c? > > Your work is definitely interesting, but I think it will be even more > interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are > converted. Well minix has dir in pagecache ;) But you're completely right: ext2 will be the next step and then ext3 and things like XFS and NTFS will be the real test. I think I could eventually get ext2 done (one of the biggest headaches is simply just converting ->b_data accesses), however unlikely a journalling one. > My gut feeling is that there are several problem areas you haven't hit > yet, with the new code. I would agree with your gut :) > Also, once things are converted, the question of transitioning from > buffer.c will undoubtedly answer itself. That's the way several of us > handle transitions: finish all the work, then look with fresh eyes and > conceive a path from the current code to your enhanced code. Yeah that would be nice. It's very difficult because of so much filesystem code. I'd say it would be feasible to step buffer.c into fsblock.c, however if we were to track all (or even the common) filesystems along with that it would introduce a huge number of kind-of-redundant changes that I don't think all fs maintainers would have time to write (and as I said, I can't do it myself). Anyway, let's cross that bridge if and when we come to it. For now, the big thing that needs to be done is convert a "big" fs and see if the results tell us that it's workable. Thanks for the comments Jeff.

Related Links:

Just wanted to say that the

Jezze (not verified)
on
June 25, 2007 - 5:22am

Just wanted to say that the interview with Nick Piggin can be found on: http://kerneltrap.org/node/657 and not the link above!

Very interesting article by the way!

Sounds cool

Anonymous (not verified)
on
June 25, 2007 - 8:12am

Sounds cool. Linux is constantly evolving and improving.

Would this increase filesystem or IO performance?

Performance

Anonymous (not verified)
on
June 25, 2007 - 2:08pm

If we can get IO block size beyond 4k, IO is likely to improve hugely.
At the moment, you can make a XFS FS with a block size other than 4K, but you can't mount it.
Hopefully that will be resolved.

--John

No

Anonymous (not verified)
on
June 26, 2007 - 4:03am

"If we can get IO block size beyond 4k, IO is likely to improve hugely."

No, the difference on at least modern SATA disks is miniscule because it does sector readahead anyway. On PATA, there's been measured a small gain for linear access patterns.

For random access, the situation can get worse because you are likely to transfer more data than needed between each seek. Also, the cost of cache misses is higher because the cache items cover more ground (hence fewer items in the cache - bad for semi-random access patterns)

The block size is mostly useful wrt VM integration and fragmentation issues.

All SATA disks do readahead

Anonymous (not verified)
on
June 26, 2007 - 8:44am

All SATA disks do readahead but all PATA disks don't? Sources please.

why do you understand it this way?

strcmp
on
June 26, 2007 - 12:45pm

i read it more as 'measurements show that the speed gain is likely bigger on pata'. who said "all"? you won't get a source for this kind of statement.

from what i know every modern disk (read: younger that 15 years, when the DOS benchmark coretest was popular) does readahead. but pata controllers tend to have an inefficient (backward-compatible) programming interface. maybe the protocol overhead dominates the access time if everything else is cached? and bigger blocks mean less protocol overhead (if the kernel doesn't read bigger blocks for read ahead purposes anyway...).

also there apparently never has been usable command queuing on pata, which is part of the 'legacy protocol' problem.

you're thinking small

Anonymous (not verified)
on
June 29, 2007 - 7:16am

Desktop applications and large data processing applications have very different usage patterns. You're talking about factors that affect the former, and large blocks are intended to help the latter. The performance gain is from reduced overhead on the server side--every time you retrieve data as a 256k block instead of 64 4k blocks you save a lot of work, and if you're streaming data on the order of hundreds of gigs that adds up to a lot of CPU savings. Yes, that's a configuration that would stink for a desktop workload, or even a database index, but for other workloads it's a huge win.

Is that strictly true?

Mr_Z
on
June 29, 2007 - 10:17am

If you can queue commands, how much of the cost of retrieving the 256K is in the command themselves? For 64 contiguous 4K blocks, I don't expect the performance to be much different than one 256K block, especially when disks still operate in terms of 512 byte sectors at the protocol level anyway. In both cases you're sending 512 requests for 512 bytes.

The real impact comes from the amount of metadata you need to access to locate those blocks. If you have the nested indirect-block structure such as ext2/ext3, you will need to locate at least 2 other metadata blocks (the indirect tree head in the inode and at least one indirect block) to find all 64 blocks in the file. If the file is really big, you might have to go through a few more (triple-indirect, double-indirect, and finally indirect) before you've located the 64 blocks of interest. Since these indirection lookups are dependent (meaning you have to finish one before you can do the next), you could be there awhile.

Granted, you can start fetching some of the data blocks while waiting for further lookups, and indeed I'm certain that's what happens. But, the metadata isn't threaded in with the data itself, so each metadata lookup that misses the cache also turns into a seek. If you're dealing with large files regularly, this adds up. If you regularly stream through files a single time, you don't get much benefit from the disk cache, either for the data or the metadata.

This is why people like extents so much. The number of extents required to track a file tracks with the number of fragments the file is broken into. Assuming you've laid down the file mostly contiguously, it will have few extents, and therefore little metadata associated with it. This property is largely independent of block size. The important aspect is that large contiguous swaths of file are represented compactly, and so large I/Os can be initiated without consulting very much metadata at all.

Yes

Anonymous (not verified)
on
July 2, 2007 - 6:04am

XFS is an extent based filesystem, and it does get more efficient with larger blocks (on IRIX, because they aren't available on Linux yet). Remember where this thread started; regardless of whether or not you're using extents to manage your disk space, you're still going through the buffer layer, and you can still only deal in buffer-layer (block) units. If you look at the code implementing a filesystem you'll see a lot of system calls that basically say "get a block from the buffer". If you can reduce those system calls by an order of magnitude or more you can potentially save a lot of system call overhead. As far as actually reading the data from the disk, no, you don't have to issue one read for every 512 bytes; the scsi command set allows you to say "retrieve M bytes starting at offset N".

Waste space?

Anonymous (not verified)
on
June 26, 2007 - 9:00am

If you used a larger block size, wouldn't it waste more disk space?

PS. I am a noob.

Probably, but it depends

Mr_Z
on
June 26, 2007 - 4:18pm

If you're storing mostly large files on a filesystem like ext2 or ext3, which tracks storage in lists of allocated blocks, etc. then a larger block size will be both faster and more efficient storage-wise. This is due to the fact there is less metadata associated with each file. Less metadata means more storage left for the files, and less I/O associated with metadata.

This advantage largely evaporates when you go to an extent-based filesystem. There, block size is much less relevant, because storage is tracked in extents. Extents give a starting block # and as well as a count of how many blocks comprise the extent. Above that is a list of extents. The number of extents is unlikely to change greatly relative to the block size.

A more typical filesystem has large numbers of small files, though. In fact, I recall seeing somewhere that there's an inverse relationship. So, for a more typical workload, you'd waste space with too large a block size unless you do something like tail-packing, or placing small files directly in inodes. Without those features, you want a block size that's just larger than the bulk of your "small files", without being too large. 4K seems to be a pretty good choice, both for that reason and the fact it lines up well with most TLB's page size.

(Note, Alpha uses an 8K page size from what I've heard.)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.