"Work continues to progress well but I've hit a few snags," noted Matthew Dillon, referring to the ongoing development of his HAMMER filesystem. He began by highlighting a number of problems with the current design, then adding, "everything else now works, and works well, including and most especially the historical access features." He continued:
"I've come to the conclusion that I am going to have to make a fairly radical change to the on-disk structures to solve these problems. On the plus side, these changes will greatly simplify the filesystem topology and greatly reduce its complexity. On the negative side, recovering space will not be instantaneous and will basically require data to be copied from one part of the filesystem to another."
Matt detailed his solution, which included getting rid of the previously described clusters, super-clusters, A-lists, and per-cluster B-Tree's, "instead have just one global B-Tree for the entire filesystem, able to access any record anywhere in the filesystem", adding that the filesystem would be implemented "as one big huge circular FIFO, pretty much laid down linearly on disk, with a B-Tree to locate and access data." He detailed the many improvements, noting that this also makes it possible to provide efficient real-time mirroring. He concluded, "it will probably take a week or two to rewire the topology and another week or two to debug it. Despite the massive rewiring, the new model is much, MUCH simpler then the old, and all the B-Tree code is retained (just extended to operate across the entire filesystem instead of just within a single cluster)."
From: Matthew Dillon Subject: HAMMER update 06-Feb-2008 Date: Feb 6, 2:55 pm 2008 Work continues to progress well but I've hit a few snags: * My current recovery model is just not working for cross-cluster transactions. * I can't fix the write performance issue due to the way the cluster localization and spike code works. * Handling nearly full filesystems is proving to be a problem due to the inability to estimate space requirements which in turn are related to the cluster localization the current model uses. * Efficient real-time mirroring is just not easy to do w/ the current cluster-localized model. Everything else now works, and works well, including and most especially the historical access features. I've already fallen in love with being able to create a softlink to a '@@0x<timestamp>' string to access snapshots. I've come to the conclusion that I am going to have to make a fairly radical change to the on-disk structures to solve these problems. On the plus side, these changes will greatly simplify the filesystem topology and greatly reduce its complexity. On the negative side, recovering space will not be instantanious and will basically require data to be copied from one part of the filesystem to another. My take is that since large filesystems (our target) do not usually fill up very quickly, and indeed it can take hours even if you are writing 60MB/sec to the disk, It is well worth taking a hit on free space recovery if it accomplishes all of the filesystem's other goals. This may seem a bit radical as solutions go, but the more I look at it the more I like it because it really does neatly tie up every single remaining issue. * Get rid of super-clusters, clusters, and typed buffers. Get rid of them entirely. Just have an array of buffers after the volume header. * Get rid of the A-list's (the recursive bitmap radix tree allocation model). Poof, gone. All of them, gone. Have no random block allocation model at all. * Get rid of the per-cluster B-Tree's. Instead have just one global B-Tree for the entire filesystem, able to access any record anywhere in the filesystem (currently a per-cluster B-Tree can only access information within that cluster). * Implement the filesystem as one big huge circular FIFO, pretty much laid down linearly on disk, with a B-Tree to locate and access data. * Random modifications (required for manipulating the B-Tree and marking records as deleted) will append undo records to this FIFO and the only write ordering requirement will be that the buffers containing these undo records be committed before the buffers containing the random modifications are committed. This completely solves the crash recovery issue, regardless of the size of a transaction. If you crash half way through doing a 200MB write() it will be able to unwind the whole thing. This (plus the B-Tree and layout changes) completely solves the write performance issue. * Physical recovery of 'disk space' in the filesystem requires advancing the circular FIFO's beginning index, which means copying data we wish to retain from the front of the FIFO to the end and discarding data at the head that we wish to destroy. This is the one big gotcha to the model, but the key point here is that the copying can occur off-peak, or whenever disk bandwidth is available, and can operate on large linear swaths of disk at once (with some random async writes for B-Tree fixups). * Since information is laid down linearly, on-disk space can be reserved for in-memory buffer and record caches, solving the issue of how to deal with nearly full filesystems. * Ultimately physical recovery can be optimized by introducing a large-block allocation model 'behind' the circular FIFO abstraction. This will solve the issue of dealing with bad blocks on the disk. * The filesystem can be recovered from horrible data loss basically by doing a copy-scan to first remove all auxillary data (like B-Tree nodes and older undo records that are no longer needed), and then regenerating the B-Tree from scratch for the entire filesystem. This would of course take a long time but it would only be needed to recover data after a head crash, where physical corruption might be present. * This also neatly solves real-time mirroring issues. A real-time mirror can simply track the FIFO linearly, and can record its FIFO index if interrupted to pick up where it left off later on. You can't get much better then that. I had given up on efficient real time mirroring when I settled on the localized cluster model. Now its back. It will probably take a week or two to rewire the topology and another week or two to debug it. Despite the massive rewiring, the new model is much, MUCH simpler then the old, and all the B-Tree code is retained (just extended to operate across the entire filesystem instead of just within a single cluster). -Matt Matthew Dillon <dillon@backplane.com>
From: Simon 'corecode' Schubert Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 4:09 pm 2008 Matthew Dillon wrote: > * Implement the filesystem as one big huge circular FIFO, pretty much > laid down linearly on disk, with a B-Tree to locate and access data. > > * Random modifications (required for manipulating the B-Tree and marking > records as deleted) will append undo records to this FIFO and the only > write ordering requirement will be that the buffers containing these > undo records be committed before the buffers containing the random > modifications are committed. This sounds quite like LFS now. LFS however split the volume in smaller blocks which could be "used", "empty" or "open", IIRC. Their background cleaner then could push remaining data from used blocks to a currently open one, marking the block "empty" after that, allowing the FS to write to the blocks again. They always had problems with the cleaner, I think. cheers simon -- Serve - BSD +++ RENT this banner advert +++ ASCII Ribbon /"\ Work - Mac +++ space for low €€€ NOW!1 +++ Campaign \ / Party Enjoy Relax | http://dragonflybsd.org Against HTML \ Dude 2c 2 the max ! http://golden-apple.biz Mail + News / \
From: Michael Neumann Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 5:02 pm 2008 Simon 'corecode' Schubert wrote: > Matthew Dillon wrote: >> * Implement the filesystem as one big huge circular FIFO, pretty much >> laid down linearly on disk, with a B-Tree to locate and access >> data. >> >> * Random modifications (required for manipulating the B-Tree and >> marking >> records as deleted) will append undo records to this FIFO and >> the only >> write ordering requirement will be that the buffers containing >> these >> undo records be committed before the buffers containing the >> random modifications are committed. > > This sounds quite like LFS now. LFS however split the volume in smaller > blocks which could be "used", "empty" or "open", IIRC. Their background > cleaner then could push remaining data from used blocks to a currently > open one, marking the block "empty" after that, allowing the FS to write > to the blocks again. How about "Generational Garbarge Collection"? Assuming that there are some files that will never be deleted this could give slighly better performance. Keep a "copy count" (a copy occurs if the cleaner has to copy data from the left end to the right end of the FIFO). If that increases over, say 3, copy it into the old generation FIFO. One problem of course is how to dimension each generation, and how many to use. I think that's basically how LFS works. Regards, Michael
From: Ivan Voras Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 5:05 pm 2008 Matthew Dillon wrote: > * Get rid of the A-list's (the recursive bitmap radix tree allocati= on > model). Poof, gone. All of them, gone. Have no random block > allocation model at all. >=20 > * Get rid of the per-cluster B-Tree's. Instead have just one globa= l > B-Tree for the entire filesystem, able to access any record anywh= ere > in the filesystem (currently a per-cluster B-Tree can only access= > information within that cluster). >=20 > * Implement the filesystem as one big huge circular FIFO, pretty mu= ch > laid down linearly on disk, with a B-Tree to locate and access da= ta. How will this affect parallel IO (reads, but especially writes)? Would=20 having such a global structure serialize it? (I'm assuming, possibly=20 wrongly, that having trees per-cluster allowed you to lock individual=20 clusters).
From: Matthew Dillon Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 6:15 pm 2008 :How will this affect parallel IO (reads, but especially writes)? Would=20 :having such a global structure serialize it? (I'm assuming, possibly=20 :wrongly, that having trees per-cluster allowed you to lock individual=20 :clusters). Reads will not be effected at all... the locking occurs at the B-Tree node layer. Writes will not be serialized and will still be asynchronous so the most typical striping setups on multi-disk filesystems should still yield very high performance. Writes WILL be far more likely to be sequential which should actually improve write performance. Also keep in mind that writes are buffered by the buffer cache, so there is a caching layer between userland and the physical disk. Mixed data writes (parallel write operations by multiple processes in different parts of the filesystem) will generally lay down new information sequentially on disk, which can be detrimental for read performance since the individual files will not be entirely sequential. I seem to recall a paper at a USENIX long ago where someone tested locality of reference for reads after laying down writes from parallel sources sequentially, and it was no worse then trying to zone the disparate writes, so I'm not really worried about this case. Also, once you get over a track or two's worth of data, it costs about the same to seek 3 tracks as it does to seek 10 tracks, so as long as writes are not *completely* strewn about due to lots of parallel write activity occuring, it shouldn't be a problem. They won't be because writes are cached in the buffer cache prior to being flushed out. We should get nice long bursts of sequentially ordered data on disk. -- I don't like to think that I wasted a ton of time building the cluster mechanism, and its kinda sad to see so much code removed. But most of the work over the last few months has been B-Tree centric, implementing the inode cache, high level VOPs, record structures, etc... and those parts of the codebase remain intact. It really got to the point where implementing the last bits was starting to take way way too much time. When things start to take that much time to do, I know I've made a mistake somewhere in the design. Better to fix it now then to try to slog through the complexity later on. -Matt Matthew Dillon <dillon@backplane.com>
From: Matthew Dillon Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 6:40 pm 2008 :This sounds quite like LFS now. LFS however split the volume in smaller = : :blocks which could be "used", "empty" or "open", IIRC. Their background = : :cleaner then could push remaining data from used blocks to a currently=20 :open one, marking the block "empty" after that, allowing the FS to write = : :to the blocks again. : :They always had problems with the cleaner, I think. : :cheers : simon It's the 'pushing data around' action that causes the most problems with recovery and cleaning, or more specifically, recovery after you crash while cleaning. It can get messy really fast, especially if you are trying to manage bitmaps. With the cluster idea I got stuck the moment a transaction spanned more the one cluster. I suddenly had two recovery domains related to one transaction, and I couldn't find a solution to it. A filesystem-wide FIFO abstraction solves all three problems by providing exactly one synchronization point (the circular fifo's indices) for the entire filesystem. I really do think that the performance issues with cleaning HAMMER will be addressable through the use of a very-large-block blockmap, or by implementing a set of named very-large blocks. I have some other ideas as well, such as treating the data associated with records as an out-of-band entity that is not made part of the FIFO itself. One last thing that must be considered is real time mirroring. There's something to be said for having a single logical FIFO for the entire filesystem in a clustered environment. It makes non-queued mirroring ridiculously easy to implement. -Matt
From: Matthew Dillon Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 6:25 pm 2008 :How about "Generational Garbarge Collection"? Assuming that there are :some files that will never be deleted this could give slighly better :performance. : :Keep a "copy count" (a copy occurs if the cleaner has to copy data from :the left end to the right end of the FIFO). If that increases over, say :3, copy it into the old generation FIFO. : :One problem of course is how to dimension each generation, and how many :to use. : :I think that's basically how LFS works. : :Regards, : : Michael The idea I have is turn the physical layout of the disk into a logical linear layout. That is, instead of actually laying the FIFO down linearly on the physical disk, instead create a very large backing blockmap that allows large (say 16MB) chunks of the disk to be shifted around simply by adjusting the block map. Now the FIFO copying can simply shift the underlying block map to move a 16MB block from the beginning of the FIFO to the end of the FIFO if no garbage collection is otherwise required. One can also have the idea of 'named very large blocks', where each one represents a separate recovery zone. This is like a blockmap allocation mechanism that allows you to move the block from the beginning of the FIFO to the end without having to rewire the references to elements within that block. I think the initial implementation has to just work with the disk. I am making sure my design is condusive to these sorts of optimizations but I am not going to try to implement a blockmap immediately. What is becoming very clear to me now that I'm actually coding this stuff, is that my original clustering idea was simply too highly integrated into the general filesystem operations, requiring complex code all over the place. The new idea is far, FAR less complex and yet as we see there are plenty of layer-based solutions that can solve the copy case issues without requiring the filesystem core to have much knowledge about the additional layer. -Matt Matthew Dillon <dillon@backplane.com>


the hammer also rises?
"instead have just one global B-Tree for the entire filesystem, able to access any record anywhere in the filesystem"
Reiser FS?
Ever needed to force fsck a
Ever needed to force fsck a reiserfs volume that contains deleted backup images of other reiserfs volumes? Do you know what happens?
Nobody knows that
>Do you know what happens?
Only reiserfsck knows the answer, but sure they are Very Bad Things.
I thought Reiser4 solved these issues. XFS uses a similar structure.
I think it's not fair to blame an structure based on a bad designed filesystem using it.
You're right,
You're right, but I didn't blame the structure. The original poster referred to reiserfs.
All programming languages
All programming languages evolve towards lisp, and now all filesystems evolve towards Reiser FS?
Between Hammer and Btrfs,
Between Hammer and Btrfs, things are looking exciting in the Linux filesystem world! The lack of a ZFS port now doesn't seem so bad. :-)
HAMMER is for DragonflyBSD,
HAMMER is for DragonflyBSD, not GNU/Linux.
If you're talking about the
If you're talking about the kernel , you can definitely lose the "GNU/" part. Even accoring to rms.
It can be ported to Linux.
It can be ported to Linux.