"HAMMER won't be ready for sure (things take however long they take), but the hardest part is working and stable and I'm just down to garbage collection and crash recovery," noted Matthew Dillon, discussing the status of what is ultimately intended to be a highly available clustering filesystem. The upcoming DragonFlyBSD release this month was originally intended to be 2.0 with a beta quality HAMMER, but the decision was recently made to call the release 1.12 while HAMMER continues to stabilize. Matt continued, "HAMMER is really shaping up now. Here's what works now: all filesystem operations; all historical operations; all Pruning features". During the discussion, he was asked how he planned to support multi-master replication, in reply to which he began:
"My current plan is to use a quorum algorithm similar to the one I wrote for the backplane database years ago. But there are really two major (and very complex) pieces to the puzzle. Not only do we need a quorum algorithm, but we need a distributed cache coherency algorithm as well. With those two pieces individual machines will be able to proactively cache filesystem data and guarantee transactional consistency across the cluster."
From: Matthew Dillon Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 8, 5:22 pm 2008 :There's always the option of releasing a 1.12 version now (it's not like :there haven't been enough changes to justify a new release). The 2.0 release :is likely to get a lot of downloads, so I think shipping it with a pre-alpha :hammer is a waste of an opportunity to attract more people. Not to mention :that it's hard to put a time bound on this kind of development. So, let's :just admit that and ship 2.0 as soon as Matt declares it ready for beta :testing, regardless of what time of year it is. A real beta-state hammer :justifies a 2.0 release on its own IMHO. Also, this will let Matt work on :hammer without any tight deadlines. : :Aggelos Yes, and I've agonized over this very possibility. Maybe the thing to do is to poll the people on kernel@. HAMMER won't be ready for sure (things take however long they take), but the hardest part of it is working and stable and I'm just down to garbage collection and crash recovery. Crazily enough, that is what all the major surgery yesterday, and the continuing work, is about. So what do people think? Should this month's release be 1.12 or 2.0 ? -Matt Matthew Dillon <firstname.lastname@example.org>
From: Matthew Dillon Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 10, 2:00 pm 2008 Ok, it's going to be 1.12. As much as I hate 2.0 being so late, there's no point calling it 2.0 without a production-ready HAMMER. Considering how few items are left on the list for HAMMER, it will likely be production-ready long before the mid-year release. -Matt Matthew Dillon <email@example.com>
From: Matthew Dillon Subject: 1.12 Branching complete Date: Feb 11, 7:58 pm 2008 1.12 has been branched! I've done a rough nrelease build with HEAD but haven't tested that a nrelease build with 1.12 produces the correct version information yet. Other then that it went pretty smoothly. There were no installer issues once I completely regenerated the binary packages. I have updated the nrelease Makefile to use the new binary packages (including a completely new bootstrap), for fresh installs. -Matt
From: Matthew Dillon Subject: HAMMER update 10-Feb-2008 Date: Feb 10, 2:19 pm 2008 HAMMER is really shaping up now. Here's what works now: * All filesystem operations * All historical operations * All Pruning features Here's what is left: * freemap code (allocate and free big-blocks, which are 8MB blocks). Currently a hack so everything else can be tested, nothing is actually freed. * undo fifo and related recovery code. Most of the API calls are in place, the back-end buffer reservation, flushes, and recovery need to be implemented. * big-block cleaning code (this is different from the pruning code). * Structural locking. The B-Tree is fine-grained locked but the locks for the blockmap are just a hack (one big lock). These are all fairly low difficulty items, most of the infrastructure needed to support their function is already in place and the FIFO infrastructure has already been tested (just not mapped onto a blockmap yet). I have already run some tests with regards to the blockmap allocation model and it looks very good. What I did was implement an array of blockmap entry structures rather then just an array of pointers to the actual physical big-blocks. The blockmap entry structure not only has a pointer to the underlying physical big-block, it also has a bytes_free field which specifies how many bytes in the underlying big-block are free. This is the only tracking done by the blockmap. It does not actually try to track WHERE in the big-block the free areas are... figuring that out will be up to the cleaning code. What this gives us is the following: * Extremely fast freeing of on-disk storage elements. The target physical block doesn't have to be read or written, only the governing blockmap entry. With 8MB big-blocks and 32-byte blockmap entries one 16K buffer can track 4GB worth of underlying storage, which means that freeing large amounts of sparse information does not cause the disk to seek all over the place. This is far, FAR better then the cluster model I was using last week and had to throw away. Massively better. Like night and day. * The all-free case can be detected and used to immediately return a completely-free bigblock to the free pool. I've done some testing and what this means is that removing large files or medium-sized sub-trees WILL in fact result in some immediate gratification. The space freed from sparse removal and pruning will take time to actually become reusable as the cleaning code will have to go through and finish cleaning out the big-block(s) in question. * The cleaning code is not complicated in the least. All it needs to do is scan the B-Tree and check the blockmap entries for related references. If the associated big-block has greater then a certain percentage of space free, the cleaning code will attempt to pack the remaining data (as it comes across it in the B-Tree) into a new block. Since the B-Tree elements and records must be manipulated no matter which side you approach cleaning and packing from, this is no more difficult then trying to reverse engineer the remaining contents of a big-block. -Matt Matthew Dillon <firstname.lastname@example.org>
From: walt Subject: Re: HAMMER update 10-Feb-2008 Date: Feb 10, 6:41 pm 2008 Matthew Dillon wrote: > HAMMER is really shaping up now... > This is far, FAR better then the cluster model I was using last week > and had to throw away. Massively better. Like night and day. I'm fond of telling political hotheads (which you once were, but no longer ;o) that, before they destroy the system devised by their predecessors, they owe it to themselves to stop and find out exactly what problems their predecessors thought they were solving when they invented the existing system. So -- in this instance you are both the establishment and the revolutionary at the same time. Could you explain in extra-bonehead language what problems you were solving with the cluster model, and if you are still solving those problems with the newer model? Thanks!
From: Matthew Dillon Subject: Re: HAMMER update 10-Feb-2008 Date: Feb 10, 8:21 pm 2008 :So -- in this instance you are both the establishment and the revolutionary at :the same time. Could you explain in extra-bonehead language what problems you :were solving with the cluster model, and if you are still solving those problems :with the newer model? : :Thanks! The three major pieces didn't integrate together well enough. It's really that simple. Individually they were great. Together they were not. * Small (64MB) clusters with localized B-Trees. Plus: Recovering the filesystem from a massive failure not difficult. Minus: Data and records cannot roam about, they have to be placed in the particular cluster covering their key. * Fine-grained A-list allocator. Plus: Fine grained allocation and freeing. Minus: It didn't help the problem of dead space in the cluster, or for that matter help the issue of dealing with full clusters, because particular records and data had to go into particular clusters. * Per-cluster recovery mechanic. Plus: Very easy to recover a fixed portion of the B-Tree. Minus: Couldn't cross cluster boundaries. Couldn't deal with transactions that crossed cluster boundaries. Requires strict write ordering and for clusters to be marked 'open' (virtually synchronously it turned out) to even detect that recovery was needed. I still would have needed to implement an UNDO FIFO on top of everything else, and despite the per-buffer locality of reference the model had it still made an excessive number of little modifications all over the buffer that would have made the UNDO FIFO rather expensive. So I had a fine-grained allocator that couldn't contribute to solving the issue of what to do when clusters were too full or too empty, a recovery mechanic that couldn't guarantee consistency for transactions which spanned more then one cluster, and clusters that wound up being too small (64MB) and caused management I/O to become too dispersed. Add to that the fact that the model required an excessive number of special cases in-code. Do a cvs diff just of the B-Tree and cursor code from before I ripped it out to now and you will see the new code is much, much cleaner, with virtually no special cases. -- The new model works a lot a better. Fine-grained allocation through what will effectively be a sequential (blockmap) index. Extremely good localization and packing of data (even better then the cluster model had, and the cluster model was pretty good in that regard). The ability (due to the blockmap) to rearrange the physical location of blocks of nodes, records, and data, in 8 MB chunks, with no effort. Coarse grained freeing which is extremely fast (emphasis on extremely.. it is ultra fast), but is still able to detect the 'all free' case for a big-block, for some immediate gratification when deleting or pruning large amounts of material. For the UNDO FIFO, which I would have had to write anyway, the new model generates far fewer dispersed modifications, resulting in a smaller UNDO stream. A small UNDO stream reduces the number of buffer dependancies to, in most cases, not more then one (i.e. 25 buffers may need 1 buffer to be committed to disk before they can be committed). The core allocator, next on my list to write, is literally just a big-block allocator, dealing with large 8M blocks and not having to deal with any fine-grained issues within those blocks. You can't get much more straightforward then that! -Matt Matthew Dillon <email@example.com>
From: Jason Smethers Subject: Re: HAMMER update 10-Feb-2008 Date: Feb 10, 7:51 pm 2008 Matthew Dillon wrote: > Here's what is left: > * Structural locking. The B-Tree is fine-grained locked but the > locks for the blockmap are just a hack (one big lock). Have you decided how to implement multi-master replication yet? -- Jason Smethers
From: Matthew Dillon Subject: Re: HAMMER update 10-Feb-2008 Date: Feb 10, 8:27 pm 2008 :Matthew Dillon wrote: :> Here's what is left: :> * Structural locking. The B-Tree is fine-grained locked but the :> locks for the blockmap are just a hack (one big lock). : :Have you decided how to implement multi-master replication yet? : :-- :Jason Smethers My current plan is to use a quorum algorithm similar to the one I wrote for the backplane database years ago. But there are really two major (and very complex) pieces to the puzzle. Not only do we need a quorum algorithm, but we need a distributed cache coherency algorithm as well. With those two pieces individual machines will be able to proactively cache filesystem data and guarantee transactional consistency across the cluster. The quorum algorithm is fairly straightforward, all the complexity there is basically on how to deal with broken connections, missing hosts, and things of that ilk. The caching algorithm is going to be a lot more complex. It will have to have timeouts with quorum-based watchdogs for refreshment in order to deal with hosts that drop out of the cluster, on top of everything else it has to do. It will have to be range-based at multiple levels in order to limit the memory footprint and work with super-large filesystems. I don't even want to start thinking about it yet. What HAMMER will provide to the system as a whole is the real-time mirroring aspects, inode numbers and transaction id's which are forever-unique (allowing data to be passively cached indefinitely), historical access for as-of transactions which will allow parallel transactions to occur and detect collisions at commit time rather then during the transaction, and a bunch of other features. -Matt Matthew Dillon <firstname.lastname@example.org>