DragonFlyBSD

DragonFly BSD 2.0, HAMMER Filesystem

Submitted by Jeremy
on July 21, 2008 - 6:48pm
DragonFlyBSD

"Hurrah! 2.0 has been released!" said Matthew Dillon, announcing the eighth major release of DragonFly BSD. This release is the first to include HAMMER, a new clustering filesystem that already boasts an impressive list of features, including: "crash recovery on-mount, no fsck; fine-grained snapshots, snapshot management, snapshot-support for filesystem-wide data integrity checks; historically accessible by default; mirroring: queueless incremental mirroring, master to multi-slave; undo and rollback; reblocking; multi-volume, maximum storage capacity of 1-Exabyte." Other highlighted changes in this release include, "native fairq-queue implementation using ALTQ, for PF", and "native connection state recovery to PF, so router reboots do not drop active TCP connections."

The latest version of DragonFly BSD can be downloaded from a mirror. The download page explains:

"DragonFly CDs are 'live', which means that the CD will boot your system and let you log in as root (no password). You can use this feature to check for hardware compatibility and play with DragonFly a little before actually installing it on your hard drive."

HAMMER Performance and Mirroring

Submitted by Jeremy
on June 20, 2008 - 8:28am
DragonFlyBSD

Matthew Dillon continues to make significant progress on his HAMMER clustering filesystem for DragonFly BSD. He labeled the latest release 56c, noting that it, "represents an additional significant improvement in performance, [also including] bug fixes and most of the final media changes." A significant improvement in write performance was obtained by making the filesystem block size automatically increase from 16K to 64K when a file grows to larger than 1 MB. One remaining media change is required to optimize mtime and atime storage, at which point HAMMER will go into testing and bug fixing mode. Matt noted, "HAMMER's performance is extremely good now, and its system cpu overhead has dropped to roughly the same that we get from UFS", adding, "HAMMER is now able to sustain full disk bandwidth for bulk reads and writes. HAMMER continues to have far superior random-write performance, whether the system caches are blown out or not." Discussing future plans for the filesystem, Matt noted, "I could go on and on, there's so much that can be done with this filesytem :-)" Regarding one of these plans, he offered:

"I am not going to promise it, but there is a slight chance I will be able to get mirroring working by the release. I figured out how to do it, finally. Basically the solution is to add another field to the B-Tree's internal elements... the 'most recent' transaction id, and to propogate it up all the way to the root of the tree. The mirroring code can then optimally scan the B-Tree and pick out all records that have changed relative to some transaction id, allowing it to quickly 'pick up' where it left off and construct a record-level mirror over a fully asynchronous link, without any queueing. You can't get much better then that, frankly. "

HAMMER's B+Tree Implementation

Submitted by Jeremy
on June 17, 2008 - 7:52pm
DragonFlyBSD

"HAMMER makes no modifications to the B-Tree whatsoever on the front-end. When you create, delete, rename, write, etc... when you do those operations HAMMER caches them in a virtualization layer in memory and doesn't make any modifications to its on-media data structures (or their in-memory representations) at all until the meta-data is synced to disk," DragonFly BSD creator Matthew Dillon explained, comparing HAMMER, his clustering filesystem, to a wiki summary of Reiser4's implementations. He continued:

"HAMMER uses a modified B+Tree for its on-disk representation, which is a B-Tree with only keys at internal nodes and only records at the leafs. This was done to reduce structural bloat, allow for a leaf->leaf linking optimization in the future, and for other reasons. [...] HAMMER's internal nodes have a left and right bounding element. A standard B+Tree only has a left bounding element. By adding a right bounding element HAMMER can cache pointers into its B+Tree and 'pick up' searches, insertions, and deletions relative to the cached pointers instead of having to start at the root of the tree. More importantly, it can pickup searches, insertions, and deletions at internal nodes, not just leaf nodes. So I can cache a proximity pointer and if I do a good job I never have to traverse the B+Tree above that point."

Improving HAMMER Performance

Submitted by Jeremy
on June 12, 2008 - 1:17am
DragonFlyBSD

"After another round of performance tuning HAMMER all my benchmarks show HAMMER within 10% of UFS's performance, and it beats the shit out of UFS in certain tests such as file creation and random write performance," noted DragonFly BSD creator Matthew Dillon, providing an update on his new clustering filesystem. He continued, "read performance is good but drops more then UFS under heavy write loads (but write performance is much better at the same time)." He then referred to the blogbench benchmark noting, "now when UFS gets past blog #300 and blows out the system caches, UFS's write performance goes completely to hell but it is able to maintain good read performance." Matthew then compared this to HAMMER:

"HAMMER is the opposite. It can maintain fairly good write performance long after the system caches have been blown out, but read performance drops to about the same as its write performance (remember, this is blogbench doing reads from random files). Here HAMMER's read performance drops significantly but it is able to maintain write performance. UFS's write performance basically comes to a dead halt. However, HAMMER's performance numbers become 'unstable' once the system caches are blown out."

HAMMER Stabilizing

Submitted by Jeremy
on May 14, 2008 - 6:11am
DragonFlyBSD

Matthew Dillon sent out a series of updates about his developing HAMMER filesystem, noting that he is currently focusing on the reblocking and pruning code, tracking down a number of bugs resulting in B-Tree corruption. He also noted that previously HAMMER was comprised of three components: B-Tree nodes, records, and data. In his latest cleanups, he has entirely removed the record structure, "this will seriously improve the performance of directory and inode access." This change did require an on-media format change, "I know I have said this before, but there's a very good chance that no more on-media changes will be made after this point. The official freeze of the on-media format will not occur until the 2.0 release, however."

Matt added, "HAMMER is stable enough now that I am able to run it on my LAN backup box. I'm using it to test that the snapshots work as expected as well as to test the long term effects of reblocking and pruning." He then cautioned:

"Please note that HAMMER is not ready for production use yet, there is still the filesystem-full handling to implement and much more serious testing of the reblocking and pruning code is required, not to mention the crash recovery code. I expect to find a few more bugs, but I'm really happy with the results so far."

HAMMER Crash Recovery

Submitted by Jeremy
on April 24, 2008 - 5:20pm
DragonFlyBSD

"HAMMER is going to be a little unstable as I commit the crash recovery code," began DragonFly BSD creator Matthew Dillon, adding, "I'm about half way through it." He went on to list what's left for crash recovery to work with HAMMER, his new clustering filesystem, "I have to flush the undo buffers out before the meta-data buffers; then I have to flush the volume header so mount can see the updated undo info; then I have to flush out the meta-data buffers that the UNDO info refers to; and, finally, the mount code must scan the UNDO buffers and perform any required UNDOs." He continued:

"The idea being that if a crash occurs at any point in the above sequence, HAMMER will be able to run the UNDOs to undo any partially written meta-data. HAMMER would be able to do this at mount-time and it would probably take less then a second, so basically this gives us our instant crash-recovery feature."

Matt went on to add that as an advantage of significantly separating the front end VFS operations from the backend I/O it would now be possible to fix several stalls in the code, significantly improving HAMMER's performance.

Fair Queuing For ALTQ

Submitted by Jeremy
on April 10, 2008 - 12:45pm
DragonFlyBSD

"I have a question for the PF/ALTQ masters out there," Matthew Dillon began on the DragonFlyBSD kernel mailing list, having recently switched from using a Cisco router to a DragonFlySD server running PF. "I am trying to configure PF in a manner similar to what Cisco's fair-queue algorithm does. Cisco's algorithm basically hashes TCP and UDP traffic based on the port/IP pairs, creating a bunch of lists of backlogged packets and then schedules the packets at the head of each list." He went on to explain that he was unsuccessfully trying to configure the same thing with PF, "neither CBQ nor HFSC seem to work well. I can separate certain types of traffic but the real problem is when there are multiple TCP connections that are essentially classified the same, and one is hogging the outgoing bandwidth. So the question is, is there a PF solution for that or do I need to write a new ALTQ mechanic to implement fair queueing?"

Not finding a solution, he followed with a series of patches implementing what he needed. He explained the resulting logic noting, "unless something comes up I am going to commit this to DragonFly on Friday and call it done. I would be pleased if other projects picked up some or all of the work":

"The queues are scanned from highest priority to lowest priority; if the packet bandwidth on the queue does not exceed the bandwidth parameter and a packet is available, a packet will be chosen fro that queue; if a packet is available but the queue has exceeded the specified bandwidth, the next lower priority queue is scanned (and so forth); if NO lower priority queues either have packets or are all over the bandwidth limit, then a packet will be taken from the highest priority queue with a packet ready; packet rate can exceed the queue bandwidth specification (but will not exceed the interface bandwidth specification, of course), but under full saturation the average bandwidth for any given queue will be limited to the specified value."

HAMMER Approaches Alpha Status

Submitted by Jeremy
on March 25, 2008 - 6:49am
DragonFlyBSD

Matthew Dillon posted on update on his evolving HAMMER filesystem, noting that it "passes all standard filesystem stress tests and buildworld will run with a HAMMER /usr/obj". He also noted, "pruning and reblocking code is in and partially tested, but now needs more stringent testing; full historical access appears to be working but needs testing." He added, "there are two big-ticket and several little-ticket items left. HAMMER will officially go Alpha when the big-ticket items are done, and beta when we get a few of the little-ticket items done." The two "big-ticket" items left to be completed are UNDO crash recovery code, and handling for full filesystems. Matt summarized:

"I have no time frame for these items yet. It will depend on how quickly HAMMER moves to Alpha and Beta status. I will say, however, now that HAMMER's on-disk format has solidified, that I have a very precise understanding of the protocols that will be needed to accomplish fully cache coherent remote access for both replicated and non-replicated (remote mount style) access. And, as you know, fully coherent filesystem access across machines is going to be the basis for DragonFly's clustering across said machines. In summary, things are progressing very well."

DragonFly BSD 1.12, "A Maintenance Update"

Submitted by Jeremy
on February 28, 2008 - 8:17am
DragonFlyBSD

"Hello everyone! We are happy to say that the 1.12 release is now available!" began Matthew Dillon, announcing the latest stable version of DragonFly BSD. The project's home page explains, "DragonFly is an operating system and environment originally based on FreeBSD. DragonFly branched from FreeBSD in 2003 in order to develop a radically different approach to concurrency, SMP, and most other kernel subsystems." Regarding the latest release, Matt explained:

"This release is primarily a maintenance update. A lot of work has been done all over the kernel and userland. There are no new big-ticket items though we have pushed the MP lock further into the kernel.

"The 2.0 release is scheduled for mid-year. Of the current big-ticket item work, the new HAMMER filesystem is almost to the alpha stage of development and is expected to be production ready by the mid-year 2.0 release."

2.0 Becomes 1.12 While HAMMER Matures

Submitted by Jeremy
on February 12, 2008 - 6:28am
DragonFlyBSD

"HAMMER won't be ready for sure (things take however long they take), but the hardest part is working and stable and I'm just down to garbage collection and crash recovery," noted Matthew Dillon, discussing the status of what is ultimately intended to be a highly available clustering filesystem. The upcoming DragonFlyBSD release this month was originally intended to be 2.0 with a beta quality HAMMER, but the decision was recently made to call the release 1.12 while HAMMER continues to stabilize. Matt continued, "HAMMER is really shaping up now. Here's what works now: all filesystem operations; all historical operations; all Pruning features". During the discussion, he was asked how he planned to support multi-master replication, in reply to which he began:

"My current plan is to use a quorum algorithm similar to the one I wrote for the backplane database years ago. But there are really two major (and very complex) pieces to the puzzle. Not only do we need a quorum algorithm, but we need a distributed cache coherency algorithm as well. With those two pieces individual machines will be able to proactively cache filesystem data and guarantee transactional consistency across the cluster."

A Better HAMMER

Submitted by Jeremy
on February 7, 2008 - 1:32am
DragonFlyBSD

"Work continues to progress well but I've hit a few snags," noted Matthew Dillon, referring to the ongoing development of his HAMMER filesystem. He began by highlighting a number of problems with the current design, then adding, "everything else now works, and works well, including and most especially the historical access features." He continued:

"I've come to the conclusion that I am going to have to make a fairly radical change to the on-disk structures to solve these problems. On the plus side, these changes will greatly simplify the filesystem topology and greatly reduce its complexity. On the negative side, recovering space will not be instantaneous and will basically require data to be copied from one part of the filesystem to another."

Matt detailed his solution, which included getting rid of the previously described clusters, super-clusters, A-lists, and per-cluster B-Tree's, "instead have just one global B-Tree for the entire filesystem, able to access any record anywhere in the filesystem", adding that the filesystem would be implemented "as one big huge circular FIFO, pretty much laid down linearly on disk, with a B-Tree to locate and access data." He detailed the many improvements, noting that this also makes it possible to provide efficient real-time mirroring. He concluded, "it will probably take a week or two to rewire the topology and another week or two to debug it. Despite the massive rewiring, the new model is much, MUCH simpler then the old, and all the B-Tree code is retained (just extended to operate across the entire filesystem instead of just within a single cluster)."

HAMMER Balancing

Submitted by Jeremy
on January 13, 2008 - 11:09am
DragonFlyBSD

"HAMMER is progressing very well with only 3-4 big-ticket items left to do," noted DragonFlyBSD creator Matthew Dillon regarding the ongoing development of his highly available clustering filesystem, "I'm really happy with the progress I'm making". He listed "on-the-fly recovery, balancing, refactoring of the spike code, and retention policy scan" as the remaining items needing to be implemented. "everything else is now working and reasonably stable. Of the remaining items only the spike coding has any real algorithmic complexity. Recovery and balancing just require brute force and the physical record deletion. The retention policy scan needs is already coded and working (just not the scan itself)."

Matt then defined what he meant by 'spike', "basically, when a cluster (a 64MB block of the disk) fills up a 'spike' needs to be driven into that cluster's B-Tree in order to expand it into a new cluster. The spike basically forwards a portion of the B-Tree's key space to a new cluster." He added, "refactoring the spike code means doing a better job selecting the amount of key space the spike can represent." He noted that balancing refers to the act of balancing the B-Tree representation of the filesystem, "we want to slowly move physical data records from higher level clusters to lower level clusters, eventually winding up with a situation where the higher level clusters contain only spikes and lower level clusters are mostly full." Matt continued:

"Keep in mind that HAMMER is designed to handle very large filesystems... in particular, filesystems that are big enough that you don't actually fill them up under normal operation, or at least do not quickly fill them up and then quickly clean them out. The balancing code is expected to need upwards of a day (or longer) to slowly iron out storage inefficiencies. If a situation comes up where faster action is needed, then faster action can be taken. I intend to take advantage of the fact that most filesystems (and, really, any large filesystem), takes quite a while to actually become full."

HAMMER Filesystem Working

Submitted by Jeremy
on January 3, 2008 - 2:58pm
DragonFlyBSD

"HAMMER is progressing well. The filesystem basically works, but there are some major pieces missing such as, oh, the recovery code, and I still have a ton of issues to work through... the poor fs blows up when it runs out of space, for example, due to the horrible spike implementation I have right now," DragonFlyBSD creator Matthew Dillon stated. HAMMER is a new highly available clustering filesystem aimed to be of beta quality by the DragonFlyBSD 2.0 release later this month. Matt notes,

"It isn't stable yet but some major milestones have been achieved. I am able to cpdup, rm -rf, and perform historical queries on deleted data."

Matt went on to caution, "please note that HAMMER is *NOT* yet ready for wider testing. Please don't start reporting bugs yet, because there are still tons of things for me to work through."

HAMMER Filesystem Update

Submitted by Jeremy
on November 17, 2007 - 2:20am
DragonFlyBSD

"HAMMER work is still progressing well, I hope to have most of it working in a degenerate single-cluster (64MB filesystem) case by the end of next week. (cluster == 64MB block of the disk, not cluster as in clustering)," noted Matthew Dillon on the DragonFlyBSD mailing list. He continued, "gluing the per-cluster B-Tree's together for the multi-cluster case is turning out to be more of a headache and will probably take at least 2 weeks to get working. Some fairly sophisticated heuristics will be needed to avoid unnecessary copying between clusters." Matt went on to note that the next DragonFlyBSD release will likely be delayed a month:

"I may decide to move the 2.0 release to mid-January to give myself some more time. This is similar to what we did for 1.8. Also, I think a January release is better then a Christmas release because people get busy with christmas-like things. I want the filesystem to be at least beta quality as of the release and I don't think its possible to get it there by mid-December."

HAMMER B-Tree Recovery

Submitted by Jeremy
on November 7, 2007 - 5:46am
DragonFlyBSD

"Speaking of on-disk B-Trees, ReiserFS' biggest problems are all based on its use of flexible B-Trees," suggested a reader on the DragonFlyBSD Kernel mailing list, pointing to the difficulty of detecting a failed node and then of rebuilding the B-Tree. HAMMER filesystem designer and author, Matt Dillon, explained, "if a cluster needs to be recovered, HAMMER will simply throw away the B-Tree and regenerate it from scratch using the cluster's record list. This way all B-Tree I/O operations can be asynchronous and do not have to be flushed on fsync. At the same time HAMMER will remove any records whose creation transaction id's are too large (i.e. not synchronized with the cluster header), and will zero out the delete transaction id for any records whos deletion transaction id's are too large." Matt then acknowledged:

"The real performance issue for HAMMER is going to be B-Tree insertions and rebalancing across clusters. I think most of the issues can be resolved with appropriate heuristics and by a background process to slowly rebalance clusters. This will require a lot of work, though, and only minimal rebalancing will be in [the end-of-the-year] release."