Matthew Dillon

Quote: Early-Alpha State

Submitted by Jeremy
on February 2, 2008 - 8:36pm

"I will do as much work on HAMMER as I can prior to release but it will definitely be in an early-alpha state as of the release."

HAMMER Balancing

Submitted by Jeremy
on January 13, 2008 - 11:09am
DragonFlyBSD

"HAMMER is progressing very well with only 3-4 big-ticket items left to do," noted DragonFlyBSD creator Matthew Dillon regarding the ongoing development of his highly available clustering filesystem, "I'm really happy with the progress I'm making". He listed "on-the-fly recovery, balancing, refactoring of the spike code, and retention policy scan" as the remaining items needing to be implemented. "everything else is now working and reasonably stable. Of the remaining items only the spike coding has any real algorithmic complexity. Recovery and balancing just require brute force and the physical record deletion. The retention policy scan needs is already coded and working (just not the scan itself)."

Matt then defined what he meant by 'spike', "basically, when a cluster (a 64MB block of the disk) fills up a 'spike' needs to be driven into that cluster's B-Tree in order to expand it into a new cluster. The spike basically forwards a portion of the B-Tree's key space to a new cluster." He added, "refactoring the spike code means doing a better job selecting the amount of key space the spike can represent." He noted that balancing refers to the act of balancing the B-Tree representation of the filesystem, "we want to slowly move physical data records from higher level clusters to lower level clusters, eventually winding up with a situation where the higher level clusters contain only spikes and lower level clusters are mostly full." Matt continued:

"Keep in mind that HAMMER is designed to handle very large filesystems... in particular, filesystems that are big enough that you don't actually fill them up under normal operation, or at least do not quickly fill them up and then quickly clean them out. The balancing code is expected to need upwards of a day (or longer) to slowly iron out storage inefficiencies. If a situation comes up where faster action is needed, then faster action can be taken. I intend to take advantage of the fact that most filesystems (and, really, any large filesystem), takes quite a while to actually become full."

HAMMER Filesystem Working

Submitted by Jeremy
on January 3, 2008 - 2:58pm
DragonFlyBSD

"HAMMER is progressing well. The filesystem basically works, but there are some major pieces missing such as, oh, the recovery code, and I still have a ton of issues to work through... the poor fs blows up when it runs out of space, for example, due to the horrible spike implementation I have right now," DragonFlyBSD creator Matthew Dillon stated. HAMMER is a new highly available clustering filesystem aimed to be of beta quality by the DragonFlyBSD 2.0 release later this month. Matt notes,

"It isn't stable yet but some major milestones have been achieved. I am able to cpdup, rm -rf, and perform historical queries on deleted data."

Matt went on to caution, "please note that HAMMER is *NOT* yet ready for wider testing. Please don't start reporting bugs yet, because there are still tons of things for me to work through."

HAMMER Filesystem Update

Submitted by Jeremy
on November 17, 2007 - 2:20am
DragonFlyBSD

"HAMMER work is still progressing well, I hope to have most of it working in a degenerate single-cluster (64MB filesystem) case by the end of next week. (cluster == 64MB block of the disk, not cluster as in clustering)," noted Matthew Dillon on the DragonFlyBSD mailing list. He continued, "gluing the per-cluster B-Tree's together for the multi-cluster case is turning out to be more of a headache and will probably take at least 2 weeks to get working. Some fairly sophisticated heuristics will be needed to avoid unnecessary copying between clusters." Matt went on to note that the next DragonFlyBSD release will likely be delayed a month:

"I may decide to move the 2.0 release to mid-January to give myself some more time. This is similar to what we did for 1.8. Also, I think a January release is better then a Christmas release because people get busy with christmas-like things. I want the filesystem to be at least beta quality as of the release and I don't think its possible to get it there by mid-December."

HAMMER B-Tree Recovery

Submitted by Jeremy
on November 7, 2007 - 5:46am
DragonFlyBSD

"Speaking of on-disk B-Trees, ReiserFS' biggest problems are all based on its use of flexible B-Trees," suggested a reader on the DragonFlyBSD Kernel mailing list, pointing to the difficulty of detecting a failed node and then of rebuilding the B-Tree. HAMMER filesystem designer and author, Matt Dillon, explained, "if a cluster needs to be recovered, HAMMER will simply throw away the B-Tree and regenerate it from scratch using the cluster's record list. This way all B-Tree I/O operations can be asynchronous and do not have to be flushed on fsync. At the same time HAMMER will remove any records whose creation transaction id's are too large (i.e. not synchronized with the cluster header), and will zero out the delete transaction id for any records whos deletion transaction id's are too large." Matt then acknowledged:

"The real performance issue for HAMMER is going to be B-Tree insertions and rebalancing across clusters. I think most of the issues can be resolved with appropriate heuristics and by a background process to slowly rebalance clusters. This will require a lot of work, though, and only minimal rebalancing will be in [the end-of-the-year] release."

HAMMER Filesystem Progress

Submitted by Jeremy
on November 1, 2007 - 7:32pm
DragonFlyBSD

"I will be continuing to commit bits and pieces of HAMMER, but note that it will probably not even begin to work for quite some time," Matthew Dillon reported on the new clustering filesystem he's developing for DragonFlyBSD. He noted, "I am still on track for it to make it into the end-of-year release." Matt continued:

"My B-Tree implementation also allows HAMMER to cache B-Tree nodes and start lookups from any internal node rather then having to start at the root. You can do this in a standard B-Tree too but it isn't necessarily efficient for certain boundary cases. In my implementation I store boundaries for the left AND right side which means a search starting in the middle of the tree knows exactly where to go and will never have to retrace its steps."

HAMMER Performance

Submitted by Jeremy
on October 14, 2007 - 3:07am
DragonFlyBSD

"I've never looked at the Reiser code though the comments I get from friends who use it are on the order of 'extremely reliable but not the fastest filesystem in the world'," Matt Dillon explained when asked to compare his new clustering HAMMER filesystem with ReiserFS, both of which utilize BTrees to organize objects and records. He continued, "I don't expect HAMMER to be slow. A B-Tree typically uses a fairly small radix in the 8-64 range (HAMMER uses 8 for now). A standard indirect block methodology typically uses a much larger radix, such as 512, but is only able to organize information in a very restricted, linear way." He continued to describe numerous plans he has for optimizing performance, "my expectation is that this will lead to a fairly fast filesystem. We will know in about a month :-)"

Among the optimizations planned, Matt explained, "the main thing you want to do is to issue large I/Os which cover multiple B-Tree nodes and then arrange the physical layout of the B-Tree such that a linear I/O will cover the most likely path(s), thus reducing the actual number of physical I/O's needed." He noted, "HAMMER will also be able to issue 100% asynchronous I/Os for all B-Tree operations, because it doesn't need an intact B-Tree for recovery of the filesystem." He went on to describe another potential optimization allowed by the filesystem's design, "HAMMER is designed to allow clusters-by-cluster reoptimization of the storage layout. Anything that isn't optimally layed-out at the time it was created can be re-layed-out at some later time, e.g. with a continuously running background process or a nightly cron job or something of that ilk. This will allow HAMMER to choose to use an expedient layout instead of an optimal one in its critical path and then 'fix' the layout later on to make re-accesses optimal."

HAMMER Filesystem Design

Submitted by Jeremy
on October 10, 2007 - 5:51pm
DragonFlyBSD

"I am going to start committing bits and pieces of the HAMMER filesystem over the next two months," announced Matthew Dillon on the Dragonfly BSD kernel mailing list. He noted that the filesystem should be functional by the 2.0 release in December, "I am making good progress and I believe it will be beta quality by the release. It took nearly the whole year to come up with a workable design. I thought I had it at the beginning of the year but I kept running into issues and had to redesign the thing several times since then." Matthew then posted a detailed design document for the new filesystem.

During the followup discussion, Matthew was asked if HAMMER would be a ZFS killer. He responded, "ZFS serves a different purpose and I think it is cool, but as time has progressed I find myself liking ZFS's design methodology less and less, and I am very glad I decided against trying to port it." He noted it is essential to have redundant copies of data, but added, "the problem ZFS has is that it is TOO redundant. You just don't need that scale of redundancy if you intend to operate in a multi-master replicated environment because you not only have wholely independant (logical) copies of the filesystem, they can also all be live and online at the same time." As for how Dragonfly's new filesystem will address redundancy, he explained:

"HAMMER's approach to redundancy is logical replication of the entire filesystem. That is, wholely independant copies operating on different machines in different locations. Ultimately HAMMER's mirroring features will be used to further our clustering goals. The major goal of this project is transparent clustering and a major requirement for that is to have a multi-master replicated environment. That is the role HAMMER will eventually fill. We wont have multi-master in 2.0, but there's a good chance we will have it by the end of next year."

Interview: Matthew Dillon

Submitted by Jeremy
on August 6, 2007 - 1:56pm

Matthew Dillon created DragonFly BSD in June of 2003 as a fork of the FreeBSD 4.8 codebase. KernelTrap first spoke with Matthew back in January of 2002 while he was still a FreeBSD developer and a year before his current project was started. He explains that the DragonFly project's primary goal is to design a "fully cross-machine coherent and transparent cluster OS capable of migrating processes (and thus the work load) on the fly."

In this interview, Matthew discusses his incentive for starting a new BSD project and briefly compares DragonFly to FreeBSD and the other BSD projects. He goes on to discuss the new features in today's DragonFly 1.10 release. He also offers an in-depth explanation of the project's cluster goals, including a thorough description of his ambitious new clustering filesystem. Finally, he reflects back on some of his earlier experiences with FreeBSD and Linux, and explains the importance of the BSD license.

DragonFlyBSD: 1.10 Released

Submitted by Jeremy
on August 6, 2007 - 1:40pm
DragonFlyBSD

Matthew Dillon has announced the release of DragonFly BSD 1.10, the sixth major DragonFly release since the project's creation in 2003. The release notes say "we consider 1.10 to be more stable then 1.8," and summarize some of the new features:

"Several big-ticket items are present in this release. Our default ATA driver has been switched to NATA (ported from FreeBSD). NATAs big claim to fame is support for AHCI which is the native SATA protocol standard. It is far, far better then the old ATA/IDE protocol. DragonFly now has non-booting support for GPT partitioning and 64 bit disklabels. Non-booting means we don't have boot support for these formats yet. DragonFly's Light Weight Process abstraction is now finished and working via libthread_xu but the default threading library is not quite ready to be changed from libc_r yet. All threaded programs now link against an actual 'libpthread' which is a softlink to libc_r or libthread_xu, allowing the new threading library to be tested more fully."

DragonFlyBSD: 1.10 Release Coming Soon

Submitted by Jeremy
on July 31, 2007 - 7:02pm
DragonFlyBSD

"1.10 has been branched," DragonFlyBSD creator Matt Dillon announced, noting that the official release is expected soon, "no release date has been set yet but this coming weekend is looking real good now." Among the new features of DragonFly 1.10 are improved virtual kernel support, a new disk management infrastructure, improvements to wireless networking, and support for the new syslink protocol.

DragonFlyBSD has a stable release every six months. The current development branch is numbered 1.11, with the next stable release at the end of the year numbered 2.0. The 1.10 release has been delayed about a week while some final bugs were addressed. Matt noted:

"The 1.10 release is looking a lot better now. We are basically just waiting for a new pkgsrc bootstrap kit and a little more testing. All major issues except booting a machine with a USB root with EHCI loaded have been resolved."

DragonFlyBSD: Syslink Protocol

Submitted by Jeremy
on April 27, 2007 - 3:06am
DragonFlyBSD

DragonFlyBSD founder Matthew Dillon [interview] posted an update on his syslink protocol which he defined as, "a message based protocol that can devolve down into almost direct procedure calls when two localized resources talk to each other." The syslink API will be used to talk to both local resources on the same node as well as to remote resources on a different node. Earlier documentation further explained the networking nature of the protocol, "the Syslink protocol is used to glue the cluster mesh together. It is based on the concept of reliable packets and buffered streams. Adding a new node to the mesh is as simple as obtaining a stream connection to any node already in the mesh, or tying into a packet switch with UDP." In another email Matthew explained how various DragonFlyBSD nodes utilize Syslink to automatically establish the optimal physical route.

In his recent email, Matthew described the latest Syslink issue he has solved, "in order to transport requests across a machine boundary (that is, outside the domain of a direct memory access), it is necessary to assign a unique identifier to the resource." He detailed how he had originally planned to rework dozens of major system structures to use the syslink API, but instead will now "rework JUST the reference counting methodology used in these resource structures." The end result is "a common ref counting API and a little structure that includes a 64 bit unique sysid, red-black tree node, the ref count, and a pointer to a resource type structure (e.g. identifying it as a vnode, vm object, or whatever). When any of the above resources are allocated, they will be indexed in a Red-Black tree. In other words it will be possible to identify every single resource in the system by traversing the red-black tree". He goes on to summarize, "and that, folks, gives us the building blocks we need to represent resources in a cluster. This also means I don't have to rewrite the APIs. Instead I can simply write new RPC APIs for accesses made via syslink ids and, poof, now all of a system's resources will become accessible remotely, with only modest effort."

DragonFlyBSD: Designing a Highly Available Clustering Filesystem

Submitted by Jeremy
on February 27, 2007 - 7:43pm
DragonFlyBSD

Matt Dillon [interview] posted the design synopsis of a new highly available clustered filesystem he will soon begin writing for DragonFlyBSD. The feature summary at the beginning of his document included, "on-demand filesystem check and recovery; infinite snapshots; multi-master operation, including the ability to self-heal a corrupted filesystem by accessing replicated data; infinite logless replication, meaning that replication targets can be offline for 'days' without effecting performance or operation; 64 bit file space, 64 bit filesystem space, no space restrictions whatsoever; reliably handles data storage for huge multi-hundred-terrabyte filesystems without fear of unrecoverable corruption; cluster operation, provides the ability to commit data to locally replicated store independantly of other replication nodes, with access governed by cache coherency protocols; independant index, data is laid out in a highly recoverable fashion, independant of index generation, and indexes can be regenerated from scratch and thus indexes can be updated asynchronously." He then goes into detail on each of these points and many more, explaining how he intends to implement the new filesystem.

The new filesystem is currently unnamed, though Matt noted, "it doesn't have to translate as an acronym. At the moment 'HAMMER' is my favorite. I like the idea of a hammer :-)" It was suggested that this could mean, "high-availability multi-master extra reliable file system", though Matt was not impressed with this. Another proposed idea that Matt liked was HACFS, or "High-Availability Clustered File System".

DragonFlyBSD: New Version Numbering Scheme

Submitted by Jeremy
on April 1, 2005 - 6:46am
DragonFlyBSD

Matt Dillon [interview] decided on an official version numbering scheme for DragonFlyBSD releases. First ruling out the usage of dates in each release, he settled on using odd numbers to denote a work in progress, and even numbers to denote releases. For example, 1.0, 1.2, 1.4, and so on would be considered releases, whereas 1.1, 1.3, 1.5, and so on would be considered works in progress.

Four tags will also be used, -CURRENT, -WORKING, -RELEASE and -STABLE. The -CURRENT tag indicates "a build based on the head of the CVS tree." The -WORKING tag indicates "a build based on our current stable tag". The -RELEASE tag indicates "a build based on a release branch." And the -STABLE tag indicates "a build based on a post-release branch." Matt adds, "you can probably see why I am also using odd/even numbering... so people can just glance at the number to get an idea of the relative time frame without necessarily understanding what all the keywords mean." Following this scheme, the next stable release will be DragonFly 1.2-RELEASE.

DragonFly: I/O Consolidation and Direct-to-DMA Plans

Submitted by njc
on December 28, 2004 - 10:10am
DragonFlyBSD

Matt Dillon [story] provides an interesting and detailed explanation of future development plans with regards to DragonFly's I/O subsystem. Originally inspired by the PIPE code improvements of FreeBSD's Alan Cox, and demonstrated in DragonFly's unique XIO and MSFBUF APIs, the goal of this work is to avoid KVA mappings for I/O requests and the resulting overhead of interprocessor interrupts in SMP systems. In theory, this equates to high performance through the benefit of efficient I/O in combination with the ability of any subsystem layer to transfer data to busdma with zero memory-to-memory copies. Matt expands:

"What we are going to do is extend the msf_buf abstraction to cover these needs and provide a set of API calls that allows upper layers to supply data in any form and lower level layers to request data in any form, including with address restrictions. msf_buf's already have a page-list (XIO) and KVA mapping abstraction. We are going to add a bounce-buffer abstraction and then work on a bunch of new API calls for msf_bufs to cover the needs of various subsystems."

There appears to be a lot of interesting work going on in DragonFly, read more for the entirety of Matt's post.