HAMMER Approaches Alpha Status

Submitted by Jeremy
on March 25, 2008 - 6:49am

Matthew Dillon posted on update on his evolving HAMMER filesystem, noting that it "passes all standard filesystem stress tests and buildworld will run with a HAMMER /usr/obj". He also noted, "pruning and reblocking code is in and partially tested, but now needs more stringent testing; full historical access appears to be working but needs testing." He added, "there are two big-ticket and several little-ticket items left. HAMMER will officially go Alpha when the big-ticket items are done, and beta when we get a few of the little-ticket items done." The two "big-ticket" items left to be completed are UNDO crash recovery code, and handling for full filesystems. Matt summarized:

"I have no time frame for these items yet. It will depend on how quickly HAMMER moves to Alpha and Beta status. I will say, however, now that HAMMER's on-disk format has solidified, that I have a very precise understanding of the protocols that will be needed to accomplish fully cache coherent remote access for both replicated and non-replicated (remote mount style) access. And, as you know, fully coherent filesystem access across machines is going to be the basis for DragonFly's clustering across said machines. In summary, things are progressing very well."


From: Matthew Dillon
Subject: HAMMER update 23-Mar-08
Date: Mar 23, 7:57 pm 2008

Here's an update on the HAMMER work!

    Current status:

    * Passes all standard filesystem stress tests and buildworld will
      run with a HAMMER /usr/obj.  Large histories are able to accumulate
      without effecting performance.

    * Pruning and reblocking code is in and partially tested, but now needs
      more stringent testing.

    * Full historical access appears to be working but needs testing.
      Note that a sync is still needed to flush dirty cached data prior
      to acquiring a timestamp for the 'snapshot' to be set in stone.
      (dirty data cached in-memory has no historical tags and must be
      committed to physical disk before it can be accessed historically).


    Current bugs:

    * There is one known bug in the standard operations code paths
      that results in an assertion in HAMMER's I/O subsystem.

    * There are probably bugs in the reblocking and/or pruning code.  More
      likely in the reblocking code.


    There are two big-ticket and several little-ticket items left.  HAMMER
    will officially go Alpha when the big-ticket items are done, and beta
    when we get a few of the little-ticket items done.

    Big ticket items left:

    * UNDO (crash recovery) code.  Currently it writes out undo records but
      they are not yet sequenced, buffer writes are not yet ordered, and
      there is no mount-time recovery code yet.

      This is the last item needed before HAMMER can go operational.

    * Filesystem full handling.  Currently no space is reserved for dirty
      cached data so it is possible to create/write files and for HAMMER
      to not have sufficient space left on-disk to flush it.


    Little ticket items:

    * Automated reblocking (currently these functions are manually
      initialized via the hammer utility).

    * I/O clustering and preliminary BMAP op when writing out large files.

    * CRC checking (CRC fields are reserved but not entirely generated yet
      and not yet checked at all).

    * Disaster Recovery filesystem scan.

    * Boot support.

    I expect all of these items and more to be handled by the 2.0 release
    in July.

			    Additional HAMMER capabilities
				(no timeline yet)

    * Adding, removing, and resizing a HAMMER filesystem's backing store.

			Ultimate Goals and working towards them
				(no timeline yet)

    Our ultimate goal with HAMMER and DragonFly in general is to support
    fully cache coherent replication in a multi-machine environment.  This
    involves several steps and networking protocols.

    * Replication of synchronization streams based on the UNDO log.  If
      resynchronizing to a target which is too old a B-Tree scan will
      likely be required.

    * Cache coherency protocols for machine-machine coherency for both
      replicated and remote-HAMMER access.

    I have no time frame for these items yet.  It will depend on how quickly
    HAMMER moves to Alpha and Beta status.  I will say, however, now
    that HAMMER's on-disk format has solidified, that I have a very precise
    understanding of the protocols that will be needed to accomplish fully
    cache coherent remote access for both replicated and non-replicated
    (remote mount style) access.

    And, as you know, fully coherent filesystem access across machines is
    going to be the basis for DragonFly's clustering across said machines.

    In summary, things are progressing very well.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From: Thomas E. Spanjaard
Subject: Re: HAMMER update 23-Mar-08
Date: Mar 24, 6:19 pm 2008

Matthew Dillon wrote:
>     * Full historical access appears to be working but needs testing.
>       Note that a sync is still needed to flush dirty cached data prior
>       to acquiring a timestamp for the 'snapshot' to be set in stone.
>       (dirty data cached in-memory has no historical tags and must be
>       committed to physical disk before it can be accessed historically).

Wouldn't making timestamp queries (at least from userland) enforce a 
sync on the volume in question be useful here?
-- 
         Thomas E. Spanjaard
         tgen@netphreax.net

From: Matthew Dillon Subject: Re: HAMMER update 23-Mar-08 Date: Mar 24, 8:02 pm 2008 :Wouldn't making timestamp queries (at least from userland) enforce a :sync on the volume in question be useful here? :-- : Thomas E. Spanjaard : tgen@netphreax.net Making the 'hammer now' command do a sync() is a good idea. I will make that change right now so it doesn't get lost. Here's a general overview of the issues involved with having historical access to the filesystem: ---------------- Recording the timestamps in the in-memory cache, for a finer-grained snapshot capability, is doable but has its own issues. Here's an illustration: open() create file write() append 4K (file size now 4K) write() append 4K (file size now 8K) write() append 4K (file size now 12K) write() append 4K (file size now 16K) Now NONE of this has gone to disk yet, it's entirely in the in-memory cache. The inode is in the in-memory cache. The data is stored in the buffer cache. Even the directory entry for the file that we just created is still in the in-memory cache (HAMMER caches the raw records it intends to commit later on). If I wanted to be able to acquire a timestamp between each write and 'see' a snapshot of the file as of any point in the above sequence, then every write would also have to allocate a copy of the inode (because it changes size on each write). The data has the same problem though with a slightly different example. Lets say each write() was a seek-write, overwriting the previous data. Now with every write() I would have to allocate a copy of the data being overwritten. This is complicated by the fact that the buffer cache has no clue about 'historical' accesses, so I would not be able to use the buffer cache to cache the data. There's also another problem and that is with the efficiency of the topology on-disk. Even if I maintained all the copies of the inode and all the copies of the data in-memory, I would still have to sync all those copies to disk in order for things to remain historically coherent (whether it be in-cache or on-disk). This would result in hundreds or even thousands of copies of the inode on-disk, not to mention potentially many copies of the data. I just don't want to do that right now, at least not as a default. A lot of performance would be lost. Hence a sync() is needed if you want to create a demark which you can accurately snapshot. ------------- Here's a quick synopsis of how the cache would operate in a clustered filesystem: In order to properly integrate with in-memory caches, a wider cache coherency infrastructure is needed between machines such that modifications made on one machine proactively invalidate those protions of the cache(s) on other machines. At the same time, any 'dirty' cache data, for example when a file is created or written to, must lock the cache space in question on all other machines. The cache space in this case is not just the file data, but also the related namespaces (for creations, deletions, and renames). Attempts to access locked spaces from other machines in the cluster would have to force a flush to the filesystem backing store and lower the cache states for the effected information on the original machine from dirty to shared-read-only. It will be easiest to integrate the cache coherency information into the buffer cache and namecache themselves. Once a machine has dirtied an in-memory cache element... for example part of the namespace when creating a file or chunks of data written within a file, that machine must have a free hand to make further modifications to the cache spaces involved without further interaction with other machines. ------------- Now, if you think of those two major elements you can see that they actually fit together quite well. If I were to attempt to maintain transactional coherency on a per-system-call basis then the cache granularity between machines would have to be much, much smaller then our current in-memory caching elements provide. That would become a really nasty coding problem. So I don't even want to begin to complate transactional coherency at a finer-grain then sync() or fsync() until long after we actually have clustering working. -Matt
From: Petr Janda Subject: Re: HAMMER update 23-Mar-08 Date: Mar 24, 9:16 pm 2008 So is it needed to run hammer now in order to "create" a snapshot? What would I do in situation like this: got a hammer filesystem and couple of the files change on day to day basis. Then a week later I needed to access one of the files, in exactly the state they were 7 days ago. Cheers, Petr
From: Matthew Dillon Subject: Re: HAMMER update 23-Mar-08 Date: Mar 24, 10:26 pm 2008 :So is it needed to run hammer now in order to "create" a snapshot? What would :I do in situation like this: got a hammer filesystem and couple of the files :change on day to day basis. Then a week later I needed to access one of the :files, in exactly the state they were 7 days ago. : :Cheers, :Petr No, you do not have to run 'hammer now' to create a snapshot. The kernel syncs all filesystems every 30 seconds, so if you do nothing at all you get a snapshot granularity of 30 seconds. Where you would use 'hammer now' is if you wanted the most current snapshot possible for the purpose of, say, backing up your filesystem to another machine. You might do something like this: set timestamp = `hammer now` cpdup /mountpoint/@@$timestamp targethost:/somepath But if you didn't care about that you could just go back far enough that you get a stable historical view... e.g. go back 1 minute and you would have a stable view into your filesystem. set timestamp = `hammer stamp 60s` <------ doesn't sync cpdup /mountpoint/@@$timestamp targethost:/somepath Ultimately the idea of managing filesystems this way is to still do regular backups from your production machine to your backup machine (ultimately by way of replication), with both running HAMMER, but only retain a limited amount of history on the production box. You might desire to retain only one week's worth of history on the production box, retain one month's history on your local backup box, and retain a very granular one year's worth of history on your remote backup box. Come to think of it, I should add some more directives to the 'hammer prune' command to make that easier to specify. Until I implement a live replication 'feed' the minimum granularity on the backup box will be how often you do your backups (e.g. once a day), and you can prune it into more granular forms from that starting point. Once we have a live replication feed the backup box will have the same 30-second granularity that the production machine has. A major bullet point for this style of management is that the retention policy on the various boxes can be different even though they are all slaved off the same production filesystem. -Matt Matthew Dillon <dillon@backplane.com>

Icon

andy
on
March 25, 2008 - 11:49am

Shouldn't the icon for this article be the dragonfly and not Tux?

Fixed, thanks!

andy
on
March 25, 2008 - 5:10pm

Fixed, thanks!

Port?

Anonymous (not verified)
on
March 25, 2008 - 5:25pm

Port it to Linux?

Re: Port it to Linux?

Anonymous (not verified)
on
March 25, 2008 - 6:32pm

"Port it to Linux?"

First off, why? Linux already has clustering filesystems, and if I recall correctly, Hammer relies on quite a few DragonFly specific features (most of which have rough equivalents in Linux, tho I'd imagine that the differences are such that it'd be an absolute nightmare to port, if you could even convince Linus etc to incorporate it).

Let's not even get into the fact that Hammer isn't even remotely finished yet, making the idea of porting it at all for the moment a silly proposition.

IMHO, Hammer is something that's best left to DragonFly, as part of their ongoing efforts to create a native SSI clustering BSD. If you want to use Hammer, use DragonFly, if you want to use GFS, use Linux.

I have been following the

Anonymous (not verified)
on
March 26, 2008 - 8:45pm

I have been following the technical aspects loosely, but I assume that, when all is said and done, the on-disk structure should be fairly straightforward (compared to some Linux and vendor-hyped filesystems) and even a reimplementation would not be difficult. I don't think any extremely special DragonFly features are instrumental, but if they are, aren't some of those what Linux reimplements per-module anyway?

It's designed to make the SSI stuff doable, but that doesn't mean you'd need any of that to read and write it locally.

I heartily agree that it is extremely early to worry about porting, but once it's out of beta I don't see why it couldn't be a nice low-cruft general purpose FS.

"It's designed to make the

Anonymous (not verified)
on
March 26, 2008 - 9:07pm

"It's designed to make the SSI stuff doable, but that doesn't mean you'd need any of that to read and write it locally."

D'oh. I'm laughing at myself for not having this very simple fact in mind. Thanks ;^)

The story summary is a bit

Anonymous (not verified)
on
March 25, 2008 - 5:43pm

The story summary is a bit misleading:

The two "big-ticket" items left to be completed are UNDO crash recovery code, and handling for full filesystems. Matt summarized:
"I have no time frame for these items yet. It will depend on how quickly HAMMER moves to Alpha and Beta status. [...]"

The big ticket items DO have time frame (DFBSD 2.0, around June), it's the "Additional" and "Ultimate" goals that have no timeframe yet and will be addressed only after reaching Alpha/Beta status.

Dumb question...

Anonymous (not verified)
on
March 26, 2008 - 6:34pm

I know its a dumb*ss question, but is Hammer intended to replace UFS in DF?

I believe the intention is

Anonymous (not verified)
on
March 26, 2008 - 8:48pm

I believe the intention is for it to be good enough that there'd be no reason to use UFS.

Because these are BSD people, it won't be forced on users until there's a good chance it's as stable. ;)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.