Re: [Tux3] Comparison to Hammer fs design

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: <kernel@...>
Date: Friday, July 25, 2008 - 6:54 pm

(resent after subscribing to the the kernel@crater)

On Friday 25 July 2008 11:53, Matthew Dillon wrote:

How about a cross post?
 

Yes, that is the main difference indeed, essentially "log everything" vs
"commit" style versioning.  The main similarity is the lifespan oriented
version control at the btree leaves.


Once I get down to the leaf level I binary search on logical address in
the case of a file index btree or on version in the case of an inode
table block, so this cost is still Log(N) with a small k.  For a
heavily versioned inode or file region this might sometimes result in
an overflow block or two that has to be linearly searched which is not
a big problem so long as it is a rare case, which it really ought to be
for the kinds of filesystem loads I have seen.  A common example of a
worst case is /var/log/messages, where the mtime and size are going to
change in pretty much every version, so if you have hourly snapshots
and hold them for three months it adds up to about 2200 16 byte inode
table attributes to record, about 8 inode table leaf blocks.  I really
do not see that as a problem.  If it goes up to 100 leaf blocks to
search then that could be a problem.

Usually, I will only be interested in the first and last of those
blocks, the first holding the rarely changing attributes including the
root of the file index btree and the last holding the most recent
incarnation of the mtime/size attribute.

The penultimate inode table index block tells me how many blocks a
given inode lives in because several blocks will have the same inum
key.  So the lookup algorithm for a massively versioned file becomes:
1) read the first inode table block holding that inum; 2) read the last
block with the same inum.  The latter operation only needs to consult
the immediate parent index block, which is locked in the page cache at
that point.

It will take a little care and attention to the inode format,
especially the inode header, to make sure that I can reliably do that
first/last probe optimization for the "head" version, but it does seem
worth the effort.  For starters I will just do a mindless linear walk
of an "overflow" inode and get fancy if that turns out to be a problem.


By far the most common case I would think.  But check out the versioned
pointer algorithms.  Surprisingly that just works, which is not exactly
obvious:

   Versioned pointers: a new method of representing snapshots
   http://lwn.net/Articles/288896/

I was originally planning to keep all versions of a truncate/rewrite
file in the same file index, but recently I realized that that is dumb
because there will never be any file data shared by a successor version
in that case.  So the thing to do is just create an entirely new
versioned file data attribute for each rewrite, bulking up the inode
table entry a little but greatly constraining the search for versions
to delete and reducing cache pressure by not loading unrelated version
data when traversing a file.


...which happens in Tux3 courtesy of the fact that the entire block
containing the dirent will have been versioned, with the new version
showing the entry gone.  Here is one of two places where I violate my
vow to avoid copying an entire block when only one data item in it
changes (the other being the atime table).  I rely on two things to
make this nice: 1) Most dirent changes will be logically logged and
only rolled up into the versioned file blocks when there are enough to
be reasonably sure that each changed directory block will be hit
numerous times in each rollup episode.  (Storing the directory blocks
in dirent-create order as in PHTree makes this very likely for mass
deletes.)  2) When we care about this is usually during a mass delete,
where most or all dirents in each directory file block are removed
before moving on to the next block.


Again, check out the versioned pointer algorithms.  You can tell what
can be pruned just by consulting the version tree and the create_tids
(birth versions) for a particular logical address.  Maybe the hang is
that you do not organize the btrees by logical address (or inum in the
case of the inode table tree).  I thought you did but have not read
closely enough to be sure.


Fair enough.  I have an entirely different approach to what you call
mirroring and what I call delta replication.  (I reserve the term
mirroring to mean mindless duplication of physical media writes.)  This
method proved out well in ddsnap:

   http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149...

(Sorry about the massive URL, you can blame Linus for that;-)

What I do in ddsnap is compute all the blocks that differ between two
versions and apply those to a remote volume already holding the first
of the two versions, yielding a replica of the second version that is
logically but not physically identical.  The same idea works for a
versioned filesystem: compute all the leaf data that differs between
two versions, per inum, and apply the resulting delta to the
corresponding inums in the remote replica.  The main difference vs a
ddsnap volume delta is that not all of the changes are physical blocks,
there are also changed inode attributes, so the delta stream format
has to be elaborated accordingly.


I intend to log insertions and deletions logically, which keeps each
down to a few bytes until a btree rollup episode comes along to perform
updating of the btree nodes in bulk.  I am pretty sure this will work
for you as well, and you might want to check out that forward logging
trick.

That reminds me, I was concerned about the idea of UNDO records vs
REDO.  I hope I have this right: you delay acknowledging any write
transaction to the submitter until log commit has progressed beyond the
associated UNDO records.  Otherwise, if you acknowledge, crash, and
prune all UNDO changes, then you would end up with applications
believing that they had got things onto stable storage and be wrong
about that.  I have no doubt you did the right thing there, but it is
not obvious from your design documentation.


Yes, I mostly have 16 byte elements and am working on getting most of
them down to 12 or 8.


I use two-stage lookup, or four stage if you count searches within
btree blocks.  This makes the search depth smaller in the case of small
files, and in the case of really huge files it adds depth exactly as
appropriate.  The index blocks end up cached in the page cache (the
buffer cache is just a flavor of page cache in recent Linux) so there
is little repeated descending through the btree indices.  Instead, most
of the probing is through the scarily fast radix tree code, which has
been lovingly optimized over the years to avoid cache line misses and
SMP lock contention.

I also received a proposal for a "fat" btree index from a collaborator
(Maciej) that included the file offset but I really did not like the...
fattening.  A two level btree yields less metadata overall which in my
estimation is more important than saving some bree probes.  The way
things work these days, falling down from level 1 cache to level 2 or
from level 2 to level 3 costs much more than executing some extra CPU
instructions.  So optimization strategy inexorably shifts away from
minimizing instructions towards minimizing cache misses.


Fortunately, we get that for free on Linux, courtesy of the page cache
radix trees :-)

I might eventually add some explicit cursor caching, but various
artists over the years have noticed that it does not make as much
difference as you might think.


Indeed.  But Linux is braindamaged about large block size, so there is
very strong motivation to stay within physical page size for the
immediate future.  Perhaps if I get around to a certain hack that has
been perenially delayed, that situation will improve:

   "Variable sized page objects"
   http://lwn.net/Articles/37795/


Topological bloat?


Ah, that is a very nice benefit of Tux3-style forward logging I forgot
to mention in the original post: transaction size is limited only by
available free space on the volume, because log commit blocks can be
anywhere.  Last night I dreamed up a log commit block variant where the
associated transaction data blocks can be physically discontiguous, to
make this assertion come true, shortly after reading Stephen Tweedie's
horror stories about what you have to do if you must work around not
being able to put an entire, consistent transaction onto stable media
in one go:

   http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html

Most of the nasties he mentions just vanish if:

  a) File data is always committed atomically
  b) All commit transactions are full transactions

He did remind me (via time travel from year 2000) of some details I
should write into the design explicitly, for example, logging orphan
inodes that are unlinked while open, so they can be deleted on replay
after a crash.  Another nice application of forward logging, which
avoids the seek-happy linked list through the inode table that Ext3
does.


Good point.  I expect Tux3 will eventually have a reblocker (aka
defragger).  There are some really nice things you can do, like:

  1) Set a new version so logical changes cease for the parent
     version.

  2) We want to bud off a given directory tree into its own volume,
     so start by deleting the subtree in the current version.  If
     any link counts in the directory tree remain nonzero in the
     current version then there are hard links into the subtree, so
     fail now and drop the new version.

  3) Reblock a given region of the inode table tree and all the files
     in it into one physically contiguous region of blocks

  4) Add a free tree and other once-per volume structures to the new
     home region.

  5) The new region is now entirely self contained and even keeps its
     version history.  At the volume manager level, map it to a new,
     sparse volume that just has a superblock in the zeroth extent and
     the new, mini filesystem at some higher logical address.  Remap
     the region in the original volume to empty space and add the
     empty space to the free tree.

  6) Optionally reblock the newly budded filesystem to the base of the
     new volume so utilties that do not not play well with sparse
     volumes do not do silly things.


Cavalier was a poor choice of words, the post was full of typos as well
so I am not proud of it.  You are solving a somewhat different problem
and you have code out now, which is a huge achievement.  Still I think
you can iteratively improve your design using some of the techniques I
have stumbled upon.  There are probably some nice tricks I can get from
your code base too once I delve into it.


Thankyou very much.  I think you are on the right track too, which you
have a rather concrete way of proving.


The big ahas! that eliminated much of the complexity in the Tux3 design
were:

  * Forward logging - just say no to incomplete transactions
  * Version weaving - just say no to recursive copy on write

Essentially I have been designing Tux3 for ten years now and working
seriously on the simplifying elements for the last three years or so,
either entirely on paper or in related work like ddsnap and LVM3.

One of the nice things about moving on from design to implementation of
Tux3 is that I can now background the LVM3 design process and see what
Tux3 really wants from it.  I am determined to match every checkbox
volume management feature of ZFS as efficiently or more efficiently,
without violating the traditional layering between filesystem and
block device, and without making LVM3 a Tux3-private invention.


Hopefully not too much later.  I find this dialog very fruitful.  I just
wish such dialog would occur more often at the design/development stage
in Linux and other open source work instead of each group obsessively
ignoring all "competing" designs and putting huge energy into chatting
away about the numerous bugs that arise from rushing their design or
ignoring the teachings of history.

Regards,

Daniel
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [Tux3] Comparison to Hammer fs design, Matthew Dillon, (Fri Jul 25, 2:53 pm)
Re: [Tux3] Comparison to Hammer fs design, Daniel Phillips, (Fri Jul 25, 6:54 pm)