On Sunday 27 July 2008 14:31, Matthew Dillon wrote:
So I do not want users to fixate on that detail. The mount option
allows them to choose between "fast but theoretically riskier" and
"warm n fuzzy zero risk but not quite so fast". If the example of Ext3
is anything to go by, almost everybody chooses the "ordered data" mode
over the "journal data" mode given the tradeoff that the latter is
about 30% slower but offers better data integrity for random file
rewrites.
The tradeoff for the option in Tux3 will be maybe 1 - 10% slower in
return for a miniscule reduction in the risk of a false positive on
replay. Along the lines of deciding to live underground to avoid the
risk of being hit by a meteorite. Anyway, Tux3 will offer the option
and everybody will be happy.
Actually implementing the option is pretty easy because the behavior
variation is well localized.
Yes, a continuous running space, not preallocated. The forward log
will insert itself into any free blocks that happen to be near the
the transaction goal location. Such coopted free space will be
implicitly unavailable for block allocation, slightly complicating the
block allocation code which has to take into consideration both the
free space map and all the outstanding log transactions.
Logical logging is not idempotent by nature because of the uncertainty
of the state of the object edited by a logical operation: has the
operation already been applied or not? If you know something about the
structure of the target you can usually tell. Suppose the operation is
a dirent create. It has been applied to the target if the direct
already exists, otherwise not. I do not like this style of special
case hacking to force the logical edit to be idempotent.
Instead, I choose to be sure about the state of the target object by
introducing the rule that after a logical log operation has been
generated, nothing is allowed to write to the object being edited. The
logical operation pins the state of the disk image of target object.
The object stays pinned until the logical log entry has been retired
by a physical commit to the object that updates the object and retires
the logical edit in one atomic log transaction.
I was working on a specific code example of this with pseudocode and
all disk transfers enumerated, but then your email arrived. Well after
this response the example will probably be better.
Tux3 never backs anything out, so there is some skew to clear up.
The logical and physical log operations are indeed bound together at
both the disk image and the processor level, but not in the way that
you might expect. The primary purpose of rolling up logical log
entries into physical updates is to control resource consumption;
the secondary purpose is to reduce the amount of replay required to
reconstruct "current" memory images of the btree blocks on reboot or
remount.
In the first case, resource exhaustion, a high level vfs transaction
may have to wait for a rollup to finish, cleaning a number of dirty
buffer cache blocks and releasing resources needed for the rollup,
such as bio structs and lists of transaction components.
Such blocking is nominally enforced by the VMM by sending the
requesting process off to scan for and try to free up some dirty
memory (a horribly bad design idea in Linux that causes recursive
calls into the filesystem, but that is the way it works) during which
time the process is effectively blocked. Tux3 will add a more
responsible level of resource provisioning by limiting the number of
transactions that can be in flight, much as Ext3 does. This allows
the filesystem to fullfill a promise like "never will use up all of
kernel reserve memory after having been given privileged access to
it in order to clean and recover some dirty cache blocks".
In the second case, keeping the replay path short, high level
operations can proceed in parallel with physical updates, because
a transaction is _logically_ completed as soon as committing the
associated logical edits to disk has completed.
So high level operations block on logical commit completions, not on
physical commit completions except in the case of resource exhaustion.
If the user writes:
rm a/b/c&
rm a/b&
rm a&
Then they deserve what they get, which is a probable complaint that a
directory does not exist. If the operations happen to execute in a
favorable order then they will all succeed. With Tux3 this is possible
because destroying the 1TB file can proceed asynchronously while the VFS
transaction completes and returns as soon as a commit containing
['unlink', inum_a/b/c, dnum_a/b]
['destroy', inum_a/b/c]
has been logged. So deleting the 1TB file can be very fast, although
it may be some time before all the occupied disk space becomes
available for reuse.
To make the deletion visible to high level filesystem operations, some
cached disk blocks have to be updated. For a huge deletion like this
it makes sense to invent a versioned "orphan" attribute, which is added
to the inode table leaf, meaning that any data lookups arriving from
still-open files should ignore the file index tree entirely.
True, but that only pins one commit block on disk.
I do not think I have that problem because recovering the space used
by the big file can proceed incrementally: free a few blocks from the
inode index leaves; commit the revised file index blocks and free tree
blocks; repeat as necessary. After a crash, find the logical inode
destroy log entry and resume the destroy.
Yes indeed. A matter of totalling all that up, which is fortunately
bounded to a nice low number. Then in grand Linux tradition, ignore
the totals and just rely on there being "megabytes" of reserve memory
to handle the worst case. Obviously a bad idea, but that is how we
have done things for many years.
It sounds like a good idea, I will ponder.
I hope that question is cleared up now.
The presence of a "link" record in the logical log implies a link count
in addition to whatever is recorded in the on-disk inode table leaf.
On the other hand, the "current" image of the inode table leaf in the
buffer cache has the correct link count when the sys_link returns.
It depends on how high the density of such holes is. I will try to keep
it down to a few per megabyte. The nice think about having some one
block holes strewn around the disk is, there is always a nearby place
for a commit block to camp out for a while.
I am hoping not to mess that up ;-)
The thing about physical logging in Tux3 is, the logged data blocks are
normally immediately linked into the index blocks as actual file data
or btree metadata blocks, so they are only written once, and they are
hopefully written near where the rest of the structure lives. It is
only in the optional "update in place" style of physical logging that
the commit data blocks are written twice, which will be no worse than a
traditional journal and probably a lot better because of not needing to
seek to the journal region.
Nice, now I understand. But do you not have to hold all filesystem
transactions that depend on the modification until the btree node has
completed writing to disk? With logical logging you only have to wait
on the logical commit to complete, into which may be packed a number of
other changes for efficiency.
As described above, it is not how I do it.
I think I see it, but I have my doubts because you have to block
transactions waiting for the up to date copy of the btree data to land
on disk. Either that, or you may give userspace the impression that
some transaction has gotten onto stable storage when that is not the
case.
If you do in fact block transactions until the current image of the
btree leaf has been written out then the blocking behavior is no
better than REDO style, and REDO can batch together many updates to
different logical blocks into a single commit block, which seems like
fewer transfers overall.
I hope that is cleared up now.
I am gravitating towards that style for Tux3's commit-related short
term allocations.
It is not Linux that perpetrates this outrage, it is NVFS v2. We can't
just tell everybody that their NFS v2 clients are now broken.
Which was my original proposal to solve the problem. Then Ted told me
about NFS v2 :-O
Actually, NFS hands you a 62 bit cookie with the high bits of both s32
parts unused. NFS v2 gives you a 31 bit cookie. Bleah.
Yes, I noticed that. Check out dx_hack_hash:
http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/hash.c#L38
It distributed hashes of ascii strings very well for me, with few
funnels. It way outperformed some popular hashes like TEA. Ted's cut
down crytographic hash is yet more uniform but costs much more CPU.
It affects the number of probes you have to do for the lookup.
Binsearching hits one cache line per test while each node lookup hits a
lot more.
It is not just the cache footprint but the time it takes to get your
key tables into L1 cache. To be sure, we are talking about small
details here, but these small details can add up to a difference of a
factor of two in execution speed. Which maybe you don't care much about
now that you are orders of magnitude faster than what came before ;-)
Following my prescription above, the file will be full of zeros when
re-extended because the data pointers were removed. But I think you
are right, truncate really is a delete and I actually argued myself
into understanding that.
Yes, I have thought about it a little more, and I imagine something
like a small array of dirty bits that climb up the btree with the help
of logical logging, where each bit means that there is something
interesting for some corresponding replication target to look at,
somewhere down in the subtree.
The Linux dentry cache actually implements proper namespace semantics
all by itself without needing to flush anything, which is what Ramfs
is. Ramfs just lives entirely in cache without even being backed by
a ramdisk, until the power goes off.
Linux implements "negative dentries" to say "this entry is not here".
I have not looked at that part for a while now, and I did not look at
it just now, but Al Viro has been fiddling with it for years getting the
bugs out one at a time. The last major one I heard of was some time
ago. It works somehow, I should look more closely at it.
Modern Linux filesystems get close I think. Particularly in journalled
data mode, Ext3 marks all the buffers it deals with as "don't touch" to
the VFS and VMM, which have no idea how to obey the necessary ordering
constraints. There is also this thing called the "journalling block
device" that provides an abstract implementation of a physical journal,
which is actually used by more than one filesystem. (Three I think,
including Ext4, however I now hear noise about rewriting it.)
Either you forgot to expand or I missed it. I am interested.
I have written the code, mostly, and it is tight. A similar idea has
been part of ddsnap for 5 years now.
I remember it well. You were the one who put Rik up the reverse map
design that caused the main fireworks between 2.3 and 2.4.9. (I was
the one who finally got it work by social-engineering the merging of
the reverse map part with Andrea's more robust LRU design.)
I also remember the time when BSD buffers were far superior to Linux
ones, yes. In 2.0 the arrangement sucked enormously: sort-of coherency
between the buffer cache and the page cache was achieved by physically
copying changes from one to the other.
Today, the buffer cache and page cache are nearly fully unified. But
the unification will never be complete until the page size is variable,
so the remaining tasks done by buffers can be done by pages instead.
Anyway, this part of your OS is more similar than different, which will
help a lot with a port.
Linux caches physical translations by gluing one or more buffer heads
onto each page, which is one of the few remaining uses of buffer heads.
To finally get rid of buffer heads the physical cache needs to be done
some other way. I am sure there are lots of better ways.
This is similar to what I do when I write out a physical block and use
a logical log entry to make the on-disk parent point at it (actually,
it is a promise to make the parent point at it some time in the future.
But then I do not let those things live very long. Eventually it might
make sense to let them live a little longer, perhaps by generalizing
the idea of the single linear log to a forest of little logs, some of
which could be left around for a long time instead of being quickly
rolled up into the canonical structures.
Indeed, I need to handle that. Since I do want to stick with a simple
linear sequence of logical log commits for now, I do not want to leave
any of them sitting around for a log time. One easy thing to do is to
put any orphans aside at rollup time, in an unversioned orphan file, say.
Linux LVM is all about device mapper, and actually, device mapper does
not really have a reason to be a separate subsystem, it can just be the
way the block layer works. If interested, check out my posts on bio
stacking and rolling device mapper's "clone_and_map" idea up into the
generic block layer. I think all unixen should do this.
Other than that, device mapper is just about making device numbers
point at different virtual devices, transparently to userspace. Even
device stacking works like that: extract the "virtual" device (which
may or may not be a real device) from the device number you are going
to remap, store it in some other device number, create a new virtual
device on top of the other device number, finally store the new virtual
device in the old device number. The old device number has now been
transparently stacked. Nothing to it.
Only 3 hours so far here... hmm, make that four.
Daniel