Linux: Proposing Soft Updates In ext2

Submitted by Jeremy
on April 11, 2002 - 7:09am

Alexis Carvalho asked on the lkml, "Does anyone know of any implementation of soft-updates over ext2?" He went on to explain that he was intending to implement this as a project for grad school. The discussion that followed considered some of the pro's and con's of adding soft-updates to ext2.

For further reading, see our earlier interview with Theo de Raadt in which there's a lengthy discussion of soft updates versus journaling. Theo also recommends this USENIX paper.

The thread also discusses the tux2 filesystem, which uses a phase tree algorithm. Find the original (Aug 2000) announcement here.

From: Alexis S. L. Carvalho
To: linux-kernel mailing list
Subject: implementing soft-updates
Date: Tue, 9 Apr 2002 18:46:05 -0300

Hi

Does anyone know of any implementation of soft-updates over ext2? I'm
starting a project on this for grad school, and I'd like to know of any
previous (current?) efforts.

Thanks.

Alexis

From: Albert D. Cahalan
Subject: Re: implementing soft-updates
Date: Tue, 9 Apr 2002 20:41:28 -0400 (EDT)

Alexis S. L. Carvalho writes:

> Does anyone know of any implementation of soft-updates
> over ext2? I'm starting a project on this for grad school,
> and I'd like to know of any previous (current?) efforts.

That's interesting. Some comments:

It is common for controllers, RAID arrays, and the disks to
mess up your ordering. Power failure during a write has been
known to scribble on random unrelated parts of the disk.
Power failure often creates bad sectors that can only be
fixed by a large write that covers the affected area.

Ext2 has deletion time stamps. These are not really good for
performance, but they help fsck to know what is going on.

While ext2 fsck doesn't guarantee anything, in practice it is far
more reliable than ufs fsck. If you change the algorithms to be
like those used by BSD, then you may lose some of the ability to
recover. Remember, fsck isn't just for power failures. It tries
to piece together a filesystem that has suffered disk corruption
caused by attackers, kernel bugs, fdisk screwups, MS-DOS writing
past the end of a partition, Windows NT Disk Manager, viruses,
disk head crashes, and every other cause you can imagine. If you
change fsck to make BSD-style assumptions about write ordering,
you weaken the ability to deal with disasters.

I'm sure you are aware of ext3. You should also be aware of tux2.
Tux2 uses the phase-tree algorithm to perform atomic updates of
the whole filesystem. Tux2 looks horridly slow at first glance,
but is actually quite fast. The overhead drops to almost nothing
as the number of simultaneous operations goes to infinity.
(the overhead asymptoticly approaches 0.1%) While the operations
tend to cause fragmentation, they also make defragmentation be
really cheap -- you can defragment on-th-fly as part of normal
filesystem operations without any additional IO. There is a
neat trick you can do with the phase-tree algorithm for better
integrity: make every non-leaf node carry checksums for all
directly connected child nodes. (either plain or keyed crypto)
Filesystem-level snapshots are easy with the phase-tree algorithm.

Soft-updates are mainly useful for OS wars. Lots of FUD comes
flying out of the BSD camp. Ext2 horror stories are rare
when you consider just how many millions of users ext2 has.
Soft-updates would make our worst problems even worse. The whole
point of soft-updates is to have fsck and the kernel trust the
metadata a bit more... which is terrible if your VIA motherboard
is mangling your metadata before it hits the disk. Not to say
that doing well in an OS war isn't a useful goal though!

In case you are still thinking about what to do, here are a
few filesystem ideas that you might like:

soft-updates for ext2
ext2 compression (e2compr)
delayed allocation (allocate space only when about to do IO)
while rw mounted: defrag, undelete (not trash bin), grow, shrink, fsck
get tux2 into production shape
use the phase-tree algorithm for FAT32 (hint: active FAT flags)
new phase-tree filesystem, perhaps with JFS or XFS structure
make ext2 extents work
make ext2 handle huge block sizes
mark idle filesystems clean; mark dirty before non-atomic updates
ACLs compatible with NFSv4, fast, and compact
secure deletion (stop root, not the NSA: zero the name, inode...)
tools for in-place filesystem conversion (ufs --> ext2)
HFS+ filesystem
Apple's UID hacks for Darwin (the BSD-like MacOS X kernel)
design a fast way to map from inode number to filename(s)
try larger inodes (example: 168-byte, 3 in 512 bytes, 0,1,2,x,4,5,6,x,8...)
provide real-time file IO (app buffers do not guarantee bandwidth)

BTW, the unbalanced trees can be good. They provide quick access
to file magic (see "file" command) and other header information.
We have read-ahead to take care of the rest of the file.

From: Alexis S. L. Carvalho
Subject: Re: implementing soft-updates
Date: Tue, 9 Apr 2002 22:58:54 -0300

First of all, thanks for your comments.

Thus spake Albert D. Cahalan:
> It is common for controllers, RAID arrays, and the disks to
> mess up your ordering. Power failure during a write has been
> known to scribble on random unrelated parts of the disk.
> Power failure often creates bad sectors that can only be
> fixed by a large write that covers the affected area.

OK, but if something scribbles on random unrelated parts of the disk
there's not much you can do besides praying that fsck will fix it.

> Ext2 has deletion time stamps. These are not really good for
> performance, but they help fsck to know what is going on.
>
> While ext2 fsck doesn't guarantee anything, in practice it is far
> more reliable than ufs fsck. If you change the algorithms to be
> like those used by BSD, then you may lose some of the ability to
> recover. Remember, fsck isn't just for power failures. It tries
> to piece together a filesystem that has suffered disk corruption
> caused by attackers, kernel bugs, fdisk screwups, MS-DOS writing
> past the end of a partition, Windows NT Disk Manager, viruses,
> disk head crashes, and every other cause you can imagine. If you
> change fsck to make BSD-style assumptions about write ordering,
> you weaken the ability to deal with disasters.

I haven't looked into e2fsck yet, but if/when I get to it, I'll probably
add a mode that makes some assumptions about the disk state. If you
don't explicitly ask for this mode, you get the current behavior.

Also, this mode would only be run during the boot sequence under a
specific situation (the system crashed while running with soft-updates).
Note that if you were running a journalling fs, fsck wouldn't be run at
all.

> I'm sure you are aware of ext3. You should also be aware of tux2.

I read some stuff about tux2 a couple of years ago, but I do have to
re-read it all...

> Soft-updates are mainly useful for OS wars. Lots of FUD comes
> flying out of the BSD camp. Ext2 horror stories are rare
> when you consider just how many millions of users ext2 has.

Well, I found soft-updates pretty interesting, and I want to play a bit
with it. Anyway, given my (lack of) experience with kernel programming I
don't believe I'll have anything useful for some time yet...

> In case you are still thinking about what to do, here are a
> few filesystem ideas that you might like:

hmm... I guess I find soft-updates sexy enough... :-)

Thanks

Alexis

From: Andreas Dilger
Subject: Re: implementing soft-updates
Date: Tue, 9 Apr 2002 21:46:56 -0600

On Apr 09, 2002 22:58 -0300, Alexis S. L. Carvalho wrote:
> OK, but if something scribbles on random unrelated parts of the disk
> there's not much you can do besides praying that fsck will fix it.

Well, the fact that ext2 uses fixed areas of the disk for specific
purposes (e.g. inode table) and it has backups of a lot of metadata
makes it very possible to recover from random data corruption.

> Note that if you were running a journalling fs, fsck wouldn't be run at
> all.

Note that this is incorrect. Even with ext3, e2fsck is run on each
boot. While in the normal case all it does is journal recovery (takes
a few seconds at most) and do a superficial check of the superblock.
This is incredibly useful, however, if there was a filesystem error,
since e2fsck has a chance to check and cleanup the filesystem before
it is put into use.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


From: Andreas Dilger
Subject: Re: implementing soft-updates
Date: Tue, 9 Apr 2002 20:55:04 -0600

On Apr 09, 2002 20:41 -0400, Albert D. Cahalan wrote:
> In case you are still thinking about what to do, here are a
> few filesystem ideas that you might like:
>
> ext2 compression (e2compr)
- project needs polishing, integration
> delayed allocation (allocate space only when about to do IO)
- Andrew Morton has done this for 2.5
> while rw mounted: defrag, undelete (not trash bin), grow, shrink, fsck
- Andrew Morton has implemented for ext3 (kernel space, needs user tool)
> make ext2 extents work
- yes, discussion ongoing on ext2-devel, no real progress yet
> make ext2 handle huge block sizes
- kernel issues w.r.t. buffers > PAGE_SIZE
> mark idle filesystems clean; mark dirty before non-atomic updates
- maybe marginally useful
> tools for in-place filesystem conversion (ufs --> ext2)
- existing project
> try larger inodes (example: 168-byte, 3 in 512 bytes, 0,1,2,x,4,5,6,x,8...)
- discussion ongoing on ext2-devel with some good progress

Cheers, Andreas


From: Dominik Kubla
Subject: Re: implementing soft-updates
Date: Wed, 10 Apr 2002 11:28:07 +0200

On Tue, Apr 09, 2002 at 08:41:28PM -0400, Albert D. Cahalan wrote:
...
> While ext2 fsck doesn't guarantee anything, in practice it is far
> more reliable than ufs fsck. If you change the algorithms to be
> like those used by BSD, then you may lose some of the ability to
> recover. Remember, fsck isn't just for power failures. It tries
> to piece together a filesystem that has suffered disk corruption
> caused by attackers, kernel bugs, fdisk screwups, MS-DOS writing
> past the end of a partition, Windows NT Disk Manager, viruses,
> disk head crashes, and every other cause you can imagine. If you
> change fsck to make BSD-style assumptions about write ordering,
> you weaken the ability to deal with disasters.

I disagree. In fact the current BSD softupdate code guarantees that all
that ever happens is that freed blocks are not entered into the free
block list. Something fsck can fix in background on a life system. See
M. Kirk McKusicks BSDcon 02 paper 'Running fsck in background.'

Your argument that faulty hardware may create havoc with your on-disk
data structures is something every file system is prone to unless it
uses a raw-read-after-write for checking purposes. Something which
definitely kills disk performance.

The background fsck capability, just like journalling or logging, are
typically only in needed in 24/7 systems (sure, they are nice to have in
your home system, but do you _REALLY_ need them? i don't!) and those
system typically are run on proven hardware which is operated well
within the specs. So please don't construct these kinds of arguments.

The fact that the BSD FFS in it's currently released version (which does
not include snapshot and background fsck capability) is considered to be
one of the more reliable file systems around, even when softupdates are
enabled, speaks for itself. So please just as you don't want horror
stories about Linux ext2 spread: don't do it yourself.

Alexis, if you're looking for a rewarding Linux project, don't focus too
much on softupdates, the majority of linux users/developers couldn't
care less. I wonder sometimes if this is perhaps because BSD did it
first?

Read M. Kirk McKusick's paper on fsck and snapshots (it's in the
proceedings of this years BSDcon, available from Usenix) and try to
implement the snapshot capability for ext2/ext3. Everyone of us who has
to do live backups of production systems will thank you if you get that
development started. I found that Mr. McKusick is somebody who is very
helpful towards people trying to understand his work, so you might get
help from him if you get stuck. OTOH if you avoid the buzzword
'softupdates' many Linux file system hackers will be much more inclined
to help you out with the Linux part.

Yours,
Dominik Kubla

From: Albert D. Cahalan
Subject: Re: implementing soft-updates
Date: Wed, 10 Apr 2002 14:07:09 -0400 (EDT)

Dominik Kubla writes:
> I disagree. In fact the current BSD softupdate code guarantees that all
> that ever happens is that freed blocks are not entered into the free
> block list. Something fsck can fix in background on a life system. See
> M. Kirk McKusicks BSDcon 02 paper 'Running fsck in background.'

Two cases:

a. proper shutdown -- somewhat OK to never fsck
b. unclean shutdown -- may involve kernel crashing

So with an unclean filesystem, _any_ avoidance of fsck is
suspect. I have a UPS; when my system boots on an unclean
filesystem it's because XFree86 thought it could run a
hardware driver in userspace.

Journalling gives you a nice list of recently-touched data
structures to examine. The phase-tree algorithm can support
low-cost incremental checksumming of the whole filesystem.
Soft-updates leave you with... well, is prayer any good?
You'd better run fsck at boot, which AFAIK is exactly what
is done; you even say "not include [...] background fsck".

> The fact that the BSD FFS in it's currently released version (which does
> not include snapshot and background fsck capability) is considered to be
> one of the more reliable file systems around, even when softupdates are
> enabled, speaks for itself. So please just as you don't want horror
> stories about Linux ext2 spread: don't do it yourself.

I'm just tired of this: "Back when I used to use Linux 2.1.44 my
disks were trashed so bad that I lost everything! So use BSD."
Last time I checked, BSD fsck didn't have a set of regression tests
like ext2 fsck does. On the BSD mailing lists you can read about
fsck getting signal 11. So it's not God's Glorious Filesystem by
any means.


From: Andreas Dilger
Subject: Re: implementing soft-updates
Date: Wed, 10 Apr 2002 12:13:04 -0600

On Apr 10, 2002 11:28 +0200, Dominik Kubla wrote:
> try to implement the snapshot capability for ext2/ext3. Everyone of us
> who has to do live backups of production systems will thank you if you
> get that development started.

LVM can already do snapshots at the device level. It integrates with
ext3/XFS/reiserfs via sync_super_lockfs/unlockfs so that what is in
the snapshot is a consistent, clean filesystem.

There might need to be a little touchup with ext2 to support these
calls, but even in the current state you get a usable filesystem
snapshot, with the exception that the filesystem has not been marked
clean.

As for a filesystem-level ext2/ext3 snapshot, this has also already
been done (sf.net/projects/snapfs). The people who took over that
project have removed all of the released files and CVS, but you can
still get the CVS from the sourceforge CVS backups. I also have a
version here, but don't have any time to work on it.

Cheers, Andreas
--
Andreas Dilger<