"Any benchmark is going to be a benchmark of the OS as much as it is going to be a benchmark of the filesystem. It's pretty hard to separate the two. ZFS is best tested on Open Solaris. UFS is best tested on FreeBSD, EXT3 is best tested on Linux, and HAMMER of course is best tested on DragonFly."
"Since everybody seems to be having fun building new filesystems these days, I thought I should join the party, began Daniel Phillips, announcing the Tux3 versioning filesystem. He continued, "Tux3 is a write anywhere, atomic commit, btree based versioning filesystem. As part of this work, the venerable HTree design used in Ext3 and Lustre is getting a rev to better support NFS and possibly become more efficient." Daniel explained:
"The main purpose of Tux3 is to embody my new ideas on storage data versioning. The secondary goal is to provide a more efficient snapshotting and replication method for the Zumastor NAS project, and a tertiary goal is to be better than ZFS."
In his announcement email, Daniel noted that implementation work is underway, "much of the work consists of cutting and pasting bits of code I have developed over the years, for example, bits of HTree and ddsnap. The immediate goal is to produce a working prototype that cuts a lot of corners, for example block pointers instead of extents, allocation bitmap instead of free extent tree, linear search instead of indexed, and no atomic commit at all. Just enough to prove out the versioning algorithms and develop new user interfaces for version control."
"In the kerneloops.org stats, a new oops is rapidly climbing the charts, began Arjan van de Ven, referring to his website where he automatically collects kernel oops and warning reports from mailing lists, bugzillas, and a special client. Regarding the latest oops, he noted, "the oops is a page fault in the ext3 'do_slit' function, and the first report of it was with 2.6.26-rc6-git3." Linux creator Linus Torvalds took a quick interest in the issue, observing that all the oopses seemed to be on the i686 architecture, suggesting, "could this perhaps be an indication that it is specific to i686 some way (eg a compiler issue?)"
Shortly before Linus sent out his emails, Dave Airlie confirmed that this was indeed a known compiler bug affecting GCC 4.3.1. The bug report notes, "any ext* filesystem which enables the dir_index feature is likely susceptible". Linus caught up on his email and retorted, "gaah. I should read all my email instead of wasting my time trying to match up the code with what I can reproduce.." The reason the Red Hat bug report wasn't automatically picked up by the kerneloops website was because the oops was reported in a jpeg image, leading Arjan to quip, "maybe one day if I'm really bored I'll implement OCR into [kerneloops.org] ;)".
Chris Mason announced version 0.10 of his new Btrfs filesystem, listing the following new features, "explicit back references, online resizing (including shrinking), in place conversion from Ext3 to Btrfs, data=ordered support, mount options to disable data COW and checksumming, and barrier support for sata and IDE drives". He noted that the disk format in v0.10 has changed, and is not compatible with the v0.9 disk format. Regarding back reference support, Chris explained, "the core of this release is explicit back references for all metadata blocks, data extents, and directory items. These are a crucial building block for future features such as online fsck and migration between devices. The back references are verified during deletes, and the extent back references are checked by the existing offline fsck tool." He then detailed the new Ext3 to Btrfs conversion utility:
"The conversion program uses the copy on write nature of Btrfs to preserve the original Ext3 FS, sharing the data blocks between Btrfs and Ext3 metadata. Btrfs metadata is created inside the free space of the Ext3 filesystem, and it is possible to either make the conversion permanent (reclaiming the space used by Ext3) or roll back the conversion to the original Ext3 filesystem."
"In [the first pass] of e2fsck, every inode table in the fileystem is scanned and checked, regardless of whether it is in use," Avantika Mathur began. "This is the most time consuming part of the filesystem check. The unintialized block group feature can greatly reduce e2fsck time by eliminating checking of uninitialized inodes." She went on to explain how it works, "with this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation." Avantika attached a graph illustrating the advantage of the patch which she summarized as follows:
"The patches have been stress tested with fsstress and fsx. In performance tests testing e2fsck time, we have seen that e2fsck time on ext3 grows linearly with the total number of inodes in the filesytem. In ext4 with the uninitialized block groups feature, the e2fsck time is constant, based solely on the number of used inodes rather than the total inode count. Since typical ext4 filesystems only use 1-10% of their inodes, this feature can greatly reduce e2fsck time for users. With performance improvement of 2-20 times, depending on how full the filesystem is."
With the release of the 2.6.19-rc1-mm1 kernel, the ext4 filesystem [story] was merged into Andrew Morton [interview]'s -mm tree for further testing. In the announcement Andrew notes that the new filesystem is compatible with ext3 until you add a file that has extents. He also notes, "when comparing performance with other filesystems, remember that ext3/4 by default offers higher data integrity guarantees than most. So when comparing with a metadata-only journalling filesystem, use `mount -o data=writeback'. (Although this doesn't seem to make much difference with ext3)" The goal is to stabilize the new filesystem within the next six to nine months, and ultimately to replace the ext3 filesystem.
Theodore Ts'o offered an insightful summary of issues affecting future development on the ext3 filesystem, "it is clear that many people feel they have a stake in the future development plans of the ext2/ext3 filesystem, as it [is] one of the most popular and commonly used filesystems, particular amongst the kernel development community. For this reason, the stakes are higher than it would be for other filesystems." He listed the three main concerns for future development as stability, compatibility confusion, and code complexity, "unfortunately, these various concerns were sometimes mixed together in the discussion two months ago, and so it was hard to make progress. Linus's concern seems to have been primarily the first point, with perhaps a minor consideration of the 3rd. Others dwelled very heavily on the second point."
Theodore went on to say, "to address these issues, after discussing the matter amongst ourselves, the ext2/3 developers would like to propose the following path forward." He listed a four step plan beginning with the creation of a new ext4 filesystem registered with the kernel temporarily as 'ext3dev', "this will be explicitly marked as an CONFIG_EXPERIMENTAL filesystem, and will in affect be a 'development fork' of ext3. A similar split of the fs/jbd will be made in order to support 64-bit jbd, which will be used by fs/ext4 and future versions of ocfs2." Theodore explained that new features will go into the ext3dev tree, with only bugfixes making their way back to the stable ext3 tree. He noted that it will remain important that the ext4 code base can mount ext3 filesystems, "this is necessary to ensure a future smooth upgrade path from ext3 to ext4 users." Finally, "probably in 6-9 months when we are satisified with the set of features that have been added to fs/ext4, and confident that the filesystem format has stablized, we will submit a patch which causes the fs/ext4 code to register itself as the ext4 filesystem." He further noted that once ext4 is deemed fully stable, it may completely replace ext3 in the source tree.
With the release of 2.6.9-mm1, Andrew Morton [interview] offered a quick status update on a number of patches in his -mm tree [forum] that are 2.6-mainline hopefuls. For example, regarding the much debated reiser4 filesystem [story], Andrew said that he is still "not sure, really. The namespace extensions were disabled, although all the code for that is still present. Linus's filesystem criterion used to be 'once lots of people are using it, preferably when vendors are shipping it'. That's a bit of a chicken and egg thing though. Needs more discussion". And as for Ingo Molnar [interview]'s preemption and low-latency fixups [forum] Andrew offered, "I haven't really thought about it and haven't looked at the patches yet. Hopefully 2.6.10 material."
Other projects specifically mentioned include the sysfs backing store, the ext3 reservations code, the ext3 resize code, kexec and crashdump [story], perfctr, cachefs, cpusets, and the md updates. Read on for Andrew's comments and the complete -mm1 changelog.
Continuing the earlier discussion about low latency and Ingo Molnar [interview]'s voluntary kernel preemption patch [story], the conversation moved onto the affect a filesystem can have on latency. Specifically, 2.6 maintainer Andrew Morton [interview] noted that ReiserFS was known to have some latency issues in both the 2.4 and 2.6 Linux kernels, "resierfs: yes, it's a problem. I 'fixed' it multiple times in 2.4, but the fixes ended up breaking the fs in subtle ways and I eventually gave up." However, he did go on to note, "actually, the 2.4 low-latency patch does still have some reiserfs fixes, so it's probably better than reiserfs in 2.6."
When asked if ext3 was a better choice for low latency work, Andrew Morton replied, "ext3 is certainly better than [reiserfs], but still has a couple of potential problem spots. ext2 is probably the best at this time." Data is continuing to be collected and reviewed by a number of kernel developers, so the more noticeable latency issues in the 2.6 kernel will likely be addressed soon.
Mike Benoit recently posted a link to results from his new and improved file system shootout, using better hardware and running more tests. Using two benchmarks that are designed to measure hard drive and file system performance, Bonnie++ and IOZone, he's compared a number journaling filesystems found in the 2.6 kernel [forum]. Included in the lineup are EXT2 (not journaling, but an effective baseline [story]), JFS, XFS, ReiserFS, Reiser4, and EXT3 each compared head to head on both SCSI and IDE drives.
In Mike's summary he labels JFS and XFS as 'best bang for your buck' explaining, "While not the fastest file systems, both of them consistently perform close to EXT2, while using minimal CPU. XFS seems to be faster over a wider range of benchmarks, however it does use slightly more CPU than JFS. While JFS really starts to slow down with lots of files." As for pure speed, Mike points to Reiser4 which really shined in the Bonnie++ benchmarks, though not quite so much in the IOZone benchmarks. He suggests, "ReiserFS v4 will [definitely] be worth while keeping an eye on, especially considering some of the exciting new features it offers."
Grant Miner posted some interesting benchmark results to the lkml, comparing five journaling filesystems available with the current 2.6.0-test2 development kernel. The tests were conducted with a very simple shell script, mainly timing how long it takes to copy, tar, and remove directories, performing several syncs in between. He summarizes:
- ext3's syncs tended to take the longest [at] 10 seconds, except
- JFS took a whopping 38.18s on its final sync
- xfs used more CPU than ext3 but was slower than ext3
- reiser4 had highest throughput and most CPU usage
- jfs had lowest throughput and least CPU usage
Some interesting discussion follows, debating the results and offering further suggestions on making the tests more useful. For example, Andrew Morton [interview] proposed including ext2 in the tests as a baseline, and Hans Reiser noted that reiser4 continues to improve rapidly. Read on for the full test results and much of the following discussion.
Andrew Morton [interview] posted on the lkml, "In 2.4.20-pre5 an optimisation was made to the ext3 fsync function which can very easily cause file data corruption at unmount time". This bug only affects people using ext3 in the uncommon "data=journal" mode, or files operating under "chattr -j", and does not affect the 2.5 series of kernels.
Andrew went on to say that "The symptoms are that any file data which was written within the thirty seconds prior to the unmount may not make it to disk. A workaround is to run `sync' before unmounting". He also posted a patch to fix the problem. However, soon thereafter, he posted saying that "that 'fix' didn't fix it. Sorry about that". Until a proper fix can be developed, he recommends that people "please avoid ext3/data=journal". Since "data=journal" is not the default ext3 mode, it is unlikely most people running ext3 will be affected by this. However, it is a data corruption bug so you should double-check that you use either "data=ordered" or "data=writeback" as your ext3 mode of operation.