Jaroslav Sykora posted a series of five patches to handle the kernel portion of what he described as "shadow directories", providing an example which utilized FUSE to access the contents of a compressed file from the command line. His first example was
cat hello.zip^/hello.c about which he explained, "the '^' is an escape character and it tells the computer to treat the file as a directory. The kernel patch implements only a redirection of the request to another directory('shadow directory') where a FUSE server must be mounted. The decompression of archives is entirely handled in the user space. More info can be found in the documentation patch in the series."
There were numerous problems suggested. Jan Engelhardt noted, "too bad, since ^ is a valid character in a *file*name. Everything is, with the exception of '
\0' and '
/'. At the end of the day, there are no control characters you could use." Later in the thread an lwn.net article from a couple years ago was quoted, "another branch, led by Al Viro, worries about the locking considerations of this whole scheme. Linux, like most Unix systems, has never allowed hard links to directories for a number of reasons;" The article had been discussing Reiser4, which treats files as directories. In the current discussion, Al Viro added, "as for the posted patch, AFAICS it's FUBAR in handling of .. in such directories. Moreover, how are you going to keep that shadow tree in sync with the main one if somebody starts doing renames in the latter? Or mount --move, or..."
"I've finally got some numbers to go along with the Btrfs variable blocksize feature. The basic idea is to create a read/write interface to map a range of bytes on the address space, and use it in Btrfs for all metadata operations (file operations have always been extent based)," explained Chris Mason in a recent posting to the Linux Filesystem Development mailing list. He linked to some benchmark results and summarized, "the first round of benchmarking shows that larger block sizes do consume more CPU, especially in metadata intensive workloads, but overall read speeds are much better." Chris then noted, "Dave reported that XFS saw much higher write throughput with large blocksizes, but so far I'm seeing the most benefits during reads." David Chinner replied, "the basic conclusion is that different filesystems will benefit in different ways with large block sizes...." explaining:
"Btrfs linearises writes due to it's COW behaviour and this is trades off read speed. i.e. we take more seeks to read data so we can keep the write speed high. By using large blocks, you're reducing the number of seeks needed to find anything, and hence the read speed will increase. Write speed will be pretty much unchanged because btrfs does linear writes no matter the block size.
"XFS doesn't linearise writes and optimises it's layout for a large number of disks and a low number of seeks on reads - the opposite of btrfs. Hence large block sizes reduce the number of writes XFS needs to write a given set of data+metadata and hence write speed increases much more than the read speed (until you get to large tree traversals)."
"Here's a set of patches that remove all calls to iget() and all read_inode() functions," began David Howells describing a collection of 32 patches posted to the lkml. He went on to explain the reason for removing these functions, "they should be removed for two reasons: firstly they don't lend themselves to good error handling, and secondly their presence is a temptation for code outside a filesystem to call iget() to access inodes within that filesystem." He then suggested three benefits:
"(1) Error handling gets simpler as you can return an error code rather than having to call is_bad_inode(). (2) You can now tell the difference between ENOMEM and EIO occurring in the read_inode() path. (3) The code should get smaller. iget() is an inline function that is typically called 2-3 times per filesystem that uses it. By folding the iget code into the read_inode code for each filesystem, it eliminates some duplication."
"I've never looked at the Reiser code though the comments I get from friends who use it are on the order of 'extremely reliable but not the fastest filesystem in the world'," Matt Dillon explained when asked to compare his new clustering HAMMER filesystem with ReiserFS, both of which utilize BTrees to organize objects and records. He continued, "I don't expect HAMMER to be slow. A B-Tree typically uses a fairly small radix in the 8-64 range (HAMMER uses 8 for now). A standard indirect block methodology typically uses a much larger radix, such as 512, but is only able to organize information in a very restricted, linear way." He continued to describe numerous plans he has for optimizing performance, "my expectation is that this will lead to a fairly fast filesystem. We will know in about a month :-)"
Among the optimizations planned, Matt explained, "the main thing you want to do is to issue large I/Os which cover multiple B-Tree nodes and then arrange the physical layout of the B-Tree such that a linear I/O will cover the most likely path(s), thus reducing the actual number of physical I/O's needed." He noted, "HAMMER will also be able to issue 100% asynchronous I/Os for all B-Tree operations, because it doesn't need an intact B-Tree for recovery of the filesystem." He went on to describe another potential optimization allowed by the filesystem's design, "HAMMER is designed to allow clusters-by-cluster reoptimization of the storage layout. Anything that isn't optimally layed-out at the time it was created can be re-layed-out at some later time, e.g. with a continuously running background process or a nightly cron job or something of that ilk. This will allow HAMMER to choose to use an expedient layout instead of an optimal one in its critical path and then 'fix' the layout later on to make re-accesses optimal."
"This NTFS update fixes the deadlock at mount time reported by several people over the years but it was only recently that someone who reported it actually replied to my response and helped me track it down (I have never been able to reproduce the deadlock)," Anton Altaparmakov explained about a patch against the NTFS filesystem. He summarized the changes:
"The fix was to stop calling ntfs_attr_set() at mount time as that causes balance_dirty_pages_ratelimited() to be called which on systems with little memory actually tries to go and balance the dirty pages which tries to take the s_umount semaphore but because we are still in fill_super() across which the VFS holds s_umount for writing this results in a deadlock.
"We now do the dirty work by hand by submitting individual buffers. This has the annoying 'feature' that mounting can take a few seconds if the journal is large as we have clear it all. One day someone should improve on this by deferring the journal clearing to a helper kernel thread so it can be done in the background but I don't have time for this at the moment and the current solution works fine so I am leaving it like this for now."
"I am going to start committing bits and pieces of the HAMMER filesystem over the next two months," announced Matthew Dillon on the Dragonfly BSD kernel mailing list. He noted that the filesystem should be functional by the 2.0 release in December, "I am making good progress and I believe it will be beta quality by the release. It took nearly the whole year to come up with a workable design. I thought I had it at the beginning of the year but I kept running into issues and had to redesign the thing several times since then." Matthew then posted a detailed design document for the new filesystem.
During the followup discussion, Matthew was asked if HAMMER would be a ZFS killer. He responded, "ZFS serves a different purpose and I think it is cool, but as time has progressed I find myself liking ZFS's design methodology less and less, and I am very glad I decided against trying to port it." He noted it is essential to have redundant copies of data, but added, "the problem ZFS has is that it is TOO redundant. You just don't need that scale of redundancy if you intend to operate in a multi-master replicated environment because you not only have wholely independant (logical) copies of the filesystem, they can also all be live and online at the same time." As for how Dragonfly's new filesystem will address redundancy, he explained:
"HAMMER's approach to redundancy is logical replication of the entire filesystem. That is, wholely independant copies operating on different machines in different locations. Ultimately HAMMER's mirroring features will be used to further our clustering goals. The major goal of this project is transparent clustering and a major requirement for that is to have a multi-master replicated environment. That is the role HAMMER will eventually fill. We wont have multi-master in 2.0, but there's a good chance we will have it by the end of next year."
Mark Weinem offered a summary of NetBSD's six 2007 Summer of Code development projects. The projects included: the Automated Testing Framework, "the goal of the ATF project was to develop a testing framework to easily define test cases and run them in a completely automated way"; porting ZFS, "the primary goal of this project was to port volume emulation (ZVOL) functionality in order to mount ZFS file systems"; QoS framework for NetBSD's virtual memory system, "for delay sensitive systems such as streaming multimedia servers and back-end database systems, servicing the reader processes in a timely fashion is more important than the servicing the writers"; kernel file systems in userspace, as a result of the project, "almost all NetBSD kernel file systems can be compiled, mounted and run in userspace"; and hardware monitoring, "the aim of this project was to develop a kernel event notification framework to notify userland of hardware changes e.g. a new USB device being added". Mark added:
"NetBSD has been involved in the Google Summer of Code since its conception in 2005. This year we were glad to once again have the oppertunity to introduce six students to our operating system, to Open Source software development and get them sponsored by Google to work on projects defined by the NetBSD developers."
Trond Myklebust noted the NFS client updates for the upcoming 2.6.24 kernel:
"Aside from the usual updates from Chuck for NFS-over-IPv6 (still incomplete) and a number of bugfixes for the text-based mount code, the main news in the NFS tree is the merging of support for the NFS/RDMA client code from Tom Talpey and the NetApp New England (NANE) team."
He continued, "we also have the 64-bit inode support from RedHat/Peter Staubach. There is also the addition of a nfs_vm_page_mkwrite() method in order to clean up the mmap() write code. Finally, I've been working on a number of updates for the attribute revalidation, having pulled apart most of the dentry and attribute revalidation into separate variables. A number of fixes that address existing bugs fell out of that review, which should hopefully result in more efficient dcache behaviour..." Actual source changes can be browsed in the NFS client git repository.
"I've just released the 2.6.23-rc9-ext4-1. It collapses some patches in preparation for pushing them to Linus, and adds some of the cleanup patches that had been incorporated into Andrew's broken-out-2007-10-01-04-09 series," announced Theodore Ts'o. He also noted of the current ext4 git tree, "it also has some new development patches in the unstable (not yet ready to push to mainline) portion of the patch series." In an earlier thread Theodore posted a series of patches specifically intended for inclusion in the upcoming 2.6.24 kernel. Included in the patch series was a patch for improving fsck performance, "in performance tests testing e2fsck time, we have seen that e2fsck time on ext3 grows linearly with the total number of inodes in the filesytem. In ext4 with the uninitialized block groups feature, the e2fsck time is constant, based solely on the number of used inodes rather than the total inode count." The patch included an explanation of how the feature works, enabled through a mkfs option:
"With this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation."
UBIFS is described as, "a new flash file system which is designed to work on top of UBI." It has replaced the JFFS3 project, a choice explained on the project webpage, "we have realized that creating a scalable flash file system on top of bare flash is a difficult task, just because the flash media is so problematic (wear-leveling, bad eraseblocks). We have tried this way, and it turned out to be that we solved media problems, instead of concentrating on file system issues. So we decided to split one big and complex tasks into 2 sub-tasks: UBI solves the media problems, like bad eraseblocks and wear-leveling, and UBIFS implements the file system on top. And now finally, we may concentrate on file-system issues: implementing write-back caching, multi-headed journal, garbage collector, indexing information management and so on. There are a lot of FS problems to solve - orphaned files, deletions, recoverability after unclean reboots and so on."
In a recent posting to the lkml, Artem Bityutskiy noted that UBIFS has to take into account that there is a small amount of unused block space at the ends of eraseblocks, and the size of pages written to disk are smaller than they are in memory as the filesystem performs compression. "So, if our current liability is X, we do not know exactly how much flash space (Y) it will take. All we can do is to introduce some pessimistic, worst-case function Y = F(X). This pessimistic function assumes that pages won't be compressible, and it assumes worst-case wastage." The calculation is necessary as even though data is not written immdiately to the flash device, it's important to be able to inform the application writing data if there's not enough space left. "So my question is: how can we flush _few_ oldest dirty pages/inodes while we are inside UBIFS (e.g., in ->prepare_write(), ->mkdir(), ->link(), etc)?"
"The attached patch adds a generic intermediary (FS-Cache) by which filesystems may call on local caching capabilities, and by which local caching backends may make caches available," explained David Howells describing his "generic filesystem caching facility" patch. In his patchset he also provided a patch to make NFS utilize the generic caching facility. David went on to detail thirteen facilities provided by the patch, including:
"(1) Caches can be added / removed at any time, even whilst in use; (2) Adds a facility by which tags can be used to refer to caches, even if they're not mounted yet; (3) More than one cache can be used at once. Caches can be selected explicitly by use of tags; (4) The netfs is provided with an interface that allows either party to withdraw caching facilities from a file (required for (1)); (5) A netfs may annotate cache objects that belongs to it; (6) Cache objects can be pinned and reservations made; (7) The interface to the netfs returns as few errors as possible, preferring rather to let the netfs remain oblivious."
In a recent lkml thread, Linus Torvalds was involved in a discussion about mounting filesystems with the
noatime option for better performance, "'noatime,data=writeback' will quite likely be *quite* noticeable (with different effects for different loads), but almost nobody actually runs that way." He noted that he set O_NOATIME when writing git, "and it was an absolutely huge time-saver for the case of not having 'noatime' in the mount options. Certainly more than your estimated 10% under some loads." The discussion then looked at using the
relatime mount option to improve the situation, "relative atime only updates the atime if the previous atime is older than the mtime or ctime. Like noatime, but useful for applications like mutt that need to know when a file has been read since it was last modified." Ingo Molnar stressed the significance of fixing this performance issue, "I cannot over-emphasize how much of a deal it is in practice. Atime updates are by far the biggest IO performance deficiency that Linux has today. Getting rid of atime updates would give us more everyday Linux performance than all the pagecache speedups of the past 10 years, _combined_." He submitted some patches to improve
relatime, and noted about
"It's also perhaps the most stupid Unix design idea of all times. Unix is really nice and well done, but think about this a bit: 'For every file that is read from the disk, lets do a ... write to the disk! And, for every file that is already cached and which we read from the cache ... do a write to the disk!'"
Matthew Dillon created DragonFly BSD in June of 2003 as a fork of the FreeBSD 4.8 codebase. KernelTrap first spoke with Matthew back in January of 2002 while he was still a FreeBSD developer and a year before his current project was started. He explains that the DragonFly project's primary goal is to design a "fully cross-machine coherent and transparent cluster OS capable of migrating processes (and thus the work load) on the fly."
In this interview, Matthew discusses his incentive for starting a new BSD project and briefly compares DragonFly to FreeBSD and the other BSD projects. He goes on to discuss the new features in today's DragonFly 1.10 release. He also offers an in-depth explanation of the project's cluster goals, including a thorough description of his ambitious new clustering filesystem. Finally, he reflects back on some of his earlier experiences with FreeBSD and Linux, and explains the importance of the BSD license.
Hello every one,
Chris Mason announced an early alpha release of his new Btrfs filesystem, "after the last FS summit, I started working on a new filesystem that maintains checksums of all file data and metadata." He listed the following features as "mostly implemented": "extent based file storage (2^64 max file size), space efficient packing of small files, space efficient indexed directories, dynamic inode allocation, writable snapshots, subvolumes (separate internal filesystem roots), checksums on data and metadata (multiple algorithms available), very fast offline filesystem check". He listed the following features as yet to be implemented: "object level mirroring and striping, strong integration with device mapper for multiple device support, online filesystem check, efficient incremental backup and FS mirroring". Regarding the current state of the project, Chris said:
"The current status is a very early alpha state, and the kernel code weighs in at a sparsely commented 10,547 lines. I'm releasing now in hopes of finding people interested in testing, benchmarking, documenting, and contributing to the code. I've gotten this far pretty quickly, and plan on continuing to knock off the features as fast as I can. Hopefully I'll manage a release every few weeks or so. The disk format will probably change in some major way every couple of releases."