I have a problem I'm hoping someone can help me with...
I have an NFS server/client setup. On the server side, I export a romfs filesystem to the NFS. The client computers boot via PXE bootloader and mount their root filesystem via NFS, using tempfs/unionfs to make parts of the filesystem read/write (i.e., /etc, /var, etc)
"This is a high performance network filesystem with local coherent cache of data and metadata. Its main goal is distributed parallel processing of data. Network filesystem is a client transport. POHMELFS protocol was proven to be superior to NFS in lots (if not all, then it is in a roadmap) operations."
This latest release prompted Jeff Garzik to reply, "this continues to be a neat and interesting project :)" New features include fast transactions, round-robin failover, and near-wire limit performance. This adds to existing features which include a local coherent data and metadata cache, async processing of most events, and a fast and scalable multi threaded user space server. Planned features include a server extension to allow mirroring data across multiple devices, strong authentication, and possible data encryption when transferring data over the network. Evgeniy linked to several benchmarks in his blog.
"These patches add local caching for network filesystems such as NFS," began David Howells describing an updated set of thirty-seven patches to introduce FS-Cache. When asked how the patches affect performance, he noted that this was dependent on the use case, highlighting issues when dealing with lots of metadata, "getting metadata from the local disk fs is slower than pulling it across an unshared gigabit ethernet from a server that already has it in memory."
David continued "these points don't mean that fscache is no use, just that you have to consider carefully whether it's of use to *you* given your particular situation, and that depends on various factors," adding, "note that currently FS-Caching is disabled for individual NFS files opened for writing as there's no way to handle the coherency problems thereby introduced." He concluded with a number of simple performance benchmarks.
"The problem with swap over network is the generic swap problem: needing memory to free memory. Normally this is solved using mempools, as can be seen in the BIO layer," explained Peter Zijlstra. "Swap over network has the problem that the network subsystem does not use fixed sized allocations, but heavily relies on kmalloc(). This makes mempools unusable."
The first fifteen patches set up a generic framework for reserving memory. Patches 16-23 actually put the framework to use on the network stack. Peter noted, "a network write back completion [involves] receiving packets, which when there is no memory, is rather hard. And even when there is memory there is no guarantee that the required packet comes in in the window that that memory buys us." He went on to explain, "the solution to this problem is found in the fact that network is to be assumed lossy. Even now, when there is no memory to receive packets the network card will have to discard packets. What we do is move this into the network stack." Patches 24-26 set up an infrastructure for swapping to a filesystem instead of a block device, which is then utilized by the final patches, "finally, convert NFS to make use of the new network and vm infrastructure to provide swap over NFS." When the usefulness of these patches were questioned, Peter noted, "There is a large corporate demand for this, which is why I'm doing this. The typical usage scenarios are: 1) cluster/blades, where having local disks is a cost issue (maintenance of failures, heat, etc) 2) virtualisation, where dumping the storage on a networked storage unit makes for trivial migration and what not.."
A recent report on the lkml suggested improved IO/writeback performance in the recently released 2.6.24-rc1 kernel compared to the earlier 188.8.131.52 and 184.108.40.206 kernels. Credit was given to some patches by Peter Zijlstra. Ingo Molnar replied, "wow, really nice results! Peter does know how to make stuff fast :) Now lets pick up some of Peter's other, previously discarded patches as well :-)" He pointed to several patches "as a starter", then quipped, "I think the MM should get out of deep-feature-freeze mode - there's tons of room to improve :-/"
Andrew Morton replied, "kidding. We merged about 265 MM patches in 2.6.24-rc1:
482 files changed, 8071 insertions(+), 5142 deletions(-)". He added, "a lot of that was new functionality. That's easier to add than things which change long-standing functionality." Of the patches Ingo pointed to, Peter noted he was currently working on polishing the swap-over-NFS patch, "will post that one again, soonish.... Esp. after Linus professed liking to have swap over NFS." Rik van Riel also replied regarding rewriting the page replacement code, "at the moment I only have the basic 'plumbing' of the split VM working and am fixing some bugs in that. Expect a patch series with that soon, so you guys can review that code and tell me where to beat it into shape some more :)"
Trond Myklebust noted the NFS client updates for the upcoming 2.6.24 kernel:
"Aside from the usual updates from Chuck for NFS-over-IPv6 (still incomplete) and a number of bugfixes for the text-based mount code, the main news in the NFS tree is the merging of support for the NFS/RDMA client code from Tom Talpey and the NetApp New England (NANE) team."
He continued, "we also have the 64-bit inode support from RedHat/Peter Staubach. There is also the addition of a nfs_vm_page_mkwrite() method in order to clean up the mmap() write code. Finally, I've been working on a number of updates for the attribute revalidation, having pulled apart most of the dentry and attribute revalidation into separate variables. A number of fixes that address existing bugs fell out of that review, which should hopefully result in more efficient dcache behaviour..." Actual source changes can be browsed in the NFS client git repository.
"Here's a new version of my credentials patch. It's still very basic, with only Ext3, (V)FAT, NFS, AFS, SELinux and keyrings compiled in on an x86_64 arch kernel," stated David Howells. He described the patch as, "introduce a copy on write credentials record (struct cred). The fsuid, fsgid, supplementary groups list move into it (DAC security). The session, process and thread keyrings are reflected in it, but don't primarily reside there as they aren't per-thread and occasionally need to be instantiated or replaced by other threads or processes."
Casey Schaufler asked, "what I don't really understand is what value is gained by this exercise. Are the savings sufficiently significant to justify the effort?" Trond Myklebust explained, "it is not about savings, but about new functionality. Basically, the existence of reference-counted credentials will allow AFS and NFS to cache that information and use it for deferred writes etc." David added, "and also make it easier for cachefiles and hopefully NFSd to override the active security. There's a comment somewhere in, I think, the SunRPC code in the Linux kernel bemoaning the lack of this very feature:-)"
Hua Zhong reported an NFS regression in 2.6.23-rc4 as compared to 2.6.22, "[upgrading] causes several autofs mounts to fail silently - they just [do] not appear when they should." Trond Myklebust explained that the change to default behavior was intentional to prevent an NFS mount from being mounted with the wrong options. The patch also introduced a new mount option, "the new option is there in order to make it damned clear to sysadmins that this is a dangerous thing to do: mounts which don't share the same superblock also don't share the same data and attribute caches. Any file or directory which appears in both mounts had better only be used by one application at a time or be using an appropriate locking scheme." Jakob Oestergaard defended the change asserting, "what he 'broke' is, for example, a ro mount being mounted as rw. That *could* be a very serious security (etc.etc.) problem which he just fixed. Anything depending on read-only not being enforced will cease to work, of course, and that is what a few people complain about(!)."
Linus Torvalds disagreed strongly with the change, "that commit gets reverted or fixed. It's a regression, and your theories that it's 'better' that way are obviously broken." He added:
"The point being that you just disallowed people from doing things that are sane but _potentially_ dangerous. That's not how we work. The UNIX way is to give people rope - if you cannot *prove* that what they are doing is wrong, then you damn well better not disallow it."
In response to the concern that the changes to NFS were necessary to fix a security hole, Linus retorted, "this is *not* a security hole. In order to make it a security hole, you need to be root in the first place. So what you call a security hole is really no different from root installing a bad SUID binary. It's simply not the kernels place to then say 'SUID binaries will not work, because it's a potential security hole'."
"I've long hated the non-killability of tasks accessing a dead NFS server," Matthew Wilcox said along with a prototype patch to fix the issue based on a 2002 posting by Linus Torvalds. Matthew added, "I've only added one real user of the killable concept to this patch -- try_lock_page(). However, this is enough for 'cat */*/*' to be killable with a ^C when I unplug the ethernet cord between it and the nfs server."
Linus responded favorably to the patch, "hey, I obviously approve. And the patch looks simple." He went on to suggest that he was interested in merging the patch during the next merge window, "feel free to re-submit after 2.6.23 is out the door, I don't think anybody will really complain. Any NFS user will know why something like this can be really nice."
Linux creator Linus Torvalds announced the 2.6.20-rc6 release candidate kernel, "it's been more than a week since -rc5, but I blame everybody (including me) being away for Linux.conf.au and then me waiting for a few days afterwards to let everybody sync up." He asked that people test the regressions reported against earlier release candidates [story], "so that we can confirm whether they are still active and relevant." Linus noted that he hoped this would be the final release candidate before 2.6.20 is released, then went on to discuss what's new:
"As to -rc6 itself: the bulk of it are the MTD updates (including a few new drivers), and the POWER update (and the bulk of _that_ in terms of patch size being defconfig updates ;)
"But there's various random fixes in infiniband, DVB, network drivers, scsi, usb, some filesystems (cifs, jffs2, nfs, ntfs, ocfs2) as well as core networking too. Oh, and KVM, of course. And stuff I probably have already forgotten."
Kerneltrap has spoken with Matthew Dillon, a well-known FreeBSD kernel hacker. He has recently been in the spotlight due to many impressive NFS related bug fixes, as well as fixes to the TCP stack. In this interview he talks about these bug fixes as well as his history with computers, programming and FreeBSD. He also discusses Linux, open source, embedded systems, the Amiga (and his DICE C compiler), and much more.