"The problem with swap over network is the generic swap problem: needing memory to free memory. Normally this is solved using mempools, as can be seen in the BIO layer," explained Peter Zijlstra. "Swap over network has the problem that the network subsystem does not use fixed sized allocations, but heavily relies on kmalloc(). This makes mempools unusable."
The first fifteen patches set up a generic framework for reserving memory. Patches 16-23 actually put the framework to use on the network stack. Peter noted, "a network write back completion [involves] receiving packets, which when there is no memory, is rather hard. And even when there is memory there is no guarantee that the required packet comes in in the window that that memory buys us." He went on to explain, "the solution to this problem is found in the fact that network is to be assumed lossy. Even now, when there is no memory to receive packets the network card will have to discard packets. What we do is move this into the network stack." Patches 24-26 set up an infrastructure for swapping to a filesystem instead of a block device, which is then utilized by the final patches, "finally, convert NFS to make use of the new network and vm infrastructure to provide swap over NFS." When the usefulness of these patches were questioned, Peter noted, "There is a large corporate demand for this, which is why I'm doing this. The typical usage scenarios are: 1) cluster/blades, where having local disks is a cost issue (maintenance of failures, heat, etc) 2) virtualisation, where dumping the storage on a networked storage unit makes for trivial migration and what not.."
A recent report on the lkml suggested improved IO/writeback performance in the recently released 2.6.24-rc1 kernel compared to the earlier 220.127.116.11 and 18.104.22.168 kernels. Credit was given to some patches by Peter Zijlstra. Ingo Molnar replied, "wow, really nice results! Peter does know how to make stuff fast :) Now lets pick up some of Peter's other, previously discarded patches as well :-)" He pointed to several patches "as a starter", then quipped, "I think the MM should get out of deep-feature-freeze mode - there's tons of room to improve :-/"
Andrew Morton replied, "kidding. We merged about 265 MM patches in 2.6.24-rc1:
482 files changed, 8071 insertions(+), 5142 deletions(-)". He added, "a lot of that was new functionality. That's easier to add than things which change long-standing functionality." Of the patches Ingo pointed to, Peter noted he was currently working on polishing the swap-over-NFS patch, "will post that one again, soonish.... Esp. after Linus professed liking to have swap over NFS." Rik van Riel also replied regarding rewriting the page replacement code, "at the moment I only have the basic 'plumbing' of the split VM working and am fixing some bugs in that. Expect a patch series with that soon, so you guys can review that code and tell me where to beat it into shape some more :)"
"My experiments show that when there is not much free physical memory, swapoff moves pages out of swap at a rate of approximately 5mb/sec," Daniel Drake noted in a recent discussion about swapoff performance. He added, "I've read into the swap code and I have some understanding that this is an expensive operation (and has to be)." Hugh Dickins acknowledged, "Yes, it can be shamefully slow. But we've done nothing about it for years, simply because very few actually suffer from its worst cases. You're the first I've heard complain about it in a long time: perhaps you'll be joined by a chorus, and we can have fun looking at it again."
As a potential optimization Daniel proposed, "iterate through all process page tables, paging all swapped pages back into physical memory and updating PTEs". Hugh replied, "feasible yes, and very much less CPU-intensive than the present method. But... it would be reading in pages from swap in pretty much a random order, whereas the present method is reading them in sequentially, to minimize disk seek time. So I doubt your way would actually work out faster". He then added:
"The speedups I've imagined making, were a need demonstrated, have been more on the lines of batching (dealing with a range of pages in one go) and hashing (using the swapmap's ushort, so often 1 or 2 or 3, to hold an indicator of where to look for its references)."
Another thread discussed potentially merging the swap prefetch patch into the mainline Linux kernel. Con Kolivas [story] started the thread saying "I fixed all bugs I could find and improved it as much as I could last kernel cycle. Put me and the users out of our misery and merge it now or delete it forever please." Replying to an off-list message, Andrew Morton asked users of the patch, "please provide us more details on your usage and testing of that code. Amount of memory, workload, observed results, etc?"
Nick Piggin [interview] noted that he's still interested in better understanding and possibly fixing what's happening with swap and reclaim on the systems reporting a benefit from the swap-prefetch patch. He went on to note, "regarding swap prefetching. I'm not going to argue for or against it anymore because I have really stopped following where it is up to, for now. If the code and the results meet the standard that Andrew wants then I don't particularly mind if he merges it. It would be nice if some of you guys would still report and test problems with reclaim when prefetching is turned off -- I have never encountered the morning after sluggishness (although I don't doubt for a minute that it is a problem for some)." Ingo Molnar followed up to these coments acking the patch, "I have tested it and have read the code, and it looks fine to me. (i've reported my test results elsewhere already [story]) We should include this in v2.6.23."
The question was asked on the lkml whether or not memory allocated by kmalloc and vmalloc is swappable. Rik van Reil offered a clear explanation as to why it is not, "unswappable kernel memory is simpler and faster," adding, "there really is no good reason for swapping kernel memory nowadays." He went on to explain:
"Over the last 15 years, the memory requirements of the Linux kernel have grown maybe a factor 10, while the memory of computers has grown by a factor of 1000.
"The data structures that grow with memory (mostly the mem_map array of page structs) has actually gotten smaller since the 2.4 kernel and now takes under 1% of memory even on x86-64."