Jens Axboe detailed the changes in his linux-2.6-block.git tree that he plans to merge into the upcoming 2.6.24 kernel. Among the changes were the necessary updates to enable SG chaining which is used for large IO commands, "the goal of sg chaining is to allow support for very large sgtables, without requiring that they be allocated from one contigious piece of memory." Andrew Morton asked for more information, "presumably sg chaining means more overhead on the IO submission paths? If so, has this been quantified?"
Jens explained that there is no overhead for existing logic which doesn't use sg chaining, "just cleanups to drivers to use
for_each_sg() and so on." He continued:
"For actually using the sg chaining, there's some overhead of course. Say we support 256 entries without chaining, or 1mb with 4kb pages. A request with 1000 entried would require 4 trips to the allocator to allocate the chainable lists and 4 trips when freeing that list again. We don't loop the sg list on setup of freeing, just jump to the correct locations. So even for chaining, the cost isn't that big. It enables us to support much larger IO commands and potentially speed up some devices quite a lot, so CPU cost is less of a concern. And for small sglists, there isn't a noticable overhead."
"I've long hated the non-killability of tasks accessing a dead NFS server," Matthew Wilcox said along with a prototype patch to fix the issue based on a 2002 posting by Linus Torvalds. Matthew added, "I've only added one real user of the killable concept to this patch -- try_lock_page(). However, this is enough for 'cat */*/*' to be killable with a ^C when I unplug the ethernet cord between it and the nfs server."
Linus responded favorably to the patch, "hey, I obviously approve. And the patch looks simple." He went on to suggest that he was interested in merging the patch during the next merge window, "feel free to re-submit after 2.6.23 is out the door, I don't think anybody will really complain. Any NFS user will know why something like this can be really nice."
Following a recent merge request, Linus Torvalds stressed that he was serious about not wanting to merge any big changes after the merge window closes, "get the changes in before -rc1, or just *wait*. If they aren't ready before the merge window opens, they simply shouldn't be merged at all." Jeff Garzik reiterated, "once -rc1 is out there, that means the focus should be on stabilizing the existing codebase. Pushing a big driver update means that effort must restart from scratch. We just don't want to go down that road, which a big reason for the merge window in general." Further when it was noted that the recent changes were heavily tested by the vendor, Jeff stressed the importance of community testing:
"Take a lesson from when I was on Linus's shit-list... twice: Twice, Intel submitted an e1000 update after the merge window closed. Twice, they claimed the driver passed their quite-exhaustive internal testing. And twice, the most popular network driver broke for large masses of users because I took a hardware vendor's word on testing rather than rely on the testing PROVEN to flush out problems: public linux kernel testing.
"I'm not singling out Intel, there are plenty of other hardware vendors that repeat the exact same pattern."
As expected, Linus Torvalds released the 2.6.23-rc1 kernel two weeks after the release of 2.6.22, ending the merge window, "and it has a *ton* of changes as usual for the merge window, way too much for me to be able to post even just the shortlog or diffstat on the mailing list". He noted, "I personally like how 'sendfile' is now totally gone internally, and the kernel now ends up doing all that with splice insted. Good riddance, although we'll obvously end up supporting the old user level interfaces for a long time." Linus went on to summarize the other changes:
"Lots of architecture updates (for just about all of them - x86[-64], arm, alpha, mips, ia64, powerpc, s390, sh, sparc, um..), lots of driver updates (again, all over - usb, net, dvb, ide, sata, scsi, isdn, infiniband, firewire, i2c, you name it).
"Filesystems, VM, networking, ACPI, it's all there. And virtualization all over the place (kvm, lguest, Xen).
"Notable new things might be the merge of the cfs scheduler, and the UIO driver infrastructure might interest some people."
The Xen virtual machine monitor was recently merged into the upcoming 2.6.23 Linux kernel in a series of patches from Jeremy Fitzhardinge. The project was originally started as a research project at the University of Cambridge, and has been repeatedly discussed as a merge candidate for the mainline Linux kernel.
Xen is described in the project's FAQ as:
"Xen is a virtual machine monitor (VMM) for x86-compatible computers. Xen can securely execute multiple virtual machines, each running its own OS, on a single physical system with close-to-native performance."
Rusty Russell's lguest was recently merged into the upcoming 2.6.23 Linux kernel. The merge comment describes the project, "lguest is a simple hypervisor for Linux on Linux. Unlike kvm it doesn't need VT/SVM hardware. Unlike Xen it's simply 'modprobe and go'. Unlike both, it's 5000 lines and self-contained." The comment goes on to note:
"Performance is ok, but not great (-30% on kernel compile). But given its hackability, I expect this to improve, along with the paravirt_ops code which it supplies a complete example for. There's also a 64-bit version being worked on and other craziness.
"But most of all, lguest is awesome fun! Too much of the kernel is a big ball of hair. lguest is simple enough to dive into and hack, plus has some warts which scream 'fork me!'."
Two new documentation directories were merged into the upcoming 2.6.23 mainline kernel, containing translations of the HOWTO and stable_api_nonsense.txt documents in Japanese and Chinese. Greg KH explained, "here are some patches that add some translations of some procedural documentation files to the Documentation/ tree." Regarding some of the concerns that were expressed with merging translated documentation into the mainline kernel tarball, Greg noted, "these files change _very_ slowly over time, and are quite easy to keep up to date by the translators." He added:
"I know that kernel development is in English, but translations of a small subset of documentation files that go over procedures and how to get involved in the community is something that I feel is important and will bring in more developers in the end. Having these files in the kernel tree is a good way to keep a central location that all can see and easily find, instead of hiding them away on different web sites that might be harder to update by anyone who needs to do so."
In response to another merge request, Andrew Morton retorted, "argh. I have a backlog of maybe 300 patches here which I am cheerfully ignoring while concentrating on preventing 2.6.23 from being less of a disaster than it has already been." He noted that he was not planning to merge any new code into his -mm tree for 2.6.23 inclusion, "the door for new 2.6.23 material shut two weeks ago. Here, at least." He went on to note:
"Please, stop writing patches. Maybe do something to help get 2.6.23 off its back. Like, go review some of the code which people are cheerfully merging five minutes after having written it."
Recent merges into the upcoming 2.6.23 kernel can be found by browsing the gitweb interface to Linus' 2.6 kernel tree. The 2.6.23-rc1 kernel should be released on or shortly after Sunday the 22nd, two weeks after 2.6.22 was released, and at which time the merge window is closed.
A recently merged KVM patchset included support for guest SMP, various performance improvements, and suspend/resume fixes. KVM stands for Kernel-based Virtual Machine, "a full virtualization solution for Linux on x86 hardware containing virtualization extensions". In regards to the recently merged guest SMP support which will be part of the upcoming 2.6.23 kernel, Avi Kivity noted:
"Guest smp is fully operational. Kernel build on 2-way smp is 40% faster than on a up guest. Expect significant performance improvements from in-kernel apic and from further tuning."
H. Peter Anvin submitted a series of patches rewriting the x86 setup code, "this patch set replaces the x86 setup code, which is currently all in assembly, with a version written in C, using the '.code16gcc' feature of binutils (which has been present since at least 2001.)" He went on to explain why he did this, "the new code is vastly easier to read, and, I hope, debug. It should be noted that I found a fair number of minor bugs while going through this code, and have attempted to correct them."
Linus Torvalds reacted favorably, "I can't really argue against this on any sane grounds - not only is it removing more lines than it adds, but moving from mostly unreadable assembly to C seems a good idea." He went on to note, "so let's just get this merged. But the question is, do we put it in 2.6.23-rc1, or do we put it in -mm for a few weeks, which would imply waiting for the next merge window? Andrew?" Andrew Morton pointed out that the patches have been in -mm already for a couple of months, "this code has been in -mm since 11 May, as git-newsetup.patch. It has caused (for what it is) astonishingly few problems. Maybe a couple of build glitches and one runtime failure, all quickly fixed. I'd say it's ready." Linus agreed, "Ok. That makes it easy. I'll just merge it."
Another thread discussed potentially merging the swap prefetch patch into the mainline Linux kernel. Con Kolivas [story] started the thread saying "I fixed all bugs I could find and improved it as much as I could last kernel cycle. Put me and the users out of our misery and merge it now or delete it forever please." Replying to an off-list message, Andrew Morton asked users of the patch, "please provide us more details on your usage and testing of that code. Amount of memory, workload, observed results, etc?"
Nick Piggin [interview] noted that he's still interested in better understanding and possibly fixing what's happening with swap and reclaim on the systems reporting a benefit from the swap-prefetch patch. He went on to note, "regarding swap prefetching. I'm not going to argue for or against it anymore because I have really stopped following where it is up to, for now. If the code and the results meet the standard that Andrew wants then I don't particularly mind if he merges it. It would be nice if some of you guys would still report and test problems with reclaim when prefetching is turned off -- I have never encountered the morning after sluggishness (although I don't doubt for a minute that it is a problem for some)." Ingo Molnar followed up to these coments acking the patch, "I have tested it and have read the code, and it looks fine to me. (i've reported my test results elsewhere already [story]) We should include this in v2.6.23."
Ingo Molnar [interview]'s Completely Fair Scheduler [story] has been merged into the Linux kernel for inclusion in the upcoming 2.6.23 release. A comment in the patch titled 'sched: cfs core code' noted, "apply the CFS core code. This change switches over the scheduler core to CFS's modular design and makes use of kernel/sched_fair/rt/idletask.c to implement Linux's scheduling policies." Another patch included documentation which described the new scheduler, "80% of CFS's design can be summed up in a single sentence: CFS basically models an 'ideal, precise multi-tasking CPU' on real hardware." It goes on to explain:
"CFS's task picking logic is based on this p->wait_runtime value and it is thus very simple: it always tries to run the task with the largest p->wait_runtime value. In other words, CFS tries to run the task with the 'gravest need' for more CPU time. So CFS always tries to split up CPU time between runnable tasks as close to 'ideal multitasking hardware' as possible.
"Most of the rest of CFS's design just falls out of this really simple concept, with a few add-on embellishments like nice levels, multiprocessing and various algorithm variants to recognize sleepers."
Following the release of the 2.6.22 kernel [story], Andrew Morton [interview] posted a list of a wide range of patches that are in his -mm kernel, summarizing for each his plans as to whether or not they will be pushed upstream for inclusion in the upcoming 2.6.23 kernel. Comments included simply noting "merge" or "hold", as well as "these appear to need some work,", "don't know, need to ping suitable developers over this work," and "sent to maintainer." Perhaps most entertaining was Andrew's response to the vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch, "this is scary. Will sit and admire it until it has been demonstrated to be a net gain." It is possible to track which patches are actually merged using the gitweb interface to Linus' kernel tree.
"Ok, the merge window has closed, and 2.6.22-rc1 is out there," Linus Torvalds announced on the Linux Kernel Mailing List. He noted that there were a large number of changes, "almost seven thousand files changed, and that's not double-counting the files that got moved around." As to what was changed, Linus summarized, "architecture updates, drivers, filesystems, networking, security, build scripts, reorganizations, cleanups.. You name it, it's there." He went on to add:
"You want a new firewire stack? We've got it. New wireless networking infrastructure? Check. New infiniband drivers? Digital video drivers? A totally new CPU architecture (blackfin)? Check, check, check.
"That said, I think (and certainly hope) that this will not be nearly as painful as the big fundamental timer changes for 2.6.21, and while there are some pretty core changes there (like the new SLUB allocator, which hopefully will end up replacing both SLAB and SLOB), it feels pretty solid, and not as scary as ripping the carpet from under the timer infrastructure."
Following up to feedback on his merge plans [story], Andrew Morton [interview] posted an updated summary of what he is pushing upstream for inclusion in the upcoming 2.6.22 kernel. His list included, "a few serial bits, a few pcmcia bits, one little security patch, the blackfin architecture, small h8300 update, small alpha update, swsusp updates, m68k bits, and lots of UML updates." He also noted that he'll push some of the memory management queue including, "an enhancement to /proc/pid/smaps to permit monitoring of a running program's working set. The SLUB allocator, it's pretty green but I do want to push ahead with this pretty aggressively with a view to replacing slab altogether. Generic pagetable quicklist management. We have x86_64 and ia64 and sparc64 implementations, but I'll only include David's sparc64 implementation here. I'll send the x86_64 and ia64 implementations through maintainers."