"The biggest user-visible change in -v19 is reworked sleeper fairness: it's similar in behavior to -v18 but works more consistently across nice levels. Fork-happy workloads (like kernel builds) should behave better as well. There are also a handful of speedups: unsigned math, 32-bit speedups, O(1) task pickup, debloating and other micro-optimizations."
Among the other changes found in this latest version of the scheduler patchset, Ingo noted, "merged the group-scheduling CFS-core changes from Srivatsa Vaddagiri [story]. This makes up for the bulk of the changes in -v19 but has no behavioral impact. The final group-fairness enabler patch is now a small and lean add-on patch to CFS."
Following a review of Ingo Molnar [interview]'s Completely Fair Scheduler [story], Srivatsa Vaddagiri posted a patch allowing the new scheduler to provide fairness at a per-group level rather than at a per-process level. He described the changes that he made and noted, "I have used 'uid' as the basis of grouping for timebeing (since that grouping concept is already in mainline today). The patch can be adapted to a more generic process grouping mechanism later."
Ingo reacted to the patch favorably, "yeah, i like this alot." He went on to comment, "the 'struct sched_entity' abstraction looks very clean, and that's the main thing that matters: it allows for a design that will only cost us performance if group scheduling is desired." He went on to ask, "if you could do a -v14 port and at least add minimal SMP support: i.e. it shouldnt crash on SMP, but otherwise no extra load-balancing logic is needed for the first cut - then i could try to pick all these core changes up for -v15. (I'll let you know about any other thoughts/details when i do the integration.)"
"As I understand, fair_clock is a monotonously increasing clock which advances at a pace inversely proportional to the load on the runqueue," Srivatsa Vaddagiri explained in a review of Ingo Molnar [interview]'s CFS CPU scheduler [story], "if load = 1 (task), it will advance at same pace as wall clock, as load increases it advances slower than wall clock." He continued on to ask some questions about the choices made in CFS as compared to the EEVDF CPU scheduler [story]. In the resulting discussion, Ingo offered some insight into the design of the CFS. He began:
"80% of CFS's design can be summed up in a single sentence: CFS basically models an 'ideal, precise multi-tasking CPU' on real hardware. 'Ideal multi-tasking CPU' is a (non-existent :-) CPU that has 100% physical power and which can run each task at precise equal speed, in parallel, each at 1/nr_running speed. For example: if there are 2 tasks running then it runs each at 50% physical power - totally in parallel.
"On real hardware, we can run only a single task at once, so while that one task runs the other tasks that are waiting for the CPU are at a disadvantage - the current task gets an unfair amount of CPU time. In CFS this fairness imbalance is expressed and tracked via the per-task p->wait_runtime (nanosec-unit) value. 'wait_runtime' is the amount of time the task should now run on the CPU for it become completely fair and balanced."
Ingo Molnar [interview] reviewed Con Kolivas [interview]'s swap-prefetching patches [story] suggesting that they were ready for inclusion in the mainline kernel, "I've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback i've seen so far was positive. Time to have this upstream and time for a desktop-oriented distro to pick it up." He went on to describe swap prefetch, "to the desktop user this is a speculative performance feature that he is willing to potentially waste CPU and IO capacity, in expectation of better performance. On the conceptual level it is _precisely the same thing as regular file readahead_. (with the difference that to me swapahead seems to be quite a bit more intelligent than our current file readahead logic.)"
Nick Piggin [interview] expressed some concern that the impact of the code still wasn't understood well enough, "I wanted to see some basic regression tests to show that it hasn't caused obvious problems, and some basic scenarios where it helps, so that we can analyze them. It is really simple, but I haven't got any since first asking." Ingo noted that the patch has generated a lot of positive feedback from users and it would be best to merge it into the kernel, going on to suggest that it would be good to get more people actively involved, "really, we are likely be better off by risking the merge of _bad_ code (which in the swap-prefetch case is the exact opposite of the truth), than to let code stagnate. People are clearly unhappy about certain desktop aspects of swapping, and the only way out of that is to let more people hack that code. Merging code involves more people. It will cause 'noise' and could cause regressions, but at least in this case the only impact is 'performance' and the feature is trivial to disable."
Linux creator Linus Torvalds announced the release of the 2.6.21 kernel, "if the goal for 2.6.20 was to be a stable release (and it was), the goal for 2.6.21 is to have just survived the big timer-related changes and some of the other surprises (just as an example: we were apparently unlucky enough to hit what looks like a previously unknown hardware errata in one of the ethernet drivers that got updated etc)." Regarding the the dynticks code which was merged in -rc1 [story] he said, "the big change during 2.6.21 is all the timer changes to support a tickless system (and even with ticks, more varied time sources). Thanks (when it no longer broke for lots of people ;) go to Thomas Gleixner and Ingo Molnar and a cadre of testers and coders." He went on to note, "of course, the timer stuff was just the most painful and core part (and thus the one that I remember most): there's a lot of changes all over. The appended changelog is just for the fixes since -rc7, so that doesn't look very impressive, the full changes since 2.6.20 are obviously a *lot* bigger (and you're better off reading the individual -rc changelogs)." Linus finished with a running joke about the many debates centered around current CPU scheduler efforts [story], quipping, "we now return you to your regular scheduler discussions".
2.4 kernel maintainer [story] Willy Tarreau ran some tests to compare Con Kolivas [interview]'s Staircase Deadline CPU scheduler [story] with Ingo Molnar [interview]'s new Completely Fair Scheduler [story]. He summarized his experiences:
"I think that CFS is based on a more promising concept but is less mature and is dangerous right now with certain workloads. SD shows some strange behaviours like not using all CPU available and a little jerkyness, but is more robust and may be the less risky solution for a first step towards a better scheduler in mainline, but it may also probably be the last O(1) scheduler, which may be replaced sometime later when CFS (or any other one) shows at the same time the smoothness of CFS and the robustness of SD."
Some debate was raised by logic added since CFS version 3 to automatically nice the X process for better GUI responsiveness. The CFS changelog comment labels the change as a usibility fix explaining, "automatic renicing of kernel threads such as keventd, OOM tasks and tasks doing privileged hardware access (such as Xorg)." Ingo posted a standalone patch demonstrating how these processes are detected and automatically niced, offering it for inclusion into Con's Staircase scheduler. Willy concurred that it was a good idea, "I think it could be a good idea since you recommend to renice X with SD. Most of the problem users are facing with renicing X is that they need to change their configs or scripts. If the kernel can reliably detect X and handle it differently, why not do it ?" Con was less convinced, "hmm well I have tried my very best to do all the changes without changing 'policy' as much as possible since that trips over so many emotive issues that noone can agree on, and I don't have a strong opinion on this as I thought it would be better for it to be a config option for X in userspace instead. Either way it needs to be turned on/off by admin and doing it by default in the kernel is... not universally accepted as good."
Ingo Molnar [interview] released a new patchset titled the "Modular Scheduler Core and Completely Fair Scheduler". He explained, "this project is a complete rewrite of the Linux task scheduler. My goal is to address various feature requests and to fix deficiencies in the vanilla scheduler that were suggested/found in the past few years, both for desktop scheduling and for server scheduling workloads." The patchset introduces Scheduling Classes, "an extensible hierarchy of scheduler modules. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming about them too much." It also includes sched_fair.c with an implementation of the CFS desktop scheduler, "a replacement for the vanilla scheduler's SCHED_OTHER interactivity code," about which Ingo noted, "I'd like to give credit to Con Kolivas [interview] for the general approach here: he has proven via RSDL/SD that 'fair scheduling' is possible and that it results in better desktop scheduling. Kudos Con!"
Regarding the actual implementation, Ingo explained, "CFS's design is quite radical: it does not use runqueues, it uses a time-ordered rbtree to build a 'timeline' of future task execution, and thus has no 'array switch' artifacts (by which both the vanilla scheduler and RSDL/SD are affected). CFS uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail. Thus the CFS scheduler has no notion of 'timeslices' and has no heuristics whatsoever. There is only one central tunable, /proc/sys/kernel/sched_granularity_ns, which can be used to tune the scheduler from 'desktop' (low latencies) to 'server' (good batching) workloads." He went on to note, "due to its design, the CFS scheduler is not prone to any of the 'attacks' that exist today against the heuristics of the stock scheduler".
Announcing the third version of his syslets subsystem patches [story], Ingo Molnar [interview] noted that he has implemented many fundamental changes to the code including the introduction of threadlets, "'threadlets' are basically the user-space equivalent of syslets: small functions of execution that the kernel attempts to execute without scheduling. If the threadlet blocks, the kernel creates a real thread from it, and execution continues in that thread. The 'head' context (the context that never blocks) returns to the original function that called the threadlet." As threadlets are only moved into a separate thread context if they block, Ingo refers to them as 'optional threads'. He also describes them as 'on-demand parallelism', "user-space does not have to worry about setting up, sizing and feeding a thread pool - the kernel will execute the workload in a single-threaded manner as long as it makes sense, but once the context blocks, a parallel context is created. So parallelism inside applications is utilized in a natural way."
Ingo goes on to note that the syslet code and API has been significantly enhanced in this latest release, "the v3 code is ABI-incompatible with v2, due to these fundamental changes." He adds, "syslets (small, kernel-side, scripted 'syscall plugins') are still supported - they are (much...) harder to program than threadlets but they allow the highest performance. Core infrastructure libraries like glibc/libaio are expected to use syslets. Jens Axboe's FIO tool already includes support for v2 syslets, and the following patch updates FIO to the v3 API".
Ingo Molnar [interview] posted a second version of his syslets subystem patch set, which offers asynchrous system call support [story]. He noted that the effort is a work in progress, and that there are still outstanding issues to be fixed, "the biggest conceptual change in v2 is the ability of cachemiss threads to be turned into user threads. This fixes signal handling, makes them ptrace-eable, etc," going on to list numerous fixes since the first release. He noted that prior to releasing a third version of the patch set he will add support for multiple completion rings, add logic to share the 'spare thread' between the rings to further reduce startup costs, and remove reliance on mlock().
Linus Torvalds commented, "I'm still not a huge fan of the user space interface, but at least the core code looks quite clean. No objections on that front." He referred to earlier comments in which he had reacted strongly to the syslets userland interface saying, "I dislike it intensely, because it's so _close_ to being usable. But the programming interface looks absolutely horrid for any 'casual' use, and while the loops etc look like fun, I think they are likely to be less than useful in practice. Yeah, you can do the 'setup and teardown' just once, but it ends up being 'once per user', and it ends up being a lot of stuff to do for somebody who wants to just do some simple async stuff." He later noted that he was in particular concerned with the "register" functionality, which Ingo then simplified.
Ingo Molnar [interview] posted a set of 11 patches introducing "the first release of the 'Syslet' kernel feature and kernel subsystem, which provides generic asynchrous system call support". Ingo explains:
"Syslets are small, simple, lightweight programs (consisting of system-calls, 'atoms') that the kernel can execute autonomously (and, not the least, asynchronously), without having to exit back into user-space. Syslets can be freely constructed and submitted by any unprivileged user-space context - and they have access to all the resources (and only those resources) that the original context has access to."
Ingo goes on in his email to explain in greater detail how syslets work, then adds, "as it might be obvious to some of you, the syslet subsystem takes many ideas and experience from my Tux in-kernel webserver :) The syslet code originates from a heavy rewrite of the Tux-atom and the Tux-cachemiss infrastructure." He also offered some benchmark results, showing a 33.9% speedup comparing uncached synchronous IO to syslets, and a 19.2% speedup comparing cached synchronous IO to syslets, "so syslets, in this particular workload, are a nice speedup /both/ in the uncached and in the cached case. (note that i used only a single disk, so the level of parallelism in the hardware is quite limited.)"
A new feature that will first be availble in the upcoming 2.6.20 kernel is KVM, a Kernel-based Virtual Machine. The project's webpage describes KVM as, "a full virtualization solution for Linux on x86 hardware. It consists of a loadable kernel module (kvm.ko) and a userspace component. Using KVM, one can run multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualized hardware: a network card, disk, graphics adapter, etc." The project's FAQ explains that the functionality requires "an x86 machine running a recent Linux kernel on an Intel processor with VT (virtualization technology) extensions, or an AMD processor with SVM extensions (also called AMD-V)." The userland aspect of KVM is a slighlty modified version of qemu, used to instantiate the virtual machine.
Ingo Molnar [interview] announced a new patch introducing paravirtualization support for KVM, outdating the KVM FAQ which in comparing KVM to Xen notes, "Xen supports both full virtualization and a technique called paravirtualization, which allows better performance for modified guests. kvm does not at present support paravirtualization." In describing his patch which is against the 2.6.20-rc3 + KVM trunk kernel, Ingo said it, "includes support for the hardware cr3-cache feature of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)". He went on to add, "some aspects of the code are still a bit ad-hoc and incomplete, but the code is stable enough in my testing and i'd like to have some feedback." In a series of benchmarks, he found 2-task context switch performance to be improved by a factor of four, while "hackbench 1" showed twice as good performance, and "hackbench 5" showed a 30% improvement. His email goes on to detail how the paravirtualization works.
Linux creator Linus Torvalds announced the first release candidate for the upcoming 2.6.18 kernel, "the merge window for 2.6.18 is closed, and -rc1 is out there". He noted that the changes are extensive, "the changes are too big for the mailing list, even just the shortlog. As usual, lots of stuff happened. Most architectures got updated, ACPI updates, networking, SCSI and sound, IDE, infiniband, input, DVB etc etc etc." Find the shortlog here. Linus went on described additional changes:
"There's also a fair amount of basic infrastructure updates here, with things like the generic IRQ layer, the lockdep (oh, and priority-inheritance mutex support) stuff by Ingo &co, some generic timekeeping infrastructure ('clocksource') stuff, memory hotplug (and page migration) support, etc etc."
Thomas Gleixner and Ingo Molnar [interview] posted an update of their high-res timers kernel patches for the 2.6.17 kernel, "upon which we based a tickless kernel (dyntick) implementation and a 'dynamic HZ' feature as well". The patch currently works for x86, with ports to x86_64, PPC and ARM in the works. Thomas explains, "the high-res timers feature (CONFIG_HIGH_RES_TIMERS) enables POSIX timers and nanosleep() to be as accurate as the hardware allows (around 1usec on typical hardware). This feature is transparent - if enabled it just makes these timers much more accurate than the current HZ resolution." He goes on to discribe the tickless kernel:
"The tickless kernel feature (CONFIG_NO_HZ) enables 'on-demand' timer interrupts: if there is no timer to be expired for say 1.5 seconds when the system goes idle, then the system will stay totally idle for 1.5 seconds. This should bring cooler CPUs and power savings: on our (x86) testboxes we have measured the effective IRQ rate to go from HZ to 1-2 timer interrupts per second.
"This feature is implemented by driving 'low res timer wheel' processing via special per-CPU high-res timers, which timers are reprogrammed to the next-low-res-timer-expires interval. This tickless-kernel design is SMP-safe in a natural way and has been developed on SMP systems from the beginning."
Con Kolivas [interview] posted a set of patches to the lkml offering a pluggable cpu scheduler framework. He began, "with the recent interest in varying the cpu schedulers in linux, this set of patches provides a modular framework for adding multiple boot-time selectable cpu schedulers. William Lee Irwin III [interview] came up with the original design and I based my patchset on that." The architecture independent framework is designed to allow new schedulers to be added by only touching three files, without adding overhead, and allowing you to compile in only the CPU scheduler(s) you need. Con explained, "this allows, for example, embedded hardware to have a tiny new scheduler that takes up minimal code space."
Designer of the 2.6 kernel's O(1) scheduler, Ingo Molnar [interview], had some reservations. He said, "my main worry with this approach is not really overhead but the impact on scheduler development. Right now there is a Linux scheduler that every developer (small-workload and large-workload people) tries to make as good as possible." This includes all corner cases, from the smallest embedded hardware to the largest NUMA hardware, with all their improvements benefiting the core CPU scheduler. Ingo added, "we want to make it _harder_ for specialized workloads to be handled in some 'specialized' way, because those precise workloads do show up in other workloads too, in a different manner. A fix made for NUMA or real-time purposes can easily make a difference for desktop workloads. Often 'specialized' is an excluse for a 'fundamentally broken, limited hack', especially in the scheduler world."
With the release of 2.6.9-mm1, Andrew Morton [interview] offered a quick status update on a number of patches in his -mm tree [forum] that are 2.6-mainline hopefuls. For example, regarding the much debated reiser4 filesystem [story], Andrew said that he is still "not sure, really. The namespace extensions were disabled, although all the code for that is still present. Linus's filesystem criterion used to be 'once lots of people are using it, preferably when vendors are shipping it'. That's a bit of a chicken and egg thing though. Needs more discussion". And as for Ingo Molnar [interview]'s preemption and low-latency fixups [forum] Andrew offered, "I haven't really thought about it and haven't looked at the patches yet. Hopefully 2.6.10 material."
Other projects specifically mentioned include the sysfs backing store, the ext3 reservations code, the ext3 resize code, kexec and crashdump [story], perfctr, cachefs, cpusets, and the md updates. Read on for Andrew's comments and the complete -mm1 changelog.