"After posting some benchmarks involving cfs, I got some feedback, so I decided to do a follow-up that'll hopefully fill in the gaps many people wanted to see filled," Rob Hussey began. He added, "this time around I've done the benchmarks against 2.6.21, 2.6.22-ck1, and 2.6.23-rc6-cfs-devel (latest git as of 12 hours ago)." Rob briefly summarized, "the only analysis I'll offer is that both sd and cfs are improvements, and I'm glad that there is a lot of work being done in this area of linux development. Much respect to Con Kolivas, Ingo Molnar, and Roman Zippel, as well all the others who have contributed."
Referring to a chart in which the blue line represented the CFS process scheduler, and the green line represented the SD "staircase" process scheduler, Ingo Molnar noted, "heh - am i the only one impressed by the consistency of the blue line in this graph? :-) [ and the green line looks a bit like a .. staircase? ]" He acknowledged some slowdown in CFS compared to SD in one of the benchmarks, "-ck1 is 0.8% faster in this particular test." Ingo then explained, "many things happened between 2.6.22-ck1 and 2.6.23-cfs-devel that could affect performance of this test. My initial guess would be sched_clock() overhead." In further testing he applied a low-res-sched-clock that resulted in better performance for CFS leading him to conclude, "the performance difference between -ck and -cfs-devel seems to be mostly down to the more precise (but slower) sched_clock() introduced in v2.6.23 and to the startup penalty of freshly created tasks." When asked if the low-res-sched-clock was likely to be merged, Ingo replied:
"I don't think so - we want precise/accurate scheduling before performance. (otherwise tasks working off the timer tick could steal away cycles without being accounted for them fairly, and could starve out all other tasks.) Unless the difference was really huge in real life - but it isn't."
In a June of 1992 posting to the linux-activists mailing list, Linus Torvalds described the original Linux scheduler noting, "the scheduler in linux is pretty simple, but does a reasonably good job at giving good IO response while not being too unfair against cpu-bound processes." A year later, Linus posted a more detailed description of the scheduler noting, "the linux scheduling algorithm is one of the simplest ones possible". Comments in the original 254 line
sched.c file read, "'schedule()' is the scheduler function. This is GOOD CODE! There probably won't be any reason to change this, as it should work well in all circumstances (ie gives IO-bound processes good response etc). The one thing you might take a look at is the signal-handler code here."
Comments in the current 6,709 line
sched.c file show the first changes being made in 1996 by Dave Grothe, "to fix bugs in semaphores and make semaphores SMP safe". Two years later Andrea Arcangeli is credited with implementing "schedule_timeout() and related stuff". It was not until 2002, ten years after Linus' original code was written, that the scheduler received a complete rewrite, "new ultra-scalable O(1) scheduler by Ingo Molnar: hybrid priority-list and round-robin design with an array-switch method of distributing timeslices and per-CPU runqueues." Con Kolivas is credited with "interactivity tuning" in 2003, and Nick Piggin added "scheduler domains" in 2004. A more recent rewrite of the scheduler happened in April, again by Ingo Molnar, this time with his Completely Fair Scheduler.
During the recent debates about the Completely Fair Scheduler, Ingo Molnar explained why he rewrote the scheduler, "CFS started out as an experiment to simplify the scheduler, to clean up the after-effects of a better-desktop-scheduling patch Mike Galbraith sent me. Had anyone told me at that time that I'd end up writing a new scheduler I'd have laughed at the suggestion and I'd have pointed to the large number of pending patches of mine in forms of the -rt tree, the syslet/threadlet code and other stuff that needs fixing a lot more urgent than the task scheduler." Regarding the recent debate he added, "there was simply no code in existence before CFS which has proven the code simplicity/design virtues of 'fair scheduling' - SD was more of an argument _against_ it than for it. I think maybe even Con might have been surprised by that simplicity: in his first lkml reaction to CFS he also wrote that he finds the CFS code 'beautiful', and my reply to Con's mail still addresses a good number of points raised in this thread i think." Ingo also described his development style:
"I don't typically write code because I'm particularly 'convinced' about an idea or because I 'believe in' an idea, I mostly write code to _check_ whether an idea is worth advancing or not. Writing code is my form of 'thinking', and releasing patches is my form of telling others about my 'thoughts'. I might have guesses about how well something will work out in practice (and I'd certainly be a fool to go out coding blindly), but surprises happen almost always, both in positive and in negative direction, and even with relatively simple patches."
"People who think SD was 'perfect' were simply ignoring reality," Linus Torvalds began in a succinct explanation as to why he chose the CFS scheduler written by Ingo Molnar instead of the SD scheduler written by Con Kolivas. He continued, "sadly, that seemed to include Con too, which was one of the main reasons that I never [entertained] the notion of merging SD for very long at all: Con ended up arguing against people who reported problems, rather than trying to work with them." He went on to stress the importance of working toward a solution that is good for everyone, "that was where the SD patches fell down. They didn't have a maintainer that I could trust to actually care about any other issues than his own." He then offered some praise to Ingo, "as a long-term maintainer, trust me, I know what matters. And a person who can actually be bothered to follow up on problem reports is a *hell* of a lot more important than one who just argues with reporters." Linus went on to note a comparison between the two schedulers:
"I realize that this comes as a shock to some of the SD people, but I'm told that there was a university group that did some double-blind testing of the different schedulers - old, SD and CFS - and that everybody agreed that both SD and CFS were better than the old, but that there was no significant difference between SD and CFS."
Updating the pluggable scheduler patches for the 2.6.22 kernel, Peter Williams noted, "probably the last one now that CFS is in the main line". CFS author Ingo Molnar asked, "why is CFS in mainline a problem? The CFS merge should make the life of development/test patches like plugsched conceptually easier." Peter explained, "it means a major rewrite of the plugsched interface and I'm not sure that it's worth it (if CFS works well). However, note that I did say probably not definitely :-). I'll play with it and see what happens."
Ingo went on to suggest, "most of the schedulers in plugsched should be readily adaptable to the modular scheduling-policy scheme of the upstream scheduler. I'm sure there will be some minor issues as isolation of the modules is not enforced right now - and i'd be happy to review (and potentially apply) common-sense patches that improve the framework." Peter wasn't entirely convinced, but added, "I don't feel like converting staircase and nicksched as I have no real interest in them. Perhaps I'll just create the interface and some schedulers based on my own ideas and let others such as Con and Nick add schedulers if they're still that way inclined." In response to Ingo's offer to review the work he noted, "thanks for the offer of support (it may sway my decision)".
Another thread discussed potentially merging the swap prefetch patch into the mainline Linux kernel. Con Kolivas [story] started the thread saying "I fixed all bugs I could find and improved it as much as I could last kernel cycle. Put me and the users out of our misery and merge it now or delete it forever please." Replying to an off-list message, Andrew Morton asked users of the patch, "please provide us more details on your usage and testing of that code. Amount of memory, workload, observed results, etc?"
Nick Piggin [interview] noted that he's still interested in better understanding and possibly fixing what's happening with swap and reclaim on the systems reporting a benefit from the swap-prefetch patch. He went on to note, "regarding swap prefetching. I'm not going to argue for or against it anymore because I have really stopped following where it is up to, for now. If the code and the results meet the standard that Andrew wants then I don't particularly mind if he merges it. It would be nice if some of you guys would still report and test problems with reclaim when prefetching is turned off -- I have never encountered the morning after sluggishness (although I don't doubt for a minute that it is a problem for some)." Ingo Molnar followed up to these coments acking the patch, "I have tested it and have read the code, and it looks fine to me. (i've reported my test results elsewhere already [story]) We should include this in v2.6.23."
Ingo Molnar [interview]'s Completely Fair Scheduler [story] has been merged into the Linux kernel for inclusion in the upcoming 2.6.23 release. A comment in the patch titled 'sched: cfs core code' noted, "apply the CFS core code. This change switches over the scheduler core to CFS's modular design and makes use of kernel/sched_fair/rt/idletask.c to implement Linux's scheduling policies." Another patch included documentation which described the new scheduler, "80% of CFS's design can be summed up in a single sentence: CFS basically models an 'ideal, precise multi-tasking CPU' on real hardware." It goes on to explain:
"CFS's task picking logic is based on this p->wait_runtime value and it is thus very simple: it always tries to run the task with the largest p->wait_runtime value. In other words, CFS tries to run the task with the 'gravest need' for more CPU time. So CFS always tries to split up CPU time between runnable tasks as close to 'ideal multitasking hardware' as possible.
"Most of the rest of CFS's design just falls out of this really simple concept, with a few add-on embellishments like nice levels, multiprocessing and various algorithm variants to recognize sleepers."
Con Kolivas [interview] continues to maintain the performance oriented -ck patchset that he started in early 2004 [story], "this patchset is designed to improve system responsiveness and interactivity. It is configurable to any workload but the default -ck patch is aimed at the desktop and -cks is available with more emphasis on serverspace." In Con's latest release, 2.6.21-ck1, he notes that he has updated the patchset to include his improved SD cpu scheduler [story], "the staircase-deadline cpu scheduler has replaced the old staircase design in this version."
Con goes on to explain, "the staircase-deadline cpu scheduler can be set in either purely forward-looking mode for absolutely rigid fairness and cpu distribution according to nice level, or it can allow a small per-process history to smooth out cpu usage perturbations common in interactive tasks by enabling this sysctl. While small fairness issues can arise with this enabled, overall fairness is usually still strongly maintained and starvation is never possible. Enabling this can significantly smooth out 3d graphics and games." Swap prefetch [story] is also among the patches included in the -ck patchset.
Ingo Molnar [interview] reviewed Con Kolivas [interview]'s swap-prefetching patches [story] suggesting that they were ready for inclusion in the mainline kernel, "I've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback i've seen so far was positive. Time to have this upstream and time for a desktop-oriented distro to pick it up." He went on to describe swap prefetch, "to the desktop user this is a speculative performance feature that he is willing to potentially waste CPU and IO capacity, in expectation of better performance. On the conceptual level it is _precisely the same thing as regular file readahead_. (with the difference that to me swapahead seems to be quite a bit more intelligent than our current file readahead logic.)"
Nick Piggin [interview] expressed some concern that the impact of the code still wasn't understood well enough, "I wanted to see some basic regression tests to show that it hasn't caused obvious problems, and some basic scenarios where it helps, so that we can analyze them. It is really simple, but I haven't got any since first asking." Ingo noted that the patch has generated a lot of positive feedback from users and it would be best to merge it into the kernel, going on to suggest that it would be good to get more people actively involved, "really, we are likely be better off by risking the merge of _bad_ code (which in the swap-prefetch case is the exact opposite of the truth), than to let code stagnate. People are clearly unhappy about certain desktop aspects of swapping, and the only way out of that is to let more people hack that code. Merging code involves more people. It will cause 'noise' and could cause regressions, but at least in this case the only impact is 'performance' and the feature is trivial to disable."
2.4 kernel maintainer [story] Willy Tarreau ran some tests to compare Con Kolivas [interview]'s Staircase Deadline CPU scheduler [story] with Ingo Molnar [interview]'s new Completely Fair Scheduler [story]. He summarized his experiences:
"I think that CFS is based on a more promising concept but is less mature and is dangerous right now with certain workloads. SD shows some strange behaviours like not using all CPU available and a little jerkyness, but is more robust and may be the less risky solution for a first step towards a better scheduler in mainline, but it may also probably be the last O(1) scheduler, which may be replaced sometime later when CFS (or any other one) shows at the same time the smoothness of CFS and the robustness of SD."
Some debate was raised by logic added since CFS version 3 to automatically nice the X process for better GUI responsiveness. The CFS changelog comment labels the change as a usibility fix explaining, "automatic renicing of kernel threads such as keventd, OOM tasks and tasks doing privileged hardware access (such as Xorg)." Ingo posted a standalone patch demonstrating how these processes are detected and automatically niced, offering it for inclusion into Con's Staircase scheduler. Willy concurred that it was a good idea, "I think it could be a good idea since you recommend to renice X with SD. Most of the problem users are facing with renicing X is that they need to change their configs or scripts. If the kernel can reliably detect X and handle it differently, why not do it ?" Con was less convinced, "hmm well I have tried my very best to do all the changes without changing 'policy' as much as possible since that trips over so many emotive issues that noone can agree on, and I don't have a strong opinion on this as I thought it would be better for it to be a config option for X in userspace instead. Either way it needs to be turned on/off by admin and doing it by default in the kernel is... not universally accepted as good."
Ingo Molnar [interview] released a new patchset titled the "Modular Scheduler Core and Completely Fair Scheduler". He explained, "this project is a complete rewrite of the Linux task scheduler. My goal is to address various feature requests and to fix deficiencies in the vanilla scheduler that were suggested/found in the past few years, both for desktop scheduling and for server scheduling workloads." The patchset introduces Scheduling Classes, "an extensible hierarchy of scheduler modules. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming about them too much." It also includes sched_fair.c with an implementation of the CFS desktop scheduler, "a replacement for the vanilla scheduler's SCHED_OTHER interactivity code," about which Ingo noted, "I'd like to give credit to Con Kolivas [interview] for the general approach here: he has proven via RSDL/SD that 'fair scheduling' is possible and that it results in better desktop scheduling. Kudos Con!"
Regarding the actual implementation, Ingo explained, "CFS's design is quite radical: it does not use runqueues, it uses a time-ordered rbtree to build a 'timeline' of future task execution, and thus has no 'array switch' artifacts (by which both the vanilla scheduler and RSDL/SD are affected). CFS uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail. Thus the CFS scheduler has no notion of 'timeslices' and has no heuristics whatsoever. There is only one central tunable, /proc/sys/kernel/sched_granularity_ns, which can be used to tune the scheduler from 'desktop' (low latencies) to 'server' (good batching) workloads." He went on to note, "due to its design, the CFS scheduler is not prone to any of the 'attacks' that exist today against the heuristics of the stock scheduler".
Andrew Morton [interview] offered a list of patches in his mm tree, summarizing for each his plans as to whether or not they will be pushed to Linus for inclusion in the upcoming 2.6.17 kernel. Comments on the patches range from the simple "will merge" to pushing them to others for review. One of the more entertaining comments followed a set of 33 patches where Andrew noted, "This is Oleg's romp through the core kernel. There's a ton of material here. I'll probably send it all to Linus and ask him to review it. (aka blame-shifting)." Later in the thread he explained, "it's just a whole lot of code in areas which are tricky and in which few people work and in which reviewing resources are slight."
One set of patches refused with the comment, "still don't have a compelling argument for this, IMO" was Con Kolivas [interview]' swap prefetching efforts [story]. The feature was discussed in a couple of follow up threads. In response to some concerns raised by Jens Axboe, Con explained the implementation a little further, "If the system is idle it doesn't cost anything to bring those pages in (laptop mode disables any prefetching if you're thinking about power consumption on laptops). And if the system wants the ram that has been filled with prefetched pages wrongly, the prefetched pages are at the tail end of the inactive LRU list with a copy on backing store so if they're not accessed they'll be the first thing dropped in preference to anything else, without any I/O."
Con Kolivas [interview] posted a set of patches to the lkml offering a pluggable cpu scheduler framework. He began, "with the recent interest in varying the cpu schedulers in linux, this set of patches provides a modular framework for adding multiple boot-time selectable cpu schedulers. William Lee Irwin III [interview] came up with the original design and I based my patchset on that." The architecture independent framework is designed to allow new schedulers to be added by only touching three files, without adding overhead, and allowing you to compile in only the CPU scheduler(s) you need. Con explained, "this allows, for example, embedded hardware to have a tiny new scheduler that takes up minimal code space."
Designer of the 2.6 kernel's O(1) scheduler, Ingo Molnar [interview], had some reservations. He said, "my main worry with this approach is not really overhead but the impact on scheduler development. Right now there is a Linux scheduler that every developer (small-workload and large-workload people) tries to make as good as possible." This includes all corner cases, from the smallest embedded hardware to the largest NUMA hardware, with all their improvements benefiting the core CPU scheduler. Ingo added, "we want to make it _harder_ for specialized workloads to be handled in some 'specialized' way, because those precise workloads do show up in other workloads too, in a different manner. A fix made for NUMA or real-time purposes can easily make a difference for desktop workloads. Often 'specialized' is an excluse for a 'fundamentally broken, limited hack', especially in the scheduler world."
Con Kolivas, a practicing doctor in Australia, has written a benchmarking tool called ConTest which has proven to be tremendously useful to kernel developers, having been designed to compare the performance of different versions of the Linux kernel. He was kind enough to speak with us, explaining how he got started on this project, what makes his benchmark unique, and how to interpret the resulting output. Comparing the 2.5 development kernel to the 2.4 stable kernel, Con says, "a good 2.5 kernel (and that's not all of them) feels faster than 2.4 in most ways and this bodes well for 2.6." The interesting results from his frequent benchmarks back up this statement.
Con also describes his high performance patchset for the 2.4 stable kernel, currently at version 2.4.19-ck9. This patchset adds a number of performance boosting patches ideal for a desktop environment, such as the O(1) scheduler, kernel preemption, low latency and compressed caching. Read on for the full interview...