Ingo Molnar announced version 20 of his Completely Fair Scheduler patchset, offering further cleanups for the new scheduler code that will be part of the upcoming 2.6.23 kernel, "there have been lots of small regression fixes, speedups, debug enhancements and tidy-ups - many of which can be user-visible." Ingo went on to summarize:
"There are nearly 100 changes - they do add up to a significant total linecount change. There was no crash bug or hang bug found in the CFS code since v19 was released. (in fact the last crash/hang bug in CFS was found and fixed in v7, more than 3 months ago, and even that crash only happened in an uncommon sw-suspend setup, not during normal use. So CFS has turned out to be a pretty robust codebase.) Nevertheless, if you had any problems (performance or behavioral) with v19 it's worth checking v20 out - and if v19 worked great for you it's worth checking out that v20 still works great =B-)"
In a June of 1992 posting to the linux-activists mailing list, Linus Torvalds described the original Linux scheduler noting, "the scheduler in linux is pretty simple, but does a reasonably good job at giving good IO response while not being too unfair against cpu-bound processes." A year later, Linus posted a more detailed description of the scheduler noting, "the linux scheduling algorithm is one of the simplest ones possible". Comments in the original 254 line
sched.c file read, "'schedule()' is the scheduler function. This is GOOD CODE! There probably won't be any reason to change this, as it should work well in all circumstances (ie gives IO-bound processes good response etc). The one thing you might take a look at is the signal-handler code here."
Comments in the current 6,709 line
sched.c file show the first changes being made in 1996 by Dave Grothe, "to fix bugs in semaphores and make semaphores SMP safe". Two years later Andrea Arcangeli is credited with implementing "schedule_timeout() and related stuff". It was not until 2002, ten years after Linus' original code was written, that the scheduler received a complete rewrite, "new ultra-scalable O(1) scheduler by Ingo Molnar: hybrid priority-list and round-robin design with an array-switch method of distributing timeslices and per-CPU runqueues." Con Kolivas is credited with "interactivity tuning" in 2003, and Nick Piggin added "scheduler domains" in 2004. A more recent rewrite of the scheduler happened in April, again by Ingo Molnar, this time with his Completely Fair Scheduler.
Ingo Molnar pushed a series of patches to his Completely Fair Scheduler code upstream that were merged into the mainline kernel. He explained the reason for so many small patches, "the main reason is the safe and gradual elimination of a widely used 64-bit function argument: the 64-bit 'now' timestamp." In addition to the many small patches removing the 64-bit "now" timestamp, he noted:
"These changes are necessary for 3 reasons: firstly they address the 'there's too much 64-bit stuff in the scheduler' observation. Secondly, it's not directly visible but these changes also act as a correctness fix for an obscure (and minor) but not-too-pretty-to-fix accounting bug: idle_balance() had its own internal notion of 'now', separate from that of schedule(). Thirdly, this debloats sched.o quite significantly."
In a recent lkml thread, Linus Torvalds was involved in a discussion about mounting filesystems with the
noatime option for better performance, "'noatime,data=writeback' will quite likely be *quite* noticeable (with different effects for different loads), but almost nobody actually runs that way." He noted that he set O_NOATIME when writing git, "and it was an absolutely huge time-saver for the case of not having 'noatime' in the mount options. Certainly more than your estimated 10% under some loads." The discussion then looked at using the
relatime mount option to improve the situation, "relative atime only updates the atime if the previous atime is older than the mtime or ctime. Like noatime, but useful for applications like mutt that need to know when a file has been read since it was last modified." Ingo Molnar stressed the significance of fixing this performance issue, "I cannot over-emphasize how much of a deal it is in practice. Atime updates are by far the biggest IO performance deficiency that Linux has today. Getting rid of atime updates would give us more everyday Linux performance than all the pagecache speedups of the past 10 years, _combined_." He submitted some patches to improve
relatime, and noted about
"It's also perhaps the most stupid Unix design idea of all times. Unix is really nice and well done, but think about this a bit: 'For every file that is read from the disk, lets do a ... write to the disk! And, for every file that is already cached and which we read from the cache ... do a write to the disk!'"
Nick Piggin used 'git bisect' to track a lmbench regression to the main CFS commit, leading to an interesting discussion between Nick and Ingo Molnar. Ultimately the regression was tracked down to the temporary configurability of the scheduler while it is tuned for optimal performance, "one reason for the extra overhead is the current tunability of CFS, but that is not fundamental, it's caused by the many knobs that CFS has at the moment." The solution, already coded but not yet merged in the mainline kernel "changes those knobs to constants, allowing the compiler to optimize the math better and reduce code size," and as a result result, "CFS can be faster at micro-context-switching than 2.6.22."
Ingo described the lmbench configuration in question as a "micro-benchmark", and noted that with a macro-benchmark better performance was more pronounced, "because with CFS the _quality_ of scheduling decisions has increased. So even if we had increased micro-costs (which we wont have once the current tuning period is over and we cast the CFS parameters into constants), the quality of macro-scheduling can offset that, and not only on the desktop!" He summarized, "that's why our main focus in CFS was on the macro-properties of scheduling _first_, and then the micro-properties are adjusted to the macro-constraints as a second layer."
Some of the concerns expressed about the Completely Fair Scheduler were reports that it might not handle 3D games as well as the SD scheduler. In a recent thread, Ingo Molnar noted, "people are regularly testing 3D smoothness, and they find CFS good enough and that matches my experience as well (as limited as it may be). In general my impression is that CFS and SD are roughly on par when it comes to 3D smoothness." He noted that all known regressions were reported against earlier versions of CFS that had long since been fixed, and that he was very interested in any new reports of regressions against the current version of the code, "what is more interesting (to me) is not the positive CFS feedback but negative CFS feedback (although positive feedback certain _feels_ good so don't hold it back intentionally ;-)," adding, "there are no open 3D related regressions for CFS at the moment." Ingo then offered benchmarks illustrating the improved 3D performance of CFS, with numbers showing it to perform as well and in some cases considerably better than the SD scheduler.
Linus Torvalds noted, "I don't think _any_ scheduler is perfect, and almost all of the time, the RightAnswer(tm) ends up being not 'one or the other', but 'somewhere in between'." He noted that he was confident that he'd made the right decision in merging CFS, then added, "but at the same time, no technical decision is ever written in stone. It's all a balancing act. I've replaced the scheduler before, I'm 100% sure we'll replace it again. Schedulers are actually not at all that important in the end: they are a very very small detail in the kernel."
During the recent debates about the Completely Fair Scheduler, Ingo Molnar explained why he rewrote the scheduler, "CFS started out as an experiment to simplify the scheduler, to clean up the after-effects of a better-desktop-scheduling patch Mike Galbraith sent me. Had anyone told me at that time that I'd end up writing a new scheduler I'd have laughed at the suggestion and I'd have pointed to the large number of pending patches of mine in forms of the -rt tree, the syslet/threadlet code and other stuff that needs fixing a lot more urgent than the task scheduler." Regarding the recent debate he added, "there was simply no code in existence before CFS which has proven the code simplicity/design virtues of 'fair scheduling' - SD was more of an argument _against_ it than for it. I think maybe even Con might have been surprised by that simplicity: in his first lkml reaction to CFS he also wrote that he finds the CFS code 'beautiful', and my reply to Con's mail still addresses a good number of points raised in this thread i think." Ingo also described his development style:
"I don't typically write code because I'm particularly 'convinced' about an idea or because I 'believe in' an idea, I mostly write code to _check_ whether an idea is worth advancing or not. Writing code is my form of 'thinking', and releasing patches is my form of telling others about my 'thoughts'. I might have guesses about how well something will work out in practice (and I'd certainly be a fool to go out coding blindly), but surprises happen almost always, both in positive and in negative direction, and even with relatively simple patches."
"People who think SD was 'perfect' were simply ignoring reality," Linus Torvalds began in a succinct explanation as to why he chose the CFS scheduler written by Ingo Molnar instead of the SD scheduler written by Con Kolivas. He continued, "sadly, that seemed to include Con too, which was one of the main reasons that I never [entertained] the notion of merging SD for very long at all: Con ended up arguing against people who reported problems, rather than trying to work with them." He went on to stress the importance of working toward a solution that is good for everyone, "that was where the SD patches fell down. They didn't have a maintainer that I could trust to actually care about any other issues than his own." He then offered some praise to Ingo, "as a long-term maintainer, trust me, I know what matters. And a person who can actually be bothered to follow up on problem reports is a *hell* of a lot more important than one who just argues with reporters." Linus went on to note a comparison between the two schedulers:
"I realize that this comes as a shock to some of the SD people, but I'm told that there was a university group that did some double-blind testing of the different schedulers - old, SD and CFS - and that everybody agreed that both SD and CFS were better than the old, but that there was no significant difference between SD and CFS."
In a continued thread about how the recently merged Completely Fair Scheduler affects the nice command, Ingo Molnar offered a history of nice levels in the Linux kernel. He began by describing the three most frequent complaints he has received, first was "nice levels were always so weak under Linux that people continuously bugged me about making nice +19 tasks use up much less CPU time", second was "the fact that nice level behavior depended on the _absolute_ nice level as well, while the nice API itself is fundamentally 'relative'", and third was "negative nice levels were not 'punchy enough', so lots of people had to resort to run audio (and other multimedia) apps under RT priorities such as SCHED_FIFO."
Ingo then noted, "CFS addresses all three types of complaints". For the first complaint he noted that he "decoupled the scheduler from 'time slice' and HZ concepts (and made granularity a separate concept from nice levels) and thus CFS was able to implement better and more consistent nice +19 support: now in CFS nice +19 tasks get a HZ-independent 1.5%, instead of the variable 3%-5%-9% range they got in the old scheduler." For the second type of complaint he "made nice(1) have the same CPU utilization effect on tasks, regardless of their absolute nice levels. So on CFS, running a nice +10 and a nice +11 task has the same CPU utilization 'split' between them as running a nice -5 and a nice -4 task. (one will get 55% of the CPU, the other 45%.)" And the third type of complaint "is addressed by CFS almost automatically: stronger negative nice levels are an automatic side-effect of the recalibrated dynamic range of nice levels."
Updating the pluggable scheduler patches for the 2.6.22 kernel, Peter Williams noted, "probably the last one now that CFS is in the main line". CFS author Ingo Molnar asked, "why is CFS in mainline a problem? The CFS merge should make the life of development/test patches like plugsched conceptually easier." Peter explained, "it means a major rewrite of the plugsched interface and I'm not sure that it's worth it (if CFS works well). However, note that I did say probably not definitely :-). I'll play with it and see what happens."
Ingo went on to suggest, "most of the schedulers in plugsched should be readily adaptable to the modular scheduling-policy scheme of the upstream scheduler. I'm sure there will be some minor issues as isolation of the modules is not enforced right now - and i'd be happy to review (and potentially apply) common-sense patches that improve the framework." Peter wasn't entirely convinced, but added, "I don't feel like converting staircase and nicksched as I have no real interest in them. Perhaps I'll just create the interface and some schedulers based on my own ideas and let others such as Con and Nick add schedulers if they're still that way inclined." In response to Ingo's offer to review the work he noted, "thanks for the offer of support (it may sway my decision)".
The recently merged Completely Fair Scheduler changes how the Linux kernel handles scheduling priorities set with the nice command. Ingo Molnar explained that each level of nice adds or substracts 10% of CPU utilization, "the '10% effect' is relative and cumulative: from _any_ nice level, if you go up 1 level, it's -10% CPU usage, if you go down 1 level it's +10% CPU usage." Ingo noted that with the earlier scheduler the nice level was tied to the HZ, offering three examples in which HZ is set to 100, 250, and 300, "a nice +19 task (the most commonly used nice level in practice) gets 9.1%, 3.9%, 3.1% of CPU time on the old scheduler, depending on the value of HZ. This is quite inconsistent and illogical. This HZ dependency of nice levels existed for many years, and the new scheduler solves that inconsistency - every nice level will get the same amount of time, regardless of HZ."
Ingo went on to offer a table comparing positive nice values to the percentage of the CPU that a process gets in relation to a '
nice 0' process: "nice 0: 100.00%; nice 1: 80.00%; nice 2: 64.10%; nice 3: 51.28%; nice 4: 40.98%; nice 5: 32.78%; nice 6: 26.24%; nice 7: 21.00%; nice 8: 16.77%; nice 9: 13.42%; nice 10: 10.74%; nice 11: 8.59%; nice 12: 6.87%; nice 13: 5.50%; nice 14: 4.39%; nice 15: 3.51%; nice 16: 2.81%; nice 17: 2.25%; nice 18: 1.80%; nice 19: 1.44%". He offered another table comparing negative nice values to a '
nice -20' process: "nice 0: 1.15%; nice -1: 1.44%; nice -2: 1.80%; nice -3: 2.25%; nice -4: 2.81%; nice -5: 3.51%; nice -6: 4.39%; nice -7: 5.50%; nice -8: 6.87%; nice -9: 8.59%; nice -10: 10.74%; nice -11: 13.42%; nice -12: 16.77%; nice -13: 21.00%; nice -14: 26.24%; nice -15: 32.78%; nice -16: 40.98%; nice -17: 51.28%; nice -18: 64.10%; nice -19: 80.00%; nice -20: 100.00%"
Ingo Molnar announced that the real time patchset [story] that he and Thomas Gleixner maintain is now available as a series of 374 broken out patches, "from now on (as of 188.8.131.52-rt2) it will be part of every upstream -rt release and it is available from the -rt download site". Regarding the patches, he notes that it's responsible for, "698 files changed, 27920 insertions(+), 9603 deletions(-)", going on to note, "which is impressive as we moved a huge chunk of -rt into mainline already ;-) The series file is attached below.". Ingo explains:
"the purpose of this finegrained splitup is to foster (even ;-) quicker upstream integration of various -rt features, and to see the full -rt tree integrated upstream. We also hope that this split-up queue helps various vendors standardize their (currently quite splintered) real-time implementations to the upstream -rt patchset. The queue is not (yet) bisectable at every point, and many of the splits are thematic, to allow the simpler future handling of updates."
Included in Andrew Morton's potential 2.6.23 merge list [story] were a series of patches to make the x86-64 architecture tickless. Andi Kleen, the x86-64 maintainer replied, "I'm sceptical about the dynticks code. It just rips out the x86-64 timing code completely, which needs a lot more review and testing. Probably not .23." Linus Torvalds agreed, "we are *not* going to do another 'rip everything out, and replace it with new code' again. Over my dead body. We're going to do this thing gradually, or not at all." He went on to explain "the patch-set itself actually looks fine, as far as I'm concerned. But it does seem to have that 'enable everything in one go' problem. I'd much rather see one time source at a time being converted, and enabled then and there, so that when people report problems and do a bisection, if it was HPET that broke, you get the commit that changed HPET."
In response to the pains caused by the original dyntick merge in 2.6.21, Ingo Molnar acknowledged, "we had 12 -hrt/dynticks merge related regressions between 2.6.21-rc1 and -final, and 4 after final." He went on to point out, "it's all pretty quiet today on the dynticks regressions front. (there are no open regressions in either the upstream i386 code or in the devel patches we are aware of)." As to the source of the bugs, he explained, "the majority of the above bugs were in the infrastructure code. (the worst was the generic resume/suspend one fixed in 184.108.40.206) And sadly, a fair number of the infrastructure bugs we introduced during the frentic clockevents/dynticks rewrites/redesigns we did between .20 and .21. That was a royally stupid mistake for us to do - instead of patiently waiting for the bugs to be shaken out we destabilized the infrastructure. (it was a 'lets make this thing so nice that it's impossible to reject' instintic gut reaction.)" Linus replied, "one thing I'll happily talk about is that while 2.6.21 was painful, you and Thomas in particular were both very responsible about the thing, so no, I'm not at all complaining or worried about it in that sense! I just really _really_ wish we could have two fairly stable releases in a row. I think 2.6.22 has the potential to be a pretty good setup, and I'd really like to avoid having another 2.6.21 immediately afterwards."
Another thread discussed potentially merging the swap prefetch patch into the mainline Linux kernel. Con Kolivas [story] started the thread saying "I fixed all bugs I could find and improved it as much as I could last kernel cycle. Put me and the users out of our misery and merge it now or delete it forever please." Replying to an off-list message, Andrew Morton asked users of the patch, "please provide us more details on your usage and testing of that code. Amount of memory, workload, observed results, etc?"
Nick Piggin [interview] noted that he's still interested in better understanding and possibly fixing what's happening with swap and reclaim on the systems reporting a benefit from the swap-prefetch patch. He went on to note, "regarding swap prefetching. I'm not going to argue for or against it anymore because I have really stopped following where it is up to, for now. If the code and the results meet the standard that Andrew wants then I don't particularly mind if he merges it. It would be nice if some of you guys would still report and test problems with reclaim when prefetching is turned off -- I have never encountered the morning after sluggishness (although I don't doubt for a minute that it is a problem for some)." Ingo Molnar followed up to these coments acking the patch, "I have tested it and have read the code, and it looks fine to me. (i've reported my test results elsewhere already [story]) We should include this in v2.6.23."
Ingo Molnar [interview]'s Completely Fair Scheduler [story] has been merged into the Linux kernel for inclusion in the upcoming 2.6.23 release. A comment in the patch titled 'sched: cfs core code' noted, "apply the CFS core code. This change switches over the scheduler core to CFS's modular design and makes use of kernel/sched_fair/rt/idletask.c to implement Linux's scheduling policies." Another patch included documentation which described the new scheduler, "80% of CFS's design can be summed up in a single sentence: CFS basically models an 'ideal, precise multi-tasking CPU' on real hardware." It goes on to explain:
"CFS's task picking logic is based on this p->wait_runtime value and it is thus very simple: it always tries to run the task with the largest p->wait_runtime value. In other words, CFS tries to run the task with the 'gravest need' for more CPU time. So CFS always tries to split up CPU time between runnable tasks as close to 'ideal multitasking hardware' as possible.
"Most of the rest of CFS's design just falls out of this really simple concept, with a few add-on embellishments like nice levels, multiprocessing and various algorithm variants to recognize sleepers."