"Really, i have never seen a _single_ mainstream app where the use of sched_yield() was the right choice," stated Ingo Molnar during a continuing discussion about the Completely Fair Scheduler. He went on to ask if anyone could point to specific code that illustrates the proper usage of
sched_yield(). In response to a theory of how it could potentially optimize userland locking, Ingo challenged, "these are generic statements, but I'm _really_ interested in the specifics. Real, specific code that i can look at. The typical Linux distro consists of in excess of 500 millions of lines of code, in tens of thousands of apps, so there really must be some good, valid and 'right' use of sched_yield() somewhere in there, in some mainstream app, right? (because, as you might have guessed it, in the past decade of sched_yield() existence i _have_ seen my share of sched_yield() utilizing user-space code, and at the moment i'm not really impressed by those examples.)" Ingo went on to explain:
"sched_yield() has been around for a decade (about three times longer than futexes were around), so if it's useful, it sure should have grown some 'crown jewel' app that uses it and shows off its advantages, compared to other locking approaches, right?
"For example, if you asked me whether pipes are the best thing for certain apps, i could immediately show you tons of examples where they are. Same for sockets. Or RT priorities. Or nice levels. Or futexes. Or just about any other core kernel concept or API. Your notion that showing a good example of an API would be 'difficult' because it's hard to determine 'smart' use is not tenable i believe and does not adequately refute my pretty plain-meaning 'it does not exist' assertion."
"I said I was hoping that -rc8 was the last -rc, and I hate doing this, but we've had more changes since -rc8 than we had in -rc8. And while most of them are pretty trivial, I really couldn't face doing a 2.6.23 release and take the risk of some really stupid brown-paper-bag thing," Linus Torvalds said, announcing the 2.6.23-rc9 kernel. He added, "so there's a final -rc out there, and right now my plan is to make this series really short, and release 2.6.23 in a few days. So please do give it a last good testing, and holler about any issues you find!" He went on to warn developers that the first thing planned for 2.6.24 was to merge the unified x86 architecture:
"This is also a good time to warn about the fact that we're doing the x86 merge very soon (as in the next day or two) after 2.6.23 is out, so if you have pending patches for the next series that touch arch/i386 or x86-64, you should get in touch with Thomas Gleixner and Ingo Molnar, who are the keepers of the merge scripts, and will help you prepare..
"Doing it as early as possible in the 2.6.24-rc4 series (basically I'll do it first thing) will mean that we'll have the maximum amount of time to sort out any issues, and the thing is, Thomas and Ingo already have a tree ready to go, so people can check their work against that, and don't need to think that they have to do any fixups after it his *my* tree. It would be much better if everybody was just ready for it, and not taken by surprise."
"This version brings a number of new checks, and a number of bug fixes," Andy Whitcroft noted in his announcement for version 0.10 of checkpatch.pl, used by Linux kernel developers to scan their code for common mistakes. Ingo Molnar expressed concern, "your checkpatch patch itself produces 22 warnings." He pointed out that there were numerous bogus warnings generated by the script, "ever since v8 the quality of checkpatch.pl has been getting worse and worse as there are way too many false positives. I'm still stuck on v8 for my own use, v9 and v10 is unusable." Ingo continued, "what matters is that only items should be displayed that i _can_ fix. With v8 i was able to make kernel/sched*.c almost noise-free, but with v9 and v10 that's not possible anymore." He noted that he was fine with there being a flag that would cause the script to generate additional questionable warnings, "but these default false positives are _lethal_ for a tool like this. (and i made this point before.) This is a _fundamental_ thing".
Andy added a new option to make it possible to disable some of the more subjective tests, noting that he preferred the tests to be enabled by default, "fundamentally I am not trying to help the people who are careful but those who do not know better. As for the false positives, those I am always interested in and always striving to remove, as they annoy me as much as the next man." Andrew Morton disagreed with the option being enabled by default, suggesting, "off, I'd say. That way people are more likely to use it. Or, more accurately, will have less excuses to not use it." Andy acquiesced, "off it is." He added, "I will also review the tests which are warnings and checks (subjective) and see if any are now miss-categorised." He pointed out that as the script is not a C language parser, instead detecting C language style validations using regular expressions, it won't ever be 100% accurate and is instead only intended as a useful guide.
A potential bug reported against the Completely Fair Scheduler suggested that it was causing a network slowdown, measured with the 'Iperf' bandwidth performance benchmarking tool. The performance hit was quickly tracked to the previously discussed changes in how CFS handles sched_yield(). When it was suggested that this was a bug in the new process scheduler, Ingo explained:
"I had a quick look at the source code, and the reason for that weird yield usage was that there's a locking bug in iperf's 'Reporter thread' abstraction and apparently instead of fixing the bug it was worked around via a horrible yield() based user-space lock."
He then submit a small patch to fix the bug and remove the call to sched_yield() resulting in, "iperf uses _much_ less CPU time. On my Core2Duo test system, before the patch it used up 100% CPU time to saturate 1 gigabit of network traffic to another box. With the patch applied it now uses 9% of CPU time." He added playfully, "sched_yield() is almost always the symptom of broken locking or other bug. In that sense CFS does the right thing by exposing such bugs =B-)" Stephen Hemminger pointed out that a similar patch had been submitted to the Iperf project last month as it caused an identical problem with FreeBSD's scheduler.
"By popular demand, here is release -v22 of the CFS scheduler. It is a full backport of the latest & greatest sched-devel.git code to v2.6.23-rc8, v22.214.171.124, v126.96.36.199 and v188.8.131.52," announced Ingo Molnar. He added, "this is the first time the development version of the scheduler has been fed back into the stable backport series, so there's many changes since v20.5". Ingo went on to explain, "even if CFS v20.5 worked well for you, please try this release too, with a good focus on interactivity testing - because, unless some major showstopper is found, this codebase is intended for a v2.6.24 upstream merge." He then summarized some of the changes:
"The changes in v22 consist of lots of mostly small enhancements, speedups, interactivity improvements, debug enhancements and tidy-ups - many of which can be user-visible. (These enhancements have been contributed by many people - see the changelog below and the git tree for detailed credits.)
"The biggest individual new feature is per UID group scheduling, written by Srivatsa Vaddagiri, which can be enabled via the CONFIG_FAIR_USER_SCHED=y .config option. With this feature enabled, each user gets a fair share of the CPU time, regardless of how many tasks each user is running."
"Lots of scheduler updates in the past few days, done by many people," noted Ingo Molnar, going on to describe the more significant updates. "Most importantly, the SMP latency problems reported and debugged by Mike Galbraith should be fixed for good now." Ingo noted that the current code base was looking stable and was likely to be merged into the upcoming 2.6.24 kernel, "so please give it a good workout and let us know if there's anything bad going on. (If this works out fine then i'll propagate these changes back into the CFS backport, for wider testing.)" He went on to describe the other main changes in the development branch of the process scheduler:
"I've also included the latest and greatest group-fairness scheduling patch from Srivatsa Vaddagiri, which can now be used without containers as well (in a simplified, each-uid-gets-its-fair-share mode). This feature (CONFIG_FAIR_USER_SCHED) is now default-enabled.
"Peter Zijlstra has been busy enhancing the math of the scheduler: we've got the new 'vslice' forked-task code that should enable snappier shell commands during load while still keeping kbuild workloads in check."
"sched_yield() is not - and should not be - about 'recalculating the position in the scheduler queue' like you do now in CFS," Linus Torvalds stated in a discussion with Completely Fair Scheduler author Ingo Molnar, pointing to the man pages to back up his argument that sched_yield should instead move a thread to the end of its queue, adding, "quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the 'right place' in the rbtree. It should move it *last*."
Ingo described how it worked with the pre-2.6.23 scheduler, "the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task." He went on to compare this to the new process scheduler , "so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. In CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_." He also noted that this would change the behavior for some desktop applications that call sched_yield(), "there will be lots of regression reports about lost interactivity during load."
"After posting some benchmarks involving cfs, I got some feedback, so I decided to do a follow-up that'll hopefully fill in the gaps many people wanted to see filled," Rob Hussey began. He added, "this time around I've done the benchmarks against 2.6.21, 2.6.22-ck1, and 2.6.23-rc6-cfs-devel (latest git as of 12 hours ago)." Rob briefly summarized, "the only analysis I'll offer is that both sd and cfs are improvements, and I'm glad that there is a lot of work being done in this area of linux development. Much respect to Con Kolivas, Ingo Molnar, and Roman Zippel, as well all the others who have contributed."
Referring to a chart in which the blue line represented the CFS process scheduler, and the green line represented the SD "staircase" process scheduler, Ingo Molnar noted, "heh - am i the only one impressed by the consistency of the blue line in this graph? :-) [ and the green line looks a bit like a .. staircase? ]" He acknowledged some slowdown in CFS compared to SD in one of the benchmarks, "-ck1 is 0.8% faster in this particular test." Ingo then explained, "many things happened between 2.6.22-ck1 and 2.6.23-cfs-devel that could affect performance of this test. My initial guess would be sched_clock() overhead." In further testing he applied a low-res-sched-clock that resulted in better performance for CFS leading him to conclude, "the performance difference between -ck and -cfs-devel seems to be mostly down to the more precise (but slower) sched_clock() introduced in v2.6.23 and to the startup penalty of freshly created tasks." When asked if the low-res-sched-clock was likely to be merged, Ingo replied:
"I don't think so - we want precise/accurate scheduling before performance. (otherwise tasks working off the timer tick could steal away cycles without being accounted for them fairly, and could starve out all other tasks.) Unless the difference was really huge in real life - but it isn't."
"In the patch you really remove _a_lot_ of stuff," commented Roman Zippel in his reaction to Ingo Molnar's latest updates to the Completely Fair Scheduler. Roman has been consistently critical of Ingo's efforts, asking questions and criticizing Ingo's feedback. He continued, "you also removed a lot of things I tried to get you to explain them to me. On the one hand I could be happy that these things are gone, as they were the major road block to splitting up my own patch. On the other hand it still leaves me somewhat unsatisfied, as I still don't know what that stuff was good for."
Ingo replied to Roman's technical concerns, and pointed out that he'd been traveling for the recent kernel summit, adding, "I bent backwards trying to somehow get you to cooperate with us (and I still haven't given up on that!) - instead of you disparaging CFS and me frequently :-(". Willy Tarreau took a more critical stance, calling into question Roman's motives. He noted that he had been impressed by Roman's original review of the scheduler, but disappointed as the discussion seemed to degenerate, "it's the way you're trying to prove Ingo is a bastard and that you're a victim. But if we just re-read a few pick-ups of your mails since Aug 1st, its getting pretty obvious that you completely made up this situation." Kyle Moffett added, "I get the impression that Ingo re-implemented some ideas that you had because you refused to do so in a way that was acceptable for the upstream kernel. How exactly is this a bad thing?"
"The aim of these four patches is to introduce Virtual Machine time accounting," began Laurent Vivier. He described the first two patches as:
"1) As recent CPUs introduce a third running state, after 'user' and 'system', we need a new field, 'guest', in cpustat to store the time used by the CPU to run virtual CPU. Modify /proc/stat to display this new field.
"2) Like for cpustat, introduce the 'gtime' (guest time of the task) and 'cgtime' (guest time of the task children) fields for the tasks. Modify signal_struct and task_struct. Modify /proc/<pid>/stat to display these new fields."
Both Ingo Molnar and Rik van Riel responded favorably to the patch. Ingo replied, "the concept certainly looks sane to me," adding, "I'd suggest inclusion into 2.6.24." Regarding concerns that the new information at the end of the line could break utilities such as
ps, Rik assured that it would not, "we have added numbers to the cpu lines in /proc/stat since early 2.6. All the programs parsing /proc/stat should just scan for a number of numbers from the start of the line, without trying to scan for the terminating newline."
Having recently returned from the Linux kernel summit, Ingo Molnar and Peter Zijlstra sent out some performance updates to the Completely Fair Scheduler:
"Our main focus has been on simplifications and performance - and as part of that we've also picked up some ideas from Roman Zippel's 'Really Fair Scheduler' patch as well and integrated them into CFS. We'd like to ask people go give these patches a good workout, especially with an eye on any interactivity regressions."
He noted that some of the changes included removing features that had proved unecessary. "while keeping the things that worked out fine, like sleeper fairness." Ingo posted some results from the lmbench benchmark noting around a 16% speedup on both the 32-bit and 64-bit x86 architectures. He added, "we are now a bit faster than the O(1) scheduler was under v2.6.22 - even on 32-bit. The main speedup comes from the avoidance of divisions (or shifts) in the wakeup and context-switch fastpaths."
"This patch implements a new version of RCU which allows its read-side critical sections to be preempted," Paul McKenney said in a posting to the Linux Kernel mailing list. He described the RCU code contained in his 9 part patchset, "it uses a set of counter pairs to keep track of the read-side critical sections and flips them when all tasks exit [the] read-side critical section." He continued:
"This patch was developed as a part of the -rt kernel development and meant to provide better latencies when read-side critical sections of RCU don't disable preemption. As a consequence of keeping track of RCU readers, the readers have a slight overhead (optimizations in the paper). This implementation co-exists with the 'classic' RCU implementations and can be switched to at compiler."
Ingo Molnar responded very favorably to the patch, "cool! Now after 2 years of development and testing I think this is one of the most mature patchsets on lkml - so i'd like to see this designated for potential upstream inclusion. I.e. everyone who can see some bug, please holler now."
In an effort to fully understand the math proposed by Roman Zippel in his Really Fair Scheduler, Ingo Molnar implemented a simplified version of the logic on top of his Completely Fair Scheduler code which he then humorously labeled the Really Simple Really Fair Scheduler, "could you please confirm whether the math algorithm you are suggesting is implemented by this patch roughly correctly?" Ingo explained:
"As an addendum to my review, please find below a prototype patch I've just written that implements RSRFS (Really Simple Really Fair Scheduler) on top of CFS. It is intended to demonstrate the essence of the math you have presented via your patch. (it has no nice levels support yet, to make the math really apparent to everyone interested)"
Roman pointed out that things were oversimplified, "it simplifies the math too much, the nice level weighting is an essential part of the math and without it one can't really understand the problem I'm trying to solve." Ingo acknowledged this and explained, "in the first level review I'm mainly interested in your patch's behavior [with regards to] simple nice-0 tasks, how they run and how they sleep and wake up. Nothing else. Why? That's what 99% of the Linux tasks do after all." In another thread he explained that he would continue to review Roman's proposal, however also noting that he would be offline this week for the kernel summit and thus temporarily unresponsive to email.
Ingo Molnar reviewed Roman Zippel's Really Fair Scheduler code, suggesting that much of the work was similar to that which was being done by Peter Zijlstra, "all in one, we don't disagree, this is an incremental improvement we are thinking about for 2.6.24. We do disagree with this being positioned as something fundamentally different though - it's just the same thing mathematically, expressed without a "/weight" divisor, resulting in no change in scheduling behavior. (except for a small shift of CPU utilization for a synthetic corner-case)"
Roman was not impressed with Ingo's review, asking, "did you even try to understand what I wrote?" He continued, "while Peter's patches are interesting, they are only a small step to what I'm trying to achieve." Regarding performance and code-quality concerns with his patch he added, "I explicitly said that my patch is only a prototype, so I haven't done any cleanups and tuning in this direction yet, so making any conclusions based on code size comparisons is quite ridiculous at this point. The whole point of this patch was to demonstrate the algorithmic changes, not to demonstrate a final and perfectly tuned scheduler."
During the many threads discussing Ingo Molnar's recently merged Completely Fair Scheduler, Roman Zippel has repeatedly questioned the complexity of the new process scheduler. In a recent posting to the Linux Kernel mailing list he offered a simpler scheduler named the 'Really Fair Scheduler' saying, "as I already tried to explain previously CFS has a considerable algorithmic and computational complexity. This patch should now make it clearer, why I could so easily skip over Ingo's long explanation of all the tricks CFS uses to keep the computational overhead low - I simply don't need them." He offered a mathematical overview of how his new scheduler works, included some benchmarks, and reflected back to earlier discussions on the lkml asking, "Ingo, from this point now I need your help, you have to explain to me, what is missing now relative to CFS. I tried to ask questions, but that wasn't very successful..." He also noted:
"The basic idea of this scheduler is somewhat different than CFS. Where the old scheduler maintains fixed time slices, CFS still maintains a dynamic per task time slice. This model does away with it completely, instead it puts the task on a virtual (normalized) time line, where only the relative distance between any two task is relevant."