I'm using RHEL 5.4 with the kernel 2.6.18-164.Can I introduce the CFS into the kernel by adding a patch,or should I create the patch myself?
Is there any available patch for this kernel?
Ingo Molnar posted a merge request for the latest git scheduler tree summarizing, "it contains various enhancements to the scheduler - find the full shortlog is below. 96 commits from 19 authors - scheduler developers have been busy again. :-/" He added, "the scheduling behavior of the kernel to normal users should not change over v2.6.24, but there are a good number of new features and enhancements under the hood." Ingo went on to list a number of these new features, including:
"Various instrumentation and debugging enhancements from Arjan van de Ven; Peter Zijlstra's RT time limit and RT throttling code for the RT scheduling class; Paul E. McKenney's preemptible RCU code; refcount based CPU-hotplug rework by Gautham R Shenoy; there's serious interest in running RT tasks on enterprise-class hardware, so Steven Rostedt and Gregory Haskins wrote a large number of enhancements to the RT scheduling class and load-balancer; Peter Zijlstra's high-resolution scheduler tick code; [...] and a good number of other, smaller enhancements."
Ingo Molnar announced that version 24 of his Completely Fair Scheduler patch is now available backported to the 2.6.24-rc3, 2.6.23.8, 2.6.22.13, and 2.6.21.7 kernels. He noted that there have been significant changes since the last backport, "36 files changed, 2359 insertions(+), 1082 deletions(-). That's 187 individual commits from 32 authors." Ingo noted, "99% of these changes are already upstream in Linus's git tree and they will be released as part of v2.6.24. (there are 4 pending commits that are in the small 2.6.24-rc3-v24 patch.)" He also highlighted some of the more significant improvements:
"Improved interactivity via Peter Ziljstra's 'virtual slices' feature. As load increases, the scheduler shortens the virtual timeslices that tasks get, so that applications observe the same constant latency for getting on the CPU. (This goes on until the slices reach a minimum granularity value).
"CONFIG_FAIR_USER_SCHED is now available across all backported kernels and the per user weights are configurable via /sys/kernel/uids/. Group scheduling got refined all around."
Ingo Molnar sent a merge request to Linus Torvalds for the latest CFS fixes. CFS, the Completely Fair Scheduler, was merged into the mainline Linux kernel in July of 2007. It was first included in the 2.6.23 kernel, released in October of 2007. The scheduler appears to be quickly stabilizing, visible in the minimal assortment of fixes contained in the latest source code push. Ingo Molnar summarized the changes:
"There are two cross-subsystem groups of fixes: three commits that resolve a KVM build fix on !SMP - acked by Avi to go in via the scheduler git tree because it changes a central include file. The other one is a powerpc CPU time accounting regression fix from Paul Mackerras.
"The remaining 14 commits: one crash fix (only triggerable via the new control-groups filesystem), a delay-accounting regression fix, two performance regression fixes, a latency fix, two small scheduling-behavior regression fixes and seven cleanups."
Ken Chen submitted a patch to reduce the memory footprint of schedstat in a thread titled, "schedstat needs a diet". He explained, "schedstat is useful in investigating CPU scheduler behavior. Ideally, I think it is beneficial to have it on all the time. However, the cost of turning it on in production system is quite high, largely due to number of events it collects and also due to its large memory footprint." His patch converted numerous unsigned long variables to unsigned int, "most of the fields probably don't need to be [a] full 64-bits on 64-bit [architectures]. Rolling over 4 billion events will most likly take a long time and user space tools can be made to accommodate that."
Ingo Molnar merged the patch into his scheduler development tree suggesting there were further conversions that could be made, "note that current -git has a whole bunch of new schedstats fields in /proc//sched which can be used to track the exact balancing behavior of tasks. It can be cleared via echoing 0 to the file - so overflow is not an issue. Most of those new fields should probably be unsigned int too. (they are u64 right now.)"
"It contains lots of scheduler updates from lots of people - hopefully the last big one for quite some time," began Ingo Molnar, describing his merge request for the linux-2.6-sched git tree. He continued, "most of the focus was on performance (both micro-performance and scalability/balancing), but there's the fair-scheduling feature now Kconfig selectable too. Find the shortlog below." Ingo noted, "code that is touched outside of the scheduler: the KVM bits were acked by Avi, the net/unix change is trivial and only affects sync wakeups, ditto the fs/pipe.c changes - but i can push those separately if it needs an ack from David first." He then concluded:
"Testing status: the changes are chronological and all the interactivity-impacting changes are near the head of the queue and most of them were done weeks ago, and were thus part of the CFS-v22 backport series - which was tested by many people. There are no known regressions at the moment. It's all fully bisectable."
"As far as my testsystem goes, v2.6.23 beats v2.6.22.9 in sysbench," explained Ingo Molnar in response to a posting showing the opposite results. He referred to his own testing results and explained:
"As you can see it in the graph, v2.6.23 schedules much more consistently too. [ v2.6.22 has a small (but potentially statistically insignificant) edge at 4-6 clients, and CFS has a slightly better peak (which is statistically insignificant)."
Ingo noted that he was nuable to find information as to how the other benchmark was generated, "there are no .configs or other testing details at or around that URL that i could use to reproduce their result precisely, so at least a minimal bugreport would be nice." He then offered some tips on how sysbench works and some suggested tunings, "sysbench is a pretty 'batched' workload: it benefits most from batchy scheduling: the client doing as much work as it can, then server doing as much work as it can - and so on. The longer the client can work the more cache-efficient the workload is. Any round-trip to the server due to pesky preemption only blows up the cache footprint of the workload and gives lower throughput."
"Really, i have never seen a _single_ mainstream app where the use of sched_yield() was the right choice," stated Ingo Molnar during a continuing discussion about the Completely Fair Scheduler. He went on to ask if anyone could point to specific code that illustrates the proper usage of sched_yield(). In response to a theory of how it could potentially optimize userland locking, Ingo challenged, "these are generic statements, but I'm _really_ interested in the specifics. Real, specific code that i can look at. The typical Linux distro consists of in excess of 500 millions of lines of code, in tens of thousands of apps, so there really must be some good, valid and 'right' use of sched_yield() somewhere in there, in some mainstream app, right? (because, as you might have guessed it, in the past decade of sched_yield() existence i _have_ seen my share of sched_yield() utilizing user-space code, and at the moment i'm not really impressed by those examples.)" Ingo went on to explain:
"sched_yield() has been around for a decade (about three times longer than futexes were around), so if it's useful, it sure should have grown some 'crown jewel' app that uses it and shows off its advantages, compared to other locking approaches, right?
"For example, if you asked me whether pipes are the best thing for certain apps, i could immediately show you tons of examples where they are. Same for sockets. Or RT priorities. Or nice levels. Or futexes. Or just about any other core kernel concept or API. Your notion that showing a good example of an API would be 'difficult' because it's hard to determine 'smart' use is not tenable i believe and does not adequately refute my pretty plain-meaning 'it does not exist' assertion."
A potential bug reported against the Completely Fair Scheduler suggested that it was causing a network slowdown, measured with the 'Iperf' bandwidth performance benchmarking tool. The performance hit was quickly tracked to the previously discussed changes in how CFS handles sched_yield(). When it was suggested that this was a bug in the new process scheduler, Ingo explained:
"I had a quick look at the source code, and the reason for that weird yield usage was that there's a locking bug in iperf's 'Reporter thread' abstraction and apparently instead of fixing the bug it was worked around via a horrible yield() based user-space lock."
He then submit a small patch to fix the bug and remove the call to sched_yield() resulting in, "iperf uses _much_ less CPU time. On my Core2Duo test system, before the patch it used up 100% CPU time to saturate 1 gigabit of network traffic to another box. With the patch applied it now uses 9% of CPU time." He added playfully, "sched_yield() is almost always the symptom of broken locking or other bug. In that sense CFS does the right thing by exposing such bugs =B-)" Stephen Hemminger pointed out that a similar patch had been submitted to the Iperf project last month as it caused an identical problem with FreeBSD's scheduler.
"By popular demand, here is release -v22 of the CFS scheduler. It is a full backport of the latest & greatest sched-devel.git code to v2.6.23-rc8, v2.6.22.8, v2.6.21.7 and v2.6.20.20," announced Ingo Molnar. He added, "this is the first time the development version of the scheduler has been fed back into the stable backport series, so there's many changes since v20.5". Ingo went on to explain, "even if CFS v20.5 worked well for you, please try this release too, with a good focus on interactivity testing - because, unless some major showstopper is found, this codebase is intended for a v2.6.24 upstream merge." He then summarized some of the changes:
"The changes in v22 consist of lots of mostly small enhancements, speedups, interactivity improvements, debug enhancements and tidy-ups - many of which can be user-visible. (These enhancements have been contributed by many people - see the changelog below and the git tree for detailed credits.)
"The biggest individual new feature is per UID group scheduling, written by Srivatsa Vaddagiri, which can be enabled via the CONFIG_FAIR_USER_SCHED=y .config option. With this feature enabled, each user gets a fair share of the CPU time, regardless of how many tasks each user is running."
"Lots of scheduler updates in the past few days, done by many people," noted Ingo Molnar, going on to describe the more significant updates. "Most importantly, the SMP latency problems reported and debugged by Mike Galbraith should be fixed for good now." Ingo noted that the current code base was looking stable and was likely to be merged into the upcoming 2.6.24 kernel, "so please give it a good workout and let us know if there's anything bad going on. (If this works out fine then i'll propagate these changes back into the CFS backport, for wider testing.)" He went on to describe the other main changes in the development branch of the process scheduler:
"I've also included the latest and greatest group-fairness scheduling patch from Srivatsa Vaddagiri, which can now be used without containers as well (in a simplified, each-uid-gets-its-fair-share mode). This feature (CONFIG_FAIR_USER_SCHED) is now default-enabled.
"Peter Zijlstra has been busy enhancing the math of the scheduler: we've got the new 'vslice' forked-task code that should enable snappier shell commands during load while still keeping kbuild workloads in check."
"sched_yield() is not - and should not be - about 'recalculating the position in the scheduler queue' like you do now in CFS," Linus Torvalds stated in a discussion with Completely Fair Scheduler author Ingo Molnar, pointing to the man pages to back up his argument that sched_yield should instead move a thread to the end of its queue, adding, "quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the 'right place' in the rbtree. It should move it *last*."
Ingo described how it worked with the pre-2.6.23 scheduler, "the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task." He went on to compare this to the new process scheduler , "so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. In CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_." He also noted that this would change the behavior for some desktop applications that call sched_yield(), "there will be lots of regression reports about lost interactivity during load."
"After posting some benchmarks involving cfs, I got some feedback, so I decided to do a follow-up that'll hopefully fill in the gaps many people wanted to see filled," Rob Hussey began. He added, "this time around I've done the benchmarks against 2.6.21, 2.6.22-ck1, and 2.6.23-rc6-cfs-devel (latest git as of 12 hours ago)." Rob briefly summarized, "the only analysis I'll offer is that both sd and cfs are improvements, and I'm glad that there is a lot of work being done in this area of linux development. Much respect to Con Kolivas, Ingo Molnar, and Roman Zippel, as well all the others who have contributed."
Referring to a chart in which the blue line represented the CFS process scheduler, and the green line represented the SD "staircase" process scheduler, Ingo Molnar noted, "heh - am i the only one impressed by the consistency of the blue line in this graph? :-) [ and the green line looks a bit like a .. staircase? ]" He acknowledged some slowdown in CFS compared to SD in one of the benchmarks, "-ck1 is 0.8% faster in this particular test." Ingo then explained, "many things happened between 2.6.22-ck1 and 2.6.23-cfs-devel that could affect performance of this test. My initial guess would be sched_clock() overhead." In further testing he applied a low-res-sched-clock that resulted in better performance for CFS leading him to conclude, "the performance difference between -ck and -cfs-devel seems to be mostly down to the more precise (but slower) sched_clock() introduced in v2.6.23 and to the startup penalty of freshly created tasks." When asked if the low-res-sched-clock was likely to be merged, Ingo replied:
"I don't think so - we want precise/accurate scheduling before performance. (otherwise tasks working off the timer tick could steal away cycles without being accounted for them fairly, and could starve out all other tasks.) Unless the difference was really huge in real life - but it isn't."
"Looking at these graphs (and the fixed one from your second email), it sure looks a lot like CFS is doing at *least* as well as the old scheduler in every single test, and doing much better in most of them (in addition it's much more consistent between runs)," Kyle Moffett noted regarding recent benchmarks run against the Completely Fair Scheduler by Rob Hussey. Kyle continued:
"This seems to jive with all the other benchmarks and overall empirical testing that everyone has been doing. Overall I have to say a job well done for Ingo, Peter, Con, and all the other major contributors to this impressive endeavor."
"In the patch you really remove _a_lot_ of stuff," commented Roman Zippel in his reaction to Ingo Molnar's latest updates to the Completely Fair Scheduler. Roman has been consistently critical of Ingo's efforts, asking questions and criticizing Ingo's feedback. He continued, "you also removed a lot of things I tried to get you to explain them to me. On the one hand I could be happy that these things are gone, as they were the major road block to splitting up my own patch. On the other hand it still leaves me somewhat unsatisfied, as I still don't know what that stuff was good for."
Ingo replied to Roman's technical concerns, and pointed out that he'd been traveling for the recent kernel summit, adding, "I bent backwards trying to somehow get you to cooperate with us (and I still haven't given up on that!) - instead of you disparaging CFS and me frequently :-(". Willy Tarreau took a more critical stance, calling into question Roman's motives. He noted that he had been impressed by Roman's original review of the scheduler, but disappointed as the discussion seemed to degenerate, "it's the way you're trying to prove Ingo is a bastard and that you're a victim. But if we just re-read a few pick-ups of your mails since Aug 1st, its getting pretty obvious that you completely made up this situation." Kyle Moffett added, "I get the impression that Ingo re-implemented some ideas that you had because you refused to do so in a way that was acceptable for the upstream kernel. How exactly is this a bad thing?"