"By popular demand, here is release -v22 of the CFS scheduler. It is a full backport of the latest & greatest sched-devel.git code to v2.6.23-rc8, v2.6.22.8, v2.6.21.7 and v2.6.20.20," announced Ingo Molnar. He added, "this is the first time the development version of the scheduler has been fed back into the stable backport series, so there's many changes since v20.5". Ingo went on to explain, "even if CFS v20.5 worked well for you, please try this release too, with a good focus on interactivity testing - because, unless some major showstopper is found, this codebase is intended for a v2.6.24 upstream merge." He then summarized some of the changes:
"The changes in v22 consist of lots of mostly small enhancements, speedups, interactivity improvements, debug enhancements and tidy-ups - many of which can be user-visible. (These enhancements have been contributed by many people - see the changelog below and the git tree for detailed credits.)
"The biggest individual new feature is per UID group scheduling, written by Srivatsa Vaddagiri, which can be enabled via the CONFIG_FAIR_USER_SCHED=y .config option. With this feature enabled, each user gets a fair share of the CPU time, regardless of how many tasks each user is running."
"Lots of scheduler updates in the past few days, done by many people," noted Ingo Molnar, going on to describe the more significant updates. "Most importantly, the SMP latency problems reported and debugged by Mike Galbraith should be fixed for good now." Ingo noted that the current code base was looking stable and was likely to be merged into the upcoming 2.6.24 kernel, "so please give it a good workout and let us know if there's anything bad going on. (If this works out fine then i'll propagate these changes back into the CFS backport, for wider testing.)" He went on to describe the other main changes in the development branch of the process scheduler:
"I've also included the latest and greatest group-fairness scheduling patch from Srivatsa Vaddagiri, which can now be used without containers as well (in a simplified, each-uid-gets-its-fair-share mode). This feature (CONFIG_FAIR_USER_SCHED) is now default-enabled.
"Peter Zijlstra has been busy enhancing the math of the scheduler: we've got the new 'vslice' forked-task code that should enable snappier shell commands during load while still keeping kbuild workloads in check."
"The cfs core has been enhanced since quite sometime now to understand task-groups and [to] provide fairness to such task-groups," began Srivatsa Vaddagiri, "what was needed was an interface for the administrator to define task-groups and specify group 'importance' in terms of its cpu share. The patch below adds such an interface."
Srivatsa requested that his patch be merged into Andrew Morton's -mm tree to receive more testing, "note that the load balancer needs more work, esp to deal with cases like 2-groups on 4-cpus, one group has 3 tasks and other having 4 tasks. We are working on some ideas, but nothing to share in the form of a patch yet. I felt sending this patch out would help folks start testing the feature and also improve it."
Ingo Molnar [interview]'s Completely Fair Scheduler [story] has been merged into the Linux kernel for inclusion in the upcoming 2.6.23 release. A comment in the patch titled 'sched: cfs core code' noted, "apply the CFS core code. This change switches over the scheduler core to CFS's modular design and makes use of kernel/sched_fair/rt/idletask.c to implement Linux's scheduling policies." Another patch included documentation which described the new scheduler, "80% of CFS's design can be summed up in a single sentence: CFS basically models an 'ideal, precise multi-tasking CPU' on real hardware." It goes on to explain:
"CFS's task picking logic is based on this p->wait_runtime value and it is thus very simple: it always tries to run the task with the largest p->wait_runtime value. In other words, CFS tries to run the task with the 'gravest need' for more CPU time. So CFS always tries to split up CPU time between runnable tasks as close to 'ideal multitasking hardware' as possible.
"Most of the rest of CFS's design just falls out of this really simple concept, with a few add-on embellishments like nice levels, multiprocessing and various algorithm variants to recognize sleepers."
Ingo Molnar [interview] released version 19 of his CFS scheduler patchset [story], first released back in April [story]. He noted:
"The biggest user-visible change in -v19 is reworked sleeper fairness: it's similar in behavior to -v18 but works more consistently across nice levels. Fork-happy workloads (like kernel builds) should behave better as well. There are also a handful of speedups: unsigned math, 32-bit speedups, O(1) task pickup, debloating and other micro-optimizations."
Among the other changes found in this latest version of the scheduler patchset, Ingo noted, "merged the group-scheduling CFS-core changes from Srivatsa Vaddagiri [story]. This makes up for the bulk of the changes in -v19 but has no behavioral impact. The final group-fairness enabler patch is now a small and lean add-on patch to CFS."
Following a review of Ingo Molnar [interview]'s Completely Fair Scheduler [story], Srivatsa Vaddagiri posted a patch allowing the new scheduler to provide fairness at a per-group level rather than at a per-process level. He described the changes that he made and noted, "I have used 'uid' as the basis of grouping for timebeing (since that grouping concept is already in mainline today). The patch can be adapted to a more generic process grouping mechanism later."
Ingo reacted to the patch favorably, "yeah, i like this alot." He went on to comment, "the 'struct sched_entity' abstraction looks very clean, and that's the main thing that matters: it allows for a design that will only cost us performance if group scheduling is desired." He went on to ask, "if you could do a -v14 port and at least add minimal SMP support: i.e. it shouldnt crash on SMP, but otherwise no extra load-balancing logic is needed for the first cut - then i could try to pick all these core changes up for -v15. (I'll let you know about any other thoughts/details when i do the integration.)"
"As I understand, fair_clock is a monotonously increasing clock which advances at a pace inversely proportional to the load on the runqueue," Srivatsa Vaddagiri explained in a review of Ingo Molnar [interview]'s CFS CPU scheduler [story], "if load = 1 (task), it will advance at same pace as wall clock, as load increases it advances slower than wall clock." He continued on to ask some questions about the choices made in CFS as compared to the EEVDF CPU scheduler [story]. In the resulting discussion, Ingo offered some insight into the design of the CFS. He began:
"80% of CFS's design can be summed up in a single sentence: CFS basically models an 'ideal, precise multi-tasking CPU' on real hardware. 'Ideal multi-tasking CPU' is a (non-existent :-) CPU that has 100% physical power and which can run each task at precise equal speed, in parallel, each at 1/nr_running speed. For example: if there are 2 tasks running then it runs each at 50% physical power - totally in parallel.
"On real hardware, we can run only a single task at once, so while that one task runs the other tasks that are waiting for the CPU are at a disadvantage - the current task gets an unfair amount of CPU time. In CFS this fairness imbalance is expressed and tracked via the per-task p->wait_runtime (nanosec-unit) value. 'wait_runtime' is the amount of time the task should now run on the CPU for it become completely fair and balanced."