"By popular demand, here is release -v22 of the CFS scheduler. It is a full backport of the latest & greatest sched-devel.git code to v2.6.23-rc8, v2.6.22.8, v2.6.21.7 and v2.6.20.20," announced Ingo Molnar. He added, "this is the first time the development version of the scheduler has been fed back into the stable backport series, so there's many changes since v20.5". Ingo went on to explain, "even if CFS v20.5 worked well for you, please try this release too, with a good focus on interactivity testing - because, unless some major showstopper is found, this codebase is intended for a v2.6.24 upstream merge." He then summarized some of the changes:
"The changes in v22 consist of lots of mostly small enhancements, speedups, interactivity improvements, debug enhancements and tidy-ups - many of which can be user-visible. (These enhancements have been contributed by many people - see the changelog below and the git tree for detailed credits.)
"The biggest individual new feature is per UID group scheduling, written by Srivatsa Vaddagiri, which can be enabled via the CONFIG_FAIR_USER_SCHED=y .config option. With this feature enabled, each user gets a fair share of the CPU time, regardless of how many tasks each user is running."
From: Ingo Molnar
Subject: [patch/backport] CFS scheduler, -v22, for v2.6.23-rc8, v2.6.22.8, v2.6.21.7, v2.6.20.20
Date: Sep 26, 4:13 am 2007
By popular demand, here is release -v22 of the CFS scheduler. It is a
full backport of the latest & greatest sched-devel.git code to
v2.6.23-rc8, v2.6.22.8, v2.6.21.7 and v2.6.20.20. The patches can be
downloaded from the usual place:
http://people.redhat.com/mingo/cfs-scheduler/
This is the first time the development version of the scheduler has been
fed back into the stable backport series, so there's many changes since
v20.5:
15 files changed, 1103 insertions(+), 840 deletions(-)
Even if CFS v20.5 worked well for you, please try this release too, with
a good focus on interactivity testing - because, unless some major
showstopper is found, this codebase is intended for a v2.6.24 upstream
merge.
( Even a quick, subjective report of: "checked this patch, it didnt
crash and it feels like v20.5" or "laggier than v20.5" or "feels
better than v20.5" is useful to us and enables us to judge the general
direction of interactivity. )
The changes in v22 consist of lots of mostly small enhancements,
speedups, interactivity improvements, debug enhancements and tidy-ups -
many of which can be user-visible. (These enhancements have been
contributed by many people - see the changelog below and the git tree
for detailed credits.)
The biggest individual new feature is per UID group scheduling, written
by Srivatsa Vaddagiri, which can be enabled via the
CONFIG_FAIR_USER_SCHED=y .config option. With this feature enabled, each
user gets a fair share of the CPU time, regardless of how many tasks
each user is running.
For example, it took me 0.1 seconds to log in over ssh as root on a
testbox that was running a kernel with per UID group scheduling enabled:
$ time ssh root@testbox /bin/true
real 0m0.125s
user 0m0.013s
sys 0m0.011s
Which testbox had a system load of 1000.17 at this time, due to a rogue
runaway workload of one thousand (!) non-reniced infinite loops:
top - 14:34:05 up 30 min, 3 users, load average: 1000.17, 839.23, 444.57
Tasks: 1131 total, 1002 running, 129 sleeping, 0 stopped, 0 zombie
Cpu(s): 30.8%us, 0.2%sy, 0.0%ni, 68.2%id, 0.8%wa, 0.0%hi, 0.0%si
Mem: 2048992k total, 157688k used, 1891304k free, 18308k buffers
Swap: 4096564k total, 0k used, 4096564k free, 25464k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3633 root 20 0 2892 1576 724 R 7 0.1 0:00.06 top
2427 mingo 20 0 1576 244 196 R 2 0.0 0:01.14 loop
2429 mingo 20 0 1576 244 196 R 2 0.0 0:01.14 loop
To the root user, the box was fully usable an interactivity was
excellent - i was easily able to kill off those runaway tasks.
( The /proc/root_user_cpu_share tunable also allows the root uid to have
higher weight than other uids. Unit of the tunable is 0.1%, a weight
of 100% is 1024, the default weight of the root uid is 200%. )
See the detailed shortlog below for a description of the other changes,
or pull the sched-devel.git tree for all the 83 commits:
git-pull git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
Also, as usual, any sort of feedback, bugreport, fix and suggestion is
more than welcome!
Ingo
------------------>
Dmitry Adamushko (9):
sched: clean up struct load_stat
sched: clean up schedstat block in dequeue_entity()
sched: sched_setscheduler() fix
sched: add set_curr_task() calls
sched: do not keep current in the tree and get rid of sched_entity::fair_key
sched: optimize task_new_fair()
sched: simplify sched_class::yield_task()
sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()
sched: yield fix
Hiroshi Shimamoto (1):
sched: clean up sched_fork()
Matthias Kaehlcke (1):
sched: use list_for_each_entry_safe() in __wake_up_common()
Mike Galbraith (2):
sched: fix SMP migration latencies
sched: fix formatting of /proc/sched_debug
Peter Zijlstra (12):
sched: simplify SCHED_FEAT_* code
sched: new task placement for vruntime
sched: simplify adaptive latency
sched: clean up new task placement
sched: add tree based averages
sched: handle vruntime overflow
sched: better min_vruntime tracking
sched: add vslice
sched debug: check spread
sched: max_vruntime() simplification
sched: clean up min_vruntime use
sched: speed up and simplify vslice calculations
S.Caglar Onur (1):
sched debug: BKL usage statistics, fix
Srivatsa Vaddagiri (12):
sched: group-scheduler core
sched: revert recent removal of set_curr_task()
sched: fix minor bug in yield
sched: print nr_running and load in /proc/sched_debug
sched: print &rq->cfs stats
sched: clean up code under CONFIG_FAIR_GROUP_SCHED
sched: add fair-user scheduler
sched: group scheduler wakeup latency fix
sched: group scheduler SMP migration fix
sched: group scheduler, fix coding style issues
sched: group scheduler, fix bloat
sched: group scheduler, fix latency
Ingo Molnar (44):
sched: fix new-task method
sched: resched task in task_new_fair()
sched: small sched_debug cleanup
sched: debug: track maximum 'slice'
sched: uniform tunings
sched: use constants if !CONFIG_SCHED_DEBUG
sched: remove stat_gran
sched: remove precise CPU load
sched: remove precise CPU load calculations #2
sched: track cfs_rq->curr on !group-scheduling too
sched: cleanup: simplify cfs_rq_curr() methods
sched: uninline __enqueue_entity()/__dequeue_entity()
sched: speed up update_load_add/_sub()
sched: clean up calc_weighted()
sched: introduce se->vruntime
sched: move sched_feat() definitions
sched: optimize vruntime based scheduling
sched: simplify check_preempt() methods
sched: wakeup granularity fix
sched: add se->vruntime debugging
sched: add more vruntime statistics
sched: debug: update exec_clock only when SCHED_DEBUG
sched: remove wait_runtime limit
sched: remove wait_runtime fields and features
sched: x86: allow single-depth wchan output
sched: fix delay accounting performance regression
sched: prettify /proc/sched_debug output
sched: enhance debug output
sched: kernel/sched_fair.c whitespace cleanups
sched: fair-group sched, cleanups
sched: enable CONFIG_FAIR_GROUP_SCHED=y by default
sched debug: BKL usage statistics
sched: remove unneeded tunables
sched debug: print settings
sched debug: more width for parameter printouts
sched: entity_key() fix
sched: remove condition from set_task_cpu()
sched: remove last_min_vruntime effect
sched: undo some of the recent changes
sched: fix place_entity()
sched: fix sched_fork()
sched: remove set_leftmost()
sched: clean up schedstats, cnt -> count
sched: cleanup, remove stale comment
arch/i386/Kconfig | 11
fs/proc/base.c | 2
include/linux/sched.h | 56 ++-
init/Kconfig | 21 +
kernel/delayacct.c | 2
kernel/sched.c | 577 ++++++++++++++++++++++++-------------
kernel/sched_debug.c | 246 ++++++++++------
kernel/sched_fair.c | 733 ++++++++++++++++++------------------------------
kernel/sched_idletask.c | 5
kernel/sched_rt.c | 12
kernel/sched_stats.h | 28 -
kernel/sysctl.c | 31 --
kernel/user.c | 43 ++
13 files changed, 963 insertions(+), 804 deletions(-)
-
good job
load avg 1000 and the system is still usable? holy cow, that's crazy!
Looking forward to use this scheduler on my systems.
Looks nifty. The user
Looks nifty. The user running 1000 tasks wont have good interactivity, but other users
and root is not punished.
But that idle of 60% does
But that idle of 60% does look somewhat bad.
That is odd...
...although idle time isn't calculated by the kernel, and could be subject to rounding error on such a (relatively) short sample interval as top's.
Could it be rounding error in top? Top computes idle time by adding up all the CPU-seconds taken since the last refresh. With 1000 tasks all getting darn near identical amounts of CPU time--slightly less than 0.1% of the CPU--it's not hard to imagine many of the usage %ages getting rounded down, thereby compromising the computation.
The real way to see if the idle's at 60% would be to look at total CPU seconds reported by ps after, say, letting things run for 20 minutes or so (1200 seconds), so that each of the 1000 tasks gets at least a full CPU-second. (1.2 CPU seconds, ideally.)
--
Program Intellivision and play Space Patrol!
Nonsense
The kernel does calculate idle time. Please read Documentation/filesystems/proc.txt wrt /proc/sys which has a field per-CPU for this.
Well whaddaya know!
It sure does, apparently in units of HZ. This certainly seems to be a rounding error issue at some point in the system, unless the CPU really is 60% idle (which seems unlikely). I wonder if this file is up to date on the situation. Back in February, I guess it was, but a lot's changed in the last 7 months.
As for "top" computing the idle: I guess I was thinking more along the lines of how old school top (which read /dev/kmem) seemed to work. That was a looooong time ago, and I could even be misremembering. It looks like there is definitely some amusing code in the procps-top to handle older Linux ("SMP kernels (as of pre-2.4 era) can report idle time going backwards"), so who knows.
Edit: This is interesting. According to this file, idle time is reported as the sum of kernel and user space time that "init" was given. (Makes sense, since that's what "runs" when everything else sleeps.) Hmmm.... maybe there is something to this?
--
Program Intellivision and play Space Patrol!
what's the default for CONFIG_FAIR_USER_SCHED ?
But will
CONFIG_FAIR_USER_SCHED=ybe the default?if it isn't the default, it wouldn't be all that useful, imo, because it's a real obscure flag deep in the bowels of the scheduler, and vast majority of users will run without it.
if it really works as advertised, i think it should be the default.
just my ยข 1.75
It doesn't really matter, as
It doesn't really matter, as it's what the dists decide to use that'll be the norm.
There were (are) problems
There were (are) problems with this setting and SMP. I'm sure it will be ironed out by 2.6.24.
It is enabled by default on the development branch. Ultimately, the default will be decided by the different distros.
There were (are) problems
I believe those problems were fixed prior the CFS-v22 release, and the bug reporter confirmed it too. See this post:
I piddled around with fair users this morning, and it worked well. With Xorg and Gforce as one user (X and Gforce are synchronous ATM), and a make -j30 as another, I could barely tell the make was running. Watching a dvd, I couldn't tell. Latencies were pretty darn good throughout three hours of testing this and that.
Somehow, I think the
Somehow, I think the /proc/root_user_cpu_share is a private case. Allowing @stuff to have more CPU than @students (or @guests) would be much better.
And why adding more stuff to /proc?
And why adding more stuff
Where should it be instead?
why not /sys?
why not /sys?
Allowing @stuff to have more
Just wait for the process containers (now "control groups"?) patch; this will let you do that.
Just tried out the patch
Just tried out the patch with 2.6.23-rc8.
My system feels really quite snappy now, possibly faster than the ck series now.
I did have an issue with tearing with geforce driver but that seems to be fixed now.
Can't wait for this to be in the kernel as default.
can you also "nice" users?
Giving each user a fair share of CPU time is a fine addition, but can you also "nice" a user up or down, i.e., give him/her a larger or smaller share than other users?
Yes with process groups due in 2.6.24
Yes you can with the process groups feature due in 2.6.24
The idea is that you can group processes together, and each group gets a fair share. By default each user's processes are a group, but the sysadmin can setup other groups. For example in a university environment, you might put the students in a different group from the staff. Each group can be configured to get an unequal share.
The main driver for this is to prevent users from abusing the scheduler by running lots of processes in parallel, as otherwise one user could run a compile with five concurrent threads and get five times as much CPU time as someone who is running a single threaded application.