CFS Scheduler -v24 Backports

Submitted by Jeremy
on November 19, 2007 - 1:00pm

Ingo Molnar announced that version 24 of his Completely Fair Scheduler patch is now available backported to the 2.6.24-rc3, 2.6.23.8, 2.6.22.13, and 2.6.21.7 kernels. He noted that there have been significant changes since the last backport, "36 files changed, 2359 insertions(+), 1082 deletions(-). That's 187 individual commits from 32 authors." Ingo noted, "99% of these changes are already upstream in Linus's git tree and they will be released as part of v2.6.24. (there are 4 pending commits that are in the small 2.6.24-rc3-v24 patch.)" He also highlighted some of the more significant improvements:

"Improved interactivity via Peter Ziljstra's 'virtual slices' feature. As load increases, the scheduler shortens the virtual timeslices that tasks get, so that applications observe the same constant latency for getting on the CPU. (This goes on until the slices reach a minimum granularity value).

"CONFIG_FAIR_USER_SCHED is now available across all backported kernels and the per user weights are configurable via /sys/kernel/uids/. Group scheduling got refined all around."


From: Ingo Molnar
Subject: [patch/backport] CFS scheduler, -v24, for v2.6.24-rc3, v2.6.23.8, v2.6.22.13, v2.6.21.7
Date: Nov 19, 8:17 am 2007

By popular demand, here is release -v24 of the CFS scheduler patch.

It is a full backport of the latest & greatest scheduler code to 
v2.6.24-rc3, v2.6.23.8, v2.6.22.13, v2.6.21.7. The patches can be 
downloaded from the usual place:

    http://people.redhat.com/mingo/cfs-scheduler/

There's tons of changes since v22 was released:

    36 files changed, 2359 insertions(+), 1082 deletions(-)

that's 187 individual commits from 32 authors.

So even if CFS v22 worked well for you, please try this release too and 
report regressions (if any).

There are countless improvements in -v24 (see the shortlog further below 
for details), but here are a few highlights:

 - improved interactivity via Peter Ziljstra's "virtual slices" feature.
   As load increases, the scheduler shortens the virtual timeslices that 
   tasks get, so that applications observe the same constant latency for 
   getting on the CPU. (This goes on until the slices reach a minimum 
   granularity value)

 - CONFIG_FAIR_USER_SCHED is now available across all backported 
   kernels and the per user weights are configurable via 
   /sys/kernel/uids/. Group scheduling got refined all around.

 - performance improvements

 - bugfixes

99% of these changes are already upstream in Linus's git tree and they 
will be released as part of v2.6.24. (there are 4 pending commits that 
are in the small 2.6.24-rc3-v24 patch.)

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome!

	Ingo

------------------>
Adrian Bunk (3):
      sched: make kernel/sched.c:account_guest_time() static
      sched: proper prototype for kernel/sched.c:migration_init()
      sched: make sched_nr_latency static

Alexey Dobriyan (1):
      sched: uninline scheduler

Andi Kleen (5):
      sched: cleanup: remove unnecessary gotos
      sched: cleanup: refactor common code of sleep_on / wait_for_completion
      sched: cleanup: refactor normalize_rt_tasks
      sched: remove stale comment from sched_group_set_shares()
      sched: fix return value of wait_for_completion_interruptible()

Arjan van de Ven (1):
      Make scheduler debug file operations const

Balbir Singh (1):
      sched: fix delay accounting regression

Christian Borntraeger (1):
      sched: fix accounting of interrupts during guest execution on s390

Cliff Wickman (1):
      hotplug cpu: migrate a task within its cpuset

Dhaval Giani (1):
      sched: group scheduling, sysfs tunables

Dmitry Adamushko (16):
      sched: clean up struct load_stat
      sched: clean up schedstat block in dequeue_entity()
      sched: sched_setscheduler() fix
      sched: add set_curr_task() calls
      sched: do not keep current in the tree and get rid of sched_entity::fair_key
      sched: optimize task_new_fair()
      sched: simplify sched_class::yield_task()
      sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()
      sched: yield fix
      sched: fix __pick_next_entity()
      sched: tidy up SCHED_RR
      sched: cleanup, remove calc_weighted()
      sched: cleanup, make dequeue_entity() and update_stats_wait_end() similar
      sched: fix group scheduling for SCHED_BATCH
      sched: fix __set_task_cpu() SMP race
      sched: remove activate_idle_task()

Eric Dumazet (1):
      sched: cleanup, use NSEC_PER_MSEC and NSEC_PER_SEC

Eugene Teo (1):
      Fix tsk->exit_state usage

Gautham R Shenoy (1):
      sched: fix rt ptracer monopolizing CPU

Hiroshi Shimamoto (1):
      sched: clean up sched_fork()

Ingo Molnar (80):
      sched: fix sysctl_sched_child_runs_first flag
      sched: resched task in task_new_fair()
      sched: small sched_debug cleanup
      sched: debug: track maximum 'slice'
      sched: uniform tunings
      sched: use constants if !CONFIG_SCHED_DEBUG
      sched: remove stat_gran
      sched: remove precise CPU load
      sched: remove precise CPU load calculations #2
      sched: track cfs_rq->curr on !group-scheduling too
      sched: cleanup: simplify cfs_rq_curr() methods
      sched: uninline __enqueue_entity()/__dequeue_entity()
      sched: speed up update_load_add/_sub()
      sched: clean up calc_weighted()
      sched: introduce se->vruntime
      sched: move sched_feat() definitions
      sched: optimize vruntime based scheduling
      sched: simplify check_preempt() methods
      sched: wakeup granularity increase
      sched: add se->vruntime debugging
      sched: remove SCHED_FEAT_SKIP_INITIAL
      sched: add more vruntime statistics
      sched: debug: update exec_clock only when SCHED_DEBUG
      sched: remove wait_runtime limit
      sched: remove wait_runtime fields and features
      sched: fix delay accounting performance regression
      sched: prettify /proc/sched_debug output
      sched: enhance debug output
      sched: kernel/sched_fair.c whitespace cleanups
      sched debug: BKL usage statistics
      sched: remove unneeded tunables
      sched debug: print settings
      sched debug: more width for parameter printouts
      sched: entity_key() fix
      sched: remove condition from set_task_cpu()
      sched: remove last_min_vruntime effect
      sched: undo some of the recent changes
      sched: fix sign check error in place_entity()
      sched: fix sched_fork()
      sched: remove set_leftmost()
      sched: clean up schedstats, cnt -> count
      sched: cleanup, remove stale comment
      sched: mark scheduling classes as const
      sched: whitespace cleanups
      sched: vslice fixups for non-0 nice levels
      sched: optimize schedule() a bit on SMP
      sched: tweak wakeup granularity
      sched: run sched_domain_debug() if CONFIG_SCHED_DEBUG=y
      sched: break out if printing a warning in sched_domain_debug()
      sched: style cleanup
      sched: kfree(NULL) is valid
      sched: cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG
      sched: cleanup: rename task_grp to task_group
      sched: cleanup: function prototype cleanups
      sched: fix: move the CPU check into ->task_new_fair()
      sched: update comment
      sched: clean up is_migration_thread()
      sched: do not normalize kernel threads via SysRq-N
      sched: do not wakeup-preempt with SCHED_BATCH tasks
      sched: speed up context-switches a bit
      sched: reintroduce cache-hot affinity
      sched: debug: increase width of debug line
      sched: debug, improve migration statistics
      sched: allow the immediate migration of cache-cold tasks
      sched: affine sync wakeups
      sched: sync wakeups preempt too
      sched: cleanup, fix spacing
      sched: cleanup, make struct rq comments more consistent
      sched: add KERN_CONT annotation
      sched: fix fastcall mismatch in completion APIs
      sched: clean up sched_domain_debug()
      sched: fix style of swap() macro in kernel/sched_fair.c
      sched: fix style in kernel/sched.c
      sched: reintroduce SMP tunings again
      sched: turn off PREEMPT_RESTRICT
      sched: remove PREEMPT_RESTRICT
      sched: wakeup preemption fix
      sched: clean up the wakeup preempt check
      sched: clean up the wakeup preempt check, #2
      sched: reorder SCHED_FEAT_ bits

James Bottomley (1):
      sched: fix incorrect assumption that cpu 0 exists

Ken Chen (2):
      sched: fix improper load balance across sched domain
      sched: reduce schedstat variable overhead a bit

Laurent Vivier (2):
      sched: guest CPU accounting: maintain stats in account_system_time()
      sched: don't clear PF_VCPU in scheduler

Matthias Kaehlcke (1):
      sched: use list_for_each_entry_safe() in __wake_up_common()

Michael Neuling (2):
      Add scaled time to taskstats based process accounting
      kernel/sched.c: remove bogus comment from account_user_time

Mike Galbraith (3):
      sched: fix SMP migration latencies
      sched: fix formatting of /proc/sched_debug
      sched: prevent wakeup over-scheduling

Milton Miller (7):
      sched: domain sysctl fixes: use kcalloc()
      sched: domain sysctl fixes: use for_each_online_cpu()
      sched: domain sysctl fixes: unregister the sysctl table before domains
      sched: domain sysctl fixes: do not crash on allocation failure
      sched: domain sysctl fixes: add terminator comment
      sched: more robust sd-sysctl entry freeing
      sched: fix sched_domain sysctl registration again

Oleg Nesterov (3):
      do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(tasklist)
      migration_call(CPU_DEAD): use spin_lock_irq() instead of task_rq_lock()
      sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHED

Paul E. McKenney (1):
      sched: export cpu_clock()

Paul Jackson (2):
      cpuset: remove sched domain hooks from cpusets
      cpuset sched_load_balance flag

Paul Menage (4):
      Task Control Groups: example CPU accounting subsystem
      Fix cpusets update_cpumask
      sched: clean up some control group code
      sched: report CPU usage in CFS cgroup directories

Pavel Emelyanov (3):
      pid namespaces: changes to show virtual ids to user
      Uninline find_task_by_xxx set of functions
      Use helpers to obtain task pid in printks

Peter Williams (2):
      sched: reduce balance-tasks overhead
      sched: isolate SMP balancing code a bit more

Peter Zijlstra (21):
      sched: simplify SCHED_FEAT_* code
      sched: new task placement for vruntime
      sched: simplify adaptive latency
      sched: clean up new task placement
      sched: add tree based averages
      sched: handle vruntime 64-bit overflow
      sched: better min_vruntime tracking
      sched: add vslice
      sched debug: check spread
      sched: max_vruntime() simplification
      sched: clean up min_vruntime use
      sched: speed up and simplify vslice calculations
      sched: another wakeup_granularity fix
      sched: disable sleeper_fairness on SCHED_BATCH
      sched: disable forced preemption by default
      sched: activate task_hot() only on fair-scheduled tasks
      sched: fix unconditional irq lock
      sched: fix vslice
      sched: documentation: place_entity() comments
      sched: reintroduce the sched_min_granularity tunable
      sched: avoid large irq-latencies in smp-balancing

S.Caglar Onur (1):
      sched debug: BKL usage statistics, fix

Satyam Sharma (1):
      sched: use show_regs() to improve __schedule_bug() output

Srivatsa Vaddagiri (16):
      sched: group-scheduler core
      sched: revert recent removal of set_curr_task()
      sched: fix minor bug in yield
      sched: print nr_running and load in /proc/sched_debug
      sched: print &rq->cfs stats
      sched: clean up code under CONFIG_FAIR_GROUP_SCHED
      sched: add fair-user scheduler
      sched: group scheduler wakeup latency fix
      sched: group scheduler SMP migration fix
      sched: group scheduler, fix coding style issues
      sched: group scheduler, fix bloat
      sched: group scheduler, fix latency
      sched: fix new task startup crash
      Hook up group scheduler with control groups
      sched: move rcu_head to task_group struct
      sched: fix copy_namespace() <-> sched_fork() dependency in do_fork

Zou Nan hai (1):
      sched: some proc entries are missed in sched_domain sys_ctl debug code

-

Thanks!!

Anonymous (not verified)
on
November 19, 2007 - 3:24pm

That is what precisely I was waiting for to roll out the updates on the kernel I use (2.6.22 serie). Many thanks to all kernel developers for the hard work.

smp-only

Anonymous (not verified)
on
November 19, 2007 - 10:42pm

Make sure you compile with CONFIG_SMP=y even if you have only one core. The -v24 backport patch (at least the 2.6.23.8 variant) doesn't work for uniprocessor kernels.

If it doesn't work then it's

intgr
on
November 20, 2007 - 2:37am

If it doesn't work then it's a bug and should be reported to Ingo.

Works fine here on UP.

Anonymous (not verified)
on
November 24, 2007 - 1:03am

Works fine here on UP.

Sound familiar

Anonymous (not verified)
on
November 20, 2007 - 1:49pm

As load increases, the scheduler shortens the virtual timeslices that
tasks get, so that applications observe the same constant latency for
getting on the CPU.

Just like Roman's scheduler did months ago. Imagine that.

And your point is?

intgr
on
November 20, 2007 - 5:00pm

And your point is?

I really don't see a point in your statement, but you seem to be implying that Roman's scheduler was better just because it had one additional feature that is useful? Have you forgotten that, at that point, CFS had already had several more features compared to Roman's, such as group scheduling and instrumentation?

My point

Anonymous (not verified)
on
November 21, 2007 - 6:11am

My point is that the scheduler mafia routinely receives valuable contributions and ignores them. Then they deviously reimplement the ideas without giving proper attribution. For in outside contributer this is the worst place in the kernel to work. And, not surprisingly, this is technically the worst part of the kernel.

My POV

intgr
on
November 21, 2007 - 8:44am

I'm seeing a completely inverse picture here.

the scheduler mafia routinely receives valuable contributions and ignores them. Then they deviously reimplement the ideas

In order to take advantage of Roman's contributions, the kernel team would have had to replace the whole CFS. That wouldn't have made much sense as I explained in my previous post.

Ironically, CFS was already fully functional when Roman ignored it and started writing his own reimplementation of CFS. Roman decided not to cooperate with other developers and add to CFS.

I personally found his exchanges with Ingo evasive, as if he didn't even want other developers to understand his scheduler. For instance, he was unwilling to break his work into a set of smaller patches, and this is absolutely essential to getting your code reviewed and accepted in the first place (even if it had made sense to throw out CFS completely at that point). Ingo even offered to do this work for him in order to learn from his scheduler, with the "RSDL".

Obviously nobody could force Roman to port his improvements over to CFS, so there was no other choice than to wait for someone else to do it, such as Peter Ziljstra.

For in outside contributer this is the worst place in the kernel to work

I can't agree with this. Quoting Ingo's announcement: "That's 187 individual commits from 32 authors.". Only 80 of these commits came from Ingo. None of these contributions were "ignored by the scheduler mafia".

Just because some people fail to get along with kernel developers and make a huge fuss about it, doesn't mean that this is the case in general.

My POV

Anonymous (not verified)
on
November 21, 2007 - 1:05pm

Nice POV ,, but ,, why bother? Everyone should know all these by now. Yet, some guys keep telling the same old story over and over again. It's something like football to them, they don't really care about arguments, facts and reality.

pfff... :)

Pot calling the kettle black

Anonymous (not verified)
on
November 21, 2007 - 1:07pm

Ironically, CFS was already fully functional when Roman ignored it and started writing his own reimplementation of CFS.

You mean like way Con's SD scheduler was already fully functional when Molnar wrote is own reimplementation CFS? The patches thing was a ruse. Molnar is known to stonewall contributers in this way, never honestly intending to merge their code. The question came down to was Molnar able or willing to understand Roman's work? The answer is a definitive no. A lot of us think that today's CFS scheduler is joe code.

Ahhh... When the first

Anonymous (not verified)
on
November 21, 2007 - 1:43pm

Ahhh... When the first attack is refuted, try another one. Then another. Then another.

Oh, and only answer the paragraph where you think you have an edge.

You'd do fine in politics.

*Re*implementation

intgr
on
November 21, 2007 - 2:10pm

CFS was not a "reimplementation" of the SD, because the design of the two schedulers is nothing alike.

Roman's scheduler pretty much re-used the same approach as CFS, with various tweaks (many of which had already been implemented into CFS by Peter Ziljstra, by the time Roman posted his scheduler).

Wrong

Anonymous (not verified)
on
November 22, 2007 - 8:49am

Just like Roman's scheduler did months ago.

No, the RFS patch did not do that at all.

Take a look at the check_preempt_curr_fair() function in kernel/sched_norm.c that Roman wrote, it's using the same static timeslices that CFS is using: "gran_norm" is not load-dependent at all, it's static. (it's a modified version of the original CFS code and it did not change CFS's time-slicing logic.)

So your argument does not even pass the sniff test.

Just like Roman's scheduler

Anonymous (not verified)
on
November 27, 2007 - 8:19am

Just like Roman's scheduler did months ago. Imagine that.

Much like something I did as an exercise, seven years ago or so--I don't think it's a revolutionary idea, but it's nice to see it in the kernel.

How to tune the scheduling on 2.6

rlx (not verified)
on
November 21, 2007 - 7:35pm

Hi,

I have been using kernel 2.4 for a long time and I installed 2.6.22.12 and
2.6.23.8 last week. I find that when the CPU usage is 100%, kernel 2.6
becomes non responsive (sluggish). Currently, I am running kernel 2.4.35,
the CPU usage is 100% and I don't even notice.

I pointed my browser on kerneltrap and the first thing I see is Ingo's
message.

Is there a simple explanation as to why scheduling on kernel 2.6 is not
as good as on kernel 2.4.

Or are there parameters that I can set to improve interactivity under high load.

Thanks

Richard

Just to point out here, CFS

scottharmon
on
November 21, 2007 - 9:36pm

Just to point out here, CFS was merged into mainline for the 2.6.23 release, so you might want to check that kernel out. Other than that, there are quite a few reasons you could be having perceived sluggishness. One that seems common to me is not having proper DMA support in your kernel (IO slowness seems to make everything sluggish).

Lack of responsiveness at 100% CPU on 2.6 kernel

rlx (not verified)
on
November 22, 2007 - 7:05am

Thanks for the info. I am trying to get up to date.

What I am referring to by 'non responsiveness' is the lag
between the cursor movement on the screen and the mouse movement,
the time between typing a letter and seeing the character
on the screen, and general window operations such as getting
the focus on a window. All at 100% CPU.

Anyway I need to get used to 2.6. I was just surprised by
the difference in behavior between 2.4 and 2.6 on an otherwise
identical system and with mostly the same kernel parameters
(kernel 2.6 inherited most of the parameters from kernel 2.4
in my installation).

I noted already that DMA activation works differently on 2.6 than on 2.4.
I believe that DMA is activated by I still have to make sure.

Many of the options' names

scottharmon
on
November 24, 2007 - 10:28am

Many of the options' names have changed between 2.4 and 2.6, it is probably just as easy to start from scratch when configuring a 2.6 kernel if coming from 2.4

The responsiveness of my pc is back to normal

rlx (not verified)
on
November 24, 2007 - 9:24pm

Thanks for the nice comments. I replaced the hard disk IDE cable
with a 40 wire cable so my computer can now use dma5.

At last I think I have got the setup of kernel 2.6 right. My CPU is
currently running at 100% use (preparing a live dvd) and the response
is very good.

So I apologize for raising this issue.

But now, when I switch to a virtual console the screen becomes
dim. I tried 'setterm -half-bright off' with no effect.
After I boot, the brightness is normal, but after I switch
to another virtual console the text becomes almost unreadable.

I have searched the net and the kernel documentation without luck, yet.

Otherwise I feel comfortable running 2.6.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.