Hi Ingo, The following patches apply to linux-tip/sched/devel and enhance the performance of the kernel (specifically in PREEMPT_RT, though they do not regress mainline performance as far as I can tell). They offer somewhere between 50-100% speedups in netperf performance, depending on the test. Should you be interested in these patches, you can optionally pull them from git at the following URL: git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git sched-devel Comments/feedback/bugfixes welcome Regards, -Greg --
Oprofile data shows that the system may spend a significant amount of time (60%+) in find_busiest_groups as a result of newidle balancing. This entire operation is a critical section since it occurs inline with a schedule(). Since we do find_busiest_groups() et. al. without locks held for normal balancing, lets do it for newidle as well. It will at least allow other cpus and interrupts to make forward progress (against our RQ) while we try to balance. Additionally, we abort the newidle processing if we are preempted. This patch should both improve latency response, as well as increase throughput. It has shown to significantly contribute to a 6-12% increase in network peformance. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- kernel/sched.c | 40 +++++++++++++++++++++++++++++++++------- 1 files changed, 33 insertions(+), 7 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index c51d9fa..56722b1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3426,6 +3426,15 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd, cpus_setall(*cpus); + schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]); + + /* + * We are in a preempt-disabled section, so dropping the lock/irq + * here simply means that other cores may acquire the lock, + * and interrupts may occur. + */ + spin_unlock_irq(&this_rq->lock); + /* * When power savings policy is enabled for the parent domain, idle * sibling can pick up load irrespective of busy siblings. In this case, @@ -3436,7 +3445,6 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd, !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE)) sd_idle = 1; - schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]); redo: group = find_busiest_group(sd, this_cpu, &imbalance, CPU_NEWLY_IDLE, &sd_idle, cpus, NULL); @@ -3456,22 +3464,37 @@ redo: schedstat_add(sd, lb_imbalance[CPU_NEWLY_IDLE], imbalance); ld_moved = 0; - if (busiest->nr_running > 1) ...
Inspired by Peter Zijlstra. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- kernel/sched.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 56722b1..0b9f90e 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2863,6 +2863,10 @@ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, max_load_move - total_load_moved, sd, idle, all_pinned, &this_best_prio); class = class->next; + + if (idle == CPU_NEWLY_IDLE && this_rq->nr_running) + break; + } while (class && max_load_move > total_load_moved); return total_load_moved > 0; --
What happened to the feedback I sent about this? --
>>> On Tue, Jul 8, 2008 at 1:00 AM, in message <200807081500.18245.nickpiggin@yahoo.com.au>, Nick Piggin Ah yes. Slipped through the cracks...sorry about that. What if we did "if (idle == CPU_NEWLY_IDLE && need_resched())" instead? --
Isn't that exactly the same thing because any task will preempt the idle thread? --
>>> On Wed, Jul 9, 2008 at 4:09 AM, in message <200807091809.52293.nickpiggin@yahoo.com.au>, Nick Piggin Not quite. The former version would break on *any* succesful enqueue (as a result of a local move_task as well as a remote wake-up/migration). The latter version will only break on the the remote variety. You were concerned about stopping a move_task operation early because it would reduce efficiency, and I do not entirely disagree. However, this really only concerns the local type (which have now been removed). Remote preemptions should (IMO) always break immediately because it would have been likely to invalidate the f_b_g() calculation anyway, and low-latency requirements dictate During NEWIDLE this is a preempt-disabled section because we are still in the middle of a schedule(). Therefore there will be no involuntary preemption and that is why we are concerned with making sure we check for voluntary preemption. The move_tasks() will run to completion without this patch. With this patch it will break the operation if someone tries to preempt us. I'll keep an open mind but I am pretty sure this is something we should be doing. As far as I can tell, there should be no downside with this second version. As a compromise we could put an #ifdef CONFIG_PREEMPT around this new logic, but I don't think it is strictly necessary. Regards, -Greg --
I thought this was about newidle balancing? Tasks are always going to Firstly, won't the act of pulling tasks set the need_resched condition? Secondly, even if it does what you say, what exactly would be the difference between blocking a newly moved task from running and blocking a newly woken task from running? Either way you introduce the same worst case latency I don't think it has really been thought through that well. So I'm still That's not very nice. It's reasonable to run with CONFIG_PREEMPT but not blindly want to trade latency for throughput. --
>>> On Wed, Jul 9, 2008 at 7:17 AM, in message <200807092117.01669.nickpiggin@yahoo.com.au>, Nick Piggin Yes, but you misunderstand me. I am referring to "push" (remote moves to us) verses "pull" (we move from remote). During a move_task() we sometimes have to drop the RQ lock in the double_lock balance. This gives a remote CPU a chance to grab the lock and potentially move tasks to us as part of either a migration operation, or a wake-up. When this happens, several things should be noted: 1) it will change the load "landscape" such that any previous computation in f_b_g() is potentially invalid. 2) The task that was moved may be higher priority and therefore should not have to wait for move_tasks() to finish moving some arbitrary number of lower-priority tasks (and note that "lower prio" is a high-probability since NEWIDLE only does CFS tasks, and only RT tasks typically migrate like this). Therefore, IMO it doesnt make sense to continue moving more load. Just stop and let the Hmm.. Indeed. You are probably right about that and I need some other way to indicate Tasks that are pushed to us have a good chance to be RT (since RT is a heavy user of "push" methods, while CFS is mostly pull). Conversely, tasks that are pulled to us by newidle are guaranteed *not* to be RT (since newidle balancing will only pull CFS tasks). Perhaps that is the answer: terminate on s/need_resched()/rq->rt.nr_running. Its not exactly scalable to an arbitrary arrangement of future sched_classes, but that could be addresses when those sched_classes become available. To be fair, I think this is what Peter was trying to do with his more elaborate version of How do you come to this conclusion? Continuing to perform a move under these circumstances (or at least, my intended circumstances) is against stale data and could just as well hurt throughput as much as help it. Since the move is essentially arbitrary once this threshold is crossed, even the throughput will become ...
We have the notion of tracking process-coupling (a.k.a. buddy-wake) via
the p->se.last_wake / p->se.avg_overlap facilities, but it is only used
for cfs to cfs interactions. There is no reason why an rt to cfs
interaction cannot share in establishing a relationhip in a similar
manner.
Because PREEMPT_RT runs many kernel threads as FIFO priority, we often
times have heavy interaction between RT threads waking CFS applications.
This patch offers a substantial boost (50-60%+) in perfomance under those
circumstances.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---
kernel/sched.c | 14 ++++++++++++++
kernel/sched_fair.c | 21 ++-------------------
2 files changed, 16 insertions(+), 19 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 0b9f90e..af9a68b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1544,6 +1544,12 @@ static void set_load_weight(struct task_struct *p)
p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
}
+static void update_avg(u64 *avg, u64 sample)
+{
+ s64 diff = sample - *avg;
+ *avg += diff >> 3;
+}
+
static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
{
sched_info_queued(p);
@@ -1553,6 +1559,12 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
{
+ if (sleep && p->se.last_wakeup) {
+ update_avg(&p->se.avg_overlap,
+ p->se.sum_exec_runtime - p->se.last_wakeup);
+ p->se.last_wakeup = 0;
+ }
+
p->sched_class->dequeue_task(rq, p, sleep);
p->se.on_rq = 0;
}
@@ -2157,6 +2169,8 @@ out_running:
p->sched_class->task_wake_up(rq, p);
#endif
out:
+ current->se.last_wakeup = current->se.sum_exec_runtime;
+
task_rq_unlock(rq, &flags);
return success;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8e77d67..3c06027 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -639,21 +639,6 @@ enqueue_entity(struct cfs_rq ...Seem reasonable to me, Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --
applied to tip/sched/devel - thanks guys. I have also merged tip/sched/devel.smp-group-balance into tip/sched/devel, those changes have held up fine in testing, with little problems. Ingo --
-tip testing found this boot hang: Linux version 2.6.26-rc8-tip (mingo@dione) (gcc version 4.2.3) #12917 SMP Mon Jun 30 15:06:32 CEST 2008 [...] CPU 1/1 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 CPU1: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ stepping 02 Brought up 2 CPUs Total of 2 processors activated (8041.15 BogoMIPS). [ hard hang ] with this config: http://redhat.com/~mingo/misc/config-Mon_Jun_30_14_54_19_CEST_2008.bad full bootlog: http://redhat.com/~mingo/misc/hang-Mon_Jun_30_14_54_19_CEST_2008.bad it should continue with this bootup sequence: calling net_ns_init+0x0/0x1a0 net_namespace: 376 bytes initcall net_ns_init+0x0/0x1a0 returned 0 after 0 msecs calling init_smp_flush+0x0/0x60 i've bisected it down to: -------------- | commit cc8160c56843201891766660e3816d2e546c1b17 | Author: Gregory Haskins <ghaskins@novell.com> | Date: Fri Jun 27 14:29:50 2008 -0600 | | sched: enable interrupts and drop rq-lock during newidle balancing -------------- so i've reverted that change for now. Ingo --
>>> On Mon, Jun 30, 2008 at 9:15 AM, in message <20080630131511.GA7506@elte.hu>, Ill take a look, Ingo. Sorry for the inconvenience and thanks for the bisect. --
>>> On Mon, Jun 30, 2008 at 9:15 AM, in message <20080630131511.GA7506@elte.hu>, Ok, I dug in a little bit here. I haven't verified this out yet, but I think the problem is that your config is PREEMPT_VOLUNTARY which NOPs the preempt_disable() in schedule() that I rely on to allow the lock to be dropped. (Doh!) One way I can fix this is to fixup the newidle() code to only play the irq dropping tricks ifdef CONFIG_PREEMPT == TRUE. Does this sound reasonable, or is there a better way to address this? --
This means that there's a location somewhere that does a cond_resched? Perhaps we should look to see what does that. And perhaps the preempt_disable should not be allowing PREEMPT_VOLUNTARY to do scheduling with cond_resched. --
>>> On Mon, Jun 30, 2008 at 9:15 AM, in message <20080630131511.GA7506@elte.hu>, I may have found the issue: It looks like the hunk that initially disables interrupts in load_balance_newidle() was inadvertently applied to load_balance() instead during the merge to linux-tip. If you fold the following patch into my original patch, it should set things right again. ----- sched: fix merge problem with newidle enhancement patch From: Gregory Haskins <ghaskins@novell.com> commit cc8160c56843201891766660e3816d2e546c1b17 introduces a locking enhancement for newidle. However, one hunk misapplied to load_balance instead of load_balance_newidle. This fixes the issue. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- kernel/sched.c | 18 +++++++++--------- 1 files changed, 9 insertions(+), 9 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index f35d73c..f36406f 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3459,15 +3459,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, cpus_setall(*cpus); - schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]); - - /* - * We are in a preempt-disabled section, so dropping the lock/irq - * here simply means that other cores may acquire the lock, - * and interrupts may occur. - */ - spin_unlock_irq(&this_rq->lock); - /* * When power savings policy is enabled for the parent domain, idle * sibling can pick up load irrespective of busy siblings. In this case, @@ -3630,6 +3621,15 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd, cpus_setall(*cpus); + schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]); + + /* + * We are in a preempt-disabled section, so dropping the lock/irq + * here simply means that other cores may acquire the lock, + * and interrupts may occur. + */ + spin_unlock_irq(&this_rq->lock); + /* * When power savings policy is enabled for the parent domain, idle * sibling can pick up load irrespective of busy siblings. In ...
ah, that might have indeed happened. Good catch, i'll test it tomorrow. Ingo --
ah, sorry - indeed! I've reactivated your patch and i'm testing it now. Ingo --
-tip testing found that it still hangs at: Intel machine check architecture supported. Intel machine check reporting enabled on CPU#1. x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106 CPU1: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ stepping 02 Brought up 2 CPUs Total of 2 processors activated (8044.97 BogoMIPS). [...] with this config: http://redhat.com/~mingo/misc/config-Thu_Jul__3_16_49_53_CEST_2008.bad on 3 separate test-systems. I've again reverted both your fixup and the original commit. I've pushed out the failing tree to the tmp.sched/devel.fe6149f5e82 -tip branch, you should be able to reproduce it. Ingo --
>>> On Thu, Jul 3, 2008 at 11:12 AM, in message <20080703151209.GA17059@elte.hu>, Hmm..my fix + your config booted for me at the time I submitted it. I will try to repro this issue with the new info. Regards, --
I was not able to reproduce the hang by taking sched/devel/HEAD, my patch, and your config (or any config for that matter). We will have to back-burner this patch for now until I can think of a way to figure out what is wrong here. Sorry for the troubles. -Greg --
