[PATCH 3/3] sched: add avg-overlap support to RT tasks

Previous thread: [PATCH 1/2] [OLPC] sdhci: add quirk for the Marvell CaFe's vdd/powerup issue by Andres Salomon on Friday, June 27, 2008 - 1:24 pm. (2 messages)

Next thread: [GIT PULL] one fix for -rc by Sam Ravnborg on Friday, June 27, 2008 - 2:19 pm. (1 message)
From: Gregory Haskins
Date: Friday, June 27, 2008 - 1:29 pm

Hi Ingo,
  The following patches apply to linux-tip/sched/devel and enhance the
performance of the kernel (specifically in PREEMPT_RT, though they do
not regress mainline performance as far as I can tell).  They offer
somewhere between 50-100% speedups in netperf performance, depending
on the test.

Should you be interested in these patches, you can optionally pull them
from git at the following URL:

git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git sched-devel

Comments/feedback/bugfixes welcome

Regards,
-Greg
--

From: Gregory Haskins
Date: Friday, June 27, 2008 - 1:29 pm

Oprofile data shows that the system may spend a significant amount of
time (60%+) in find_busiest_groups as a result of newidle balancing.  This
entire operation is a critical section since it occurs inline with
a schedule().  Since we do find_busiest_groups() et. al. without locks
held for normal balancing, lets do it for newidle as well.  It will
at least allow other cpus and interrupts to make forward progress
(against our RQ) while we try to balance.

Additionally, we abort the newidle processing if we are preempted.

This patch should both improve latency response, as well as increase
throughput.  It has shown to significantly contribute to a 6-12%
increase in network peformance.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   40 +++++++++++++++++++++++++++++++++-------
 1 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c51d9fa..56722b1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3426,6 +3426,15 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd,
 
 	cpus_setall(*cpus);
 
+	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
+
+	/*
+	 * We are in a preempt-disabled section, so dropping the lock/irq
+	 * here simply means that other cores may acquire the lock,
+	 * and interrupts may occur.
+	 */
+	spin_unlock_irq(&this_rq->lock);
+
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
 	 * sibling can pick up load irrespective of busy siblings. In this case,
@@ -3436,7 +3445,6 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd,
 	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
 		sd_idle = 1;
 
-	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
 redo:
 	group = find_busiest_group(sd, this_cpu, &imbalance, CPU_NEWLY_IDLE,
 				   &sd_idle, cpus, NULL);
@@ -3456,22 +3464,37 @@ redo:
 	schedstat_add(sd, lb_imbalance[CPU_NEWLY_IDLE], imbalance);
 
 	ld_moved = 0;
-	if (busiest->nr_running > 1) ...
From: Gregory Haskins
Date: Friday, June 27, 2008 - 1:29 pm

Inspired by Peter Zijlstra.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 56722b1..0b9f90e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2863,6 +2863,10 @@ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 				max_load_move - total_load_moved,
 				sd, idle, all_pinned, &this_best_prio);
 		class = class->next;
+
+		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
+			break;
+
 	} while (class && max_load_move > total_load_moved);
 
 	return total_load_moved > 0;

--

From: Nick Piggin
Date: Monday, July 7, 2008 - 10:00 pm

What happened to the feedback I sent about this?

--

From: Gregory Haskins
Date: Tuesday, July 8, 2008 - 5:37 am

>>> On Tue, Jul 8, 2008 at  1:00 AM, in message
<200807081500.18245.nickpiggin@yahoo.com.au>, Nick Piggin

Ah yes.  Slipped through the cracks...sorry about that.

What if we did "if (idle == CPU_NEWLY_IDLE && need_resched())" instead?




--

From: Nick Piggin
Date: Wednesday, July 9, 2008 - 1:09 am

Isn't that exactly the same thing because any task will preempt the idle
thread?
--

From: Gregory Haskins
Date: Wednesday, July 9, 2008 - 3:53 am

>>> On Wed, Jul 9, 2008 at  4:09 AM, in message
<200807091809.52293.nickpiggin@yahoo.com.au>, Nick Piggin

Not quite.  The former version would break on *any* succesful enqueue (as a result of a
local move_task as well as a remote wake-up/migration).  The latter version will only
break on the the remote variety.  You were concerned about stopping a move_task
operation early because it would reduce efficiency, and I do not entirely disagree.
However, this really only concerns the local type (which have now been removed).

Remote preemptions should (IMO) always break immediately because it would have been likely
to invalidate the f_b_g() calculation anyway, and low-latency requirements dictate

During NEWIDLE this is a preempt-disabled section because we are still in the middle
of a schedule(). Therefore there will be no involuntary preemption and that is why we
are concerned with making sure we check for voluntary preemption.  The move_tasks()
will run to completion without this patch.  With this patch it will break the operation if
someone tries to preempt us.

I'll keep an open mind but I am pretty sure this is something we should be doing. As far
as I can tell, there should be no downside with this second version.  As a compromise
we could put an #ifdef CONFIG_PREEMPT around this new logic, but I don't think it
is strictly necessary.

Regards,
-Greg
--

From: Nick Piggin
Date: Wednesday, July 9, 2008 - 4:17 am

I thought this was about newidle balancing? Tasks are always going to

Firstly, won't the act of pulling tasks set the need_resched condition?

Secondly, even if it does what you say, what exactly would be the difference
between blocking a newly moved task from running and blocking a newly woken
task from running? Either way you introduce the same worst case latency

I don't think it has really been thought through that well. So I'm still

That's not very nice. It's reasonable to run with CONFIG_PREEMPT but not
blindly want to trade latency for throughput.
--

From: Gregory Haskins
Date: Wednesday, July 9, 2008 - 4:53 am

>>> On Wed, Jul 9, 2008 at  7:17 AM, in message
<200807092117.01669.nickpiggin@yahoo.com.au>, Nick Piggin

Yes, but you misunderstand me.  I am referring to "push" (remote moves to us) verses
"pull" (we move from remote).  During a move_task() we sometimes have to drop the
RQ lock in the double_lock balance.  This gives a remote CPU a chance to grab the lock
and potentially move tasks to us as part of either a migration operation, or a wake-up.

When this happens, several things should be noted:  1) it will change the load "landscape"
such that any previous computation in f_b_g() is potentially invalid.  2) The task that was
moved may be higher priority and therefore should not have to wait for move_tasks() to
finish moving some arbitrary number of lower-priority tasks (and note that "lower prio"
is a high-probability since NEWIDLE only does CFS tasks, and only RT tasks typically
migrate like this).

Therefore, IMO it doesnt make sense to continue moving more load.  Just stop and let the

Hmm.. Indeed.   You are probably right about that and I need some other way to indicate

Tasks that are pushed to us have a good chance to be RT (since RT is a heavy user of 
"push" methods, while CFS is mostly pull).  Conversely, tasks that are pulled to us by
newidle are guaranteed *not* to be RT (since newidle balancing will only pull CFS tasks).

Perhaps that is the answer: terminate on s/need_resched()/rq->rt.nr_running.  Its not
exactly scalable to an arbitrary arrangement of future sched_classes, but that could be
addresses when those sched_classes become available.

To be fair, I think this is what Peter was trying to do with his more elaborate version of

How do you come to this conclusion?  Continuing to perform a move under these
circumstances (or at least, my intended circumstances) is against stale data and
could just as well hurt throughput as much as help it.  Since the move is
essentially arbitrary once this threshold is crossed, even the throughput will
become ...
From: Gregory Haskins
Date: Friday, June 27, 2008 - 1:30 pm

We have the notion of tracking process-coupling (a.k.a. buddy-wake) via
the p->se.last_wake / p->se.avg_overlap facilities, but it is only used
for cfs to cfs interactions.  There is no reason why an rt to cfs
interaction cannot share in establishing a relationhip in a similar
manner.

Because PREEMPT_RT runs many kernel threads as FIFO priority, we often
times have heavy interaction between RT threads waking CFS applications.
This patch offers a substantial boost (50-60%+) in perfomance under those
circumstances.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c      |   14 ++++++++++++++
 kernel/sched_fair.c |   21 ++-------------------
 2 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0b9f90e..af9a68b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1544,6 +1544,12 @@ static void set_load_weight(struct task_struct *p)
 	p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
 }
 
+static void update_avg(u64 *avg, u64 sample)
+{
+	s64 diff = sample - *avg;
+	*avg += diff >> 3;
+}
+
 static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
 {
 	sched_info_queued(p);
@@ -1553,6 +1559,12 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
 {
+	if (sleep && p->se.last_wakeup) {
+		update_avg(&p->se.avg_overlap,
+			   p->se.sum_exec_runtime - p->se.last_wakeup);
+		p->se.last_wakeup = 0;
+	}
+
 	p->sched_class->dequeue_task(rq, p, sleep);
 	p->se.on_rq = 0;
 }
@@ -2157,6 +2169,8 @@ out_running:
 		p->sched_class->task_wake_up(rq, p);
 #endif
 out:
+	current->se.last_wakeup = current->se.sum_exec_runtime;
+
 	task_rq_unlock(rq, &flags);
 
 	return success;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8e77d67..3c06027 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -639,21 +639,6 @@ enqueue_entity(struct cfs_rq ...
From: Peter Zijlstra
Date: Friday, June 27, 2008 - 1:51 pm

Seem reasonable to me,

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

--

From: Ingo Molnar
Date: Monday, June 30, 2008 - 5:56 am

applied to tip/sched/devel - thanks guys.

I have also merged tip/sched/devel.smp-group-balance into 
tip/sched/devel, those changes have held up fine in testing, with little 
problems.

	Ingo
--

From: Ingo Molnar
Date: Monday, June 30, 2008 - 6:15 am

-tip testing found this boot hang:

 Linux version 2.6.26-rc8-tip (mingo@dione) (gcc version 4.2.3) #12917 
 SMP Mon Jun 30 15:06:32 CEST 2008
 [...]
 CPU 1/1 -> Node 0
 CPU: Physical Processor ID: 0
 CPU: Processor Core ID: 1
 CPU1: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ stepping 02
 Brought up 2 CPUs
 Total of 2 processors activated (8041.15 BogoMIPS).
 [ hard hang ]

with this config:

  http://redhat.com/~mingo/misc/config-Mon_Jun_30_14_54_19_CEST_2008.bad

full bootlog:

  http://redhat.com/~mingo/misc/hang-Mon_Jun_30_14_54_19_CEST_2008.bad

it should continue with this bootup sequence:

  calling  net_ns_init+0x0/0x1a0
  net_namespace: 376 bytes
  initcall net_ns_init+0x0/0x1a0 returned 0 after 0 msecs
  calling  init_smp_flush+0x0/0x60

i've bisected it down to:

--------------
| commit cc8160c56843201891766660e3816d2e546c1b17
| Author: Gregory Haskins <ghaskins@novell.com>
| Date:   Fri Jun 27 14:29:50 2008 -0600
|
|     sched: enable interrupts and drop rq-lock during newidle balancing
--------------

so i've reverted that change for now.

	Ingo
--

From: Gregory Haskins
Date: Monday, June 30, 2008 - 4:20 am

>>> On Mon, Jun 30, 2008 at  9:15 AM, in message <20080630131511.GA7506@elte.hu>,

Ill take a look, Ingo.  Sorry for the inconvenience and thanks for the bisect.



--

From: Gregory Haskins
Date: Monday, June 30, 2008 - 7:41 am

>>> On Mon, Jun 30, 2008 at  9:15 AM, in message <20080630131511.GA7506@elte.hu>,

Ok, I dug in a little bit here.  I haven't verified this out yet, but I think the problem is that
your config is PREEMPT_VOLUNTARY which NOPs the preempt_disable() in schedule() that
I rely on to allow the lock to be dropped.  (Doh!)

One way I can fix this is to fixup the newidle() code to only play the irq dropping tricks
ifdef CONFIG_PREEMPT == TRUE.  Does this sound reasonable, or is there a better way
to address this?



--

From: Steven Rostedt
Date: Monday, June 30, 2008 - 8:01 am

This means that there's a location somewhere that does a cond_resched?
Perhaps we should look to see what does that. And perhaps the
preempt_disable should not be allowing PREEMPT_VOLUNTARY to do scheduling
with cond_resched.


--

From: Gregory Haskins
Date: Monday, June 30, 2008 - 10:16 am

>>> On Mon, Jun 30, 2008 at  9:15 AM, in message <20080630131511.GA7506@elte.hu>,

I may have found the issue:  It looks like the hunk that initially disables interrupts in load_balance_newidle() was inadvertently applied to load_balance() instead during the
merge to linux-tip.  If you fold the following patch into my original patch, it should set
things right again.

-----

sched: fix merge problem with newidle enhancement patch

From: Gregory Haskins <ghaskins@novell.com>

commit cc8160c56843201891766660e3816d2e546c1b17 introduces a locking
enhancement for newidle.  However, one hunk misapplied to load_balance
instead of load_balance_newidle.  This fixes the issue.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f35d73c..f36406f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3459,15 +3459,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	cpus_setall(*cpus);
 
-	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
-
-	/*
-	 * We are in a preempt-disabled section, so dropping the lock/irq
-	 * here simply means that other cores may acquire the lock,
-	 * and interrupts may occur.
-	 */
-	spin_unlock_irq(&this_rq->lock);
-
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
 	 * sibling can pick up load irrespective of busy siblings. In this case,
@@ -3630,6 +3621,15 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd,
 
 	cpus_setall(*cpus);
 
+	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
+
+	/*
+	 * We are in a preempt-disabled section, so dropping the lock/irq
+	 * here simply means that other cores may acquire the lock,
+	 * and interrupts may occur.
+	 */
+	spin_unlock_irq(&this_rq->lock);
+
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
 	 * sibling can pick up load irrespective of busy siblings. In ...
From: Ingo Molnar
Date: Monday, June 30, 2008 - 11:10 am

ah, that might have indeed happened. Good catch, i'll test it tomorrow.

	Ingo
--

From: Ingo Molnar
Date: Thursday, July 3, 2008 - 7:41 am

ah, sorry - indeed! I've reactivated your patch and i'm testing it now.

	Ingo
--

From: Ingo Molnar
Date: Thursday, July 3, 2008 - 8:12 am

-tip testing found that it still hangs at:

 Intel machine check architecture supported.
 Intel machine check reporting enabled on CPU#1.
 x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
 CPU1: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ stepping 02
 Brought up 2 CPUs
 Total of 2 processors activated (8044.97 BogoMIPS).
 [...]

with this config:

  http://redhat.com/~mingo/misc/config-Thu_Jul__3_16_49_53_CEST_2008.bad

on 3 separate test-systems.

I've again reverted both your fixup and the original commit. I've pushed 
out the failing tree to the tmp.sched/devel.fe6149f5e82 -tip branch, you 
should be able to reproduce it.

	Ingo
--

From: Gregory Haskins
Date: Tuesday, July 8, 2008 - 5:38 am

>>> On Thu, Jul 3, 2008 at 11:12 AM, in message <20080703151209.GA17059@elte.hu>,

Hmm..my fix + your config booted for me at the time I submitted it.  I will try to repro
this issue with the new info.

Regards,


--

From: Gregory Haskins
Date: Tuesday, July 8, 2008 - 9:45 am

I was not able to reproduce the hang by taking sched/devel/HEAD, my patch, and your
config (or any config for that matter).  We will have to back-burner this patch for now
until I can think of a way to figure out what is wrong here.  Sorry for the troubles.

-Greg



--

Previous thread: [PATCH 1/2] [OLPC] sdhci: add quirk for the Marvell CaFe's vdd/powerup issue by Andres Salomon on Friday, June 27, 2008 - 1:24 pm. (2 messages)

Next thread: [GIT PULL] one fix for -rc by Sam Ravnborg on Friday, June 27, 2008 - 2:19 pm. (1 message)