This is the latest incarnation of my scary ttwu patch, its split up into tiny little pieces because at some point while working on the feedback from the last posting it all stopped working and tiny changes was the only way to find out wth I done wrong (I never did find out though, when I completely rewrote everything in tiny chunks it all magically worked again -- so I guess tglx is chuckling now). Anyway, it still need a few more loose ends fixed, esp on 32bit there is the whole unlocked remote min_vruntime access where we can read partial updates and screw the timeline like problem. And I suppose there's a few more XXXs in there where I know things are needing TLC. Also, I bet the eagle eyed lot Cc'ed will spot more holes, as all this is scary stuff :-) Anyway, lots of thanks to Frank, Oleg, Ingo and others who commented on the last posting. I guess most everybody is out having x-mas celebrations (and I will too not too long from now), so I expect replies to this won't happen until the new year, but who knows, maybe in between eating and drinking too much we all get an urge to stare at code. Happy Holidays! --
Currently p->pi_lock already serializes p->sched_class, also put
p->cpus_allowed and try_to_wake_up() under it, this prepares the way
to do the first part of ttwu() without holding rq->lock.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 37 ++++++++++++++++---------------------
1 file changed, 16 insertions(+), 21 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2301,7 +2301,7 @@ void task_oncpu_function_call(struct tas
#ifdef CONFIG_SMP
/*
- * ->cpus_allowed is protected by either TASK_WAKING or rq->lock held.
+ * ->cpus_allowed is protected by both rq->lock and p->pi_lock
*/
static int select_fallback_rq(int cpu, struct task_struct *p)
{
@@ -2334,7 +2334,7 @@ static int select_fallback_rq(int cpu, s
}
/*
- * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
+ * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
*/
static inline
int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
@@ -2450,7 +2450,8 @@ static int try_to_wake_up(struct task_st
this_cpu = get_cpu();
smp_wmb();
- rq = task_rq_lock(p, &flags);
+ raw_spin_lock_irqsave(&p->pi_lock, flags);
+ rq = __task_rq_lock(p);
if (!(p->state & state))
goto out;
@@ -2508,7 +2509,8 @@ static int try_to_wake_up(struct task_st
ttwu_stat(rq, p, cpu, wake_flags);
success = 1;
out:
- task_rq_unlock(rq, &flags);
+ __task_rq_unlock(rq);
+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
put_cpu();
return success;
@@ -4543,6 +4545,8 @@ void rt_mutex_setprio(struct task_struct
BUG_ON(prio < 0 || prio > MAX_PRIO);
+ lockdep_assert_held(&p->pi_lock);
+
rq = task_rq_lock(p, &flags);
trace_sched_pi_setprio(p, prio);
@@ -5150,7 +5154,6 @@ long sched_getaffinity(pid_t pid, struct
{
struct task_struct *p;
unsigned long ...Yes for wakeup, but not true for fork. I don't see protection in wake_up_new_task(). Or am I missing something? Thanks, Yong --
Ah, true, wake_up_new_task() holds task_rq_lock() which is sufficient, but yes, we could also make that pi_lock. --
Currently ttwu() does two rq->lock acquisitions, once on the task's
old rq, holding it over the p->state fiddling and load-balance pass.
Then it drops the old rq->lock to acquire the new rq->lock.
By having serialized ttwu(), p->sched_class, p->cpus_allowed with
p->pi_lock, we can now drop the whole first rq->lock acquisition.
The p->pi_lock serializing concurrent ttwu() calls protects p->state,
which we will set to TASK_WAKING to bridge possible p->pi_lock to
rq->lock gaps and serialize set_task_cpu() calls against
task_rq_lock().
The p->pi_lock serialization of p->sched_class allows us to call
scheduling class methods without holding the rq->lock, and the
serialization of p->cpus_allowed allows us to do the load-balancing
bits without races.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 57 ++++++++++++++++++++++++++--------------------------
kernel/sched_fair.c | 3 --
2 files changed, 30 insertions(+), 30 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2440,69 +2440,70 @@ ttwu_post_activation(struct task_struct
* Returns %true if @p was woken up, %false if it was already running
* or @state didn't match @p's state.
*/
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
- int wake_flags)
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
- int cpu, orig_cpu, this_cpu, success = 0;
+ int cpu, this_cpu, success = 0;
unsigned long flags;
- unsigned long en_flags = ENQUEUE_WAKEUP;
struct rq *rq;
this_cpu = get_cpu();
smp_wmb();
raw_spin_lock_irqsave(&p->pi_lock, flags);
- rq = __task_rq_lock(p);
if (!(p->state & state))
goto out;
cpu = task_cpu(p);
- if (p->on_rq)
- goto out_running;
+ if (p->on_rq) {
+ rq = __task_rq_lock(p);
+ if (p->on_rq)
+ goto ...Collect all ttwu stat code into a single function and ensure its
always called for an actual wakeup (changing p->state to
TASK_RUNNING).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 67 +++++++++++++++++++++++++++------------------------------
1 file changed, 32 insertions(+), 35 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2366,21 +2366,36 @@ static void update_avg(u64 *avg, u64 sam
}
#endif
-static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
- bool is_sync, bool is_migrate, bool is_local,
- unsigned long en_flags)
+static void
+ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
{
+#ifdef CONFIG_SCHEDSTATS
+ int this_cpu = smp_processor_id();
+
+ schedstat_inc(rq, ttwu_count);
schedstat_inc(p, se.statistics.nr_wakeups);
- if (is_sync)
+
+ if (wake_flags & WF_SYNC)
schedstat_inc(p, se.statistics.nr_wakeups_sync);
- if (is_migrate)
+
+ if (cpu != task_cpu(p))
schedstat_inc(p, se.statistics.nr_wakeups_migrate);
- if (is_local)
+
+ if (cpu == this_cpu) {
+ schedstat_inc(rq, ttwu_local);
schedstat_inc(p, se.statistics.nr_wakeups_local);
- else
- schedstat_inc(p, se.statistics.nr_wakeups_remote);
+ } else {
+ struct sched_domain *sd;
- activate_task(rq, p, en_flags);
+ schedstat_inc(p, se.statistics.nr_wakeups_remote);
+ for_each_domain(this_cpu, sd) {
+ if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
+ schedstat_inc(sd, ttwu_wake_remote);
+ break;
+ }
+ }
+ }
+#endif /* CONFIG_SCHEDSTATS */
}
static void
@@ -2440,12 +2455,12 @@ static int try_to_wake_up(struct task_st
if (!(p->state & state))
goto out;
+ cpu = task_cpu(p);
+
if (p->se.on_rq)
goto out_running;
- cpu = task_cpu(p);
orig_cpu = cpu;
-
#ifdef CONFIG_SMP
if (unlikely(task_running(rq, p)))
goto ...try_to_wake_up() would only return a success when it would have to
place a task on a rq, change that to every time we change p->state to
TASK_RUNNING, because that's the real measure of wakeups.
This results in that success is always true for the tracepoints.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2383,10 +2383,10 @@ static inline void ttwu_activate(struct
activate_task(rq, p, en_flags);
}
-static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
- int wake_flags, bool success)
+static void
+ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
{
- trace_sched_wakeup(p, success);
+ trace_sched_wakeup(p, true);
check_preempt_curr(rq, p, wake_flags);
p->state = TASK_RUNNING;
@@ -2406,7 +2406,7 @@ static inline void ttwu_post_activation(
}
#endif
/* if a worker is waking up, notify workqueue */
- if ((p->flags & PF_WQ_WORKER) && success)
+ if (p->flags & PF_WQ_WORKER)
wq_worker_waking_up(p, cpu_of(rq));
}
@@ -2505,9 +2505,9 @@ static int try_to_wake_up(struct task_st
#endif /* CONFIG_SMP */
ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
cpu == this_cpu, en_flags);
- success = 1;
out_running:
- ttwu_post_activation(p, rq, wake_flags, success);
+ ttwu_post_activation(p, rq, wake_flags);
+ success = 1;
out:
task_rq_unlock(rq, &flags);
put_cpu();
@@ -2526,7 +2526,6 @@ static int try_to_wake_up(struct task_st
static void try_to_wake_up_local(struct task_struct *p)
{
struct rq *rq = task_rq(p);
- bool success = false;
BUG_ON(rq != this_rq());
BUG_ON(p == current);
@@ -2541,9 +2540,8 @@ static void try_to_wake_up_local(struct
schedstat_inc(rq, ttwu_local);
}
...Only wait for the current holder to release the lock.
spin_unlock_wait() can only be about the current holder, since
completion of this function is inherently racy with new contenders.
Therefore, there is no reason to wait until the lock is completely
unlocked.
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
arch/x86/include/asm/spinlock.h | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)
Index: linux-2.6/arch/x86/include/asm/spinlock.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/spinlock.h
+++ linux-2.6/arch/x86/include/asm/spinlock.h
@@ -158,18 +158,32 @@ static __always_inline void __ticket_spi
}
#endif
+#define TICKET_MASK ((1 << TICKET_SHIFT) - 1)
+
static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
{
int tmp = ACCESS_ONCE(lock->slock);
- return !!(((tmp >> TICKET_SHIFT) ^ tmp) & ((1 << TICKET_SHIFT) - 1));
+ return !!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK);
}
static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
{
int tmp = ACCESS_ONCE(lock->slock);
- return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1;
+ return (((tmp >> TICKET_SHIFT) - tmp) & TICKET_MASK) > 1;
+}
+
+static inline void __ticket_spin_unlock_wait(arch_spinlock_t *lock)
+{
+ int tmp = ACCESS_ONCE(lock->slock);
+
+ if (!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK))
+ return; /* not locked */
+
+ /* wait until the current lock holder goes away */
+ while ((lock->slock & TICKET_MASK) == (tmp & TICKET_MASK))
+ cpu_relax();
}
#ifndef CONFIG_PARAVIRT_SPINLOCKS
@@ -206,7 +220,11 @@ static __always_inline void arch_spin_lo
arch_spin_lock(lock);
}
-#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+static inline void arch_spin_unlock_wait(arch_spinlock_t ...Is there really any reason for this patch? I'd rather keep the simpler
Also, the above is just ugly. You've lost the ACCESS_ONCE() on the
lock access, and it's using another model of masking than the regular
one. Both of which may be intentional (maybe you are _trying_ to get
the compiler to just load the low bytes and avoid the 'and'), but the
whole open-coding of the logic - twice, and with different looking
masking - just makes my skin itch.
Linus
--
No numbers, the testcase I use for this series is too unstable to really
give that fine results. Its more a result of seeing the code an going:
"oohh that can wait a long time when the lock is severely contended".
But I think I can get rid of the need for calling this primitive
I'm not sure I fully understand the complaint here. The ACCESS_ONCE is
for the tmp variable, which we use several times and needs to contain a
single load of the lock variable and should not be optimized away into
multiple loads.
The first conditional:
if (!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK))
Is exactly like the regular __ticket_spin_is_contended, and while that
is a somewhat overly clever way of writing head != tail, I don't see a
problem with that.
The second conditional:
while ((lock->slock & TICKET_MASK) == (tmp & TICKET_MASK))
Is indeed different, it waits for the lock tail (new load) to change
from the first observed (first load) tail. Once we observe the tail
index changing we know the previous owner completed and we can drop out.
Anyway, if I can indeed get rid of my unlock_wait usage its all moot
anyway, there aren't many users of this primitive.
--
It would be kind of nice to fix, with ticket locks, dumb spin_unlock_wait
can infinitely starve if the lock queue is never empty, wheras at least
the simple spinlocks it would have a statistical chance of being given
I always hated it because it seems hard to use right and verify result
is correct, particularly because it has no memory ordering guarantees.
assert(active == 1);
spin_lock(&blah);
if (should_die)
active = 0;
counter++;
spin_unlock(&blah);
if (active) {
spin_lock(&blah);
/* do something */
spin_unlock(&blah);
} else {
/* wait for last to go away */
spin_unlock_wait(&blah);
counter++;
}
I don't know, stupid example but I can't really think of good ways to
use it off the top of my head.
Anyway this has a lost update problem even on x86 because counter can
be speculatively loaded out of order from the load of the lock word.
So the nice simple lock APIs which supposedly don't require any thought
of barriers have tricked us!
So I agree, taking it out the back and shooting it in the head would make
the world a better place.
--
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2470,15 +2470,15 @@ static int ttwu_remote(struct task_struc
return ret;
}
-void sched_ttwu_pending(void)
+static void __sched_ttwu_pending(struct rq *rq)
{
#ifdef CONFIG_SMP
- struct rq *rq = this_rq();
struct task_struct *list = xchg(&rq->wake_list, NULL);
if (!list)
return;
+ rq = this_rq(); /* always enqueue locally */
raw_spin_lock(&rq->lock);
while (list) {
@@ -2491,6 +2491,11 @@ void sched_ttwu_pending(void)
#endif
}
+void sched_ttwu_pending(void)
+{
+ __sched_ttwu_pending(this_rq());
+}
+
#ifdef CONFIG_SMP
static void ttwu_queue_remote(struct task_struct *p, int cpu)
{
@@ -6162,6 +6167,17 @@ migration_call(struct notifier_block *nf
migrate_nr_uninterruptible(rq);
calc_global_load_remove(rq);
break;
+
+ case CPU_DEAD:
+ /*
+ * Queue any possible remaining pending wakeups on this cpu.
+ * Load-balancing will sort it out eventually.
+ */
+ local_irq_save(flags);
+ __sched_ttwu_pending(cpu_rq(cpu));
+ local_irq_restore(flags);
+ break;
+
#endif
}
return NOTIFY_OK;
--
But it's possible that p is not allowed to run on this cpu, right? Thanks, --
Very good point, yes I totally forgot about that.. stupid ->cpus_allowed. OK, I guess we need a new select_task_rq() call in the offline case.. --
In prepratation of having to call task_contributes_to_load() without
holding rq->lock, we need to store the result until we do and can
update the rq accounting accordingly.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/sched.h | 1 +
kernel/sched.c | 16 ++++------------
2 files changed, 5 insertions(+), 12 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1264,6 +1264,7 @@ struct task_struct {
/* Revert to default priority/policy when forking */
unsigned sched_reset_on_fork:1;
+ unsigned sched_contributes_to_load:1;
pid_t pid;
pid_t tgid;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2467,18 +2467,7 @@ static int try_to_wake_up(struct task_st
if (unlikely(task_running(rq, p)))
goto out_activate;
- /*
- * In order to handle concurrent wakeups and release the rq->lock
- * we put the task in TASK_WAKING state.
- *
- * First fix up the nr_uninterruptible count:
- */
- if (task_contributes_to_load(p)) {
- if (likely(cpu_online(orig_cpu)))
- rq->nr_uninterruptible--;
- else
- this_rq()->nr_uninterruptible--;
- }
+ p->sched_contributes_to_load = !!task_contributes_to_load(p);
p->state = TASK_WAKING;
if (p->sched_class->task_waking) {
@@ -2503,6 +2492,9 @@ static int try_to_wake_up(struct task_st
WARN_ON(task_cpu(p) != cpu);
WARN_ON(p->state != TASK_WAKING);
+ if (p->sched_contributes_to_load)
+ rq->nr_uninterruptible--;
+
out_activate:
#endif /* CONFIG_SMP */
activate_task(rq, p, en_flags);
--
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2367,10 +2367,11 @@ static void update_avg(u64 *avg, u64 sam
#endif
static void
-ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
+ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
{
#ifdef CONFIG_SCHEDSTATS
int this_cpu = smp_processor_id();
+ struct rq *rq = this_rq();
schedstat_inc(rq, ttwu_count);
schedstat_inc(p, se.statistics.nr_wakeups);
@@ -2491,9 +2492,10 @@ try_to_wake_up(struct task_struct *p, un
activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
out_running:
ttwu_post_activation(p, rq, wake_flags);
- ttwu_stat(rq, p, cpu, wake_flags);
success = 1;
__task_rq_unlock(rq);
+
+ ttwu_stat(p, cpu, wake_flags);
out:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
put_cpu();
@@ -2527,7 +2529,7 @@ static void try_to_wake_up_local(struct
activate_task(rq, p, ENQUEUE_WAKEUP);
ttwu_post_activation(p, rq, 0);
- ttwu_stat(rq, p, smp_processor_id(), 0);
+ ttwu_stat(p, smp_processor_id(), 0);
out:
raw_spin_unlock(&p->pi_lock);
}
--
Typo? You just put it out of rq_lock. Thanks, --
No we cannot change stats on task_rq() since we don't hold any locks there, but since we have IRQs disabled we can change stats on the local rq. It changes the accounting slightly (instead of accounting the wakeup on the task's rq we account it on the waking cpu's rq) but that shouldn't That's the purpose of this patch, we need to be able to do ttwu_stat() without holding rq->lock. I should have written a changelog but it was either finish all fancy changelogs on an rfc series and post after the holidays or post before ;-) --
The ttwu_post_actication() does the core wakeup, it sets TASK_RUNNING
and performs wakeup-preemption, so give is a more descriptive name.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2399,8 +2399,11 @@ ttwu_stat(struct task_struct *p, int cpu
#endif /* CONFIG_SCHEDSTATS */
}
+/*
+ * Mark the task runnable and perform wakeup-preemption.
+ */
static void
-ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
+ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
trace_sched_wakeup(p, true);
check_preempt_curr(rq, p, wake_flags);
@@ -2492,7 +2495,7 @@ try_to_wake_up(struct task_struct *p, un
activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
out_running:
- ttwu_post_activation(p, rq, wake_flags);
+ ttwu_do_wakeup(rq, p, wake_flags);
success = 1;
__task_rq_unlock(rq);
@@ -2529,7 +2532,7 @@ static void try_to_wake_up_local(struct
if (!p->on_rq)
activate_task(rq, p, ENQUEUE_WAKEUP);
- ttwu_post_activation(p, rq, 0);
+ ttwu_do_wakeup(rq, p, 0);
ttwu_stat(p, smp_processor_id(), 0);
out:
raw_spin_unlock(&p->pi_lock);
--
Now that we've removed the rq->lock requirement from the first part of ttwu() and can compute placement without holding any rq->lock, ensure we execute the second half of ttwu() on the actual cpu we want the task to run on. This avoids having to take rq->lock and doing the task enqueue remotely, saving lots on cacheline transfers. As measured using: http://oss.oracle.com/~mason/sembench.c $ echo 4096 32000 64 128 > /proc/sys/kernel/sem $ ./sembench -t 2048 -w 1900 -o 0 unpatched: run time 30 seconds 537953 worker burns per second patched: run time 30 seconds 657336 worker burns per second Still need to sort out all the races marked XXX (non-trivial), and its x86 only for the moment. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- arch/x86/kernel/smp.c | 1 include/linux/sched.h | 2 kernel/sched.c | 143 ++++++++++++++++++++++++++++++++++++------------ kernel/sched_features.h | 2 4 files changed, 114 insertions(+), 34 deletions(-) Index: linux-2.6/arch/x86/kernel/smp.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/smp.c +++ linux-2.6/arch/x86/kernel/smp.c @@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_ /* * KVM uses this interrupt to force a cpu out of guest mode */ + sched_ttwu_pending(); } void smp_call_function_interrupt(struct pt_regs *regs) Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1020,6 +1020,7 @@ partition_sched_domains(int ndoms_new, c } #endif /* !CONFIG_SMP */ +void sched_ttwu_pending(void); struct io_context; /* See blkdev.h */ @@ -1201,6 +1202,7 @@ struct task_struct { int lock_depth; /* BKL lock depth */ #ifdef CONFIG_SMP + struct task_struct *wake_entry; int on_cpu; #endif int on_rq; Index: ...
---
Adds stats to see how stable the sembench results are, it does slow the
burn rate down, I guess because we're now touching the FPU and the
context switch overhead increases dramatically, but it does show its
relatively stable.
New output looks like:
# ./sembench -t 2048 -w 1900 -o 0 -r 30
main loop going
all done
2048 threads, waking 1900 at a time
using ipc sem operations
main thread burns: 2374
worker burn count total 4510600 min 1187 max 2374 avg 2202.441 +- 0.415%
run time 30 seconds 150353 worker burns per second
---
--- sembench.c 2010-04-12 20:45:50.000000000 +0200
+++ sembench3.c 2011-01-03 15:29:42.000000000 +0100
@@ -1,6 +1,6 @@
/*
* copyright Oracle 2007. Licensed under GPLv2
- * To compile: gcc -Wall -o sembench sembench.c -lpthread
+ * To compile: gcc -Wall -o sembench sembench.c -lpthread -lm
*
* usage: sembench -t thread count -w wakenum -r runtime -o op
* op can be: 0 (ipc sem) 1 (nanosleep) 2 (futexes)
@@ -28,8 +28,9 @@
#include <sys/time.h>
#include <sys/syscall.h>
#include <errno.h>
+#include <math.h>
-#define VERSION "0.2"
+#define VERSION "0.3"
/* futexes have been around since 2.5.something, but it still seems I
* need to make my own syscall. Sigh.
@@ -78,10 +79,55 @@
int *semid_lookup = NULL;
+struct stats
+{
+ double n, mean, M2;
+};
+
+static void update_stats(struct stats *stats, long long val)
+{
+ double delta;
+
+ stats->n++;
+ delta = val - stats->mean;
+ stats->mean += delta / stats->n;
+ stats->M2 += delta*(val - stats->mean);
+}
+
+static double avg_stats(struct stats *stats)
+{
+ return stats->mean;
+}
+
+/*
+ * http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
+ *
+ * (\Sum n_i^2) - ((\Sum n_i)^2)/n
+ * s^2 = -------------------------------
+ * n - 1
+ *
+ * http://en.wikipedia.org/wiki/Stddev
+ *
+ * The std dev of the mean is related to the std dev by:
+ *
+ * s
+ * s_mean = -------
+ * sqrt(n)
+ *
+ ...This looks a bit suspicious. If this is called by sched_ttwu_pending() we are holding rq->lock, not task_rq_lock(). It seems, we can race with, say, migration thread running on task_cpu(). OK, p->state = TASK_WAKING protects us against, say, set_cpus_allowed_ptr() which does task_rq_lock(p) and thus checks task_is_waking(). what if that cpu does set_cpus_allowed_ptr(p) ? It spins with irq disabled. Once the caller, try_to_wake_up(), drops ->pi_lock it will wait for !task_is_waking() forever, no? Oleg. --
Ah, it appears I've already fixed that, let me clean up my current series and repost. --
I am not sure... Suppose that p was TASK_INTERRUPTIBLE and p->on_rq == 1 before, when set_cpus_allowed_ptr() was called. To simplify, suppose that the caller is preempted right after it drops p->pi_lock and before it does stop_one_cpu(migration_cpu_stop). After that p can complete chedule() and deactivate itself. Now, try_to_wake_up() can set TASK_WAKING, choose another CPU, and do ttwu_queue_remote(). Finally, the caller of set_cpus_allowed_ptr() resumes and schedules migration_cpu_stop. It is very possible I missed something, but what is the new locking rules for set_task_cpu() anyway? I mean, which rq->lock it needs? Oleg. --
But __migrate_task() will then find !p->on_rq and not actually do In the patches I just posted: set_task_cpu(p, cpu) callers need to either hold task_rq(p)->lock (the old rq) or p->pi_lock. --
But if it races with ttwu_do_activate() (which can hold another rq->lock != rq_src/rq_dest), it can check p->on_rq after activate_task() was already called. Yes, thanks, I already noticed v4, and at first glance 11/18 plus set_task_cpu() in try_to_wake_up() should close the problems... Oleg. --
In preparation of calling select_task_rq() without rq->lock held, drop
the dependency on the rq argument.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/sched.h | 3 +--
kernel/sched.c | 40 ++++++++++++++--------------------------
kernel/sched_fair.c | 2 +-
kernel/sched_idletask.c | 2 +-
kernel/sched_rt.c | 10 +++++++++-
kernel/sched_stoptask.c | 3 +--
6 files changed, 27 insertions(+), 33 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1063,8 +1063,7 @@ struct sched_class {
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
- int (*select_task_rq)(struct rq *rq, struct task_struct *p,
- int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2138,13 +2138,14 @@ static int migration_cpu_stop(void *data
* The task's runqueue lock must be held.
* Returns true if you have to wait for migration thread.
*/
-static bool migrate_task(struct task_struct *p, struct rq *rq)
+static bool need_migrate_task(struct task_struct *p)
{
/*
* If the task is not on a runqueue (and not running), then
* the next wake-up will properly place the task.
*/
- return p->on_rq || task_running(rq, p);
+ smp_rmb(); /* finish_lock_switch() */
+ return p->on_rq || p->on_cpu;
}
/*
@@ -2337,9 +2338,9 @@ static int select_fallback_rq(int cpu, s
* The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
*/
static inline
-int select_task_rq(struct rq *rq, struct ...If we drop rq_lock, need_migrate_task() maybe return true but p is already running on other cpu. Thus we do a wrong migration call. Thanks, --
Ah, the reason its here is that this patch removes the rq argument and thus we no longer need rq->lock. So this part relies on the property Yeah, too bad.. ;-) exec load balancing is more an optimistic thing anyway, if it got rebalanced under out feet we don't care. --
I don't understand this need_migrate_task() at all (with or without the patch). This task is current/running, it should always return T. I guess, migrate_task() was needed before to initialize migration_req. Oleg. --
But afaict you can call set_cpus_allowed_ptr() on !self. --
Ah, sorry for the confusion, I only meant sched_exec() case.
set_cpus_allowed_ptr() does need need_migrate_task(), of course.
As for set_cpus_allowed_ptr()->need_migrate_task() path, I have another
question,
static bool need_migrate_task(struct task_struct *p)
{
/*
* If the task is not on a runqueue (and not running), then
* the next wake-up will properly place the task.
*/
smp_rmb(); /* finish_lock_switch() */
return p->on_rq || p->on_cpu;
}
I don't understand this smp_rmb(). Yes, finish_lock_switch() does
wmb() before it clears ->on_cpu, but how these 2 barriers can pair?
In fact, I am completely confused. I do not understand why do we
check task_running() at all. If we see on_rq == 0 && on_cpu == 1,
then this task is going to clear its on_cpu soon, once it finishes
context_switch().
Probably, this check was needed before, try_to_wake_up() could
activate the task_running() task without migrating. But, at first
glance, this is no longer possible after this series?
Oleg.
--
You mean the fact that I fouled up and didn't cross them (both are
It can still do that, I think the problem is us dropping rq->lock in the
middle of schedule(), when the freshly woken migration thread comes in
between there and moves the task away, you can get into the situation
that two cpus reference the same task_struct at the same time, which
usually leads to 'interesting' situations.
---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2144,8 +2144,9 @@ static bool need_migrate_task(struct tas
* If the task is not on a runqueue (and not running), then
* the next wake-up will properly place the task.
*/
+ bool running = p->on_rq || p->on_cpu;
smp_rmb(); /* finish_lock_switch() */
- return p->on_rq || p->on_cpu;
+ return running;
}
/*
@@ -3416,7 +3417,7 @@ void sched_exec(void)
if (dest_cpu == smp_processor_id())
goto unlock;
- if (likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {
+ if (likely(cpu_active(dest_cpu))) {
struct migration_arg arg = { p, dest_cpu };
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
--
Frr, brainfart, you cannot schedule the migration thread without completely finishing prev. /me goes ponder more --
Yeah, task_running() is not needed after patch 13 which may be the suitable place to poke :) Thanks, Yong -- Only stand for myself --
That and patch 6, which removes the false negatives from ->on_rq. --
What I mean is we could firstly add pi_lock in patch 7 and then remove the rq argument in this patch. :) Thanks, Yong -- Only stand for myself --
No, that's the wrong way around.. anyway, I've pulled this into its own patch and the next posting should hopefully be clearer. --
We need preempt_disable() to protect us against CPU hotplug. This task was never activated, it won't be found/migrated if that CPU goes away before we take task_rq_lock(). Oleg. --
Ah, indeed. I've replaced it by keeping IRQs disabled over the whole function, no need to restore and save them in the middle. --
Since we now have p->on_cpu unconditionally available, use it to
re-implement mutex_spin_on_owner.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/mutex.h | 2 -
include/linux/sched.h | 2 -
kernel/mutex-debug.c | 2 -
kernel/mutex-debug.h | 2 -
kernel/mutex.c | 2 -
kernel/mutex.h | 2 -
kernel/sched.c | 83 +++++++++++++++++++-------------------------------
7 files changed, 39 insertions(+), 56 deletions(-)
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -51,7 +51,7 @@ struct mutex {
spinlock_t wait_lock;
struct list_head wait_list;
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
- struct thread_info *owner;
+ struct task_struct *owner;
#endif
#ifdef CONFIG_DEBUG_MUTEXES
const char *name;
Index: linux-2.6/kernel/mutex-debug.c
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -75,7 +75,7 @@ void debug_mutex_unlock(struct mutex *lo
return;
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
- DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+ DEBUG_LOCKS_WARN_ON(lock->owner != current);
DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
mutex_clear_owner(lock);
}
Index: linux-2.6/kernel/mutex-debug.h
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -29,7 +29,7 @@ extern void debug_mutex_init(struct mute
static inline void mutex_set_owner(struct mutex *lock)
{
- lock->owner = current_thread_info();
+ lock->owner = current;
}
static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/mutex.c
===================================================================
--- ...Provide a generic p->on_rq because the p->se.on_rq semantics are
unfavourable for lockless wakeups but needed for sched_fair.
In particular, p->on_rq is only cleared when we actually dequeue the
task in schedule() and not on any random dequeue as done by things
like __migrate_task() and __sched_setscheduler().
This also allows us to remove p->se usage from !sched_fair code.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/sched.h | 1 +
kernel/sched.c | 36 ++++++++++++++++++------------------
kernel/sched_debug.c | 2 +-
kernel/sched_rt.c | 10 +++++-----
kernel/sched_stoptask.c | 2 +-
5 files changed, 26 insertions(+), 25 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1200,6 +1200,7 @@ struct task_struct {
#ifdef CONFIG_SMP
int on_cpu;
#endif
+ int on_rq;
int prio, static_prio, normal_prio;
unsigned int rt_priority;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1759,7 +1759,6 @@ static void enqueue_task(struct rq *rq,
update_rq_clock(rq);
sched_info_queued(p);
p->sched_class->enqueue_task(rq, p, flags);
- p->se.on_rq = 1;
}
static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -1767,7 +1766,6 @@ static void dequeue_task(struct rq *rq,
update_rq_clock(rq);
sched_info_dequeued(p);
p->sched_class->dequeue_task(rq, p, flags);
- p->se.on_rq = 0;
}
/*
@@ -1780,6 +1778,7 @@ static void activate_task(struct rq *rq,
enqueue_task(rq, p, flags);
inc_nr_running(rq);
+ p->on_rq = 1;
}
/*
@@ -2070,7 +2069,7 @@ static void check_preempt_curr(struct rq
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back ...Seems we need on_rt_rq(&p->rt) here, otherwise we enqueue the task to pushable list when called from rt_mutex_setprio()/ __sched_setscheduler() etc. Thus add a little overhead. Though we call dequeue_pushable_task() in set_curr_task_rt() unconditionally. Thanks, Yong --
Since we now serialize ttwu() using p->pi_lock, we also need to
serialize ttwu_local() using that, otherwise, once we drop the
rq->lock from ttwu() it can race with ttwu_local().
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2513,9 +2513,9 @@ static int try_to_wake_up(struct task_st
* try_to_wake_up_local - try to wake up a local task with rq lock held
* @p: the thread to be awakened
*
- * Put @p on the run-queue if it's not alredy there. The caller must
+ * Put @p on the run-queue if it's not alredy there. The caller must
* ensure that this_rq() is locked, @p is bound to this_rq() and not
- * the current task. this_rq() stays locked over invocation.
+ * the current task.
*/
static void try_to_wake_up_local(struct task_struct *p)
{
@@ -2523,16 +2523,21 @@ static void try_to_wake_up_local(struct
BUG_ON(rq != this_rq());
BUG_ON(p == current);
- lockdep_assert_held(&rq->lock);
+
+ raw_spin_unlock(&rq->lock);
+ raw_spin_lock(&p->pi_lock);
+ raw_spin_lock(&rq->lock);
if (!(p->state & TASK_NORMAL))
- return;
+ goto out;
if (!p->on_rq)
activate_task(rq, p, ENQUEUE_WAKEUP);
ttwu_post_activation(p, rq, 0);
ttwu_stat(rq, p, smp_processor_id(), 0);
+out:
+ raw_spin_unlock(&p->pi_lock);
}
/**
@@ -3925,6 +3930,7 @@ pick_next_task(struct rq *rq)
*/
asmlinkage void __sched schedule(void)
{
+ struct task_struct *to_wakeup = NULL;
struct task_struct *prev, *next;
unsigned long *switch_count;
struct rq *rq;
@@ -3958,21 +3964,21 @@ asmlinkage void __sched schedule(void)
* task to maintain concurrency. If so, wake
* up the task.
*/
- if (prev->flags & PF_WQ_WORKER) {
- struct task_struct *to_wakeup;
-
+ if ...(add Tejun) I _think_ this is safe, this worker can't change cpu afaics. But --
Also available from (once the mirror catches up): git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-sched.git --- arch/x86/include/asm/spinlock.h | 26 ++- arch/x86/kernel/smp.c | 1 + include/linux/mutex.h | 2 +- include/linux/sched.h | 23 +- kernel/mutex-debug.c | 2 +- kernel/mutex-debug.h | 2 +- kernel/mutex.c | 2 +- kernel/mutex.h | 2 +- kernel/sched.c | 542 ++++++++++++++++++++++----------------- kernel/sched_debug.c | 2 +- kernel/sched_fair.c | 5 +- kernel/sched_features.h | 2 + kernel/sched_idletask.c | 2 +- kernel/sched_rt.c | 20 +- kernel/sched_stoptask.c | 5 +- 15 files changed, 371 insertions(+), 267 deletions(-) --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing List | 9p: block-based virtio client |
| Michael Breuer |
