This is still a little buggy. I've been hitting a hang when running
dbench 100 in a loop. But I also hit that hang with vanilla -rt.
But I'm sure this still has some things that need to be straighten out.
Anyway, you can either apply the following patches or pull the code
from the listed git repo.
This still needs to be worked out, but since I'm taking off for the
rest of the year, I might as well show you where I ended up at.
The following patches are in:
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-rt.git
branch: rt/tip/rt/devel
Lai Jiangshan (1):
rtmutex: Ensure only the top waiter or higher priority task can take the lock
Steven Rostedt (3):
rtmutex: Only save lock depth once in spin_slowlock
rtmutex: Try to take lock early in rt_spin_lock_slowlock()
rtmutex: Revert Optimize rt lock wakeup
----
kernel/futex.c | 22 +--
kernel/rt.c | 10 +-
kernel/rtmutex-debug.c | 1 -
kernel/rtmutex.c | 458 ++++++++++++++++++-----------------------------
kernel/rtmutex_common.h | 18 +--
5 files changed, 188 insertions(+), 321 deletions(-)
--
From: Steven Rostedt <srostedt@redhat.com>
The commit: rtmutex: Optimize rt lock wakeup
Does not do what it was suppose to do.
This is because the adaptive waiter sets its state to TASK_(UN)INTERRUPTIBLE
before going into the loop. Thus, the test in wakeup_next_waiter()
will always fail on an adaptive waiter, as it only tests to see if
the pending waiter never has its state set ot TASK_RUNNING unless
something else had woke it up.
The smp_mb() added to make this test work is just as expensive as
just calling wakeup. And since we we fail to wake up anyway, we are
doing both a smp_mb() and wakeup as well.
I tested this with dbench and we run faster without this patch.
I also tried a variant that instead fixed the loop, to change the state
only if the spinner was to go to sleep, and that still did not show
any improvement.
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Morreale <pmorreale@novell.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
kernel/rtmutex.c | 29 ++---------------------------
1 files changed, 2 insertions(+), 27 deletions(-)
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 318d7ed..e218873 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -554,33 +554,8 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
*/
if (!savestate)
wake_up_process(pendowner);
- else {
- /*
- * We can skip the actual (expensive) wakeup if the
- * waiter is already running, but we have to be careful
- * of race conditions because they may be about to sleep.
- *
- * The waiter-side protocol has the following pattern:
- * 1: Set state != RUNNING
- * 2: Conditionally sleep if waiter->task != NULL;
- *
- * And the owner-side has the following:
- * A: Set waiter->task = NULL
- * B: Conditionally wake if the state != RUNNING
- *
- * As long as we ensure 1->2 order, and A->B order, we
- * will never miss a wakeup.
- *
- * Therefore, this barrier ensures that waiter->task = ...Just a quick note to say I am a bit skeptical of this patch. I know you are offline next week, so lets plan on hashing it out after the new year before I ack it. Happy holidays! --
Sure, but as I said, it is mostly broken anyway. I could even insert some tracepoints to show that this is always missed (heck I'll add an unlikely and do the branch profiler ;-) The reason is that adaptive spinners spin in some other state than TASK_RUNNING, thus it does not help adaptive spinners at all. I first tried to fix that, but it made dbench run even slower. But I only did a few tests, and only on a 4 CPU box, so it was a rather small sample. The removal of the code had to deal with more that it was already broken than anything else. But yeah, we can hash this out in the new year. This is one of the You too! -- Steve --
>>> On 12/23/2010 at 11:54 PM, in message <1293166464.22802.415.camel@gandalf.stny.rr.com>, Steven Rostedt This is why I am skeptical. You are essentially asserting there are two issues here, IIUC: 1) The intent of avoiding a wakeup is broken and we take the double whammy of a mb() plus the wakeup() anyway. 2) mb() is apparently slower than wakeup(). I agree (1) is plausible, though I would like to see the traces to confirm. Its been a long time since I looked at that code, but I think the original code either ran in RUNNING_MUTEX and was inadvertently broken in the mean time or the other cpu would have transitioned to RUNNING on its own when we flipped the owner before the release-side check was performed. Or perhaps we just plain screwed this up and it was racy ;) I'm not sure. But as Peter (M) stated, it seems like a shame to walk away from the concept without further investigation. I think everyone can agree that at the very least, if it is in fact taking a double whammy we should fix that. For (2), I am skeptical in two parts ;). You stated you thought mb() was just as expensive as a wakeup which seems suspect to me, given a wakeup needs to be a superset of a barrier II[R|U]C. Lets call this "2a". In addition, your results when you removed the logic and went straight to a wakeup() and found dbench actually was faster than the "fixed mb()" path would imply wakeup() is actually _faster_ than mb(). Lets call this "2b". For (2a), I would like to see some traces that compare mb() to wakeup() (of a presumably already running task that happens in the INTERRUPTIBLE state) to be convinced that wakeup() is equal/faster. I suspect it isn't For (2b), I would suggest that we don't rely on dbench alone in evaluating the merit of the change. In some ways, its a great test for this type of change since it leans heavily on the coarse VFS locks. However, dbench is also pretty odd and thrives on somewhat chaotic behavior. For instance, it loves the "lateral ...
OK, here it is, after running a simple "dbench 10":
correct incorrect % Function File Line
------- --------- - -------- ---- ----
150 726979 99 wakeup_next_waiter rtmutex.c 581
And the patch I added:
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 318d7ed..dacdcb6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -578,7 +578,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
smp_mb();
/* If !RUNNING && !RUNNING_MUTEX */
- if (pendowner->state & ~TASK_RUNNING_MUTEX)
+ if (unlikely(pendowner->state & ~TASK_RUNNING_MUTEX))
wake_up_process_mutex(pendowner);
}
Interesting that we hit 150 times that the new owner was already awake.
Perhaps it was because the new owner was about to spin on a spinlock
after it had put itself into the TASK_INTERRUPTIBLE state, and it was
Forget this point, as #1 seems to be the main issue. Also, I'm not sure
[ cut out all the 2's since I'm not worried about that now, even if it
is not a problem. ]
Now, the way I was going to fix this is to have the adaptive wait be in
TASK_RUNNING state, and then change the state iff we are going to sleep.
But this can be a bit more race. Especially with Lai's new changes. With
the new changes, the waiter->task does not get nulled anymore.
The test to take the lock, now needs to be under the lock->wait_lock
spinlock. We have to grab that lock, and see if the owner is null, and
that we are the next waiter in the queue. Thus, on taking the lock we
need to have something like this:
if (adaptive_wait(&waiter, orig_owner))
sleep = 1;
else
sleep = 0;
if (sleep)
raw_spin_lock(&lock->wait_lock);
saved_state = rt_set_current_block_state(saved_state);
if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
sleep = 0;
raw_spin_unlock(&lock->wait_lock);
if (sleep)
schedule_rt_mutex(lock);
saved_state ...I may be able to remove the above locks and replace it with: saved_state = rt_set_current_blocked_state(saved_state); if (orig_owner == rt_mutex_owner(lock)) schedule_rt_mutex(lock); --
Isn't it possible to miss a wakeup here if the waiter becomes preempted? Recall that adaptive wait is a preemptive wait. Hence the (I believe) original reason we did the adaptive spin in a (transitioning) sleep state. --
Yes it is a preemptive wait, I would not have accepted the patches if it was anything else. But preemption is not affected by the state of the task. A task could be TASK_RUNNING or TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE, and that would not affect how it acts when it is preempted. -- Steve --
My bad. I thought preemption did change task state. This still requires the owner to run through try_to_wake_up() and all its associated overhead only to find out that the waiter is running. The assumption I made when I suggested the original concept to Greg was that if the new owner is running, there is *nothing* to do wrt scheduling. If that was a wrong assumption, then, yes, drop the patch and clean things up. If that was a good assumption, then we are leaving 'cycles on the table' as waking up a running process is a non-zero-overhead path and that is a bad thing considering how many times spin_unlock() is called on an rt system. Bear in mind that this savings scales directly as the number of CPUs (assuming all are vectored on the lock). We can only have nr_cpus-1 spinning waiters at any given time, regardless of the number of tasks in contention. Perhaps this is too little to worry about on a 4way system, but I suspect that it could be substantial on larger systems. I'll be quiet now as I know little about the intricacies of preemption/scheduling (obviously) and like Greg, have been removed from RT kernel work for several years. <sigh> -PWM --
No need to be quiet ;-) I'm working on making it spin in TASK_RUNNING state if possible, but it is making the code a bit more complex, as it seems that there is an assumption with the wakeup and the changing of the current->state in the rt_spin_lock_slowlock code all being under the lock->wait_lock. I think I'll scrap this idea. That said, I think your wakeup patch may be worth while with Lai's new code. His changes causes the owner to wake up the pending owner several times, because the pending owner is never removed from the lock wait_list. If a high prio task grabs and releases the same lock over and over, if there is a waiter it will try to wake up that waiter each time. Thus, having your patch may prevent that unnecessary wakeup. I'll look more into it. Thanks! -- Steve --
Oooohhhh I wonder if that would enable 'lock-stealing' for FIFO? IIRC, you came up with a ping-pong scenario that prevented its use there. --
I guess you mean literal FIFO order and not SCHED_FIFO. But, if I understand you, it would allow for this. If a process has a lock and two processes are blocked on it. When the first releases the lock it gives it to the higher of the waiters. But if a high prio task comes a long, it will steal that lock away, and this woken task will need to race to get it from the other lower prio task that comes along. And yes, if that other task is the same prio, we reverse the FIFO order here. Lai's patch prevents this. Either the woken task gets the lock or a higher prio task does. No mixing around the order of taking the lock here. The better side effect of Lai's patch is that it cleans up the code quite a bit, which is what I really want, as we start to introduce Probably. -- Steve --
I was referring to one of the original patches that allowing an incoming task of similar priority to 'steal' the lock over a pending waiter. If the owner had been cleared and the pending waiter not yet given ownership, the incoming task could steal the lock. IIRC, you had found a case wrt SCHED_FIFO tasks that could negatively impact performance/throughput in some way and it was restricted to non-SCHED_FIFO tasks. This was just an idle question, it sounds like RT is moving towards (and perhaps already there) synchronized lock access. --
My bad. I thought preemption did change task state. This still requires the owner to run through try_to_wake_up() and all its associated overhead only to find out that the waiter is running. The assumption I made when I suggested the original concept to Greg was that if the new owner is running, there is *nothing* to do wrt scheduling. If that was a wrong assumption, then, yes, drop the patch and clean things up. If that was a good assumption, then we are leaving 'cycles on the table' as waking up a running process is a non-zero-overhead path and that is a bad thing considering how many times spin_unlock() is called on an rt system. Bear in mind that this savings scales directly as the number of CPUs (assuming all are vectored on the lock). We can only have nr_cpus-1 spinning waiters at any given time, regardless of the number of tasks in contention. Perhaps this is too little to worry about on a 4way system, but I suspect that it could be substantial on larger systems. I'll be quiet now as I know little about the intricacies of preemption/scheduling (obviously) and like Greg, have been removed from RT kernel work for several years. <sigh> -PWM --
We shouldn't be too quick to merely rip this out without a little thinking. Clearly this is broken, however the intent was clear. That being that if a waiter is spinning, we don't need to wake it up. The wake up path is substantially more than a barrier; it includes a barrier as well as grabbing the task_rq_lock only to find that the task is oncpu. Then various accounting is updated, etc. We know definitively that a waiter can only spin if the owner is oncpu, by definition of adaptive spinning. We also know that only the owner can release the lock to a waiter (spinning or not). So it seems clear that avoiding unnecessary contention on the rq lock would be a Good Thing(tm). Perhaps this cannot be done safely, but if you saw an improvement in dbench by merely removing a barrier, imagine the improvement by removing the contention on the lock. Happy Holidays to all! --
From: Steven Rostedt <srostedt@redhat.com>
The task that enters the rt_spin_lock_slowlock() will not take or
release the kernel lock again until it leaves the fuction.
There's no reason that we need to save and restore the lock depth
for each iteration of the try_lock loop.
Just save and reset the lock depth before entering the loop
and restore it upon exit.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
kernel/rtmutex.c | 16 +++++++++-------
1 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 348b1e7..f0ce334 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -821,6 +821,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
struct rt_mutex_waiter waiter;
unsigned long saved_state, flags;
struct task_struct *orig_owner;
+ int saved_lock_depth;
debug_rt_mutex_init_waiter(&waiter);
waiter.task = NULL;
@@ -843,8 +844,14 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
*/
saved_state = rt_set_current_blocked_state(current->state);
+ /*
+ * Prevent schedule() to drop BKL, while waiting for
+ * the lock ! We restore lock_depth when we come back.
+ */
+ saved_lock_depth = current->lock_depth;
+ current->lock_depth = -1;
+
for (;;) {
- int saved_lock_depth = current->lock_depth;
/* Try to acquire the lock */
if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL))
@@ -863,11 +870,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
continue;
}
- /*
- * Prevent schedule() to drop BKL, while waiting for
- * the lock ! We restore lock_depth when we come back.
- */
- current->lock_depth = -1;
orig_owner = rt_mutex_owner(lock);
get_task_struct(orig_owner);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
@@ -883,9 +885,9 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
put_task_struct(orig_owner);
raw_spin_lock_irqsave(&lock->wait_lock, flags);
- current->lock_depth = saved_lock_depth;
saved_state = ...From: Lai Jiangshan <laijs@cn.fujitsu.com> In current rtmutex, the pending owner may be boosted by the tasks in the rtmutex's waitlist when the pending owner is deboosted or a task in the waitlist is boosted. This boosting is unrelated, because the pending owner does not really take the rtmutex. It is not reasonable. Example. time1: A(high prio) onwers the rtmutex. B(mid prio) and C (low prio) in the waitlist. time2 A release the lock, B becomes the pending owner A(or other high prio task) continues to run. B's prio is lower than A, so B is just queued at the runqueue. time3 A or other high prio task sleeps, but we have passed some time The B and C's prio are changed in the period (time2 ~ time3) due to boosting or deboosting. Now C has the priority higher than B. ***Is it reasonable that C has to boost B and help B to get the rtmutex? NO!! I think, it is unrelated/unneed boosting before B really owns the rtmutex. We should give C a chance to beat B and win the rtmutex. This is the motivation of this patch. This patch *ensures* only the top waiter or higher priority task can take the lock. How? 1) we don't dequeue the top waiter when unlock, if the top waiter is changed, the old top waiter will fail and go to sleep again. 2) when requiring lock, it will get the lock when the lock is not taken and: there is no waiter OR higher priority than waiters OR it is top waiter. 3) In any time, the top waiter is changed, the top waiter will be woken up. The algorithm is much simpler than before, no pending owner, no boosting for pending owner. Other advantage of this patch: 1) The states of a rtmutex are reduced a half, easier to read the code. 2) the codes become shorter. 3) top waiter is not dequeued until it really take the lock: they will retain FIFO when it is stolen. Not advantage nor disadvantage 1) Even we may wakeup multiple waiters(any time when top waiter changed), we hardly cause "thundering herd", the number of wokenup task is ...
I'm still hitting some bugs with the port to -rt, but I also noticed something that doesn't look too good. There's several places in the kernel where a task may release and acquire the same lock multiple times in a row. The old way of removing the pending owner from the lists and waking it up once, would have the high prio task wake it up once, and then it can grab the locks multiple times without modifying the list, since the pending owner is already awake and not in the pi list anymore. The new way has the owner remove the woken task from its pi list and wakes it up, but when it steals the lock again, it adds this owner back to its pi list. When it releases the lock, it wakes it up again and removes it from its pi list again. This happens over and over again. Running dbench, I see this happen (probably on one of the dcache locks) literally a 100 times in a row. I'm a bit nervous to add this overhead. I'm still working in debug mode so I'm not at a point to do real benchmarks yet. -- Steve --
From: Steven Rostedt <srostedt@redhat.com>
Try to take the lock again as soon as we go into the
rt_spin_lock_slowlock() code before doing the setup and wait loop.
This makes the code closer to what the rt_mutex_slowlock() does,
which will help in simplifying this code in later commits.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
kernel/rtmutex.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index f0ce334..318d7ed 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -829,6 +829,11 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
raw_spin_lock_irqsave(&lock->wait_lock, flags);
init_lists(lock);
+ if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL)) {
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+ return;
+ }
+
BUG_ON(rt_mutex_owner(lock) == current);
/*
--
1.7.2.3
--
