Re: [RFC][RT][PATCH 4/4] rtmutex: Ensure only the top waiter or higher priority task can take the lock

Previous thread: [GIT PULL] ext4 regression fix by Theodore Ts'o on Thursday, December 23, 2010 - 3:44 pm. (1 message)

Next thread: Re: 2.6.37-rc7: no more shutdown on DELL E6400 by Alan Stern on Thursday, December 23, 2010 - 3:57 pm. (5 messages)
From: Steven Rostedt
Date: Thursday, December 23, 2010 - 3:47 pm

This is still a little buggy. I've been hitting a hang when running
dbench 100 in a loop. But I also hit that hang with vanilla -rt.

But I'm sure this still has some things that need to be straighten out.
Anyway, you can either apply the following patches or pull the code
from the listed git repo.

This still needs to be worked out, but since I'm taking off for the
rest of the year, I might as well show you where I ended up at.

The following patches are in:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-rt.git

    branch: rt/tip/rt/devel


Lai Jiangshan (1):
      rtmutex: Ensure only the top waiter or higher priority task can take the lock

Steven Rostedt (3):
      rtmutex: Only save lock depth once in spin_slowlock
      rtmutex: Try to take lock early in rt_spin_lock_slowlock()
      rtmutex: Revert Optimize rt lock wakeup

----
 kernel/futex.c          |   22 +--
 kernel/rt.c             |   10 +-
 kernel/rtmutex-debug.c  |    1 -
 kernel/rtmutex.c        |  458 ++++++++++++++++++-----------------------------
 kernel/rtmutex_common.h |   18 +--
 5 files changed, 188 insertions(+), 321 deletions(-)
--

From: Steven Rostedt
Date: Thursday, December 23, 2010 - 3:47 pm

From: Steven Rostedt <srostedt@redhat.com>

The commit: rtmutex: Optimize rt lock wakeup

Does not do what it was suppose to do.
This is because the adaptive waiter sets its state to TASK_(UN)INTERRUPTIBLE
before going into the loop. Thus, the test in wakeup_next_waiter()
will always fail on an adaptive waiter, as it only tests to see if
the pending waiter never has its state set ot TASK_RUNNING unless
something else had woke it up.

The smp_mb() added to make this test work is just as expensive as
just calling wakeup. And since we we fail to wake up anyway, we are
doing both a smp_mb() and wakeup as well.

I tested this with dbench and we run faster without this patch.
I also tried a variant that instead fixed the loop, to change the state
only if the spinner was to go to sleep, and that still did not show
any improvement.

Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Morreale <pmorreale@novell.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rtmutex.c |   29 ++---------------------------
 1 files changed, 2 insertions(+), 27 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 318d7ed..e218873 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -554,33 +554,8 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 	 */
 	if (!savestate)
 		wake_up_process(pendowner);
-	else {
-		/*
-		 * We can skip the actual (expensive) wakeup if the
-		 * waiter is already running, but we have to be careful
-		 * of race conditions because they may be about to sleep.
-		 *
-		 * The waiter-side protocol has the following pattern:
-		 * 1: Set state != RUNNING
-		 * 2: Conditionally sleep if waiter->task != NULL;
-		 *
-		 * And the owner-side has the following:
-		 * A: Set waiter->task = NULL
-		 * B: Conditionally wake if the state != RUNNING
-		 *
-		 * As long as we ensure 1->2 order, and A->B order, we
-		 * will never miss a wakeup.
-		 *
-		 * Therefore, this barrier ensures that waiter->task = ...
From: Gregory Haskins
Date: Thursday, December 23, 2010 - 9:45 pm

Just a quick note to say I am a bit skeptical of this patch.  I know you are offline next week, so lets plan on hashing it out after the new year before I ack it.

Happy holidays!




--

From: Steven Rostedt
Date: Thursday, December 23, 2010 - 9:54 pm

Sure, but as I said, it is mostly broken anyway. I could even insert
some tracepoints to show that this is always missed (heck I'll add an
unlikely and do the branch profiler ;-)

The reason is that adaptive spinners spin in some other state than
TASK_RUNNING, thus it does not help adaptive spinners at all. I first
tried to fix that, but it made dbench run even slower. But I only did a
few tests, and only on a 4 CPU box, so it was a rather small sample. The
removal of the code had to deal with more that it was already broken
than anything else.

But yeah, we can hash this out in the new year. This is one of the

You too!

-- Steve


--

From: Gregory Haskins
Date: Tuesday, December 28, 2010 - 7:06 am

>>> On 12/23/2010 at 11:54 PM, in message
<1293166464.22802.415.camel@gandalf.stny.rr.com>, Steven Rostedt


This is why I am skeptical.  You are essentially asserting there are two issues here, IIUC:

1) The intent of avoiding a wakeup is broken and we take the double whammy of a mb()
plus the wakeup() anyway.

2) mb() is apparently slower than wakeup().

I agree (1) is plausible, though I would like to see the traces to confirm.  Its been a long time
since I looked at that code, but I think the original code either ran in RUNNING_MUTEX and was
inadvertently broken in the mean time or the other cpu would have transitioned to RUNNING on
its own when we flipped the owner before the release-side check was performed.  Or perhaps
we just plain screwed this up and it was racy ;)  I'm not sure.  But as Peter (M) stated, it seems
like a shame to walk away from the concept without further investigation.  I think everyone can
agree that at the very least, if it is in fact taking a double whammy we should fix that.

For (2), I am skeptical in two parts ;).  You stated you thought mb() was just as expensive as a
wakeup which seems suspect to me, given a wakeup needs to be a superset of a barrier
II[R|U]C.  Lets call this "2a".  In addition, your results when you removed the logic and went 
straight to a wakeup() and found dbench actually was faster than the "fixed mb()" path would 
imply wakeup() is actually _faster_ than mb().  Lets call this "2b".

For (2a), I would like to see some traces that compare mb() to wakeup() (of a presumably 
already running task that happens in the INTERRUPTIBLE state) to be convinced that wakeup() is 
equal/faster.  I suspect it isn't

For (2b), I would suggest that we don't rely on dbench alone in evaluating the merit of the 
change.  In some ways, its a great test for this type of change since it leans heavily on the coarse 
VFS locks.  However, dbench is also pretty odd and thrives on somewhat chaotic behavior.  For 
instance, it loves the "lateral ...
From: Steven Rostedt
Date: Monday, January 3, 2011 - 12:06 pm

OK, here it is, after running a simple "dbench 10":

 correct incorrect  %        Function                  File              Line
 ------- ---------  -        --------                  ----              ----
     150   726979  99 wakeup_next_waiter             rtmutex.c            581

And the patch I added:

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 318d7ed..dacdcb6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -578,7 +578,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 		smp_mb();
 
 		/* If !RUNNING && !RUNNING_MUTEX */
-		if (pendowner->state & ~TASK_RUNNING_MUTEX)
+		if (unlikely(pendowner->state & ~TASK_RUNNING_MUTEX))
 			wake_up_process_mutex(pendowner);
 	}
 

Interesting that we hit 150 times that the new owner was already awake.
Perhaps it was because the new owner was about to spin on a spinlock
after it had put itself into the TASK_INTERRUPTIBLE state, and it was


Forget this point, as #1 seems to be the main issue. Also, I'm not sure

[ cut out all the 2's since I'm not worried about that now, even if it 
  is not a problem. ]


Now, the way I was going to fix this is to have the adaptive wait be in
TASK_RUNNING state, and then change the state iff we are going to sleep.

But this can be a bit more race. Especially with Lai's new changes. With
the new changes, the waiter->task does not get nulled anymore.

The test to take the lock, now needs to be under the lock->wait_lock
spinlock. We have to grab that lock, and see if the owner is null, and
that we are the next waiter in the queue. Thus, on taking the lock we
need to have something like this:


	if (adaptive_wait(&waiter, orig_owner))
		sleep = 1;
	else
		sleep = 0;

	if (sleep)
		raw_spin_lock(&lock->wait_lock);
		saved_state = rt_set_current_block_state(saved_state);
		if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
			sleep = 0;
		raw_spin_unlock(&lock->wait_lock);
		if (sleep)
			schedule_rt_mutex(lock);
		saved_state ...
From: Steven Rostedt
Date: Monday, January 3, 2011 - 1:22 pm

I may be able to remove the above locks and replace it with:

	saved_state = rt_set_current_blocked_state(saved_state);
	if (orig_owner == rt_mutex_owner(lock))
		schedule_rt_mutex(lock);



--

From: Peter W. Morreale
Date: Tuesday, January 4, 2011 - 8:19 am

Isn't it possible to miss a wakeup here if the waiter becomes preempted?

Recall that adaptive wait is a preemptive wait.  Hence the (I believe)
original reason we did the adaptive spin in a (transitioning)  sleep
state.



--

From: Steven Rostedt
Date: Tuesday, January 4, 2011 - 8:47 am

Yes it is a preemptive wait, I would not have accepted the patches if it
was anything else. But preemption is not affected by the state of the
task. A task could be TASK_RUNNING or TASK_INTERRUPTIBLE or
TASK_UNINTERRUPTIBLE, and that would not affect how it acts when it is
preempted.

-- Steve


--

From: Peter W. Morreale
Date: Tuesday, January 4, 2011 - 10:15 am

My bad.  I thought preemption did change task state.

This still requires the owner to run through try_to_wake_up() and all
its associated overhead only to find out that the waiter is running.  

The assumption I made when I suggested the original concept to Greg was
that if the new owner is running, there is *nothing* to do wrt
scheduling.  If that was a wrong assumption, then, yes, drop the patch
and clean things up.  

If that was a good assumption, then we are leaving 'cycles on the table'
as waking up a running process is a non-zero-overhead path and that is a
bad thing considering how many times spin_unlock() is called on an rt
system.

Bear in mind that this savings scales directly as the number of CPUs
(assuming all are vectored on the lock).  We can only have nr_cpus-1
spinning waiters at any given time, regardless of the number of tasks in
contention.  Perhaps this is too little to worry about on a 4way system,
but I suspect that it could be substantial on larger systems.  

I'll be quiet now as I know little about the intricacies of
preemption/scheduling (obviously) and like Greg, have been removed from
RT kernel work for several years. <sigh>

-PWM




--

From: Steven Rostedt
Date: Tuesday, January 4, 2011 - 10:27 am

No need to be quiet ;-)

I'm working on making it spin in TASK_RUNNING state if possible, but it
is making the code a bit more complex, as it seems that there is an
assumption with the wakeup and the changing of the current->state in the
rt_spin_lock_slowlock code all being under the lock->wait_lock. I think
I'll scrap this idea.

That said, I think your wakeup patch may be worth while with Lai's new
code. His changes causes the owner to wake up the pending owner several
times, because the pending owner is never removed from the lock
wait_list. If a high prio task grabs and releases the same lock over and
over, if there is a waiter it will try to wake up that waiter each time.

Thus, having your patch may prevent that unnecessary wakeup.

I'll look more into it. Thanks!

-- Steve



--

From: Peter W. Morreale
Date: Tuesday, January 4, 2011 - 10:45 am

Oooohhhh  I wonder if that would enable 'lock-stealing' for FIFO?

IIRC, you came up with a ping-pong scenario that prevented its use
there.



--

From: Steven Rostedt
Date: Tuesday, January 4, 2011 - 11:06 am

I guess you mean literal FIFO order and not SCHED_FIFO.

But, if I understand you, it would allow for this. If a process has a
lock and two processes are blocked on it. When the first releases the
lock it gives it to the higher of the waiters. But if a high prio task
comes a long, it will steal that lock away, and this woken task will
need to race to get it from the other lower prio task that comes along.
And yes, if that other task is the same prio, we reverse the FIFO order
here.

Lai's patch prevents this. Either the woken task gets the lock or a
higher prio task does. No mixing around the order of taking the lock
here. The better side effect of Lai's patch is that it cleans up the
code quite a bit, which is what I really want, as we start to introduce

Probably.

-- Steve


--

From: Peter W. Morreale
Date: Tuesday, January 4, 2011 - 1:48 pm

I was referring to one of the original patches that allowing an incoming
task of similar priority to 'steal' the lock over a pending waiter.  If
the owner had been cleared and the pending waiter not yet given
ownership, the incoming task could steal the lock.  

IIRC, you had found a case wrt SCHED_FIFO tasks that could negatively
impact performance/throughput in some way and it was restricted to
non-SCHED_FIFO tasks. 

This was just an idle question, it sounds like RT is moving towards (and
perhaps already there) synchronized lock access.  



--

From: Peter W. Morreale
Date: Tuesday, January 4, 2011 - 10:24 am

My bad.  I thought preemption did change task state.

This still requires the owner to run through try_to_wake_up() and all
its associated overhead only to find out that the waiter is running.  

The assumption I made when I suggested the original concept to Greg was
that if the new owner is running, there is *nothing* to do wrt
scheduling.  If that was a wrong assumption, then, yes, drop the patch
and clean things up.  

If that was a good assumption, then we are leaving 'cycles on the table'
as waking up a running process is a non-zero-overhead path and that is a
bad thing considering how many times spin_unlock() is called on an rt
system.

Bear in mind that this savings scales directly as the number of CPUs
(assuming all are vectored on the lock).  We can only have nr_cpus-1
spinning waiters at any given time, regardless of the number of tasks in
contention.  Perhaps this is too little to worry about on a 4way system,
but I suspect that it could be substantial on larger systems.  

I'll be quiet now as I know little about the intricacies of
preemption/scheduling (obviously) and like Greg, have been removed from
RT kernel work for several years. <sigh>

-PWM





--

From: Peter W. Morreale
Date: Friday, December 24, 2010 - 8:47 am

We shouldn't be too quick to merely rip this out without a little
thinking.  Clearly this is broken, however the intent was clear.  

That being that if a waiter is spinning, we don't need to wake it up. 

The wake up path is substantially more than a barrier; it includes a
barrier as well as grabbing the task_rq_lock only to find that the task
is oncpu. Then various accounting is updated, etc. 

We know definitively that a waiter can only spin if the owner is oncpu,
by definition of adaptive spinning.  We also know that only the owner
can release the lock to a waiter (spinning or not).   So it seems clear
that avoiding unnecessary contention on the rq lock would be a Good
Thing(tm). 

Perhaps this cannot be done safely, but if you saw an improvement in
dbench by merely removing a barrier, imagine the improvement by removing
the contention on the lock.

Happy Holidays to all!



--

From: Steven Rostedt
Date: Thursday, December 23, 2010 - 3:47 pm

From: Steven Rostedt <srostedt@redhat.com>

The task that enters the rt_spin_lock_slowlock() will not take or
release the kernel lock again until it leaves the fuction.
There's no reason that we need to save and restore the lock depth
for each iteration of the try_lock loop.

Just save and reset the lock depth before entering the loop
and restore it upon exit.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rtmutex.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 348b1e7..f0ce334 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -821,6 +821,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	struct rt_mutex_waiter waiter;
 	unsigned long saved_state, flags;
 	struct task_struct *orig_owner;
+	int saved_lock_depth;
 
 	debug_rt_mutex_init_waiter(&waiter);
 	waiter.task = NULL;
@@ -843,8 +844,14 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	 */
 	saved_state = rt_set_current_blocked_state(current->state);
 
+	/*
+	 * Prevent schedule() to drop BKL, while waiting for
+	 * the lock ! We restore lock_depth when we come back.
+	 */
+	saved_lock_depth = current->lock_depth;
+	current->lock_depth = -1;
+
 	for (;;) {
-		int saved_lock_depth = current->lock_depth;
 
 		/* Try to acquire the lock */
 		if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL))
@@ -863,11 +870,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 				continue;
 		}
 
-		/*
-		 * Prevent schedule() to drop BKL, while waiting for
-		 * the lock ! We restore lock_depth when we come back.
-		 */
-		current->lock_depth = -1;
 		orig_owner = rt_mutex_owner(lock);
 		get_task_struct(orig_owner);
 		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
@@ -883,9 +885,9 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 			put_task_struct(orig_owner);
 
 		raw_spin_lock_irqsave(&lock->wait_lock, flags);
-		current->lock_depth = saved_lock_depth;
 		saved_state = ...
From: Steven Rostedt
Date: Thursday, December 23, 2010 - 3:47 pm

From: Lai Jiangshan <laijs@cn.fujitsu.com>

In current rtmutex, the pending owner may be boosted by the tasks
in the rtmutex's waitlist when the pending owner is deboosted
or a task in the waitlist is boosted. This boosting is unrelated,
because the pending owner does not really take the rtmutex.
It is not reasonable.

Example.

time1:
A(high prio) onwers the rtmutex.
B(mid prio) and C (low prio) in the waitlist.

time2
A release the lock, B becomes the pending owner
A(or other high prio task) continues to run. B's prio is lower
than A, so B is just queued at the runqueue.

time3
A or other high prio task sleeps, but we have passed some time
The B and C's prio are changed in the period (time2 ~ time3)
due to boosting or deboosting. Now C has the priority higher
than B. ***Is it reasonable that C has to boost B and help B to
get the rtmutex?

NO!! I think, it is unrelated/unneed boosting before B really
owns the rtmutex. We should give C a chance to beat B and
win the rtmutex.

This is the motivation of this patch. This patch *ensures*
only the top waiter or higher priority task can take the lock.

How?
1) we don't dequeue the top waiter when unlock, if the top waiter
   is changed, the old top waiter will fail and go to sleep again.
2) when requiring lock, it will get the lock when the lock is not taken and:
   there is no waiter OR higher priority than waiters OR it is top waiter.
3) In any time, the top waiter is changed, the top waiter will be woken up.

The algorithm is much simpler than before, no pending owner, no
boosting for pending owner.

Other advantage of this patch:
1) The states of a rtmutex are reduced a half, easier to read the code.
2) the codes become shorter.
3) top waiter is not dequeued until it really take the lock:
   they will retain FIFO when it is stolen.

Not advantage nor disadvantage
1) Even we may wakeup multiple waiters(any time when top waiter changed),
   we hardly cause "thundering herd",
   the number of wokenup task is ...
From: Steven Rostedt
Date: Monday, January 3, 2011 - 9:02 pm

I'm still hitting some bugs with the port to -rt, but I also noticed
something that doesn't look too good.

There's several places in the kernel where a task may release and
acquire the same lock multiple times in a row.

The old way of removing the pending owner from the lists and waking it
up once, would have the high prio task wake it up once, and then it can
grab the locks multiple times without modifying the list, since the
pending owner is already awake and not in the pi list anymore.

The new way has the owner remove the woken task from its pi list and
wakes it up, but when it steals the lock again, it adds this owner back
to its pi list. When it releases the lock, it wakes it up again and
removes it from its pi list again. This happens over and over again.

Running dbench, I see this happen (probably on one of the dcache locks)
literally a 100 times in a row.

I'm a bit nervous to add this overhead. I'm still working in debug mode
so I'm not at a point to do real benchmarks yet.

-- Steve


--

From: Steven Rostedt
Date: Thursday, December 23, 2010 - 3:47 pm

From: Steven Rostedt <srostedt@redhat.com>

Try to take the lock again as soon as we go into the
rt_spin_lock_slowlock() code before doing the setup and wait loop.
This makes the code closer to what the rt_mutex_slowlock() does,
which will help in simplifying this code in later commits.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rtmutex.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index f0ce334..318d7ed 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -829,6 +829,11 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
 
+	if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL)) {
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+		return;
+	}
+
 	BUG_ON(rt_mutex_owner(lock) == current);
 
 	/*
-- 
1.7.2.3


--

Previous thread: [GIT PULL] ext4 regression fix by Theodore Ts'o on Thursday, December 23, 2010 - 3:44 pm. (1 message)

Next thread: Re: 2.6.37-rc7: no more shutdown on DELL E6400 by Alan Stern on Thursday, December 23, 2010 - 3:57 pm. (5 messages)