Re: [PATCH 07/10] generic rwsem: implement down_read_critical() / up_read_critical()

Previous thread: [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers by Michel Lespinasse on Monday, May 17, 2010 - 3:25 pm. (1 message)

Next thread: [GIT PULL] Please pull NFS client changes by Trond Myklebust on Monday, May 17, 2010 - 3:28 pm. (2 messages)
From: Michel Lespinasse
Date: Monday, May 17, 2010 - 3:25 pm

This is version 3 of my rwsem changes. Patches 7 and 10 were modified to
address Linus's comments about the API. Please consider for merging.

Changes since V2:

- Rebased to 2.6.34

- Changed patch 07 to address Linus's comments about the API.
  down_read_critical() and up_read_critical() now work as a pair; threads
  using this are allowed to skip over blocked threads when acquiring the
  read lock; however they must make sure to quickly release that lock and
  in particular, they are forbidden to block.

- Changed patch 10 to make use of the down_read_critical()/up_read_critical()
  API when accessing /proc/<pid>/exe and /proc/<pid>/maps files.
  I excluded smaps and numa_maps files, which can actually block while
  being generated (smaps blocks in smaps_pte_range() doing a cond_resched(),
  which seems legitimate as it's a potentially heavy operation. numa_maps
  blocks in show_numa_map() doing a bzalloc of struct numa_maps, which
  should probably get done in do_maps_open() instead).

The motivation for this change was some cluster monitoring software we
use at google; which reads /proc/<pid>/maps files for all running
processes. When the machines are under load, the mmap_sem is often
acquire for reads for long periods of time since do_page_fault() holds
it while doing disk accesses; and fair queueing behavior often ends up
in the monitoring software making little progress. By introducing
unfair behavior in a few selected places, are are able to let the
monitoring software make progress without impacting performance for
the rest of the system. I've made sure not to change the rwsem fast
paths in implementing this proposal.

Michel Lespinasse (10):
  x86 rwsem: minor cleanups
  rwsem: fully separate code pathes to wake writers vs readers
  rwsem: lighter active count checks when waking up readers
  rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads
  rwsem: wake queued readers when writer blocks on active read lock
  rwsem: smaller wrappers around ...
From: Michel Lespinasse
Date: Monday, May 17, 2010 - 3:25 pm

In __rwsem_do_wake(), we can skip the active count check unless we come
there from up_xxxx(). Also when checking the active count, it is not
actually necessary to increment it; this allows us to get rid of the
read side undo code and simplify the calculation of the final rwsem count
adjustment once we've counted the reader threads to wake.

The basic observation is the following. When there are waiter threads
on a rwsem and the spinlock is held, other threads can only increment the
active count by trying to grab the rwsem in up_xxxx(). However up_xxxx()
will notice there are waiter threads and take the down_failed path,
blocking to acquire the spinlock on the way there. Therefore, a thread
observing an active count of zero with waiters queued and the spinlock held,
is protected against other threads acquiring the rwsem until it wakes the
last waiter or releases the spinlock.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   60 ++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 874c0c1..ab0d306 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -36,6 +36,14 @@ struct rwsem_waiter {
 #define RWSEM_WAITING_FOR_WRITE	0x00000002
 };
 
+/* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
+ * RWSEM_WAKE_READ_OWNED imply that the spinlock must have been kept held
+ * since the rwsem value was observed.
+ */
+#define RWSEM_WAKE_ANY        0 /* Wake whatever's at head of wait list */
+#define RWSEM_WAKE_NO_ACTIVE  1 /* rwsem was observed with no active thread */
+#define RWSEM_WAKE_READ_OWNED 2 /* rwsem was observed to be read owned */
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then:
@@ -46,8 +54,8 @@ struct rwsem_waiter {
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only woken if downgrading is false
  ...
From: Michel Lespinasse
Date: Monday, May 17, 2010 - 3:25 pm

Add down_read_critical() / up_read_critical() API.

down_read_critical() is similar to down_read() with the following changes:
- when the rwsem is read owned with queued writers, down_read_unfair() callers
  are allowed to acquire the rwsem for read without queueing;
- when the rwsem is write owned, down_read_unfair() callers get queued in
  front of threads trying to acquire the rwsem by other means.
- caller can't sleep until releasing lock with up_read_critical(),
  and preemption is disabled for that time.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/rwsem-spinlock.h |   10 +++++++++-
 include/linux/rwsem.h          |   12 ++++++++++++
 kernel/rwsem.c                 |   35 +++++++++++++++++++++++++++++++++++
 lib/rwsem-spinlock.c           |   10 +++++++---
 4 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..67631c3 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -60,7 +60,9 @@ do {								\
 	__init_rwsem((sem), #sem, &__key);			\
 } while (0)
 
-extern void __down_read(struct rw_semaphore *sem);
+#define __HAVE_DOWN_READ_UNFAIR
+
+extern void __down_read_internal(struct rw_semaphore *sem, int unfair);
 extern int __down_read_trylock(struct rw_semaphore *sem);
 extern void __down_write(struct rw_semaphore *sem);
 extern void __down_write_nested(struct rw_semaphore *sem, int subclass);
@@ -70,5 +72,11 @@ extern void __up_write(struct rw_semaphore *sem);
 extern void __downgrade_write(struct rw_semaphore *sem);
 extern int rwsem_is_locked(struct rw_semaphore *sem);
 
+static inline void __down_read(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 0); }
+
+static inline void __down_read_unfair(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 1); }
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_RWSEM_SPINLOCK_H */
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index efd348f..76fd8f4 ...
From: Linus Torvalds
Date: Monday, May 17, 2010 - 3:44 pm

.. or here. In this case, it really is more about "unfairness", but I'm 
not convinced it should be so in the naming anyway, even if internally it 

But you did here (because it's a new thing).

Anyway, the series looks mostly acceptable to me in this form. I think it 
conceptually works out, and I think that the non-preemption guarantee 
should mean that starvation of writers is not likely an issue. However, 
I'd definitely like some second opinions on it. I'm not going to apply 
this series without acks from people. So you should try to convince DavidH 
too that this actually really does matter and makes sense.

		Linus
--

From: Michel Lespinasse
Date: Monday, May 17, 2010 - 4:13 pm

Gah! Sorry for missing the comment updates. I agree with you on the naming,
I just didn't remember about the comment.


I'll see what I can do here. Thanks !

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: Michel Lespinasse
Date: Monday, May 17, 2010 - 4:20 pm

Add down_read_critical() / up_read_critical() API.

down_read_critical() is similar to down_read() with the following changes:
- when the rwsem is read owned with queued writers, down_read_critical()
  callers are allowed to acquire the rwsem for read without queueing;
- when the rwsem is write owned, down_read_critical() callers get queued in
  front of threads trying to acquire the rwsem by other means.
- caller can't sleep until releasing lock with up_read_critical(),
  and preemption is disabled for that time.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/rwsem-spinlock.h |   10 +++++++++-
 include/linux/rwsem.h          |   12 ++++++++++++
 kernel/rwsem.c                 |   35 +++++++++++++++++++++++++++++++++++
 lib/rwsem-spinlock.c           |   10 +++++++---
 4 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..67631c3 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -60,7 +60,9 @@ do {								\
 	__init_rwsem((sem), #sem, &__key);			\
 } while (0)
 
-extern void __down_read(struct rw_semaphore *sem);
+#define __HAVE_DOWN_READ_UNFAIR
+
+extern void __down_read_internal(struct rw_semaphore *sem, int unfair);
 extern int __down_read_trylock(struct rw_semaphore *sem);
 extern void __down_write(struct rw_semaphore *sem);
 extern void __down_write_nested(struct rw_semaphore *sem, int subclass);
@@ -70,5 +72,11 @@ extern void __up_write(struct rw_semaphore *sem);
 extern void __downgrade_write(struct rw_semaphore *sem);
 extern int rwsem_is_locked(struct rw_semaphore *sem);
 
+static inline void __down_read(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 0); }
+
+static inline void __down_read_unfair(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 1); }
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_RWSEM_SPINLOCK_H */
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index ...
From: David Howells
Date: Wednesday, May 19, 2010 - 6:21 am

Shouldn't preemption really be disabled before __down_read_unfair() is called?
Otherwise you can get an unfair read on a sem and immediately get taken off
the CPU.  Of course, this means __down_read_unfair() would have to deal with
that in the slow path:-/

Oh, and something else that occurs to me:  Do unfair readers have to go at the
front of the wakeup queue?  Can they be slightly less unfair and go either
before the first reader in the queue or at the back of the queue instead?

David
--

From: Michel Lespinasse
Date: Wednesday, May 19, 2010 - 4:47 pm

I've asked myself the same question; it is true that we don't fully
prevent ourselves getting preempted while holding the rwsem here.

My understanding is that Linus wants the preempt_disable() mainly to
prevent threads doing voluntary preemption (anything that might_sleep)
while holding the unfairly acquired rwsem;  and also to have people
think twice before adding more down_read_critical() calls.

It wouldn't be difficult to move the preempt_disable() ahead of the
lock acquire fast path. However I don't think I can do it for the
blocking path, where thread A tries to acquire the lock on behalf of
thread B and then wakes B if it succeeded - I don't think we have an
API for A to say 'I want to disable preemption in thread B', is there

Going before the first reader would be fine for our use, as we're
really only using this for mmap_sem and the write holders there don't
keep it very long. I'm not sure what this buys us though.

Going at the back of the queue would mean the critical readers would
still get occasionally blocked behind other readers doing disk
accesses - we'd like to avoid that.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: Michel Lespinasse
Date: Thursday, May 20, 2010 - 8:35 pm

I think it's not that bad - I mean, the unfairness does not come into
factor here. If you tried to do a regular down_read(), you could also
get preempted on your way to the blocking path. Being preempted on the
way to your critical section after a successful (if unfair) acquire
really is no worse.

The critical section prevents you from blocking on long-latency events
such as disk accesses; being preempted but still runnable is not
nearly as bad.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: Michel Lespinasse
Date: Monday, May 17, 2010 - 3:25 pm

The basic properties that have been true so far still hold:
- RWSEM_ACTIVE_READ_BIAS  & RWSEM_ACTIVE_MASK == 1
- RWSEM_ACTIVE_WRITE_BIAS & RWSEM_ACTIVE_MASK == 1
- RWSEM_WAITING_BIAS      & RWSEM_ACTIVE_MASK == 0
- RWSEM_ACTIVE_WRITE_BIAS < 0 and RWSEM_WAITING_BIAS < 0

A new property is introduced: RWSEM_ACTIVE_WRITE_BIAS is set to be
'more negative' than RWSEM_WAITING_BIAS. This way, even the first writer
will set the rwsem count to be < RWSEM_WAITING_BIAS and down_read_critical()
can make use of this to determine if there are any active writers.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/rwsem.h |   53 ++++++++++++++++++++++++++++++++---------
 arch/x86/lib/rwsem_64.S      |   14 +++++++++-
 arch/x86/lib/semaphore_32.S  |   21 +++++++++++++++-
 3 files changed, 72 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 86e0473..fbb068c 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -16,11 +16,10 @@
  * if there are writers (and maybe) readers waiting (in which case it goes to
  * sleep).
  *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
+ * The WRITE_BIAS value supports up to 32767 processes simultaneously
+ * trying to acquire a write lock.
  *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
+ * The value of ACTIVE_BIAS supports up to 32767 active processes.
  *
  * This should be totally fair - if anything is waiting, a process that wants a
  * lock will go to the back of the queue. When the currently active lock is
@@ -48,6 +47,8 @@ struct rwsem_waiter;
 extern asmregparm struct rw_semaphore *
  rwsem_down_read_failed(struct rw_semaphore *sem);
 extern asmregparm struct rw_semaphore *
+ rwsem_down_read_unfair_failed(struct rw_semaphore *sem);
+extern asmregparm struct rw_semaphore *
  ...
From: Michel Lespinasse
Date: Monday, May 17, 2010 - 3:25 pm

This change addresses the following situation:

- Thread A acquires the rwsem for read
- Thread B tries to acquire the rwsem for write, notices there is already
  an active owner for the rwsem.
- Thread C tries to acquire the rwsem for read, notices that thread B already
  tried to acquire it.
- Thread C grabs the spinlock and queues itself on the wait queue.
- Thread B grabs the spinlock and queues itself behind C. At this point A is
  the only remaining active owner on the rwsem.

In this situation thread B could notice that it was the last active writer
on the rwsem, and decide to wake C to let it proceed in parallel with A
since they both only want the rwsem for read.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   21 ++++++++++++++++-----
 1 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index b2dde5a..89cd8b3 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -67,6 +67,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 		goto readers_only;
 
 	if (wake_type == RWSEM_WAKE_READ_OWNED)
+		/* Another active reader was observed, so wakeup is not
+		 * likely to succeed. Save the atomic op.
+		 */
 		goto out;
 
 	/* There's a writer at the front of the queue - try to grant it the
@@ -98,7 +101,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
  readers_only:
 	/* If we come here from up_xxxx(), another thread might have reached
 	 * rwsem_down_failed_common() before we acquired the spinlock and
-	 * woken up an active locker.  We prefer to check for this first in
+	 * woken up an active writer.  We prefer to check for this first in
 	 * order to not spend too much time with the spinlock held if we're
 	 * not going to be able to wake up readers in the end.
 	 *
@@ -111,8 +114,8 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	 * count adjustment pretty soon.
 	 */
 	if (wake_type == RWSEM_WAKE_ANY &&
-	    rwsem_atomic_update(0, sem) & RWSEM_ACTIVE_MASK)
-		/* ...
From: Michel Lespinasse
Date: Monday, May 17, 2010 - 3:25 pm

Previously each waiting thread added a bias of RWSEM_WAITING_BIAS. With this
change, the bias is added only once when the wait list is non-empty.

This has a few nice properties which will be used in following changes:
- when the spinlock is held and the waiter list is known to be non-empty,
  count < RWSEM_WAITING_BIAS  <=>  there is an active writer on that sem
- count == RWSEM_WAITING_BIAS  <=>  there are waiting threads and no
                                     active readers/writers on that sem

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   28 +++++++++++++++++-----------
 1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index ab0d306..b2dde5a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -60,7 +60,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
 	struct list_head *next;
-	signed long oldcount, woken, loop;
+	signed long oldcount, woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
@@ -73,9 +73,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	 * write lock.  However, we only wake this writer if we can transition
 	 * the active part of the count from 0 -> 1
 	 */
+	adjustment = RWSEM_ACTIVE_WRITE_BIAS;
+	if (waiter->list.next == &sem->wait_list)
+		adjustment -= RWSEM_WAITING_BIAS;
+
  try_again:
-	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
-						- RWSEM_ACTIVE_BIAS;
+	oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
 	if (oldcount & RWSEM_ACTIVE_MASK)
 		/* Someone grabbed the sem already */
 		goto undo;
@@ -128,13 +131,15 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 
 	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
 
-	loop = woken;
-	woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+	adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
+	if (waiter->flags & ...
From: Michel Lespinasse
Date: Monday, May 17, 2010 - 3:25 pm

This helps in the following situation:
- Thread A takes a page fault while reading or writing memory.
  do_page_fault() acquires the mmap_sem for read and blocks on disk
  (either reading the page from file, or hitting swap) for a long time.
- Thread B does an mmap call and blocks trying to acquire the mmap_sem
  for write
- Thread C is a monitoring process trying to read every /proc/pid/maps
  in the system. This requires acquiring the mmap_sem for read. Thread C
  blocks behind B, waiting for A to release the rwsem.  If thread C
  could be allowed to run in parallel with A, it would probably get done
  long before thread A's disk access completes, thus not actually slowing
  down thread B.

Test results with down_read_critical_test (10 seconds):

2.6.33.3:
threadA completes ~600 faults
threadB completes ~300 mmap/munmap cycles
threadC completes ~600 /proc/pid/maps reads

2.6.33.3 + down_read_critical:
threadA completes ~600 faults
threadB completes ~300 mmap/munmap cycles
threadC completes ~160000 /proc/pid/maps reads

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 fs/proc/base.c          |    4 ++--
 fs/proc/task_mmu.c      |   24 +++++++++++++++++-------
 include/linux/proc_fs.h |    1 +
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8418fcc..9201762 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1367,11 +1367,11 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 
 	/* We need mmap_sem to protect against races with removal of
 	 * VM_EXECUTABLE vmas */
-	down_read(&mm->mmap_sem);
+	down_read_critical(&mm->mmap_sem);
 	exe_file = mm->exe_file;
 	if (exe_file)
 		get_file(exe_file);
-	up_read(&mm->mmap_sem);
+	up_read_critical(&mm->mmap_sem);
 	return exe_file;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 47f5b14..83f9353 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -89,7 +89,10 @@ static void vma_stop(struct proc_maps_private *priv, struct ...
From: Michel Lespinasse
Date: Monday, May 17, 2010 - 3:25 pm

More code can be pushed from rwsem_down_read_failed and
rwsem_down_write_failed into rwsem_down_failed_common.

Following change adding down_read_unfair infrastructure support also
enjoys having flags available in a register rather than having to
fish it out in the struct rwsem_waiter...

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   25 ++++++++++---------------
 1 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 89cd8b3..d8764e0 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -171,8 +171,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
  */
 static struct rw_semaphore __sched *
 rwsem_down_failed_common(struct rw_semaphore *sem,
-			struct rwsem_waiter *waiter, signed long adjustment)
+			 unsigned int flags, signed long adjustment)
 {
+	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
 	signed long count;
 
@@ -180,12 +181,13 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	/* set up my own style of waitqueue */
 	spin_lock_irq(&sem->wait_lock);
-	waiter->task = tsk;
+	waiter.task = tsk;
+	waiter.flags = flags;
 	get_task_struct(tsk);
 
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
-	list_add_tail(&waiter->list, &sem->wait_list);
+	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
@@ -206,7 +208,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	/* wait to be given the lock */
 	for (;;) {
-		if (!waiter->task)
+		if (!waiter.task)
 			break;
 		schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
@@ -223,11 +225,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 asmregparm struct rw_semaphore __sched *
 rwsem_down_read_failed(struct rw_semaphore *sem)
 {
-	struct rwsem_waiter waiter;
-
-	waiter.flags = RWSEM_WAITING_FOR_READ;
-	rwsem_down_failed_common(sem, &waiter, ...
From: Michel Lespinasse
Date: Monday, May 17, 2010 - 3:25 pm

Add rwsem_down_read_unfair_failed() function in the non-generic rwsem
library code, similar to  rwsem_down_read_failed() except that blocked
threads are placed at the head of the queue.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   22 +++++++++++++++++++++-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index d8764e0..b3fe179 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -34,6 +34,7 @@ struct rwsem_waiter {
 	unsigned int flags;
 #define RWSEM_WAITING_FOR_READ	0x00000001
 #define RWSEM_WAITING_FOR_WRITE	0x00000002
+#define RWSEM_UNFAIR            0x00000004
 };
 
 /* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
@@ -187,7 +188,11 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
-	list_add_tail(&waiter.list, &sem->wait_list);
+
+	if (flags & RWSEM_UNFAIR)
+		list_add(&waiter.list, &sem->wait_list);
+	else
+		list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
@@ -229,6 +234,21 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 					-RWSEM_ACTIVE_READ_BIAS);
 }
 
+#ifdef __HAVE_DOWN_READ_UNFAIR
+
+/*
+ * wait for the read lock to be granted - skip waiting threads
+ */
+asmregparm struct rw_semaphore __sched *
+rwsem_down_read_unfair_failed(struct rw_semaphore *sem)
+{
+	return rwsem_down_failed_common(sem,
+					RWSEM_WAITING_FOR_READ | RWSEM_UNFAIR,
+					-RWSEM_ACTIVE_READ_BIAS);
+}
+
+#endif
+
 /*
  * wait for the write lock to be granted
  */
-- 
1.7.0.1

--

From: David Howells
Date: Wednesday, May 19, 2010 - 4:47 am

Given that you describe this first, this would suggest that the subject of the
patch should be this.  I'm not sure I'd count this as a minor cleanup.  I think

Mostly okay, except where you said "expects old value in %edx" - that's only
true on i386, not x86_64.  On the latter it would be %rdi.  However, I can live

If you're going to put the initialisation of EDX/RDI on tmp (which isn't really
necessary), rather than directly on the asm statement, you could change the
'"=d" (tmp)' output constraint to be '"+d" (tmp)' and drop the '"1" (tmp)'
constraint entirely.

However, apart from that, feel free to add my Acked-by to this patch or its
split resultant patches.

David
--

From: Michel Lespinasse
Date: Thursday, May 20, 2010 - 2:37 pm

It's actually still %edx in x86_64 - we're calling into

I agree it'd be nicer, but I wondered if all gcc versions would handle
the constraints change fine and then I chickened out. Instead I moved
the initialization on the constraints list as was already done in
__up_write(). All I was really shooting for here is consistency

Thanks. I'll send a V4 series soon integrating your feedback & mark
the two splitted patches resulting from this one as Acked-by.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: David Howells
Date: Wednesday, May 19, 2010 - 5:04 am

I hate code that jumps into braced blocks with goto.  Would it be possible for
you to do:

	readers_only:
		if (downgrading)
			goto wake_readers;
	...
	wake_readers:
		/* Grant an infinite number of read locks to the readers at the front
	...

instead, please?

Also, since the labels 'undo' and 'try_again' are now specific to the writer
path, can you label them 'undo_write' and 'try_again_write' just to make it
obvious?

Other than that, no particular objections to this patch.

David
--

From: Michel Lespinasse
Date: Thursday, May 20, 2010 - 2:48 pm

Applied. Note that next patch in the series removes that code anyway :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: David Howells
Date: Wednesday, May 19, 2010 - 5:25 am

That's not true.  A thread attempting to get an rwsem by issuing a down_read()
or down_write() will also unconditionally increment the active count before it
considers calling out to the slow path.

Maybe what you mean is that other threads wanting to do a wake up can only
increase the active count for the processes being woken up whilst holding the

Do you mean a waiter rather than an active locker?  If a process is still
registering activity on the rwsem, then it can't be woken yet.


I wonder if I should've called it the 'activity part' of the count rather than
the 'active part'.

Apart from that, the patch looks fine.  That's all comment/description fixes.

David
--

From: Michel Lespinasse
Date: Thursday, May 20, 2010 - 3:33 pm

Damn, this was a pretty bad commit comment - I wrote up_xxxx instead
of down_xxxx here.

down_xxxx can increase the active count but it will then take the
down_failed path and wait to acquire the spinlock - so we don't need
to worry about it. Other code paths can't even increase the active
count, since they don't hold the spinlock.

(I got mixed up due to a legitimate mention of the up_xxxx path higher
up. Semaphore operation names confuse me with the active count going
up in down() and down in up() - I suppose it made more sense in the


Thanks. Updated the comments. And I thought I could speak english... :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: David Howells
Date: Friday, May 21, 2010 - 1:06 am

Not so much can, as does...  Perhaps you need to qualify that.

David
--

From: David Howells
Date: Wednesday, May 19, 2010 - 5:33 am

You might want to change "when the wait list" to "to indicate that the wait
list".  But apart from that:

Acked-by: David Howells <dhowells@redhat.com>
--

From: David Howells
Date: Wednesday, May 19, 2010 - 5:44 am

The comma you've added after 'us' is wrong.  That suggests that the implicit
'then' comes there.  I take it you're a proponent of the Oxford/Harvard/serial
comma?

Apart from that miscellaneous grammatical difference, the patch is fine:-)

David
--

From: David Howells
Date: Wednesday, May 19, 2010 - 5:51 am

Acked-by: David Howells <dhowells@redhat.com>
--

From: David Howells
Date: Wednesday, May 19, 2010 - 6:34 am

Can I suggest you change this to:

	enum rwsem_waiter_type {
		RWSEM_WAITING_FOR_WRITE,
		RWSEM_WAITING_FOR_READ,
		RWSEM_WAITING_FOR_UNFAIR_READ
	};


to:

	enum rwsem_waiter_type type;

and use this throughout.  It simplifies some of the code too.  See attached
patch.

David
---
rwsem: Make waiter type an enum in the slow path of the asm-optimised version

Make the waiter type in the asm-optimised version of the slow path an
enumeration rather than a bitmask.

Signed-off-by: David Howells <dhowells@redhat.com>
---
diff --git a/lib/rwsem.c b/lib/rwsem.c
index b3fe179..8aa2238 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -28,13 +28,16 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 
 EXPORT_SYMBOL(__init_rwsem);
 
+enum rwsem_waiter_type {
+	RWSEM_WAITING_FOR_WRITE,
+	RWSEM_WAITING_FOR_READ,
+	RWSEM_WAITING_FOR_UNFAIR_READ
+};
+
 struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;
-	unsigned int flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
-#define RWSEM_UNFAIR            0x00000004
+	enum rwsem_waiter_type type;
 };
 
 /* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
@@ -64,7 +67,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	signed long oldcount, woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
+	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
 		goto readers_only;
 
 	if (wake_type == RWSEM_WAKE_READ_OWNED)
@@ -133,10 +136,10 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 		waiter = list_entry(waiter->list.next,
 					struct rwsem_waiter, list);
 
-	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
+	} while (waiter->type != RWSEM_WAITING_FOR_WRITE);
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
-	if (waiter->flags & RWSEM_WAITING_FOR_READ)
+	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
 		/* hit ...
From: Michel Lespinasse
Date: Thursday, May 20, 2010 - 4:30 pm

Makes perfect sense assuming we don't plan to add more flags in the future.

if (type == RWSEM_WAITING_FOR_UNFAIR_READ)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: David Howells
Date: Friday, May 21, 2010 - 1:03 am

Sorry, yes.
--

From: David Howells
Date: Wednesday, May 19, 2010 - 7:36 am

Acked-by: David Howells <dhowells@redhat.com>
--

From: David Howells
Date: Wednesday, May 19, 2010 - 8:21 am

That seems to work.  I have some rwsem benchmarks for you using my
synchronisation primitive test (see attached).  I run it like so:

	insmod /tmp/synchro-test.ko rd=10 wr=2 dg=1 do_sched=1

with 10 readers, 2 writers and a downgrader on the same rwsem, and permitting
scheduling between attempts to grab and release the semaphore.  The test runs
for 5 seconds on a box that isn't doing anything else, and has had almost
everything that might normally be running (including syslog) disabled or
removed.

Without your patches:

      THREADS       S INTVL             LOCKS TAKEN AND RELEASED
-------------------   /LOAD -------------------------------------------------
MTX SEM RD  WR  DG          MTX TAKEN SEM TAKEN RD TAKEN  WR TAKEN  DG TAKEN
=== === === === === = ===== ========= ========= ========= ========= =========
  0   0  10   2   1 s   2/2         0         0   2087911    590464    194394
  0   0  10   2   1 s   2/2         0         0   2030497    591511    196851
  0   0  10   2   1 s   2/2         0         0   2078523    605317    199645

With your patches

  0   0  10   2   1 s   2/2         0         0   2054720    605691    198472
  0   0  10   2   1 s   2/2         0         0   2051376    597976    196962
  0   0  10   2   1 s   2/2         0         0   2011909    595706    197482

So there doesn't seem to be much in the way of degradation.  In fact, there
seems to have been more degradation somewhere in the general kernel since I
played with version 1 of your patches a week ago.  At that time, the vanilla
kernel gave this without your patches applied:

  0   0  10   2   1 s   2/2         0         0   2217200    636227    210255
  0   0  10   2   1 s   2/2         0         0   2217903    629175    212926
  0   0  10   2   1 s   2/2         0         0   2224136    647912    212506

Feel free to have a play.

BTW, the code also works on FRV in NOMMU mode (which uses the spinlock-based
rwsem version).


However...  I'm not certain that giving a process ...
From: Michel Lespinasse
Date: Thursday, May 20, 2010 - 7:44 pm

Yes, I made sure to keep the generic (and non-x86 asm) versions

One thing to keep in mind if you're concerned about the higher
priority is that these /proc files are still not open to everyone - a
user can't read the /proc/<pid>/exe and /proc/<pid>/maps files for
another user's processes unless he's got CAP_SYS_PTRACE priviledges.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: Michel Lespinasse
Date: Friday, May 21, 2010 - 6:49 pm

This is relatively large software; I don't actually have a complete
view of it myself.
However some of the things it does include the following:

- The cluster management software tracks ownership of sysv shm
segments, potentially deleting them if they have been abandoned.
/proc/<pid>/maps files are used as an input to these decisions.

- There is a separate module making numa aware decisions based on
/proc/<pid>/numa_maps information. This drives decisions such as
moving data between nodes (we have not been pushing the corresponding
kernel patches yet). Using down_read_critical() here would be
desirable too, but not as critical as for other parts of the system
(numa awareness is not the biggest worry when machines get under
enough memory pressure that reading numa_maps starts blocking).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: David Howells
Date: Tuesday, May 25, 2010 - 2:42 am

ipcs manages to give you that information in its nattch column, I believe,
without touching /proc/pid/maps.

David
--

Previous thread: [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers by Michel Lespinasse on Monday, May 17, 2010 - 3:25 pm. (1 message)

Next thread: [GIT PULL] Please pull NFS client changes by Trond Myklebust on Monday, May 17, 2010 - 3:28 pm. (2 messages)