This is version 3 of my rwsem changes. Patches 7 and 10 were modified to address Linus's comments about the API. Please consider for merging. Changes since V2: - Rebased to 2.6.34 - Changed patch 07 to address Linus's comments about the API. down_read_critical() and up_read_critical() now work as a pair; threads using this are allowed to skip over blocked threads when acquiring the read lock; however they must make sure to quickly release that lock and in particular, they are forbidden to block. - Changed patch 10 to make use of the down_read_critical()/up_read_critical() API when accessing /proc/<pid>/exe and /proc/<pid>/maps files. I excluded smaps and numa_maps files, which can actually block while being generated (smaps blocks in smaps_pte_range() doing a cond_resched(), which seems legitimate as it's a potentially heavy operation. numa_maps blocks in show_numa_map() doing a bzalloc of struct numa_maps, which should probably get done in do_maps_open() instead). The motivation for this change was some cluster monitoring software we use at google; which reads /proc/<pid>/maps files for all running processes. When the machines are under load, the mmap_sem is often acquire for reads for long periods of time since do_page_fault() holds it while doing disk accesses; and fair queueing behavior often ends up in the monitoring software making little progress. By introducing unfair behavior in a few selected places, are are able to let the monitoring software make progress without impacting performance for the rest of the system. I've made sure not to change the rwsem fast paths in implementing this proposal. Michel Lespinasse (10): x86 rwsem: minor cleanups rwsem: fully separate code pathes to wake writers vs readers rwsem: lighter active count checks when waking up readers rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads rwsem: wake queued readers when writer blocks on active read lock rwsem: smaller wrappers around ...
In __rwsem_do_wake(), we can skip the active count check unless we come
there from up_xxxx(). Also when checking the active count, it is not
actually necessary to increment it; this allows us to get rid of the
read side undo code and simplify the calculation of the final rwsem count
adjustment once we've counted the reader threads to wake.
The basic observation is the following. When there are waiter threads
on a rwsem and the spinlock is held, other threads can only increment the
active count by trying to grab the rwsem in up_xxxx(). However up_xxxx()
will notice there are waiter threads and take the down_failed path,
blocking to acquire the spinlock on the way there. Therefore, a thread
observing an active count of zero with waiters queued and the spinlock held,
is protected against other threads acquiring the rwsem until it wakes the
last waiter or releases the spinlock.
Signed-off-by: Michel Lespinasse <walken@google.com>
---
lib/rwsem.c | 60 ++++++++++++++++++++++++++++++++--------------------------
1 files changed, 33 insertions(+), 27 deletions(-)
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 874c0c1..ab0d306 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -36,6 +36,14 @@ struct rwsem_waiter {
#define RWSEM_WAITING_FOR_WRITE 0x00000002
};
+/* Wake types for __rwsem_do_wake(). Note that RWSEM_WAKE_NO_ACTIVE and
+ * RWSEM_WAKE_READ_OWNED imply that the spinlock must have been kept held
+ * since the rwsem value was observed.
+ */
+#define RWSEM_WAKE_ANY 0 /* Wake whatever's at head of wait list */
+#define RWSEM_WAKE_NO_ACTIVE 1 /* rwsem was observed with no active thread */
+#define RWSEM_WAKE_READ_OWNED 2 /* rwsem was observed to be read owned */
+
/*
* handle the lock release when processes blocked on it that can now run
* - if we come here from up_xxxx(), then:
@@ -46,8 +54,8 @@ struct rwsem_waiter {
* - woken process blocks are discarded from the list after having task zeroed
* - writers are only woken if downgrading is false
...Add down_read_critical() / up_read_critical() API.
down_read_critical() is similar to down_read() with the following changes:
- when the rwsem is read owned with queued writers, down_read_unfair() callers
are allowed to acquire the rwsem for read without queueing;
- when the rwsem is write owned, down_read_unfair() callers get queued in
front of threads trying to acquire the rwsem by other means.
- caller can't sleep until releasing lock with up_read_critical(),
and preemption is disabled for that time.
Signed-off-by: Michel Lespinasse <walken@google.com>
---
include/linux/rwsem-spinlock.h | 10 +++++++++-
include/linux/rwsem.h | 12 ++++++++++++
kernel/rwsem.c | 35 +++++++++++++++++++++++++++++++++++
lib/rwsem-spinlock.c | 10 +++++++---
4 files changed, 63 insertions(+), 4 deletions(-)
diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..67631c3 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -60,7 +60,9 @@ do { \
__init_rwsem((sem), #sem, &__key); \
} while (0)
-extern void __down_read(struct rw_semaphore *sem);
+#define __HAVE_DOWN_READ_UNFAIR
+
+extern void __down_read_internal(struct rw_semaphore *sem, int unfair);
extern int __down_read_trylock(struct rw_semaphore *sem);
extern void __down_write(struct rw_semaphore *sem);
extern void __down_write_nested(struct rw_semaphore *sem, int subclass);
@@ -70,5 +72,11 @@ extern void __up_write(struct rw_semaphore *sem);
extern void __downgrade_write(struct rw_semaphore *sem);
extern int rwsem_is_locked(struct rw_semaphore *sem);
+static inline void __down_read(struct rw_semaphore *sem)
+ { __down_read_internal(sem, 0); }
+
+static inline void __down_read_unfair(struct rw_semaphore *sem)
+ { __down_read_internal(sem, 1); }
+
#endif /* __KERNEL__ */
#endif /* _LINUX_RWSEM_SPINLOCK_H */
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index efd348f..76fd8f4 ..... or here. In this case, it really is more about "unfairness", but I'm not convinced it should be so in the naming anyway, even if internally it But you did here (because it's a new thing). Anyway, the series looks mostly acceptable to me in this form. I think it conceptually works out, and I think that the non-preemption guarantee should mean that starvation of writers is not likely an issue. However, I'd definitely like some second opinions on it. I'm not going to apply this series without acks from people. So you should try to convince DavidH too that this actually really does matter and makes sense. Linus --
Gah! Sorry for missing the comment updates. I agree with you on the naming, I just didn't remember about the comment. I'll see what I can do here. Thanks ! -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
Add down_read_critical() / up_read_critical() API.
down_read_critical() is similar to down_read() with the following changes:
- when the rwsem is read owned with queued writers, down_read_critical()
callers are allowed to acquire the rwsem for read without queueing;
- when the rwsem is write owned, down_read_critical() callers get queued in
front of threads trying to acquire the rwsem by other means.
- caller can't sleep until releasing lock with up_read_critical(),
and preemption is disabled for that time.
Signed-off-by: Michel Lespinasse <walken@google.com>
---
include/linux/rwsem-spinlock.h | 10 +++++++++-
include/linux/rwsem.h | 12 ++++++++++++
kernel/rwsem.c | 35 +++++++++++++++++++++++++++++++++++
lib/rwsem-spinlock.c | 10 +++++++---
4 files changed, 63 insertions(+), 4 deletions(-)
diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..67631c3 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -60,7 +60,9 @@ do { \
__init_rwsem((sem), #sem, &__key); \
} while (0)
-extern void __down_read(struct rw_semaphore *sem);
+#define __HAVE_DOWN_READ_UNFAIR
+
+extern void __down_read_internal(struct rw_semaphore *sem, int unfair);
extern int __down_read_trylock(struct rw_semaphore *sem);
extern void __down_write(struct rw_semaphore *sem);
extern void __down_write_nested(struct rw_semaphore *sem, int subclass);
@@ -70,5 +72,11 @@ extern void __up_write(struct rw_semaphore *sem);
extern void __downgrade_write(struct rw_semaphore *sem);
extern int rwsem_is_locked(struct rw_semaphore *sem);
+static inline void __down_read(struct rw_semaphore *sem)
+ { __down_read_internal(sem, 0); }
+
+static inline void __down_read_unfair(struct rw_semaphore *sem)
+ { __down_read_internal(sem, 1); }
+
#endif /* __KERNEL__ */
#endif /* _LINUX_RWSEM_SPINLOCK_H */
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index ...Shouldn't preemption really be disabled before __down_read_unfair() is called? Otherwise you can get an unfair read on a sem and immediately get taken off the CPU. Of course, this means __down_read_unfair() would have to deal with that in the slow path:-/ Oh, and something else that occurs to me: Do unfair readers have to go at the front of the wakeup queue? Can they be slightly less unfair and go either before the first reader in the queue or at the back of the queue instead? David --
I've asked myself the same question; it is true that we don't fully prevent ourselves getting preempted while holding the rwsem here. My understanding is that Linus wants the preempt_disable() mainly to prevent threads doing voluntary preemption (anything that might_sleep) while holding the unfairly acquired rwsem; and also to have people think twice before adding more down_read_critical() calls. It wouldn't be difficult to move the preempt_disable() ahead of the lock acquire fast path. However I don't think I can do it for the blocking path, where thread A tries to acquire the lock on behalf of thread B and then wakes B if it succeeded - I don't think we have an API for A to say 'I want to disable preemption in thread B', is there Going before the first reader would be fine for our use, as we're really only using this for mmap_sem and the write holders there don't keep it very long. I'm not sure what this buys us though. Going at the back of the queue would mean the critical readers would still get occasionally blocked behind other readers doing disk accesses - we'd like to avoid that. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
I think it's not that bad - I mean, the unfairness does not come into factor here. If you tried to do a regular down_read(), you could also get preempted on your way to the blocking path. Being preempted on the way to your critical section after a successful (if unfair) acquire really is no worse. The critical section prevents you from blocking on long-latency events such as disk accesses; being preempted but still runnable is not nearly as bad. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
The basic properties that have been true so far still hold: - RWSEM_ACTIVE_READ_BIAS & RWSEM_ACTIVE_MASK == 1 - RWSEM_ACTIVE_WRITE_BIAS & RWSEM_ACTIVE_MASK == 1 - RWSEM_WAITING_BIAS & RWSEM_ACTIVE_MASK == 0 - RWSEM_ACTIVE_WRITE_BIAS < 0 and RWSEM_WAITING_BIAS < 0 A new property is introduced: RWSEM_ACTIVE_WRITE_BIAS is set to be 'more negative' than RWSEM_WAITING_BIAS. This way, even the first writer will set the rwsem count to be < RWSEM_WAITING_BIAS and down_read_critical() can make use of this to determine if there are any active writers. Signed-off-by: Michel Lespinasse <walken@google.com> --- arch/x86/include/asm/rwsem.h | 53 ++++++++++++++++++++++++++++++++--------- arch/x86/lib/rwsem_64.S | 14 +++++++++- arch/x86/lib/semaphore_32.S | 21 +++++++++++++++- 3 files changed, 72 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h index 86e0473..fbb068c 100644 --- a/arch/x86/include/asm/rwsem.h +++ b/arch/x86/include/asm/rwsem.h @@ -16,11 +16,10 @@ * if there are writers (and maybe) readers waiting (in which case it goes to * sleep). * - * The value of WAITING_BIAS supports up to 32766 waiting processes. This can - * be extended to 65534 by manually checking the whole MSW rather than relying - * on the S flag. + * The WRITE_BIAS value supports up to 32767 processes simultaneously + * trying to acquire a write lock. * - * The value of ACTIVE_BIAS supports up to 65535 active processes. + * The value of ACTIVE_BIAS supports up to 32767 active processes. * * This should be totally fair - if anything is waiting, a process that wants a * lock will go to the back of the queue. When the currently active lock is @@ -48,6 +47,8 @@ struct rwsem_waiter; extern asmregparm struct rw_semaphore * rwsem_down_read_failed(struct rw_semaphore *sem); extern asmregparm struct rw_semaphore * + rwsem_down_read_unfair_failed(struct rw_semaphore *sem); +extern asmregparm struct rw_semaphore * ...
This change addresses the following situation: - Thread A acquires the rwsem for read - Thread B tries to acquire the rwsem for write, notices there is already an active owner for the rwsem. - Thread C tries to acquire the rwsem for read, notices that thread B already tried to acquire it. - Thread C grabs the spinlock and queues itself on the wait queue. - Thread B grabs the spinlock and queues itself behind C. At this point A is the only remaining active owner on the rwsem. In this situation thread B could notice that it was the last active writer on the rwsem, and decide to wake C to let it proceed in parallel with A since they both only want the rwsem for read. Signed-off-by: Michel Lespinasse <walken@google.com> --- lib/rwsem.c | 21 ++++++++++++++++----- 1 files changed, 16 insertions(+), 5 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index b2dde5a..89cd8b3 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -67,6 +67,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type) goto readers_only; if (wake_type == RWSEM_WAKE_READ_OWNED) + /* Another active reader was observed, so wakeup is not + * likely to succeed. Save the atomic op. + */ goto out; /* There's a writer at the front of the queue - try to grant it the @@ -98,7 +101,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type) readers_only: /* If we come here from up_xxxx(), another thread might have reached * rwsem_down_failed_common() before we acquired the spinlock and - * woken up an active locker. We prefer to check for this first in + * woken up an active writer. We prefer to check for this first in * order to not spend too much time with the spinlock held if we're * not going to be able to wake up readers in the end. * @@ -111,8 +114,8 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type) * count adjustment pretty soon. */ if (wake_type == RWSEM_WAKE_ANY && - rwsem_atomic_update(0, sem) & RWSEM_ACTIVE_MASK) - /* ...
Previously each waiting thread added a bias of RWSEM_WAITING_BIAS. With this
change, the bias is added only once when the wait list is non-empty.
This has a few nice properties which will be used in following changes:
- when the spinlock is held and the waiter list is known to be non-empty,
count < RWSEM_WAITING_BIAS <=> there is an active writer on that sem
- count == RWSEM_WAITING_BIAS <=> there are waiting threads and no
active readers/writers on that sem
Signed-off-by: Michel Lespinasse <walken@google.com>
---
lib/rwsem.c | 28 +++++++++++++++++-----------
1 files changed, 17 insertions(+), 11 deletions(-)
diff --git a/lib/rwsem.c b/lib/rwsem.c
index ab0d306..b2dde5a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -60,7 +60,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
struct rwsem_waiter *waiter;
struct task_struct *tsk;
struct list_head *next;
- signed long oldcount, woken, loop;
+ signed long oldcount, woken, loop, adjustment;
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
@@ -73,9 +73,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
* write lock. However, we only wake this writer if we can transition
* the active part of the count from 0 -> 1
*/
+ adjustment = RWSEM_ACTIVE_WRITE_BIAS;
+ if (waiter->list.next == &sem->wait_list)
+ adjustment -= RWSEM_WAITING_BIAS;
+
try_again:
- oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
- - RWSEM_ACTIVE_BIAS;
+ oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
if (oldcount & RWSEM_ACTIVE_MASK)
/* Someone grabbed the sem already */
goto undo;
@@ -128,13 +131,15 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
} while (waiter->flags & RWSEM_WAITING_FOR_READ);
- loop = woken;
- woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+ adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
+ if (waiter->flags & ...This helps in the following situation: - Thread A takes a page fault while reading or writing memory. do_page_fault() acquires the mmap_sem for read and blocks on disk (either reading the page from file, or hitting swap) for a long time. - Thread B does an mmap call and blocks trying to acquire the mmap_sem for write - Thread C is a monitoring process trying to read every /proc/pid/maps in the system. This requires acquiring the mmap_sem for read. Thread C blocks behind B, waiting for A to release the rwsem. If thread C could be allowed to run in parallel with A, it would probably get done long before thread A's disk access completes, thus not actually slowing down thread B. Test results with down_read_critical_test (10 seconds): 2.6.33.3: threadA completes ~600 faults threadB completes ~300 mmap/munmap cycles threadC completes ~600 /proc/pid/maps reads 2.6.33.3 + down_read_critical: threadA completes ~600 faults threadB completes ~300 mmap/munmap cycles threadC completes ~160000 /proc/pid/maps reads Signed-off-by: Michel Lespinasse <walken@google.com> --- fs/proc/base.c | 4 ++-- fs/proc/task_mmu.c | 24 +++++++++++++++++------- include/linux/proc_fs.h | 1 + 3 files changed, 20 insertions(+), 9 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 8418fcc..9201762 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1367,11 +1367,11 @@ struct file *get_mm_exe_file(struct mm_struct *mm) /* We need mmap_sem to protect against races with removal of * VM_EXECUTABLE vmas */ - down_read(&mm->mmap_sem); + down_read_critical(&mm->mmap_sem); exe_file = mm->exe_file; if (exe_file) get_file(exe_file); - up_read(&mm->mmap_sem); + up_read_critical(&mm->mmap_sem); return exe_file; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 47f5b14..83f9353 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -89,7 +89,10 @@ static void vma_stop(struct proc_maps_private *priv, struct ...
More code can be pushed from rwsem_down_read_failed and
rwsem_down_write_failed into rwsem_down_failed_common.
Following change adding down_read_unfair infrastructure support also
enjoys having flags available in a register rather than having to
fish it out in the struct rwsem_waiter...
Signed-off-by: Michel Lespinasse <walken@google.com>
---
lib/rwsem.c | 25 ++++++++++---------------
1 files changed, 10 insertions(+), 15 deletions(-)
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 89cd8b3..d8764e0 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -171,8 +171,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
*/
static struct rw_semaphore __sched *
rwsem_down_failed_common(struct rw_semaphore *sem,
- struct rwsem_waiter *waiter, signed long adjustment)
+ unsigned int flags, signed long adjustment)
{
+ struct rwsem_waiter waiter;
struct task_struct *tsk = current;
signed long count;
@@ -180,12 +181,13 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
/* set up my own style of waitqueue */
spin_lock_irq(&sem->wait_lock);
- waiter->task = tsk;
+ waiter.task = tsk;
+ waiter.flags = flags;
get_task_struct(tsk);
if (list_empty(&sem->wait_list))
adjustment += RWSEM_WAITING_BIAS;
- list_add_tail(&waiter->list, &sem->wait_list);
+ list_add_tail(&waiter.list, &sem->wait_list);
/* we're now waiting on the lock, but no longer actively locking */
count = rwsem_atomic_update(adjustment, sem);
@@ -206,7 +208,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
/* wait to be given the lock */
for (;;) {
- if (!waiter->task)
+ if (!waiter.task)
break;
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
@@ -223,11 +225,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
asmregparm struct rw_semaphore __sched *
rwsem_down_read_failed(struct rw_semaphore *sem)
{
- struct rwsem_waiter waiter;
-
- waiter.flags = RWSEM_WAITING_FOR_READ;
- rwsem_down_failed_common(sem, &waiter, ...Add rwsem_down_read_unfair_failed() function in the non-generic rwsem
library code, similar to rwsem_down_read_failed() except that blocked
threads are placed at the head of the queue.
Signed-off-by: Michel Lespinasse <walken@google.com>
---
lib/rwsem.c | 22 +++++++++++++++++++++-
1 files changed, 21 insertions(+), 1 deletions(-)
diff --git a/lib/rwsem.c b/lib/rwsem.c
index d8764e0..b3fe179 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -34,6 +34,7 @@ struct rwsem_waiter {
unsigned int flags;
#define RWSEM_WAITING_FOR_READ 0x00000001
#define RWSEM_WAITING_FOR_WRITE 0x00000002
+#define RWSEM_UNFAIR 0x00000004
};
/* Wake types for __rwsem_do_wake(). Note that RWSEM_WAKE_NO_ACTIVE and
@@ -187,7 +188,11 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
if (list_empty(&sem->wait_list))
adjustment += RWSEM_WAITING_BIAS;
- list_add_tail(&waiter.list, &sem->wait_list);
+
+ if (flags & RWSEM_UNFAIR)
+ list_add(&waiter.list, &sem->wait_list);
+ else
+ list_add_tail(&waiter.list, &sem->wait_list);
/* we're now waiting on the lock, but no longer actively locking */
count = rwsem_atomic_update(adjustment, sem);
@@ -229,6 +234,21 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
-RWSEM_ACTIVE_READ_BIAS);
}
+#ifdef __HAVE_DOWN_READ_UNFAIR
+
+/*
+ * wait for the read lock to be granted - skip waiting threads
+ */
+asmregparm struct rw_semaphore __sched *
+rwsem_down_read_unfair_failed(struct rw_semaphore *sem)
+{
+ return rwsem_down_failed_common(sem,
+ RWSEM_WAITING_FOR_READ | RWSEM_UNFAIR,
+ -RWSEM_ACTIVE_READ_BIAS);
+}
+
+#endif
+
/*
* wait for the write lock to be granted
*/
--
1.7.0.1
--
Given that you describe this first, this would suggest that the subject of the patch should be this. I'm not sure I'd count this as a minor cleanup. I think Mostly okay, except where you said "expects old value in %edx" - that's only true on i386, not x86_64. On the latter it would be %rdi. However, I can live If you're going to put the initialisation of EDX/RDI on tmp (which isn't really necessary), rather than directly on the asm statement, you could change the '"=d" (tmp)' output constraint to be '"+d" (tmp)' and drop the '"1" (tmp)' constraint entirely. However, apart from that, feel free to add my Acked-by to this patch or its split resultant patches. David --
It's actually still %edx in x86_64 - we're calling into I agree it'd be nicer, but I wondered if all gcc versions would handle the constraints change fine and then I chickened out. Instead I moved the initialization on the constraints list as was already done in __up_write(). All I was really shooting for here is consistency Thanks. I'll send a V4 series soon integrating your feedback & mark the two splitted patches resulting from this one as Acked-by. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
I hate code that jumps into braced blocks with goto. Would it be possible for you to do: readers_only: if (downgrading) goto wake_readers; ... wake_readers: /* Grant an infinite number of read locks to the readers at the front ... instead, please? Also, since the labels 'undo' and 'try_again' are now specific to the writer path, can you label them 'undo_write' and 'try_again_write' just to make it obvious? Other than that, no particular objections to this patch. David --
Applied. Note that next patch in the series removes that code anyway :) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
That's not true. A thread attempting to get an rwsem by issuing a down_read() or down_write() will also unconditionally increment the active count before it considers calling out to the slow path. Maybe what you mean is that other threads wanting to do a wake up can only increase the active count for the processes being woken up whilst holding the Do you mean a waiter rather than an active locker? If a process is still registering activity on the rwsem, then it can't be woken yet. I wonder if I should've called it the 'activity part' of the count rather than the 'active part'. Apart from that, the patch looks fine. That's all comment/description fixes. David --
Damn, this was a pretty bad commit comment - I wrote up_xxxx instead of down_xxxx here. down_xxxx can increase the active count but it will then take the down_failed path and wait to acquire the spinlock - so we don't need to worry about it. Other code paths can't even increase the active count, since they don't hold the spinlock. (I got mixed up due to a legitimate mention of the up_xxxx path higher up. Semaphore operation names confuse me with the active count going up in down() and down in up() - I suppose it made more sense in the Thanks. Updated the comments. And I thought I could speak english... :) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
Not so much can, as does... Perhaps you need to qualify that. David --
You might want to change "when the wait list" to "to indicate that the wait list". But apart from that: Acked-by: David Howells <dhowells@redhat.com> --
The comma you've added after 'us' is wrong. That suggests that the implicit 'then' comes there. I take it you're a proponent of the Oxford/Harvard/serial comma? Apart from that miscellaneous grammatical difference, the patch is fine:-) David --
Acked-by: David Howells <dhowells@redhat.com> --
Can I suggest you change this to:
enum rwsem_waiter_type {
RWSEM_WAITING_FOR_WRITE,
RWSEM_WAITING_FOR_READ,
RWSEM_WAITING_FOR_UNFAIR_READ
};
to:
enum rwsem_waiter_type type;
and use this throughout. It simplifies some of the code too. See attached
patch.
David
---
rwsem: Make waiter type an enum in the slow path of the asm-optimised version
Make the waiter type in the asm-optimised version of the slow path an
enumeration rather than a bitmask.
Signed-off-by: David Howells <dhowells@redhat.com>
---
diff --git a/lib/rwsem.c b/lib/rwsem.c
index b3fe179..8aa2238 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -28,13 +28,16 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
EXPORT_SYMBOL(__init_rwsem);
+enum rwsem_waiter_type {
+ RWSEM_WAITING_FOR_WRITE,
+ RWSEM_WAITING_FOR_READ,
+ RWSEM_WAITING_FOR_UNFAIR_READ
+};
+
struct rwsem_waiter {
struct list_head list;
struct task_struct *task;
- unsigned int flags;
-#define RWSEM_WAITING_FOR_READ 0x00000001
-#define RWSEM_WAITING_FOR_WRITE 0x00000002
-#define RWSEM_UNFAIR 0x00000004
+ enum rwsem_waiter_type type;
};
/* Wake types for __rwsem_do_wake(). Note that RWSEM_WAKE_NO_ACTIVE and
@@ -64,7 +67,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
signed long oldcount, woken, loop, adjustment;
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
- if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
+ if (waiter->type != RWSEM_WAITING_FOR_WRITE)
goto readers_only;
if (wake_type == RWSEM_WAKE_READ_OWNED)
@@ -133,10 +136,10 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
waiter = list_entry(waiter->list.next,
struct rwsem_waiter, list);
- } while (waiter->flags & RWSEM_WAITING_FOR_READ);
+ } while (waiter->type != RWSEM_WAITING_FOR_WRITE);
adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
- if (waiter->flags & RWSEM_WAITING_FOR_READ)
+ if (waiter->type != RWSEM_WAITING_FOR_WRITE)
/* hit ...Makes perfect sense assuming we don't plan to add more flags in the future. if (type == RWSEM_WAITING_FOR_UNFAIR_READ) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
Acked-by: David Howells <dhowells@redhat.com> --
That seems to work. I have some rwsem benchmarks for you using my
synchronisation primitive test (see attached). I run it like so:
insmod /tmp/synchro-test.ko rd=10 wr=2 dg=1 do_sched=1
with 10 readers, 2 writers and a downgrader on the same rwsem, and permitting
scheduling between attempts to grab and release the semaphore. The test runs
for 5 seconds on a box that isn't doing anything else, and has had almost
everything that might normally be running (including syslog) disabled or
removed.
Without your patches:
THREADS S INTVL LOCKS TAKEN AND RELEASED
------------------- /LOAD -------------------------------------------------
MTX SEM RD WR DG MTX TAKEN SEM TAKEN RD TAKEN WR TAKEN DG TAKEN
=== === === === === = ===== ========= ========= ========= ========= =========
0 0 10 2 1 s 2/2 0 0 2087911 590464 194394
0 0 10 2 1 s 2/2 0 0 2030497 591511 196851
0 0 10 2 1 s 2/2 0 0 2078523 605317 199645
With your patches
0 0 10 2 1 s 2/2 0 0 2054720 605691 198472
0 0 10 2 1 s 2/2 0 0 2051376 597976 196962
0 0 10 2 1 s 2/2 0 0 2011909 595706 197482
So there doesn't seem to be much in the way of degradation. In fact, there
seems to have been more degradation somewhere in the general kernel since I
played with version 1 of your patches a week ago. At that time, the vanilla
kernel gave this without your patches applied:
0 0 10 2 1 s 2/2 0 0 2217200 636227 210255
0 0 10 2 1 s 2/2 0 0 2217903 629175 212926
0 0 10 2 1 s 2/2 0 0 2224136 647912 212506
Feel free to have a play.
BTW, the code also works on FRV in NOMMU mode (which uses the spinlock-based
rwsem version).
However... I'm not certain that giving a process ...Yes, I made sure to keep the generic (and non-x86 asm) versions One thing to keep in mind if you're concerned about the higher priority is that these /proc files are still not open to everyone - a user can't read the /proc/<pid>/exe and /proc/<pid>/maps files for another user's processes unless he's got CAP_SYS_PTRACE priviledges. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
This is relatively large software; I don't actually have a complete view of it myself. However some of the things it does include the following: - The cluster management software tracks ownership of sysv shm segments, potentially deleting them if they have been abandoned. /proc/<pid>/maps files are used as an input to these decisions. - There is a separate module making numa aware decisions based on /proc/<pid>/numa_maps information. This drives decisions such as moving data between nodes (we have not been pushing the corresponding kernel patches yet). Using down_read_critical() here would be desirable too, but not as critical as for other parts of the system (numa awareness is not the biggest worry when machines get under enough memory pressure that reading numa_maps starts blocking). -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
ipcs manages to give you that information in its nattch column, I believe, without touching /proc/pid/maps. David --
