this cpu operations can be used to slightly optimize the function. The
changes will avoid some address calculations and replace them with the
use of the percpu segment register.
If one would have this_cpu_inc_return and this_cpu_dec_return then it
would be possible to optimize inc_zone_page_state and dec_zone_page_state even
more.
V1->V2:
- Fix __dec_zone_state overflow handling
- Use s8 variables for temporary storage.
Signed-off-by: Christoph Lameter <cl@linux.com>
---
mm/vmstat.c | 56 ++++++++++++++++++++++++++++++++------------------------
1 file changed, 32 insertions(+), 24 deletions(-)
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c 2010-11-24 13:38:34.000000000 -0600
+++ linux-2.6/mm/vmstat.c 2010-11-24 15:03:08.000000000 -0600
@@ -167,18 +167,20 @@ static void refresh_zone_stat_thresholds
void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
int delta)
{
- struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
-
- s8 *p = pcp->vm_stat_diff + item;
+ struct per_cpu_pageset * __percpu pcp = zone->pageset;
+ s8 * __percpu p = pcp->vm_stat_diff + item;
long x;
+ long t;
+
+ x = delta + __this_cpu_read(*p);
- x = delta + *p;
+ t = __this_cpu_read(pcp->stat_threshold);
- if (unlikely(x > pcp->stat_threshold || x < -pcp->stat_threshold)) {
+ if (unlikely(x > t || x < -t)) {
zone_page_state_add(x, zone, item);
x = 0;
}
- *p = x;
+ __this_cpu_write(*p, x);
}
EXPORT_SYMBOL(__mod_zone_page_state);
@@ -221,16 +223,19 @@ EXPORT_SYMBOL(mod_zone_page_state);
*/
void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
- struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
- s8 *p = pcp->vm_stat_diff + item;
-
- (*p)++;
+ struct per_cpu_pageset * __percpu pcp = zone->pageset;
+ s8 * __percpu p = pcp->vm_stat_diff + item;
+ s8 v, t;
+
+ __this_cpu_inc(*p);
+
+ v = __this_cpu_read(*p);
+ t = ...Reviewed-by: Pekka Enberg <penberg@kernel.org> --
Then we might want to directly target the implementation with this_cpu_add_return/this_cpu_sub_return (you implement these in patch 03), which would not need to disable preemption on the fast path. I think we already discussed this in the past. The reason eludes me at the moment, but I remember discussing that changing the increment/decrement delta to the nearest powers of two would let us deal with overflow cleanly. But it's probably too early in the morning for me to wrap my head around the issue at the moment. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
We still would need to disable preemption because multiple this_cpu_ops are needed. The only way to avoid preemption would be if the modifications to the differential and the adding to the per zone counters would be per cpu atomic. --
We could do this with local cmpxchgs like in the following patch. This
would avoid preemption disable and interrupt disable (at least on x86).
Trouble is how do we make this fit for architectures that do not have
cmpxchg?
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c 2010-11-29 10:58:52.000000000 -0600
+++ linux-2.6/mm/vmstat.c 2010-11-29 11:11:34.000000000 -0600
@@ -169,18 +169,23 @@ void __mod_zone_page_state(struct zone *
{
struct per_cpu_pageset __percpu *pcp = zone->pageset;
s8 __percpu *p = pcp->vm_stat_diff + item;
- long x;
- long t;
+ long o, n, t, z;
- x = delta + __this_cpu_read(*p);
+ do {
+ z = 0;
+ t = this_cpu_read(pcp->stat_threshold);
+ o = this_cpu_read(*p);
+ n = delta + o;
+
+ if (n > t || n < -t) {
+ /* Overflow must be added to zone counters */
+ z = n;
+ n = 0;
+ }
+ } while (o != n && this_cpu_cmpxchg(*p, o, n) != o);
- t = __this_cpu_read(pcp->stat_threshold);
-
- if (unlikely(x > t || x < -t)) {
- zone_page_state_add(x, zone, item);
- x = 0;
- }
- __this_cpu_write(*p, x);
+ if (z)
+ zone_page_state_add(z, zone, item);
}
EXPORT_SYMBOL(__mod_zone_page_state);
@@ -190,11 +195,7 @@ EXPORT_SYMBOL(__mod_zone_page_state);
void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
int delta)
{
- unsigned long flags;
-
- local_irq_save(flags);
__mod_zone_page_state(zone, item, delta);
- local_irq_restore(flags);
}
EXPORT_SYMBOL(mod_zone_page_state);
--
All architectures should have a fallback nowadays, no ? This might involve disabling interrupts around a cmpxchg emulation, which would make the slow path disable/enable interrupts twice. Is it what you are concerned about ? Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
We are adding new per cpu atomic functionality here and we are establishing the fallbacks as we go. Fallbacks are to an cmpxchg emulation using interrupt disable etc of course but the performance may be lower than the current version. I need to get some numbers to assess the impact. In the meantime I have a full cmpxchg based vmstat implementation here that seems to work without a problem: Subject: vmstat: User per cpu atomics to avoid interrupt and preempt disable Currently the operations to increment vm counters must disable preempt and/or interrupts in order to not mess up their housekeeping of counters. Both measures have disadvantages. Interrupt disable causes a lot of overhead. Disabling preemption is something that the RT folks do not like. So use this_cpu_cmpxchg() to avoid any of those things. The fetching of the counter thresholds is racy. A threshold from another cpu may be applied if we happen to be rescheduled on another cpu. However, the following vmstat operation will then bring the counter again under the threshold limit. Signed-off-by: Christoph Lameter <cl@linux.com> --- mm/vmstat.c | 145 +++++++++++++++++++++++++++++++++++------------------------- 1 file changed, 86 insertions(+), 59 deletions(-) Index: linux-2.6/mm/vmstat.c =================================================================== --- linux-2.6.orig/mm/vmstat.c 2010-11-29 12:31:20.000000000 -0600 +++ linux-2.6/mm/vmstat.c 2010-11-29 12:36:42.000000000 -0600 @@ -162,27 +162,87 @@ static void refresh_zone_stat_thresholds } /* - * For use when we know that interrupts are disabled. + * mod_state() modifies the zone counter state through atomic per cpu + * operations. + * + * Overstep mode specifies how overstep should handled: + * 0 No overstepping + * 1 Overstepping half of threshold + * -1 Overstepping minus half of threshold */ -void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, - int delta) +static inline void mod_state(struct zone ...
Some numbers for the cmpxchg implementation (cycles) Function Orig Cmpxchg Fallback -------------------------------------------------- inc_zone_page_state 170 32 196 __inc_zone_page_s 23 31 26 inc/dec pairs 347 69 379 So the fallback is always worse. cmpxchg is only better for the versions of zone counters where we need to disable and enable interrupts. This would suggest to only use the cmpxchg for arches that have cmpxchg local and only in the case of the full versions. There are 12 cases of inc_zone_page_state and 13 of dec_zone_page_state as well as 7 cases of mod_zone_page_state. --
