Re: [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations

Previous thread: [thisops uV2 03/10] percpu: Generic support for this_cpu_add,sub,dec,inc_return by Christoph Lameter on Friday, November 26, 2010 - 2:09 pm. (3 messages)

Next thread: [thisops uV2 10/10] slub: Lockless fastpaths by Christoph Lameter on Friday, November 26, 2010 - 2:09 pm. (1 message)
From: Christoph Lameter
Date: Friday, November 26, 2010 - 2:09 pm

this cpu operations can be used to slightly optimize the function. The
changes will avoid some address calculations and replace them with the
use of the percpu segment register.

If one would have this_cpu_inc_return and this_cpu_dec_return then it
would be possible to optimize inc_zone_page_state and dec_zone_page_state even
more.

V1->V2:
	- Fix __dec_zone_state overflow handling
	- Use s8 variables for temporary storage.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/vmstat.c |   56 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 24 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2010-11-24 13:38:34.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2010-11-24 15:03:08.000000000 -0600
@@ -167,18 +167,20 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
-
-	s8 *p = pcp->vm_stat_diff + item;
+	struct per_cpu_pageset * __percpu pcp = zone->pageset;
+	s8 * __percpu p = pcp->vm_stat_diff + item;
 	long x;
+	long t;
+
+	x = delta + __this_cpu_read(*p);
 
-	x = delta + *p;
+	t = __this_cpu_read(pcp->stat_threshold);
 
-	if (unlikely(x > pcp->stat_threshold || x < -pcp->stat_threshold)) {
+	if (unlikely(x > t || x < -t)) {
 		zone_page_state_add(x, zone, item);
 		x = 0;
 	}
-	*p = x;
+	__this_cpu_write(*p, x);
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
 
@@ -221,16 +223,19 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
-	s8 *p = pcp->vm_stat_diff + item;
-
-	(*p)++;
+	struct per_cpu_pageset * __percpu pcp = zone->pageset;
+	s8 * __percpu p = pcp->vm_stat_diff + item;
+	s8 v, t;
+
+	__this_cpu_inc(*p);
+
+	v = __this_cpu_read(*p);
+	t = ...
From: Pekka Enberg
Date: Saturday, November 27, 2010 - 1:00 am

Reviewed-by: Pekka Enberg <penberg@kernel.org>
--

From: Mathieu Desnoyers
Date: Saturday, November 27, 2010 - 7:49 am

Then we might want to directly target the implementation with
this_cpu_add_return/this_cpu_sub_return (you implement these in patch 03), which
would not need to disable preemption on the fast path. I think we already
discussed this in the past. The reason eludes me at the moment, but I remember
discussing that changing the increment/decrement delta to the nearest powers of
two would let us deal with overflow cleanly. But it's probably too early in the
morning for me to wrap my head around the issue at the moment.

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Christoph Lameter
Date: Monday, November 29, 2010 - 9:16 am

We still would need to disable preemption because multiple this_cpu_ops
are needed. The only way to avoid preemption would be if the modifications
to the differential and the adding to the per zone counters would be
per cpu atomic.


--

From: Christoph Lameter
Date: Monday, November 29, 2010 - 10:13 am

We could do this with local cmpxchgs like in the following patch. This
would avoid preemption disable and interrupt disable (at least on x86).
Trouble is how do we make this fit for architectures that do not have
cmpxchg?


Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2010-11-29 10:58:52.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2010-11-29 11:11:34.000000000 -0600
@@ -169,18 +169,23 @@ void __mod_zone_page_state(struct zone *
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
 	s8 __percpu *p = pcp->vm_stat_diff + item;
-	long x;
-	long t;
+	long o, n, t, z;

-	x = delta + __this_cpu_read(*p);
+	do {
+		z = 0;
+		t = this_cpu_read(pcp->stat_threshold);
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (n > t || n < -t) {
+			/* Overflow must be added to zone counters */
+			z = n;
+			n = 0;
+		}
+	} while (o != n && this_cpu_cmpxchg(*p, o, n) != o);

-	t = __this_cpu_read(pcp->stat_threshold);
-
-	if (unlikely(x > t || x < -t)) {
-		zone_page_state_add(x, zone, item);
-		x = 0;
-	}
-	__this_cpu_write(*p, x);
+	if (z)
+		zone_page_state_add(z, zone, item);
 }
 EXPORT_SYMBOL(__mod_zone_page_state);

@@ -190,11 +195,7 @@ EXPORT_SYMBOL(__mod_zone_page_state);
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 					int delta)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
 	__mod_zone_page_state(zone, item, delta);
-	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(mod_zone_page_state);


--

From: Mathieu Desnoyers
Date: Monday, November 29, 2010 - 12:28 pm

All architectures should have a fallback nowadays, no ? This might involve
disabling interrupts around a cmpxchg emulation, which would make the slow path
disable/enable interrupts twice. Is it what you are concerned about ?

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Christoph Lameter
Date: Monday, November 29, 2010 - 1:07 pm

We are adding new per cpu atomic functionality here and we are
establishing the fallbacks as we go.

Fallbacks are to an cmpxchg emulation using interrupt disable etc of
course but the performance may be lower than the current version. I need
to get some numbers to assess the impact. In the meantime I have a full
cmpxchg based vmstat implementation here that seems to work without a
problem:

Subject: vmstat: User per cpu atomics to avoid interrupt and preempt disable

Currently the operations to increment vm counters must disable preempt and/or
interrupts in order to not mess up their housekeeping of counters. Both measures
have disadvantages. Interrupt disable causes a lot of overhead. Disabling
preemption is something that the RT folks do not like.

So use this_cpu_cmpxchg() to avoid any of those things. The fetching of the
counter thresholds is racy. A threshold from another cpu may be applied
if we happen to be rescheduled on another cpu.  However, the following
vmstat operation will then bring the counter again under the
threshold limit.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/vmstat.c |  145 +++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 86 insertions(+), 59 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2010-11-29 12:31:20.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2010-11-29 12:36:42.000000000 -0600
@@ -162,27 +162,87 @@ static void refresh_zone_stat_thresholds
 }

 /*
- * For use when we know that interrupts are disabled.
+ * mod_state() modifies the zone counter state through atomic per cpu
+ * operations.
+ *
+ * Overstep mode specifies how overstep should handled:
+ *	0	No overstepping
+ *	1	Overstepping half of threshold
+ *	-1	Overstepping minus half of threshold
  */
-void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
-				int delta)
+static inline void mod_state(struct zone ...
From: Christoph Lameter
Date: Monday, November 29, 2010 - 1:59 pm

Some numbers for the cmpxchg implementation (cycles)

Function		Orig	Cmpxchg Fallback
--------------------------------------------------
inc_zone_page_state	170	32	196
__inc_zone_page_s	23	31	26
inc/dec pairs		347	69	379

So the fallback is always worse. cmpxchg is only better for the versions
of zone counters where we need to disable and enable interrupts.

This would suggest to only use the cmpxchg for arches that have cmpxchg
local and only in the case of the full versions.

There are 12 cases of inc_zone_page_state and 13 of dec_zone_page_state
as well as 7 cases of mod_zone_page_state.


--