Re: [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic

Previous thread: [this_cpu_xx V6 5/7] this_cpu: Remove slub kmem_cache fields by cl on Wednesday, October 7, 2009 - 2:10 pm. (2 messages)

Next thread: zaurus: cleanup sharpsl_pm.c by Pavel Machek on Tuesday, October 6, 2009 - 1:03 pm. (11 messages)
From: cl
Date: Wednesday, October 7, 2009 - 2:10 pm

V5->V6:
- Drop patches merged by Tejun.
- Drop irqless slub fastpath for now.
- Patches against Tejun percpu for-next branch.

V4->V5:
- Avoid setup_per_cpu_area() modifications and fold the remainder of the
  patch into the page allocator patch.
- Irq disable / per cpu ptr fixes for page allocator patch.

V3->V4:
- Fix various macro definitions.
- Provide experimental percpu based fastpath that does not disable
  interrupts for SLUB.

V2->V3:
- Available via git tree against latest upstream from
	 git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus
- Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation
  for CONFIG_ZONE_DMA
- Create fallback framework so that 64 bit ops on 32 bit platforms
  can fallback to the use of preempt or interrupt disable. 64 bit
  platforms can use 64 bit atomic per cpu ops.

V1->V2:
- Various minor fixes
- Add SLUB conversion
- Add Page allocator conversion
- Patch against the git tree of today

The patchset introduces various operations to allow efficient access
to per cpu variables for the current processor. Currently there is
no way in the core to calculate the address of the instance
of a per cpu variable without a table lookup. So we see a lot of

	per_cpu_ptr(x, smp_processor_id())

The patchset introduces a way to calculate the address using the offset
that is available in arch specific ways (register or special memory
locations) using

	this_cpu_ptr(x)

In addition macros are provided that can operate on per cpu
variables in a per cpu atomic way. With that scalars in structures
allocated with the new percpu allocator can be modified without disabling
preempt or interrupts. This works by generating a single instruction that
does both the relocation of the address to the proper percpu area and
the RMW action.

F.e.

	this_cpu_add(x->var, 20)

can be used to generate an instruction that uses a segment register for the
relocation of the per cpu address into the per cpu area of the ...
From: cl
Date: Wednesday, October 7, 2009 - 2:10 pm

Remove the pageset notifier since it only marks that a processor
exists on a specific node. Move that code into the vmstat notifier.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/page_alloc.c |   28 ----------------------------
 mm/vmstat.c     |    1 +
 2 files changed, 1 insertion(+), 28 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-10-06 18:19:17.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2009-10-06 18:19:20.000000000 -0500
@@ -906,6 +906,7 @@ static int __cpuinit vmstat_cpuup_callba
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		start_cpu_timer(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-10-06 18:19:17.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-10-06 18:19:21.000000000 -0500
@@ -3108,27 +3108,6 @@ static void setup_pagelist_highmark(stru
 		pcp->batch = PAGE_SHIFT * 8;
 }
 
-
-static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
-		unsigned long action,
-		void *hcpu)
-{
-	int cpu = (long)hcpu;
-
-	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		node_set_state(cpu_to_node(cpu), N_CPU);
-		break;
-	default:
-		break;
-	}
-	return NOTIFY_OK;
-}
-
-static struct notifier_block __cpuinitdata pageset_notifier =
-	{ &pageset_cpuup_callback, NULL, 0 };
-
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.
@@ -3154,13 +3133,6 @@ void __init setup_per_cpu_pageset(void)
 						percpu_pagelist_fraction));
 		}
 	}
-
-	/*
-	 * The boot cpu is always the first active.
-	 * The boot node has a processor
-	 */
-	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
-	register_cpu_notifier(&pageset_notifier);
 }
 ...
From: cl
Date: Wednesday, October 7, 2009 - 2:10 pm

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.

Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm.h     |    4 -
 include/linux/mmzone.h |   12 ---
 mm/page_alloc.c        |  187 ++++++++++++++++++-------------------------------
 mm/vmstat.c            |   14 ++-
 4 files changed, 81 insertions(+), 136 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-10-07 14:34:25.000000000 -0500
+++ linux-2.6/include/linux/mm.h	2009-10-07 14:48:09.000000000 -0500
@@ -1061,11 +1061,7 @@ extern void si_meminfo(struct sysinfo * 
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 extern int after_bootmem;
 
-#ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif
 
 extern void zone_pcp_update(struct zone *zone);
 
Index: ...
From: Tejun Heo
Date: Thursday, October 8, 2009 - 3:38 am

Hello, Christoph.


This looks much better but I'm not sure whether it's safe.  percpu
offsets have not been set up before setup_per_cpu_areas() is complete
on most archs but if all that's necessary is getting the page
allocator up and running as soon as static per cpu areas and offsets
are set up (which basically means as soon as cpu init is complete on
ia64 and setup_per_cpu_areas() is complete on all other archs).  This
should be correct.  Is this what you're expecting?

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Thursday, October 8, 2009 - 3:40 am

Also, as I'm not very familiar with the code, I'd really appreciate
Mel Gorman's acked or reviewed-by.

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Thursday, October 8, 2009 - 9:15 am

paging_init() is called after the per cpu areas have been initialized. So
I thought this would be safe. Tested it on x86.

zone_pcp_init() only sets up the per cpu pointers to the pagesets. That
works regardless of the boot stage. Then then build_all_zonelists()
initializes the actual contents of the per cpu variables.

Finally the per cpu pagesets are allocated from the percpu allocator when
all allocators are up and the pagesets are sized.


--

From: Mel Gorman
Date: Thursday, October 8, 2009 - 3:53 am

I haven't tested the patch series but it now looks good to my eyes at
least. Thanks


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: cl
Date: Wednesday, October 7, 2009 - 2:10 pm

Using per cpu allocations removes the needs for the per cpu arrays in the
kmem_cache struct. These could get quite big if we have to support systems
with thousands of cpus. The use of this_cpu_xx operations results in:

1. The size of kmem_cache for SMP configuration shrinks since we will only
   need 1 pointer instead of NR_CPUS. The same pointer can be used by all
   processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
   system meaning less memory overhead for configurations that may potentially
   support up to 1k NUMA nodes / 4k cpus.

3. We can remove the diddle widdle with allocating and releasing of
   kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
   alloc logic will do it all for us. Removes some portions of the cpu hotplug
   functionality.

4. Fastpath performance increases since per cpu pointer lookups and
   address calculations are avoided.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    6 -
 mm/slub.c                |  207 ++++++++++-------------------------------------
 2 files changed, 49 insertions(+), 164 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-09-17 17:51:51.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-09-29 09:02:05.000000000 -0500
@@ -69,6 +69,7 @@ struct kmem_cache_order_objects {
  * Slab cache management.
  */
 struct kmem_cache {
+	struct kmem_cache_cpu *cpu_slab;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
@@ -104,11 +105,6 @@ struct kmem_cache {
 	int remote_node_defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-#ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
-	struct ...
From: Tejun Heo
Date: Monday, October 12, 2009 - 3:19 am

Hello,


Shouldn't this be this_cpu_ptr() without the double underscore?

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Monday, October 12, 2009 - 3:21 am

Oh... another similar conversions in slab_alloc() and slab_free() too.

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Monday, October 12, 2009 - 7:54 am

Interrupts are disabled so no concurrent fast path can occur.


--

From: Tejun Heo
Date: Monday, October 12, 2009 - 7:13 pm

The only difference between this_cpu_ptr() and __this_cpu_ptr() is the
usage of my_cpu_offset and __my_cpu_offset which in turn are only
different in whether they check preemption status to make sure the cpu
is pinned down when called.

The only places where the underbar prefixed versions should be used
are places where cpu locality is nice but not critical and preemption
debug check wouldn't work properly for whatever reason.  The above is
none of the two and the conversion is buried in a patch which is
supposed to do something else.  Am I missing something?

-- 
tejun
--

From: Christoph Lameter
Date: Tuesday, October 13, 2009 - 7:41 am

I used __this_cpu_* whenever the context is already providing enough
safety that preempt disable or irq disable would not matter. The use of
__this_cpu_ptr was entirely for consistent usage here. this_cpu_ptr would
be safer because it has additional checks that preemption really is
disabled. So if someone gets confused about logic flow later it can be
dtected.
--

From: Tejun Heo
Date: Tuesday, October 13, 2009 - 7:56 am

Yeah, widespread use of underscored versions isn't very desirable.
The underscored versions should notify certain specific exceptional
conditions instead of being used as general optimization (which
doesn't make much sense after all as the optimization is only
meaningful with debug option turned on).  Are you interested in doing
a sweeping patch to drop underscores from __this_cpu_*() conversions?

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Tuesday, October 13, 2009 - 8:20 am

Nope. __this_cpu_add/dec cannot be converted.

__this_cpu_ptr could be converted to this_cpu_ptr but I think the __ are
useful there too to show that we are in a preempt section.

The calls to raw_smp_processor_id and smp_processor_id() are only useful
in the fallback case. There is no need for those if the arch has a way to
provide the current percpu offset. So we in effect have two meanings of __
right now.

1. We do not care about the preempt state (thus we call
raw_smp_processor_id so that the preempt state does not trigger)

2. We do not need to disable preempt before the operation.

__this_cpu_ptr only implies 1. __this_cpu_add uses 1 and 2.

--

From: Tejun Heo
Date: Tuesday, October 13, 2009 - 6:57 pm

Hello, Christoph.



That doesn't make much sense.  __ for this_cpu_ptr() means "bypass
sanity check, we're knowingly violating the required conditions" not

Yeah, we need to clean it up.  The naming is too confusing.

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Wednesday, October 14, 2009 - 7:14 am

Its consistent if __ means both 1 and 2. If we want to distinguish it then
we may want to create raw_this_cpu_xx which means that we do not call
smp_processor_id() on fallback but raw_smp_processor_id(). Does not
matter if the arch provides a per cpu offset.

This would mean duplicating all the macros. The use of raw_this_cpu_xx
should be rare so maybe the best approach is to say that __ means only
that the macro does not need to disable preempt but it still checks for
preemption being off. Then audit the __this_cpu_xx uses and see if there
are any that require a raw_ variant.

The vm event counters require both no check and no preempt since they can
be implemented in a racy way.




--

From: Tejun Heo
Date: Thursday, October 15, 2009 - 12:47 am

I was basically stating the different between raw_smp_processor_id()
and smp_processor_id() which I thought applied the same to

The biggest grief I have is that the meaning of __ is different among
different accessors.  If that can be cleared up, we would be in much
better shape without adding any extra macros.  Can we just remove all
__'s and use meaningful pre or suffixes like raw or irq or whatever?

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Friday, October 16, 2009 - 9:44 am

Ii does apply. __this_cpu_ptr does not use smp_processor_id() but
raw_smp_processor_id(). this_cpu_ptr does not need to disable preempt so

It currently means that we do not deal with preempt and do not check for
preemption. That is consistent.

Sure we could change the API to have even more macros than the large
amount it already has so that we can check for proper preempt disablement.

I guess that would mean adding

raw_nopreempt_this_cpu_xx  and nopreempt_this_cpu_xx variants? The thing
gets huge. I think we could just leave it. __ suggests that serialization
and checking is not performed like in the full versions and that is true.
--

From: Tejun Heo
Date: Saturday, October 17, 2009 - 8:11 pm

Hello, Christoph.



I don't think we'll need to add new variants.  Just renaming existing
ones so that they have more specific pre/suffix should make things
clearer.  I'll give a shot at that once the sparse annotation patchset
is merged.

Thanks.

-- 
tejun
--

From: cl
Date: Wednesday, October 7, 2009 - 2:10 pm

Use this_cpu_* operations in the hotpath to avoid calculations of
kmem_cache_cpu pointer addresses.

On x86 there is a trade off: Multiple uses segment prefixes against an
address calculation and more register pressure. Code size is reduced
also therefore it is an advantage icache wise.

The use of prefixes is necessary if we want to use a scheme
for fastpaths that do not require disabling interrupts.

Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   80 ++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 39 insertions(+), 41 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-10-07 14:52:05.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-10-07 14:58:50.000000000 -0500
@@ -1512,10 +1512,10 @@ static void flush_all(struct kmem_cache 
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
  */
-static inline int node_match(struct kmem_cache_cpu *c, int node)
+static inline int node_match(struct kmem_cache *s, int node)
 {
 #ifdef CONFIG_NUMA
-	if (node != -1 && c->node != node)
+	if (node != -1 && __this_cpu_read(s->cpu_slab->node) != node)
 		return 0;
 #endif
 	return 1;
@@ -1603,46 +1603,46 @@ slab_out_of_memory(struct kmem_cache *s,
  * a call to the page allocator and the setup of a new slab.
  */
 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr)
 {
 	void **object;
-	struct page *new;
+	struct page *page = __this_cpu_read(s->cpu_slab->page);
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
 
-	if (!c->page)
+	if (!page)
 		goto new_slab;
 
-	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
+	slab_lock(page);
+	if (unlikely(!node_match(s, node)))
 		goto ...
From: Tejun Heo
Date: Monday, October 12, 2009 - 3:40 am

The rest of the patches look good to me but I'm no expert in this area
of code.  But you're the maintainer of the allocator and the changes
definitely are percpu related, so if you're comfortable with it, I can
happily carry the patches through percpu tree.

Thanks.

-- 
tejun
--

From: Pekka Enberg
Date: Monday, October 12, 2009 - 6:14 am

The patch looks sane to me but the changelog contains no relevant
numbers on performance. I am fine with the patch going in -percpu but
the patch probably needs some more beating performance-wise before it
can go into .33. I'm CC'ing some more people who are known to do SLAB
performance testing just in case they're interested in looking at the
patch. In any case,

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

                        Pekka
--

From: Christoph Lameter
Date: Monday, October 12, 2009 - 7:55 am

I am warming up my synthetic in kernel tests right now. Hope I have
something by tomorrow.

--

From: David Rientjes
Date: Tuesday, October 13, 2009 - 2:45 am

I ran 60-second netperf TCP_RR benchmarks with various thread counts over 
two machines, both four quad-core Opterons.  I ran the trials ten times 
each with both vanilla per-cpu#for-next at 9288f99 and with v6 of this 
patchset.  The transfer rates were virtually identical showing no 
improvement or regression with this patchset in this benchmark.

 [ As I reported in http://marc.info/?l=linux-kernel&m=123839191416472, 
   this benchmark continues to be the most significant regression slub has 
   compared to slab. ]
--

From: Christoph Lameter
Date: Tuesday, October 13, 2009 - 7:43 am

Hmmm... Last time I ran the in kernel benchmarks this showed a reduction
in cycle counts. Did not get to get my tests yet.

Can you also try the irqless hotpath?

--

From: Christoph Lameter
Date: Tuesday, October 13, 2009 - 12:14 pm

Here are some cycle numbers w/o the slub patches and with. I will post the
full test results and the patches to do these in kernel tests in a new
thread. The regression may be due to caching behavior of SLUB that will
not change with these patches.

Alloc fastpath wins ~ 50%. kfree also has a 50% win if the fastpath is
being used. First test does 10000 kmallocs and then frees them all.
Second test alloc one and free one and does that 10000 times.

no this_cpu ops

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 239 cycles kfree -> 261 cycles
10000 times kmalloc(16) -> 249 cycles kfree -> 208 cycles
10000 times kmalloc(32) -> 215 cycles kfree -> 232 cycles
10000 times kmalloc(64) -> 164 cycles kfree -> 216 cycles
10000 times kmalloc(128) -> 266 cycles kfree -> 275 cycles
10000 times kmalloc(256) -> 478 cycles kfree -> 199 cycles
10000 times kmalloc(512) -> 449 cycles kfree -> 201 cycles
10000 times kmalloc(1024) -> 484 cycles kfree -> 398 cycles
10000 times kmalloc(2048) -> 475 cycles kfree -> 559 cycles
10000 times kmalloc(4096) -> 792 cycles kfree -> 506 cycles
10000 times kmalloc(8192) -> 753 cycles kfree -> 679 cycles
10000 times kmalloc(16384) -> 968 cycles kfree -> 712 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 292 cycles
10000 times kmalloc(16)/kfree -> 308 cycles
10000 times kmalloc(32)/kfree -> 326 cycles
10000 times kmalloc(64)/kfree -> 303 cycles
10000 times kmalloc(128)/kfree -> 257 cycles
10000 times kmalloc(256)/kfree -> 262 cycles
10000 times kmalloc(512)/kfree -> 293 cycles
10000 times kmalloc(1024)/kfree -> 262 cycles
10000 times kmalloc(2048)/kfree -> 289 cycles
10000 times kmalloc(4096)/kfree -> 274 cycles
10000 times kmalloc(8192)/kfree -> 265 cycles
10000 times kmalloc(16384)/kfree -> 1041 cycles


with this_cpu_xx

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 134 cycles kfree -> 212 cycles
10000 times kmalloc(16) -> 109 cycles kfree -> 116 cycles
10000 times kmalloc(32) ...
From: Pekka Enberg
Date: Tuesday, October 13, 2009 - 12:44 pm

Hi Christoph,


I wonder how reliable these numbers are. We did similar testing a while 
back because we thought kmalloc-96 caches had weird cache behavior but 
finally figured out the anomaly was explained by the order of the tests 
run, not cache size.


Notice the jump from 32 to 64 and then back to 64. One would expect we 
see linear increase as object size grows as we hit the page allocator 



If there's 50% improvement in the kmalloc() path, why does the 
this_cpu() version seem to be roughly as fast as the mainline version?

			Pekka
--

From: Christoph Lameter
Date: Tuesday, October 13, 2009 - 12:48 pm

Well you need to look behind these numbers to see when the allocator uses

64 is the cacheline size for the machine. At that point you have the
advantage of no overlapping data between different allocations and the

Its not that the kmalloc() is faster. The instructions used for the
fastpath generate less cycles. Other components figure into the total
latency as well.

16k allocations for example are not handled by slub anymore. Fastpath has
no effect. The wins there is just the improved percpu handling in the page
allocator.

I have some numbers here for irqless which drops another half of the
fastpath latency (and it adds some code to the slow path, sigh):

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 55 cycles kfree -> 251 cycles
10000 times kmalloc(16) -> 201 cycles kfree -> 261 cycles
10000 times kmalloc(32) -> 220 cycles kfree -> 261 cycles
10000 times kmalloc(64) -> 186 cycles kfree -> 224 cycles
10000 times kmalloc(128) -> 205 cycles kfree -> 125 cycles
10000 times kmalloc(256) -> 351 cycles kfree -> 267 cycles
10000 times kmalloc(512) -> 330 cycles kfree -> 310 cycles
10000 times kmalloc(1024) -> 416 cycles kfree -> 419 cycles
10000 times kmalloc(2048) -> 537 cycles kfree -> 439 cycles
10000 times kmalloc(4096) -> 458 cycles kfree -> 594 cycles
10000 times kmalloc(8192) -> 810 cycles kfree -> 678 cycles
10000 times kmalloc(16384) -> 879 cycles kfree -> 746 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 66 cycles
10000 times kmalloc(16)/kfree -> 187 cycles
10000 times kmalloc(32)/kfree -> 116 cycles
10000 times kmalloc(64)/kfree -> 107 cycles
10000 times kmalloc(128)/kfree -> 115 cycles
10000 times kmalloc(256)/kfree -> 65 cycles
10000 times kmalloc(512)/kfree -> 66 cycles
10000 times kmalloc(1024)/kfree -> 206 cycles
10000 times kmalloc(2048)/kfree -> 65 cycles
10000 times kmalloc(4096)/kfree -> 193 cycles
10000 times kmalloc(8192)/kfree -> 65 cycles
10000 times kmalloc(16384)/kfree -> 976 cycles




--

From: David Rientjes
Date: Tuesday, October 13, 2009 - 1:15 pm

With the netperf -t TCP_RR -l 60 benchmark I ran, CONFIG_SLUB_STATS shows 
the allocation fastpath is utilized quite a bit for a couple of key 
caches:

	cache		ALLOC_FASTPATH	ALLOC_SLOWPATH
	kmalloc-256	98125871	31585955
	kmalloc-2048	77243698	52347453

For an optimized fastpath, I'd expect such a workload would result in at 
least a slightly higher transfer rate.

I'll try the irqless patch, but this particular benchmark may not 
appropriately demonstrate any performance gain because of the added code 
in the also significantly-used slowpath.
--

From: Christoph Lameter
Date: Tuesday, October 13, 2009 - 1:28 pm

There will be no improvements if the load is dominated by the
instructions in the network layer or caching issues. None of that is
changed by the path. It only reduces the cycle count in the fastpath.


--

From: David Rientjes
Date: Tuesday, October 13, 2009 - 3:53 pm

Right, but CONFIG_SLAB shows a 5-6% improvement over CONFIG_SLUB in the 
same workload so it shows that the slab allocator does have an impact in 
transfer rate.  I understand that the performance gain with this patchset, 
however, may not be representative with the benchmark since it also 
frequently uses the slowpath for kmalloc-256 about 25% of the time and the 
added code of the irqless patch may mask the fastpath gain.
--

From: Mel Gorman
Date: Wednesday, October 14, 2009 - 6:34 am

I have a bit more detailed results based on the following machine

CPU type:	AMD Phenom 9950
CPU counts:	1 CPU (4 cores)
CPU Speed:	1.3GHz
Motherboard:	Gigabyte GA-MA78GM-S2H
Memory:		8GB

The reference kernel used is mmotm-2009-10-09-01-07. The patches applied
are the patches in this thread. The headings are a bit munged but it's

SLUB-vanilla	where vanilla is mmotm-2009-10-09-01-07
SLUB-this-cpu	mmotm-2009-10-09-01-07 + patches in this thread
SLAB-*		same as above but SLAB configured instead of SLUB.
		I know it wasn't necessary to run SLAB-this-cpu but
		it gives an idea to what degree results can vary
		between reboots even if results are stable once the
		machine is running.

The benchmarks run were kernbench, netperf UDP_STREAM and TCP_STREAM and
sysbench with postgres.

Kernbench is 5 kernel compiles and an average taken. One kernel compile
is done at the start to warm the benchmark up and this result is
discarded.

Netperf is the _STREAM tests as opposed to the _RR tests reported
elsewhere. No special effort is done to bind processes to any particular
CPU. The results reported tried to be 99% confidence that the estimated
mean was within 1% of the true mean. Results where netperf failed to
achieve the necessary confidence are marked with a * and the line after
such a result states what percentage the estimated mean is to the true
mean. The test is run with different packet sizes.

Sysbench is a read-only test (to avoid IO) and is the "complex"
workload. The test is run with varying numbers of threads.

In all the results, SLUB-vanilla is the reference baseline. This allows
a comparison between SLUB-vanilla and SLAB-vanilla as well with the
patches applied.

            kernbench-SLUB-vanilla-kernbench    kernbench-SLUBkernbench-SLAB-vanilla-kernbench    kernbench-SLAB
                  SLUB-vanilla          this-cpu      SLAB-vanilla          this-cpu
Elapsed min       92.95 ( 0.00%)    92.62 ( 0.36%)    92.93 ( 0.02%)    92.62 ( 0.36%)
Elapsed mean      ...
From: Christoph Lameter
Date: Wednesday, October 14, 2009 - 7:08 am

The test did not include the irqless patch I hope?


The queuing in SLAB allows a better cache hot behavior. Without a queue
SLUB has a difficult time improvising cache hot behavior based on objects
restricted to a slab page. Therefore the size of the slab page will

SLUB offloads allocations > 8k to the page allocator.
SLAB does create large slabs.

--

From: Mel Gorman
Date: Wednesday, October 14, 2009 - 8:49 am

Allocations >8k might explain then why 8K and 16K packets for UDP_STREAM
performance suffers. That can be marked as future possible work to sort
out within the allocator.

However, does it explain why TCP_STREAM suffers so badly even for packet
sizes like 2K? It's also important to note in some cases, SLAB was far
slower even when the packet sizes were greater than 8k so I don't think
the page allocator is an adequate explanation for TCP_STREAM.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Pekka Enberg
Date: Wednesday, October 14, 2009 - 8:53 am

Hi Mel,


SLAB is able to queue lots of large objects but SLUB can't do that 
because it has no queues. In SLUB, each CPU gets a page assigned to it 
that serves as a "queue" but the size of the queue gets smaller as 
object size approaches page size.

We try to offset that with higher order allocations but IIRC we don't 
increase the order linearly with object size and cap it to some 
reasonable maximum.

			Pekka
--

From: Christoph Lameter
Date: Wednesday, October 14, 2009 - 8:56 am

You can test to see if larger pages have an influence by passing

slub_max_order=6

or so on the kernel command line.

You can force a large page use in slub by setting

slub_min_order=3

f.e.

Or you can force a mininum number of objecxcts in slub through f.e.

slub_min_objects=50



slub_max_order=6 slub_min_objects=50

should result in pretty large slabs with lots of in page objects that
allow slub to queue better.




--

From: Pekka Enberg
Date: Wednesday, October 14, 2009 - 9:14 am

Hi Christoph,


On Wed, Oct 14, 2009 at 6:56 PM, Christoph Lameter

Yeah, that should help but it's probably not something we can do for
mainline. I'm not sure how we can fix SLUB to support large objects
out-of-the-box as efficiently as SLAB does.

                        Pekka
--

From: Christoph Lameter
Date: Wednesday, October 14, 2009 - 11:19 am

We could add a per cpu "queue" through a pointer array in kmem_cache_cpu.
Which is more SLQB than SLUB.

--

From: Mel Gorman
Date: Friday, October 16, 2009 - 3:50 am

Here are the results of that suggestion. They are side-by-side with the
other results so the columns are

SLUB-vanilla		No other patches applied, SLUB configured
vanilla-highorder	No other patches + slub_max_order=6 slub_min_objects=50
SLUB-this-cpu		The patches in this set applied
this-cpu-higher		These patches + slub_max_order=6 slub_min_objects=50
SLAB-vanilla		No other patches, SLAB configured
SLAB-this-cpu		Thes patches, SLAB configured

                  SLUB-vanilla   vanilla-highorder     SLUB-this-cpu  this-cpu-highorder    SLAB-vanilla     SLAB-this-cpu
Elapsed min       92.95 ( 0.00%)    92.64 ( 0.33%)    92.62 ( 0.36%)    92.77 ( 0.19%)    92.93 ( 0.02%)    92.62 ( 0.36%)
Elapsed mean      93.11 ( 0.00%)    92.89 ( 0.24%)    92.74 ( 0.40%)    92.82 ( 0.31%)    93.00 ( 0.13%)    92.82 ( 0.32%)
Elapsed stddev     0.10 ( 0.00%)     0.15 (-58.74%)     0.14 (-40.55%)     0.09 ( 7.73%)     0.04 (55.47%)     0.18 (-84.33%)
Elapsed max       93.20 ( 0.00%)    93.04 ( 0.17%)    92.95 ( 0.27%)    92.98 ( 0.24%)    93.05 ( 0.16%)    93.09 ( 0.12%)
User    min      323.21 ( 0.00%)   323.38 (-0.05%)   322.60 ( 0.19%)   323.26 (-0.02%)   322.50 ( 0.22%)   323.26 (-0.02%)
User    mean     323.81 ( 0.00%)   323.64 ( 0.05%)   323.20 ( 0.19%)   323.56 ( 0.08%)   323.16 ( 0.20%)   323.54 ( 0.08%)
User    stddev     0.40 ( 0.00%)     0.38 ( 4.24%)     0.46 (-15.30%)     0.27 (33.20%)     0.48 (-20.92%)     0.29 (26.07%)
User    max      324.32 ( 0.00%)   324.30 ( 0.01%)   323.72 ( 0.19%)   323.96 ( 0.11%)   323.86 ( 0.14%)   323.98 ( 0.10%)
System  min       35.95 ( 0.00%)    35.33 ( 1.72%)    35.50 ( 1.25%)    35.95 ( 0.00%)    35.35 ( 1.67%)    36.01 (-0.17%)
System  mean      36.30 ( 0.00%)    35.99 ( 0.87%)    35.96 ( 0.96%)    36.20 ( 0.28%)    36.17 ( 0.36%)    36.23 ( 0.21%)
System  stddev     0.25 ( 0.00%)     0.41 (-59.25%)     0.45 (-75.60%)     0.15 (41.61%)     0.56 (-121.14%)     0.14 (46.14%)
System  max       36.65 ( 0.00%)    36.44 ( 0.57%)    36.67 (-0.05%)    36.32 ( 0.90%)   ...
From: David Rientjes
Date: Friday, October 16, 2009 - 11:40 am

This is understandable considering the statistics that I posted for this 
workload on my machine, higher order cpu slabs will naturally get freed to 
more often from the fastpath, which also causes it to utilize the 
allocation fastpath more often (and we can see the optimization of this 
patchset), in addition to avoiding partial list handling.

The pain with the smaller packet sizes is probably the overhead from the 
page allocator more than slub, a characteristic that also caused the 
TCP_RR benchmark to suffer.  It can be mitigated somewhat with slab 
preallocation or a higher min_partial setting, but that's probably not an 
optimal solution.
--

From: David Rientjes
Date: Thursday, October 15, 2009 - 2:03 am

TCP_STREAM stresses a few specific caches:

		ALLOC_FASTPATH	ALLOC_SLOWPATH	FREE_FASTPATH	FREE_SLOWPATH
kmalloc-256	3868530		3450592		95628		7223491
kmalloc-1024	2440434		429		2430825		10034
kmalloc-4096	3860625		1036723		85571		4811779

This demonstrates that freeing to full (or partial) slabs causes a lot of 
pain since the fastpath normally can't be utilized and that's probably 
beyond the scope of this patchset.

It's also different from the cpu slab thrashing issue I identified with 
the TCP_RR benchmark and had a patchset to somewhat improve.  The 
criticism was the addition of an increment to a fastpath counter in struct 
kmem_cache_cpu which could probably now be much cheaper with these 
optimizations.
--

From: Christoph Lameter
Date: Friday, October 16, 2009 - 9:45 am

Can you redo the patch?

--

From: David Rientjes
Date: Friday, October 16, 2009 - 11:43 am

Sure, but it would be even more inexpensive if we can figure out why the 
irqless patch is hanging my netserver machine within the first 60 seconds 
on the TCP_RR benchmark.  I guess nobody else has reproduced that yet.
--

From: Christoph Lameter
Date: Friday, October 16, 2009 - 11:50 am

Nope. Sorry. I have tried running some tests but so far nothing.

--

From: Christoph Lameter
Date: Tuesday, October 13, 2009 - 1:25 pm

The tests were all run directly after booting the respective kernel.
--

From: David Rientjes
Date: Tuesday, October 13, 2009 - 6:33 pm

v6 of your patchset applied to percpu#for-next now at dec54bf "this_cpu: 
Use this_cpu_xx in trace_functions_graph.c" works fine, but when I apply 
the irqless patch from http://marc.info/?l=linux-kernel&m=125503037213262 
it hangs my netserver machine within the first 60 seconds when running 
this benchmark.  These kernels both include the fixes to kmem_cache_open() 
and dma_kmalloc_cache() you posted earlier.  I'll have to debug why that's 
happening before collecting results.
--

From: cl
Date: Wednesday, October 7, 2009 - 2:10 pm

Dynamic DMA kmalloc cache allocation is troublesome since the
new percpu allocator does not support allocations in atomic contexts.
Reserve some statically allocated kmalloc_cpu structures instead.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |   19 +++++++++++--------
 mm/slub.c                |   24 ++++++++++--------------
 2 files changed, 21 insertions(+), 22 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-09-29 11:42:06.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-09-29 11:43:18.000000000 -0500
@@ -131,11 +131,21 @@ struct kmem_cache {
 
 #define SLUB_PAGE_SHIFT (PAGE_SHIFT + 2)
 
+#ifdef CONFIG_ZONE_DMA
+#define SLUB_DMA __GFP_DMA
+/* Reserve extra caches for potential DMA use */
+#define KMALLOC_CACHES (2 * SLUB_PAGE_SHIFT - 6)
+#else
+/* Disable DMA functionality */
+#define SLUB_DMA (__force gfp_t)0
+#define KMALLOC_CACHES SLUB_PAGE_SHIFT
+#endif
+
 /*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT];
+extern struct kmem_cache kmalloc_caches[KMALLOC_CACHES];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -203,13 +213,6 @@ static __always_inline struct kmem_cache
 	return &kmalloc_caches[index];
 }
 
-#ifdef CONFIG_ZONE_DMA
-#define SLUB_DMA __GFP_DMA
-#else
-/* Disable DMA functionality */
-#define SLUB_DMA (__force gfp_t)0
-#endif
-
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:42:06.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:43:18.000000000 -0500
@@ -2090,7 +2090,7 @@ static inline int ...
From: Christoph Lameter
Date: Tuesday, October 13, 2009 - 11:48 am

Slight bug when creating kmalloc dma caches on the fly. When searching for
an unused statically allocated kmem_cache structure we need to check for
size == 0 not the other way around.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-10-13 13:31:05.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-10-13 13:31:36.000000000 -0500
@@ -2650,7 +2650,7 @@ static noinline struct kmem_cache *dma_k

 	s = NULL;
 	for (i = 0; i < KMALLOC_CACHES; i++)
-		if (kmalloc_caches[i].size)
+		if (!kmalloc_caches[i].size)
 			break;

 	BUG_ON(i >= KMALLOC_CACHES);

--

From: cl
Date: Wednesday, October 7, 2009 - 2:10 pm

this_cpu_inc() translates into a single instruction on x86 and does not
need any register. So use it in stat(). We also want to avoid the
calculation of the per cpu kmem_cache_cpu structure pointer. So pass
a kmem_cache pointer instead of a kmem_cache_cpu pointer.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org?

---
 mm/slub.c |   43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:44:35.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:44:49.000000000 -0500
@@ -217,10 +217,10 @@ static inline void sysfs_slab_remove(str
 
 #endif
 
-static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
+static inline void stat(struct kmem_cache *s, enum stat_item si)
 {
 #ifdef CONFIG_SLUB_STATS
-	c->stat[si]++;
+	__this_cpu_inc(s->cpu_slab->stat[si]);
 #endif
 }
 
@@ -1108,7 +1108,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
+		stat(s, ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1406,23 +1406,22 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
 
 		if (page->freelist) {
 			add_partial(n, page, tail);
-			stat(c, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
+			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
 		} else {
-			stat(c, DEACTIVATE_FULL);
+			stat(s, DEACTIVATE_FULL);
 			if (SLABDEBUG && PageSlubDebug(page) &&
 						(s->flags & SLAB_STORE_USER))
 				add_full(n, page);
 		}
 		slab_unlock(page);
 	} else {
-		stat(c, DEACTIVATE_EMPTY);
+		stat(s, ...
From: Mel Gorman
Date: Tuesday, October 13, 2009 - 8:40 am

FWIW, this fails to boot on latest mmotm on x86-64 even though the patches
apply. It fails to create basic slab cackes like kmalloc-64.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Tuesday, October 13, 2009 - 8:45 am

There was a fixup patch for one of the slub patches. Was that merged?

--

From: Mel Gorman
Date: Tuesday, October 13, 2009 - 9:09 am

No. I missed it without the change in subject line and had just exported
the thread series itself. Sorry.

I might have something useful on this in the morning assuming no other
PEBKAC-related messes.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Tuesday, October 13, 2009 - 10:17 am

I am stuck too. Sysfs is screwed up somehow and triggers the
hangcheck timer.

--

Previous thread: [this_cpu_xx V6 5/7] this_cpu: Remove slub kmem_cache fields by cl on Wednesday, October 7, 2009 - 2:10 pm. (2 messages)

Next thread: zaurus: cleanup sharpsl_pm.c by Pavel Machek on Tuesday, October 6, 2009 - 1:03 pm. (11 messages)