V5->V6: - Drop patches merged by Tejun. - Drop irqless slub fastpath for now. - Patches against Tejun percpu for-next branch. V4->V5: - Avoid setup_per_cpu_area() modifications and fold the remainder of the patch into the page allocator patch. - Irq disable / per cpu ptr fixes for page allocator patch. V3->V4: - Fix various macro definitions. - Provide experimental percpu based fastpath that does not disable interrupts for SLUB. V2->V3: - Available via git tree against latest upstream from git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus - Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation for CONFIG_ZONE_DMA - Create fallback framework so that 64 bit ops on 32 bit platforms can fallback to the use of preempt or interrupt disable. 64 bit platforms can use 64 bit atomic per cpu ops. V1->V2: - Various minor fixes - Add SLUB conversion - Add Page allocator conversion - Patch against the git tree of today The patchset introduces various operations to allow efficient access to per cpu variables for the current processor. Currently there is no way in the core to calculate the address of the instance of a per cpu variable without a table lookup. So we see a lot of per_cpu_ptr(x, smp_processor_id()) The patchset introduces a way to calculate the address using the offset that is available in arch specific ways (register or special memory locations) using this_cpu_ptr(x) In addition macros are provided that can operate on per cpu variables in a per cpu atomic way. With that scalars in structures allocated with the new percpu allocator can be modified without disabling preempt or interrupts. This works by generating a single instruction that does both the relocation of the address to the proper percpu area and the RMW action. F.e. this_cpu_add(x->var, 20) can be used to generate an instruction that uses a segment register for the relocation of the per cpu address into the per cpu area of the ...
Remove the pageset notifier since it only marks that a processor
exists on a specific node. Move that code into the vmstat notifier.
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
mm/page_alloc.c | 28 ----------------------------
mm/vmstat.c | 1 +
2 files changed, 1 insertion(+), 28 deletions(-)
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c 2009-10-06 18:19:17.000000000 -0500
+++ linux-2.6/mm/vmstat.c 2009-10-06 18:19:20.000000000 -0500
@@ -906,6 +906,7 @@ static int __cpuinit vmstat_cpuup_callba
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
start_cpu_timer(cpu);
+ node_set_state(cpu_to_node(cpu), N_CPU);
break;
case CPU_DOWN_PREPARE:
case CPU_DOWN_PREPARE_FROZEN:
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2009-10-06 18:19:17.000000000 -0500
+++ linux-2.6/mm/page_alloc.c 2009-10-06 18:19:21.000000000 -0500
@@ -3108,27 +3108,6 @@ static void setup_pagelist_highmark(stru
pcp->batch = PAGE_SHIFT * 8;
}
-
-static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
- unsigned long action,
- void *hcpu)
-{
- int cpu = (long)hcpu;
-
- switch (action) {
- case CPU_UP_PREPARE:
- case CPU_UP_PREPARE_FROZEN:
- node_set_state(cpu_to_node(cpu), N_CPU);
- break;
- default:
- break;
- }
- return NOTIFY_OK;
-}
-
-static struct notifier_block __cpuinitdata pageset_notifier =
- { &pageset_cpuup_callback, NULL, 0 };
-
/*
* Allocate per cpu pagesets and initialize them.
* Before this call only boot pagesets were available.
@@ -3154,13 +3133,6 @@ void __init setup_per_cpu_pageset(void)
percpu_pagelist_fraction));
}
}
-
- /*
- * The boot cpu is always the first active.
- * The boot node has a processor
- */
- node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
- register_cpu_notifier(&pageset_notifier);
}
...Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.
Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.
Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.
Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.
V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
include/linux/mm.h | 4 -
include/linux/mmzone.h | 12 ---
mm/page_alloc.c | 187 ++++++++++++++++++-------------------------------
mm/vmstat.c | 14 ++-
4 files changed, 81 insertions(+), 136 deletions(-)
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2009-10-07 14:34:25.000000000 -0500
+++ linux-2.6/include/linux/mm.h 2009-10-07 14:48:09.000000000 -0500
@@ -1061,11 +1061,7 @@ extern void si_meminfo(struct sysinfo *
extern void si_meminfo_node(struct sysinfo *val, int nid);
extern int after_bootmem;
-#ifdef CONFIG_NUMA
extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif
extern void zone_pcp_update(struct zone *zone);
Index: ...Hello, Christoph. This looks much better but I'm not sure whether it's safe. percpu offsets have not been set up before setup_per_cpu_areas() is complete on most archs but if all that's necessary is getting the page allocator up and running as soon as static per cpu areas and offsets are set up (which basically means as soon as cpu init is complete on ia64 and setup_per_cpu_areas() is complete on all other archs). This should be correct. Is this what you're expecting? Thanks. -- tejun --
Also, as I'm not very familiar with the code, I'd really appreciate Mel Gorman's acked or reviewed-by. Thanks. -- tejun --
paging_init() is called after the per cpu areas have been initialized. So I thought this would be safe. Tested it on x86. zone_pcp_init() only sets up the per cpu pointers to the pagesets. That works regardless of the boot stage. Then then build_all_zonelists() initializes the actual contents of the per cpu variables. Finally the per cpu pagesets are allocated from the percpu allocator when all allocators are up and the pagesets are sized. --
I haven't tested the patch series but it now looks good to my eyes at least. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Using per cpu allocations removes the needs for the per cpu arrays in the
kmem_cache struct. These could get quite big if we have to support systems
with thousands of cpus. The use of this_cpu_xx operations results in:
1. The size of kmem_cache for SMP configuration shrinks since we will only
need 1 pointer instead of NR_CPUS. The same pointer can be used by all
processors. Reduces cache footprint of the allocator.
2. We can dynamically size kmem_cache according to the actual nodes in the
system meaning less memory overhead for configurations that may potentially
support up to 1k NUMA nodes / 4k cpus.
3. We can remove the diddle widdle with allocating and releasing of
kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
alloc logic will do it all for us. Removes some portions of the cpu hotplug
functionality.
4. Fastpath performance increases since per cpu pointer lookups and
address calculations are avoided.
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
include/linux/slub_def.h | 6 -
mm/slub.c | 207 ++++++++++-------------------------------------
2 files changed, 49 insertions(+), 164 deletions(-)
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2009-09-17 17:51:51.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2009-09-29 09:02:05.000000000 -0500
@@ -69,6 +69,7 @@ struct kmem_cache_order_objects {
* Slab cache management.
*/
struct kmem_cache {
+ struct kmem_cache_cpu *cpu_slab;
/* Used for retriving partial slabs etc */
unsigned long flags;
int size; /* The size of an object including meta data */
@@ -104,11 +105,6 @@ struct kmem_cache {
int remote_node_defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
#endif
-#ifdef CONFIG_SMP
- struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
- struct ...Hello, Shouldn't this be this_cpu_ptr() without the double underscore? Thanks. -- tejun --
Oh... another similar conversions in slab_alloc() and slab_free() too. Thanks. -- tejun --
Interrupts are disabled so no concurrent fast path can occur. --
The only difference between this_cpu_ptr() and __this_cpu_ptr() is the usage of my_cpu_offset and __my_cpu_offset which in turn are only different in whether they check preemption status to make sure the cpu is pinned down when called. The only places where the underbar prefixed versions should be used are places where cpu locality is nice but not critical and preemption debug check wouldn't work properly for whatever reason. The above is none of the two and the conversion is buried in a patch which is supposed to do something else. Am I missing something? -- tejun --
I used __this_cpu_* whenever the context is already providing enough safety that preempt disable or irq disable would not matter. The use of __this_cpu_ptr was entirely for consistent usage here. this_cpu_ptr would be safer because it has additional checks that preemption really is disabled. So if someone gets confused about logic flow later it can be dtected. --
Yeah, widespread use of underscored versions isn't very desirable. The underscored versions should notify certain specific exceptional conditions instead of being used as general optimization (which doesn't make much sense after all as the optimization is only meaningful with debug option turned on). Are you interested in doing a sweeping patch to drop underscores from __this_cpu_*() conversions? Thanks. -- tejun --
Nope. __this_cpu_add/dec cannot be converted. __this_cpu_ptr could be converted to this_cpu_ptr but I think the __ are useful there too to show that we are in a preempt section. The calls to raw_smp_processor_id and smp_processor_id() are only useful in the fallback case. There is no need for those if the arch has a way to provide the current percpu offset. So we in effect have two meanings of __ right now. 1. We do not care about the preempt state (thus we call raw_smp_processor_id so that the preempt state does not trigger) 2. We do not need to disable preempt before the operation. __this_cpu_ptr only implies 1. __this_cpu_add uses 1 and 2. --
Hello, Christoph. That doesn't make much sense. __ for this_cpu_ptr() means "bypass sanity check, we're knowingly violating the required conditions" not Yeah, we need to clean it up. The naming is too confusing. Thanks. -- tejun --
Its consistent if __ means both 1 and 2. If we want to distinguish it then we may want to create raw_this_cpu_xx which means that we do not call smp_processor_id() on fallback but raw_smp_processor_id(). Does not matter if the arch provides a per cpu offset. This would mean duplicating all the macros. The use of raw_this_cpu_xx should be rare so maybe the best approach is to say that __ means only that the macro does not need to disable preempt but it still checks for preemption being off. Then audit the __this_cpu_xx uses and see if there are any that require a raw_ variant. The vm event counters require both no check and no preempt since they can be implemented in a racy way. --
I was basically stating the different between raw_smp_processor_id() and smp_processor_id() which I thought applied the same to The biggest grief I have is that the meaning of __ is different among different accessors. If that can be cleared up, we would be in much better shape without adding any extra macros. Can we just remove all __'s and use meaningful pre or suffixes like raw or irq or whatever? Thanks. -- tejun --
Ii does apply. __this_cpu_ptr does not use smp_processor_id() but raw_smp_processor_id(). this_cpu_ptr does not need to disable preempt so It currently means that we do not deal with preempt and do not check for preemption. That is consistent. Sure we could change the API to have even more macros than the large amount it already has so that we can check for proper preempt disablement. I guess that would mean adding raw_nopreempt_this_cpu_xx and nopreempt_this_cpu_xx variants? The thing gets huge. I think we could just leave it. __ suggests that serialization and checking is not performed like in the full versions and that is true. --
Hello, Christoph. I don't think we'll need to add new variants. Just renaming existing ones so that they have more specific pre/suffix should make things clearer. I'll give a shot at that once the sparse annotation patchset is merged. Thanks. -- tejun --
Use this_cpu_* operations in the hotpath to avoid calculations of
kmem_cache_cpu pointer addresses.
On x86 there is a trade off: Multiple uses segment prefixes against an
address calculation and more register pressure. Code size is reduced
also therefore it is an advantage icache wise.
The use of prefixes is necessary if we want to use a scheme
for fastpaths that do not require disabling interrupts.
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
mm/slub.c | 80 ++++++++++++++++++++++++++++++--------------------------------
1 file changed, 39 insertions(+), 41 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2009-10-07 14:52:05.000000000 -0500
+++ linux-2.6/mm/slub.c 2009-10-07 14:58:50.000000000 -0500
@@ -1512,10 +1512,10 @@ static void flush_all(struct kmem_cache
* Check if the objects in a per cpu structure fit numa
* locality expectations.
*/
-static inline int node_match(struct kmem_cache_cpu *c, int node)
+static inline int node_match(struct kmem_cache *s, int node)
{
#ifdef CONFIG_NUMA
- if (node != -1 && c->node != node)
+ if (node != -1 && __this_cpu_read(s->cpu_slab->node) != node)
return 0;
#endif
return 1;
@@ -1603,46 +1603,46 @@ slab_out_of_memory(struct kmem_cache *s,
* a call to the page allocator and the setup of a new slab.
*/
static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
- unsigned long addr, struct kmem_cache_cpu *c)
+ unsigned long addr)
{
void **object;
- struct page *new;
+ struct page *page = __this_cpu_read(s->cpu_slab->page);
/* We handle __GFP_ZERO in the caller */
gfpflags &= ~__GFP_ZERO;
- if (!c->page)
+ if (!page)
goto new_slab;
- slab_lock(c->page);
- if (unlikely(!node_match(c, node)))
+ slab_lock(page);
+ if (unlikely(!node_match(s, node)))
goto ...The rest of the patches look good to me but I'm no expert in this area of code. But you're the maintainer of the allocator and the changes definitely are percpu related, so if you're comfortable with it, I can happily carry the patches through percpu tree. Thanks. -- tejun --
The patch looks sane to me but the changelog contains no relevant
numbers on performance. I am fine with the patch going in -percpu but
the patch probably needs some more beating performance-wise before it
can go into .33. I'm CC'ing some more people who are known to do SLAB
performance testing just in case they're interested in looking at the
patch. In any case,
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka
--
I am warming up my synthetic in kernel tests right now. Hope I have something by tomorrow. --
I ran 60-second netperf TCP_RR benchmarks with various thread counts over two machines, both four quad-core Opterons. I ran the trials ten times each with both vanilla per-cpu#for-next at 9288f99 and with v6 of this patchset. The transfer rates were virtually identical showing no improvement or regression with this patchset in this benchmark. [ As I reported in http://marc.info/?l=linux-kernel&m=123839191416472, this benchmark continues to be the most significant regression slub has compared to slab. ] --
Hmmm... Last time I ran the in kernel benchmarks this showed a reduction in cycle counts. Did not get to get my tests yet. Can you also try the irqless hotpath? --
Here are some cycle numbers w/o the slub patches and with. I will post the full test results and the patches to do these in kernel tests in a new thread. The regression may be due to caching behavior of SLUB that will not change with these patches. Alloc fastpath wins ~ 50%. kfree also has a 50% win if the fastpath is being used. First test does 10000 kmallocs and then frees them all. Second test alloc one and free one and does that 10000 times. no this_cpu ops 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 239 cycles kfree -> 261 cycles 10000 times kmalloc(16) -> 249 cycles kfree -> 208 cycles 10000 times kmalloc(32) -> 215 cycles kfree -> 232 cycles 10000 times kmalloc(64) -> 164 cycles kfree -> 216 cycles 10000 times kmalloc(128) -> 266 cycles kfree -> 275 cycles 10000 times kmalloc(256) -> 478 cycles kfree -> 199 cycles 10000 times kmalloc(512) -> 449 cycles kfree -> 201 cycles 10000 times kmalloc(1024) -> 484 cycles kfree -> 398 cycles 10000 times kmalloc(2048) -> 475 cycles kfree -> 559 cycles 10000 times kmalloc(4096) -> 792 cycles kfree -> 506 cycles 10000 times kmalloc(8192) -> 753 cycles kfree -> 679 cycles 10000 times kmalloc(16384) -> 968 cycles kfree -> 712 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 292 cycles 10000 times kmalloc(16)/kfree -> 308 cycles 10000 times kmalloc(32)/kfree -> 326 cycles 10000 times kmalloc(64)/kfree -> 303 cycles 10000 times kmalloc(128)/kfree -> 257 cycles 10000 times kmalloc(256)/kfree -> 262 cycles 10000 times kmalloc(512)/kfree -> 293 cycles 10000 times kmalloc(1024)/kfree -> 262 cycles 10000 times kmalloc(2048)/kfree -> 289 cycles 10000 times kmalloc(4096)/kfree -> 274 cycles 10000 times kmalloc(8192)/kfree -> 265 cycles 10000 times kmalloc(16384)/kfree -> 1041 cycles with this_cpu_xx 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 134 cycles kfree -> 212 cycles 10000 times kmalloc(16) -> 109 cycles kfree -> 116 cycles 10000 times kmalloc(32) ...
Hi Christoph, I wonder how reliable these numbers are. We did similar testing a while back because we thought kmalloc-96 caches had weird cache behavior but finally figured out the anomaly was explained by the order of the tests run, not cache size. Notice the jump from 32 to 64 and then back to 64. One would expect we see linear increase as object size grows as we hit the page allocator If there's 50% improvement in the kmalloc() path, why does the this_cpu() version seem to be roughly as fast as the mainline version? Pekka --
Well you need to look behind these numbers to see when the allocator uses 64 is the cacheline size for the machine. At that point you have the advantage of no overlapping data between different allocations and the Its not that the kmalloc() is faster. The instructions used for the fastpath generate less cycles. Other components figure into the total latency as well. 16k allocations for example are not handled by slub anymore. Fastpath has no effect. The wins there is just the improved percpu handling in the page allocator. I have some numbers here for irqless which drops another half of the fastpath latency (and it adds some code to the slow path, sigh): 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 55 cycles kfree -> 251 cycles 10000 times kmalloc(16) -> 201 cycles kfree -> 261 cycles 10000 times kmalloc(32) -> 220 cycles kfree -> 261 cycles 10000 times kmalloc(64) -> 186 cycles kfree -> 224 cycles 10000 times kmalloc(128) -> 205 cycles kfree -> 125 cycles 10000 times kmalloc(256) -> 351 cycles kfree -> 267 cycles 10000 times kmalloc(512) -> 330 cycles kfree -> 310 cycles 10000 times kmalloc(1024) -> 416 cycles kfree -> 419 cycles 10000 times kmalloc(2048) -> 537 cycles kfree -> 439 cycles 10000 times kmalloc(4096) -> 458 cycles kfree -> 594 cycles 10000 times kmalloc(8192) -> 810 cycles kfree -> 678 cycles 10000 times kmalloc(16384) -> 879 cycles kfree -> 746 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 66 cycles 10000 times kmalloc(16)/kfree -> 187 cycles 10000 times kmalloc(32)/kfree -> 116 cycles 10000 times kmalloc(64)/kfree -> 107 cycles 10000 times kmalloc(128)/kfree -> 115 cycles 10000 times kmalloc(256)/kfree -> 65 cycles 10000 times kmalloc(512)/kfree -> 66 cycles 10000 times kmalloc(1024)/kfree -> 206 cycles 10000 times kmalloc(2048)/kfree -> 65 cycles 10000 times kmalloc(4096)/kfree -> 193 cycles 10000 times kmalloc(8192)/kfree -> 65 cycles 10000 times kmalloc(16384)/kfree -> 976 cycles --
With the netperf -t TCP_RR -l 60 benchmark I ran, CONFIG_SLUB_STATS shows the allocation fastpath is utilized quite a bit for a couple of key caches: cache ALLOC_FASTPATH ALLOC_SLOWPATH kmalloc-256 98125871 31585955 kmalloc-2048 77243698 52347453 For an optimized fastpath, I'd expect such a workload would result in at least a slightly higher transfer rate. I'll try the irqless patch, but this particular benchmark may not appropriately demonstrate any performance gain because of the added code in the also significantly-used slowpath. --
There will be no improvements if the load is dominated by the instructions in the network layer or caching issues. None of that is changed by the path. It only reduces the cycle count in the fastpath. --
Right, but CONFIG_SLAB shows a 5-6% improvement over CONFIG_SLUB in the same workload so it shows that the slab allocator does have an impact in transfer rate. I understand that the performance gain with this patchset, however, may not be representative with the benchmark since it also frequently uses the slowpath for kmalloc-256 about 25% of the time and the added code of the irqless patch may mask the fastpath gain. --
I have a bit more detailed results based on the following machine
CPU type: AMD Phenom 9950
CPU counts: 1 CPU (4 cores)
CPU Speed: 1.3GHz
Motherboard: Gigabyte GA-MA78GM-S2H
Memory: 8GB
The reference kernel used is mmotm-2009-10-09-01-07. The patches applied
are the patches in this thread. The headings are a bit munged but it's
SLUB-vanilla where vanilla is mmotm-2009-10-09-01-07
SLUB-this-cpu mmotm-2009-10-09-01-07 + patches in this thread
SLAB-* same as above but SLAB configured instead of SLUB.
I know it wasn't necessary to run SLAB-this-cpu but
it gives an idea to what degree results can vary
between reboots even if results are stable once the
machine is running.
The benchmarks run were kernbench, netperf UDP_STREAM and TCP_STREAM and
sysbench with postgres.
Kernbench is 5 kernel compiles and an average taken. One kernel compile
is done at the start to warm the benchmark up and this result is
discarded.
Netperf is the _STREAM tests as opposed to the _RR tests reported
elsewhere. No special effort is done to bind processes to any particular
CPU. The results reported tried to be 99% confidence that the estimated
mean was within 1% of the true mean. Results where netperf failed to
achieve the necessary confidence are marked with a * and the line after
such a result states what percentage the estimated mean is to the true
mean. The test is run with different packet sizes.
Sysbench is a read-only test (to avoid IO) and is the "complex"
workload. The test is run with varying numbers of threads.
In all the results, SLUB-vanilla is the reference baseline. This allows
a comparison between SLUB-vanilla and SLAB-vanilla as well with the
patches applied.
kernbench-SLUB-vanilla-kernbench kernbench-SLUBkernbench-SLAB-vanilla-kernbench kernbench-SLAB
SLUB-vanilla this-cpu SLAB-vanilla this-cpu
Elapsed min 92.95 ( 0.00%) 92.62 ( 0.36%) 92.93 ( 0.02%) 92.62 ( 0.36%)
Elapsed mean ...The test did not include the irqless patch I hope? The queuing in SLAB allows a better cache hot behavior. Without a queue SLUB has a difficult time improvising cache hot behavior based on objects restricted to a slab page. Therefore the size of the slab page will SLUB offloads allocations > 8k to the page allocator. SLAB does create large slabs. --
Allocations >8k might explain then why 8K and 16K packets for UDP_STREAM performance suffers. That can be marked as future possible work to sort out within the allocator. However, does it explain why TCP_STREAM suffers so badly even for packet sizes like 2K? It's also important to note in some cases, SLAB was far slower even when the packet sizes were greater than 8k so I don't think the page allocator is an adequate explanation for TCP_STREAM. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Hi Mel, SLAB is able to queue lots of large objects but SLUB can't do that because it has no queues. In SLUB, each CPU gets a page assigned to it that serves as a "queue" but the size of the queue gets smaller as object size approaches page size. We try to offset that with higher order allocations but IIRC we don't increase the order linearly with object size and cap it to some reasonable maximum. Pekka --
You can test to see if larger pages have an influence by passing slub_max_order=6 or so on the kernel command line. You can force a large page use in slub by setting slub_min_order=3 f.e. Or you can force a mininum number of objecxcts in slub through f.e. slub_min_objects=50 slub_max_order=6 slub_min_objects=50 should result in pretty large slabs with lots of in page objects that allow slub to queue better. --
Hi Christoph,
On Wed, Oct 14, 2009 at 6:56 PM, Christoph Lameter
Yeah, that should help but it's probably not something we can do for
mainline. I'm not sure how we can fix SLUB to support large objects
out-of-the-box as efficiently as SLAB does.
Pekka
--
We could add a per cpu "queue" through a pointer array in kmem_cache_cpu. Which is more SLQB than SLUB. --
Here are the results of that suggestion. They are side-by-side with the
other results so the columns are
SLUB-vanilla No other patches applied, SLUB configured
vanilla-highorder No other patches + slub_max_order=6 slub_min_objects=50
SLUB-this-cpu The patches in this set applied
this-cpu-higher These patches + slub_max_order=6 slub_min_objects=50
SLAB-vanilla No other patches, SLAB configured
SLAB-this-cpu Thes patches, SLAB configured
SLUB-vanilla vanilla-highorder SLUB-this-cpu this-cpu-highorder SLAB-vanilla SLAB-this-cpu
Elapsed min 92.95 ( 0.00%) 92.64 ( 0.33%) 92.62 ( 0.36%) 92.77 ( 0.19%) 92.93 ( 0.02%) 92.62 ( 0.36%)
Elapsed mean 93.11 ( 0.00%) 92.89 ( 0.24%) 92.74 ( 0.40%) 92.82 ( 0.31%) 93.00 ( 0.13%) 92.82 ( 0.32%)
Elapsed stddev 0.10 ( 0.00%) 0.15 (-58.74%) 0.14 (-40.55%) 0.09 ( 7.73%) 0.04 (55.47%) 0.18 (-84.33%)
Elapsed max 93.20 ( 0.00%) 93.04 ( 0.17%) 92.95 ( 0.27%) 92.98 ( 0.24%) 93.05 ( 0.16%) 93.09 ( 0.12%)
User min 323.21 ( 0.00%) 323.38 (-0.05%) 322.60 ( 0.19%) 323.26 (-0.02%) 322.50 ( 0.22%) 323.26 (-0.02%)
User mean 323.81 ( 0.00%) 323.64 ( 0.05%) 323.20 ( 0.19%) 323.56 ( 0.08%) 323.16 ( 0.20%) 323.54 ( 0.08%)
User stddev 0.40 ( 0.00%) 0.38 ( 4.24%) 0.46 (-15.30%) 0.27 (33.20%) 0.48 (-20.92%) 0.29 (26.07%)
User max 324.32 ( 0.00%) 324.30 ( 0.01%) 323.72 ( 0.19%) 323.96 ( 0.11%) 323.86 ( 0.14%) 323.98 ( 0.10%)
System min 35.95 ( 0.00%) 35.33 ( 1.72%) 35.50 ( 1.25%) 35.95 ( 0.00%) 35.35 ( 1.67%) 36.01 (-0.17%)
System mean 36.30 ( 0.00%) 35.99 ( 0.87%) 35.96 ( 0.96%) 36.20 ( 0.28%) 36.17 ( 0.36%) 36.23 ( 0.21%)
System stddev 0.25 ( 0.00%) 0.41 (-59.25%) 0.45 (-75.60%) 0.15 (41.61%) 0.56 (-121.14%) 0.14 (46.14%)
System max 36.65 ( 0.00%) 36.44 ( 0.57%) 36.67 (-0.05%) 36.32 ( 0.90%) ...This is understandable considering the statistics that I posted for this workload on my machine, higher order cpu slabs will naturally get freed to more often from the fastpath, which also causes it to utilize the allocation fastpath more often (and we can see the optimization of this patchset), in addition to avoiding partial list handling. The pain with the smaller packet sizes is probably the overhead from the page allocator more than slub, a characteristic that also caused the TCP_RR benchmark to suffer. It can be mitigated somewhat with slab preallocation or a higher min_partial setting, but that's probably not an optimal solution. --
TCP_STREAM stresses a few specific caches: ALLOC_FASTPATH ALLOC_SLOWPATH FREE_FASTPATH FREE_SLOWPATH kmalloc-256 3868530 3450592 95628 7223491 kmalloc-1024 2440434 429 2430825 10034 kmalloc-4096 3860625 1036723 85571 4811779 This demonstrates that freeing to full (or partial) slabs causes a lot of pain since the fastpath normally can't be utilized and that's probably beyond the scope of this patchset. It's also different from the cpu slab thrashing issue I identified with the TCP_RR benchmark and had a patchset to somewhat improve. The criticism was the addition of an increment to a fastpath counter in struct kmem_cache_cpu which could probably now be much cheaper with these optimizations. --
Can you redo the patch? --
Sure, but it would be even more inexpensive if we can figure out why the irqless patch is hanging my netserver machine within the first 60 seconds on the TCP_RR benchmark. I guess nobody else has reproduced that yet. --
Nope. Sorry. I have tried running some tests but so far nothing. --
The tests were all run directly after booting the respective kernel. --
v6 of your patchset applied to percpu#for-next now at dec54bf "this_cpu: Use this_cpu_xx in trace_functions_graph.c" works fine, but when I apply the irqless patch from http://marc.info/?l=linux-kernel&m=125503037213262 it hangs my netserver machine within the first 60 seconds when running this benchmark. These kernels both include the fixes to kmem_cache_open() and dma_kmalloc_cache() you posted earlier. I'll have to debug why that's happening before collecting results. --
Dynamic DMA kmalloc cache allocation is troublesome since the
new percpu allocator does not support allocations in atomic contexts.
Reserve some statically allocated kmalloc_cpu structures instead.
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
include/linux/slub_def.h | 19 +++++++++++--------
mm/slub.c | 24 ++++++++++--------------
2 files changed, 21 insertions(+), 22 deletions(-)
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2009-09-29 11:42:06.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2009-09-29 11:43:18.000000000 -0500
@@ -131,11 +131,21 @@ struct kmem_cache {
#define SLUB_PAGE_SHIFT (PAGE_SHIFT + 2)
+#ifdef CONFIG_ZONE_DMA
+#define SLUB_DMA __GFP_DMA
+/* Reserve extra caches for potential DMA use */
+#define KMALLOC_CACHES (2 * SLUB_PAGE_SHIFT - 6)
+#else
+/* Disable DMA functionality */
+#define SLUB_DMA (__force gfp_t)0
+#define KMALLOC_CACHES SLUB_PAGE_SHIFT
+#endif
+
/*
* We keep the general caches in an array of slab caches that are used for
* 2^x bytes of allocations.
*/
-extern struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT];
+extern struct kmem_cache kmalloc_caches[KMALLOC_CACHES];
/*
* Sorry that the following has to be that ugly but some versions of GCC
@@ -203,13 +213,6 @@ static __always_inline struct kmem_cache
return &kmalloc_caches[index];
}
-#ifdef CONFIG_ZONE_DMA
-#define SLUB_DMA __GFP_DMA
-#else
-/* Disable DMA functionality */
-#define SLUB_DMA (__force gfp_t)0
-#endif
-
void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
void *__kmalloc(size_t size, gfp_t flags);
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2009-09-29 11:42:06.000000000 -0500
+++ linux-2.6/mm/slub.c 2009-09-29 11:43:18.000000000 -0500
@@ -2090,7 +2090,7 @@ static inline int ...Slight bug when creating kmalloc dma caches on the fly. When searching for an unused statically allocated kmem_cache structure we need to check for size == 0 not the other way around. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2009-10-13 13:31:05.000000000 -0500 +++ linux-2.6/mm/slub.c 2009-10-13 13:31:36.000000000 -0500 @@ -2650,7 +2650,7 @@ static noinline struct kmem_cache *dma_k s = NULL; for (i = 0; i < KMALLOC_CACHES; i++) - if (kmalloc_caches[i].size) + if (!kmalloc_caches[i].size) break; BUG_ON(i >= KMALLOC_CACHES); --
this_cpu_inc() translates into a single instruction on x86 and does not
need any register. So use it in stat(). We also want to avoid the
calculation of the per cpu kmem_cache_cpu structure pointer. So pass
a kmem_cache pointer instead of a kmem_cache_cpu pointer.
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org?
---
mm/slub.c | 43 ++++++++++++++++++++-----------------------
1 file changed, 20 insertions(+), 23 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2009-09-29 11:44:35.000000000 -0500
+++ linux-2.6/mm/slub.c 2009-09-29 11:44:49.000000000 -0500
@@ -217,10 +217,10 @@ static inline void sysfs_slab_remove(str
#endif
-static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
+static inline void stat(struct kmem_cache *s, enum stat_item si)
{
#ifdef CONFIG_SLUB_STATS
- c->stat[si]++;
+ __this_cpu_inc(s->cpu_slab->stat[si]);
#endif
}
@@ -1108,7 +1108,7 @@ static struct page *allocate_slab(struct
if (!page)
return NULL;
- stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
+ stat(s, ORDER_FALLBACK);
}
if (kmemcheck_enabled
@@ -1406,23 +1406,22 @@ static struct page *get_partial(struct k
static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
{
struct kmem_cache_node *n = get_node(s, page_to_nid(page));
- struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
__ClearPageSlubFrozen(page);
if (page->inuse) {
if (page->freelist) {
add_partial(n, page, tail);
- stat(c, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
+ stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
} else {
- stat(c, DEACTIVATE_FULL);
+ stat(s, DEACTIVATE_FULL);
if (SLABDEBUG && PageSlubDebug(page) &&
(s->flags & SLAB_STORE_USER))
add_full(n, page);
}
slab_unlock(page);
} else {
- stat(c, DEACTIVATE_EMPTY);
+ stat(s, ...FWIW, this fails to boot on latest mmotm on x86-64 even though the patches apply. It fails to create basic slab cackes like kmalloc-64. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
There was a fixup patch for one of the slub patches. Was that merged? --
No. I missed it without the change in subject line and had just exported the thread series itself. Sorry. I might have something useful on this in the morning assuming no other PEBKAC-related messes. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I am stuck too. Sysfs is screwed up somehow and triggers the hangcheck timer. --
