iirc profiling analysis showed that the problem was the page lock
serialization (in particular the slab_lock() in __slab_free). That
was on 2.6.24.2
Well in the benchmark it is slower.
Ignoring NUMA is no option unfortunately. And with integrated memory
controller many of the remote CPU frees are off node.
I think the problem is that this atomic operation thrashes cache lines
around. Really counting cycles on instructions is not that interesting,
but minimizing the cache thrashing is. And for that it looks like slub
is worse.
What is the big problem of having a batched free queue? If the expiry
is done at a good bounded time (e.g. on interrupt exit or similar)
locally on the CPU it shouldn't be a big issue, should it?
-Andi
--