>I think a realistic benchmark would be by running a real kernel and profiling the input values of the bitmap functions and then testing these cases. I actually started that when I complained last time by writing a systemtap script for this that generates a histogram, but for some reason systemtap couldn't tap all bitmap functions in my kernel and missed some completely and I ran out of time tracking that down. My gut feeling is the only interesting cases are cpumask/nodemask sized (which can be one word, two words but now upto 8 words on a NR_CPU=4096 x86 kernel) and then 4k sized ext3/reiser/etc. block bitmaps. Ok. The generic version is out-of-line, Yes it should probably. cpumask walks are relatively common. I remember profiling mysql some time ago which did bad overscheduling due to dumb locking. Funny was that the mask walking in the scheduler actually stood out. No, i don't claim extreme overscheduling is an interesting case to optimize for, but then there are more realistic workloads which also do a lot of context switching. BTW if you do generic work on this: one reason the generated code for for_each_cpu etc. is so ugly is that the code has checks for find_next_bit returning >= max size. If you can generize the code enough to make sure no arch does that anymore these checks could be eliminated. -Andi --
