I suspect the extra cost might be worth it for two reasons: 1) we could
optimize the cross-call implementation further 2) on systems where TLB
flushes actually matter, the ability to overlap multiple TLB flushes to
the same single CPU might improve workloads.
FYI, i've created a new -tip topic for your patches, tip/x86/tlbflush.
It's based on tip/irq/sparseirq (there are a good deal of dependencies
with that topic).
It would be nice to see some numbers on sufficiently SMP systems, using
some mmap/munmap intense workload.