Only if you believe that 4K stack pages are a worthy goal.
As far as I can figure out they are not. They might have been
a worthy goal on crappy 2.4 VMs, but these times are long gone.
The "saving memory on embedded" argument also does not
quite convince me, it is unclear if that is really
a significant amount of memory on these systems and if that
couldn't be addressed better (e.g. in running generally
less kernel threads). I don't have numbers on this,
but then the people who made this argument didn't have any
either :)
If anybody has concrete statistics on this
(including other kernel memory users in realistic situations)
please feel free to post them.
The problem with his suggestion is that the lower 4K of the stack page
are accessed in normal operation too because it contains the thread_struct.
That could be changed, but it would be a relatively large change
because you would need to audit/change a lot of code who assumes
thread_struct and stack are continuous
If that was changed implementing Willy's suggestion would not be that
difficult using cpa() at the cost of some general slowdown in
increased TLB misses and much higher thread creation/tear down cost etc,
Using the alternative vmalloc way has also other issues.
But still the fundamental problem is that it would likely only
hit the interesting cases in real production setups and I don't
think the production users would be very happy to slow down
their kernels and handle strange backtraces just to act as guinea pigs
for something dubious
-Andi
--