Right you can use a 4k filesystem. The 4k blocks are buffers in a larger
page then.
I would think that your approach would be slower since you always have to
populate 1 << N ptes when mmapping a file? Plus there is a lot of wastage
of memory because even a file with one character needs an order N page? So
there are less pages available for the same workload.
Then you are breaking mmap assumptions of applications becaused the order
N kernel will no longer be able to map 4k pages. You likely need a new
binary format that has pages correctly aligned. I know that we would need
one on IA64 if we go beyond the established page sizes.
-