I'm looking at unifying asm-x86/pgalloc*.h, and so I'm trying to make
things as similar as possible between 32 and 64-bit.
Once difference is that 64-bit incrementally allocates all levels of the
pagetable, whereas 32-bit PAE preallocates the 4 pmds when it allocates
the pgd. What's the rationale for this? What pitfalls would there be
in making them incrementally allocated?
Preallocation makes sense from the perspective that they will all be
allocated almost immediately in a typical process. But it is a somewhat
arbitrary difference from 64-bit, and since 64-bit can't reasonably
preallocate any pagetable levels, it seems sensible to change 32-bit to
match.
Thanks,
J
-
IIRC, the present bit is ignored in the magic 4-entry PGD. All entries have to be present. What earlier CPU's did was to basically load all four values into the CPU when you loaded %cr3. There was no "three-level page table walker" at all: it was still a two-level page table walker, there were just for magic internal page tables that were indexed off the two high bits. Linus -
This is true, although you could point a PGD to an all-zero page if you really wanted to. You have to re-load CR3 after modifying the top-level They still are. Loading CR3 in PAE really loads four registers from memory. x86-64 is different, of course. -hpa -
There may be bigger fish to fry in terms of per-process overhead, if you're trying to cut that down. The trouble with trying to address some of those is that there is mutual antagonism between compactness and expansibility in the process address space layout, so you'll end up instantiating a lot more than you want barring some sort of provision for a compact address space layout. Pagetable sharing is a far more powerful resource scalability method, though it also needs cooperation in user address space layout to reap its gains. There are other overheads, of course, though they're more typically per-something besides processes. -- wli -
I think Jeremy's question was due to trying to reduce the 32/64-bit differences. Performance-wise, it might add a small amount to user setup time (a typical 32-bit process will need all four, for the main binary, libraries, stack and kernel, respectively) but it is probably not significant (although I'd like to see numbers just in case). -hpa -
With the new top down mmap layout and standard 3:1 split it should typically only need two. -Andi -
I didn't count kernel because it is always fixed anyways and about zero overhead for the normal setup case. -Andi -
Of course, but it was in the original list so... -hpa -
Hm, do you recall what processors that might affect? As far as I know,
current processors will ignore non-present top-level entries. Anyway,
we can point them not present to empty_zero_page, so testing the present
bit will still be sufficient to tell if we need to allocate a new pmd,
but if the hardware decides to follow the page reference there's no harm
done. (Hm, unless the hardware decides it wants to set A or D bits in
That just means we need to reload cr3 after populating the pgd with a
new pmd, right?
J
-
Are you sure?
Anyway, this is not worth making a distinction for. Just pre-allocate all
of them. There really is just 4 PGD entries, and it really *is* different
from having a full three-level page table, and of the four PGD entries:
- one is used for the kernel mapping (assuming the regular 1:3 layout)
- AT LEAST two are required by user space anyway
so pre-allocating is never going to waste more than one page.
And you may feel that pre-allocating is a special case, but it's an
*easier* special case than the one that you are apparently thinking about
(which is to special-case according to CPU version).
So don't do it. Just preallocate for the magic 4-entry PGD. You can make
the special case just be something like
/* Preallocate for small PGD's */
#if PTRS_PER_PGD == 4
for (i = 0; i < USER_PTRS_PER_PGD; i++) {
pmd_t *pmd = pmd_alloc();
set_pgd(pgd+i, __pgd(PAGE_PRESENT | __pa(pmd));
}
#endif
or similar.
There is absolutely *zero* reason not to do this, and there is also zero
reason to make this be a "32-bit vs 64-bit" issue. The code can be there
in both, and the #if could even be all in C code (ie there may be reasons
to prefer writing it as
/* The old-style PAE PGD needs to be preallocated */
if (USER_PTRS_PER_PGD <= 4) {
...
}
and the compiler should even compile it away entirely for all practical
x86 page table walking never sets A/D bits on non-present entries.
That said, there's still a huge difference.
For "real" page table walking, you can always just insert entries without
flushing the cache if those entries weren't there before (because the TLB
is supposed to not cache negative entries).
Again, because of the way the mahic 4-entry PGD works, that isn't true for
it. It caches the entries regardless, so if you change it from non-present
to present, you have to flush the TLB (well, "reload %cr3", which is the
BUT ONLY FOR THIS CASE!
And if you preallocate it, you make *that* special case go ...3.8.5 in vol 3a "Page-Directory and Page-Table Entries With Extended
Addressing Enabled":
The present flag (bit 0) in the page-directory-pointer-table entries
can be set to 0 or 1. If the present flag is clear, the remaining
bits in the page-directory-pointer-table entry are available to the
operating system. If the present flag is set, the fields of the
page-directory-pointer-table entry are defined in Figures 3-20 for
4-KByte pages and Figures 3-21 for 2-MByte pages.
So I would assume this works on all current CPUs, but I can imagine that
Yeah, I'm not so concerned about memory saving; I don't think there
I'm hoping to avoid special-casing anything, if I can help it, aside
from the normal 32/64-bit 2/3/4-level parameterising of the various
Perhaps. And there's the corresponding difference between 32 and 64 bit
on freeing a pagetable; 32-bit assumes the pgd destructor will free the
pmd, whereas 64-bit does it separately. Even in the current 32-bit
code, there's separate handling for PAE and non-PAE. I think it can all
Yes, that is a bit awkward; it means that 32-bit PAE would need a
speparate pgd_populate. But that seems like a smaller change than 1)
making 32-bit PAE pgd-alloc preallocate the pmd, and 2) making pmd_free
noop on 32-bit PAE, and 3) making pgd_free free the preallocated pmd.
Perhaps 2 & 3 aren't necessary and can be the same as 64-bit.
Yep, absolutely.
J
-
Yes, OK, it makes sense. Conceptually they would be dynamically
allocated and freed, but they'd just happen to start allocated, to avoid
the tlb flush of populating the pgd of an active pagetable. If you
happened to do a 1G munmap, it may end up freeing and reallocating them,
but that's going to be very rare. Either way, the other special cases
are avoided (though pgd_populate would still need to be correct, on the
offchance it gets invoked).
J
-
I don't think we ever free the pmd's now, do we? (Except for the *final* free, of course, when we release the whole VM). Linus -
PDPTR is documented to have P bits but none of the other control bits, unlike other levels of the hierarchy. The hardware never sets A or D bits on non-present pages, since all the bits except P are reserved for the operating systems (and, besides, they Yes. And as Linus said, it would be a new special case. -hpa -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux K |
