Re: Why preallocate pmd in x86 32-bit PAE?

Previous thread: [PATCH RESEND] xen: mask _PAGE_PCD from ptes by Jeremy Fitzhardinge on Thursday, November 15, 2007 - 2:49 pm. (1 message)

Next thread: [PATCH] x86: clean up nmi_32/64.c by Hiroshi Shimamoto on Thursday, November 15, 2007 - 3:14 pm. (1 message)
From: Jeremy Fitzhardinge
Date: Thursday, November 15, 2007 - 2:57 pm

I'm looking at unifying asm-x86/pgalloc*.h, and so I'm trying to make
things as similar as possible between 32 and 64-bit.

Once difference is that 64-bit incrementally allocates all levels of the
pagetable, whereas 32-bit PAE preallocates the 4 pmds when it allocates
the pgd.  What's the rationale for this?  What pitfalls would there be
in making them incrementally allocated?

Preallocation makes sense from the perspective that they will all be
allocated almost immediately in a typical process.  But it is a somewhat
arbitrary difference from 64-bit, and since 64-bit can't reasonably
preallocate any pagetable levels, it seems sensible to change 32-bit to
match.

Thanks,
    J
-

From: Linus Torvalds
Date: Thursday, November 15, 2007 - 3:12 pm

IIRC, the present bit is ignored in the magic 4-entry PGD.  All entries 
have to be present.

What earlier CPU's did was to basically load all four values into the CPU 
when you loaded %cr3. There was no "three-level page table walker" at all: 
it was still a two-level page table walker, there were just for magic 
internal page tables that were indexed off the two high bits.

		Linus
-

From: H. Peter Anvin
Date: Thursday, November 15, 2007 - 3:42 pm

This is true, although you could point a PGD to an all-zero page if you 
really wanted to.  You have to re-load CR3 after modifying the top-level 

They still are.  Loading CR3 in PAE really loads four registers from 
memory.  x86-64 is different, of course.

	-hpa
-

From: William Lee Irwin III
Date: Thursday, November 15, 2007 - 5:40 pm

There may be bigger fish to fry in terms of per-process overhead, if
you're trying to cut that down. The trouble with trying to address
some of those is that there is mutual antagonism between compactness
and expansibility in the process address space layout, so you'll end
up instantiating a lot more than you want barring some sort of provision
for a compact address space layout. Pagetable sharing is a far more
powerful resource scalability method, though it also needs cooperation
in user address space layout to reap its gains.

There are other overheads, of course, though they're more typically
per-something besides processes.


-- wli
-

From: H. Peter Anvin
Date: Thursday, November 15, 2007 - 5:41 pm

I think Jeremy's question was due to trying to reduce the 32/64-bit 
differences.  Performance-wise, it might add a small amount to user 
setup time (a typical 32-bit process will need all four, for the main 
binary, libraries, stack and kernel, respectively) but it is probably 
not significant (although I'd like to see numbers just in case).

	-hpa

-

From: Andi Kleen
Date: Friday, November 16, 2007 - 4:16 am

With the new top down mmap layout and standard 3:1 split it should typically 
only need two.

-Andi
-

From: H. Peter Anvin
Date: Friday, November 16, 2007 - 8:45 am

Well, three with the kernel.

	-hpa
-

From: Andi Kleen
Date: Friday, November 16, 2007 - 8:53 am

I didn't count kernel because it is always fixed anyways and about zero
overhead for the normal setup case.

-Andi

-

From: H. Peter Anvin
Date: Friday, November 16, 2007 - 9:10 am

Of course, but it was in the original list so...

	-hpa
-

From: Jeremy Fitzhardinge
Date: Friday, November 16, 2007 - 10:12 am

Hm, do you recall what processors that might affect?  As far as I know,
current processors will ignore non-present top-level entries.  Anyway,
we can point them not present to empty_zero_page, so testing the present
bit will still be sufficient to tell if we need to allocate a new pmd,
but if the hardware decides to follow the page reference there's no harm
done.  (Hm, unless the hardware decides it wants to set A or D bits in

That just means we need to reload cr3 after populating the pgd with a
new pmd, right?

    J
-

From: Linus Torvalds
Date: Friday, November 16, 2007 - 10:35 am

Are you sure?

Anyway, this is not worth making a distinction for. Just pre-allocate all 
of them. There really is just 4 PGD entries, and it really *is* different 
from having a full three-level page table, and of the four PGD entries:

 - one is used for the kernel mapping (assuming the regular 1:3 layout)
 - AT LEAST two are required by user space anyway

so pre-allocating is never going to waste more than one page.

And you may feel that pre-allocating is a special case, but it's an 
*easier* special case than the one that you are apparently thinking about 
(which is to special-case according to CPU version).

So don't do it. Just preallocate for the magic 4-entry PGD. You can make 
the special case just be something like

	/* Preallocate for small PGD's */
	#if PTRS_PER_PGD == 4
		for (i = 0; i < USER_PTRS_PER_PGD; i++) {
			pmd_t *pmd = pmd_alloc();
			set_pgd(pgd+i, __pgd(PAGE_PRESENT | __pa(pmd));
		}	
	#endif

or similar. 

There is absolutely *zero* reason not to do this, and there is also zero 
reason to make this be a "32-bit vs 64-bit" issue. The code can be there 
in both, and the #if could even be all in C code (ie there may be reasons 
to prefer writing it as

	/* The old-style PAE PGD needs to be preallocated */
	if (USER_PTRS_PER_PGD <= 4) {
		...
	}

and the compiler should even compile it away entirely for all practical 

x86 page table walking never sets A/D bits on non-present entries.

That said, there's still a huge difference. 

For "real" page table walking, you can always just insert entries without 
flushing the cache if those entries weren't there before (because the TLB 
is supposed to not cache negative entries). 

Again, because of the way the mahic 4-entry PGD works, that isn't true for 
it. It caches the entries regardless, so if you change it from non-present 
to present, you have to flush the TLB (well, "reload %cr3", which is the 

BUT ONLY FOR THIS CASE!

And if you preallocate it, you make *that* special case go ...
From: Jeremy Fitzhardinge
Date: Friday, November 16, 2007 - 11:30 am

3.8.5 in vol 3a "Page-Directory and Page-Table Entries With Extended
Addressing Enabled":

    The present flag (bit 0) in the page-directory-pointer-table entries
    can be set to 0 or 1. If the present flag is clear, the remaining
    bits in the page-directory-pointer-table entry are available to the
    operating system. If the present flag is set, the fields of the
    page-directory-pointer-table entry are defined in Figures 3-20 for
    4-KByte pages and Figures 3-21 for 2-MByte pages.

So I would assume this works on all current CPUs, but I can imagine that

Yeah, I'm not so concerned about memory saving; I don't think there

I'm hoping to avoid special-casing anything, if I can help it, aside
from the normal 32/64-bit 2/3/4-level parameterising of the various

Perhaps.  And there's the corresponding difference between 32 and 64 bit
on freeing a pagetable; 32-bit assumes the pgd destructor will free the
pmd, whereas 64-bit does it separately.  Even in the current 32-bit
code, there's separate handling for PAE and non-PAE.  I think it can all

Yes, that is a bit awkward; it means that 32-bit PAE would need a
speparate pgd_populate.  But that seems like a smaller change than 1)
making 32-bit PAE pgd-alloc preallocate the pmd, and 2) making pmd_free
noop on 32-bit PAE, and 3) making pgd_free free the preallocated pmd. 
Perhaps 2 & 3 aren't necessary and can be the same as 64-bit.


Yep, absolutely.

    J
-

From: Jeremy Fitzhardinge
Date: Friday, November 16, 2007 - 12:14 pm

Yes, OK, it makes sense.  Conceptually they would be dynamically
allocated and freed, but they'd just happen to start allocated, to avoid
the tlb flush of populating the pgd of an active pagetable.  If you
happened to do a 1G munmap, it may end up freeing and reallocating them,
but that's going to be very rare.  Either way, the other special cases
are avoided (though pgd_populate would still need to be correct, on the
offchance it gets invoked).

    J
-

From: Linus Torvalds
Date: Friday, November 16, 2007 - 12:22 pm

I don't think we ever free the pmd's now, do we?

(Except for the *final* free, of course, when we release the whole VM).

		Linus
-

From: H. Peter Anvin
Date: Friday, November 16, 2007 - 10:45 am

PDPTR is documented to have P bits but none of the other control bits, 
unlike other levels of the hierarchy.

The hardware never sets A or D bits on non-present pages, since all the 
bits except P are reserved for the operating systems (and, besides, they 

Yes.  And as Linus said, it would be a new special case.

	-hpa

-

Previous thread: [PATCH RESEND] xen: mask _PAGE_PCD from ptes by Jeremy Fitzhardinge on Thursday, November 15, 2007 - 2:49 pm. (1 message)

Next thread: [PATCH] x86: clean up nmi_32/64.c by Hiroshi Shimamoto on Thursday, November 15, 2007 - 3:14 pm. (1 message)