Our P2M tree structure is a three-level. On the leaf nodes we set the Machine Frame Number (MFN) of the PFN. What this means is that when one does: pfn_to_mfn(pfn), which is used when creating PTE entries, you get the real MFN of the hardware. When Xen sets up a guest it initially populates a array which has descending MFN values, as so: idx: 0, 1, 2 [0x290F, 0x290E, 0x290D, ..] so pfn_to_mfn(2)==0x290D. If you start, restart many guests that list starts looking quite random. We graft this structure on our P2M tree structure and stick in those MFN in the leafs. But for all other leaf entries, or for the top root, or middle one, for which there is a void entry, we assume it is "missing". So pfn_to_mfn(0xc0000)=INVALID_P2M_ENTRY. We add the possibility of setting 1-1 mappings on certain regions, so that: pfn_to_mfn(0xc0000)=0xc0000 The benefit of this is, that we can assume for non-RAM regions (think PCI BARs, or ACPI spaces), we can create mappings easily b/c we get the PFN value to match the MFN. For this to work efficiently we introduce are two new pages: p2m_identity and p2m_mid_identity. All entries in p2m_identity are set to INVALID_P2M_ENTRY type, and all entries in p2m_mid_identity point to p2m_identity. Whenever we are told to set the MFN which is equal to PFN for void regions, we swap over from p2m_missing to p2m_identity or p2m_mid_missing to p2m_mid_identity. See this diagram for details: https://docs.google.com/drawings/edit?id=1smqIRPYq2mSxmvpabuk_Ap6bQAW_aaZnOnjzqlcmxKc&... Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/mmu.c | 67 +++++++++++++++++++++++++++++++++++++++++++-------- 1 files changed, 56 insertions(+), 11 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 8e9c669..667901f 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -217,6 +217,9 @@ static RESERVE_BRK_ARRAY(unsigned long **, p2m_top, P2M_TOP_PER_PAGE); static ...
Should patch 7/8 (add the IDENTITY_FRAME_BIT) be folded back into this patch? This statement confused me at first. I think the piece of information which is missing in the rest of the paragraph is that on lookup we spot that the entry points to p2m_identity and return the identity value instead of dereferencing and returning INVALID_P2M_ENTRY. If the value is never actually (deliberately) completely dereferecned then perhaps for sanity/debugging the page should contain some other invalid/poison value so we can spot in stack traces etc cases where it has been erroneously dereferenced? (e.g. 0xFFFFFFFE?) Or does that If p2m_top[topidx] == p2m_mid_identity the also p2m_top[topidx][mididx] == p2m_identity. Therefore I'm not sure if the first check is worthwhile Should be "return true" throughout for a function returning bool. I I don't think I quite understand this bit. If I do "__set_phys_to_machine(X, X)" where X happens to currently correspond to p2m_mid_missing won't that cause all pfn entries in the range covered by p2m_top[topidx] (i.e. quite a few, like 512^2 pages or something) to switch from missing to identity? Similarly for p2m_top[topidx][mididx]? Perhaps ranges of identity bits are often well aligned with the boundaries in the p2m 3-level tree but wouldn't that just be coincidence? Don't we need to completely populate the tree at this point and setup the leaf nodes as appropriate? Which we can't do since this is __set_phys_to_machine so we need to fail and let set_phys_to_machine to its thing? Or maybe this hole bit needs to be hoisted up to set_phys_to_machine? --
I wonder if it would make sense to have a debugfs file which exports the p2m, for the purposes of eye-balling it for this sort of issue? comparing to the e820 etc... Maybe the whole thing in raw form would be overkill but spitting out a list of ranges of identity, invalid, normal pages, plus annotation regarding whether those come from a p2m_mid_identity style page or a leaf node, might be occasionally useful. Ian. -- Ian Campbell Current Noise: Place of Skulls - The Watchers Call toll free number before digging. --
I should have read patch 5/8 first... --
Does not have too. This can work independently (it does not introduce Yup. The tools have checks for either valid PFNs or 0xFFFFFFF, but nothing else. They flip out if you stick other values in .. oh and sticking in the same MFN (in case you were thinking of them pointing to a dummy page) Yup. Keep in mind that this patchset explores the more "conservative" approach. All 'void' entries are pointing to the p2m_*MISSING instead of p2m_*IDENTITY pages. I thought so, but for all the machines I've run this on, those boundaries Not exactly. At this point (xen/setup, parsing the E820), the P2M has been populated up to nr_pages. So, say we have 2G to the guest, that is p2m[0, 1] point to two pages "created" by extend_brk. The contents of the pages points to mfn_list[..]. The rest of the p2m (so, p2m[2]..p2m[511]) point to p2m_mid_missing. When we start the identity mapping, we end up right (for example) from 3GB to 4GB with identity mapping, so p2m[3] is now p2m_mid_identity. The p2m[4] and later are still in p2m_mid_missing. Thought I've just thought of nasty condition. Say those PCI regions start at 2.5GB, and the user provided dom0_mem=max:3GB. Oh, and actually that still works. The p2m[1] will be allocated by reserv_brk, and we just end up sticking from p2m[1][473]..p2m[1][511] with the PFN | IDENTITY_FRAME_BIT, which is OK as that contents used to have INVALID_P2M_IDENTITY (it would have been created by extend_brk in xen_build_dynamic_phys_to_machine and filled It can, but set_phys_to_machine would have to use reserved_brk. The danger I am seeing is if the non-RAM regions (and gaps) in the E820 are not 1GB (or 512MB under 32-bit) boundary aligned (2.36GB?)_and_ the dom0_mem is employed as to make the top-level P2M right up to the non-boundary aligned E820 void (so, dom0_mem=max:2GB for example). That would create a proper p2m[0] and p2m[1], but p2m[3] would point to p2m_mid_missing, and when we try to make p2m[3][231] identity (so ~2.36GB area) we would ...
