Re: [PATCH] mmu notifiers #v5

Previous thread: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges by Christoph Lameter on Wednesday, January 30, 2008 - 9:57 pm. (18 messages)

Next thread: [patch 1/3] mmu_notifier: Core code by Christoph Lameter on Wednesday, January 30, 2008 - 9:57 pm. (16 messages)
From: Christoph Lameter
Date: Wednesday, January 30, 2008 - 9:57 pm

I hope this is finally a release that covers all the requirements. Locking
description is at the top of the core patch.

This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel. The known immediate users are

KVM (establishes a refcount to the page. External references called spte)
	(Refcount seems to be not necessary)

GRU (simple TLB shootdown without refcount. Has its own pagetable/tlb)

XPmem (uses its own reverse mappings. Remote ptes, Needs
	to sleep when sending messages)
	XPmem could defer freeing pages if a callback with atomic=1 occurs.

Pending:

- Feedback from users of the callbacks for KVM, RDMA, XPmem and GRU
  (Early tests with the GRU were successful).

Known issues:

- RCU quiescent periods are required on registering
  notifiers to guarantee visibility to other processors.

Andrea's mmu_notifier #4 -> RFC V1

- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
  called.
- Develop a patch sequence that separates out the different types of
  hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.

V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
  already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
  is held.
- Add invalidate_all()

V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
  and sys_remap_file_pages() after the pte clearing.

V3->V4:
- Drop locking and synchronize_rcu() on ->release since we know on release ...
From: Andrea Arcangeli
Date: Thursday, January 31, 2008 - 10:18 am

GRU should implement at least invalidate_page and invalidate_pages,
and it should be mostly covered with that.

My suggestion is to add the invalidate_range_start/end incrementally
with this, and to keep all the xpmem mmu notifiers in a separate
incremental patch (those are going to require many more changes to
perfect). They've very different things. GRU is simpler, will require
less changes and it should be taken care of sooner than XPMEM. KVM
requirements are a subset of GRU thanks to the page pin so I can
ignore KVM requirements as a whole and I only focus on GRU for the
time being.

There's room for optimization by moving some invalidate_page to
invalidate_pages but that would complicate all tlb flushing code a
lot, it'd require tons of changes and we should concentrate on getting
the functionality included in the simplest, obviously safe, least
intrusive form, in a way that can be optimized later if that turns out
to be an issue. And I think for KVM this is already quite optimal. And
there's zero slowdown when the notifiers are armed, for all tasks but
the one tracked by the notifiers.

Invalidates inside PT lock will avoid the page faults to happen in
parallel of my invalidates, no dependency on the page pin, mremap
covered, mprotect covered etc... I verified with the below patch KVM
swapping is rock solid on top of 2.6.24 as with my #v4 on a older host
kernel. I think it'd be much less efficient to add yet another
non-per-pte-scalar lock to serialize the GRU interrupt or KVM
pagefault against the main linux page fault, given we already have all
needed serialization out of the PT lock. XPMEM is forced to do that
because it has to sleep in sending the invalidate. So it has to take
such subsystem mutex, send the invalidate, ptep_clear_flush, and
finally unlock the mutex (hence the need of a _end callback, to unlock
the page faults). KVM could also add a kvm internal lock to do
something like that but there's no point given PT lock is more
scalar. GRU is totally forbidden ...
From: Christoph Lameter
Date: Thursday, January 31, 2008 - 1:18 pm

You are aware that the pt lock is split for systems with >4 CPUS? You can 

pt lock cannot serialize with invalidate_range since it is split. A range 

That may be okay. Have you checked all the arches that can provide their 
own implementation of this macro? This is only going to work on arches 
that use the generic implementation.


This will require a callback on every(!) removal of a pte. A range 
invalidate does not do any good since the callbacks are performed anyways. 
Probably needlessly.

In addition you have the same issues with arches providing their own macro 

Ahh you found an additional arch. How about x86 code? There is one 

What is the point of invalidate_pages? It cannot be serialized properly 
and you do the invalidate_page() calles regardless. Is is some sort of 

Macros. We want functions there to be able to validate the parameters even 


You are under the wrong impression that you can use the pte lock to 
serialize general access to ptes! Nope. ptelock only serialize access to 


Again broken serialization.
--

From: Christoph Lameter
Date: Thursday, January 31, 2008 - 4:09 pm

Hmmm.. May be okay after all. I see that you are only doing it on the pte 
level. This means the range callbacks are taking down a max of 512 
entries. So you have a callback for each pmd. A callback for 2M of memory?




--

From: Andrea Arcangeli
Date: Thursday, January 31, 2008 - 4:41 pm

Exactly. The point of _pages is to reduce of an order of magnitude
(512, or 1024 times) the number of needed invalidate_page calls in a
few places where it's a strightforward optimization for both KVM and
GRU. Thanks to the PT lock this remains a totally obviously safe
design and it requires zero additional locking anywhere (nor linux VM,
nor in the mmu notifier methods, nor in the KVM/GRU page fault).

Sure you can do invalidate_range_start/end for more than 2M(/4M on
32bit) max virtual ranges. But my approach that averages the fixed
mmu_lock cost already over 512(/1024) ptes will make any larger
"range" improvement not strongly measurable anymore given to do that
you have to add locking as well and _surely_ decrease the GRU
scalability with tons of threads and tons of cpus potentially making
GRU a lot slower _especially_ on your numa systems.
--

From: Christoph Lameter
Date: Thursday, January 31, 2008 - 6:44 pm

The trouble is that the invalidates are much more expensive if you have to 
send theses to remote partitions (XPmem). And its really great if you can 
simple tear down everything. Certainly this is a significant improvement 
over the earlier approach but you still have the invalidate_page calls in 
ptep_clear_flush. So they fire needlessly?

Serializing access in the device driver makes sense and comes with 
additional possiblity of not having to increment page counts all the time. 
So you trade one cacheline dirtying for many that are necessary if you 
always increment the page count.

How does KVM insure the consistency of the shadow page tables? Atomic ops?

The GRU has no page table on its own. It populates TLB entries on demand 
using the linux page table. There is no way it can figure out when to 
drop page counts again. The invalidate calls are turned directly into tlb 
flushes.

 
--

From: Andrea Arcangeli
Date: Friday, February 1, 2008 - 5:09 am

Dunno, they certainly fire more frequently than yours, even _pages
fires more frequently than range_start,end but don't forget why!
That's because I've a different spinlock for every 512
ptes/4k-grub-tlbs that are being invalidated... So it pays off in
scalability. I'm unsure if gru could play tricks with your patch, to
still allow faults to still happen in parallel if they're on virtual

Note that my #v5 doesn't require to increase the page count all the
time, so GRU will work fine with #v5.

See this comment in my patch:

    /*
     * invalidate_page[s] is called in atomic context
     * after any pte has been updated and before
     * dropping the PT lock required to update any Linux pte.
     * Once the PT lock will be released the pte will have its
     * final value to export through the secondary MMU.
     * Before this is invoked any secondary MMU is still ok
     * to read/write to the page previously pointed by the
     * Linux pte because the old page hasn't been freed yet.
     * If required set_page_dirty has to be called internally
     * to this method.
     */


invalidate_page[s] is always called before the page is freed. This
will require modifications to the tlb flushing code logic to take

A per-VM mmu_lock spinlock is taken to serialize the access, plus

Yes, this is why it can't serialize follow_page with only the PT lock
with your patch. KVM may do it once you add start,end to range_end
only thanks to the additional pin on the page.
--

From: Christoph Lameter
Date: Friday, February 1, 2008 - 12:23 pm

But that comes with the cost of firing invalidate_page for every page 
being evicted. In order to make your single invalidate_range work without 

Yes so your invalidate_range is still some sort of dysfunctional 
optimization? Gazillions of invalidate_page's will have to be executed 

And that would not be enough to hold of new references? With small tweaks 
this should work with a common scheme. We could also redefine the role 
of _start and _end slightly to just require that the refs are removed when 
_end completes. That would allow the KVM page count ref to work as is now 

Right but that pin requires taking a refcount which we cannot do.

Frankly this looks as if this is a solution that would work only for KVM.
 
--

From: Andrea Arcangeli
Date: Saturday, February 2, 2008 - 7:17 pm

I don't know if gru can flush the external TLB references with a
single instruction like the cpu can do by overwriting %cr3. As far as
vmx/svm are concerned, you got to do some fixed amount of work on the
rmap structures and each spte, for each 4k invalidated regardless of
page/pages/range/ranges. You can only share the cost of the lock and
of the memslot lookup, so you lookup and take the lock once and you
drop 512 sptes instead of just 1. Similar to what we do in the main
linux VM by taking the PT lock for every 512 sptes.

So you worry about gazillions of invalidate_page without taking into
account that even if you call invalidate_range_end(0, -1ULL) KVM will
have to mangle over 4503599627370496 rmap entries anyway. Yes calling
1 invalidate_range_end instead of 4503599627370496 invalidate_page,
will be faster, but not much faster. For KVM it remains an O(N)
operation, where N is the number of pages. I'm not so sure we need to
worry about my invalidate_pages being limited to 512 entries.

Perhaps GRU is different, I'm just describing the KVM situation here.

As far as KVM is concerned it's more sensible to be able to scale when
there are 1024 kvm threads on 1024 cpus and each kvm-vcpu is accessing
a different guest physical page (especially when new chunks of memory
are allocated for the first time) that won't collide the whole time on

I can already only use _end and ignore _start, only remaining problem
is this won't be 100% coherent (my patch is 100% coherent thanks to PT
lock implicitly serializing follow_page/get_user_pages of the KVM/GRU
secondary MMU faults). My approach give a better peace of mind with
full scalability and no lock recursion when it's the KVM page fault
that invokes get_user_pages that invokes the linux page fault routines
that will try to execute _start taking the lock to block the page

GRU can use my patch without the pin. XPMEM obviously can't use my
patch as my invalidate_page[s] are under the PT lock (a feature to fit
GRU/KVM in the simplest ...
From: Jack Steiner
Date: Saturday, February 2, 2008 - 8:14 pm

The mechanism obviously is a little different, but the GRU can flush an
external TLB in a single read/write/flush of a cacheline (should take a few
100 nsec). Typically, an application uses a single GRU. Threaded
applications, however, could use one GRU per thread, so it is possible that
multiple TLBs must be flushed. In some cases, the external flush can be
avoided if the GRU is not currently in use by the thread.

Also, most (but not all) applications that use the GRU do not usually do
anything that requires frequent flushing (fortunately). The GRU is intended
for HPC-like applications. These don't usually do frequent map/unmap
operations or anything else that requires a lot of flushes.

I expect that KVM is a lot different.

I have most of the GRU code working with the latest mmuops patch. I still
have a list of loose ends that I'll get to next week. The most important is
the exact handling of the range invalidates. The code that I currently have
works (mostly) but has a few endcases that will cause problems. Once I
finish, I'll be glad to send you snippets of the code (or all of it) if you

--- jack
--

From: Andrea Arcangeli
Date: Saturday, February 2, 2008 - 8:33 pm

I don't think so. invalidate_page/pages/range_start,end is a slow and
unfrequent path for KVM (or alternatively the ranges are very small in
which case _range_start/end won't payoff compared to _pages). Whenever
invalidate_page[s] become a fast path, we're generally I/O
bound. get_user_pages is always the fast path instead. I thought it
was much more important that get_user_pages scale as well as it does
now and that the KVM page fault isn't serialized with a mutex, than
whatever invalidate side optimization. get_user_pages may run
frequently from all vcpus even if there are no invalidates and no

Sure.
--

From: Christoph Lameter
Date: Monday, February 4, 2008 - 12:09 pm

Doesnt the kernel in some situations release the page before releasing the 
pte lock? Then there will be an external pte pointing to a page that may 
now have a different use. Its really bad if that pte does allow writes.


--

From: Andrea Arcangeli
Date: Monday, February 4, 2008 - 10:25 pm

Sure the kernel does that most of the time, which is for example why I
had to use invalidate_page instead of invalidate_pages inside
zap_pte_range. Zero problems with that (this is also the exact reason
why I mentioned the tlb flushing code would need changes to convert
some page in pages).
--

From: Christoph Lameter
Date: Monday, February 4, 2008 - 11:11 pm

Zero problems only if you find having a single callout for every page 
acceptable. So the invalidate_range in your patch is only working 
sometimes. And even if it works then it has to be used on 2M range. Seems 
to be a bit fragile and needlessly complex.

"conversion of some page in pages"? A proposal to defer the freeing of the 
pages until after the pte_unlock?

--

From: Andrea Arcangeli
Date: Tuesday, February 5, 2008 - 11:08 am

invalidate_pages is only a further optimization that was
strightforward in some places where the page isn't freed but only the

The patch as a whole isn't fragile nor complex. Pretending to use
invalidate_pages anywhere would be complex (and in turn more fragile
than my current patch, complexity brings fragility).

Overall you can only argue against performance issues (my patch is
simpler for GRU/KVM, and it sure isn't fragile, quite the opposite,
given I never allow a coherency-loss between two threads that will
read/write to two different physical pages for the same virtual
adddress in remap_file_pages).

In performance terms with your patch before GRU can run follow_page it
has to take a mm-wide global mutex where each thread in all cpus will
have to take it. That will trash on >4-way when the tlb misses start
to occour. There's nothing like that in my patch. Your approach will
micro-optimize certain large pte-mangling calls, or do_exit, but those
aren't interesting paths nor for GRU nor for KVM. You're optimizing for

There can be many tricks to optimize page in pages, but again munmap
and do_exit aren't the interesting path to optimzie, nor for GRU nor
for KVM so it doesn't matter right now.
--

From: Christoph Lameter
Date: Tuesday, February 5, 2008 - 11:17 am

The other approach will not have any remote ptes at that point. Why would 

No. It only has to lock the affected range. Remote page faults can occur 
while another part of the address space is being invalidated. The 
complexity of locking is up to the user of the mmu notifier. A simple 
implementation is satisfactory for the GRU right now. Should it become a 

Still not sure what we are talking about here.
 
--

From: Andrea Arcangeli
Date: Tuesday, February 5, 2008 - 1:55 pm

It never happens that two threads writes to two different physical
pages by working on the same process virtual address. This is an issue
only for KVM which is probably ok with it but certainly you can't
consider the dependency on the page-pin less fragile or less complex

That will make the follow_page fast path even slower if it has to
lookup a rbtree or a list of locked ranges. Still not comparable to
the PT lock that 1) it's zero cost and 2) it'll provide an even more

The apps using GRU/KVM never trigger large
munmap/mremap/do_exit. You're optimizing for the irrelevant workload,
by requiring unnecessary new locking in the GRU fast path.
--

From: Christoph Lameter
Date: Tuesday, February 5, 2008 - 3:06 pm

You can avoid the page-pin and the pt lock completely by zapping the 

As I said the implementation is up to the caller. Not sure what 
XPmem is using there but then XPmem is not using follow_page. The GRU 

Maybe that is true for KVM but certainly not true for the GRU. The GRU is 
designed to manage several petabytes of memory that may be mapped by a 
series of Linux instances. If a process only maps a small chunk of 4 
Gigabytes then we already have to deal with 1 mio callbacks.


--

From: Robin Holt
Date: Tuesday, February 5, 2008 - 3:12 pm

XPMEM is doing this by putting our equivalent structure of the mm into a
recalling state which will cause all future faulters to back off, it then
marks any currently active faults in the range as invalid (we have a very
small number of possible concurrent faulters for a different reason),
proceeds to start remote shoot-downs, waits for those shoot-downs to
complete, then returns from the _begin callout with the mm-equiv still in
the recalling state.  Additional recalls may occur, but no new faults can.
The _end callout reduces the number of active recalls until there are
none left at which point the faulters are allowed to proceed again.

Thanks,
Robin
--

From: Andrea Arcangeli
Date: Tuesday, February 5, 2008 - 3:26 pm

Avoid the PT lock? The PT lock has to be taken anyway by the linux
VM.

"holding off new references until _end" = per-range mutex less scalar

"lightway way of locking" = mm-wide-mutex (not necessary at all if we
take advantage of the per-pte-scalar PT lock that has to be taken

KVM is also going to map a lot of stuff, but mapping involves mmap,
munmap/mremap/mprotect not. The size of mmap is irrelevant in both
approaches. optimizing do_exit by making the tlb-miss runtime slower
doesn't sound great to me and that's your patch does if you force GRU
to use it.
--

From: Christoph Lameter
Date: Tuesday, February 5, 2008 - 4:10 pm

You can of course setup a 2M granularity lock to get the same granularity 
as the pte lock. That would even work for the cases where you have to page 

The size of the mmap is relevant if you have to perform callbacks on 
every mapped page that involved take mmu specific locks. That seems to be 
the case with this approach.

Optimizing do_exit by taking a single lock to zap all external references 
instead of 1 mio callbacks somehow leads to slowdown?

--

From: Andrea Arcangeli
Date: Tuesday, February 5, 2008 - 4:47 pm

If you set a 2M granularity lock, the _start callback would need to
do:

	for_each_2m_lock()
		mutex_lock()

so you'd run zillon of mutex_lock in a row, you're the one with the

mmap should never trigger any range_start/_end callback unless it's
overwriting an older mapping which is definitely not the interesting

It can if the application runs for more than a couple of seconds,
i.e. not a fork flood in which you care about do_exit speed. Keep in
mind if you had 1mio invalidate_pages callback it means you previously
called follow_page 1 mio of times too...
--

From: Christoph Lameter
Date: Tuesday, February 5, 2008 - 5:04 pm

There is no requirement to do a linear search. No one in his right mind 

There is still at least the need for teardown on exit. And you need to 
consider the boundless creativity of user land programmers. You would not 

That is another problem were we are also in need of solutions. I believe 
we have discussed that elsewhere.
--

From: Andrea Arcangeli
Date: Thursday, January 31, 2008 - 4:28 pm

The lock I take already protects up to 512 ptes yes. I call
invalidate_pages only across a _single_ 4k pagetable filled at max
with 512 pte_t mapping virutal addresses in a 2M naturally aligned
virtual range.

There's no smp race even for >4 CPUS. Check it again! I never call
invalidate_pages _outside_ the PT lock specific for that 4k pagetable
(a single PT lock protects 512 pte_t, not only a single one!).

Perhaps if you called it PMD lock there would be no misunderstanding:

    spinlock_t *__ptl = pte_lockptr(mm, pmd);	 \


Obviously I checked them all yes, and this was much faster than hand


There's no ptep_clear_flush override on x86 or I would have patched it
like I did to s390.



If you want I can turn this into a static inline, it would already
work fine. Certainly this isn't a blocker for merging given most
people will have MMU_NOTIFIER=y and this speedup compilation a tiny

This is primarly to turn off the compiler warning, not to check the

You are under the wrong impression that invalidate_page could be
called on a range that spawns over a region that isn't 2M naturally

This is a mess I've to say and I'm almost willing to return to a
_safe_rcu + and removing the autodisarming feature. KVM itself isn't
using mm_users but mm_count. then if somebody need to release the mn
inside ->relase they should mmu_notifier_unregister _inside_
->release and then call_rcu to release the mn (synchronize_rcu isn't
allowed inside ->release because of the rcu_spin_lock() that can't
schedule or it'll reach a quiescent point itself allowing the other

The above is perfectly fine too. If you have doubts and your
misunderstanding of my 100% SMP safe locking isn't immediately clear,
think about this, how could it be safe to modify 512 ptes with a
single lock, regardless of the mmu notifiers?

I appreciate the review! I hope my entirely bug free and
strightforward #v5 will strongly increase the probability of getting
this in sooner than later. If something else it shows ...
From: Christoph Lameter
Date: Thursday, January 31, 2008 - 6:37 pm

Mutex locking? Could you be more specific?

I hope you will continue to do reviews of the other mmu notifier patchset?


--

From: Robin Holt
Date: Thursday, January 31, 2008 - 7:23 pm

I think he is talking about the external locking that xpmem will need
to do to ensure we are not able to refault pages inside of regions that
are undergoing recall/page table clearing.  At least that has been my
understanding to this point.

Thanks,
Robin
--

From: Christoph Lameter
Date: Thursday, January 31, 2008 - 7:26 pm

Right this has to be something like rw spinlock. Its needed for both 
GRU/XPmem. Not sure about KVM.

Take the read lock for invalidate operations. These can occur 
concurrently. (Or a simpler implementation for the GRU may just use a 
spinlock).

The write lock must be held for populate operations.

Lock can be refined as needed by the notifier driver. F.e. locking could 
be restricted to certain ranges.

--

From: Andrea Arcangeli
Date: Friday, February 1, 2008 - 5:00 am

Robin understanding was correct on this point. Also I think if you add
start,end to range_end (like suggested a few times by me and once
by Robin) I may get away without a lock thanks to the page pin. That's
one strong reason why start,end are needed in range_end.

The only one that will definitely be forced to add a new lock around

Sure. I will also create a new KVM patch to plug on top of your
versions (I already did for V2/V3 even if it never worked, but that
might have a build problem, see kvm-devel for details). To be clear,
as long as anything is merged that avoids me to use kprobes to make
KVM swap work, I'm satisfied. I thought my approach had several
advantages in simplicity, and being more obviously safe (in not
relaying entirely on the page pin and creating a window of time where
the linux pte writes to a page and the sptes writes to another one,
remember populate_range), and it avoids introducing external locks in
the GRU follow_page path. My approach was supposed to be in addition
of the range_start/end needed by xpmem that can't possibly take
advanage of the scalar PT lock and it definitely requires external
lock to serialize xpmem fault against linux pagefault (not the case
for GRU/KVM).
--

Previous thread: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges by Christoph Lameter on Wednesday, January 30, 2008 - 9:57 pm. (18 messages)

Next thread: [patch 1/3] mmu_notifier: Core code by Christoph Lameter on Wednesday, January 30, 2008 - 9:57 pm. (16 messages)