I hope this is finally a release that covers all the requirements. Locking description is at the top of the core patch. This is a patchset implementing MMU notifier callbacks based on Andrea's earlier work. These are needed if Linux pages are referenced from something else than tracked by the rmaps of the kernel. The known immediate users are KVM (establishes a refcount to the page. External references called spte) (Refcount seems to be not necessary) GRU (simple TLB shootdown without refcount. Has its own pagetable/tlb) XPmem (uses its own reverse mappings. Remote ptes, Needs to sleep when sending messages) XPmem could defer freeing pages if a callback with atomic=1 occurs. Pending: - Feedback from users of the callbacks for KVM, RDMA, XPmem and GRU (Early tests with the GRU were successful). Known issues: - RCU quiescent periods are required on registering notifiers to guarantee visibility to other processors. Andrea's mmu_notifier #4 -> RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that we can review their use. - Avoid adding include to linux/mm_types.h - Integrate RCU logic suggested by Peter. V1->V2: - Improve RCU support - Use mmap_sem for mmu_notifier register / unregister - Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we already have invalidate_range() callbacks there. - Clean compile for !MMU_NOTIFIER - Isolate filemap_xip strangeness into its own diff - Pass a the flag to invalidate_range to indicate if a spinlock is held. - Add invalidate_all() V2->V3: - Further RCU fixes - Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page and sys_remap_file_pages() after the pte clearing. V3->V4: - Drop locking and synchronize_rcu() on ->release since we know on release ...
GRU should implement at least invalidate_page and invalidate_pages, and it should be mostly covered with that. My suggestion is to add the invalidate_range_start/end incrementally with this, and to keep all the xpmem mmu notifiers in a separate incremental patch (those are going to require many more changes to perfect). They've very different things. GRU is simpler, will require less changes and it should be taken care of sooner than XPMEM. KVM requirements are a subset of GRU thanks to the page pin so I can ignore KVM requirements as a whole and I only focus on GRU for the time being. There's room for optimization by moving some invalidate_page to invalidate_pages but that would complicate all tlb flushing code a lot, it'd require tons of changes and we should concentrate on getting the functionality included in the simplest, obviously safe, least intrusive form, in a way that can be optimized later if that turns out to be an issue. And I think for KVM this is already quite optimal. And there's zero slowdown when the notifiers are armed, for all tasks but the one tracked by the notifiers. Invalidates inside PT lock will avoid the page faults to happen in parallel of my invalidates, no dependency on the page pin, mremap covered, mprotect covered etc... I verified with the below patch KVM swapping is rock solid on top of 2.6.24 as with my #v4 on a older host kernel. I think it'd be much less efficient to add yet another non-per-pte-scalar lock to serialize the GRU interrupt or KVM pagefault against the main linux page fault, given we already have all needed serialization out of the PT lock. XPMEM is forced to do that because it has to sleep in sending the invalidate. So it has to take such subsystem mutex, send the invalidate, ptep_clear_flush, and finally unlock the mutex (hence the need of a _end callback, to unlock the page faults). KVM could also add a kvm internal lock to do something like that but there's no point given PT lock is more scalar. GRU is totally forbidden ...
You are aware that the pt lock is split for systems with >4 CPUS? You can pt lock cannot serialize with invalidate_range since it is split. A range That may be okay. Have you checked all the arches that can provide their own implementation of this macro? This is only going to work on arches that use the generic implementation. This will require a callback on every(!) removal of a pte. A range invalidate does not do any good since the callbacks are performed anyways. Probably needlessly. In addition you have the same issues with arches providing their own macro Ahh you found an additional arch. How about x86 code? There is one What is the point of invalidate_pages? It cannot be serialized properly and you do the invalidate_page() calles regardless. Is is some sort of Macros. We want functions there to be able to validate the parameters even You are under the wrong impression that you can use the pte lock to serialize general access to ptes! Nope. ptelock only serialize access to Again broken serialization. --
Hmmm.. May be okay after all. I see that you are only doing it on the pte level. This means the range callbacks are taking down a max of 512 entries. So you have a callback for each pmd. A callback for 2M of memory? --
Exactly. The point of _pages is to reduce of an order of magnitude (512, or 1024 times) the number of needed invalidate_page calls in a few places where it's a strightforward optimization for both KVM and GRU. Thanks to the PT lock this remains a totally obviously safe design and it requires zero additional locking anywhere (nor linux VM, nor in the mmu notifier methods, nor in the KVM/GRU page fault). Sure you can do invalidate_range_start/end for more than 2M(/4M on 32bit) max virtual ranges. But my approach that averages the fixed mmu_lock cost already over 512(/1024) ptes will make any larger "range" improvement not strongly measurable anymore given to do that you have to add locking as well and _surely_ decrease the GRU scalability with tons of threads and tons of cpus potentially making GRU a lot slower _especially_ on your numa systems. --
The trouble is that the invalidates are much more expensive if you have to send theses to remote partitions (XPmem). And its really great if you can simple tear down everything. Certainly this is a significant improvement over the earlier approach but you still have the invalidate_page calls in ptep_clear_flush. So they fire needlessly? Serializing access in the device driver makes sense and comes with additional possiblity of not having to increment page counts all the time. So you trade one cacheline dirtying for many that are necessary if you always increment the page count. How does KVM insure the consistency of the shadow page tables? Atomic ops? The GRU has no page table on its own. It populates TLB entries on demand using the linux page table. There is no way it can figure out when to drop page counts again. The invalidate calls are turned directly into tlb flushes. --
Dunno, they certainly fire more frequently than yours, even _pages
fires more frequently than range_start,end but don't forget why!
That's because I've a different spinlock for every 512
ptes/4k-grub-tlbs that are being invalidated... So it pays off in
scalability. I'm unsure if gru could play tricks with your patch, to
still allow faults to still happen in parallel if they're on virtual
Note that my #v5 doesn't require to increase the page count all the
time, so GRU will work fine with #v5.
See this comment in my patch:
/*
* invalidate_page[s] is called in atomic context
* after any pte has been updated and before
* dropping the PT lock required to update any Linux pte.
* Once the PT lock will be released the pte will have its
* final value to export through the secondary MMU.
* Before this is invoked any secondary MMU is still ok
* to read/write to the page previously pointed by the
* Linux pte because the old page hasn't been freed yet.
* If required set_page_dirty has to be called internally
* to this method.
*/
invalidate_page[s] is always called before the page is freed. This
will require modifications to the tlb flushing code logic to take
A per-VM mmu_lock spinlock is taken to serialize the access, plus
Yes, this is why it can't serialize follow_page with only the PT lock
with your patch. KVM may do it once you add start,end to range_end
only thanks to the additional pin on the page.
--
But that comes with the cost of firing invalidate_page for every page being evicted. In order to make your single invalidate_range work without Yes so your invalidate_range is still some sort of dysfunctional optimization? Gazillions of invalidate_page's will have to be executed And that would not be enough to hold of new references? With small tweaks this should work with a common scheme. We could also redefine the role of _start and _end slightly to just require that the refs are removed when _end completes. That would allow the KVM page count ref to work as is now Right but that pin requires taking a refcount which we cannot do. Frankly this looks as if this is a solution that would work only for KVM. --
I don't know if gru can flush the external TLB references with a single instruction like the cpu can do by overwriting %cr3. As far as vmx/svm are concerned, you got to do some fixed amount of work on the rmap structures and each spte, for each 4k invalidated regardless of page/pages/range/ranges. You can only share the cost of the lock and of the memslot lookup, so you lookup and take the lock once and you drop 512 sptes instead of just 1. Similar to what we do in the main linux VM by taking the PT lock for every 512 sptes. So you worry about gazillions of invalidate_page without taking into account that even if you call invalidate_range_end(0, -1ULL) KVM will have to mangle over 4503599627370496 rmap entries anyway. Yes calling 1 invalidate_range_end instead of 4503599627370496 invalidate_page, will be faster, but not much faster. For KVM it remains an O(N) operation, where N is the number of pages. I'm not so sure we need to worry about my invalidate_pages being limited to 512 entries. Perhaps GRU is different, I'm just describing the KVM situation here. As far as KVM is concerned it's more sensible to be able to scale when there are 1024 kvm threads on 1024 cpus and each kvm-vcpu is accessing a different guest physical page (especially when new chunks of memory are allocated for the first time) that won't collide the whole time on I can already only use _end and ignore _start, only remaining problem is this won't be 100% coherent (my patch is 100% coherent thanks to PT lock implicitly serializing follow_page/get_user_pages of the KVM/GRU secondary MMU faults). My approach give a better peace of mind with full scalability and no lock recursion when it's the KVM page fault that invokes get_user_pages that invokes the linux page fault routines that will try to execute _start taking the lock to block the page GRU can use my patch without the pin. XPMEM obviously can't use my patch as my invalidate_page[s] are under the PT lock (a feature to fit GRU/KVM in the simplest ...
The mechanism obviously is a little different, but the GRU can flush an external TLB in a single read/write/flush of a cacheline (should take a few 100 nsec). Typically, an application uses a single GRU. Threaded applications, however, could use one GRU per thread, so it is possible that multiple TLBs must be flushed. In some cases, the external flush can be avoided if the GRU is not currently in use by the thread. Also, most (but not all) applications that use the GRU do not usually do anything that requires frequent flushing (fortunately). The GRU is intended for HPC-like applications. These don't usually do frequent map/unmap operations or anything else that requires a lot of flushes. I expect that KVM is a lot different. I have most of the GRU code working with the latest mmuops patch. I still have a list of loose ends that I'll get to next week. The most important is the exact handling of the range invalidates. The code that I currently have works (mostly) but has a few endcases that will cause problems. Once I finish, I'll be glad to send you snippets of the code (or all of it) if you --- jack --
I don't think so. invalidate_page/pages/range_start,end is a slow and unfrequent path for KVM (or alternatively the ranges are very small in which case _range_start/end won't payoff compared to _pages). Whenever invalidate_page[s] become a fast path, we're generally I/O bound. get_user_pages is always the fast path instead. I thought it was much more important that get_user_pages scale as well as it does now and that the KVM page fault isn't serialized with a mutex, than whatever invalidate side optimization. get_user_pages may run frequently from all vcpus even if there are no invalidates and no Sure. --
Doesnt the kernel in some situations release the page before releasing the pte lock? Then there will be an external pte pointing to a page that may now have a different use. Its really bad if that pte does allow writes. --
Sure the kernel does that most of the time, which is for example why I had to use invalidate_page instead of invalidate_pages inside zap_pte_range. Zero problems with that (this is also the exact reason why I mentioned the tlb flushing code would need changes to convert some page in pages). --
Zero problems only if you find having a single callout for every page acceptable. So the invalidate_range in your patch is only working sometimes. And even if it works then it has to be used on 2M range. Seems to be a bit fragile and needlessly complex. "conversion of some page in pages"? A proposal to defer the freeing of the pages until after the pte_unlock? --
invalidate_pages is only a further optimization that was strightforward in some places where the page isn't freed but only the The patch as a whole isn't fragile nor complex. Pretending to use invalidate_pages anywhere would be complex (and in turn more fragile than my current patch, complexity brings fragility). Overall you can only argue against performance issues (my patch is simpler for GRU/KVM, and it sure isn't fragile, quite the opposite, given I never allow a coherency-loss between two threads that will read/write to two different physical pages for the same virtual adddress in remap_file_pages). In performance terms with your patch before GRU can run follow_page it has to take a mm-wide global mutex where each thread in all cpus will have to take it. That will trash on >4-way when the tlb misses start to occour. There's nothing like that in my patch. Your approach will micro-optimize certain large pte-mangling calls, or do_exit, but those aren't interesting paths nor for GRU nor for KVM. You're optimizing for There can be many tricks to optimize page in pages, but again munmap and do_exit aren't the interesting path to optimzie, nor for GRU nor for KVM so it doesn't matter right now. --
The other approach will not have any remote ptes at that point. Why would No. It only has to lock the affected range. Remote page faults can occur while another part of the address space is being invalidated. The complexity of locking is up to the user of the mmu notifier. A simple implementation is satisfactory for the GRU right now. Should it become a Still not sure what we are talking about here. --
It never happens that two threads writes to two different physical pages by working on the same process virtual address. This is an issue only for KVM which is probably ok with it but certainly you can't consider the dependency on the page-pin less fragile or less complex That will make the follow_page fast path even slower if it has to lookup a rbtree or a list of locked ranges. Still not comparable to the PT lock that 1) it's zero cost and 2) it'll provide an even more The apps using GRU/KVM never trigger large munmap/mremap/do_exit. You're optimizing for the irrelevant workload, by requiring unnecessary new locking in the GRU fast path. --
You can avoid the page-pin and the pt lock completely by zapping the As I said the implementation is up to the caller. Not sure what XPmem is using there but then XPmem is not using follow_page. The GRU Maybe that is true for KVM but certainly not true for the GRU. The GRU is designed to manage several petabytes of memory that may be mapped by a series of Linux instances. If a process only maps a small chunk of 4 Gigabytes then we already have to deal with 1 mio callbacks. --
XPMEM is doing this by putting our equivalent structure of the mm into a recalling state which will cause all future faulters to back off, it then marks any currently active faults in the range as invalid (we have a very small number of possible concurrent faulters for a different reason), proceeds to start remote shoot-downs, waits for those shoot-downs to complete, then returns from the _begin callout with the mm-equiv still in the recalling state. Additional recalls may occur, but no new faults can. The _end callout reduces the number of active recalls until there are none left at which point the faulters are allowed to proceed again. Thanks, Robin --
Avoid the PT lock? The PT lock has to be taken anyway by the linux VM. "holding off new references until _end" = per-range mutex less scalar "lightway way of locking" = mm-wide-mutex (not necessary at all if we take advantage of the per-pte-scalar PT lock that has to be taken KVM is also going to map a lot of stuff, but mapping involves mmap, munmap/mremap/mprotect not. The size of mmap is irrelevant in both approaches. optimizing do_exit by making the tlb-miss runtime slower doesn't sound great to me and that's your patch does if you force GRU to use it. --
You can of course setup a 2M granularity lock to get the same granularity as the pte lock. That would even work for the cases where you have to page The size of the mmap is relevant if you have to perform callbacks on every mapped page that involved take mmu specific locks. That seems to be the case with this approach. Optimizing do_exit by taking a single lock to zap all external references instead of 1 mio callbacks somehow leads to slowdown? --
If you set a 2M granularity lock, the _start callback would need to do: for_each_2m_lock() mutex_lock() so you'd run zillon of mutex_lock in a row, you're the one with the mmap should never trigger any range_start/_end callback unless it's overwriting an older mapping which is definitely not the interesting It can if the application runs for more than a couple of seconds, i.e. not a fork flood in which you care about do_exit speed. Keep in mind if you had 1mio invalidate_pages callback it means you previously called follow_page 1 mio of times too... --
There is no requirement to do a linear search. No one in his right mind There is still at least the need for teardown on exit. And you need to consider the boundless creativity of user land programmers. You would not That is another problem were we are also in need of solutions. I believe we have discussed that elsewhere. --
The lock I take already protects up to 512 ptes yes. I call
invalidate_pages only across a _single_ 4k pagetable filled at max
with 512 pte_t mapping virutal addresses in a 2M naturally aligned
virtual range.
There's no smp race even for >4 CPUS. Check it again! I never call
invalidate_pages _outside_ the PT lock specific for that 4k pagetable
(a single PT lock protects 512 pte_t, not only a single one!).
Perhaps if you called it PMD lock there would be no misunderstanding:
spinlock_t *__ptl = pte_lockptr(mm, pmd); \
Obviously I checked them all yes, and this was much faster than hand
There's no ptep_clear_flush override on x86 or I would have patched it
like I did to s390.
If you want I can turn this into a static inline, it would already
work fine. Certainly this isn't a blocker for merging given most
people will have MMU_NOTIFIER=y and this speedup compilation a tiny
This is primarly to turn off the compiler warning, not to check the
You are under the wrong impression that invalidate_page could be
called on a range that spawns over a region that isn't 2M naturally
This is a mess I've to say and I'm almost willing to return to a
_safe_rcu + and removing the autodisarming feature. KVM itself isn't
using mm_users but mm_count. then if somebody need to release the mn
inside ->relase they should mmu_notifier_unregister _inside_
->release and then call_rcu to release the mn (synchronize_rcu isn't
allowed inside ->release because of the rcu_spin_lock() that can't
schedule or it'll reach a quiescent point itself allowing the other
The above is perfectly fine too. If you have doubts and your
misunderstanding of my 100% SMP safe locking isn't immediately clear,
think about this, how could it be safe to modify 512 ptes with a
single lock, regardless of the mmu notifiers?
I appreciate the review! I hope my entirely bug free and
strightforward #v5 will strongly increase the probability of getting
this in sooner than later. If something else it shows ...Mutex locking? Could you be more specific? I hope you will continue to do reviews of the other mmu notifier patchset? --
I think he is talking about the external locking that xpmem will need to do to ensure we are not able to refault pages inside of regions that are undergoing recall/page table clearing. At least that has been my understanding to this point. Thanks, Robin --
Right this has to be something like rw spinlock. Its needed for both GRU/XPmem. Not sure about KVM. Take the read lock for invalidate operations. These can occur concurrently. (Or a simpler implementation for the GRU may just use a spinlock). The write lock must be held for populate operations. Lock can be refined as needed by the notifier driver. F.e. locking could be restricted to certain ranges. --
Robin understanding was correct on this point. Also I think if you add start,end to range_end (like suggested a few times by me and once by Robin) I may get away without a lock thanks to the page pin. That's one strong reason why start,end are needed in range_end. The only one that will definitely be forced to add a new lock around Sure. I will also create a new KVM patch to plug on top of your versions (I already did for V2/V3 even if it never worked, but that might have a build problem, see kvm-devel for details). To be clear, as long as anything is merged that avoids me to use kprobes to make KVM swap work, I'm satisfied. I thought my approach had several advantages in simplicity, and being more obviously safe (in not relaying entirely on the page pin and creating a window of time where the linux pte writes to a page and the sptes writes to another one, remember populate_range), and it avoids introducing external locks in the GRU follow_page path. My approach was supposed to be in addition of the range_start/end needed by xpmem that can't possibly take advanage of the scalar PT lock and it definitely requires external lock to serialize xpmem fault against linux pagefault (not the case for GRU/KVM). --
| Jesse Barnes | Re: [stable] [BUG][PATCH] cpqphp: fix kernel NULL pointer dereference |
| Greg KH | [003/136] p54usb: add Zcomax XG-705A usbid |
| Magnus Damm | [PATCH 03/07] ARM: Use shared GIC entry macros on Realview |
| Oliver Neukum | Re: [Bug #13682] The webcam stopped working when upgrading from 2.6.29 to 2.6.30 |
| Martin Schwidefsky | Re: [PATCH] optimized ktime_get[_ts] for GENERIC_TIME=y |
git: | |
| Junio C Hamano | Re: Some advanced index playing |
| Jeff King | Re: confusion over the new branch and merge config |
| Robin Rosenberg | Re: cvs2svn conversion directly to git ready for experimentation |
| Linus Torvalds | git binary size... |
| Ævar Arnfjörð Bjarmason | Re: Challenge with Git-Bash |
| Linux Kernel Mailing List |
