From: Rik van Riel <riel@redhat.com>
Both the page migration code and the transparent hugepage patches expect
100% reliable rmap lookups and use page_lock_anon_vma(page) to prevent
races with mmap, munmap, expand_stack, etc.
Specifically, try_to_unmap indirectly calls vma_address, which uses the
difference between vma->vm_start and vma->vm_pgoff, which can race when a
stack is expanded downwards. VMA splitting and merging present similar
issues.
With the new anon_vma code, one VMA can be attached to multiple anon_vmas,
however mmap, munmap, expand_stack and others only took one anon_vma->lock.
This patch changes things so we take the anon_vma locks for all of the
anon_vmas attached to a VMA in the code that try_to_unmap would otherwise
race against: mmap, munmap, expand_stack, etc.
Unfortunately, this leads to a lock ordering conflict with the page_table_lock,
which protected the "same_vma" list in the anon_vma_chain. Replacing that
lock with a new lock (mm->anon_vma_chain_lock), which is taken higher up in
the mm locking hierarchy, solves that issue. This changes the locking rules
for the "same_vma" list to be either mm->mmap_sem for write, or mm->mmap_sem
for read plus the new mm->anon_vma_chain lock. This limits the place where
the new lock is taken to 2 locations - anon_vma_prepare and expand_downwards.
Document the locking rules for the same_vma list in the anon_vma_chain and
remove the anon_vma_lock call from expand_upwards, which does not need it.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables and ...Pretty much same comments as for the other one. Why are we pandering to the case that is/should be unusual? Linus --
In this case, because the fix from the migration side is difficult and fragile, while fixing things from the mmap side is straightforward. I believe the overhead of patch 1/2 should be minimal as well, because the locks we take are the _depth_ of the process tree (truncated every exec), not the width. As for patch 2/2, Mel has an alternative approach for that: http://lkml.org/lkml/2010/4/30/198 Does Mel's patch seem more reasonable to you? -- All rights reversed --
Quite frankly, I think it's totally insane to walk a list of all anon_vma's that are associated with one vma, just to lock them all. Tell me why you just don't put the lock in the vma itself then? Walking a list in order to lock multiple things is something we should _never_ do under any normal circumstances. I can see why you'd want to do this in theory (the "other side" of the locker might have to lock just the _one_ anon_vma), but if your argument is that the list is usually not very deep ("one"?), then there is no advantage, because putting the lock in the anon_vma vs the vma will get the same kind of contention. And if the list _is_ deep, then walking the list to lock them all is a crime against humanity. Well, I certainly think that seems to be a lot more targeted, and not add new allocations in a path that I think is already one of the more expensive ones. Yes, you can argue that execve() is already so expensive that a few more allocations don't matter, and you migth be right, but that's how things get to be too expensive to begin with. That said, I do still wonder why we shouldn't just say that the person who wants the safety is the one that should do the extra work. In particular, why don't we just make rmap_walk() be the one that locks all the anon_vma's? Instead of locking just one? THAT is the function that cares. THAT is the function that should do proper locking and not expect others to do extra unnecessary locking. So again, my gut feel is that if the lock just were in the vma itself, then the "normal" users would have just one natural lock, while the special case users (rmap_walk_anon) would have to lock each vma it traverses. That would seem to be the more natural way to lock things. I dunno. There may well be reasons why it doesn't work, like just the allocation lifetime rules for vma's vs anon_vma's. I'm not claiming I've thought this true. I just get a feeling of "that isn't right" when I look at the original 2/2 ...
One problem is that we cannot find the VMAs (multiple) from the page, except by walking the anon_vma_chain.same_anon_vma list. At the very least, that list requires locking, done by the anon_vma.lock. When already you have that lock, also taking per-VMA locks would be superfluous. It could also be difficult, especially since the rmap side code and the mmap side code approach the data structures from the other side, potentially leading to In a freshly exec()d process, there will be one anon_vma. In workloads with forking daemons, like apache, postgresql or sendmail, the child process will end up with two anon_vmas A forkbomb could definately end up getting slowed down by this patch. Is there any real workload out there that just forks deeper and deeper from the parent process, without However ... there's still the issue of page_lock_anon_vma in try_to_unmap_anon. I guess it protects against any of the VMAs going away, because anon_vma_unlink also takes the anon_vma->lock. I have no good answer to this question. Mel? Andrea? -- All rights reversed --
If try_to_unmap is allowed to establish the migration pte, then such pte has to remain reachable through rmap_walk at all times after that, or migration_entry_wait will crash because it notices the page has been migrated already (PG_lock not set) but there is still a migration pte established. (remove_migration_pte like split_huge_page isn't allowed to fail finding all migration ptes mapping a page during the rmap walk) It's not false positive BUG_ON if that's what you mean, removing the BUG_ON would still lead to infinite hang waiting on a migration pte that shouldn't be there anymore. --
But that's exactly what we do in rmap_walk() anyway. But yes, I can well imagine that in other cases we only do the one anon_vma. I didn't check who used the lock. So if we do want to keep the lock in the anon_vma, I would just suggest that instead of making "normal" users do lots of locking, make the Heh. AIM7. Wasn't that why we merged the multiple anon_vma's in the first Do we care? We've not locked them all there, and we've historically not cares about the rmap list being "perfect", have we? So I _think_ it's just the migration case (and apparently potentially the hugepage case) that wants _exact_ information. Which is why I suggest the onus of the extra locking should be on _them_, not on the regular code. I dunno. Again, my objections to the patches are really based more on a gut feel of "that can't be the right thing to do" than anything else. We have _extremely_ few places that walk lists to lock things. And they are never "normal" code. Things like that magic "mm_take_all_locks()", for example. That is why I then react with "that can't be right" to patches like this. Linus --
Mel's original patch adds trylock & retry all code to rmap_walk and a few other places: http://lkml.org/lkml/2010/4/26/321 I submitted my patch 1/2 as an alternative, because these repeated trylocks are pretty complex and easy to accidentally break when Well, try_to_unmap_anon walks just one page, and has the anon_vma for that page locked. Having said that, for pageout we do indeed not care about getting It's a matter of cost vs complexity. IMHO the locking changes in the lowest overhead patches (Mel's) are quite complex and could end up being hard to maintain in the future. I wanted to introduce something a little simpler, with hopefully minimal overhead. But hey, that's just my opinion - what matters is that the bug gets fixed somehow. If you prefer the more complex but slightly lower overhead patches from Mel, that's fine too. -- All rights reversed --
The wording could have been better. The problem is that once a migration PTE is established, it is expected that rmap can find it. In the specific case of exec, this can fail because of how the temporary stack is moved. As migration colliding with exec is rare, the approach taken by the patch was to not create migration PTEs that rmap could not find. On the plus side, exec (the common case) is unaffected. On the negative side, it's avoiding the exec vs migration problem instead of fixing it. The BUG_ON is not a buggy check. While migration is taking place, the page lock is held and not unreleased until all the migration PTEs have been removed. If a migration entry exists and the page is unlocked, it means that rmap failed to find all the entries. If the BUG_ON was not made, do_swap_page() would either end up looking up a semi-random entry in swap cache and inserting it (memory corruption), inserting a random page from swap (memory corruption) or returning VM_FAULT_OOM to the fault handler (general carnage). It was considered to lazily clean up the migration PTEs (http://lkml.org/lkml/2010/4/27/458) but there is no guarantee that the page the migration PTE pointed to is still the correct one. If it had been freed and re-used, the results would probably be memory corruption. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
This does leave me worrying about concurrent faults poking at vma->vm_end without synchronization. --
I didn't check this patch in detail yet. I agree it can be removed and I think it can be safely replaced with the page_table_lock. --
Sure, it could probably be replaced with the ptl, but a single anon_vma->lock would I think be better since there's more of them. --
ptl not enough, or it'd break if stack grows fast more than the size of one pmd, page_table_lock enough instead. Keeping anon_vma lock is sure fine with me ;), I was informally asked if it was a must have, and I couldn't foresee any problem in _replacing_ it (not removing) with page_table_lock (which I hope I mentioned in my answer ;). But I never had an interest to remove it, just I couldn't find any good reason to keep it either other than "paranoid just in case", which is good enough justification to me ;) considering these archs are uncommon and by definition gets less testing. --
