Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock

Previous thread: [PATCH 2/2] mm: fix race between shift_arg_pages and rmap_walk by Rik van Riel on Monday, May 3, 2010 - 9:19 am. (3 messages)

Next thread: [PATCH 0/2] Fix migration races in rmap_walk() V4 by Rik van Riel on Monday, May 3, 2010 - 9:17 am. (1 message)
From: Rik van Riel
Date: Monday, May 3, 2010 - 9:18 am

From: Rik van Riel <riel@redhat.com>

Both the page migration code and the transparent hugepage patches expect
100% reliable rmap lookups and use page_lock_anon_vma(page) to prevent
races with mmap, munmap, expand_stack, etc.

Specifically, try_to_unmap indirectly calls vma_address, which uses the
difference between vma->vm_start and vma->vm_pgoff, which can race when a
stack is expanded downwards.  VMA splitting and merging present similar
issues.

With the new anon_vma code, one VMA can be attached to multiple anon_vmas,
however mmap, munmap, expand_stack and others only took one anon_vma->lock.
This patch changes things so we take the anon_vma locks for all of the
anon_vmas attached to a VMA in the code that try_to_unmap would otherwise
race against: mmap, munmap, expand_stack, etc. 

Unfortunately, this leads to a lock ordering conflict with the page_table_lock,
which protected the "same_vma" list in the anon_vma_chain.  Replacing that
lock with a new lock (mm->anon_vma_chain_lock), which is taken higher up in
the mm locking hierarchy, solves that issue.  This changes the locking rules
for the "same_vma" list to be either mm->mmap_sem for write, or mm->mmap_sem
for read plus the new mm->anon_vma_chain lock.  This limits the place where
the new lock is taken to 2 locations - anon_vma_prepare and expand_downwards.

Document the locking rules for the same_vma list in the anon_vma_chain and
remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and ...
From: Linus Torvalds
Date: Monday, May 3, 2010 - 9:41 am

Pretty much same comments as for the other one. Why are we pandering to 
the case that is/should be unusual?

		Linus
--

From: Rik van Riel
Date: Monday, May 3, 2010 - 9:53 am

In this case, because the fix from the migration side is
difficult and fragile, while fixing things from the mmap
side is straightforward.

I believe the overhead of patch 1/2 should be minimal
as well, because the locks we take are the _depth_ of
the process tree (truncated every exec), not the width.


As for patch 2/2, Mel has an alternative approach for that:

http://lkml.org/lkml/2010/4/30/198

Does Mel's patch seem more reasonable to you?

-- 
All rights reversed
--

From: Linus Torvalds
Date: Monday, May 3, 2010 - 10:17 am

Quite frankly, I think it's totally insane to walk a list of all 
anon_vma's that are associated with one vma, just to lock them all.

Tell me why you just don't put the lock in the vma itself then? Walking a 
list in order to lock multiple things is something we should _never_ do 
under any normal circumstances.

I can see why you'd want to do this in theory (the "other side" of the 
locker might have to lock just the _one_ anon_vma), but if your argument 
is that the list is usually not very deep ("one"?), then there is no 
advantage, because putting the lock in the anon_vma vs the vma will get 
the same kind of contention.

And if the list _is_ deep, then walking the list to lock them all is a 
crime against humanity.


Well, I certainly think that seems to be a lot more targeted, and not add 
new allocations in a path that I think is already one of the more 
expensive ones. Yes, you can argue that execve() is already so expensive 
that a few more allocations don't matter, and you migth be right, but 
that's how things get to be too expensive to begin with.

That said, I do still wonder why we shouldn't just say that the person who 
wants the safety is the one that should do the extra work.

In particular, why don't we just make rmap_walk() be the one that locks 
all the anon_vma's? Instead of locking just one? THAT is the function that 
cares. THAT is the function that should do proper locking and not expect 
others to do extra unnecessary locking.

So again, my gut feel is that if the lock just were in the vma itself, 
then the "normal" users would have just one natural lock, while the 
special case users (rmap_walk_anon) would have to lock each vma it 
traverses. That would seem to be the more natural way to lock things.

I dunno. There may well be reasons why it doesn't work, like just the 
allocation lifetime rules for vma's vs anon_vma's. I'm not claiming I've 
thought this true. I just get a feeling of "that isn't right" when I look 
at the original 2/2 ...
From: Rik van Riel
Date: Monday, May 3, 2010 - 10:58 am

One problem is that we cannot find the VMAs (multiple) from
the page, except by walking the anon_vma_chain.same_anon_vma
list.  At the very least, that list requires locking, done
by the anon_vma.lock.

When already you have that lock, also taking per-VMA locks
would be superfluous. It could also be difficult, especially
since the rmap side code and the mmap side code approach the
data structures from the other side, potentially leading to

In a freshly exec()d process, there will be one anon_vma.

In workloads with forking daemons, like apache, postgresql
or sendmail, the child process will end up with two anon_vmas

A forkbomb could definately end up getting slowed down by
this patch.  Is there any real workload out there that just
forks deeper and deeper from the parent process, without


However ... there's still the issue of page_lock_anon_vma
in try_to_unmap_anon.

I guess it protects against any of the VMAs going away,
because anon_vma_unlink also takes the anon_vma->lock.


I have no good answer to this question.

Mel?  Andrea?

-- 
All rights reversed
--

From: Andrea Arcangeli
Date: Monday, May 3, 2010 - 11:13 am

If try_to_unmap is allowed to establish the migration pte, then such
pte has to remain reachable through rmap_walk at all times after that,
or migration_entry_wait will crash because it notices the page has
been migrated already (PG_lock not set) but there is still a migration
pte established. (remove_migration_pte like split_huge_page isn't
allowed to fail finding all migration ptes mapping a page during the
rmap walk)

It's not false positive BUG_ON if that's what you mean, removing the
BUG_ON would still lead to infinite hang waiting on a migration pte
that shouldn't be there anymore.
--

From: Linus Torvalds
Date: Monday, May 3, 2010 - 11:19 am

But that's exactly what we do in rmap_walk() anyway.

But yes, I can well imagine that in other cases we only do the one 
anon_vma. I didn't check who used the lock.

So if we do want to keep the lock in the anon_vma, I would just suggest 
that instead of making "normal" users do lots of locking, make the 

Heh. AIM7. Wasn't that why we merged the multiple anon_vma's in the first 

Do we care?

We've not locked them all there, and we've historically not cares about 
the rmap list being "perfect", have we? 

So I _think_ it's just the migration case (and apparently potentially the 
hugepage case) that wants _exact_ information. Which is why I suggest the 
onus of the extra locking should be on _them_, not on the regular code.

I dunno. Again, my objections to the patches are really based more on a 
gut feel of "that can't be the right thing to do" than anything else.

We have _extremely_ few places that walk lists to lock things. And they 
are never "normal" code. Things like that magic "mm_take_all_locks()", for 
example. That is why I then react with "that can't be right" to patches 
like this.

			Linus
--

From: Rik van Riel
Date: Monday, May 3, 2010 - 11:38 am

Mel's original patch adds trylock & retry all code to rmap_walk
and a few other places:

http://lkml.org/lkml/2010/4/26/321

I submitted my patch 1/2 as an alternative, because these repeated
trylocks are pretty complex and easy to accidentally break when


Well, try_to_unmap_anon walks just one page, and has the anon_vma
for that page locked.

Having said that, for pageout we do indeed not care about getting

It's a matter of cost vs complexity.  IMHO the locking changes in
the lowest overhead patches (Mel's) are quite complex and could end
up being hard to maintain in the future.  I wanted to introduce
something a little simpler, with hopefully minimal overhead.

But hey, that's just my opinion - what matters is that the bug gets
fixed somehow.  If you prefer the more complex but slightly lower
overhead patches from Mel, that's fine too.

-- 
All rights reversed
--

From: Mel Gorman
Date: Tuesday, May 4, 2010 - 6:12 am

The wording could have been better.

The problem is that once a migration PTE is established, it is expected that
rmap can find it. In the specific case of exec, this can fail because of
how the temporary stack is moved. As migration colliding with exec is rare,
the approach taken by the patch was to not create migration PTEs that rmap
could not find. On the plus side, exec (the common case) is unaffected. On
the negative side, it's avoiding the exec vs migration problem instead of
fixing it.

The BUG_ON is not a buggy check. While migration is taking place, the page lock
is held and not unreleased until all the migration PTEs have been removed. If
a migration entry exists and the page is unlocked, it means that rmap failed
to find all the entries. If the BUG_ON was not made, do_swap_page() would
either end up looking up a semi-random entry in swap cache and inserting it
(memory corruption), inserting a random page from swap (memory corruption)
or returning VM_FAULT_OOM to the fault handler (general carnage).

It was considered to lazily clean up the migration PTEs
(http://lkml.org/lkml/2010/4/27/458) but there is no guarantee that the page
the migration PTE pointed to is still the correct one. If it had been freed
and re-used, the results would probably be memory corruption.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Peter Zijlstra
Date: Monday, May 3, 2010 - 9:55 am

This does leave me worrying about concurrent faults poking at
vma->vm_end without synchronization.

--

From: Andrea Arcangeli
Date: Monday, May 3, 2010 - 10:02 am

I didn't check this patch in detail yet. I agree it can be removed and
I think it can be safely replaced with the page_table_lock.
--

From: Peter Zijlstra
Date: Monday, May 3, 2010 - 10:11 am

Sure, it could probably be replaced with the ptl, but a single
anon_vma->lock would I think be better since there's more of them.



--

From: Andrea Arcangeli
Date: Monday, May 3, 2010 - 10:18 am

ptl not enough, or it'd break if stack grows fast more than the size
of one pmd, page_table_lock enough instead.

Keeping anon_vma lock is sure fine with me ;), I was informally asked
if it was a must have, and I couldn't foresee any problem in
_replacing_ it (not removing) with page_table_lock (which I hope I
mentioned in my answer ;). But I never had an interest to remove it,
just I couldn't find any good reason to keep it either other than
"paranoid just in case", which is good enough justification to me ;)
considering these archs are uncommon and by definition gets less
testing.
--

Previous thread: [PATCH 2/2] mm: fix race between shift_arg_pages and rmap_walk by Rik van Riel on Monday, May 3, 2010 - 9:19 am. (3 messages)

Next thread: [PATCH 0/2] Fix migration races in rmap_walk() V4 by Rik van Riel on Monday, May 3, 2010 - 9:17 am. (1 message)