Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally

Previous thread: Re: usb-scanner-cameras kernel-2.6.22 and udev-095 problem by Greg KH on Tuesday, June 5, 2007 - 2:46 pm. (1 message)

Next thread: Re: 2.6.22-rc4: known regressions [section mismatch] by Jeff Chua on Tuesday, June 5, 2007 - 4:15 pm. (1 message)
From: David Rientjes
Date: Tuesday, June 5, 2007 - 3:39 pm

Reverts git commit c596d9f320aaf30d28c1d793ff3a976dee1db8f5.

OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access 
memory outside its cpuset because it could potentially cause other 
exclusive cpusets to OOM themselves.

Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 kernel/cpuset.c |    6 ------
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2431,12 +2431,6 @@ int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
 	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
 	if (node_isset(node, current->mems_allowed))
 		return 1;
-	/*
-	 * Allow tasks that have access to memory reserves because they have
-	 * been OOM killed to get memory anywhere.
-	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
-		return 1;
 	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
 		return 0;
 
-

From: Christoph Lameter
Date: Tuesday, June 5, 2007 - 3:40 pm

Those terminating tasks need that memory to terminate properly right?
-

From: David Rientjes
Date: Tuesday, June 5, 2007 - 3:42 pm

No, they need access only to the memory reserves provided by the 
difference in their low and no watermarks on a per-cpuset basis.
-

From: Paul Jackson
Date: Tuesday, June 5, 2007 - 4:01 pm

I'm a little surprised at this suggested change -- I'd have thought
that it was a good idea to let tasks marked for extinction get memory
anywhere, as they were going to use that memory to exit, and free up
lots more memory.

I'm pretty sure we have this same policy in other places in the
kernel, besides cpusets.  Did you intend to change them too?

If a MEMDIE task is taking enough memory to OOM other tasks anywhere
in the system, then doesn't that mean your entire system was in deep
yogurt, and we're just haggling over who to blame for the upcoming
crash?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Tuesday, June 5, 2007 - 4:16 pm

The intended purpose of TIF_MEMDIE was to allocate pages without being 
bound by the watermarks so that they have access to memory reserves on the 
per-zone level.  If the cpuset doesn't have access to a zone, whether it's 


No, it means that it can allocate anywhere based on the zonelist ordering 
and then can OOM a very small exclusive cpuset that would never have had 
any memory pressure if it wasn't violated.

		David
-

From: Christoph Lameter
Date: Tuesday, June 5, 2007 - 4:20 pm

But the alternative is that the exiting process does not save its 
data.

What is this very small exclusive cpuset?


-

From: David Rientjes
Date: Tuesday, June 5, 2007 - 4:25 pm

The same condition that occurs when there is a system-wide OOM, yes.  
Exclusive cpusets cannot be violated for such allocations outside of the 

That's arbitrary.  The idea is that an exclusive cpuset should not 
encounter memory pressure because another exclusive cpuset encountered an 
OOM condition because its zones happened to be higher on the zonelist.  
Notice how, without this change, it's possible to allocate on a node 
outside our mems_allowed before we use our own memory reserves.

		David
-

From: Christoph Lameter
Date: Tuesday, June 5, 2007 - 4:32 pm

But with the patch the process would be able to terminate. There is no 
global OOM situation. If there would be a global OOM situation then 

So its seems that the patch is addressing an imagined situation?

I think the allocation outside of our mems_allowed is fine when it serves 
to terminate the process and thereby release resources. It is certainly 
better than having the process corrupt data by only partially writing back 
its data.


-

From: David Rientjes
Date: Tuesday, June 5, 2007 - 4:44 pm

Sure it would, it would have access to memory reserves because of the 
change in watermarks through get_page_from_freelist().

If that fails, we can't allocate elsewhere because then we have taken 
exclusive memory from other applications and is contrary to the definition 
of mem_exclusive.  You need to construct your cpuset hierarchy with these 
scenarios in mind; when you ask for an exclusive cpuset, it shouldn't come 
with a disclaimer that says "if another cpuset that is also exclusive 
happens to OOM, we'll steal your memory anyway and it's not our problem if 
the dying task gets stuck in D state and doesn't exit synchronously or 

No, it's returning us to the previous logic where an exclusive cpuset was 
actually exclusive.

And, again, without this change it is possible to allocate in other 
exclusive cpusets without first exhausting your own memory reserves.  
That's wrong.

		David
-

From: Paul Jackson
Date: Tuesday, June 5, 2007 - 4:55 pm

Well, I can't speak to the 'real' meaning of TIF_MEMDIE with authority,
but I can speak to the meaning of cpuset flags.

The mem_exclusive flag doesn't mean this.

It means that you cannot overlap the memory of a sibling cpuset.

You will, necessarily, still overlap the memory of your ancestor cpusets.

Whether or not you make any use of the mem_exclusive flag, you still
get the same (limited) guarantees of memory usage -- namely that your
memory won't be used by tasks in non-overlapping cpusets, with some
exceptions, such as:
 1) memory handed out to interrupt code,
 2) memory handed out for GFP_ATOMIC requests, and
 3) tasks marked PF_EXITING -- will soon free up memory

Tasks in cpusets ancestor to your tasks cpuset can always, easily,
use memory on the same nodes your task is on.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Tuesday, June 5, 2007 - 6:17 pm

Which, with this patch, we will respect for tasks marked TIF_MEMDIE as 

And that's what nearest_exclusive_ancestor() determines later on if we're 

This is precisely the point: we already respect PF_EXITING tasks with 
their ability to allocate outside their own cpuset.  That gets set in 
do_exit() when a task is in receipt of the SIGKILL from the OOM killer 
during the exit path.  Between these time periods (the time when we issue 
the OOM SIGKILL and the time we're marked PF_EXITING in do_exit()), we 
should not allow allocations outside of our cpuset because we do not yet 

Sure, that behavior is unchanged.  We're relying on 
nearest_exclusive_ancestor() to determine if such nodes overlap.

		David
-

From: Paul Jackson
Date: Tuesday, June 5, 2007 - 6:20 pm

Ok ... my points on cpusets semantics having been heard,
I stand back down on the matter of memory semantics, where
I am not the master.

Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Christoph Lameter
Date: Tuesday, June 5, 2007 - 4:57 pm

Exclusive is not as absolute as you may think. There is also the 
GFP_KERNEL exception.


That is already occurring with GFP_KERNEL. So your patch really does not 
have the purifying effect on exclusivity that you expect. This looks all 
more like hunting for elusive idealistic cpuset behavior to me.

-

From: David Rientjes
Date: Tuesday, June 5, 2007 - 6:23 pm

Memory exclusivity with respect to cpusets should guarantee that memory 
nodes do not overlap with siblings if they are marked with mems_exclusive.  
The patch simply preserves that behavior through the time period between 
when the OOM killer issues a SIGKILL and the task is exiting and marked 
with PF_EXITING.

Obviously GFP_KERNEL allocations can allocate regardless of our memory 
exclusivity, but the point is that a job in one exclusive cpuset should 
not have the ability to effect the performance (in terms of reclaim and 
swap), memory usage, or survival of jobs in other exclusive cpusets 

But it's a reality that we need to respect.  It happens and when it does
it has the potential to hamper other cpusets that we setup to be exclusive 
themselves.

		David
-

From: Christoph Lameter
Date: Tuesday, June 5, 2007 - 6:32 pm

Right but the process is terminating thus only requiring limited resources
to get finished. The process does not have the ability to affect the other 
cpusets. I.e. it cannot directly allocate outside of the cpuset. The 
system has that capability and the system is handling the termination of 

The fact that process hang because of some software deficiency is an 
exceptional failure scenario. The system in general is not fully 
operational anymore anyways. Typically one or the other lock is held which 
makes certain kernel functionality inaccessible.

-

From: David Rientjes
Date: Tuesday, June 5, 2007 - 6:40 pm

And those limited resources should be available in the difference between 
low and no watermarks as defined by each zone in the cpuset's 
mems_allowed.  Regardless, we should not allow allocations outside of the 
cpuset because we have marked it TIF_MEMDIE and we don't have any explicit 
guarantee that it is exiting yet and not mlock'ing an excessive amount of 
memory that exceeds the capacity of all cpuset nodes.

		David
-

From: Christoph Lameter
Date: Tuesday, June 5, 2007 - 6:54 pm

Hmmmm... But we have sent it a SIGKILL. If the process is following 
conventions then it is exiting. Of course the process could be abusing the 
system and attempting to OOM the whole system as an act of revenge for 
being killed but isnt this a bit far fetched?



-

From: David Rientjes
Date: Tuesday, June 5, 2007 - 8:29 pm

It's not abusing the system because it was killed, it was killed because 
it was abusing the system and attempted to mlock more memory than allowed 
in its exclusive mems.  So we OOM the task but it can continue mlock'ing 
memory and intrude on the memory of other cpusets because of the 
TIF_MEMDIE exception and our SIGKILL hasn't been handled yet.
-

From: Paul Jackson
Date: Tuesday, June 5, 2007 - 4:19 pm

> The intended purpose of TIF_MEMDIE was to allocate pages without being 

Ok then ... you probably right.  I'll stand down.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Peter Zijlstra
Date: Tuesday, June 5, 2007 - 11:20 pm

This seems a little pointless, since cpuset_zone_allowed_softwall() is
only effective with ALLOC_CPUSET, and the ALLOC_NO_WATERMARKS
allocations opened up by TIF_MEMDIE don't use that.

Also I agree with Christoph's reasoning; this is not for the application
but for the system. Hence the cpuset does not get violated for the
application [ something I tried to argue before, glad Christoph now
agrees with me :-) ].

-

From: David Rientjes
Date: Tuesday, June 5, 2007 - 11:42 pm

That's the change.  Memory reserves on a per-zone level would now be used 
with respect to ALLOC_NO_WATERMARKS because TIF_MEMDIE tasks no longer 
have an explicit bypass for them.  If the node is not in the task's 
mems_allowed and it's a __GFP_HARDWALL allocation or if it's neither
PF_EXITING or in the nearest_exclusive_ancestor() of that task's cpuset, 
then we move on to the next zone in get_page_from_freelist().

The problem the patch addresses is that, as you mentioned, tasks with 
TIF_MEMDIE have an explicit bypass over using any memory reserves and can 
thus, depending on zonelist ordering, allocate first on nodes outside its 
mems_allowed even in the case of an exclusive cpuset before exhausting its 
own memory.  That behavior is wrong.

		David
-

From: Peter Zijlstra
Date: Wednesday, June 6, 2007 - 12:09 am

Right, I see your point; however considering that its a system
allocation, and all these constraints get violated by interrupts anyway,
its more of an application container than a strict allocation container.

Are you actually seeing the described behaviour, or just being pedantic
(nothing wrong with that per-se)?

I would actually be bold and make it worse by proposing something like
this, which has the benefit of making the reserve threshold system wide.

---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1eef614..870a791 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1630,6 +1630,21 @@ rebalance:
 			&& !in_interrupt()) {
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
+			/*
+			 * break out of mempolicy boundaries
+			 */
+			zonelist = NODE_DATA(numa_node_id())->node_zonelists +
+				gfp_zone(gfp_mask);
+
+			/*
+			 * Before going bare metal, try to get a page above the
+			 * critical threshold - ignoring CPU sets.
+			 */
+			page = get_page_from_freelist(gfp_mask, order, zonelist,
+					ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER);
+			if (page)
+				goto got_pg;
+
 			/* go through the zonelist yet again, ignoring mins */
 			page = get_page_from_freelist(gfp_mask, order,
 				zonelist, ALLOC_NO_WATERMARKS);


-

From: David Rientjes
Date: Wednesday, June 6, 2007 - 12:18 am

It is not necessarily system allocations at all.  It is quite possible, as 
described earlier, to mlock a large quantity of memory that exceeds the 
capacity of all nodes in a task's mems_allowed and then when this task 
OOM's to continue allocating memory in get_user_pages() but with the 
TIF_MEMDIE exception that allows it to allocate anywhere.  Then other 
cpusets get their memory infringed upon and they can OOM themselves even 
though there was no preexisting memory pressure.  This happens before we 
return to userspace in the mlock() and thus we cannot properly handle the 

Yes, as described above.  An exclusive cpuset containing only root, 
system-critical tasks was OOM'd when it was not under any memory pressure 
at all because a separate exclusive cpuset mlock'd a gigantic amount of 
memory and it could not reliably exit because the mlock continued to 
allocate outside its own cpuset and eventually OOM'd system-critical tasks 
or depleated all system memory.  True story.

At the least, if this patch is not agreed upon, I would suggest adding 
sanity checks against gfp_mask in cpuset_zone_allowed() to ensure we 
should allow the allocation in TIF_MEMDIE circumstances to avoid the above 
scenario.

		David
-

From: Paul Jackson
Date: Wednesday, June 6, 2007 - 12:34 am

Seems like that mlock code is able then to get great globs of memory
without returning to user space ... perhaps that's where the fix
should be ... that code should quit chewing up memory if it's
marked MEMDIE or some such?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Andrew Morton
Date: Wednesday, June 6, 2007 - 12:39 am

yup.  A fix for that is in the pipeline: bale from get_user_pages()
if TIF_MEMDIE is set.
-

From: David Rientjes
Date: Wednesday, June 6, 2007 - 12:48 am

That's one case.  Are there others?

The TIF_MEMDIE exception in cpuset_zone_allowed_softwall() allowed this 
problem in mlock().  If it had not been allowed to allocate anywhere 
based simply on the zonelist ordering, the mlock iteration would break 
because it could not handle the fault.

Thus, at the least, we should make sure that memory is not allocated 
outside of a task's mems_allowed unless we do sanity checks against 
gfp_mask in the TIF_MEMDIE case via cpuset_zone_allowed_softwall() to make 
sure a rouge application doesn't cause the same trouble.  That is, unless 
you can guarantee this type of problem will not happen again through any 
other means.  The logic needs to be with the TIF_MEMDIE exception to grant 
access to memory outside the cpuset only when it is relevant to the OOM 
killed task's prompt exit.

		David
-

From: Peter Zijlstra
Date: Wednesday, June 6, 2007 - 12:56 am

I don't think your patch alone would have been sufficient. With it it
would have depleted the local reserves and then jumped onwards to other
nodes (since the ALLOC_NO_WATERMARKS allocation doesn't have
ALLOC_CPUSET).

Unless there was a mem-policy restricting the zonelist (not sure if
cpusets and mem-policies are independent like that)

But your point stands.

-

From: Paul Jackson
Date: Wednesday, June 6, 2007 - 12:56 am

Well, I certainly cannot guarantee that.

Heck, I can't even guarantee isn't happening right now, somewhere else.

But I'm no memory guru.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Andrew Morton
Date: Wednesday, June 6, 2007 - 1:00 am

The only place I can think of where the kernel will sit there allocating
huge amounts of memory like this is in readahead.
-

From: Peter Zijlstra
Date: Wednesday, June 6, 2007 - 1:03 am

We just created one more, the execve string copy. However that uses
get_user_pages(), so if that gets fixed we're save there too.

-

Previous thread: Re: usb-scanner-cameras kernel-2.6.22 and udev-095 problem by Greg KH on Tuesday, June 5, 2007 - 2:46 pm. (1 message)

Next thread: Re: 2.6.22-rc4: known regressions [section mismatch] by Jeff Chua on Tuesday, June 5, 2007 - 4:15 pm. (1 message)