Reverts git commit c596d9f320aaf30d28c1d793ff3a976dee1db8f5. OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access memory outside its cpuset because it could potentially cause other exclusive cpusets to OOM themselves. Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> --- kernel/cpuset.c | 6 ------ 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/kernel/cpuset.c b/kernel/cpuset.c --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -2431,12 +2431,6 @@ int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask) might_sleep_if(!(gfp_mask & __GFP_HARDWALL)); if (node_isset(node, current->mems_allowed)) return 1; - /* - * Allow tasks that have access to memory reserves because they have - * been OOM killed to get memory anywhere. - */ - if (unlikely(test_thread_flag(TIF_MEMDIE))) - return 1; if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */ return 0; -
Those terminating tasks need that memory to terminate properly right? -
No, they need access only to the memory reserves provided by the difference in their low and no watermarks on a per-cpuset basis. -
I'm a little surprised at this suggested change -- I'd have thought
that it was a good idea to let tasks marked for extinction get memory
anywhere, as they were going to use that memory to exit, and free up
lots more memory.
I'm pretty sure we have this same policy in other places in the
kernel, besides cpusets. Did you intend to change them too?
If a MEMDIE task is taking enough memory to OOM other tasks anywhere
in the system, then doesn't that mean your entire system was in deep
yogurt, and we're just haggling over who to blame for the upcoming
crash?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
The intended purpose of TIF_MEMDIE was to allocate pages without being bound by the watermarks so that they have access to memory reserves on the per-zone level. If the cpuset doesn't have access to a zone, whether it's No, it means that it can allocate anywhere based on the zonelist ordering and then can OOM a very small exclusive cpuset that would never have had any memory pressure if it wasn't violated. David -
But the alternative is that the exiting process does not save its data. What is this very small exclusive cpuset? -
The same condition that occurs when there is a system-wide OOM, yes. Exclusive cpusets cannot be violated for such allocations outside of the That's arbitrary. The idea is that an exclusive cpuset should not encounter memory pressure because another exclusive cpuset encountered an OOM condition because its zones happened to be higher on the zonelist. Notice how, without this change, it's possible to allocate on a node outside our mems_allowed before we use our own memory reserves. David -
But with the patch the process would be able to terminate. There is no global OOM situation. If there would be a global OOM situation then So its seems that the patch is addressing an imagined situation? I think the allocation outside of our mems_allowed is fine when it serves to terminate the process and thereby release resources. It is certainly better than having the process corrupt data by only partially writing back its data. -
Sure it would, it would have access to memory reserves because of the change in watermarks through get_page_from_freelist(). If that fails, we can't allocate elsewhere because then we have taken exclusive memory from other applications and is contrary to the definition of mem_exclusive. You need to construct your cpuset hierarchy with these scenarios in mind; when you ask for an exclusive cpuset, it shouldn't come with a disclaimer that says "if another cpuset that is also exclusive happens to OOM, we'll steal your memory anyway and it's not our problem if the dying task gets stuck in D state and doesn't exit synchronously or No, it's returning us to the previous logic where an exclusive cpuset was actually exclusive. And, again, without this change it is possible to allocate in other exclusive cpusets without first exhausting your own memory reserves. That's wrong. David -
Well, I can't speak to the 'real' meaning of TIF_MEMDIE with authority,
but I can speak to the meaning of cpuset flags.
The mem_exclusive flag doesn't mean this.
It means that you cannot overlap the memory of a sibling cpuset.
You will, necessarily, still overlap the memory of your ancestor cpusets.
Whether or not you make any use of the mem_exclusive flag, you still
get the same (limited) guarantees of memory usage -- namely that your
memory won't be used by tasks in non-overlapping cpusets, with some
exceptions, such as:
1) memory handed out to interrupt code,
2) memory handed out for GFP_ATOMIC requests, and
3) tasks marked PF_EXITING -- will soon free up memory
Tasks in cpusets ancestor to your tasks cpuset can always, easily,
use memory on the same nodes your task is on.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Which, with this patch, we will respect for tasks marked TIF_MEMDIE as And that's what nearest_exclusive_ancestor() determines later on if we're This is precisely the point: we already respect PF_EXITING tasks with their ability to allocate outside their own cpuset. That gets set in do_exit() when a task is in receipt of the SIGKILL from the OOM killer during the exit path. Between these time periods (the time when we issue the OOM SIGKILL and the time we're marked PF_EXITING in do_exit()), we should not allow allocations outside of our cpuset because we do not yet Sure, that behavior is unchanged. We're relying on nearest_exclusive_ancestor() to determine if such nodes overlap. David -
Ok ... my points on cpusets semantics having been heard,
I stand back down on the matter of memory semantics, where
I am not the master.
Thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Exclusive is not as absolute as you may think. There is also the GFP_KERNEL exception. That is already occurring with GFP_KERNEL. So your patch really does not have the purifying effect on exclusivity that you expect. This looks all more like hunting for elusive idealistic cpuset behavior to me. -
Memory exclusivity with respect to cpusets should guarantee that memory nodes do not overlap with siblings if they are marked with mems_exclusive. The patch simply preserves that behavior through the time period between when the OOM killer issues a SIGKILL and the task is exiting and marked with PF_EXITING. Obviously GFP_KERNEL allocations can allocate regardless of our memory exclusivity, but the point is that a job in one exclusive cpuset should not have the ability to effect the performance (in terms of reclaim and swap), memory usage, or survival of jobs in other exclusive cpusets But it's a reality that we need to respect. It happens and when it does it has the potential to hamper other cpusets that we setup to be exclusive themselves. David -
Right but the process is terminating thus only requiring limited resources to get finished. The process does not have the ability to affect the other cpusets. I.e. it cannot directly allocate outside of the cpuset. The system has that capability and the system is handling the termination of The fact that process hang because of some software deficiency is an exceptional failure scenario. The system in general is not fully operational anymore anyways. Typically one or the other lock is held which makes certain kernel functionality inaccessible. -
And those limited resources should be available in the difference between low and no watermarks as defined by each zone in the cpuset's mems_allowed. Regardless, we should not allow allocations outside of the cpuset because we have marked it TIF_MEMDIE and we don't have any explicit guarantee that it is exiting yet and not mlock'ing an excessive amount of memory that exceeds the capacity of all cpuset nodes. David -
Hmmmm... But we have sent it a SIGKILL. If the process is following conventions then it is exiting. Of course the process could be abusing the system and attempting to OOM the whole system as an act of revenge for being killed but isnt this a bit far fetched? -
It's not abusing the system because it was killed, it was killed because it was abusing the system and attempted to mlock more memory than allowed in its exclusive mems. So we OOM the task but it can continue mlock'ing memory and intrude on the memory of other cpusets because of the TIF_MEMDIE exception and our SIGKILL hasn't been handled yet. -
> The intended purpose of TIF_MEMDIE was to allocate pages without being
Ok then ... you probably right. I'll stand down.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
This seems a little pointless, since cpuset_zone_allowed_softwall() is only effective with ALLOC_CPUSET, and the ALLOC_NO_WATERMARKS allocations opened up by TIF_MEMDIE don't use that. Also I agree with Christoph's reasoning; this is not for the application but for the system. Hence the cpuset does not get violated for the application [ something I tried to argue before, glad Christoph now agrees with me :-) ]. -
That's the change. Memory reserves on a per-zone level would now be used with respect to ALLOC_NO_WATERMARKS because TIF_MEMDIE tasks no longer have an explicit bypass for them. If the node is not in the task's mems_allowed and it's a __GFP_HARDWALL allocation or if it's neither PF_EXITING or in the nearest_exclusive_ancestor() of that task's cpuset, then we move on to the next zone in get_page_from_freelist(). The problem the patch addresses is that, as you mentioned, tasks with TIF_MEMDIE have an explicit bypass over using any memory reserves and can thus, depending on zonelist ordering, allocate first on nodes outside its mems_allowed even in the case of an exclusive cpuset before exhausting its own memory. That behavior is wrong. David -
Right, I see your point; however considering that its a system
allocation, and all these constraints get violated by interrupts anyway,
its more of an application container than a strict allocation container.
Are you actually seeing the described behaviour, or just being pedantic
(nothing wrong with that per-se)?
I would actually be bold and make it worse by proposing something like
this, which has the benefit of making the reserve threshold system wide.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1eef614..870a791 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1630,6 +1630,21 @@ rebalance:
&& !in_interrupt()) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
+ /*
+ * break out of mempolicy boundaries
+ */
+ zonelist = NODE_DATA(numa_node_id())->node_zonelists +
+ gfp_zone(gfp_mask);
+
+ /*
+ * Before going bare metal, try to get a page above the
+ * critical threshold - ignoring CPU sets.
+ */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER);
+ if (page)
+ goto got_pg;
+
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, order,
zonelist, ALLOC_NO_WATERMARKS);
-
It is not necessarily system allocations at all. It is quite possible, as described earlier, to mlock a large quantity of memory that exceeds the capacity of all nodes in a task's mems_allowed and then when this task OOM's to continue allocating memory in get_user_pages() but with the TIF_MEMDIE exception that allows it to allocate anywhere. Then other cpusets get their memory infringed upon and they can OOM themselves even though there was no preexisting memory pressure. This happens before we return to userspace in the mlock() and thus we cannot properly handle the Yes, as described above. An exclusive cpuset containing only root, system-critical tasks was OOM'd when it was not under any memory pressure at all because a separate exclusive cpuset mlock'd a gigantic amount of memory and it could not reliably exit because the mlock continued to allocate outside its own cpuset and eventually OOM'd system-critical tasks or depleated all system memory. True story. At the least, if this patch is not agreed upon, I would suggest adding sanity checks against gfp_mask in cpuset_zone_allowed() to ensure we should allow the allocation in TIF_MEMDIE circumstances to avoid the above scenario. David -
Seems like that mlock code is able then to get great globs of memory
without returning to user space ... perhaps that's where the fix
should be ... that code should quit chewing up memory if it's
marked MEMDIE or some such?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
yup. A fix for that is in the pipeline: bale from get_user_pages() if TIF_MEMDIE is set. -
That's one case. Are there others? The TIF_MEMDIE exception in cpuset_zone_allowed_softwall() allowed this problem in mlock(). If it had not been allowed to allocate anywhere based simply on the zonelist ordering, the mlock iteration would break because it could not handle the fault. Thus, at the least, we should make sure that memory is not allocated outside of a task's mems_allowed unless we do sanity checks against gfp_mask in the TIF_MEMDIE case via cpuset_zone_allowed_softwall() to make sure a rouge application doesn't cause the same trouble. That is, unless you can guarantee this type of problem will not happen again through any other means. The logic needs to be with the TIF_MEMDIE exception to grant access to memory outside the cpuset only when it is relevant to the OOM killed task's prompt exit. David -
I don't think your patch alone would have been sufficient. With it it would have depleted the local reserves and then jumped onwards to other nodes (since the ALLOC_NO_WATERMARKS allocation doesn't have ALLOC_CPUSET). Unless there was a mem-policy restricting the zonelist (not sure if cpusets and mem-policies are independent like that) But your point stands. -
Well, I certainly cannot guarantee that.
Heck, I can't even guarantee isn't happening right now, somewhere else.
But I'm no memory guru.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
The only place I can think of where the kernel will sit there allocating huge amounts of memory like this is in readahead. -
We just created one more, the execve string copy. However that uses get_user_pages(), so if that gets fixed we're save there too. -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfri |
