A recent thread on the lkml discussed the "OOM killer". The OOM (out of memory) killer has the task of choosing which process(es) to kill when the VM runs out of memory. Rik Van Riel has a full explanation of the OOM killer here.
Andrew Morton, who's been working on dividing the latest -aa VM into smaller pieces for mainline inclusion, submitted a patch about which he says:
"I have incorporated the oom killer into try_to_free_pages(), along with a tunable which defines how hard we work before killing something. It is *extremely* conservative. As it should be. The VM will spin madly for five or ten seconds before giving up and calling the oom killer. And then another five seconds elapses before the oom killer decides to actually kill something. It works."
The thread goes on to compare the mainline VM with the -aa VM. It also looks at ways to further tune the OOM killer.
From: Andrew Morton CC: lkml Subject: the oom killer Date: Fri, 05 Apr 2002 01:18:26 -0800 Andrea, Marcelo would prefer that the VM retain the oom killer. The thinking is that if try_to_free_pages fails, then we're better off making a deliberate selection of the process to kill rather than the random(ish) selection which we make by failing the allocation. One example is at http://marc.theaimsgroup.com/?l=linux-kernel&m=101405688319160&w=2 That failure was with vm-24, which I think had the less aggressive i/dcache shrink code. We do need to robustly handle the no-swap-left situation. So I have resurrected the oom killer. The patch is below. During testing of this, a problem cropped up. The machine has 64 megs of memory, no swap. The workload consisted of running `make -j0 bzImage' in parallel with `usemem 40'. usemem will malloc a 40 megabyte chunk, memset it and exit. The kernel livelocked. What appeared to be happening was that ZONE_DMA was short on free pages, but ZONE_NORMAL was not. So this check:if (!check_classzone_need_balance(classzone))
break;in try_to_free_pages() was seeing that ZONE_NORMAL had some headroom
and was causing a return to __alloc_pages().__alloc_pages has this logic:
min = 1UL << order;
for (;;) {
zone_t *z = *(zone++);
if (!z)
break;
min += z->pages_min;
if (z->free_pages > min) {
page = rmqueue(z, order);
if (page)
return page;
}
}On the first pass through this loop, `min' gets the value
zone_dma.pages_min + 1. On the second pass through the loop it gets
the value zone_dma.pages_min + 1 + zone_normal.pages_min. And this is
greater than zone_normal.free_pages! So alloc_pages() gets stuck in an
infinite loop.This logic surrounding `min' is pretty weird - it's repeated several
times in __alloc_pages. Is it correct? What is it intended to do?Anyway. "fixing" the `min' logic in that part of __alloc_pages()
prevented the lockup. The page allocator successfully takes a page
from ZONE_NORMAL (as check_classzone_need_balance() said it could) and
returns.I have incorporated the oom killer into try_to_free_pages(), along with
a tunable which defines how hard we work before killing something. It
is *extremely* conservative. As it should be. The VM will spin madly
for five or ten seconds before giving up and calling the oom killer.
And then another five seconds elapses before the oom killer decides to
actually kill something. It works.Some adjustments have been made to the oom killer to make really sure
that it doesn't kill init, and to not bother looking at processes which
have called daemonize().Your comments would be appreciated, thanks.
The entire patch series is at
http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre6/aa1/
From: Andrea Arcangeli
Subject: Re: the oom killer
Date: Fri, 5 Apr 2002 16:43:48 +0200On Fri, Apr 05, 2002 at 01:18:26AM -0800, Andrew Morton wrote:
> That failure was with vm-24, which I think had the less aggressivevm-24 had a problem yes, that is fixed in the latest releases.
> On the first pass through this loop, `min' gets the value
> zone_dma.pages_min + 1. On the second pass through the loop it gets
> the value zone_dma.pages_min + 1 + zone_normal.pages_min. And this is
> greater than zone_normal.free_pages! So alloc_pages() gets stuck in an
> infinite loop.This is a bug I fixed in the -rest patch, that's also broken on numa.
The deadlock cannot happen if you apply all my patches.As for your patch it reintroduces a deadlock by looping in GFP relying
on the oom killer (that will also go and kill the
bigger task most of the time), the oom killer can select a task in D
state, or it can a sigterm, and secondly you broke google DB (the right
fix for that min thing are the point-of-view watermarks in the -rest
patch in my collection). the worst thing is that with the oom killer
we've to keep looping, so if the task is for whatever reason hung in R
state in kernel the machine will deadlock, while current way it will
make progress either in the do_exit, or in the -ENOMEM fail path (modulo
getblk that's not too bad anyways). the current memory balancing is now
been good enough to kill in function of probability, so I didn't feel
the need of risking (at the very least theorical) deadlocks there, this
is why I left it disabled.Andrea
From: Marcelo Tosatti
Subject: Re: the oom killer
Date: Fri, 5 Apr 2002 18:45:08 -0300 (BRT)On Fri, 5 Apr 2002, Andrea Arcangeli wrote:
> This is a bug I fixed in the -rest patch, that's also broken on numa.
> The deadlock cannot happen if you apply all my patches.How did you fix this specific problem?
From: Andrea Arcangeli
Subject: Re: the oom killer
Date: Thu, 11 Apr 2002 15:13:53 +0200I didn't really fix it, it's just that the problem never existed in my
tree. I don't" min += z->pages_min" and so the
check_classzone_need_balance path sees exactly the same state of the VM
as the main allocator, so if it breaks the loop the main allocator will
go ahead just fine.Andrea
From: Andrew Morton
Subject: Re: the oom killer
Date: Thu, 11 Apr 2002 12:41:24 -0700Yup, we need to pull that fix into 2.4.
wrt the oom-killer, I think we can keep everyone happy by
implementing both solutions ;) If the aa approach reaches
the point where it will fail a page allocation we run the
oom-killer, yield and then have another go at the allocation.
Do that a couple of times and *then* fail the page allocation.This fixes the problem where the VM will (effectively) kill
a randomly chosen process rather than a deliberately chosen
one, and fixes the lockup problem which Andrea identifies,
where the victim process is stuck somewhere in-kernel
ignoring signals.It'd be nice if the second and subsequent passes of the oom
killer were able to note that a kill was already outstanding,
so they don't just kill the same process all the time. Or
perhaps the oom killer should just skip over processes which
are in TASK_UNINTERRUPTIBLE. Probably this is getting a
little too elaborate. Generally, the oom killer works OK
as-is (that is, it kills stuff and the machine recovers.
I won't vouch for the accuracy of its targetting).From: Christoph Hellwig
Subject: Re: the oom killer
Date: Thu, 11 Apr 2002 22:15:33 +0100On Thu, Apr 11, 2002 at 12:41:24PM -0700, Andrew Morton wrote:
> It'd be nice if the second and subsequent passes of the oom
> killer were able to note that a kill was already outstanding,
> so they don't just kill the same process all the time.-rmap uses a simple timeout for that. And I've just send the
rmap oom_killer tweaks to Marcelo so they hopefully will appear
in mainline soon.
Interesting
Funny how every discussion of anything memory-related turns into a VM war... not that that's bad, choice is good, and according to Linus Linux ' success is because of 'massive, parallel, unpredicted development'.
Re: Interesting
Where is the VM-war in this article?
I don't see any VM-war.
--
I used to have a sig until the great Kahuna of FOOness
told me to dump it and use /dev/urandom instead.
actually
so what your saying is, it is predictable