But all that is fully serialized by the rq->lock.. so I'm really not seeing how this can happen. --
I wonder is there any chance set_next_entity() can get NULL for parameter se if so? And will you please give me some instructions on where rq->lock is required? -- Thanks and Best Regards, shenghui --
Well, if your machine crashes that way, maybe, but I haven't seen that Pretty much everywhere, if you look at sched.c the only sched_class method not called with rq->lock held is ::task_fork(). The interesting bits are that schedule()->pre_schedule()/idle_balance() can drop rq->lock as well as ->select_task_rq(). --
Hi, Here is a panic I got today: http://frugalware.org/~vmiklos/pics/bug/2.6.37-rc6.png More details: I get this sometimes on boot or shutdown when testing systemd. I did not get it with sysvinit, so I guess it may be related to systemd's heavy cgroups usage, but I'm not sure. Sadly it isn't 100% reproducible but I usually hit it at least once a day. The config is here: http://frugalware.org/~vmiklos/logs/2.6.37-rc6.config (I just did a yes "" | make config to update it to 2.6.37-rc6.) I got something similar with 2.6.36.1 as well: http://frugalware.org/~vmiklos/pics/bug/2.6.36.1.png Ah, and this is on i686 in VMware - though given that I never had this problem with systemd, I guess it won't be an emulator bug. :) I'm not familiar with the sched code, is it possible that this is related? Thanks, Miklos
A few more info: - I reproduced the issue on my Lenovo S12 netbook as well, so it's not a virtual machine-related bug. Here is a poor shot: http://frugalware.org/~vmiklos/pics/bug/p1040730.jpg - I'm also attaching the full dmesg in the VM that is finally a text using serial console. - On "how to reproduce": sometimes I just get this during boot. If it does not occur, then it does when using "systemctl emergency". Please let me know if you need more info. Thanks, Miklos
What distro are you using? it looks like systemd is involved and I'm actively avoiding anything using that crap. It now triggering a bug means I need to get myself a 32bit image using it :/ If you happen to have a qemu image that reproduces.. :-) --
Removed the systemd list from CC, please don't cross post with lists that send bounces for non-members, that's considered rude. --
Heh, gearing up to hunt a bug it triggers, just doing all I had to do to get the thing built and installed blew my box _all up_. I still have smoldering bits lying about a week later. -Mike --
There's a reason I was asking for a qemu image ;-), I'm not letting that stuff anywhere near my regular machines. --
First, sorry about the systemd list, I did not know it's subscriber-only. Sure - here is a qcow image (871M): http://frugalware.org/~vmiklos/files/systemd.img Here is how I started it: qemu-kvm -vnc :0 -m 1G -hda systemd.img $ qemu-kvm -version QEMU PC emulator version 0.12.3 (qemu-kvm-0.12.3), Copyright (c) 2003-2008 Fabrice Bellard There are two menu entries in grub config - first is sysvinit, that boots up correctly here, you can login using root / root. The second is systemd, that just panics here almost all the time. In case it would not, use 'systemctl emergency' as root and that will trigger the bug. :) If you need more packages in the virtual machine and the pacman package manager sounds exotic, here is a quick help: http://frugalware.org/docs/pacman-g2#_apt_pacman_g2_cross_reference Thanks, Miklos
awesome, thanks! I started it with something like: qemu -kernel foo-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0" -hda systemd.img -serial stdio -m 1G Where foo-build/ contains a kernel build using your .config. I'll have a poke at it.. --
Hrm,. its not really wanting to start properly.. --- EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) VFS: Mounted root (ext4 filesystem) readonly on device 8:1. Freeing unused kernel memory: 520k freed EXT4-fs (sda1): re-mounted. Opts: (null) md: stopping all md devices. Restarting system. machine restart --- Does it _require_ initrd like muck? Or are there some modules that need to get built-in in order for the thing to boot? This is an utter lack of error reporting here, no idea what's wrong. --
I tried to do something really similar to your commandline: qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0" -hda systemd.img -serial stdio -m 1G -vnc :0 This boots up properly here, I can login using root/root from vnc. qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0 init=/bin/systemd" -hda systemd.img -serial stdio -m 1G -vnc :0 ^ Only init=/bin/systemd added, and this results in a panic in most cases. I'm attaching the stdout of qemu, showing the fail in put_prev_task_fair. $ qemu -version QEMU emulator version 0.13.0, Copyright (c) 2003-2008 Fabrice Bellard kernel-build is a git build using the config I already sent and after a 'git checkout v2.6.36'. I can try to build master as well, so far what I saw is that the bug occurs there as well, but less frequently, so maybe that's a bit harder to debug. I also want to note that - at least on my machine - if I drop -enable-kvm the bug is hard to reproduce, maybe that's because that way it does not trigger a race condition or my machine is just too slow without kvm and it triggers some timeout, changing the behaviour; I'm not exactly sure. (But again, I can reproduce the bug on real hardware, so I don't think we have a kvm bug here.) Thanks, Miklos
Right, so I've got v2.6.36 exploding, current -git and -tip won't explode for me, tried both about 20 times. So I'll try and figure out wth makes .36 explode and then see if further kernel still suffer that problem. --
Does it explode if you change some of the CONFIG_CC flags, such as CONFIG_CC_OPTIMIZE_FOR_SIZE? I.e. is the crash dependent on the precise kernel image layout perhaps? If it's sensitive to that then this might keep you from chasing shadows ... Thanks, Ingo --
I reported the issue as I tested 2.6.36 and 2.6.37-rc6 and both failed. Now I tried current -git and indeed I can't reproduce the issue there. I'm now bisecting between master and rc6 to see which commit fixes the problem - just to make sure it's not something irrelevant and the bug is still hidden somewhere.
I spoke too early. After some trying, it seems I can reproduce with current -git as well. (But as Peter points out it's harder to reproduce.) Just in case it's interesting, I attached the -git dmesg. From a quick look the trace is the same as with v2.6.36, which is easy to reproduce.
the load == f69e65a0 == address of se, odd Thanks, -- Only stand for myself. --
This appears to be consistently true, I've also found that in between these two prints, there is a free_sched_group() freeing that exact entry. So post-print is a use-after-free artifact. What's interesting is that its freeing a cfs_rq struct with nr_running=1, that should not be possible... /me goes stare at the whole cgroup task attach vs cgroup destruction muck. --
systemd-1 0d..1. 2070793us : sched_destroy_group: se: f69e43c0, load: 1024 systemd-1 0d..1. 2070794us : sched_destroy_group: cfs_rq: f69e4720, nr: 1, load: 1024 systemd-1 0d..1. 2070794us : __print_runqueue: cfs_rq: f69e4720, nr: 1, load: 1024 systemd-1 0d..1. 2070795us : __print_runqueue: curr: (null) systemd-1 0d..1. 2070796us : __print_runqueue: se: f6a8eb4c, comm: systemd-tmpfile/1243, load: 1024 systemd-1 0d..1. 2070796us : _raw_spin_unlock_irqrestore <-sched_destroy_group So somehow it manages to destroy a group with a task attached. --
Its even weirder:
systemd-1 0d..1. 1663489us : sched_destroy_group: se: f69e7360, load: 1024
systemd-1 0d..1. 1663489us : sched_destroy_group: cfs_rq: f69e72a0, nr: 1, load: 1024
systemd-1 0d..1. 1663491us : __print_runqueue: cfs_rq: f69e72a0, nr: 1, load: 1024, cgroup: /system/systemd-sysctl.service
systemd-1 0d..1. 1663491us : __print_runqueue: curr: (null)
systemd-1 0d..1. 1663493us : __print_runqueue: se: f69d95bc, comm: systemd-sysctl/1209, load: 1024, cgroup: /
systemd-1 0d..1. 1663496us : do_invalid_op <-error_code
The task enqueued to the cfs_rq doesn't match the cgroup, the thing is,
I don't see a cpu_cgroup_attach/sched_move_task call in the log, nor
does a BUG_ON() validating the task's cgroup against the cfs_rq's cgroup
on account_entity_enqueue() trigger.
So it looks like a task changes cgroup without passing through the
cgroup_subsys::attach method, which afaict isn't supposed to happen.
---
kernel/sched.c | 26 ++++++++++++++-
kernel/sched_fair.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 113 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..bc30efb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8033,10 +8033,20 @@ static void free_fair_sched_group(struct task_group *tg)
int i;
for_each_possible_cpu(i) {
- if (tg->cfs_rq)
+ if (tg->cfs_rq) {
+ struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+ if (cfs_rq) {
+ trace_printk("cfs_rq: %p, nr: %ld, load: %ld\n",
+ cfs_rq, cfs_rq->nr_running, cfs_rq->load.weight);
+ }
kfree(tg->cfs_rq[i]);
- if (tg->se)
+ }
+ if (tg->se) {
+ struct sched_entity *se = tg->se[i];
+ if (se)
+ trace_printk("se: %p, load: %ld\n", se, se->load.weight);
kfree(tg->se[i]);
+ }
}
kfree(tg->cfs_rq);
@@ -8092,6 +8102,18 @@ static inline void register_fair_sched_group(struct task_group *tg, int cpu)
static inline void unregister_fair_sched_group(struct ...This straw appears true: $ grep -e cpu_cgroup\\\|f491447c log9 ... kworker/-1196 0d..2. 1601180us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service kworker/-1196 0d..2. 1601186us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service kworker/-1196 0d..2. 1601188us : __dequeue_entity: f491447c from f492a480, 1 left kworker/-1196 0d..2. 1601188us : pick_next_task_fair: picked: f491447c, modprobe/1210 kworker/-1196 0d..2. 1601192us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service modprobe-1210 0d..5. 1601802us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: / modprobe-1210 0d..5. 1601807us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: / modprobe-1210 0d..2. 1601817us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: / modprobe-1210 0d..2. 1601819us : __enqueue_entity: f491447c to f492a480, 1 tasks modprobe-1210 0d..2. 1601826us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: / modprobe-1210 0d..2. 1601832us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: / modprobe-1210 0d..2. 1601839us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: / kworker/-1196 0d..2. 1601848us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: / kworker/-1196 0d..2. 1601854us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: / kworker/-1196 0d..2. 1601860us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: / kworker/-1196 0d..2. 1601865us : __print_runqueue: se: ...
Thanks! :) Reported-and-tested-by: Miklos Vajna <vmiklos@frugalware.org>
Could you do the move in cgroup_exit() in the CONFIG_PREEMPT case? -Mike --
I'm not really comfortable relying on that.. voluntary might just grow a cond_resched() somewhere in the exit path and lead us down the same path, also I think that !preempt smp might have the same race. --
Very good catch! Looks very reasonable and correct to me Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> -- Three Cheers, Balbir --
While this patch is likely fine for solving the problem, it does add extra work into the task exit path. Could you instead just use the pre_destroy callback to return -EBUSY if there are still any tasks on the cfs_rq? That way there'd only be a penalty on cgroup destruction, which is a much rarer operation. Paul --
No, that's broken! That would mean userspace gets the cgroup-empty notification thing and then encounters -EBUSY, which then forces it to go poll the state, rendering the whole notification thing pointless. --
I tried this patch, but it causes a boot crash: udevd[109]: delete_path: rmdir(/dev/.udev/failed) failed: Read-only file sgeneral protection fault: 0000 [#1] SMP last sysfs file: /sys/bus/mdio_bus/drivers/Generic PHY/uevent CPU 0 Modules linked in: ystem udevd[10Pid: 109, comm: udevd Not tainted 2.6.37-rc8-tip-01815-g4e7b4f4-dirty #76296 A8N-E/System Product Name RIP: 0010:[<ffffffff810276ff>] 9]: delete_path: [<ffffffff810276ff>] task_group+0x4d/0x53 RSP: 0018:ffff88003d33bd50 EFLAGS: 00010046 RAX: 6b6b6b6b6b6b6b6b RBX: ffff88003d06f460 RCX: ffffffff00000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88003d06f460 rmdir(/dev/.udeRBP: ffff88003d33bd50 R08: ffff88003e40ad68 R09: 000000000000005a R10: ffff88003f7d62c8 R11: ffff88003d086d80 R12: 0000000000000000 v/failed) failedR13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 CR2: 00007f0b222cb000 CR3: 000000003d36d000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 system udevd[DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process udevd (pid: 109, threadinfo ffff88003d33a000, task ffff88003e63ee40) Stack: ffff88003d33bd70 ffffffff8102771c ffff88003d06f460 0000000000000000 ffff88003d33bd90 ffffffff81027c9c ffff88003d06f460 ffff88003fc11000 ffff88003d33bdd0 ffffffff81034d6d109]: delete_pat ffff88003d06f460 0000000000000246 Call Trace: [<ffffffff8102771c>] set_task_rq+0x17/0x73 h: rmdir(/dev/.u [<ffffffff81027c9c>] task_move_group_fair+0x37/0x53 dev/failed) fail [<ffffffff81034d6d>] sched_move_task+0x71/0xbc ed: Read-only fi [<ffffffff81034dc9>] cpu_cgroup_exit+0x11/0x13 [<ffffffff81070e88>] cgroup_exit+0x35/0xd4 le system udev [<ffffffff8103706a>] copy_process+0xf3a/0xfc7 d[109]: delete_p [<ffffffff81037283>] do_fork+0x163/0x2d7 [<ffffffff810b456a>] ? do_munmap+0x2f2/0x30b [<ffffffff8100a0c2>] sys_clone+0x28/0x2a ath: rmdir(/dev/ [<ffffffff81002cf3>] stub_clone+0x13/0x20 [<ffffffff81002a8b>] ? ...
Hm, indeed. (I get a crash in qemu, but not on the host machine.) qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0 init=/bin/systemd" -hda systemd.img -serial stdio -m 1G -vnc :0 does not crash here, but qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0" -hda systemd.img -serial stdio -m 1G -vnc :0 does. I'm attaching the config (what I already sent earlier in this thread) and the output of the above two commands just in case that helps Peter.
Your crash seems to be completely independent of Peter's patch. It crashes here with your image/config/2.6.36 here regardless, and not 100% repeatably. Sometimes it boots, sometimes not. -Mike --
The below should fix it.
sched: fix autogroup reference leak and cpu_cgroup_exit() explosion
In the event of a fork failure, the new cpu_cgroup_exit() method tries to
move an unhashed task. Since PF_EXITING isn't set in that case, autogroup
will dig aground in a freed signal_struct. Neither cgroups nor autogroup
has anything it needs to do with this shade, so don't go there.
This also uncovered a struct autogroup reference leak. copy_process() was
simply freeing vs putting the signal_struct, stranding a reference.
Signed-off-by: Mike Galbraith <efault@gmx.de>
---
kernel/fork.c | 2 +-
kernel/sched.c | 10 ++++++++++
2 files changed, 11 insertions(+), 1 deletion(-)
Index: linux-2.6.37.git/kernel/fork.c
===================================================================
--- linux-2.6.37.git.orig/kernel/fork.c
+++ linux-2.6.37.git/kernel/fork.c
@@ -1318,7 +1318,7 @@ bad_fork_cleanup_mm:
}
bad_fork_cleanup_signal:
if (!(clone_flags & CLONE_THREAD))
- free_signal_struct(p->signal);
+ put_signal_struct(p->signal);
bad_fork_cleanup_sighand:
__cleanup_sighand(p->sighand);
bad_fork_cleanup_fs:
Index: linux-2.6.37.git/kernel/sched.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sched.c
+++ linux-2.6.37.git/kernel/sched.c
@@ -9193,6 +9193,16 @@ cpu_cgroup_attach(struct cgroup_subsys *
static void
cpu_cgroup_exit(struct cgroup_subsys *ss, struct task_struct *task)
{
+ /*
+ * cgroup_exit() is called in the copy_process failure path.
+ * The task isn't hashed, and we don't want to make autogroup
+ * dig into a freed signal_struct, so just go away.
+ *
+ * XXX: why are cgroup methods diddling unattached tasks?
+ */
+ if (!(task->flags & PF_EXITING))
+ return;
+
sched_move_task(task);
}
--
Ah, that looks plausible. I've folded this chunk into my patch and kept your fork-fail mod in a separate patch. --
Commit-ID: 101e5f77bf35679809586e250b6c62193d2ed179 Gitweb: http://git.kernel.org/tip/101e5f77bf35679809586e250b6c62193d2ed179 Author: Mike Galbraith <efault@gmx.de> AuthorDate: Fri, 31 Dec 2010 09:32:30 +0100 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Tue, 4 Jan 2011 15:10:36 +0100 sched, autogroup: Fix reference leak The cgroup exit mess also uncovered a struct autogroup reference leak. copy_process() was simply freeing vs putting the signal_struct, stranding a reference. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Oleg Nesterov <oleg@redhat.com> LKML-Reference: <1293784350.6839.2.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/fork.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index b6f2475..0672444 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1317,7 +1317,7 @@ bad_fork_cleanup_mm: } bad_fork_cleanup_signal: if (!(clone_flags & CLONE_THREAD)) - free_signal_struct(p->signal); + put_signal_struct(p->signal); bad_fork_cleanup_sighand: __cleanup_sighand(p->sighand); bad_fork_cleanup_fs: --
Well, free_signal_struct() was correct. Without CLONE_THREAD sig->sigcnt must be equal to 1. But yes, autogroup puts sched_autogroup_exit() into put_signal_struct(), so this patch looks fine. Although I must admit, to me it would be more clean to simply move sched_autogroup_exit() from put_signal_struct() into free_signal_struct() instead. Oleg. --
