"Sysrq-p is pretty useless unless you can force the keyboard interrupt and the spinning process onto the same CPU," noted Chuck Ebbert during a discussion centered around debugging tasks stuck in a running state. Pressing the <Alt><SysRq><p> key combination is used for debugging, dumping the registers and flags from the CPU that handles the keypress interrupt to the console. UltraSPARC maintainer, David Miller, replied, "yes, I find this a painful limitation too," adding:
"Sparc64 used to dump the registers on all active cpus for show_regs() via a cross-call, and this was incredibly useful. But I disabled that as soon as I started playing with Niagara because at 32 cpus and larger the output is just too voluminous to be useful."
David then suggested, "what might be appropriate is just to get a one-line program counter dump on every cpu via some new sysrq keystroke." Chuck noted that similar functionality is provided by a patch in the -mm kernel, "IIRC -mm had something like this but it was buggy because we were sending IPIs to each processor asking them to print their state. Maybe it would work if we had a way of making them dump their state to a memory location and then collected and printed it from the CPU that's handling the sysrq."
From: Jeff Garzik
Subject: [2.6.23] tasks stuck in running state?
Date: Oct 19, 2:39 pm 2007
On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
certain behavior at least once a day. I'll start a kernel build (make
-sj5 on this box), and it will "hang" in the following way:
> 31003 ? S 0:04 sshd: jgarzik@pts/0
> 31004 pts/0 Ss 0:02 \_ -bash
> 8280 pts/0 S+ 0:00 \_ make ARCH=i386 -sj4
> 8690 pts/0 Z+ 0:00 \_ [rm] <defunct>
> 8691 pts/0 S+ 0:00 \_ /bin/sh -c cat include/config/kernel.release 2> /dev/null
> 8692 pts/0 R+ 6:12 \_ cat include/config/kernel.release
Specifically, the symptom is a process, often a simple one like cat(1)
or rm(1) or somewhere in check-headers, will stay in the running state,
accumulating CPU time.
If I Ctrl-C the build, and start over, the build will normally -not- get
stuck at the same point, but proceed to chew through one of a bazillion
allmodconfig builds.
I also see this occasionally on my main workstation (also
2.6.23/x86-64/Fedora-7), though not as frequently.
This is a new behavior since the new scheduler was merged... I think.
Nothing more concrete to report at this time. I cannot easily reproduce
the behavior, as it happens [apparently] randomly sometime during the
day. Generally, the files these programs are dealing with are -always-
in the pagecache, if that makes any difference.
Jeff
-
From: Chuck Ebbert
Subject: Re: [2.6.23] tasks stuck in running state?
Date: Oct 19, 2:53 pm 2007
On 10/19/2007 05:39 PM, Jeff Garzik wrote:
> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> certain behavior at least once a day. I'll start a kernel build (make
> -sj5 on this box), and it will "hang" in the following way:
>
Can you try to strace the hanging task?
-
From: Jeff Garzik
Subject: Re: [2.6.23] tasks stuck in running state?
Date: Oct 19, 3:03 pm 2007
Chuck Ebbert wrote:
> On 10/19/2007 05:39 PM, Jeff Garzik wrote:
>> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
>> certain behavior at least once a day. I'll start a kernel build (make
>> -sj5 on this box), and it will "hang" in the following way:
>>
>
> Can you try to strace the hanging task?
Well, to the system it's running, so that doesn't do much of anything...
>
> 8482 pts/0 S+ 0:00 \_ /bin/sh /garz/repo/misc-2.6/scripts/hdrcheck.sh /garz/repo/misc-2.6/usr/include /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h /garz/repo/misc-2.6/usr/include/linux/.check.kernelcapi.h
> 8484 pts/0 R+ 3:10 \_ grep ^[ \t]*#[ \t]*include[ \t]*< /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h
> 8486 pts/0 S+ 0:00 \_ cut -f2 -d<
> 8487 pts/0 S+ 0:00 \_ cut -f1 -d>
> 8488 pts/0 S+ 0:00 \_ egrep ^linux|^asm
> [jgarzik@pretzel misc-2.6]$ strace -p8484
> Process 8484 attached - interrupt to quit
[sits there, chewing up CPU grepping a 47-line header file]
-
From: Chuck Ebbert
Subject: Re: [2.6.23] tasks stuck in running state?
Date: Oct 19, 3:18 pm 2007
On 10/19/2007 06:03 PM, Jeff Garzik wrote:
>> [jgarzik@pretzel misc-2.6]$ strace -p8484
>> Process 8484 attached - interrupt to quit
> [sits there, chewing up CPU grepping a 47-line header file]
>
And sysrq-p is pretty useless unless you can force the keyboard
interrupt and the spinning process onto the same CPU.
-
From: David Miller
Subject: Re: [2.6.23] tasks stuck in running state?
Date: Oct 19, 5:01 pm 2007
From: Chuck Ebbert <cebbert@redhat.com>
Date: Fri, 19 Oct 2007 18:18:08 -0400
> On 10/19/2007 06:03 PM, Jeff Garzik wrote:
> >> [jgarzik@pretzel misc-2.6]$ strace -p8484
> >> Process 8484 attached - interrupt to quit
> > [sits there, chewing up CPU grepping a 47-line header file]
> >
>
> And sysrq-p is pretty useless unless you can force the keyboard
> interrupt and the spinning process onto the same CPU.
Yes, I find this a painful limitation too.
Sparc64 used to dump the registers on all active cpus for show_regs()
via a cross-call, and this was incredibly useful. But I disabled that
as soon as I started playing with Niagara because at 32 cpus and
larger the output is just too voluminous to be useful.
What might be appropriate is just to get a one-line program counter
dump on every cpu via some new sysrq keystroke.
-
From: Chuck Ebbert
Subject: Re: [2.6.23] tasks stuck in running state?
Date: Oct 21, 8:59 am 2007
On 10/19/2007 08:01 PM, David Miller wrote:
>
> What might be appropriate is just to get a one-line program counter
> dump on every cpu via some new sysrq keystroke.
>
IIRC -mm had something like this but it was buggy because we were
sending IPIs to each processor asking them to print their state.
Maybe it would work if we had a way of making them dump their
state to a memory location and then collected and printed it from
the CPU that's handling the sysrq.
-