Re: [bug, 2.6.26-rc4/rc5] sporadic bootup crashes in blk_lookup_devt()/prepare_namespace()

Previous thread: [PATCH 0/8] au1xmmc updates #4 by Manuel Lauss on Sunday, June 8, 2008 - 11:35 pm. (23 messages)

Next thread: lockdep internal warning in linux-next by Andrew Morton on Monday, June 9, 2008 - 1:09 am. (2 messages)
From: Ingo Molnar
Date: Monday, June 9, 2008 - 1:03 am

-tip testing has started triggering a new type of sporadic bootup crash 
a few days ago. Find below a collection of 14 crashes i've managed to 
capture so far, which are all similar to this crash pattern:

 BUG: unable to handle kernel paging request at ffff81003b984fb8
 IP: [<ffffffff803fafd4>] blk_lookup_devt+0x42/0xa0
 PGD 8063 PUD 9063 PMD 3be2d163 PTE 800000003b984160
 Oops: 0000 [1] SMP DEBUG_PAGEALLOC

 Call Trace:
  [<ffffffff80bac17b>] ? ip_auto_config+0x0/0xd94
  [<ffffffff80209259>] name_to_dev_t+0x145/0xeec
  [<ffffffff803ff2be>] ? __next_cpu_nr+0x22/0x2b
  [<ffffffff80b7f372>] prepare_namespace+0x91/0x14c
  [<ffffffff80b7eb70>] kernel_init+0x2fe/0x314
  [<ffffffff80251f3d>] ? trace_hardirqs_on_caller+0xca/0xee
  [<ffffffff80741bbb>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff80251f3d>] ? trace_hardirqs_on_caller+0xca/0xee
  [<ffffffff8020d3f8>] child_rip+0xa/0x12
  [<ffffffff8020c90c>] ? restore_args+0x0/0x30
  [<ffffffff8025068d>] ? trace_hardirqs_off+0xd/0xf
  [<ffffffff80b7e872>] ? kernel_init+0x0/0x314
  [<ffffffff8020d3ee>] ? child_rip+0x0/0x12

I have reproduced it on different 32-bit and 64-bit systems as well, so 
it's not a hardware issue and it seems to affect the generic kernel.

Unfortunately the crashes are very sensitive to kernel layout changes so 
my efforts to bisect this failed. Sometimes repeat bootups do not 
reproduce the bug (making bisection even harder), and they all occured 
with DEBUG_PAGEALLOC kernels which suggests it's a timing sensitive data 
structure lifetime problem.

The latest crash's config and full serial console capture is:

  http://redhat.com/~mingo/misc/config-Mon_Jun__9_07_52_56_CEST_2008.bad
  http://redhat.com/~mingo/misc/crash-Mon_Jun__9_07_52_56_CEST_2008.log

i havent had time to look into the bug in detail yet, but it looks 
worrying and i think the bug is in -git too.

The first git-tagged crash was 2.6.26-rc4-00006-g1419a3b-dirty, which 
corresponds to tip-history-2008-06-04_09.44_Wed-6-g1419a3b, ...
From: Andrew Morton
Date: Monday, June 9, 2008 - 2:06 am

Did you work out where it's dying?  Deref of `dev' I assume?

--

From: Vegard Nossum
Date: Monday, June 9, 2008 - 2:09 am

On Mon, Jun 9, 2008 at 11:06 AM, Andrew Morton

                        struct gendisk *disk = dev_to_disk(dev);


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Ingo Molnar
Date: Monday, June 9, 2008 - 2:34 am

wild shot in the dark: if i were to finger a random commit in this 
general code area where it crashed it would be:

| commit 30f2f0eb4bd2c43d10a8b0d872c6e5ad8f31c9a0
| Author: Kay Sievers <kay.sievers@vrfy.org>
| Date:   Tue May 6 22:31:33 2008 +0200
|
|     block: do_mounts - accept root=<non-existant partition>

i could try a revert of this commit although given the extremely 
sporadic nature of this bug i doubt i could do any conclusive testing.

	Ingo
--

From: Vegard Nossum
Date: Monday, June 9, 2008 - 3:35 am

I'm sorry, this is slightly misleading. The dev_to_disk() doesn't contain any dereferences, so therefore that can obviously not be the source of the page fault. It is just simple pointer arithmetic.

The actual dereference happens on the next line, but it appears that this dereference and the pointer magic above is collapsed by gcc into a single instruction, cmp -0x44(%ebx), %esi. I assume the -0x44 would be = 0 - offsetof(device in gendisk) + offsetof(minors in gendisk).

So the error seems to be in dereferencing disk->minors, not dev.

And the fact that this causes a page fault seems to be pure luck; if the struct device object is placed higher than 0x44 in a page, it won't give the page fault (but simply access some valid, random memory). There seems to be a pretty good chance of an address being offset more than 0x44 bytes within a page given that a whole page is 0x1000 bytes :-)

The other condition that must be present for this fault to trigger is that the previous page must not have been mapped. Ouch. That sounds like two rare conditions!


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Adrian Bunk
Date: Monday, June 9, 2008 - 6:34 am

Mariusz already ran into this.

Neil already did some analysis of what could cause such problems [1],
but since Mariusz was no longer able to reproduce it with more recent 
kernels it became somehow forgotten.

Oh, and commit 30f2f0eb4bd2c43d10a8b0d872c6e5ad8f31c9a0 is currently 
sent for -stable review. I'll send an email that it shouldn't be merged 

cu
Adrian

[1] http://lkml.org/lkml/2008/5/25/257

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Vegard Nossum
Date: Monday, June 9, 2008 - 6:58 am

Hi,

Thanks, that matches exactly my findings too. And I agree very much
that it's strange how something which is not a gendisk can sneak
itself onto this list. So I have a feeling that it's something more
subtle than that.

It seems that Ingo is able to reproduce this "quite often", given the
number of reports he had (even though it was several thousand
bootups). We might simply add a printk() in there to determine which
device it is that is failing -- and look up the corresponding code to
see if it's doing anything weird.

But it seems more likely to be some kind of corruption.

I'm by no means familiar with this area, so please excuse me if what
I'm writing seems very obvious or stupid :-)

It seems that this list (block_class.devices) is protected by
block_class_lock in block/genhd.c. This list is only ever modified by
device_add() and device_del() in drivers/base/core.c. Both of those
are (only) protected by dev->class->sem, however. Is there a locking
mismatch here? But none of the locking code here seems to be changed
in years...


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Vegard Nossum
Date: Monday, June 9, 2008 - 7:28 am

I think this seems correct.

Everywhere else where we traverse the struct class->devices list, they
have down(&class->sem); first and up(&class->sem); afterwards.

Commit fd04897bb20be29d60f7e426a053545aebeaa61a even has this hunk:
@@ -177,8 +177,7 @@ struct class {
        struct list_head        devices;
        struct list_head        interfaces;
        struct kset             class_dirs;
-       struct semaphore        sem;    /* locks both the children and interface
-
+       struct semaphore        sem; /* locks children, devices, interfaces */
        struct class_attribute          * class_attrs;
        struct class_device_attribute   * class_dev_attrs;
        struct device_attribute         * dev_attrs;


So why doesn't block/genhd.c do this too? It seems to me that the
mutex locking here is simply a remnant of old code that happened to
not crash in most cases by chance.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Cornelia Huck
Date: Monday, June 9, 2008 - 7:57 am

On Mon, 9 Jun 2008 16:28:09 +0200,

Does this crash happen with the conversion to the class iterator
functions (should be in linux-next) as well? They take the class
mutex...
--

From: Vegard Nossum
Date: Monday, June 9, 2008 - 8:09 am

Ah, you mean this:

commit bb7ee70edb8745021c17ab604f2f4c897004e1c5
Author: Greg Kroah-Hartman <gregkh@suse.de>
Date:   Thu May 22 17:21:08 2008 -0400

    block: make blk_lookup_devt use the class iterator function

    Use the proper class iterator function instead of mucking around in the
    internals of the class structures.

    Cc: Kay Sievers <kay.sievers@vrfy.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

So it should already be fixed, then. But I guess we'll have to wait
for Ingo to run another couple of thousand tests to know the answer
;-)

Thanks!

Hm. Bugs already fixed elsewhere seems to be a recurring theme... I'll
look harder for changes/fixes in other trees the next time :-(


Vegard

PS: But what about printk_all_partitions()? There are more than just
this instance of the device list traversal code that don't use the
class semaphore.

PPS: Was that patch ever posted to LKML? I couldn't seem to find it.

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Linus Torvalds
Date: Monday, June 9, 2008 - 8:29 am

I really don't think it's the locking, although I do agree that the 
locking looks bogus _too_.

I suspect that the problem is even simpler than that. On the 
"block_class.devices" list we can have two types of devices: the ones that 
have been added by the block/genhd.c code (disks: dev->type "disk_type"), 
and the ones that are added by the class layer for partitions (partitions: 
dev.type "part_type").

And *all* the block/genhd.c loops over that device list look like this:

	list_for_each_entry(dev, &block_class.devices, node) {
		if (dev->type != &disk_type)
			continue;
		sgp = dev_to_disk(dev);
		...

because you cannot do that "dev_to_disk()" on a partition entry (it won't 
have a container of type gendisk, it will be of type hd_struct).

Well, all except one. Guess which one..

So I suspect that (a) yes, we need to fix the locking, but (b) the fix for 
this particular bug is probably the trivial one appended.

And yes, this bug was introduced by commit 30f2f0eb4b ("block: do_mounts - 
accept root=<non-existant partition>"), so the alternative is to revert it 
entirely. Kay? 

		Linus

---
 block/genhd.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 129ad93..b922d48 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -660,6 +660,8 @@ dev_t blk_lookup_devt(const char *name, int part)
 
 	mutex_lock(&block_class_lock);
 	list_for_each_entry(dev, &block_class.devices, node) {
+		if (dev->type != &disk_type)
+			continue;
 		if (strcmp(dev->bus_id, name) == 0) {
 			struct gendisk *disk = dev_to_disk(dev);
 
--

From: Ingo Molnar
Date: Monday, June 9, 2008 - 8:38 am

ah. I suspect that explains the sporadic nature as well: normally there 
is 'some' object at the list address, just with an invalid type.

The invalid type only gets visible as a hard crash if due to PAGEALLOC 
the structure sizes and kmalloc/slab details cause the invalid access to 
go to a not yet allocated page. (and then it crashes there)

And that in itself is a rather unlikely and fragile condition (it might 
even depend on timings of various allocations), that's why the bug wasnt 
really reproducible deterministically.

	Ingo
--

From: Linus Torvalds
Date: Monday, June 9, 2008 - 9:15 am

Yes. It could cause two kinds of problems:

 - it might end up returning the wrong 'dev_t'. This is unlikely, since we 
   only have two cases: the working whole-disk case, and the case where we 
   find a partition.

   But if we find a partition, we'd still get the right dev_t *most* of 
   the time, because we'd first get called with "part=0", and then we have 

	if (part < disk->minors)
		devt = MKDEV(MAJOR(dev->devt),
			MINOR(dev->devt) + part);
	break;

   where we would only fail if that conditional statement would be untrue 
   (and then we'd incorrectly return MKDEV(0,0)). Otherwise, 'devt' ends 
   up being correct anyway.

   So one effect of this bug would be that it would use the random 
   "disk->minors" value to either return the right devt, or return one 
   that is all zeroes. But if we return the all-zeroes case, then 
   init/do_mounts.c will just try again, this time with the numbers 
   removed, and now it wouldn't hit the "strcmp()" on any partition, and 
   the next time around it would find a disk and work again.

   So this is a bug, but it's one that essentially is hidden by the 
   caller.

 - The other alternative is that the bogus "disk->minors" thing would 
   cause a page fault. This would only happen if the partition allocation 
   was the first thing in a page, and the previous page was unused, and 
   you had DEBUG_PAGEALLOC enabled.

   This is obviously the case you saw.

My trivial fix makes it ignore partitions entirely.

We *could* (and perhaps should) do something slightly more involved 
instead, which actually uses a partition if it's there). Like this. That 
would avoid my one nagging worry (that some clever usage makes partitions 
with a different numbering or without a base block device).

And this is all still ignoring the locking issue, of course. It would be 
trivial to just remove the block_class_lock, and change

	mutex_[un]lock(&block_class_lock);

into

	down|up(&block_class.sem);

except for _one_ ...
From: Cornelia Huck
Date: Monday, June 9, 2008 - 10:15 am

On Mon, 9 Jun 2008 09:15:40 -0700 (PDT),

The driver core changes in -next convert class->sem to
class->p->class_mutex, which makes it non-accessible to drivers.
Most of the locking is easily done through converting to the class
iterator functions, but there are some cases where this is not going to
work:

- The {register,unregister}_blkdev() functions, which don't directly
involve the class.
- The iterators for /proc/partitions, which take the lock in
part_start() and give it up again in part_stop().

Maybe we need a possibilty for a driver to lock a class from outside?
--

From: Cornelia Huck
Date: Monday, June 9, 2008 - 11:03 am

On Mon, 9 Jun 2008 19:15:21 +0200,

Argh. I was just trying to hack up a patch when I realized that we had
to get a reference on the dynamic private structure when we take the
lock - which made the patch so ugly that I dare not post it. I'll see
if I have a better idea tomorrow (or someone beats me to it :)
--

From: Greg KH
Date: Monday, June 9, 2008 - 8:11 pm

Why would that be needed?  We protect walking the class lists internally
with the lock, that should be sufficient, right?

thanks,

greg k-h
--

From: Cornelia Huck
Date: Tuesday, June 10, 2008 - 12:51 am

On Mon, 9 Jun 2008 20:11:09 -0700,

What about the two functions I cited above? Don't we want them
protected by the same lock?
--

From: Greg KH
Date: Tuesday, June 10, 2008 - 2:52 pm

For register/unregister, yes, we still need to protect the list of
major/minor numbers.

But for the proc/partitions list, I don't think we still need it
anymore, but I might be missing something here.  :)

thanks,

greg k-h
--

From: Greg KH
Date: Monday, June 9, 2008 - 8:09 pm

The locking for struct class has turned into a mutex in the -next tree
already, but I have left the block_class_lock alone for the moment.

Now that I have also cleaned up the places in the /proc files where we
grabbed it, I think it might be safe to remove, I'll poke at that
tomorrow.

thanks,

greg k-h
--

From: Kay Sievers
Date: Monday, June 9, 2008 - 8:46 am

Yeah, the patch looks fine. That could be the reason.

I think we should keep the patch, as it fixed a different issue, and it
seems the bug was there even before the patch - the function was just
not called 3 times, so even more unlikely to trigger it.

Thanks,
Kay

--

From: Linus Torvalds
Date: Monday, June 9, 2008 - 8:58 am

No, before the patch we never did a "dev_to_disk()" on the device. We just 
did

	if (strcmp(dev->bus_id, name) == 0) {
		devt = dev->devt;
		break;
	}

and we simply didn't care if it was a disk or a partition - it would work 
correctly for both.

Your patch made it simply not work for partitions at all (by dereferencing 
an illegal address off them). My fix makes it ignore partitions entirely, 
but I'm a bit nervous that there might be some setup that sets up *only* 
partitions, not any base device at all. I guess that is unlikely, but it 
worries me a bit.

		Linus
--

From: Greg KH
Date: Monday, June 9, 2008 - 8:07 pm

Thanks for finding this, it looks correct to me.

greg k-h
--

Previous thread: [PATCH 0/8] au1xmmc updates #4 by Manuel Lauss on Sunday, June 8, 2008 - 11:35 pm. (23 messages)

Next thread: