Re: usb hdd problems with 2.6.27.2

Previous thread: Possible bug in SCSI Kconfig by Fabio Comolli on Wednesday, October 22, 2008 - 9:29 am. (14 messages)

Next thread: [PATCH][RESEND] tracing/ftrace: Introduce the big kernel lock tracer by Frederic Weisbecker on Wednesday, October 22, 2008 - 10:09 am. (5 messages)
From: Luciano Rocha
Date: Wednesday, October 22, 2008 - 9:22 am

Hello,

An external HDD, usb-encased, works fine under 2.6.26.5, but under
2.6.27.2 I get hundreds of errors per second, of 'No Sense [current]'.

On connection, under 2.6.26.5, I get the following:

[   84.503949] usb 1-2.1.1: new high speed USB device using ehci_hcd and address 8
[   84.696861] usb 1-2.1.1: configuration #1 chosen from 1 choice
[   84.740856] usbcore: registered new interface driver libusual
[   84.817625] Initializing USB Mass Storage driver...
[   84.817625] scsi2 : SCSI emulation for USB Mass Storage devices
[   84.818820] usbcore: registered new interface driver usb-storage
[   84.818823] USB Mass Storage support registered.
[   89.819567] scsi 2:0:0:0: Direct-Access     SAMSUNG  HD400LD          WQ10 PQ: 0 ANSI: 0
[   89.857627] Driver 'sd' needs updating - please use bus_type methods
[   89.857837] sd 2:0:0:0: [sda] 781422769 512-byte hardware sectors (400088 MB)
[   89.858754] sd 2:0:0:0: [sda] Write Protect is off
[   89.858754] sd 2:0:0:0: [sda] Assuming drive cache: write through
[   89.858869] sd 2:0:0:0: [sda] 781422769 512-byte hardware sectors (400088 MB)
[   89.858869] sd 2:0:0:0: [sda] Write Protect is off
[   89.858869] sd 2:0:0:0: [sda] Assuming drive cache: write through
[   89.858869]  sda: sda1
[   89.879851] sd 2:0:0:0: [sda] Attached SCSI disk
[   89.901630] sd 2:0:0:0: Attached scsi generic sg0 type 0
[   89.963632] sd 2:0:0:0: [sda] Sense Key : No Sense [current] 
[   89.963632] sd 2:0:0:0: [sda] Add. Sense: No additional sense information
[  106.913629] kjournald starting.  Commit interval 5 seconds
[  106.913629] EXT3 FS on dm-5, internal journal

On 2.6.27.2, I get:

[  551.484341] usb 1-2.1.1: new high speed USB device using ehci_hcd and address 8
[  551.677467] usb 1-2.1.1: configuration #1 chosen from 1 choice
[  551.718201] usbcore: registered new interface driver libusual
[  551.799287] Initializing USB Mass Storage driver...
[  551.800330] scsi2 : SCSI emulation for USB Mass Storage devices
[  551.800694] usbcore: ...
From: Rafael J. Wysocki
Date: Saturday, October 25, 2008 - 12:25 pm

[Adding CCs]



--

From: Alan Stern
Date: Saturday, October 25, 2008 - 12:50 pm

You can use usbmon to capture the details of what happens when you plug
in the drive.  Instructions are in the kernel source file
Documentation/usb/usbmon.txt.

Alan Stern

--

From: James Bottomley
Date: Saturday, October 25, 2008 - 1:11 pm

Try reversing this patch (apply with patch -p1 -R)

James

---

commit 10dab22664914505dcb804d9ad09cad6bc94d349
Author: Jamie Wellnitz <Jamie.Wellnitz@emulex.com>
Date:   Thu Sep 11 21:39:36 2008 -0400

    [SCSI] sd: Fix handling of NO_SENSE check condition
    
    The current handling of NO_SENSE check condition is the same as
    RECOVERED_ERROR, and assumes that in both cases, the I/O was fully
    transferred.
    
    We have seen cases of arrays returning with NO_SENSE (no error), but
    the I/O was not completely transferred, thus residual set.  Thus,
    rather than return good_bytes as the entire transfer, set good_bytes
    to 0, so that the midlayer then applies the residual in calculating
    the transfer, and for sd, will fail the I/O and fall into a retry
    path.
    
    Signed-off-by: Jamie Wellnitz <Jamie.Wellnitz@emulex.com>
    Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7c4d2e6..55e6ed4 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1054,7 +1054,6 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 		good_bytes = sd_completed_bytes(SCpnt);
 		break;
 	case RECOVERED_ERROR:
-	case NO_SENSE:
 		/* Inform the user, but make sure that it's not treated
 		 * as a hard error.
 		 */
@@ -1063,6 +1062,15 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 		memset(SCpnt->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE);
 		good_bytes = scsi_bufflen(SCpnt);
 		break;
+	case NO_SENSE:
+		/* This indicates a false check condition, so ignore it.  An
+		 * unknown amount of data was transferred so treat it as an
+		 * error.
+		 */
+		scsi_print_sense("sd", SCpnt);
+		SCpnt->result = 0;
+		memset(SCpnt->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE);
+		break;
 	case ABORTED_COMMAND:
 		if (sshdr.asc == 0x10) { /* DIF: Disk detected corruption */
 			scsi_print_result(SCpnt);


--

From: Luciano Rocha
Date: Sunday, October 26, 2008 - 7:05 am

I'll try it tomorrow.

Thanks,
Luciano Rocha

--=20
Luciano Rocha <luciano@eurotux.com>
Eurotux Inform=E1tica, S.A. <http://www.eurotux.com/>
From: Luciano Rocha
Date: Monday, October 27, 2008 - 4:14 am

<snip>

2.6.27.4 comes with it reverted. But I still get the same error.
Reapplied it as one more test, with the same error.

Regards,
Luciano Rocha

--=20
Luciano Rocha <luciano@eurotux.com>
Eurotux Inform=E1tica, S.A. <http://www.eurotux.com/>
From: Luciano Rocha
Date: Monday, October 27, 2008 - 4:28 am

Now in 2.6.27.4, same problem. The usb traffic is:

f7161ec0 3564884672 C Ii:1:005:1 0:2048 1 = 02
f7161ec0 3564884690 S Ii:1:005:1 -115:2048 1 <
f20138c0 3564884706 S Ci:1:005:0 s a3 00 0000 0001 0004 4 <
f20138c0 3564885023 C Ci:1:005:0 0 4 = 01010100
f20138c0 3564885039 S Co:1:005:0 s 23 01 0010 0001 0000 0
f20138c0 3564885273 C Co:1:005:0 0 0
f20138c0 3564885291 S Co:1:005:0 s 23 03 0016 0001 0000 0
f20138c0 3564885522 C Co:1:005:0 0 0
f20138c0 3564885534 S Ci:1:005:0 s a3 00 0000 0001 0004 4 <
f20138c0 3564885773 C Ci:1:005:0 0 4 = 01010000
f20138c0 3564911773 S Ci:1:005:0 s a3 00 0000 0001 0004 4 <
f20138c0 3564912022 C Ci:1:005:0 0 4 = 01010000
f20138c0 3564937773 S Ci:1:005:0 s a3 00 0000 0001 0004 4 <
f20138c0 3564938022 C Ci:1:005:0 0 4 = 01010000
f20138c0 3564963772 S Ci:1:005:0 s a3 00 0000 0001 0004 4 <
f20138c0 3564964020 C Ci:1:005:0 0 4 = 01010000
f20138c0 3564989773 S Ci:1:005:0 s a3 00 0000 0001 0004 4 <
f20138c0 3564990020 C Ci:1:005:0 0 4 = 01010000
f20138c0 3564990043 S Co:1:005:0 s 23 03 0004 0001 0000 0
f20138c0 3564990270 C Co:1:005:0 0 0
f20138c0 3565000771 S Ci:1:005:0 s a3 00 0000 0001 0004 4 <
f20138c0 3565001020 C Ci:1:005:0 0 4 = 11050000
f20138c0 3565011770 S Ci:1:005:0 s a3 00 0000 0001 0004 4 <
f20138c0 3565012019 C Ci:1:005:0 0 4 = 03051000
f20138c0 3565062770 S Co:1:005:0 s 23 01 0014 0001 0000 0
f20138c0 3565063018 C Co:1:005:0 0 0
f20138c0 3565063102 S Ci:1:000:0 s 80 06 0100 0000 0040 64 <
f20138c0 3565072896 C Ci:1:000:0 0 18 = 12010002 00000040 40088200 01000102 0301
f20138c0 3565072920 S Co:1:005:0 s 23 03 0004 0001 0000 0
f20138c0 3565073268 C Co:1:005:0 0 0
f20138c0 3565083782 S Ci:1:005:0 s a3 00 0000 0001 0004 4 <
f20138c0 3565084018 C Ci:1:005:0 0 4 = 03051000
f20138c0 3565134773 S Co:1:005:0 s 23 01 0014 0001 0000 0
f20138c0 3565135017 C Co:1:005:0 0 0
f20138c0 3565135045 S Co:1:000:0 s 00 05 0008 0000 0000 0
f20138c0 3565135266 C Co:1:000:0 0 0
f20138c0 3565146771 S Ci:1:008:0 s 80 06 0100 0000 0012 18 <
f20138c0 ...
From: Alan Stern
Date: Monday, October 27, 2008 - 7:24 am

This looks exactly like the "infinite retry" problem I warned about 
earlier.  Here are the important parts of the log.  For people who 
don't know how to interpret these messages, the CDB starts in the 16th 
byte of the 31-byte messages.  For example, the first command here 

The response is 0x2e9390b0.  In typical broken fashion, that is 
undoubtedly the total number of sectors rather than the highest sector 
number.

Later on the system tries to read the contents of what it thinks is the 

There's no data in the response, and the 01 on the line above 

The automatically-generated REQUEST SENSE gets the 18-byte response you 
see above.  It is entirely empty (No Sense).

The remainder of the trace shows the same command being repeated over 
etc...

There's a patch which might help resolve this problem:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8bfa24...

It is already present in 2.6.28-rc1, so it's worth a try.  If it does
fix things, let me know so I can submit it for a future 2.6.27.stable
release.

Alan Stern

--

From: Douglas Gilbert
Date: Monday, October 27, 2008 - 7:56 am

Since the READ CAPACITY "off by one" error is so common,
perhaps drivers such as usb-storage could have a hook to
do a pseudo READ CAPACITY. Then if the capacity value
looked odd (in both senses) the driver could do an IO to
the suspect block and if that failed decrement the capacity
value passed back to the mid level.
Put another way, why don't these defective devices trip up
another OS?

BTW a single disk in RAID 0 (seen on a HP E200 controller)
has a shortened capacity value seen in the midlevel on the
corresponding logical drive. That missing chunk is probably
where the RAID controller puts its control information.
Anyway, playing with the capacity value returned by READ

I know that happens but it seems strange that upper levels
are reading a block that has never been written to. Read ahead?


--

From: Boaz Harrosh
Date: Monday, October 27, 2008 - 8:08 am

Window$ never reads the last sector unless it is actually written
to. I had such a device it got stuck when I wrote to the

That would be udev or hald, I can't remember which. It is a special Linux

Boaz
--

From: Kay Sievers
Date: Monday, October 27, 2008 - 9:26 am

It's vol_id from udev. To auto-detect raid setups, LVM, MD metadata, ...

Such devices carry their magic sequence at the "end" of the device,
usually the last sector or a few sectors before the last sector. The
"last" sector is obviously determined by the capacity the device
reports.

There is no way around looking at all block devices' last sector on a
Linux box, as long as people expect auto-setup, auto-mounting,
anything that is not put in /etc/fstab to work, as long as any "meta
device" format puts a magic number at the end of the device.

Kay
--

From: Alan Stern
Date: Monday, October 27, 2008 - 8:25 am

We thought of that years ago.  Unfortunately there is no reliable way
of telling when a capacity value is wrong.  There definitely do exist
disks with an odd number of sectors.

Furthermore, doing I/O to a suspect block is not a good idea.  There
are plenty of devices which simply crash when you try to access a

I imagine they do.  However Linux has partition code that stores
information in the last sector of a partition (EFI GUID and md, for
example).  Other OS's apparently do not try to access the medium's last

usb-storage isn't in the business of altering the data it gets from a 
device.  It's just a transport.  That's why the sdev->fix_capacity flag 
exists; we tell the upper layer that the data it gets is going to be 

No, partition scanning.  Also maybe /lib/udev/vol_id, which seems to 
read an inordinate number of irrelevant sectors.

Alan Stern

--

From: James Bottomley
Date: Monday, October 27, 2008 - 8:33 am

And you do this by setting US_FL_FIX_CAPACITY in unusual_devs.h

James


--

From: Luciano Rocha
Date: Monday, October 27, 2008 - 8:18 am

Hm, where in the code can I change it to return less that what it thinks
is the last sector?


No luck. I'm going to try 2.6.28-rc2 now.

Regards,
Luciano Rocha

-- 
Luciano Rocha <luciano@eurotux.com>
Eurotux Informática, S.A. <http://www.eurotux.com/>
--

From: Alan Stern
Date: Monday, October 27, 2008 - 8:38 am

I can write a patch to do that for you.  However I'm more interested in 
solving the "infinite-retry" problem first.

I did write some patches which might help.  They weren't meant as bug 
fixes exactly, more as infrastructure cleanup.  But you ought to try 
them out, because they do affect the logic in this area.

The patches are here:

	http://marc.info/?l=linux-scsi&m=122443015406309&w=2

They are based on 2.6.27, not 2.6.28-rc.

Alan Stern

--

From: Luciano Rocha
Date: Monday, October 27, 2008 - 9:53 am

I've tested them (on 2.6.27.4), but I still get the bug. That email mentions:
"Neither patch addresses the infinite-retry problem; I wanted
to keep the issues separate."

So am I missing some other patch?

2.6.28-rc2 is also buggy.

Thanks,
Luciano Rocha

-- 
Luciano Rocha <luciano@eurotux.com>
Eurotux Informática, S.A. <http://www.eurotux.com/>
--

From: Alan Stern
Date: Monday, October 27, 2008 - 1:10 pm

Okay, I tried the same sort of experiment in 2.6.27.4 and got the same
result as you did.  I was able to fix it by applying the 8bfa24727
commit mentioned earlier together with the patch below.

Now, I'm pretty sure this is not the ideal solution.  Fixing the block 
core would be better.  Still, it does indicate for sure that there's a 
real problem.

It's worth noting that the test added to scsi_requeue_command() 
succeeds every time, that is, req->retries is always 0.  Clearly more 
work is needed.

Alan Stern



Index: 2.6.27.4/drivers/scsi/scsi_lib.c
===================================================================
--- 2.6.27.4.orig/drivers/scsi/scsi_lib.c
+++ 2.6.27.4/drivers/scsi/scsi_lib.c
@@ -611,6 +611,11 @@ static void scsi_requeue_command(struct 
 	struct request *req = cmd->request;
 	unsigned long flags;
 
+	if (--req->retries <= 0) {
+		blk_end_request(req, -EIO, blk_rq_bytes(req));
+		scsi_next_command(cmd);
+		return;
+	}
 	scsi_unprep_request(req);
 	spin_lock_irqsave(q->queue_lock, flags);
 	blk_requeue_request(q, req);

--

From: Luciano Rocha
Date: Tuesday, October 28, 2008 - 9:37 am

I tried it, but I get some I/O errors with it.

With 8bfa24727 and this patch, I get:

[   94.289819] usb 1-2.1.1: new high speed USB device using ehci_hcd and address 8
[   94.493891] usb 1-2.1.1: configuration #1 chosen from 1 choice
[   94.537884] usbcore: registered new interface driver libusual
[   94.611888] Initializing USB Mass Storage driver...
[   94.612704] scsi2 : SCSI emulation for USB Mass Storage devices
[   94.619835] usbcore: registered new interface driver usb-storage
[   94.619891] USB Mass Storage support registered.
[   94.622190] usb-storage: device found at 8
[   94.622232] usb-storage: waiting for device to settle before scanning
[   99.622933] scsi 2:0:0:0: Direct-Access     SAMSUNG  HD400LD          WQ10 PQ: 0 ANSI: 0
[   99.627569] usb-storage: device scan complete
[   99.657746] Driver 'sd' needs updating - please use bus_type methods
[   99.661428] sd 2:0:0:0: [sda] 781422769 512-byte hardware sectors (400088 MB)
[   99.662551] sd 2:0:0:0: [sda] Write Protect is off
[   99.662662] sd 2:0:0:0: [sda] Mode Sense: 03 00 00 00
[   99.662702] sd 2:0:0:0: [sda] Assuming drive cache: write through
[   99.666582] sd 2:0:0:0: [sda] 781422769 512-byte hardware sectors (400088 MB)
[   99.672877] sd 2:0:0:0: [sda] Write Protect is off
[   99.672884] sd 2:0:0:0: [sda] Mode Sense: 03 00 00 00
[   99.672886] sd 2:0:0:0: [sda] Assuming drive cache: write through
[   99.672890]  sda: sda1
[   99.683592] sd 2:0:0:0: [sda] Attached SCSI disk
[   99.704160] sd 2:0:0:0: Attached scsi generic sg0 type 0
[   99.808029] sd 2:0:0:0: [sda] Sense Key : No Sense [current] 
[   99.808040] sd 2:0:0:0: [sda] Add. Sense: No additional sense information
[   99.808051] end_request: I/O error, dev sda, sector 781422768
[   99.808056] Buffer I/O error on device sda, logical block 781422768
[   99.837771] sd 2:0:0:0: [sda] Sense Key : No Sense [current] 
[   99.837780] sd 2:0:0:0: [sda] Add. Sense: No additional sense information
[   99.837790] end_request: I/O error, dev sda, sector ...
From: Alan Stern
Date: Tuesday, October 28, 2008 - 10:38 am

That's to be expected.  These patches aren't supposed to prevent I/O 
errors, they are merely supposed to prevent the errors from continuing 



Now that is truly bizarre.  It means that the drive responded 
differently even though it received the same sequence of commands as 

Yes.  Please post a usbmon trace for each of the last two tests.  That 
is, with the unusual_devs update only, and with the unusual_devs update 
plus the invoke_transport update and the scsi_requeue_command update.

Alan Stern

--

From: Luciano Rocha
Date: Monday, November 3, 2008 - 8:52 am

I've updated the bugzilla entry with traces for some combinations of the
discussed patches:
http://bugzilla.kernel.org/show_bug.cgi?id=11843

-- 
Luciano Rocha <luciano@eurotux.com>
Eurotux Informática, S.A. <http://www.eurotux.com/>
--

From: Alan Stern
Date: Monday, November 3, 2008 - 12:46 pm

Okay, thanks.  I see the problem.  This whole area is a mess, and it's 
going to be different in 2.6.28.  But for now...

Please rerun the A+B and A+B+C tests, but this time using the B' patch
below instead of B.

Alan Stern



Index: 2.6.27.4/drivers/scsi/scsi_lib.c
===================================================================
--- 2.6.27.4.orig/drivers/scsi/scsi_lib.c
+++ 2.6.27.4/drivers/scsi/scsi_lib.c
@@ -611,6 +611,11 @@ static void scsi_requeue_command(struct 
 	struct request *req = cmd->request;
 	unsigned long flags;
 
+	if (--req->retries <= 0) {
+		blk_end_request(req, -EIO, blk_rq_bytes(req));
+		scsi_next_command(cmd);
+		return;
+	}
 	scsi_unprep_request(req);
 	spin_lock_irqsave(q->queue_lock, flags);
 	blk_requeue_request(q, req);
@@ -690,6 +695,8 @@ static struct scsi_cmnd *scsi_end_reques
 				 * leftovers in the front of the
 				 * queue, and goose the queue again.
 				 */
+				if (bytes > 0)		/* Made progress */
+					++req->retries;
 				scsi_requeue_command(q, cmd);
 				cmd = NULL;
 			}

--

From: Luciano Rocha
Date: Wednesday, November 5, 2008 - 3:26 am

New test data in the bugzilla.

Regards,
Luciano Rocha

-- 
Luciano Rocha <luciano@eurotux.com>
Eurotux Informática, S.A. <http://www.eurotux.com/>
--

From: Alan Stern
Date: Wednesday, November 5, 2008 - 9:51 am

Sorry, dumb mistake on my part.  B'' is below.  It is the same as B' 
except for one character.

Alan Stern



Index: 2.6.27.4/drivers/scsi/scsi_lib.c
===================================================================
--- 2.6.27.4.orig/drivers/scsi/scsi_lib.c
+++ 2.6.27.4/drivers/scsi/scsi_lib.c
@@ -611,6 +611,11 @@ static void scsi_requeue_command(struct 
 	struct request *req = cmd->request;
 	unsigned long flags;
 
+	if (--req->retries < 0) {
+		blk_end_request(req, -EIO, blk_rq_bytes(req));
+		scsi_next_command(cmd);
+		return;
+	}
 	scsi_unprep_request(req);
 	spin_lock_irqsave(q->queue_lock, flags);
 	blk_requeue_request(q, req);
@@ -690,6 +695,8 @@ static struct scsi_cmnd *scsi_end_reques
 				 * leftovers in the front of the
 				 * queue, and goose the queue again.
 				 */
+				if (bytes > 0)		/* Made progress */
+					++req->retries;
 				scsi_requeue_command(q, cmd);
 				cmd = NULL;
 			}

--

From: Luciano Rocha
Date: Thursday, November 13, 2008 - 10:10 am

A+B''+C seems to have done it. The bugzilla entry was updated with the
kernel messages and usb traffic.

  http://bugzilla.kernel.org/show_bug.cgi?id=3D11843

Regards,
Luciano Rocha

--=20
Luciano Rocha <luciano@eurotux.com>
Eurotux Inform=E1tica, S.A. <http://www.eurotux.com/>
From: Alan Stern
Date: Monday, October 27, 2008 - 1:36 pm

Here's the promised patch.

Alan Stern



Index: usb-2.6/drivers/usb/storage/unusual_devs.h
===================================================================
--- usb-2.6.orig/drivers/usb/storage/unusual_devs.h
+++ usb-2.6/drivers/usb/storage/unusual_devs.h
@@ -1251,6 +1251,13 @@ UNUSUAL_DEV( 0x0839, 0x000a, 0x0001, 0x0
 		US_SC_DEVICE, US_PR_DEVICE, NULL,
 		US_FL_FIX_INQUIRY),
 
+/* Reported by Luciano Rocha <luciano@eurotux.com> */
+UNUSUAL_DEV( 0x0840, 0x0082, 0x0001, 0x0001,
+		"Argosy",
+		"Storage",
+		US_SC_DEVICE, US_PR_DEVICE, NULL,
+		US_FL_FIX_CAPACITY),
+
 /* Entry and supporting patch by Theodore Kilgore <kilgota@auburn.edu>.
  * Flag will support Bulk devices which use a standards-violating 32-byte
  * Command Block Wrapper. Here, the "DC2MEGA" cameras (several brands) with

--

Previous thread: Possible bug in SCSI Kconfig by Fabio Comolli on Wednesday, October 22, 2008 - 9:29 am. (14 messages)

Next thread: [PATCH][RESEND] tracing/ftrace: Introduce the big kernel lock tracer by Frederic Weisbecker on Wednesday, October 22, 2008 - 10:09 am. (5 messages)