Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Previous thread: Re: [Ksummit-2007-discuss] Re: [Ksummit-2006-discuss] 2007 Linux Kernel Summit by Matt Domsch on Tuesday, January 30, 2007 - 5:52 pm. (9 messages)

Next thread: [patch] some scripts: replace gawk, head, bc with shell, update by Oleg Verych on Tuesday, January 30, 2007 - 7:59 pm. (1 message)
From: Mark Lord
Date: Tuesday, January 30, 2007 - 6:41 pm

It really beats the alternative of a forced reboot
due to, say, superblock I/O failing because it happened
to get merged with an unrelated I/O which then failed..
Etc..

Definitely an improvement.

The number of retries is an entirely separate issue.
If we really care about it, then we should fix SD_MAX_RETRIES.

The current value of 5 is *way* too high.  It should be zero or one.

Cheers
-

From: Ric Wheeler
Date: Tuesday, January 30, 2007 - 8:20 pm

I think that drives retry enough, we should leave retry at zero for 
normal (non-removable) drives. Should this  be a policy we can set like 
we do with NCQ queue depth via /sys ?

We need to be able to layer things like MD on top of normal drive errors 
in a way that will produce a system that provides reasonable response 
time despite any possible IO error on a single component.  Another case 
that we end up doing on a regular basis is drive recovery. Errors need 
to be limited in scope to just the impacted area and dispatched up to 
the application layer as quickly as we can so that you don't spend days 
watching a copy of  huge drive (think 750GB or more) ;-)

ric

-

From: James Bottomley
Date: Tuesday, January 30, 2007 - 9:21 pm

I don't disagree that it should be settable.  However, retries occur for
other reasons than failures inside the device.  The most standard ones
are unit attentions generated because of other activity (target reset
etc).  The key to the problem is retrying only operations that are

For the MD case, this is what REQ_FAILFAST is for.

James


-

From: Mark Lord
Date: Wednesday, January 31, 2007 - 8:13 am

I cannot find where SCSI honours that flag.  James?

And for that matter, even when I patch SCSI so that it *does* honour it,
I don't actually see the flag making it into the SCSI layer from above.

And I don't see where/how the block layer takes care when considering
merge FAILFAST/READA requests with non FAILFAST/READA requests.
To me, it looks perfectly happy to add non-FAILFAST/READA bios
to a FAILFAST request, risking data loss if a lower-layer decides
to honour the FAILFAST/READA flags.

So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;)
 

-

From: Mark Lord
Date: Wednesday, January 31, 2007 - 8:22 am

Scratch that thought.. SCSI honours it in scsi_end_request().

But I'm not certain that the block layer handles it correctly,
at least not in the 2.6.16/2.6.18 kernel that I'm working with today.

Cheers
-

From: James Bottomley
Date: Wednesday, January 31, 2007 - 8:24 am

Er, it's in scsi_error.c:scsi_decide_disposition():

      maybe_retry:

	/* we requeue for retry because the error was retryable, and
	 * the request was not marked fast fail.  Note that above,
	 * even if the request is marked fast fail, we still requeue
	 * for queue congestion conditions (QUEUE_FULL or BUSY) */
	if ((++scmd->retries) <= scmd->allowed
	    && !blk_noretry_request(scmd->request)) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
		return NEEDS_RETRY;
	} else {
		/*
		 * no more retries - report this one back to upper level.
		 */
		return SUCCESS;

James


-

From: Douglas Gilbert
Date: Tuesday, January 30, 2007 - 10:09 pm

The transport might also want a say. I see ABORTED COMMAND
errors often enough with SAS (e.g. due to expander congestion)
to warrant at least one retry (which works in my testing).
SATA disks behind SAS infrastructure would also be
susceptible to the same "random" failures.

Transport Layer Retries (TLR) in SAS should remove this class
of transport errors but only SAS tape drives support TLR as
far as I know.



-

From: Mark Lord
Date: Wednesday, January 31, 2007 - 8:08 am

(note: libata does *not* generate retries for medium errors;

Or perhaps we could have the mid-layer always "early-exit"
without retries for "MEDIUM_ERROR", and still do retries for the rest.

When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
as the drive itself has already done internal retries (libata uses the
"with retry" ATA opcodes for this).

But meanwhile, we still have the original issue too, where a single stray
bad sector can blow a system out of the water, because the mid-layer
currently aborts everything after it from a large merged request.

Thus the original patch from this thread.  :)

Cheers
-

From: Alan
Date: Wednesday, January 31, 2007 - 8:23 am

This depends on the firmware. Some of the "raid firmware" drives don't

Agreed
-

From: Ric Wheeler
Date: Wednesday, January 31, 2007 - 9:35 am

I think that even for these devices, the actual drives behind the controller 
will do retries. In any case, it would be reasonable to be able to set this 


-

From: Mark Lord
Date: Wednesday, January 31, 2007 - 10:57 am

One way to tell if this is true, is simply to time how long
the failed operation takes.  If the drive truly does not do retries,
then the media error should be reported more or less instantly
(assuming drive was already spun up).

If the failure takes more than a few hundred milliseconds to be reported,
or in this case 4-7 seconds typically, then we know the drive was doing
retries before it reported back.

I haven't seen any drive fail instantly yet.
Can anyone with those newfangled "RAID edition" drives try it
and report back?  Oh.. you'll need a way to create a bad sector.
I've got patches and a command-line utility for the job.

If your drive supports "WRITE UNCORRECTABLE" ("hdparm -I", w/latest hdparm),
-

From: James Bottomley
Date: Wednesday, January 31, 2007 - 11:13 am

Well, the simpler way (and one we have a hope of implementing) is to
examine the ASC/ASCQ codes to see if the error is genuinely unretryable.

I seem to have dropped the ball on this one in that the scsi_error.c
pieces of this patch

http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885

I thought I'd applied.  Apparently I didn't, so I'll go back and put
them in.

James


-

From: Mark Lord
Date: Wednesday, January 31, 2007 - 11:37 am

My suggestion above was not for a kernel fix,
but rather just as a way of determining if drives

Good.  That would be a useful supplement to the patch I posted here.

Cheers
-

From: Jeff Garzik
Date: Wednesday, January 31, 2007 - 2:30 am

FWIW -- speaking generally -- I think there are inevitable areas where 
libata error handling combined with SCSI error handling results in 
suboptimal error handling.

Just creating a list of "<this behavior> should be handled <this way>, 
but in reality is handled in <this silly way>" would be very helpful.

Error handling is tough to get right, because the code is exercised so 
infrequently.  Tejun has actually done an above-average job here, by 
making device probe, hotplug and other "exceptions" go through the 
libata EH code, thereby exercising the EH code more than one might 
normally assume.

Some errors in libata probably should not be retried more than once, 
when we have a definitive diagnosis.  Suggestions for improvements are 
welcome.

	Jeff


-

From: Ric Wheeler
Date: Wednesday, January 31, 2007 - 7:36 am

I agree - Tejun has done a great job at giving us a great base. Next step is to 
get clarity on what the types of errors are and how to differentiate between 

One thing that we find really useful is to inject real errors into devices. Mark 
has some patches that let us inject media errors, we also bring back failed 
drives and run them through testing and occasionally get to use analyzers, etc 
to inject odd ball errors.

Hopefully, we will get some time to brainstorm about this at the workshop,

ric
-

From: Douglas Gilbert
Date: Wednesday, January 31, 2007 - 8:28 am

Ric,
Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added
command support to flag a block as "uncorrectable". There
is no need to send bad "long" data to it and suppress the
disk's automatic re-allocation logic.

In the case of ATA it is the WRITE UNCORRECTABLE command.
In the case of SCSI it is the WR_UNCOR bit in the WRITE
LONG command.

It seems that due to SAT any useful capability in the ATA
command set will soon appear in the corresponding SCSI
command set, if it is not already there.

Doug Gilbert

-

From: Mark Lord
Date: Wednesday, January 31, 2007 - 8:38 am

That'll be useful in a couple of years, once drives that have it
become more common.  For now, though, we're hacking current drives
using READ/WRITE LONG commands, with a corresponding patch to libata
to allow for the longer "sector size" involved.

Having real bad sectors, exactly where we want them on the media,
sure does make testing / fixing the EH mechanisms a lot more feasible.

Cheers

-

Previous thread: Re: [Ksummit-2007-discuss] Re: [Ksummit-2006-discuss] 2007 Linux Kernel Summit by Matt Domsch on Tuesday, January 30, 2007 - 5:52 pm. (9 messages)

Next thread: [patch] some scripts: replace gawk, head, bc with shell, update by Oleg Verych on Tuesday, January 30, 2007 - 7:59 pm. (1 message)