It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. Definitely an improvement. The number of retries is an entirely separate issue. If we really care about it, then we should fix SD_MAX_RETRIES. The current value of 5 is *way* too high. It should be zero or one. Cheers -
I think that drives retry enough, we should leave retry at zero for normal (non-removable) drives. Should this be a policy we can set like we do with NCQ queue depth via /sys ? We need to be able to layer things like MD on top of normal drive errors in a way that will produce a system that provides reasonable response time despite any possible IO error on a single component. Another case that we end up doing on a regular basis is drive recovery. Errors need to be limited in scope to just the impacted area and dispatched up to the application layer as quickly as we can so that you don't spend days watching a copy of huge drive (think 750GB or more) ;-) ric -
I don't disagree that it should be settable. However, retries occur for other reasons than failures inside the device. The most standard ones are unit attentions generated because of other activity (target reset etc). The key to the problem is retrying only operations that are For the MD case, this is what REQ_FAILFAST is for. James -
I cannot find where SCSI honours that flag. James? And for that matter, even when I patch SCSI so that it *does* honour it, I don't actually see the flag making it into the SCSI layer from above. And I don't see where/how the block layer takes care when considering merge FAILFAST/READA requests with non FAILFAST/READA requests. To me, it looks perfectly happy to add non-FAILFAST/READA bios to a FAILFAST request, risking data loss if a lower-layer decides to honour the FAILFAST/READA flags. So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;) -
Scratch that thought.. SCSI honours it in scsi_end_request(). But I'm not certain that the block layer handles it correctly, at least not in the 2.6.16/2.6.18 kernel that I'm working with today. Cheers -
Er, it's in scsi_error.c:scsi_decide_disposition():
maybe_retry:
/* we requeue for retry because the error was retryable, and
* the request was not marked fast fail. Note that above,
* even if the request is marked fast fail, we still requeue
* for queue congestion conditions (QUEUE_FULL or BUSY) */
if ((++scmd->retries) <= scmd->allowed
&& !blk_noretry_request(scmd->request)) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
return NEEDS_RETRY;
} else {
/*
* no more retries - report this one back to upper level.
*/
return SUCCESS;
James
-
The transport might also want a say. I see ABORTED COMMAND errors often enough with SAS (e.g. due to expander congestion) to warrant at least one retry (which works in my testing). SATA disks behind SAS infrastructure would also be susceptible to the same "random" failures. Transport Layer Retries (TLR) in SAS should remove this class of transport errors but only SAS tape drives support TLR as far as I know. -
(note: libata does *not* generate retries for medium errors; Or perhaps we could have the mid-layer always "early-exit" without retries for "MEDIUM_ERROR", and still do retries for the rest. When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the "with retry" ATA opcodes for this). But meanwhile, we still have the original issue too, where a single stray bad sector can blow a system out of the water, because the mid-layer currently aborts everything after it from a large merged request. Thus the original patch from this thread. :) Cheers -
This depends on the firmware. Some of the "raid firmware" drives don't Agreed -
I think that even for these devices, the actual drives behind the controller will do retries. In any case, it would be reasonable to be able to set this -
One way to tell if this is true, is simply to time how long the failed operation takes. If the drive truly does not do retries, then the media error should be reported more or less instantly (assuming drive was already spun up). If the failure takes more than a few hundred milliseconds to be reported, or in this case 4-7 seconds typically, then we know the drive was doing retries before it reported back. I haven't seen any drive fail instantly yet. Can anyone with those newfangled "RAID edition" drives try it and report back? Oh.. you'll need a way to create a bad sector. I've got patches and a command-line utility for the job. If your drive supports "WRITE UNCORRECTABLE" ("hdparm -I", w/latest hdparm), -
Well, the simpler way (and one we have a hope of implementing) is to examine the ASC/ASCQ codes to see if the error is genuinely unretryable. I seem to have dropped the ball on this one in that the scsi_error.c pieces of this patch http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885 I thought I'd applied. Apparently I didn't, so I'll go back and put them in. James -
My suggestion above was not for a kernel fix, but rather just as a way of determining if drives Good. That would be a useful supplement to the patch I posted here. Cheers -
FWIW -- speaking generally -- I think there are inevitable areas where libata error handling combined with SCSI error handling results in suboptimal error handling. Just creating a list of "<this behavior> should be handled <this way>, but in reality is handled in <this silly way>" would be very helpful. Error handling is tough to get right, because the code is exercised so infrequently. Tejun has actually done an above-average job here, by making device probe, hotplug and other "exceptions" go through the libata EH code, thereby exercising the EH code more than one might normally assume. Some errors in libata probably should not be retried more than once, when we have a definitive diagnosis. Suggestions for improvements are welcome. Jeff -
I agree - Tejun has done a great job at giving us a great base. Next step is to get clarity on what the types of errors are and how to differentiate between One thing that we find really useful is to inject real errors into devices. Mark has some patches that let us inject media errors, we also bring back failed drives and run them through testing and occasionally get to use analyzers, etc to inject odd ball errors. Hopefully, we will get some time to brainstorm about this at the workshop, ric -
Ric, Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added command support to flag a block as "uncorrectable". There is no need to send bad "long" data to it and suppress the disk's automatic re-allocation logic. In the case of ATA it is the WRITE UNCORRECTABLE command. In the case of SCSI it is the WR_UNCOR bit in the WRITE LONG command. It seems that due to SAT any useful capability in the ATA command set will soon appear in the corresponding SCSI command set, if it is not already there. Doug Gilbert -
That'll be useful in a couple of years, once drives that have it become more common. For now, though, we're hacking current drives using READ/WRITE LONG commands, with a corresponding patch to libata to allow for the longer "sector size" involved. Having real bad sectors, exactly where we want them on the media, sure does make testing / fixing the EH mechanisms a lot more feasible. Cheers -
