Your basic assumption that "RAID=minimal down time" is flawed.
The most horrible down times and "events" you ever see will
often involve RAID. That goes for hardware RAID, software
Almost every RAID system out there handles the sudden removal
of a disk from the system pretty well. Why? Because it's EASY
to create that "failure mode". Problem is, in 25 years in this
business, I don't recall having seen a hard disk fall out of a
computer as a mode of actual failure (I did see a SCSI HBA fall
out of a machine once, but that's a different story).
My preferred way to test RAID systems involves a powder-actuated
nail gun driving a nail through the platter. Not overly
realistic either, but arguably more so than having the drive
suddenly being pulled out. The disks get expensive, though.
Back to your situation...
The drive reports a failure, but not one so horrible that the
OS doesn't attempt a retry. So, at what point does the OS just
shut down the drive and say, "not worth the trouble"? If you
are running a single drive, you generally want to keep trying
as long as there is the slightest hope (another digression: back
in the MSDOS v2 days, I had a machine blow a disk such that if I
kept hitting "Retry" enough times, each sector would ultimately
be read successfully. Wedged a pen in between the 'R' key and
the monitor, went to dinner, and when I came back, I had all my
data successfully copied to another drive). In your case,
however, you have a drive saying, "I'm getting better" when you
are saying, "It'll be stone dead in a moment". You want the
OS to whack the drive and toss it on the cart..er..remove it
from the RAID set at the first sign of trouble, but that's not
a universal answer.
Curiously, I've had servers that caused problems BOTH Ways. One
kept a drive on-line even though it was having serious problems
and should have been declared dead. In a several other cases,
the drives reported minor errors and were popped off-line ...
I highly appreciate your detailed report about your experiences
with RAID systems. That was cool. Surely I don't expect any
miracles from RAID anymore.
The current plan is to move to a ramdisk based system to get rid
of disk access afap, and to use carp to setup a fallback host.
Logging is done (non-blocking, hopefully) via network.
I had seen that disk-suddenly-out-of-computer failure once. Coincidently
enough, it was an OpenBSD system configured only for NAT, about 6 years ago.
The IDE hard disk failed sometime at night. When we arrived on the
next day at office. Everything was working flawlessly until someone
ssh'ed to that machine. My guess is something has gone awry when
the syslog went to write that new connection and suddenly the OS
discovered that was no HD present.
Surprisingly enough, the onboard IDE controller survived, but after installing
the new disk, we found the parallel IDE cable faulty and it had to be replaced
It was not a RAID system though...