"These patches allow data integrity information (checksum and more) to be attached to I/Os at the block/filesystem layers and transferred through the entire I/O stack all the way to the physical storage device," began Martin Petersen. He went on to explain, "the integrity metadata can be generated in close proximity to the original data. Capable host adapters, RAID arrays and physical disks can verify the data integrity and abort I/Os in case of a mismatch." He noted that support currently only exists for SCSI disks, but that work is underway to also add support for SATA drives and SCSI tapes, "with a few minor nits due to protocol limitations the proposed SATA format is identical to the SCSI". Explaining how this works, Martin continued:
"SCSI drives can usually be reformatted to 520-byte sectors, yielding 8 extra bytes per sector. These 8 bytes have traditionally been used by RAID controllers to store internal protection information. DIF (Data Integrity Field) is an extension to the SCSI Block Commands that standardizes the format of the 8 extra bytes and defines ways to interact with the contents at the protocol level. [...] When writing, the HBA (Host Bus Adapter) will DMA 512-byte sectors from host memory, generate the matching integrity metadata and send out 520-byte sectors on the wire. The disk will verify the integrity of the data before committing it to stable storage. When reading, the drive will send 520-byte sectors to the HBA. The HBA will verify the data integrity and DMA 512-byte sectors to host memory."
"Btrfs v0.14 is now available for download," Chris Mason announced, adding, "please note the disk format has changed, and it is not compatible with older versions of Btrfs." The project has gained a new wiki home page on the kernel.org domain, where it is explained, "Btrfs is a new copy on write filesystem for Linux aimed at implementing advanced features while focusing on fault tolerance, repair and easy administration. Initially developed by Oracle, Btrfs is licensed under the GPL and open for contribution from anyone." Regarding the latest release, Chris explained:
"v0.14 has a few performance fixes and closes some races that could have allowed corrupted metadata in v0.13. The major new feature is the ability to manage multiple devices under a single Btrfs mount. Raid0, raid1 and raid10 are supported. Even for single device filesystems, metadata is now duplicated by default. Checksums are verified after reads finish and duplicate copies are used if the checksums don't match."
Chris offered links to multi-device benchmarks summarizing, "in general these numbers show that Btrfs does a good job at scaling to this storage configuration, and that is it on par with both HW raid and MD." Looking forward, he concluded, "next up on the Btrfs todo list is finishing off the device removal and IO error handling code. After that I'll add more fine grained locking to the btrees."
"An ongoing study on datasets of several Petabytes have shown that there can be 'silent data corruption' at rates much larger than one might naively expect from the expected error rates in RAID arrays and the expected probability of single bit uncorrected errors in hard disks," began a recent query on the Linux kernel mailing list asking where the errors might be introduced. Alan Cox replied, "its almost entirely device specific at every level." He then continued on with some general information, tracing the path of the data from the drive, through the cable and bus, into main memory and the CPU cache, as well as over the network, "once its crossing the PCI bus and main memory and CPU cache its entirely down to the system you are running what is protected and how much. Note that a lot of systems won't report ECC errors unless you ask." Alan continued:
"The next usual mess is network transfers. The TCP checksum strength is questionable for such workloads but the ethernet one is pretty good. Unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness."
Regarding the specific study in question, Alan noted, "for drivers/ide there are *lots* of problems with error handling so that might be implicated (would want to do old [versus] new ide tests on the same h/w which would be very intriguing)."
Lars Ellenberg started an effort to get DRBD, the Distributed Replicated Block Device merged into the Linux kernel. When asked for clarification as to what it was, Lars explained, "think of it as RAID1 over TCP. Typically you have one Node in Primary, the other as Secondary, replication target only. But you can also have both Active, for use with a cluster file system." Earlier in the thread he described it as "a stacked block device driver".
Much of the initial review focused on the need to comply with kernel coding style guidelines. Kyle Moffett offered a much lengthier review, noting at one point in the code, "how about fixing this to actually use proper workqueues or something instead of this open-coded mess?" Lars replied, "unlikely to happen 'right now'. But it is on our todo list..." Jens Axboe added, "but stuff like that is definitely a merge show stopper, jfyi".
Jeff Garzik noted that the hardware documentation for the Promise SX4 chipset is being opened up and therefor the sata_sx4 driver is a good candidate for improvements, "I would like to take this opportunity to point hackers looking for a project at this hardware. The Promise SX4 is pretty neat, and it needs more attention than I can give, to reach its full potential." He notes that it is an older chipset that's probably not sold anymore, that the ATA programming interface is similar to that in the sata_promise driver, and that it contains a fully programmable on board DIMM and on board RAID5 XOR. Jeff went on to explain:
"A key problem is that, under Linux, sata_sx4 cannot fully exploit the RAID-centric power of this hardware by driving the hardware in 'dumb ATA mode' as it does. A better driver would notice when a RAID1 or RAID5 array contains multiple components attached to the SX4, and send only a single copy of the data to the card (saving PCI bus bandwidth tremendously). Similarly, a better driver would take advantage of the RAID5 XOR offload capabilities, to offload the entire RAID5 read or write transaction to the card.
"All this is difficult within either the MD or DM RAID frameworks, because optimizing each RAID transaction requires intimate knowledge of the hardware. We have the knowledge... but I don't have good ideas -- aside from an SX4-specific RAID 0/1/5/6 driver -- on how to exploit this knowledge."