Pawel Dawidek first ported ZFS to FreeBSD from OpenSolaris in April of 2007. He continues to actively port new ZFS features from OpenSolaris, and focuses on improving overall ZFS stability. During the introduction to his talk at BSDCan, he explained that his goal was to offer an accessible view of ZFS internals. His discussion was broken into three sections, a review of the layers ZFS is built from and how they work together, a look at unique features found in ZFS and how they work internally, and a report on the current status of ZFS in FreeBSD.
The BSDCan website notes that Pawel is a FreeBSD committer, adding:
"In the FreeBSD project, he works mostly in the storage subsystems area (GEOM, file systems), security (disk encryption, opencrypto framework, IPsec, jails), but his code is also in many other parts of the system. Pawel currently lives in Warsaw, Poland, running his small company."
Derived from notes taken at a one-hour BSDCan talk by Pawel Dawidek, titled, A closer look at the ZFS file system. Simple administration, transactional semantics, end-to-end data integrity.
In a series of slides titled "ZFS, the internals", Pawel started with a diagram illustrating the many layers of ZFS, offering a quick overview of how it all fits together, and how it fits into FreeBSD. He then quickly moved from layer to layer.
zpool status -vwhich shows all errors as well and lists all files affected by these errors. An example use of this Pawel pointed out was that's it's easy to quickly determine exactly which files need to be restored from a backup.
A feature found in the latest ZFS release, which Pawel is actively porting to FreeBSD, is the ability to use an entire device for caching, which he noted was similar to an L2 cache.
fsck. When data is modified, the change version is written to a new place on the disk rather than overwriting the old copy of the data. Once written, the pointers are updated to point to the new data. If there's a crash in the middle of an operation, the old pointers will still lead to the old data which will remain consistent and unmodified.
Traversing the live filesystem is not easy when you have multiple datasets mounted, a feature provided by this layer. This allows you to synchronize mirrors, and is used when verifying all checksums in your pool.
/dev/zfs, the communication gate between userland tools such as zfs(8) and zpool(8) and ZFS, used to configure the kernel and modify ZFS pools.
Pawel described RAID-Z as "similar to RAID-5, and yet so much different". RAID-Z gains from the fact that ZFS uses copy on write, and never overwrites data, avoiding the above limitations with RAID-5.
RAID-Z is also self healing, because a checksum is written when data is written with RAID-Z, and then each time data is read the checksum is always validated. If the checksum doesn't validate, ZFS automatically attempts to reconstruct the data from the parity information, then validates this reconstructed data - if valid, it writes the corrected data back to the disk.
Another advantage to RAID-Z is that when a disk is replaced, it doesn't blindly copy the entire disk. Instead, it only copies actual data, so if a pool is almost empty synchronization can happen very quickly.
He then discussed hardware that does checksumming in the controller. For example, disks might be formatted with 520 byte sectors rather than 512 byte sectors, and the extra 8 bytes is then used to store checksum data. Pawel pointed out that this still does not provide end to end integrity, and can still be corrupted by a bad cable, in memory, or even by a buggy driver. Returning to the mail carrier analogy, he suggested they'd be saying something like: "We can only guarantee that when the package left our office, it was okay."
Other filesystems offer checksums providing block consistency verification, checking the block itself but not guaranteeing that the block is in the right place. Thus, a controller bug could mistakingly send writes to the wrong place, or phantom writes can happen when you think you wrote data but you didn't. Continuing the the mail carrier analogy, he offered: "Here is a package. It's not broken, but it may not be yours."
And then finally he looked at how every block is verified against an independent checksum in ZFS. Pointers are stored to the block in another block along with a checksum. When data is read, it can verify the data and that it really is the block being asked for. Stepping back, he noted that as data is stored in a tree, you have checksums going all the way up to the topmost block which offers a single checksum of all blocks in the filesystem. He described this global checksum as a cryptographically strong signature of the entire pool.
To maintain snapshots, ZFS tracks when a block was stored using a counter incremented each time an operation is written to disk, as well as a pointer to the block and a checksum. Every snapshot maintains its own dead block list, which is reviewed when a snapshot is destroyed, freeing blocks that meet the following conditions: they were born after the previous snapshot, born before the destroyed snapshot, they died after this snapshot was created, and they died before the next snapshot was created.
This synchronization happens from the top of the tree and works its way down, so if it is stopped mid-process by a crash, it is possible to pick up where it left off, or to obtain at least some of the data from the partially synchronized disk.
ZFS Status in FreeBSD
Pawel explained that he has already ported the most recent version of ZFS from OpenSolaris, and that it currently lives in his private Perforce source code repository. He noted that this port is completed code wise and everything works, but that he's working on writing regression tests. He's already written 2,000 tests, but these only cover half of ZFS functionality -- an illustration of just how many features ZFS has. The new code will not be comitted until he completes the writing of his regression tests, so he suggests "be patient".
Cool New Features in the Latest Port
When Will ZFS Be Production Ready?
Pawel notes that he's heard this question a lot. "The experimental status is very inconvenient," he commented to lots of laughter from the crowded room. He noted that he's currently the only maintainer, and suggested until someone comes along to co-maintain the code to help debug things when the filesystem gains more users he wouldn't be marking the code as production ready. He also commented that nobody has stepped up yet to co-maintain the code, so he expect is will be a while yet.
He went on to note that he's personally used ZFS on FreeBSD in production for 2 years, and on his laptop for more than 1 year, "it just works, and it doesn't lose data. It doesn't corrupt data, and you don't have to wait for fsck."
Questions and Answers
With this, Pawel opened the floor to questions.
A: Not yet. The regression tests are being written first, then the patch will be published, then it wil go into CVS.
Q: Will the new version of ZFS be able to talk to partitions created with the old version of ZFS?.
A: Yes, but you will need to use a command to update the volume if you want to access the new ZFS features.
Q: How does ZFS handle bad sectors on the disk
A: This can be handled by mirror disks or using RAID-Z. In addition, ZFS always replicates its metadata, and it's possible to configure it to also replicate data on a single disk.
Q: Does it support ACLs?
A: The new version does. In OpenSolaris they use filesystem attributres. In FreeBSD we use extended attributes. In the new version the two can be translated. It's also possible to implement POSIX ACLs, but this isn't likely to happen as it would make ZFS on FreeBSD incompatible with ZFS on OpenSolaris. There's also a Google Summer of Code project related to this.
Q: How does ZFS work with 64-bit architectures?
A: Another nice ZFS feature is that it has no endian dependencies. ZFS always writes in the architecture's endianness, and doesn't slow down writes by translating. When reading, it simply checks the order in which data was stored, then feeds bytes appropriately.
Q: Can you dynamically expand filesystems?
Pawel then popped up a terminal and offered a live demonstration of how it works.
Q: How much space is allocated for snapshots?
A: No space is allocated for a snapshot until you start modifying it, then it allocates space as the filesystem changes.