ZFS gets deduplication – the right way

ZFS now has data deduplication – with the right configuration options for safety and performance in a compare-by-hash based storage system. From Jeff Bonwick’s ZFS deduplication blog entry:

Given the ability to detect hash collisions as described above, it is possible to use much weaker (but faster) hash functions in combination with the ‘verify’ option to provide faster dedup. ZFS offers this option for the fletcher4 checksum, which is quite fast:

zfs set dedup=fletcher4,verify tank

The tradeoff is that unlike SHA256, fletcher4 is not a pseudo-random hash function, and therefore cannot be trusted not to collide. It is therefore only suitable for dedup when combined with the ‘verify’ option, which detects and resolves hash collisions. On systems with a very high data ingest rate of largely duplicate data, this may provide better overall performance than a secure hash without collision verification.

What I like is (1) the user chooses the hash function based on their security and performance needs, (2) the system can optionally check for hash collisions, and (3) the ZFS storage pool design makes it easy to migrate data to a new hash function if necessary. ZFS is the first deduplicating storage system I know of with these features. (Do let me know if there are others out there!)

3 thoughts on “ZFS gets deduplication – the right way”

  1. Dedupe in ZFS works on the level of blocks – not on the level of files – so this tweak cannot be made straightforwardly, AFAIU

  2. Thanks, I fail at reading comprehension when I skim.

    Assuming 4k blocks in a 1TB filesystem with sha256, that’s 2^28 blocks to put in a sparse index that maps uniformly distributed 256 bit prefix strings to 28 bit offsets (averaging 2*28 index bits for a block, or more key bytes for faster index traversal), so at least 2^31 = 2GB of space for the index. Such is the cost of such granularity.

    Limiting the index to only a sample of block hashes could be more efficient, assuming there’s some metadata to go from block to extent or node so that blocks that are “near” a collided node get indexed on the fly. That metadata can be done away with by referring to extents or nodes directly. So we’re almost back to using node-level hashes, but we can index files with a sample of their block hashes, and do more fine-grained sharing in the disk image use case.

Comments are closed.