ZFS now has data deduplication – with the right configuration options for safety and performance in a compare-by-hash based storage system. From Jeff Bonwick’s ZFS deduplication blog entry:
Given the ability to detect hash collisions as described above, it is possible to use much weaker (but faster) hash functions in combination with the ‘verify’ option to provide faster dedup. ZFS offers this option for the fletcher4 checksum, which is quite fast:zfs set dedup=fletcher4,verify tank
The tradeoff is that unlike SHA256, fletcher4 is not a pseudo-random hash function, and therefore cannot be trusted not to collide. It is therefore only suitable for dedup when combined with the ‘verify’ option, which detects and resolves hash collisions. On systems with a very high data ingest rate of largely duplicate data, this may provide better overall performance than a secure hash without collision verification.
What I like is (1) the user chooses the hash function based on their security and performance needs, (2) the system can optionally check for hash collisions, and (3) the ZFS storage pool design makes it easy to migrate data to a new hash function if necessary. ZFS is the first deduplicating storage system I know of with these features. (Do let me know if there are others out there!)