The code monkey’s guide to cryptographic hashes for content-based addressing

At long last, I’ve written and published the “compare-by-hash for programmers” article everyone’s always been asking for. You can read it chopped into 17 pieces and partially obscured by floating ads here:

(My editor says: Please please complain about this! No one believes me when I say this is bad!) Or you can read it one piece with full size tables, etc. here:

I’m always looking for new article suggestions, especially for the Kernel Hacker’s Bookshelf (search down the page for the entry). Writing is fun!

The part of the article that Edward Tufte would be most proud of are the two tables about hash function life cycles:

Stages in the life cycle of cryptographic
hash functions
Stage Expert reaction Programmer reaction Non-expert
(“slashdotter”) reaction
Initial proposal Skepticism, don’t recommend use in
Wait to hear from the experts before adding to OpenSSL SHA-what?
Peer reviewal Moderate effort to find holes and
garner an easy publication
Used by a particularly adventurous
developers for specific purposes
Name-drop the hash at cocktail
parties to impress other geeks
General acceptance Top-level researchers begin
serious work on finding a weakness (and international fame)
Even Microsoft is using the hash function now Flame anyone who suggests the function may be broken in our lifetime
Minor weakness discovered Massive downloads of
turgid pre-prints from arXiv, calls for new hash functions
Start reviewing other hash functions for replacement Long
semi-mathematical posts comparing the complexity of the attack to the
number of protons in the universe
Serious weakness discovered Tension-filled CRYPTO rump sessions! A full break is considered
Migrate to new hash functions immediately, where
Point out that no actual collisions have been
First collision found Uncork the
champagne! Interest in the details of the construction, but no
Gather around a co-worker’s computer, comparing the
colliding inputs and running the hash function on them
Explain why a simple collision attack is still useless, it’s
really the second pre-image attack that counts
Meaningful collisions generated on home
How adorable! I’m busy trying to break this new
hash function, though
Send each other colliding X.509
certificates as pranks
Tell people at parties that you always
knew it would be broken
Collisions generated by hand Memorize as fun party
trick for next faculty mixer
Boggle Try to remember
how to do long division by hand
Life cycles of popular cryptographic hashes (the
“Breakout” chart)
Function 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
SHA-2 family                                    
Key Unbroken Weakened Broken
Hash Function Lounge
has an excellent list of references for the

12 thoughts on “The code monkey’s guide to cryptographic hashes for content-based addressing”

  1. 2004 was a pretty astounding year; I have some vague conspiracy theories about that not worth the electrical potential I use to think them. The real reason is that some brilliant PRC researchers found techniques that exploited common design elements of several of the top hash functions – and published them all in the same paper. Another suspicion I have is that most other researchers were not interested in bothering to fully break hash functions other than SHA-*, since little glory would accrue to them for proving what “everyone” (i.e., cryptographers) already knew. For example, I suspect Joux could have broken them too, but he was busy finding a collision in SHA-0 (also reported in 2004).

  2. So the case where only trusted users can put data in the system is usually pretty annoying. When the hash isn’t broken, they tell you that the system works because no one can generate collisions. When the hash is broken, they tell you that it doesn’t matter if you can generate collisions because the users won’t do that deliberately. To which my reply is, “Then why don’t you use a cheaper hash?” The only one that gets this right is rsync – the hashes are pretty wimpy, but exactly as good as they need to be (and much faster to calculate).

    So it’s fine that git uses SHA-1 and will continue to use SHA-1. Git uses SHA-1 as a cheap way of creating UUIDs for objects, not for any security reason. But there’s absolutely no point in using the latest greatest cryptographically secure hash if you don’t think there’s a reason to upgrade it when it’s broken. Git would also work just fine if it was using MD5, or even MD4 – and it would go lots faster.

  3. I guess I worry a bit about accidental collisions. I know they should be vanishingly unlikely if the objects I’m storing in git are all of similar types, but it seems like that might break down once I start putting a lot of audio and photo captures in there — for instance, if I used git to keep a complete revision history of my home directory. It’s not a case of deliberately causing collision, it’s more that I worry I might accidentally find out my backup is missing some day because that crucial .wav file hashed the same as the .png from a friend’s amateur astronomy CCD exercise. I realize the probability of a random collision is very small, but who knows when one widely-used file format and another widely-used file-format will happen to fall afoul of whatever hash function was popular when the backup system was designed…

  4. Oh, and exactly these sorts of problems did arise when using CRC32, of course. Otherwise I’d be completely blissful in my ignorance, rather than needlessly fearful ;)

  5. You’re preaching to the choir. Compare-by-hash is somewhat defensible when it’s used for a specific purpose by educated persons who trust each other. When it becomes invisible systems software, those qualifications go away.

    One way I think about it that we see certain sequences of bits far more often than expected by probability alone (and others far less often).
    Some of the sequences of bits that will appear more often than by chance within a few years are sequences that have colliding SHA-1 hashes. These sequences will be replicated and stored and generally spread through computer systems because they are so interesting and unique. And then there’s someone quite reasonably using git to maintain their home directory who downloads the pair of sequences with the same SHA-1 hash. Yeah, probably git hashes in a lot of other information as well as the data, but it’s just not that hard to construct something that collides in git once you have the original vanilla collisions.

  6. ouch.

    you probably saw this already, but and friends broke ssl (also on /.) using weaknesses in md5. as noted there, combined with dns hijacking, this is basically the theoretical end of web security as we know it. if only the browser hackers had heeded your article when you published, we might be in a better position to work around this problem today.

Comments are closed.