First major public exploitation of a compare-by-hash based system

In December, a team of hackers took the time to implement a full exploit of SSL certificates signed with a broken hash function (MD5). The paper is entitled MD5 Considered Harmful Today (a charming reference to the classic MD5 Considered Harmful Someday.

This piece of news has been in my “to-blog” queue for the better part of a month but I’ve been too depressed to write about it. For some years, I’ve tried – and failed – to convince systems programmers that they need to plan for the obsolescence of any cryptographic hash function they have incorporated into their software. Why?

Cryptographic hash functions don’t actually produce universally unique signatures/addresses/identifiers from their inputs, they simply do so with very high probability for the few years between their definition and the inevitable break of that hash function. When I wrote my first paper on the subject back in 2003, MD5 was already considered broken by cryptographers, but since none of them had bothered to find two colliding inputs, most programmers believed MD5 was still safe to use. When MD5 was completely broken the next year, I thought that the debate was over, but programmer practice remained unchanged, in part because most people had already migrated to SHA-1. Then SHA-0, a close cousin to SHA-1, was completely broken. Again, nothing changed. Then SHA-1 was significantly weakened, to the point that Bruce Schneier described it as:

SHA-1 has been broken. Not a reduced-round version. Not a simplified version. The real thing.

Still nothing. Of course, programmers are taking the exact attitude towards the weakening of SHA-1 as they did towards MD5 – it ain’t broken until you show me the collision, so I’m gonna keep using it – but this time we have a significant body of software using SHA-1. And now we know that even supposedly security-centered organizations continue to use known broken cryptographic hashes right up until someone demonstrates a collision on their exact system and creates a media storm around it.

I still recommend that designers of software in which the cryptographic “address space” is shared by untrusted users should plan for upgrading their cryptographic hash functions to the current state of the art every few years. I just have no expectation that this will happen.

I have some pretty color graphics showing the lifetime of cryptographic hash functions in an earlier post.

Finally, in a vain attempt to forestall the inevitable flame wars, I will point out that my objections do not apply to systems in which the hash address space is shared only with trusted users. In other words, hash-based source control is for the most part fine sticking with SHA-1 and could indeed use a cheaper hash like MD5 without any practical trouble. My hatred of git is based entirely on the user interface.

Update:

I should have read my Google News Alerts before posting this. Apparently a new document archive system from U Washington was just announced that does take into account migration to new hash functions. From the New York Times article by John Markoff:

The University of Washington researchers now use a modern hash algorithm called SHA-2, but they have designed the system so that it can be easily replaced with a more advanced algorithm.

But I can’t find any primary sources and the article has a technical error elsewhere: “At the heart of the system is an algorithm that is used to compute a 128-character number known as a cryptographic hash from the digital information in a particular document.” Probably a garbled reference to 128-bit MD5 checksums. Nonetheless, encouraging news.

22 thoughts on “First major public exploitation of a compare-by-hash based system”

  1. I agree with nearly everything you’ve said. Thanks for writing this up, it’s quite nice.

    However, I’m curious, how is hash-based source control fine? What happens if you have a second preimage attack that works against MD5..?

  2. What’s the attack you’re worried about? I’ve no real experience of svn, but git checks for collisions with an existing object and will abort the commit. In the worst case that sounds like a denial of service, but if you have an untrusted user who can repeatedly make commits to your source repository then I think you have other problems.

  3. OK, this is orthogonal to your blog post (and my cryptographic idiocy is showing), but is there a hash function that you recommend right now for new projects? MD5 and SHA-1 have been floating around the programmer mindset for a long time, and I know virtually nothing about SHA-2. Skein looks like an interesting (and fast) new hash from Schneier, but if there’s an industry standard that’s generally accepted over SHA-1, I don’t know what it is.

  4. Hm. So the attacker modifies an object in the repository in such a way that it doesn’t appear as a new commit, but also doesn’t trigger errors when clients discover that the repository’s history has changed? Git doesn’t actually test that the contents of an object match the hash, so you can already just replace an object with an arbitrary one without anything complaining.

  5. compare-by-hash

    I disagree with your claim that the breaking of any and all hashes is INEVITABLE — I consider it possible that we’ll some day develop hashes which are provably hard. (like today we consider factorisation of large numbers *likely* hard, but we haven’t yet got a proof that the problem is *definitely* hard.)

    I do consider it *likely* that most or all commonly used hashes today will be broken, we ain’t got anything even approaching proof of their hardness, and so it’d be just plain luck if any of them do turn out to be genuinely-hard. A good first step would be if we could manage to produce a proof that one-way-functions exist.

    I think their existence would imply that P != NP, and we’ve spent a lot of effort without much progress in proving that, so perhaps I’m optimistic. Though even though we’ve this far been unable to prove it, it does seem to be the consensus that very likely P != NP.

  6. In my experience, any system using compare-by-hash for content addressable storage will throw in “cryptographic integrity checking” as a feature, even when it’s useless, unused, or unusable. If someone manages to put an “evil” commit into Linus’ git repo, we’ll notice through some agency other than the checksum.

    Indeed, there was a case a few years back when someone added a reasonably clever commit to the CVS tree automatically created from the BitKeeper original:

    http://lkml.indiana.edu/hypermail/linux/kernel/0311.0/0621.html

    The conversion scripts found the problem.

  7. I don’t know exactly what git does and I’m not going to read the source code to find out. I will comment on the two properties Matthew lists. First, checking if a new commit collides with other commits in the same local repo is reasonable and nice, but such a system won’t detect collision between objects in separate clones of the repo. Second, checking that every commit’s associated hash matches the content of the commit on every operation would have noticeable computational cost.

  8. One of the SHA-2 family should be used at this point, probably SHA-256. The key is to view both your choice of hash function and its output as ephemeral – cryptographically strong for the next few years but at some point in the near future easily forgeable by bored hackers with a laptop.

  9. Right, the collision-by-commit problem should only happen if you’re trying to merge something from a third party (or you’re very unlucky) – and if they make a habit of it then you probably have a good excuse for distrusting them in future. I don’t think there’s a significant problem with git’s security model – even if you did subvert the repository it’s likely that somebody with an existing checkout would come up with an obscure merge error before too long. I think we’ve always assumed that anyone with direct filesystem access to your repository is probably going to win, and while hashes could be used to reduce that probability the fact that they currently aren’t means that using broken hashes isn’t a real problem.

  10. Re: compare-by-hash

    If we do prove that one-way functions exist, I’ll be too busy trying to survive the accompanying singularity to worry about anything else. :) (See “Antibodies” and the Laundry series by Charles Stross for his fun take on the implications of proving P = NP. P != NP may be less interesting but still have radical effects on society.)

  11. Re: compare-by-hash

    Now you’ve gotten me curious.

    It’s my impression that today, the general consensus is that very likely P != NP.

    What huge consequences would it have for society if we managed to prove what we’re already suspecting ? I do get that it’d be pretty huge for mathematics.

    My disagreement is on the theoretical level anyway, I do agree that when one currently design a system that depends on the cryptographic strength of a hash, one should assume from the get-go that that hash will very likely be broken at a time when your system is still in use. (you may get lucky and avoid that, but if so, that’s just plain luck)

  12. Re: How about Ripemd-160?

    RIPEMD-160 just hasn’t been studied much. It may be a strong hash function, but not many people have tried hard to break it so its strength is not well tested.

  13. Re: compare-by-hash

    Any function that outputs a string of symbols shorter than the input will have collisions, because of the pigeon-hole principle. An example of the pigeon-hole principle is you having 27 Christmas cards to put in a set of A-Z pigeon-holes, with the obvious outcome that at least one pigeon-hole has more than one card in it. Consequently collisions exist and all hashes will be broken.

    K3n.

  14. Re: compare-by-hash

    The issue isn’t that you can have collisions. The issue is whether you can generate collisions in less time than it would take you to brute force the problem. The strength of existing hashes lies in the fact that attacking a specific object is computationally expensive – once that’s no longer true, they’re not a strong hash. There’s a difference between a hash being broken because its information content is small enough to allow brute force attacks and a hash being broken because there’s a fundamental weakness that allows you to infer information about the hash from the original content (or vice versa).

    Barring unexpected developments in computational theory, there will be unbroken hashes as long as you keep adding more bits to the hash.

  15. Re: Tiger

    Tiger seems promising but hasn’t had enough exposure that I’d be comfortable depending on it.

    Right now, you should use SHA-256 for a signature that will be used for several years and is susceptible to external attack. If you don’t need something that strong, you should probably be using MD5. At present, SHA-1 is not the right answer for most software in my opinion. Either you need to protect against attacks over the next several years, and you should upgrade to SHA-256, or you don’t have to worry about attacks, and you should use a cheaper, faster hash like MD5.

  16. Re: Tiger

    There are (near) collision attacks against reduced round versions of tiger [1]&[2].

    If a high level of confidence is needed, providing signatures with two or even more hash functions is also a way to reduce possible attacks since all of the used hash functions would need to be attacked/broken at the same time.

    Adrian

    [1] – http://th.informatik.uni-mannheim.de/people/lucks/papers/Tiger_FSE_v10.pdf
    [2] – http://www.springerlink.com/content/u762587644802p38/

Comments are closed.