SHA-1 collision expected within a year

Bruce Schneier predicts we’ll see the first SHA-1 hash collision within a year, based on recent cryptanalytic results:

Bruce Schneier: Ever Better Cryptanalytic Results Against SHA-1

In other words, systems relying on the lack of collisions in SHA-1 (such as BitTorrent) for correct operation will start having interesting bugs in the next year, as I predicted in 2003.

I updated The code monkey’s guide to cryptographic hash functions (and moved it to my own web site). I also created a summary page of my writings on cryptographic hashes, including the most up-to-date version of the “Breakout Chart” of cryptographic hash life cycles.

[Humorous hyperbole deleted since humor + intertubes = fail.] I don’t think it made sense to write this paper, since I don’t think anyone changed their software as a result of reading it, and it didn’t have a positive effect for me personally either.

[Added so the comments don’t fill up with git-related flamewars.] Git and rsync are fine, as are any hash-based systems which only allow trusted users to add data to the system. (Trusted not to deliberately add colliding inputs, that is.) BitTorrent, Venti, CAS-based shared caches, and anything else which allows potentially malicious users to add data to the system is another story.

29 thoughts on “SHA-1 collision expected within a year”

  1. Strangely, I’ve never actually managed to have a conversation with you about this topic, despite the feeling I got way back when that your observations were indirectly … directed … at the program I was writing. And the several generations of related tools that have followed thereafter.

    I did not take issue with your general belief that “security breaks in hash algorithms occur”; that much is easy to extrapolate from the literature in cryptography. Security protocols need upgrading over time, everyone knows this, life goes on. Word-size on computers needs upgrading sometimes as well :)

    I think the only critique I had of your early paper was that it didn’t only talk about expiring cryptographic algorithms / advances in attacks. It also presented a completely incorrect characterization of CHFs, and you used this as the basis of an argument that such functions are therefore unsuitable in benign (or authenticated) contexts. From an engineering error-tolerance perspective, not just a security perspective. It is this argument that I took issue with, and I believe field use has borne out the safety of CHF naming. Nobody’s accidentally colliding them. Rsync still does fine with MD4 for block verification! It just usually uses it over an RSA-authenticated channel with a trusted party.

    I’ve never seen a satisfactory response to this critique, yet as far as I know it’s the only one I or anyone else really ever directed at your paper(s) on the topic. Attacks are a completely different kettle of fish.

  2. Every hash function has collisions. But what really matters is whether a specific program can be deliberately attacked or inadvertently break as a result of those collisions.

    Absence of known collisions is the first line of defense, but it’s not only the line of defense. The software should be able to resist the attempts to exploit known collisions by verifying the file contents or by adding random (or even not quite random) data before the hash is calculated. We know that git computes SHA-1 on bare blobs, so git doesn’t have that line of defense.

    It’s entirely possible that an existing hash function will be broken without an early warning that we have in case of SHA-1. Therefore, the software should incorporate protections against possible collisions.

    If git switches to another hash function, it should also start adding some metadata to the blobs. It could increase the storage. It’s also possible that a blob would be stored twice with different metadata. But “git gc” will be able to find it and turn the second blob into a link to the first blob.

  3. I did manage to have a conversation with Val about this topic, pointing out roughly what you pointed out above (it’s possible I might have also said stupid things that have since been erased from my memory, like skepticism about the fragility of SHA-1, although I think that Wang’s results should have cured that skepticism by the time this conversation happened; but I don’t have a chat log) and that was the last time I saw her on AIM for years. Coincidence? Dunno. Val, did I offend you during that conversation? I’ve never asked.

  4. Git blob

    Git hashes its objects, which are not bare blobs, but rather hash over a type string, a size string(?) and the bytestream, for a blob that is:

    sha1(“blob ” + filesize + “\0” + data)

    What this what you meant or does it not matter/not count as “random” data? At least if you find two colliding bytestrings A and B, it’s not certain that checking them into git will lead to a *git* collision.

  5. I agree, it would be awfully dumb to think that accidental collisions in the output of a strong 160-bit hash function are likely to be a major cause of errors. However, I have never believed this. I apologize for the monumental lack of clarity in my writing that resulted in this misperception, and would appreciate a pointer to the paragraph in question so I can correct it.

  6. I’ve had it happen to me exactly once, and there it was caught by the fact that we distribute multiple hashes (presently SHA1 and MD5, but going to be added RMD160 and SHA512 shortly) with Gentoo release media. Specifically, I had a corrupted copy of an ISO that happened to match the correct MD5, but did not match the SHA1.

  7. That is cool!!!

    I’ve heard of accidental MD5 collisions in the wild a few times, but no one ever saves the colliding inputs, they just rerun with a different hash function and call it a day. It’s certainly plausible, but I’d want verifications before I wandered around telling people about it. Any chance you saved yours?

  8. Sure. I’m sorry if this seems touchy, but I too have endured quite a lot of verbal abuse over my willingness to engineer a system around CHF naming — including a near-miss on project funding! — and the original HOTOS’03 paper presented a number of misinterpretations that made for verbal ammunition and caused me a lot of grief. It came across as pretty directed misinformation. There’s a section on the paper in the monotone manual as a result. I take issue with almost everything presented in section 4, the meat of the analysis.

    1. Section 4.1 implies that the assumption of randomness-of-input is important in the analysis of a CHF. It absolutely is not, and no such assumptions drive CHF design. The input space is infinite, and the function is defined to cascade strongly if there’s even a single bit difference. That section also presents an example program that is so totally beyond reason (storing every possible 2kb block, in order, on a disk? there’s not enough space in the universe) that it leads readers seriously off track.
    2. Section 4.2 describes the vulnerabilities of CHFs to security-research attacks and juxtaposes those vulnerabilities against the needs of “systems with different characteristics” (i.e. non-security contexts), and announces that “computer scientists should choose their algorithms based on how long they want to keep their data, period”. That is misleading. A security-research CHF break has absolutely no effect on the ability to keep, or retrieve, benign data indexed by that CHF. The section continues, saying “If anyone had built a distributed file system using compare-by-hash and MD4, it would already be unusable today”. As I said, this is incorrect: many mirror networks still use rsync and function absolutely fine atop MD4, despite it being “broken” in the cryptographic sense. Place in authenticated channel, system still functions fine.
    3. Section 4.3 carries this mixing-of-analysis on by flipping back and forth between security terminology and error-resistance terminology, making no distinction. For example, it says “To understand why silent errors are so bad, think about silent disk corruption”. This is a totally misleading analogy. The only “silent error” of even the remotest possibility is when you are under attack. And in that case all cryptosystems are “silent” when successfully attacked. That’s the point of calling the attack a success. In the non-attack scenarios, SHA-1 functions so well that we received advance warning of impending drive failures on several occasions due to the database entities being hashed with it.
    4. Section 4.4 is the most troubling of the bunch, as it presents a function that demonstrably doesn’t make a good CHF (you can collide it with 1 unit of work) and then suggests that, since it has a similar collision probability to SHA-1 when approximated by a function that does not describe its collision probability, it would be accepted as an adequate CHF. Here you’ve quite seriously reversed cause and effect: people do not argue that SHA-1 is a CHF because it has a particular analytic collision probability; you can’t observe its collision probability unless you happen to have all possible inputs and outputs kicking around. Rather, people argue that a particular collision probability is an acceptable analytic approximation of SHA-1 because they have convinced themselves that SHA-1 is a CHF, under much stronger evaluation criteria. Notably: cryptanalysis of the work costs of collision using all known techniques. Deciding whether or not a function is a CHF is a full-time job for a group of cryptographers, but they’d reject the proposed function instantly. It has collision probability of 1 by definition.
  9. To put this in the starkest terms: suppose someone published a paper that managed, by some leaps of logic, to convince a portion of the audience that you could not store data reliably in a filesystem, because if a user has a rootkit the data might be corrupted. And moreover to abandon work on filesystems as a way of storing data, as a result of the existence of rootkits. You’d find this specious, no? It’s true that a successful attack can corrupt the data. But an attack is an attack. It is totally orthogonal to the engineering considerations in non-attack scenarios.

  10. accidental md5-collision

    Indeed.

    Generating md5-collisions deliberately is not only plausible, but indeed perfectly straightforward.

    But I’m not aware of any weaknesses in md5 that would increase the chance of an *accidental* collision above the theorethical chance given the size of the hash.

    And creating a collision in a 128 bit hash accidentally sounds rather unlikely to me. I’ve heard 2-3 anecdotes too, but offcourse, conveniently, these are -never- accompanied by the colliding files/strings.

  11. This very post has one:

    […] systems relying on the lack of collisions in SHA-1 (such as BitTorrent) for correct operation will start having interesting bugs in the next year […]

    What would be more accurate is to say that, since we now have a SHA-1 collision finder that takes a year to halt, systems using SHA-1 have had their attack cost reduced by a factor of 2^11 or about two million. This will only be a problem if the attack cost now makes sense for someone economically.

    If you’re a movie studio, and you want to corrupt a bittorrent stream of your movie, you probably need to be able to find those torrents and disrupt them in about, say, an hour. If the collision finder described in the paper takes a year to run on whatever hardware they’ve got, you’d need to scale up the hardware you throw at it by a factor of about 9000 to find a collision in under an hour. This will allow you to corrupt one block, in one stream, for the subset of peers who get the block from you (and those who they pass it on to). You still have a linear scaling problem in number of blocks in the stream, and in number of streams you want to corrupt.

    Shorter: 9000 of any machine is not especially cheap. You will need some multiple of 9000 of some (probably quite beefy) reference machine to effect systemic disruption in bittorrent. If you had that kind of money and energy budget, your time would certainly be more effectively spent by automating joining torrents and sending C&Ds to the ISPs of peers you don’t like. So assuming more or less economically rational attackers, it’s not true that bittorrent will start seeing interesting bugs in the next year.

    (Even if participants were not rational, they still wouldn’t be interesting bugs, at least not for bittorrent. It’d just be a few bad blocks. Nothing a suitable rooted web of trust can’t fix.)

  12. Quite so. I believe the Greater Internet Fuckwad Theory has driven people to expect rather little in the way of quality from torrents anyways. They’re already full of corrupt data, malware, fakes inserted by record companies, etc. They have “the correct” SHA-1 but no authentication channel, so all they tell you is that you correctly — not accidentally — got duped into downloading junk.

    Colliding CHFs is only of interest to people designing authenticator systems. It’s relevant to TLS-library folks that you can construct a rogue MD5 CA. It’s not relevant to pretty much everyone else.

  13. So, I see where you are coming from and why you might not give this paper the most charitable interpretation. I’m not going to try to convince you of anything, but I will make some brief comments to clarify some points.

    In general, the conclusions you come to above certainly aren’t ones I would make or agree with. I wrote a follow-on article which might be useful to understand my position better:

    http://valerieaurora.org/monkey.html

    This is my response to the criticisms of the original paper. I don’t really see the point of writing this article either, since no one reads it at all. :) But, if you were to read it, you’d find that I’ve relaxed my stance on systems in which the users can be trusted – that is, rsync, git, monotone, etc. are fine as long as only trusted users are adding data to the system. (This caveat emphasized because because people keep proposing shared caches, distributed file systems, etc. based on rsync or git or whatever.)

    In general, the flipping back and forth between the cryptanalytic terminology/viewpoint and the systems terminology/viewpoint and comparison of “collision probability” is the attitude I was criticizing, not one I was supporting. Go read the Venti paper – it certainly wasn’t my idea. But once people have brought together the worlds of cryptanalysis and systems engineering, it’s disingenuous to complain when systems design philosophy is brought to bear on the new application of the cryptosystem. And this is the way a systems designer thinks:

    1. If it can happen, it will. Probably at the most inconvenient time, and repeatably.

    2. Users can’t be trusted. If there’s a hole, someone will try to exploit it, no matter how tiny.

    3. Whatever I build will unfortunately be in use for about a decade. Maybe longer if I’m unlucky.

    Etc. I understand that this seems paranoid and laughable to most people. But if you look at compare-by-hash from that viewpoint, you’ll see that CHFs begin to look entirely different. I think we could have a very enjoyable face-to-face discussion about mathematics, information theory, politics, and psychology as they intersect in the application of compare-by-hash.

    Now I’m going to follow some simple rules Mary Gardiner taught me about online discussions: Make your statement, reply once to clarify any misunderstandings, and then shut up. :)

  14. I noticed a pattern in the reactions of people to new cryptanalytic results and summarized it in the table entitled “Reactions to stages in the life cycle of cryptographic hash functions” on this page:

    http://valerieaurora.org/hash.html

    It’s kinda like the stages of grief. :) It’s been nice to see Bruce Schneier behaving exactly as described in the “Expert reaction” column (I wrote this table in 2007).

  15. Maybe what we need is the humor equivalent of a captcha: You have to read a sarcastic comment and then answer a question based on it. E.g.:

    Pam said, “Steve, your unparalleled insight into the interior lives of women utterly overwhelms me.”

    Question: Does Pam think Steve understands women?

    Answer: No -> Redirect to the humorous post
    Answer: Yes -> Redirect to the Linux kernel mailing list

  16. I appreciate the forthright response, apologize for the grumpy sound in things I’ve said, and look forward to a face-to-face talk about such things, someday.

    I think there might be a caveat in the rules of online discussions for “throw in a positive final note”: if it cheers you up at all to know, we didn’t just add a section to the monotone manual in response to that paper. We also added a mechanism for having flag days due to hash expiry and isolating a communication subgroup on an expired hash from the others. Despite also authenticating everything. I’m down with systems-designer paranoia :)

    (sadly it seems git hasn’t included this sort of thing. but then, it also didn’t include mandatory archival of authentication records. so it’s open to a whole raft of attacks just based on its trust model…)

  17. I think there might be a caveat in the rules of online discussions for “throw in a positive final note”: if it cheers you up at all to know, we didn’t just add a section to the monotone manual in response to that paper. We also added a mechanism for having flag days due to hash expiry and isolating a communication subgroup on an expired hash from the others. Despite also authenticating everything. I’m down with systems-designer paranoia :)

    Awwww! You made my day. :)

  18. I think you’re saying I’m in one of those columns? I’m not sure which.

    I still think the “interesting bugs” statement is not what you meant. I would be very much surprised if anyone tried to silently corrupt bittorrent streams by finding SHA1 collisions within the next year. Should the protocol switch hashes in the next year? Probably. Will anything unpleasant happen because of the old hash before then? Well, it could, but I’ll wager a Benjamin that it won’t. In fact, extrapolate that to any extant use of SHA1 and I’ll still make that bet.

    But I suppose I was attempting to correct the wrong thing; you were asking where you had said accidental collisions would be a problem, and I was trying to draw attention to a point where you suggested intentional collisions would be a problem. Communication is hard.

  19. See, that’s kind of what I thought. Was it just a coincidence that you disappeared from AIM at the end of that conversation and I didn’t see you again on AIM for years (if ever?)

  20. I don’t agree. Resistance to deliberate attack is a strict superset of the engineering considerations required for reliability in non-attack scenarios. If it’s infeasible for a deliberate attack to introduce undetected corruption into your data, it’s even more infeasible for a random disk error to do so.

  21. 2¹¹ is about two thousand, not two million.

    If I understand correctly, the problem is worse than you describe; it’s not a preimage attack but a collision attack. So you have to generate the original data being shared in the first place using the collision finder in such a way that most of it is, say, valid MP4 data. Then you upload your torrent to whatever people are using next year instead of ThePirateBay, and then after it takes off, you start propagating the invalid block, causing people to get slight glitches in playback when they watch your movie, even though the original that you uploaded played flawlessly.

    There are worse failure modes in other protocols. Git, say, might allow you to silently alter an old version of a source file to introduce a security hole, assuming you created that version of the file in the first place.

  22. Fair enough; “totally orthogonal” is an exaggeration. But you can easily consider the subset independent of your considerations of the superset. And many people do. The same way, as I said by analogy, you can design a filesystem without worrying so much about rootkits.

Comments are closed.