[HN Gopher] SHA-1 'fully and practically broken' by new collisio...
       ___________________________________________________________________
        
       SHA-1 'fully and practically broken' by new collision (2020)
        
       Author : ofou
       Score  : 140 points
       Date   : 2021-10-12 19:41 UTC (3 hours ago)
        
 (HTM) web link (duo.com)
 (TXT) w3m dump (duo.com)
        
       | dang wrote:
       | Some past threads:
       | 
       |  _SHA-1 collisions now cost $45k [pdf]_ -
       | https://news.ycombinator.com/item?id=23350223 - May 2020 (62
       | comments)
       | 
       |  _The first chosen-prefix collision for SHA-1_ -
       | https://news.ycombinator.com/item?id=21979333 - Jan 2020 (352
       | comments)
       | 
       |  _Abusing SHA-1 collisions for Chromium updates_ -
       | https://news.ycombinator.com/item?id=20114809 - June 2019 (36
       | comments)
       | 
       |  _A SHA-1 chosen-prefix collision attack_ -
       | https://news.ycombinator.com/item?id=19907127 - May 2019 (71
       | comments)
       | 
       |  _From Collisions to Chosen-Prefix Collisions Application to Full
       | SHA-1 [pdf]_ - https://news.ycombinator.com/item?id=19878917 -
       | May 2019 (18 comments)
       | 
       |  _SHA-1 Collision Detection on GitHub.com_ -
       | https://news.ycombinator.com/item?id=13917990 - March 2017 (90
       | comments)
       | 
       |  _Linus ' reply on Git and SHA-1 collision_ -
       | https://news.ycombinator.com/item?id=13719368 - Feb 2017 (262
       | comments)
       | 
       |  _When Will We See Collisions for SHA-1? (2012)_ -
       | https://news.ycombinator.com/item?id=13719079 - Feb 2017 (1
       | comment)
       | 
       |  _Announcing the first SHA-1 collision_ -
       | https://news.ycombinator.com/item?id=13713480 - Feb 2017 (485
       | comments)
       | 
       |  _How would Git handle a SHA-1 collision on a blob?_ -
       | https://news.ycombinator.com/item?id=13547348 - Feb 2017 (5
       | comments)
       | 
       |  _Why it's harder to forge a SHA-1 certificate than to find a
       | SHA-1 collision_ - https://news.ycombinator.com/item?id=10778773
       | - Dec 2015 (43 comments)
       | 
       |  _The Cost of Creating Collisions Using SHA-1_ -
       | https://news.ycombinator.com/item?id=8629906 - Nov 2014 (9
       | comments)
       | 
       |  _When Will We See Collisions for SHA-1?_ -
       | https://news.ycombinator.com/item?id=4618069 - Oct 2012 (51
       | comments)
        
       | cletus wrote:
       | Git was created 16 years ago. The impending breakage of SHA-1 was
       | known even at that time, just like how MD5 had been broken before
       | it.
       | 
       | I'm honestly still shocked that updating the hashing algorithm
       | wasn't built into Git from day one. I really wonder why. Did
       | people think this wouldn't happen? Were they so in love with the
       | performance of C/C++ being able to pass around 20 byte hashes on
       | the stack without worrying about a more complicated structure (eg
       | a collection of variable length hashes)?
       | 
       | It just seems like such a massive and foreseeable oversight
       | that's going to cause pain for some time to come for really no
       | reason at all.
        
         | tantalor wrote:
         | > going to cause pain
         | 
         | What kind of attack on a git repo are you worried about?
        
           | snovv_crash wrote:
           | Subrepo hash substitution could enable a supply-chain attack
        
         | dec0dedab0de wrote:
         | Linus posted about it on google plus in 2017. I haven't re-read
         | it yet, but I remember one of the ideas floating around hn at
         | the time was to just have two hashes per commit. That is, two
         | insecure hashes may be secure together for git's purposes. Even
         | though we can generate collisions for md5, and sha1, it would
         | be much more difficult to have a file generate an arbitrary
         | collision for both at the same time.
         | 
         | Here is a link to Linus's post on the internet archive
         | 
         | https://web.archive.org/web/20170717192607/https://plus.goog...
         | 
         | And here are some hn posts around the same time
         | 
         | https://news.ycombinator.com/item?id=13733481
         | 
         | https://news.ycombinator.com/item?id=13719368
        
           | bawolff wrote:
           | > That is, two insecure hashes may be secure together for
           | git's purposes.
           | 
           | Generally that's not true for Merkle-Damgard (e.g. sha1,
           | sha2) hashes - afaik the difficulty of finding a collision in
           | the more difficult hash is the same (up to big-oh) as finding
           | it in both hashes.
        
         | umvi wrote:
         | "We note that classical collisions and chosen-prefix collisions
         | do not threaten all usages of SHA-1."
        
         | ori_b wrote:
         | For someone to be able to break your repo using sha1
         | collisions, they need to be able to commit to it.
         | 
         | If you don't trust someone, don't let them commit to your repo.
         | 
         | > _Were they so in love with the performance of C /C++ being
         | able to pass around 20 byte hashes on the stack without
         | worrying about a more complicated structure (eg a collection of
         | variable length hashes)?_
         | 
         | The hashes show up everywhere. They're how every object in git
         | is identified. They make it onto disk. They're not just commit
         | ids.
         | 
         | Changing hashes changes the disk format.
        
           | lamontcg wrote:
           | Yeah I don't think any of these proposed attacks work without
           | write access to the repo, at which point the game is already
           | pretty much over already.
        
         | [deleted]
        
         | alerighi wrote:
         | SHA-1 will still work fine for the purpose of git. It is just
         | no longer considered secure for cryptographic operations, such
         | as digital signature, that doesn't mean that you can't use it
         | for other purposes, like git does. Using it is still fine and
         | will ever be fine.
         | 
         | Making the hashing algorithm exchangeable would have introduces
         | complexity in a software that is already complex, and also less
         | efficient (one of the reasons git was created was speed for
         | large project like the Linux kernel) for no real purpose. If
         | you want to change the algorithm, given that you will break
         | compatibility with all existing repositories, tools, and
         | clients, you make a fork of git because you are changing too
         | much.
         | 
         | I don't see why migrating to SHA-256. The collisions are still
         | very unlikely to generate accidentally, and sure, if you want
         | to do damage to a repository you can create one on purpose, as
         | you can as well commit some hooks that contains malware, or
         | alter the git history in any way you want, so what's the point?
        
           | simias wrote:
           | Honestly I find these rationalizations around the use of
           | SHA-1 annoying and counter-productive. The rule is simple:
           | don't use SHA-1. If you already use SHA-1 migrate away from
           | it. You know that plenty of software out there that
           | interfaces with git expecting that the commit hash will be
           | unique. Is it a security risk? Maybe, maybe not. I don't care
           | to find out.
           | 
           | It doesn't matter until it starts mattering. If the Git devs
           | had done the right thing over a decade ago we wouldn't be
           | having this discussion. The longer they wait the more painful
           | the migration will be.
           | 
           | SHA-2 was published in 2001, git was released in 2005 and now
           | we're in 2021 and we're having this discussion again. The
           | first concrete attack was released in "early 2005" according
           | to wikipedia, so there's really no excuse.
           | 
           | Just do it, make a major change where you replace SHA-1 with
           | SHA-256 and call it a day. It's going to be painful for a few
           | months and then we'll move on.
           | 
           | For me these discussions demonstrate the immaturity of
           | software engineering. In other industries regulators would've
           | banned the use of SHA-1 and you couldn't get certified if you
           | used it.
           | 
           | Do electronic engineers regularly try to argue "well ok RoHS
           | says we can't have lead solder in this product but frankly
           | for this one it's fine the casing is completely waterproof
           | and there are no risks for the customer"? No, they don't. If
           | the spec says no lead, then either it's no lead or you can't
           | sell your product. End of story.
           | 
           | SHA-1 is the lead solder of software engineering. Only
           | acceptable for military use.
        
             | umvi wrote:
             | > You know that plenty of software out there that
             | interfaces with git expecting that the commit hash will be
             | unique
             | 
             | You also know that plenty of software out there that
             | interfaces with git has hardcoded assumptions (like, for
             | example, the assumption that the commit hash will be
             | exactly 40 characters long). Some tools parse the output of
             | git log and other commit-bearing commands to make
             | decisions. Will changing git to SHA-256 create new
             | unforeseen security risks due to breakage of those tools
             | (for example, by only grabbing the first 40 characters of a
             | SHA-256 digest instead of all 64 or by just outright
             | crashing)? Maybe, maybe not.
             | 
             | IMO I think you would create _more_ security risks with the
             | git integration breakage that would accompany migrating to
             | sha256 vs. staying with sha1.
             | 
             | At this point it's almost like you want a new tool/new
             | command. `git` vs. `git2`. New projects use git2, existing
             | projects use git (or something like that). Otherwise
             | confusion and backwards-compatibility breakage will abound.
        
               | musicale wrote:
               | > New projects use git2, existing projects use git (or
               | something like that). Otherwise confusion and backwards-
               | compatibility breakage will abound.
               | 
               | Like uh, python2 and python3? ;-p
        
           | nybble41 wrote:
           | > It is just no longer considered secure for cryptographic
           | operations, such as digital signature, that doesn't mean that
           | you can't use it for other purposes, like git does.
           | 
           | Git effectively uses its hashes as a component of a digital
           | signature scheme, in the form of signed tags and signed
           | commits. The GPG signature covers the commit object, which
           | identifies the content (tree) by its hash.
        
           | madars wrote:
           | Collisions definitely do matter for git security: many people
           | pin explicit git hashes for their dependancies, and thus they
           | can be tricked in running malicious forks. This requires
           | placing a chosen commit in the git repo (so unlike second
           | preimage break it does not mean that you could attack repos
           | you have no control over) but that's not an unrealistic
           | threat model overall.
        
             | y4mi wrote:
             | What is the thread model though?
             | 
             | I don't think it's possible to create a collision that's
             | also executeable code which adds a security hole or
             | anything.
             | 
             | So what exactly would they achieve with the collision?
             | 
             | And how do they push these gigantic files that have the
             | hash collisions to a server? The upload time would be
             | significant.
        
               | noway421 wrote:
               | The possible attack is to prepare 2 versions of a commit,
               | both resulting in the same commit id. Then later on,
               | after the project is successful/etc, swap out the commit
               | with the second version, while keeping the other commits
               | intact.
               | 
               | Granted, the file that the commit touches would need to
               | be not touched in other commits. That's not out of
               | question in a typical software project - maybe a file in
               | the utils folder which is only written once and never
               | changed?
               | 
               | > I don't think it's possible to create a collision
               | that's also executeable code
               | 
               | You can include an unreadable binary blob in the commit.
               | Tweak the blob to find the collision while keeping the
               | code the way attack requires.
        
               | [deleted]
        
               | DSingularity wrote:
               | That one is nasty.
        
               | OJFord wrote:
               | A denial of service of sorts? (Something broken and
               | unusable is delivered instead, as distinct from something
               | usable but maliciously so.)
               | 
               | I agree that the chances of ever getting a second pre-
               | image that not only makes sense, but does so in some
               | malicious way may as well be zero, surely?
        
               | lvh wrote:
               | The part about the 2nd pre-image chances being
               | effectively zero may be true, but the nasty cases
               | described upthread don't need a 2nd pre-image. (You could
               | do a lot worse with one, granted!)
        
               | [deleted]
        
               | madars wrote:
               | The files don't need to be gigantic. You could, for
               | example, have a binary config file which in one colliding
               | version encodes a potentially dangerous debugging
               | setting, e.g., "allow_unauthenticated_rpc = false" but in
               | other has it to "true".
        
               | lvh wrote:
               | 1) People systematically underestimate the possibility of
               | creating collisions that still do something
               | "interesting", like being polyglots (files that can be
               | interpreted in multiple formats, executable or
               | otherwise). See PoC||GTFO, specifically anything by Ange
               | Albertini, for examples; grep
               | https://github.com/angea/pocorgtfo/blob/master/README.md
               | for "MD5". I specifically recommend this writeup: https:/
               | /github.com/angea/pocorgtfo/blob/master/writeups/19/R...
               | .
               | 
               | 1bis) You can use an existing collision to create new
               | collisions. People seem to think you need to generate all
               | the work again from scratch; this is not true. See
               | PoC||GTFO for proof by example.
               | 
               | 1cis) The files do not need to be gigantic. See PoC||GTFO
               | for proof by example.
               | 
               | 2) You can do the collision in advance, and publish the
               | malicious version later. What it accomplishes is that the
               | concept of "this Git hash unambiguously specifies a
               | revision" no longer works, and one of them can be
               | malicious.
               | 
               | 3) The standard should be "obviously safe beyond a
               | reasonable doubt", not "not obviously unsafe to a non-
               | expert". By the latter standard, pretty much any random
               | encryption construction is fine. (The examples I gave use
               | MD5, not SHA-1, but that's a matter of degrees.)
               | 
               | 4) SHA-256 was published years before git first was.
        
               | whoisburbansky wrote:
               | What do you mean by the `bis` and `cis` suffixes to your
               | entry labels?
        
               | lvh wrote:
               | It's just a subdivision; it might as well have said 1a,
               | 1b... -- but "bis" and "cis/tris" (and possibly tetrakis)
               | tend to emphasize that they're addenda, not equal points.
        
               | whoisburbansky wrote:
               | Ah, gotcha. I thought I recognized the prefixes from
               | Organic, but I don't think I've seen it used like this
               | here. Neat!
        
               | [deleted]
        
               | occamrazor wrote:
               | In addition to the attack described in a sibling comment,
               | when a hashing algorithm has been broken in some way, it
               | is safe to assume that other more advanced collision
               | attacks will be soon discovered.
        
           | cogman10 wrote:
           | And, to that point, I'm not really convinced that a
           | cryptographic hashing algorithm is really a great choice for
           | git.
           | 
           | It is nice that it checks off the boxes for even distribution
           | of hashes, but there's a bunch of other hashing algorithms
           | that can do that without the performance penalty inherent in
           | crypto hashes. For example, FNV seems like a good fit for
           | something like git.
        
             | simias wrote:
             | Is hashing a significant bottleneck in any git deployment?
             | I'd expect that the diffing would be vastly more expensive
             | for instance.
             | 
             | Besides don't many modern CPU supports things like SHA-256
             | in hardware?
        
             | TonyTrapp wrote:
             | FNV is really cool and has a reasonable quality given its
             | simplicity. But it does have issues (sticky state, and I
             | think the avalanche characteristics also weren't great)
             | that are solved by only slightly more complex hashing
             | algorithms.
        
           | lvh wrote:
           | > SHA-1 will still work fine for the purpose of git. It is
           | just no longer considered secure for cryptographic
           | operations, such as digital signature, that doesn't mean that
           | you can't use it for other purposes, like git does. Using it
           | is still fine and will ever be fine.
           | 
           | This suggests git does not rely on its hash for security
           | properties, which seems false? What is the purpose of pinning
           | revisions or signing git tags?
        
         | elfchief wrote:
         | Git was not intended (AFAIK) to be cryptographically secure.
         | Being unsuitable for crypto != being unsuitable for other uses.
        
           | wcoenen wrote:
           | Surely signed tags and signed commits in git are supposed to
           | be cryptographically secure? [0]
           | 
           | Doesn't the security of those signatures depend on the
           | security of the SHA-1 hashes that are being signed?
           | 
           | [0] https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-
           | Work
        
             | danachow wrote:
             | No - after all, the most common use of digital signatures
             | is to sign documents that can be easily tampered. All the
             | security is in the signature, not the content being signed.
        
               | lvh wrote:
               | This makes no sense. Sure: you can generate an X.509
               | certificate that says whatever you want, but the point is
               | that you can validate the signature and see that it's a
               | forgery. In the case of a hash-addressed system like git,
               | the problem is that the signature is over a collision, so
               | it no longer certifies the thing its supposed to certify.
               | Git uses the hash as a shorthand for a revision,
               | including its entire history--so yes, it is using the
               | hash that way.
               | 
               | By that logic, would MD5 be fine? MD4? CRC32?
        
               | wongarsu wrote:
               | A digital signature usually signs (~encrypts) a hash of
               | the content. So asking what exactly signing a commit or
               | tag entails is a very valid question. I would expect that
               | signing a tag is only as strong as SHA-1, since a tag is
               | essentially a label for a commit hash. For signing
               | commits I have no clue, but would be quite interested as
               | well.
        
         | formerly_proven wrote:
         | Git is designed around content-addressed storage and while you
         | _can_ cater for hash migrations, it gets pretty messy design-
         | wise. Gits core data structures were also designed really
         | quickly to patch a very urgent need of the kernel hackers. I
         | doubt it has anything to do with being in love passing 20 byte
         | strings on the stack. The fine git folks have produced a rather
         | detailed document about the hash migration (https://git-
         | scm.com/docs/hash-function-transition/) and it is not
         | particularly simple to do this.
        
       | john_alan wrote:
       | > Our work show that SHA-1 is now fully and practically broken
       | for use in digital signatures
       | 
       | Yet 2nd preimage resistance is intact. Zzz.
        
       | r0f1 wrote:
       | Obligatory link: https://shattered.io/
        
         | nikeee wrote:
         | The article states that the collision found is something
         | different than that one found by Google et al.
         | 
         | It's a chosen-prefix attack.
        
         | TravisHusky wrote:
         | Of course it has its own website ;)
        
         | fanf2 wrote:
         | That was an earlier break; this article is about https://sha-
         | mbles.github.io/
        
       | ofou wrote:
       | For the ones interested, you can use SHA-256 today. Warning:
       | Still in experimental mode.                   # .gitconfig file
       | [extensions]             objectFormat = sha256
        
       | musicale wrote:
       | Old collision (2020). ;-)
        
       | bjarneh wrote:
       | > The technique that the researchers developed is quite complex
       | and required two months of computations on 900 individual GPUs.
        
         | motohagiography wrote:
         | I only half joke when I ask, when you break a cryptogrpahic
         | hash, does it mean that now it's just a really amazing
         | compression algorithm, but with a very heavy compute
         | requirement? The non-joking half is speculating what data
         | compression in a post-quantum compute world looks like.
        
           | fwip wrote:
           | No, because reversing the hash has infinite possible answers
           | (well, "very many" for bounded input size).
           | 
           | You can't decompress 32 bytes into 1GB because you don't know
           | which of the 1GB-sized answers is the intended one.
        
             | ganzuul wrote:
             | Natural data is easily identifiable.
        
               | dTal wrote:
               | Not it's not. You can't distinguish the complete works of
               | Shakespeare from the complete works of Shakespeare with
               | the words changed a bit, or switching to Harry Potter for
               | a bit in the middle, or a completely dazzling and
               | original work of fiction. Any hash might generate all
               | three of these, and essentially boundless other works.
               | It's a library of Babel.
        
               | ganzuul wrote:
               | Your examples really fall out of the scope of the
               | premise.
        
               | fwip wrote:
               | No, they're 100% correct.
        
               | contravariant wrote:
               | Only if you count all natural looking data of which there
               | is way more than you could possibly imagine.
               | 
               | It is of course likely that humanity hasn't yet created
               | more than 2^100 (~10^30) files, so in _theory_ given a
               | registry of all files in existence you might be able to
               | identify it by its hash. However while this is simple it
               | 's definitely not easy.
        
               | ganzuul wrote:
               | I can imagine the output of program space. That's pretty
               | big. All possible universes in fact.
               | 
               | It's an issue of probability and bins. While natural data
               | is infinite, there is vastly more unnatural data. At some
               | point you have enough metadata (e.g. natural vs. random)
               | to know that the original data came from Earth to pick
               | out the right needle from the needle stack.
               | 
               | Unless the data is from a completely alien source, we are
               | close enough to that source for our heuristics to be of a
               | manageable size.
        
               | contravariant wrote:
               | It's somewhat irrelevant how many unnatural data there
               | is. The point is that there's way more plausible data (or
               | sentences if we're restricting ourselves to natural
               | language) than there are possible 160; 256 or 512 bit
               | hashes.
               | 
               | Unless of course you enumerate all of human
               | communication, but like I said that doesn't count as
               | 'easy'.
        
               | fwip wrote:
               | Nope, not on these scales.
               | 
               | Say you have a 32 byte hash, that's 2^(32 _8) possible
               | values, right? Now say your input is 128 bytes, that
               | means that (on average) each 32 byte hash maps to 2^(96_
               | 8) values.
               | 
               | "Natural looking" data will occur an astronomical number
               | of times in a search space that large - which is, again,
               | only 96 bytes larger than your hash.
        
               | __s wrote:
               | That's because natural data has low entropy
               | 
               | But let's say every paragraph only offers 1 bit of
               | entropy. Then a 160 bit hash gives you fuzzy accuracy up
               | to 160 paragraphs. After that you'll have to extend the
               | hash with hints to guide which sequence of paragraphs
               | you're looking for, & hints for where the typos are
               | 
               | ofc, 100x compression of English text doesn't require
               | this amount of compute to decompress:
               | https://en.wikipedia.org/wiki/Hutter_Prize
               | 
               | It's also impractical since most compression use cases
               | want to put the work in compression, & have decompression
               | be straight forward
               | 
               | edit: 100x was misreading, it's currently 8.6x
               | http://prize.hutter1.net/#prev
        
               | ganzuul wrote:
               | This is very abstract, but I believe that since both
               | input and output are very close in program space compared
               | to the total size of PS, the heuristics to map between
               | them are going to be of a manageable size. This is based
               | on an intuition about a single source for all activity in
               | this universe.
        
               | AnimalMuppet wrote:
               | "Paragraph" is highly optimistic. IIRC, each English
               | _character_ has about 1 bit of entropy.
        
               | __s wrote:
               | Shannon estimated 0.6 to 1.3:
               | https://cs.fit.edu/~mmahoney/dissertation/entropy1.html
               | 
               | For the sake of argument I figured I'd be highly
               | optimistic. The linked prize shows practical evidence of
               | 1GB of Wikipedia being compressed below 1 bit per
               | character _( & that's with a limit of 50 hours cpu time,
               | 10GB memory, & 100GB disk)_
        
             | motohagiography wrote:
             | This is what made it a funny thought experiement to me. I
             | was doing some space related stuff a while back and when
             | you're dealing with something as small as 100mb, and you
             | add the constraint of up to a light-year or more of
             | latency, which makes error correction immensely costly,
             | having a quantum processor reciever find the collision of a
             | unique hash could be faster than transmitting the data with
             | errors.
             | 
             | There's probably a distance where the latency in light
             | years is longer than it takes for a quantum processor with
             | some shortcuts to exhaust the search space for the
             | collision on the hash - and potential hashing algorithms
             | that contain information and hints about a hyperplane that
             | generates the original data over the field of
             | possibilities. To a quantum computer with sufficient qbits
             | at a long enough latency distance away, the 32-bit hash of
             | a 2048bit RSA key may "arrive" faster than transmitting the
             | whole key.
             | 
             | It feels like a fundamental misunderstanding of something
             | in communication theory, but this was the thinking that
             | prompted it.
        
               | mypalmike wrote:
               | Using the pigeonhole principle, consider how many 100
               | megabit values must, on average, hash to the same 32 bit
               | value. The answer is a huge number (I could be wrong but
               | it seems to me it might be 2 ^ (100,000,000 - 32)?). This
               | is a bit of a paradox in cryptographic hashing, where
               | finding a collision is difficult, but the number of
               | collisions is enormous.
        
             | akomtu wrote:
             | Not so simple. A hash corresponds to infinitely many
             | messages, but how many of them are in ASCII charset under
             | 1KB long? It may happen that each hash has a unique message
             | within these constraints.
        
               | hyperpape wrote:
               | Ignoring punctuation, capitalization and spaces, you have
               | 
               | (26 ^ 1024) / (2 ^ 160) ascii texts, which is too large
               | for python to do the division.
               | 
               | Thinking harder, that's:
               | 
               | >>> (13 * (160)) * (26 ** (1024 - 160)) 58601538220582696
               | 072767254440146387121221583753570902541264313546485610004
               | 750780866792754920589068132679848156271786567137191486170
               | 703327101040110575724309837442235432328033598945697712388
               | 381451978878967640960102861159559384620162207350857311340
               | 050840762653291815079540852172190063832289697996458447828
               | 737045986675145162241398406780211500684400272119809812611
               | 929904403730528597187389968557044373171358434578669341421
               | 589594031073341308966343982813670005358851607748477530217
               | 997935791824321484740631022470362180786257680155724965742
               | 143671853261978124815484529745096216129616265428595187946
               | 218158257908697520717067407972750772493927919233876373156
               | 233168124018474649093065762508852701249697457996541217284
               | 731525708907010521446628919717790946317667295418177373934
               | 569079380730731418728467142214924963037213836468723407639
               | 463340501537517771097304361708288408926134607971881944851
               | 903293709136443956469032501471787137653780375970169583703
               | 004025580731889262167424195839803018053029467192083418365
               | 771596772896857995516305314630974667718236900515720973033
               | 360569178506327940782971768109891923840004914279569619570
               | 809054106685577577395046255601816801966032983011243478952
               | 485788667076648479153854024730784994930592675717244288051
               | 727831378594097970916753614065993732935733615600050932039
               | 565604760497376113470099235874052260244552013678784657526
               | 956911961147614534279815393143753046456632251016766323098
               | 5347176517861376
               | 
               | If you restrict it to words, assume that there's only
               | 1000 words in the English language, averaging 5 letters
               | each, you get:
               | 
               | (1000 ^ 200) / (2 ^ 160) which becomes:
               | 
               | >>> (500 ** 160) * (1000 ** 40) 6842277657836020854119773
               | 355907793609766904013068924666782559979930620520927053718
               | 196475529111921787261962890625000000000000000000000000000
               | 000000000000000000000000000000000000000000000000000000000
               | 000000000000000000000000000000000000000000000000000000000
               | 000000000000000000000000000000000000000000000000000000000
               | 000000000000000000000000000000000000000000000000000000000
               | 000000000000000000000000000000000000000000000000000000000
               | 000000000000000000000000000000000000000000000000000000000
               | 000000000000000000000000000000000000000000000000000000000
               | 00000000000000
               | 
               | There will be massive numbers of realistic texts that
               | match any given hash.
        
               | dTal wrote:
               | The pigeonhole principle forbids it, since the hash
               | itself is an ASCII message under 1KB long. Even within
               | your message constraints, the message can contain contain
               | any possible hash, and more besides. Therefore there are
               | more valid messages than hashes.
        
           | FartyMcFarter wrote:
           | Compression algorithms must encode data in a reversible way,
           | which hash functions can't do in general.
           | 
           | For example, SHA-1 outputs a 160 bit hash, which means some
           | input of 161 or more bits will definitely have the same hash
           | as some other input of the same size. Even smaller inputs may
           | have collisions too.
        
           | [deleted]
        
           | tomsmeding wrote:
           | These hashes, regardless of their actual cryptographic
           | strength, are intended to be indistinguishable from a random
           | generator keyed with the input, kind of. Assuming that they
           | succeed in that, hashes should be quite well distributed.
           | That means that that for each hash n-bit hash output, there
           | will be about two (n+1)-bit input strings that produce that
           | hash, about four (n+2)-bit input strings, eight (n+3)-bit
           | input strings, etc. So while there are probably few _ASCII_
           | input strings that map to one particular hash, for example,
           | there will be loads and loads of arbitrary bitstrings that
           | map to that hash.
           | 
           | Hashes are very bad compressors. :)
        
           | kazinator wrote:
           | The compression joke evokes a misconception (as STEM jokes
           | often do) which, in this case, is that breaking a hash means
           | finding a way to recover the full text of a multi-megabyte
           | document, from a 100-something byte hash.
           | 
           | Whereas all it means is that it's become feasible for someone
           | to produce a fake document with the same hash as the genuine
           | one; the attack _depends_ on the existence of collisions,
           | which depend on the hash being one-way, such that an infinite
           | set of documents could be  "recovered" from the hash.
           | 
           | Most of the documents you could "decompress" from a hash are
           | going to be gibberish, but some of them are going to be
           | legit. If a hash is "broken", that means it's somehow unduly
           | easy to search that ocean of gibberish collisions to find a
           | chosen text which goes to a certain desired hash.
           | 
           | Nothing quantum can redefine a function that isn't one-to-one
           | and onto into one that is. The hashing function remains ever
           | the mathematical object that it is: a one-way function. But
           | calculations of or searches through that function can be
           | faster.
           | 
           | In a post-quantum world, pre-quantum crypto and hashing could
           | well turn out to be "for shit" as such, but not due to ruses
           | like suddenly reversing irreversible functions.
           | 
           | Not saying that the joke isn't funny! We can imagine some
           | hard-core computer science or math sitcom in which the fall
           | characters say things like this, and only the very narrow
           | target audience gets it.
        
         | lallysingh wrote:
         | Or roughly 1.3M GPU hours. Less as GPUs get upgraded. Not sure
         | how many AWS F1 (FPGA) hours it would take.
        
           | wongarsu wrote:
           | So somewhere between $300,000 (running consumer GPUs cheaply)
           | and $30 million (paying AWS on-demand prices for Tesla V100).
           | 
           | Even the upper bound for that seems well worth it for well-
           | funded attackers if the target is juicy enough.
        
         | asdfman123 wrote:
         | I wonder how much money in Bitcoin they would have made if they
         | had used those hashes to mine instead.
         | 
         | Speaking of which, it would be funny if they were pretending to
         | do security research but added a secret backdoor to mine
         | bitcoin instead, somehow exporting those hashes or using the
         | partial results of SHA-1 calculations (BTC isn't SHA-1).
         | 
         | I'm just joking, but I wonder if that's possible. If anyone is
         | Machiavellian enough for that, it's security researchers.
        
       | rgovostes wrote:
       | (Jan 7, 2020)
        
       | ImJasonH wrote:
       | Reminder that GitHub has blocked Git commit collisions since
       | 2017, and as far as anybody is aware hasn't seen one in the wild.
       | 
       | https://github.blog/2017-03-20-sha-1-collision-detection-on-...
        
         | tantalor wrote:
         | > A higher probability exists that every member of your
         | programming team will be attacked and killed by wolves in
         | unrelated incidents on the same night.
         | 
         | - Scott Chacon
        
           | asdfman123 wrote:
           | That's even true if you make your developers commute into
           | Yellowstone National Park by ski every day
        
           | bawolff wrote:
           | This is stupid. The probability of any attack happening at
           | random is close to zero. E.g. the probability of a buffer
           | overflowing just right to give you a shell through random
           | chance is less than being eaten by wolves.
           | 
           | The question is about the risk of someone intentionally
           | performing the attack, not the probability it will
           | accidentally happen at random.
        
         | api wrote:
         | Random collisions in 160-bit space are incredibly unlikely.
         | This is talking about intentional collision, and means that
         | it's entirely feasible for someone with significant compute
         | power to create a git commit that has the exact same hash as
         | another git commit. This could allow someone to silently modify
         | a git commit history to e.g. inject malware or a known "bug"
         | into a piece of software. The modified repository would be
         | indistinguishable if you're only using git hashes.
         | 
         | Git's uses SHA-1 for unique identifiers, which is technically
         | okay as long as they are not considered secure. If git were
         | designed today it would probably use SHA2 or SHA3 but it's
         | probably not going to change due to the massive install base.
         | 
         | Edit: anyone know if git's PGP signing feature creates a larger
         | hash of the data in the repo? If not maybe git should add a
         | feature where signing is done after the computation of a larger
         | hash such as SHA-512 over all commits since the previous
         | signature.
        
           | asdfman123 wrote:
           | > Random collisions in 160-bit space are incredibly unlikely
           | 
           | Just for fun: to get a 5% chance of a hash collision between
           | ANY two numbers in an 160 bit space, you'd have to generate
           | 3.9e23 hashes.
           | 
           | So you'd have to generate 1000 hashes per second for * _12
           | trillion years.*_
           | 
           | Formula:
           | 
           | n = sqrt(2 * 2^160 * ln(1/(1-0.05))
           | 
           | https://en.wikipedia.org/wiki/Birthday_problem#Probability_o.
           | ..
        
           | tialaramex wrote:
           | The defence used by GitHub specifically defends _against_
           | these intentional collisions, not some mirage of random
           | collisions.
           | 
           | Basically you collide a hash like SHA-1 or MD5 by getting it
           | into a state where transitions don't twiddle as many bits,
           | and then smashing the remaining bits by brute force trial.
           | But, such states are _weird_ so from inside the hash
           | algorithm you can notice  "Huh, this is that weird state I
           | care about" and flag that at a cost of making the algorithm a
           | little slower. The tweaked SHA1 code is publicly available.
           | 
           | If you're thinking "Oh! I should rip out our safe SHA256 code
           | and use this unsafe but then retro-actively safer SHA1" No.
           | Don't do that. SHA-256 is safer and faster. This is an
           | emergency patch for people for whom apparently 20 years
           | notice wasn't enough warning.
           | 
           | In theory the known way to do this isn't the only way, but,
           | we have re-assuring evidence for MD5 that independent forces
           | (probably the NSA) who have every reason to choose a
           | different way to attack the hash to avoid detection _do_
           | trigger the same weird states even though they 're spending
           | the eye-watering sum of money to break hashes themselves not
           | just copy-pasting a result from a published paper.
        
           | [deleted]
        
           | noway421 wrote:
           | From a practical point of view, how would injecting malware
           | happen? If you're trying to insert a malicious diff somewhere
           | deep in the git history, you would need to recompute all the
           | other commits after the injected commit - which would most
           | certainly change their commit ids too if they are touching
           | the same file. When other commit ids change, the malicious
           | change becomes detectable.
           | 
           | There's also the case for auditing: force pushing into an
           | existing repo triggers an event in GitHub and is logged.
           | While the logging event can be missed, it leaves a paper
           | trail.
           | 
           | With things like reproduce-able builds, this also becomes
           | harder. Distributing (through a means of a fork, or putting
           | it up on a website mytotallylegitgitrepos.com) source code
           | which builds into a binary which doesn't match upstream hash
           | is suspicious.
        
           | AnimalMuppet wrote:
           | Can you create an intentional collision, and _also_ have the
           | counterfeit have compilable code? (Or, in other
           | circumstances, intelligible English text?)
        
             | jfrunyon wrote:
             | Yes, just like you can create an intentional collision, and
             | _also_ have the counterfeit be a PDF.
             | [https://shattered.io/]
        
           | [deleted]
        
           | throw_away wrote:
           | Looks like there's a migration plan for git to SHA-256:
           | 
           | https://git-scm.com/docs/hash-function-transition/
        
           | jfrunyon wrote:
           | https://stackoverflow.com/questions/23584990/what-data-is-
           | be... says git signing just signs the commit hash (plus
           | metadata: committer, commit message, etc).
        
           | a_t48 wrote:
           | As far as I know it only signs individual commits.
        
           | oconnore wrote:
           | I was surprise that no one suggested truncating SHA-256 to
           | 160 bits (same as for SHA2-256/224, or SHA2-512/256). The
           | attacks on SHA-1 are not directly based on the length of the
           | hash, they are based on weaknesses in the algorithm.
           | 
           | Even attacking SHA2-256/128 would be quite difficult as I
           | understand it, even though it's the same length as MD5.
           | 
           | Truncated hashes also of course have the great property that
           | they mitigate the length extension in Merkle-Damgard
        
           | adrian_b wrote:
           | It is not yet possible "to create a git commit that has the
           | exact same hash as another git commit" in the sense that if
           | someone else has already done a commit you can make another
           | commit with the same hash.
           | 
           | What is possible now is something that is much easier: if you
           | have enough money and time, you can create 2 commits with the
           | same hash, which start with some different parts, which may
           | be chosen arbitrarily, then they have to include some parts
           | that must be computed to ensure that the hashes will match
           | and which will be gibberish that might be disguised in some
           | constants or some binary blob, if possible.
           | 
           | Then you can commit one of them and presumably you can later
           | substitute the initial commit with the other one without
           | anybody being able to detect the substitution.
        
             | outworlder wrote:
             | > presumably you can later substitute the initial commit
             | with the other one without anybody being able to detect the
             | substitution.
             | 
             | How? Which operation would be involved? Will it not show
             | anywhere else(reflog)?
        
               | adrian_b wrote:
               | Because such an operation is unlikely to succeed even if
               | the hashes match, replacing SHA-1 was not a high priority
               | for Git.
        
               | noway421 wrote:
               | `git push --force` I assume. In GitHub, this will leave a
               | trail of events though - Webhook events and auditing.
        
               | DSMan195276 wrote:
               | I would think it's not that simple because `git push
               | --force` doesn't do anything if it thinks the histories
               | are the same, and in this case we've created a history
               | that appears to be the same. You'd likely need a custom
               | git client (which isn't a problem) but I don't know
               | enough about the internals of a git push to know if the
               | server would even accept objects if they match hashes of
               | objects it already has (it may just go "I don't need
               | that" and ignore it, because what's the point in doing
               | anything with it?). Presumably it would depend on the
               | exact server implementation whether you could get it to
               | replace an existing object with a "new" one which is
               | actually the same hash, but frankly I think that's
               | unlikely because it would be pointless work from the view
               | of the server. If it does happen I'm not sure what
               | auditing you would actually be able to see, webhooks and
               | such might not be triggered because the history didn't
               | actually "change".
               | 
               | What you could do however is just host it yourself
               | somewhere else, say put it on a fork. Or if you have
               | access to the actual repository hosting the original
               | version, you could just manually replace it yourself. git
               | clients aren't going to just automatically pull the "new"
               | version though so you'll have some combo of people with
               | the old and people with the new, and it gets a little
               | messier from there.
        
               | sverhagen wrote:
               | If you can force push, why would you then not push a
               | different commit before pushing your updated commit with
               | the same, original hash, or does that also not work?
        
               | formerly_proven wrote:
               | The reflog shows changes to references. References are
               | hashes. Git is CAS, and uses the CAS assumption: H(m1) =
               | H(M2) <=> m1 = m2.
        
           | outworlder wrote:
           | > This could allow someone to silently modify a git commit
           | history to e.g. inject malware or a known "bug" into a piece
           | of software.
           | 
           | You need a collision. You also need it to be syntactically
           | correct. You need it to not raise any red flags if you are
           | contributing a patch. And ultimately you need it to do what
           | you want.
           | 
           | That's a pretty tall order.
        
             | api wrote:
             | All you need is one variable sufficiently large that you
             | can change until you find a collision, such as a comment
             | (which about all languages allow) or a meaningless number.
             | 
             | You could even vary whitespace until it fits, like spaces
             | at the end of lines.
        
           | ImJasonH wrote:
           | Yep. GitHub is saying they would block an object that looked
           | like it was crafted to produce a collision using SHAttered,
           | and hasn't seen it, intentionally or otherwise, in the wild.
        
       | lifthrasiir wrote:
       | The chosen-prefix collision with a complexity of 2^63.4 (2020).
        
         | akomtu wrote:
         | Looks like they needed 1.5 millions GPU hours to find a
         | collision with a random hash.
        
       | smegcicle wrote:
       | thats it, it's over, git is finished
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-10-12 23:00 UTC)