https://lwn.net/SubscriberLink/811068/cfeb6a67b8dfbe47/ LWN.net Logo LWN .net News from the source LWN * Content + Weekly Edition + Archives + Search + Kernel + Security + Distributions + Events calendar + Unread comments + ------------------------------------------------------------- + LWN FAQ + Write for us User: [ ] Password: [ ] [Log in] | [Subscribe] | [Register] Subscribe / Log in / New account A new hash algorithm for Git [LWN subscriber-only content] Welcome to LWN.net Free trial subscription The following subscription-only Try LWN for free for 1 content has been made available to month: no payment or you by an LWN subscriber. Thousands credit card required. of subscribers depend on LWN for Activate your trial the best news from the Linux and subscription now and see free software communities. If you why thousands of readers enjoy this article, please consider subscribe to LWN.net. accepting the trial offer on the right. Thank you for visiting LWN.net! By Jonathan Corbet February 3, 2020 The Git source-code management system is famously built on the SHA-1 hashing algorithm, which has become an increasingly weak foundation over the years. SHA-1 is now considered to be broken and, despite the fact that it does not yet seem to be so broken that it could be used to compromise Git repositories, users are increasingly worried about its security. The good news is that work on moving Git past SHA-1 has been underway for some time, and is slowly coming to fruition; there is a version of the code that can be looked at now. How Git works, simplified To understand why SHA-1 matters to Git, it helps to have an idea of how the underlying Git database works. What follows is an oversimplified view of how Git manages objects that can be skipped by readers who are already familiar with this material. Git is often described as being built on a content-addressable filesystem -- one where you can look up an object if you know that object's contents. That may not seem particularly useful, but there's more than one way to "know" those contents. In particular, you can substitute a cryptographic hash for the contents themselves; that hash is rather easier to work with and has some other useful properties. Git stores a number of object types, using SHA-1 hashes to identify them. So, for example, the SHA-1 hash of drivers/block/floppy.c in a 5.6-merge-window kernel, as calculated by Git, is 485865fd0412e40d041e861506bb3ac11a3a91e3. Conceptually, at least, Git will store that version of floppy.c in a file, using that hash as its name; early versions of Git actually did that. If somebody makes a change to floppy.c, even just removing an extra space from the end of a line, the result will have a completely different SHA-1 hash and will be stored under a different name. A Git repository is thus full of objects (often called "blobs") with SHA-1 names; since a new one is created for each revision of a file, they tend to proliferate. Your editor's kernel repository currently contains 8,647,655 objects. But blobs are not the only types of objects stored in a Git repository. An individual file object holds a particular set of contents, but it has no information about where that file appears in the repository hierarchy. If floppy.c is moved to drivers/staging someday, its hash will remain the same, so its representation in the Git object database will not change. Keeping track of how files are organized into a directory hierarchy is the job of a "tree" object. Any given tree object can be thought of as a collection of blobs (each identified by its SHA-1 hash, of course) associated with their location in the directory tree. As one might expect, a tree object has an SHA-1 hash of its own that is used to store it in the repository. Finally, a "commit" object records the state of the repository at a particular point in time. A commit contains some metadata (committer, date, etc.) along with the SHA-1 hash of a tree object reflecting the current state of the repository. With that information, Git can check out the repository at a given commit, reproducing the state of the files in the repository at that point. Importantly, a commit also contains the hash of the previous commit (or multiple commits in the case of a merge); it thus records not just the state of the repository, but the previous state, making it possible to determine exactly what changed. Commits, too, have SHA-1 hashes, and the hash of the previous commit (or commits) is included in that calculation. If two chains of development end up with the same file contents, the resulting commits will still have different hashes. Thus, unlike some other source-code management systems, Git does not (conceptually, at least) record "deltas" from one revision to the next. It thus forms a sort of blockchain, with each block containing the state of the repository at a given commit. Why hash security matters The compromise of kernel.org in 2011 created a fair amount of concern about the security of the kernel source repository. If an attacker were able to put a backdoor into the kernel code, the result could be the eventual compromise of vast numbers of deployed systems. Malicious code placed into the kernel's build system could be run behind any number of corporate and government firewalls. It was not a pleasant scenario but, thanks to the use of Git, it was also not a particularly likely one. Let us imagine that some attacker has gained control of kernel.org and wants to place some evil code into floppy.c -- something unspeakable like a change that replaces random sectors with segments from Rick Astley videos, say. Somehow this change would have to be incorporated into the repository so that it would be included in subsequent pulls. But the change to floppy.c changes its SHA-1 hash; that, in turn, will change every tree object containing the evil floppy.c and every commit that includes it as well. The head commit for the repository would certainly change, as would older ones if the attacker tried to make the change appear to have happened in the distant past. Somewhere out there is certainly some developer who actually memorizes SHA-1 hashes and would immediately notice a change like that. The rest of us probably would not, but Git will. The distributed nature of Git means that there are many copies of the repository out there; as soon as a developer tries to pull from or push to the corrupted repository, the operation will fail due to the mismatched hashes between the two repositories and the corruption will come to light. Repository integrity is also protected by signed tags, which include the hash for a specific commit and a cryptographic signature. The chain of hashes leading up to a given tag cannot be changed without invalidating the tag itself. The use of signed tags is not universal in the kernel community (and rare to nonexistent in many other projects), but mainline kernel releases are signed that way. When one sees Linus Torvalds's signature on a tag, one knows that the repository is in the state he intended when the tag was applied. All of this depends on the strength of the hash used, though. If our attacker is able to modify floppy.c in such a way that its SHA-1 hash does not change, that modification could well go undetected. That is why the news of SHA-1 hash collisions creates concern; if SHA-1 cannot be trusted to detect hostile changes, then it is no longer assuring the integrity of the repository. The world has not ended yet, fortunately. It is still reasonably expensive to create any sort of SHA-1 hash collision at all. Creating any new version of floppy.c with the same hash would be hard. An attacker would not just have to do that, though; this new version would have to contain the desired hostile code, still function as a working floppy driver, and not look like an obfuscated C code contest entry (at least not more than it already does). Creating such a beast is probably still unfeasible. But the writing is clearly on the wall; the time when SHA-1 is too weak for Git is rapidly approaching. Moving to a stronger hash Back in the early days of Git, Torvalds was unconcerned about the possibility of SHA-1 being broken; as a result, he never designed in the ability to switch to a different hash; SHA-1 is fundamental to how Git operates. As of 2017, the Git code was full of declarations like: unsigned char sha1[20]; In other words, the type of the hash was deeply wired into the code, and it was assumed that hashes would fit into a 20-byte array. At that time, Git developer brian m. carlson was already at work to separate the Git core from the specific hash being used; indeed, he had been working on it since 2014. It was unclear what hash might eventually replace SHA-1, but it was possible to create an abstract type for object hashes that would hide that detail. At this point, that work is done and merged. The decision on a replacement hash algorithm was made in 2018. A number of possibilities were considered, but the Git community settled on SHA-256 as the next-generation Git hash. The commit enshrining that choice cites its relatively long history, wide support, and good performance. The community has also decided on (and mostly implemented) a transition plan that is well documented; most of what follows is shamelessly cribbed from that file. With the hash algorithm abstracted out of the core Git code, the transition is, on the surface, relatively easy. A new version of Git can be made with a different hash algorithm, along with a tool that will convert a repository from the old hash to the new. With a simple command like: git convert-repo --to-hash=sha-256 --frobnicate-blobs --climb-subtrees \ --liability-waiver=none --use-shovels --carbon-offsets a user can leave SHA-1 behind (note that the specific command-line options may differ). There is only one problem with this plan, though: most Git repositories do not operate in a vacuum. This sort of flag-day conversion might work for a tiny project, but it's not going to work well for a project like the kernel. So Git needs to be able to work with both SHA-1 and SHA-256 hashes for the foreseeable future. There are a number of implications to this requirement that make themselves felt throughout the system. One of the transition design goals is that SHA-256 repositories should be able to interoperate with SHA-1 repositories managed by older versions of Git. If kernel.org updates to the new format, developers running older versions should still be able to pull from (and push to) that site. That will only happen if Git continues to track the SHA-1 hashes for each object indefinitely. For blobs, this tracking will happen through the maintenance of a set of translation tables; given a hash generated with one algorithm, Git will be able to look up the corresponding hash from the other. Needless to say, this lookup will only succeed for objects that are actually in the repository. These translation tables will be maintained in the "pack files" that hold most objects in a contemporary Git repository. There will be a separate table for "loose objects" that are stored as separate files rather than in packs; the cost of lookups in that table is seen as being high enough that measures need to be taken to minimize the number of loose objects in any given repository. The handling of other object types is a bit more complicated. An SHA-1 tree object, for example, must contain SHA-1 hashes for the objects in the tree. So if such a tree object is requested, Git will have to locate the SHA-256 version, then translate all the object hashes contained within it before returning it. Similar translations will be required for commits. Signed tags will contain both hashes. With this machinery in place, Git installations will be interoperable during the transition. Eventually, all users will have upgraded to SHA-256-capable versions of Git, at which point repository owners could begin turning off the SHA-1 capability and removing the translation tables. The transition will, at that point, be complete. Some inconvenient details There are likely to be some glitches along the way, naturally. One of them is a simple human-factors problem: when a user supplies a hash value, should it be interpreted as SHA-1 or SHA-256? In some cases, it's unambiguous; SHA-1 hashes are 160 bits wide, so a 256-bit hash must be SHA-256, for example. But a shorter hash could be either, since hashes can be (and often are) abbreviated. The transition document describes a multi-phase process during which the interpretation of hash values would change, but most users are unlikely to go through that process. There is, of course, a way to unambiguously give a hash value in the new Git code, and they can even be mixed on the command line; this example comes from the transition document: git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} For a Git user interface this is relatively straightforward and concise, but one can still imagine that users might tire of it relatively quickly. The obvious solution to this sort of bracket fatigue is to fully transition a project to SHA-256 as quickly as possible. There is another issue out there, though: there are a lot of SHA-1 hash values in the wild. The kernel repository currently contains over 40,000 commits with a Fixes: tag; each one of those includes an SHA-1 hash. These hash values also can be found in bug-tracker histories, release announcements, vulnerability disclosures, and more. In a repository without SHA-1 compatibility, all of those hashes will become meaningless. To address this issue, one can imagine that the Git developers may eventually add a mode where translations for old SHA-1 hashes remain in the repository, but no SHA-1 hashes for new objects are added. Current state Much of the work to implement the SHA-256 transition has been done, but it remains in a relatively unstable state and most of it is not even being actively tested yet. In mid-January, carlson posted the first part of this transition code, which clearly only solves part of the problem: First, it contains the pieces necessary to set up repositories and write _but not read_ extensions.objectFormat. In other words, you can create a SHA-256 repository, but will be unable to read it. The value of write-only repositories is generally agreed to be relatively low; not even SCCS was so limited. Carlson's purpose in posting the code at this stage is to try to reveal any core issues that will be harder to change as the work progresses. Developers who are interested in where Git is going may well want to take a close look at this code; converting their working repositories over is not recommended, though. As it turns out, carlson's work goes well beyond what has been put out for testing now; he will post it when he is ready, but really curious people can see it now in his GitHub repository. This work is unlikely to land on the systems of most Git users for some time yet, but it is good to know that it is getting close to ready. The Git developers (carlson in particular) have quietly been working on this project for years; we will all benefit from it. [Send a free link] Did you like this article? Please accept our trial subscription offer to be able to see more content like it and to participate in the discussion. ----------------------------------------- (Log in to post comments) A new hash algorithm for Git Posted Feb 3, 2020 18:15 UTC (Mon) by IanKelling (subscriber, #89418) [Link] Great article. I'd love to see a similar one about GPG and SHA-1. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 18:34 UTC (Mon) by zdavatz (subscriber, #70954) [ Link] Great article, thank you! [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 18:37 UTC (Mon) by Cyberax ( supporter , #52523) [Link] > There is, of course, a way to unambiguously give a hash value in the new Git code, and they can even be mixed on the command line; this example comes from the transition document One trick that worked for me in a similar case was to switch the encoding. SHA-1 is encoded as hex numbers, we can simply switch SHA-256 to be encoded as letters "g" to "v", so they will be immediately recognizable. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 18:41 UTC (Mon) by juliank (subscriber, #45896) [ Link] This sounds like a totally reasonable thing to do. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 23:48 UTC (Mon) by dsommers (subscriber, #55274) [ Link] No, not really. What josh suggests, prefixing the string makes more sense. * performance: Doing char replacing in strings is more CPU intensive than just skipping one single byte and continue using standard functions/libraries. This gets more evident when when considering large repositories like the Linux kernel. * future compatibility: Shifting a-f chars to another set of 6 other letters will only work 3 more times if only considering lower case letters - 6 letters (a-f) * 4 shifts = 24. So at the 5 change, something new must be done to avoid breaking compatibility. Of course the counter argument is "how often will such new algorithms occur in reality?"; but none of us really knows that for sure - just as we don't know how long a git repository will live and be accessed. From this article (I've not paid attention to discussions in the git community), it seems like they account for the possibility change it again later on again. So having a prefix possibility with just one prefix or suffix letter makes it possible to change algorithms 26 times, with no performance loss (except the "skip one byte" operation when evaluating the hash). If that is two little, 3 letters gives the possibility for 17576 changes; which is probably enough for most of us alive today - but using 4 letters increases that once again to an even more insane number. But say you then settle for 4 letters prefix (456.976 possibilities) ... then you're not that far away from {sha256} which is 8 letters, with basically an unlimited amount of algorithm changes. What is inside the {} can be any length while containing a good description of what kind of algorithm in use, without needing to lookup that "AAAC" means SHA512. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 0:10 UTC (Tue) by Cyberax ( supporter , #52523) [Link] A prefix would also work, but let's limit it to 1 letter. This would realistically give more than enough coding space to last until git is no longer useful. It can also be extended to two characters later if needed. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 5:56 UTC (Tue) by eru (subscriber, #2753) [Link] One prefix letter would allow signifying only 20 possible hash algorihms, because you should avoid [a-f] that can start a valid hash value in the current scheme. But probably that would be enough, it is unlikely the hash changes more than once in a decade... [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 6:06 UTC (Tue) by Cyberax ( supporter , #52523) [Link] You can then use the UTF-8-like encoding, reserving 1 bit for "next byte continues the encoding ID" flag. So you can extend it indefinitely. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 16:07 UTC (Tue) by mathstuf (subscriber, #69389) [ Link] There are also uppercase letters if we really get desperate :) . [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 19:40 UTC (Tue) by quotemstr (subscriber, #45331) [Link] > Doing char replacing in strings is more CPU intensive than just skipping one single byte and continue using standard functions/ libraries Hash functions don't operate on the hex encoding of the hash digest. If you need to parse base-16 to binary anyway, there's no penalty arising from choosing an alternate set of characters to represent that base-16 value. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 21:44 UTC (Mon) by josh (subscriber, #17465) [Link ] Or just add a single special character at the beginning, like a capital H. (Using a letter will make sure that people's "select by word" mechanisms pick it up.) [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 17:13 UTC (Tue) by excors (subscriber, #95769) [ Link] Gerrit already uses a SHA-1 prefixed with "I" for its Change-Id (a persistent identifier of a patch). Are there any other popular Git-related tools that use a similar pattern? If Git started adding its own prefix letters, it would be nice to avoid ambiguity with them. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 14:58 UTC (Tue) by ballombe (subscriber, #9523) [ Link] Does no work in general: 12345678 is perfectly valid in both notation. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 15:51 UTC (Tue) by willy (subscriber, #9762) [Link ] He didn't say "use the digits 0123456789ghijkl". He said "use the digits ghijklmnopqrstuv". [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 18:52 UTC (Mon) by meyert (subscriber, #32097) [ Link] I wonder if the much increased complexity is reall worth the value given a very theoretical hash collision. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 19:03 UTC (Mon) by martin.langhoff (subscriber, # 61417) [Link] At some point, it will be worthwhile. We don't know exactly when that'll be, but the trick is to do it _before_ that inflection point. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 19:58 UTC (Mon) by mirabilos (subscriber, #84359) [Link] By then, SHA-256 will be broken as well. SHA-2 uses the same underlying structure as SHA-1 and is almost only more secure due to its length. Anything new deployed now should use SHA-3 (Keccak) right from the start. The comparison with OpenPGP also lags, people can choose the hash algorithm there (even though a gpg2 --version shows there's no SHA-3 yet). Also, I wonder, will I be able to verify old signed commits and tags after the transition is complete? Doesn't seem so... [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 22:02 UTC (Mon) by Cyberax ( supporter , #52523) [Link] The best attacks on SHA-1 reduce complexity from 2^80 (still unfeasible to brute-force) to 2^68 (just barely feasible). That's about 2^12 times speedup. SHA-256 has 2^128 collision probability to start with, any realistic attacks won't lower the complexity below 2^100 (WAY outside of possible attacks). [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 2:47 UTC (Tue) by wahern (subscriber, #37304) [ Link] The recently published SHAttered attack (https://shattered.it/) took ~2^63 computations. That said, AFAIU none of the recent SHA-1 attacks carry over to SHA-256. And other than length extension attacks, I don't think the Merkle-Damgard construction is considered fundamentally broken; it's just well analyzed. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 11:32 UTC (Tue) by heftig (subscriber, #73632) [ Link] Since Git prefixes an object with its length before hashing it, does length extension still apply? [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 2:16 UTC (Tue) by KaiRo (subscriber, #1987) [Link] For signed commits or other signatures, the other question is quantum-safety of the signatures themselves, which is also probably not ensured right now. I'm actually a bit more worried about switching to quantum-safe async crypto than about those hash collisions, but both are somewhat worrisome. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 2:22 UTC (Tue) by mirabilos (subscriber, #84359) [ Link] Quantum what? I was just wondering because the signature is over the SHA-1 hash. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 2:39 UTC (Tue) by KaiRo (subscriber, #1987) [Link] "Signature" usually means that you sign some arbitrary data (in this case a SHA-1 hash) using some async crypto key material (in this case usually some RSA variant). RSA and other async crypto algorithms used commonly nowadays are not safe from being cracked by quantum computers once we have some with enough capacity. That puts all signatures, identification and encryption based on those algorithms at risk once we have those kinds of quantum computers, so where we use those we will need to find solutions for that (quantum-safe algorithms are in development or testing right now but not finalized AFAIK). The common hash algorithms have no big issues with that, so it doesn't affect git itself directly, but it certainly does or will affect the signatures of signed commits. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 19:57 UTC (Tue) by mirabilos (subscriber, #84359) [Link] I know what a signature is and all that, but I absolutely don't get where you are going with this. When I currently have a signed commit... -----BEGIN cutting here may damage your screen surface----- $ git cat-file -p HEAD tree 937122472a792ada03309a60b7a31e02a29aa764 parent 53861b4a1544c7c8825f1414c37c9694c84c5d92 author mirabilos 1580771045 +0100 committer mirabilos 1580771470 +0100 gpgsig -----BEGIN PGP SIGNATURE----- Comment: TsTF--8 [?] iQIcBAABCQAGBQJeOKiPAAoJEIlQwYleuNOzSzYP/3xowIYpxJwuHfdP8oRekbSZ eVI9mO5g8KC+SUe5oGCbocH478pBUp5AOYlFGL0awetklijRmF+EeYp+a1IluCww GD2pSPFCpxSjScERlED5YYpfaaw1XEutoGHYQNMAUQhlRMzS8NwhGJjTuoIbvE4X hMntoMtDM7sPJ3CIADIoYzXIcdaqsELvqptuvNdo9S/PIyR6OFWhpF68Qn+SILqk N+fOA/KpgQLsRmMEVy3YtqmMdToYXoP3m4ec0/QSoN90QVrO9ZnVG2+0f9yeEiVn xEWiaSSsz5vtniBLzOvQ6FeE0h08ZsQi9dcTj8aq3tDtUJb2sQi6q79Gl5StmfHI 8HN9q8ZQP/Vh8kIT5z3lcuNnb3y7sc90ZzY5i7Q2YwfKNbJ5mAEMvSgzBxcrDflR /kjUJcXJg98IzJsWbE3k9gRc9yatqKQii0GiaxID13fCfl++4klJrFMEyoTdhta4 5a7vGa6OuHr+MWsT+35yQsR6Mt1DnMY2oNArTgWG3DfNQK8zb7rIExPbuV6pLP2O X67ZCVSHwRTrLWnDHjSuQH4Hfoibq96Ga9wJwEjw0+sWKzg4CgvQH6L+UiXIZO0/ 2+hhF507WUCKh8Uit2nrRsGhVnXJrI5QZsD857oAifcBFslbTLwTCkj+3gccHxwH A/BAeG4zN0JrdvMzx0pN =9w0P -----END PGP SIGNATURE----- erm yes, the symlink... -----END cutting here may damage your screen surface----- ... or tag... -----BEGIN cutting here may damage your screen surface----- $ git cat-file -p mksh-57-6 object 3ece4d6c67f32b8e2b9b00900d05cc06c658fc87 type commit tag mksh-57-6 tagger mirabilos 1580771932 +0100 mksh (57-6) unstable; urgency=low -----BEGIN PGP SIGNATURE----- Comment: TsTF--8 [?] iQIcBAABCQAGBQJeOKpdAAoJEIlQwYleuNOzE3EP/1Qu6w3ZnelCbTcR0/lR1QaH qisRANlIKYq0MVDOmhzGZ4m6/ri9b2njI16x0R3otaIT2QfG2ldj8U/Sq7Vpm6Xb uTpMluMzFj6sungPYOCvgbDVcVqt4+qCAwtFL5Lt2gpfN45KwYO0RdrSCY8wFD3N TO3Wq7M3DXt99F9mMY/L+XfvbpDAMzjCEK0tgTAal4QWnnb7V2Y1bVnZjos5XZTV hWW4kJMqBp2Hf99KLqnjijfPgZkqbSMYKy14Nsqo1cSujwPpOH2MgDbyuun1SuSA K6U0JT1iyIsL/ixkCx8vi6ejIGGQXXpGEq4K4RA3Wc4ALB/FWC9Y2MrCEExG0wEV tDkto90sbD6Nymnii1apG2Q7aSyDNDjsiRT2tzYN2S5EzItYtV0V8ZXoxiYk/c/Z ttAcdXxh8R4+5p3yNYwAjTSzZe8ohvgHFXoAUGVpk7g9oArlNiJmqkrW3BGdFrCb gH0h4UpiXr3pgnlPi247alGT18Xly5cBX3CbjORGDNsUDZoGPLlVuyW46PaRel3V P8BODtOoFkoK7JyFCRP70Z97vQig+L9nbN5tf50haYlxhO7oOSU7RzQJxgv2tLza AT0bg6Wfs4I9VV/MjocIirwrbihZY1gMgURgad5PdoNjoyNy+vd6OKMFQm1i/eUF hGIwKngrue1A9RMKPaCG =JPiZ -----END PGP SIGNATURE----- -----END cutting here may damage your screen surface----- ... these hardcode the SHA-1 hashes. These are, thus, needed to verify the signature. This also cannot be rewritten. As a user, I'd expect that, after full git conversion to a new hash, I'll still be able to verify these. That was the question. [Reply to this comment] Maybe Skip SHA-3 Posted Feb 4, 2020 8:19 UTC (Tue) by tialaramex (subscriber, #21167) [Link] Adam Langley suggests sticking with the SHA-2 family while things shake out in the relatively new frontier that is Keccak-style algorithms. https://www.imperialviolet.org/2017/05/31/skipsha3.html SHA-3 is significantly slower than SHA-2 which is already very slow for a hash (if we didn't need a crypto hash there are lots of very very fast hashes used elsewhere) so it's a big penalty when you aren't buying say, future proofing, which you aren't because SHA-3 was agreed way before the dust settled on how to do this style of hash, there are currently half a dozen like it, all seemingly secure, most faster, none standardised. This isn't like AES where the rough direction is understood and now you're buying hardware that accelerates it, so that not doing AES ends up slower because you lose hardware assist. Langley recommends SHA-512/256 (note for those unfamiliar this is literally the name of the hash, not two different hashes you can pick from) if you care about length extension attacks and otherwise SHA-256 is fine. The reason for SHA-512/256 is that the output isn't the entire internal state, it's only half the state, meaning a length extension fails, and it only needs the same size structure to store the hash as SHA-256 (but it is slower). [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 10:41 UTC (Tue) by epa (subscriber, #39769) [Link] Also, I wonder, will I be able to verify old signed commits and tags after the transition is complete? Perhaps you would be able to verify them slowly by recomputing the SHA-1 hashes of each object from scratch, even if they aren't stored in the repository. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 20:00 UTC (Mon) by chfisher (subscriber, #106449) [Link] Since it is generally conceded that the question is not "if SHA-1 will be compromised" but "when will SHA-1 be compromised", it behooves us as developers to move to a more secure option BEFORE that compromise occurs, since an exploit that successfully infects the kernel would have such wide ranging (and expensive) implications. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 20:15 UTC (Mon) by dkg (subscriber, #55359) [Link] A hash collision for SHA-1 is not theoretical at all. Rather, it is within reach of moderately funded attacker, on the order of $100K, and has been practically demonstrated by a university+corporate team. The price is expected only to fall. The authors of the recent "SHA-ttered" collision have this to say about git: GIT strongly relies on SHA-1 for the identification and integrity checking of all file objects and commits. It is essentially possible to create two GIT repositories with the same head commit hash and different contents, say a benign source code and a backdoored one. An attacker could potentially selectively serve either repository to targeted users. This will require attackers to compute their own collision. Note that this weakness in git means that even git signatures made with strong modern crypto are vulnerable, because they are signing objects that refer to other objects only by their SHA-1 digest. For instance, when signing tags, the signed tag itself cannot be replaced, but the thing that the tag points to can be replaced without invalidating the signature. Kudos to carlson for having been working on this; it's a shame that this kind of maintenance work never seems to get prioritized by projects until there is a fire that needs putting out. It would have been better if we had already completed this transition years ago. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 20:27 UTC (Mon) by walters (subscriber, #7396) [ Link] See also https://github.com/cgwalters/git-evtag for a stronger signed tag. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 21:11 UTC (Mon) by martin.langhoff (subscriber, # 61417) [Link] It's been demonstrated on a pair of PDF files. The format is pretty opaque to the typical end user, and the "good" file was pre-doctored. These are very artificial conditions. As many have pointed out, including this article, current attacks match the SHA-1 of an existing file that wasn't built to facilitate the attack in the first place... have to add a bunch of "random" data to get to a collision. For a code file, which is the typical content of git, that's pretty "visible". [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 21:49 UTC (Mon) by dkg (subscriber, #55359) [Link] I recommend reading Joey Hess's discussion from 2011 (in particular the discussion in the comments) for why the legibility of the commit messages and code objects typically covered by git is not necessarily sufficient: other stuff can be included in the hashes that won't be visible to normal end users. (maybe this was fixed in the last decade? i haven't tested recently) Even if it were somehow true that git hashes only cover the things that are directly exposed to the user, "git history is cryptographically strong for repositories that contain only human-readable code" is a significant reduction in scope from "git history is cryptographically strong". I don't think we want to make that reduction, and i know of no repositories (and no tooling) that would deliberately enforce that kind of limitation for the sake of retaining cryptographic strength of the git history. Also, many "code only" repositories contain the occasional binary graphic file (screenshot, logo etc), firmware, test corpus, etc, all of which could be used to hide the "tumor" needed for this kind of collision attack. This needs fixing, and we've known it needed fixing for nearly as long as git has been around. Why advance an argument that seems like it would only help to delay getting a fix deployed? [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 23:39 UTC (Mon) by martin.langhoff (subscriber, # 61417) [Link] To be clear, this is progress, and progress is needed. At some point, SHA-1 will be truly broken in a "useful" way, with real life impact, and we better have made the transition by then. It's not known to be broken today in a useful, usable way, for the typical uses of git. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 15:25 UTC (Tue) by joey (subscriber, #328) [Link] It has not been fixed. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 22:03 UTC (Mon) by khim (subscriber, #9252) [Link] I don't know where you get the notion that problem of "creating an existing file that wasn't built to facilitate the attack" is even remotely possible. Not even MD-4 is broken for preimage attack in practice. MD-4 was "broken" with preimage attack of complexity 2102 - which is really worrying: maybe in a few more years with some ASICs... maybe... Very unlikely though: very few entities could spend literally trillions of dollars to show that old, almost completely forgotten, hash is no longer useful.There are exist theoretical attack on MD-5 of 2123[?]4 complexity, but if you'll recall that there are 2128 MD-5 hashes is total... that's pretty trivial improvement. SHA-1 doesn't even have a theoretical preimage attacks currently (but there are few for "reduced" versions means soon we'll see something for the full one). So no, don't expect preimage attack on SHA-1 to happen in your lifetime... unless you plan to live for 300 years. Now, collision attacks are pretty easy for MD-4, MD-5, and relatively easy for SHA-1 (tens of thousands of dollars) - but they all require attacker to "plant bomb" in the "good repo". These are still nasty enough to worry about these, but as you could guess urgency is quite low: I still think it's cheaper to just submit a dozen of patches with subtle buffer overflows and get one of them accepted than to generate such a collision. But price goes down each year... [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 21:56 UTC (Mon) by dkg (subscriber, #55359) [Link] I should also mention that the "shambles" attack published in January 2020 claims costs of $11K (USD) for an arbitrary collision and $45K (USD) for a chosen-prefix collision. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 22:02 UTC (Mon) by josh (subscriber, #17465) [Link ] The decision on SHA256 as the successor was made back in 2018. I wonder if the rationale still holds as strongly then as it does now? There are several new candidates that have substantially higher performance than SHA256, and in particular, a couple that have the advantage of supporting parallel hashing for large blocks of data, notably BLAKE3. (I *don't* want to bikeshed the hash selection here. But I wonder if that hash selection might be worth benchmarking and re-evaluating now that the infrastructure is ready.) [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 2:07 UTC (Tue) by KaiRo (subscriber, #1987) [Link] I ' ve wondered about that as well - SHA256 has good hardware support right now but SHA3/keccak or even the very new blake3 would technically be better, though it will take some time until esp. the latter will be supported in hardware - probably before SHA1 collisions will be a practical problem in git repos though. How flexible is the code in that patch to go right to an even newer hash algorithm? [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 9:07 UTC (Tue) by jwilk (subscriber, #63328) [Link ] https://www.imperialviolet.org/2017/05/31/skipsha3.html [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 14:40 UTC (Tue) by cesarb (subscriber, #6266) [ Link] BLAKE3 might have another potential advantage for Git: due to its tree structure, it could allow breaking large blobs into small pieces which can be hashed independently, without changing the final hash. This might help with some of the issues Git has with large files in a repository. [Reply to this comment] A new hash algorithm for Git Posted Feb 3, 2020 22:10 UTC (Mon) by newren (subscriber, #5160) [ Link] It's worth noting that Git already transitioned away from SHA1 to SHA1DC (SHA-1 with detection of collisions), using https://github.com /cr-marcstevens/sha1collisiondetection. This was done about 3 years ago, and prevents a lot of the existing sha1 attacks, including even the recent sha-mbles stuff (see e.g. https://lore.kernel.org/git/ 20200107203147.r33c5plp5g7pmx...) All that said, I'm glad Brian is doing such great work in transitioning the codebase over to a newer hash algorithm. It's a huge pile of work, and I'm glad he's been tackling it. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 5:21 UTC (Tue) by pabs (subscriber, #43278) [Link] I wonder what other changes should be added when changing the format of git repositories. For example: I would like to see restic/borg style rolling chunking, for more efficient storage of large files. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 11:45 UTC (Tue) by keeperofdakeys (subscriber, # 82635) [Link] It's worth pointing out that the current collisions rely on inserting or appending an arbitrary amount of data to create the collision. Git stores both a type and size in a git commit, so its much harder to successfully create a malicious object with the same hash as another compared to existing attacks. https://marc.info/?l=git&m=148787047422954 [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 20:40 UTC (Tue) by tialaramex (subscriber, #21167) [Link] Basically, no. Collisions are not a second pre-image attack. The bad guys create two blobs, which are the same size, and have the same hash but are different. They get to show you either blob and trick you by substituting the other one which you'll believe is the same because it has the same SHA-1. An attacker would need to target git specifically, yes, but it isn't particularly more difficult as a result of tracking size and type. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 15:04 UTC (Tue) by osma (subscriber, #6912) [Link] I wonder if it would make sense to use a combination of SHA-1 and SHA-256 for the new hash, just concatenating them together. I know this is not much secure than either hash alone in terms of cryptography, but then the shortened commit IDs would still remain the same and existing references in, say, commit messages and other sources would still be valid prefixes of the new hashes. [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 20:18 UTC (Tue) by Hattifnattar (subscriber, # 93737) [Link] Wow! This is a brilliant idea! I am not sure it will be adopted, though... [Reply to this comment] A new hash algorithm for Git Posted Feb 4, 2020 21:20 UTC (Tue) by meuh (subscriber, #22042) [Link ] +1, I like that. I've not found this suggestion being rejected in https://github.com/ git/git/blob/v2.25.0/Documentation/tec... but I would assume there's a catch ! [Reply to this comment] Copyright (c) 2020, Eklektix, Inc. Comments and public postings are copyrighted by their creators. Linux is a registered trademark of Linus Torvalds