https://lwn.net/SubscriberLink/811068/cfeb6a67b8dfbe47/

LWN.net Logo LWN
.net News from the source LWN

  * Content
      + Weekly Edition
      + Archives
      + Search
      + Kernel
      + Security
      + Distributions
      + Events calendar
      + Unread comments
      + -------------------------------------------------------------
      + LWN FAQ
      + Write for us

User: [        ] Password: [        ] [Log in]
|
[Subscribe]
|
[Register]
Subscribe / Log in / New account

A new hash algorithm for Git

[LWN subscriber-only content]

   Welcome to LWN.net                  Free trial subscription

   The following subscription-only     Try LWN for free for 1
   content has been made available to  month: no payment or
   you by an LWN subscriber. Thousands credit card required.
   of subscribers depend on LWN for    Activate your trial
   the best news from the Linux and    subscription now and see
   free software communities. If you   why thousands of readers
   enjoy this article, please consider subscribe to LWN.net.
   accepting the trial offer on the
   right. Thank you for visiting
   LWN.net!

By Jonathan Corbet
February 3, 2020
The Git source-code management system is famously built on the SHA-1
hashing algorithm, which has become an increasingly weak foundation
over the years. SHA-1 is now considered to be broken and, despite the
fact that it does not yet seem to be so broken that it could be used
to compromise Git repositories, users are increasingly worried about
its security. The good news is that work on moving Git past SHA-1 has
been underway for some time, and is slowly coming to fruition; there
is a version of the code that can be looked at now.

How Git works, simplified

To understand why SHA-1 matters to Git, it helps to have an idea of
how the underlying Git database works. What follows is an
oversimplified view of how Git manages objects that can be skipped by
readers who are already familiar with this material.

Git is often described as being built on a content-addressable
filesystem -- one where you can look up an object if you know that
object's contents. That may not seem particularly useful, but there's
more than one way to "know" those contents. In particular, you can
substitute a cryptographic hash for the contents themselves; that
hash is rather easier to work with and has some other useful
properties.

Git stores a number of object types, using SHA-1 hashes to identify
them. So, for example, the SHA-1 hash of drivers/block/floppy.c in a
5.6-merge-window kernel, as calculated by Git, is
485865fd0412e40d041e861506bb3ac11a3a91e3. Conceptually, at least, Git
will store that version of floppy.c in a file, using that hash as its
name; early versions of Git actually did that. If somebody makes a
change to floppy.c, even just removing an extra space from the end of
a line, the result will have a completely different SHA-1 hash and
will be stored under a different name.

A Git repository is thus full of objects (often called "blobs") with
SHA-1 names; since a new one is created for each revision of a file,
they tend to proliferate. Your editor's kernel repository currently
contains 8,647,655 objects. But blobs are not the only types of
objects stored in a Git repository.

An individual file object holds a particular set of contents, but it
has no information about where that file appears in the repository
hierarchy. If floppy.c is moved to drivers/staging someday, its hash
will remain the same, so its representation in the Git object
database will not change. Keeping track of how files are organized
into a directory hierarchy is the job of a "tree" object. Any given
tree object can be thought of as a collection of blobs (each
identified by its SHA-1 hash, of course) associated with their
location in the directory tree. As one might expect, a tree object
has an SHA-1 hash of its own that is used to store it in the
repository.

Finally, a "commit" object records the state of the repository at a
particular point in time. A commit contains some metadata (committer,
date, etc.) along with the SHA-1 hash of a tree object reflecting the
current state of the repository. With that information, Git can check
out the repository at a given commit, reproducing the state of the
files in the repository at that point. Importantly, a commit also
contains the hash of the previous commit (or multiple commits in the
case of a merge); it thus records not just the state of the
repository, but the previous state, making it possible to determine
exactly what changed.

Commits, too, have SHA-1 hashes, and the hash of the previous commit
(or commits) is included in that calculation. If two chains of
development end up with the same file contents, the resulting commits
will still have different hashes. Thus, unlike some other source-code
management systems, Git does not (conceptually, at least) record
"deltas" from one revision to the next. It thus forms a sort of
blockchain, with each block containing the state of the repository at
a given commit.

Why hash security matters

The compromise of kernel.org in 2011 created a fair amount of concern
about the security of the kernel source repository. If an attacker
were able to put a backdoor into the kernel code, the result could be
the eventual compromise of vast numbers of deployed systems.
Malicious code placed into the kernel's build system could be run
behind any number of corporate and government firewalls. It was not a
pleasant scenario but, thanks to the use of Git, it was also not a
particularly likely one.

Let us imagine that some attacker has gained control of kernel.org
and wants to place some evil code into floppy.c -- something
unspeakable like a change that replaces random sectors with segments
from Rick Astley videos, say. Somehow this change would have to be
incorporated into the repository so that it would be included in
subsequent pulls. But the change to floppy.c changes its SHA-1 hash;
that, in turn, will change every tree object containing the evil
floppy.c and every commit that includes it as well. The head commit
for the repository would certainly change, as would older ones if the
attacker tried to make the change appear to have happened in the
distant past.

Somewhere out there is certainly some developer who actually
memorizes SHA-1 hashes and would immediately notice a change like
that. The rest of us probably would not, but Git will. The
distributed nature of Git means that there are many copies of the
repository out there; as soon as a developer tries to pull from or
push to the corrupted repository, the operation will fail due to the
mismatched hashes between the two repositories and the corruption
will come to light.

Repository integrity is also protected by signed tags, which include
the hash for a specific commit and a cryptographic signature. The
chain of hashes leading up to a given tag cannot be changed without
invalidating the tag itself. The use of signed tags is not universal
in the kernel community (and rare to nonexistent in many other
projects), but mainline kernel releases are signed that way. When one
sees Linus Torvalds's signature on a tag, one knows that the
repository is in the state he intended when the tag was applied.

All of this depends on the strength of the hash used, though. If our
attacker is able to modify floppy.c in such a way that its SHA-1 hash
does not change, that modification could well go undetected. That is
why the news of SHA-1 hash collisions creates concern; if SHA-1
cannot be trusted to detect hostile changes, then it is no longer
assuring the integrity of the repository.

The world has not ended yet, fortunately. It is still reasonably
expensive to create any sort of SHA-1 hash collision at all. Creating
any new version of floppy.c with the same hash would be hard. An
attacker would not just have to do that, though; this new version
would have to contain the desired hostile code, still function as a
working floppy driver, and not look like an obfuscated C code contest
entry (at least not more than it already does). Creating such a beast
is probably still unfeasible. But the writing is clearly on the wall;
the time when SHA-1 is too weak for Git is rapidly approaching.

Moving to a stronger hash

Back in the early days of Git, Torvalds was unconcerned about the
possibility of SHA-1 being broken; as a result, he never designed in
the ability to switch to a different hash; SHA-1 is fundamental to
how Git operates. As of 2017, the Git code was full of declarations
like:

    unsigned char sha1[20];

In other words, the type of the hash was deeply wired into the code,
and it was assumed that hashes would fit into a 20-byte array.

At that time, Git developer brian m. carlson was already at work to
separate the Git core from the specific hash being used; indeed, he
had been working on it since 2014. It was unclear what hash might
eventually replace SHA-1, but it was possible to create an abstract
type for object hashes that would hide that detail. At this point,
that work is done and merged.

The decision on a replacement hash algorithm was made in 2018. A
number of possibilities were considered, but the Git community
settled on SHA-256 as the next-generation Git hash. The commit
enshrining that choice cites its relatively long history, wide
support, and good performance. The community has also decided on (and
mostly implemented) a transition plan that is well documented; most
of what follows is shamelessly cribbed from that file.

With the hash algorithm abstracted out of the core Git code, the
transition is, on the surface, relatively easy. A new version of Git
can be made with a different hash algorithm, along with a tool that
will convert a repository from the old hash to the new. With a simple
command like:

   git convert-repo --to-hash=sha-256 --frobnicate-blobs --climb-subtrees \
        --liability-waiver=none --use-shovels --carbon-offsets

a user can leave SHA-1 behind (note that the specific command-line
options may differ). There is only one problem with this plan,
though: most Git repositories do not operate in a vacuum. This sort
of flag-day conversion might work for a tiny project, but it's not
going to work well for a project like the kernel. So Git needs to be
able to work with both SHA-1 and SHA-256 hashes for the foreseeable
future. There are a number of implications to this requirement that
make themselves felt throughout the system.

One of the transition design goals is that SHA-256 repositories
should be able to interoperate with SHA-1 repositories managed by
older versions of Git. If kernel.org updates to the new format,
developers running older versions should still be able to pull from
(and push to) that site. That will only happen if Git continues to
track the SHA-1 hashes for each object indefinitely.

For blobs, this tracking will happen through the maintenance of a set
of translation tables; given a hash generated with one algorithm, Git
will be able to look up the corresponding hash from the other.
Needless to say, this lookup will only succeed for objects that are
actually in the repository. These translation tables will be
maintained in the "pack files" that hold most objects in a
contemporary Git repository. There will be a separate table for
"loose objects" that are stored as separate files rather than in
packs; the cost of lookups in that table is seen as being high enough
that measures need to be taken to minimize the number of loose
objects in any given repository.

The handling of other object types is a bit more complicated. An
SHA-1 tree object, for example, must contain SHA-1 hashes for the
objects in the tree. So if such a tree object is requested, Git will
have to locate the SHA-256 version, then translate all the object
hashes contained within it before returning it. Similar translations
will be required for commits. Signed tags will contain both hashes.

With this machinery in place, Git installations will be interoperable
during the transition. Eventually, all users will have upgraded to
SHA-256-capable versions of Git, at which point repository owners
could begin turning off the SHA-1 capability and removing the
translation tables. The transition will, at that point, be complete.

Some inconvenient details

There are likely to be some glitches along the way, naturally. One of
them is a simple human-factors problem: when a user supplies a hash
value, should it be interpreted as SHA-1 or SHA-256? In some cases,
it's unambiguous; SHA-1 hashes are 160 bits wide, so a 256-bit hash
must be SHA-256, for example. But a shorter hash could be either,
since hashes can be (and often are) abbreviated. The transition
document describes a multi-phase process during which the
interpretation of hash values would change, but most users are
unlikely to go through that process.

There is, of course, a way to unambiguously give a hash value in the
new Git code, and they can even be mixed on the command line; this
example comes from the transition document:

    git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}

For a Git user interface this is relatively straightforward and
concise, but one can still imagine that users might tire of it
relatively quickly. The obvious solution to this sort of bracket
fatigue is to fully transition a project to SHA-256 as quickly as
possible.

There is another issue out there, though: there are a lot of SHA-1
hash values in the wild. The kernel repository currently contains
over 40,000 commits with a Fixes: tag; each one of those includes an
SHA-1 hash. These hash values also can be found in bug-tracker
histories, release announcements, vulnerability disclosures, and
more. In a repository without SHA-1 compatibility, all of those
hashes will become meaningless. To address this issue, one can
imagine that the Git developers may eventually add a mode where
translations for old SHA-1 hashes remain in the repository, but no
SHA-1 hashes for new objects are added.

Current state

Much of the work to implement the SHA-256 transition has been done,
but it remains in a relatively unstable state and most of it is not
even being actively tested yet. In mid-January, carlson posted the
first part of this transition code, which clearly only solves part of
the problem:

First, it contains the pieces necessary to set up repositories and
write _but not read_ extensions.objectFormat. In other words, you can
create a SHA-256 repository, but will be unable to read it.

The value of write-only repositories is generally agreed to be
relatively low; not even SCCS was so limited. Carlson's purpose in
posting the code at this stage is to try to reveal any core issues
that will be harder to change as the work progresses. Developers who
are interested in where Git is going may well want to take a close
look at this code; converting their working repositories over is not
recommended, though.

As it turns out, carlson's work goes well beyond what has been put
out for testing now; he will post it when he is ready, but really
curious people can see it now in his GitHub repository. This work is
unlikely to land on the systems of most Git users for some time yet,
but it is good to know that it is getting close to ready. The Git
developers (carlson in particular) have quietly been working on this
project for years; we will all benefit from it.

[Send a free link]


    Did you like this article? Please accept our trial subscription
    offer to be able to see more content like it and to participate
    in the discussion.

-----------------------------------------
(Log in to post comments)

A new hash algorithm for Git

Posted Feb 3, 2020 18:15 UTC (Mon) by IanKelling (subscriber, #89418)
[Link]

Great article. I'd love to see a similar one about GPG and SHA-1.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 18:34 UTC (Mon) by zdavatz (subscriber, #70954) [
Link]

Great article, thank you!

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 18:37 UTC (Mon) by Cyberax ( supporter , #52523)
[Link]

> There is, of course, a way to unambiguously give a hash value in
the new Git code, and they can even be mixed on the command line;
this example comes from the transition document
One trick that worked for me in a similar case was to switch the
encoding. SHA-1 is encoded as hex numbers, we can simply switch
SHA-256 to be encoded as letters "g" to "v", so they will be
immediately recognizable.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 18:41 UTC (Mon) by juliank (subscriber, #45896) [
Link]

This sounds like a totally reasonable thing to do.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 23:48 UTC (Mon) by dsommers (subscriber, #55274) [
Link]

No, not really. What josh suggests, prefixing the string makes more
sense.

* performance: Doing char replacing in strings is more CPU intensive
than just skipping one single byte and continue using standard
functions/libraries. This gets more evident when when considering
large repositories like the Linux kernel.

* future compatibility: Shifting a-f chars to another set of 6 other
letters will only work 3 more times if only considering lower case
letters - 6 letters (a-f) * 4 shifts = 24. So at the 5 change,
something new must be done to avoid breaking compatibility. Of course
the counter argument is "how often will such new algorithms occur in
reality?"; but none of us really knows that for sure - just as we
don't know how long a git repository will live and be accessed.

From this article (I've not paid attention to discussions in the git
community), it seems like they account for the possibility change it
again later on again. So having a prefix possibility with just one
prefix or suffix letter makes it possible to change algorithms 26
times, with no performance loss (except the "skip one byte" operation
when evaluating the hash). If that is two little, 3 letters gives the
possibility for 17576 changes; which is probably enough for most of
us alive today - but using 4 letters increases that once again to an
even more insane number.

But say you then settle for 4 letters prefix (456.976 possibilities)
... then you're not that far away from {sha256} which is 8 letters,
with basically an unlimited amount of algorithm changes. What is
inside the {} can be any length while containing a good description
of what kind of algorithm in use, without needing to lookup that
"AAAC" means SHA512.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 0:10 UTC (Tue) by Cyberax ( supporter , #52523)
[Link]

A prefix would also work, but let's limit it to 1 letter. This would
realistically give more than enough coding space to last until git is
no longer useful. It can also be extended to two characters later if
needed.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 5:56 UTC (Tue) by eru (subscriber, #2753) [Link]

One prefix letter would allow signifying only 20 possible hash
algorihms, because you should avoid [a-f] that can start a valid hash
value in the current scheme.
But probably that would be enough, it is unlikely the hash changes
more than once in a decade...

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 6:06 UTC (Tue) by Cyberax ( supporter , #52523)
[Link]

You can then use the UTF-8-like encoding, reserving 1 bit for "next
byte continues the encoding ID" flag. So you can extend it
indefinitely.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 16:07 UTC (Tue) by mathstuf (subscriber, #69389) [
Link]

There are also uppercase letters if we really get desperate :) .

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 19:40 UTC (Tue) by quotemstr (subscriber, #45331)
[Link]

> Doing char replacing in strings is more CPU intensive than just
skipping one single byte and continue using standard functions/
libraries

Hash functions don't operate on the hex encoding of the hash digest.
If you need to parse base-16 to binary anyway, there's no penalty
arising from choosing an alternate set of characters to represent
that base-16 value.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 21:44 UTC (Mon) by josh (subscriber, #17465) [Link
]

Or just add a single special character at the beginning, like a
capital H. (Using a letter will make sure that people's "select by
word" mechanisms pick it up.)

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 17:13 UTC (Tue) by excors (subscriber, #95769) [
Link]

Gerrit already uses a SHA-1 prefixed with "I" for its Change-Id (a
persistent identifier of a patch). Are there any other popular
Git-related tools that use a similar pattern? If Git started adding
its own prefix letters, it would be nice to avoid ambiguity with
them.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 14:58 UTC (Tue) by ballombe (subscriber, #9523) [
Link]

Does no work in general:
12345678 is perfectly valid in both notation.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 15:51 UTC (Tue) by willy (subscriber, #9762) [Link
]

He didn't say "use the digits 0123456789ghijkl". He said "use the
digits ghijklmnopqrstuv".

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 18:52 UTC (Mon) by meyert (subscriber, #32097) [
Link]

I wonder if the much increased complexity is reall worth the value
given a very theoretical hash collision.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 19:03 UTC (Mon) by martin.langhoff (subscriber, #
61417) [Link]

At some point, it will be worthwhile. We don't know exactly when
that'll be, but the trick is to do it _before_ that inflection point.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 19:58 UTC (Mon) by mirabilos (subscriber, #84359)
[Link]

By then, SHA-256 will be broken as well. SHA-2 uses the same
underlying structure as SHA-1 and is almost only more secure due to
its length. Anything new deployed now should use SHA-3 (Keccak) right
from the start. The comparison with OpenPGP also lags, people can
choose the hash algorithm there (even though a gpg2 --version shows
there's no SHA-3 yet).

Also, I wonder, will I be able to verify old signed commits and tags
after the transition is complete? Doesn't seem so...

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 22:02 UTC (Mon) by Cyberax ( supporter , #52523)
[Link]

The best attacks on SHA-1 reduce complexity from 2^80 (still
unfeasible to brute-force) to 2^68 (just barely feasible). That's
about 2^12 times speedup.

SHA-256 has 2^128 collision probability to start with, any realistic
attacks won't lower the complexity below 2^100 (WAY outside of
possible attacks).

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 2:47 UTC (Tue) by wahern (subscriber, #37304) [
Link]

The recently published SHAttered attack (https://shattered.it/) took
~2^63 computations. That said, AFAIU none of the recent SHA-1 attacks
carry over to SHA-256. And other than length extension attacks, I
don't think the Merkle-Damgard construction is considered
fundamentally broken; it's just well analyzed.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 11:32 UTC (Tue) by heftig (subscriber, #73632) [
Link]

Since Git prefixes an object with its length before hashing it, does
length extension still apply?

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 2:16 UTC (Tue) by KaiRo (subscriber, #1987) [Link]

For signed commits or other signatures, the other question is
quantum-safety of the signatures themselves, which is also probably
not ensured right now. I'm actually a bit more worried about
switching to quantum-safe async crypto than about those hash
collisions, but both are somewhat worrisome.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 2:22 UTC (Tue) by mirabilos (subscriber, #84359) [
Link]

Quantum what?

I was just wondering because the signature is over the SHA-1 hash.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 2:39 UTC (Tue) by KaiRo (subscriber, #1987) [Link]

"Signature" usually means that you sign some arbitrary data (in this
case a SHA-1 hash) using some async crypto key material (in this case
usually some RSA variant). RSA and other async crypto algorithms used
commonly nowadays are not safe from being cracked by quantum
computers once we have some with enough capacity. That puts all
signatures, identification and encryption based on those algorithms
at risk once we have those kinds of quantum computers, so where we
use those we will need to find solutions for that (quantum-safe
algorithms are in development or testing right now but not finalized
AFAIK). The common hash algorithms have no big issues with that, so
it doesn't affect git itself directly, but it certainly does or will
affect the signatures of signed commits.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 19:57 UTC (Tue) by mirabilos (subscriber, #84359)
[Link]

I know what a signature is and all that, but I absolutely don't get
where you are going with this.

When I currently have a signed commit...

-----BEGIN cutting here may damage your screen surface-----
$ git cat-file -p HEAD
tree 937122472a792ada03309a60b7a31e02a29aa764
parent 53861b4a1544c7c8825f1414c37c9694c84c5d92
author mirabilos <m@mirbsd.org> 1580771045 +0100
committer mirabilos <mirabilos@evolvis.org> 1580771470 +0100
gpgsig -----BEGIN PGP SIGNATURE-----
Comment:  TsTF--8 [?]

iQIcBAABCQAGBQJeOKiPAAoJEIlQwYleuNOzSzYP/3xowIYpxJwuHfdP8oRekbSZ
eVI9mO5g8KC+SUe5oGCbocH478pBUp5AOYlFGL0awetklijRmF+EeYp+a1IluCww
GD2pSPFCpxSjScERlED5YYpfaaw1XEutoGHYQNMAUQhlRMzS8NwhGJjTuoIbvE4X
hMntoMtDM7sPJ3CIADIoYzXIcdaqsELvqptuvNdo9S/PIyR6OFWhpF68Qn+SILqk
N+fOA/KpgQLsRmMEVy3YtqmMdToYXoP3m4ec0/QSoN90QVrO9ZnVG2+0f9yeEiVn
xEWiaSSsz5vtniBLzOvQ6FeE0h08ZsQi9dcTj8aq3tDtUJb2sQi6q79Gl5StmfHI
8HN9q8ZQP/Vh8kIT5z3lcuNnb3y7sc90ZzY5i7Q2YwfKNbJ5mAEMvSgzBxcrDflR
/kjUJcXJg98IzJsWbE3k9gRc9yatqKQii0GiaxID13fCfl++4klJrFMEyoTdhta4
5a7vGa6OuHr+MWsT+35yQsR6Mt1DnMY2oNArTgWG3DfNQK8zb7rIExPbuV6pLP2O
X67ZCVSHwRTrLWnDHjSuQH4Hfoibq96Ga9wJwEjw0+sWKzg4CgvQH6L+UiXIZO0/
2+hhF507WUCKh8Uit2nrRsGhVnXJrI5QZsD857oAifcBFslbTLwTCkj+3gccHxwH
A/BAeG4zN0JrdvMzx0pN
=9w0P
-----END PGP SIGNATURE-----

erm yes, the symlink...
-----END cutting here may damage your screen surface-----

... or tag...

-----BEGIN cutting here may damage your screen surface-----
$ git cat-file -p mksh-57-6
object 3ece4d6c67f32b8e2b9b00900d05cc06c658fc87
type commit
tag mksh-57-6
tagger mirabilos <mirabilos@evolvis.org> 1580771932 +0100

mksh (57-6) unstable; urgency=low
-----BEGIN PGP SIGNATURE-----
Comment:  TsTF--8 [?]

iQIcBAABCQAGBQJeOKpdAAoJEIlQwYleuNOzE3EP/1Qu6w3ZnelCbTcR0/lR1QaH
qisRANlIKYq0MVDOmhzGZ4m6/ri9b2njI16x0R3otaIT2QfG2ldj8U/Sq7Vpm6Xb
uTpMluMzFj6sungPYOCvgbDVcVqt4+qCAwtFL5Lt2gpfN45KwYO0RdrSCY8wFD3N
TO3Wq7M3DXt99F9mMY/L+XfvbpDAMzjCEK0tgTAal4QWnnb7V2Y1bVnZjos5XZTV
hWW4kJMqBp2Hf99KLqnjijfPgZkqbSMYKy14Nsqo1cSujwPpOH2MgDbyuun1SuSA
K6U0JT1iyIsL/ixkCx8vi6ejIGGQXXpGEq4K4RA3Wc4ALB/FWC9Y2MrCEExG0wEV
tDkto90sbD6Nymnii1apG2Q7aSyDNDjsiRT2tzYN2S5EzItYtV0V8ZXoxiYk/c/Z
ttAcdXxh8R4+5p3yNYwAjTSzZe8ohvgHFXoAUGVpk7g9oArlNiJmqkrW3BGdFrCb
gH0h4UpiXr3pgnlPi247alGT18Xly5cBX3CbjORGDNsUDZoGPLlVuyW46PaRel3V
P8BODtOoFkoK7JyFCRP70Z97vQig+L9nbN5tf50haYlxhO7oOSU7RzQJxgv2tLza
AT0bg6Wfs4I9VV/MjocIirwrbihZY1gMgURgad5PdoNjoyNy+vd6OKMFQm1i/eUF
hGIwKngrue1A9RMKPaCG
=JPiZ
-----END PGP SIGNATURE-----
-----END cutting here may damage your screen surface-----

... these hardcode the SHA-1 hashes. These are, thus, needed to verify
the signature. This also cannot be rewritten.

As a user, I'd expect that, after full git conversion to a new hash,
I'll still be able to verify these. That was the question.

[Reply to this comment]
Maybe Skip SHA-3

Posted Feb 4, 2020 8:19 UTC (Tue) by tialaramex (subscriber, #21167)
[Link]

Adam Langley suggests sticking with the SHA-2 family while things
shake out in the relatively new frontier that is Keccak-style
algorithms.

https://www.imperialviolet.org/2017/05/31/skipsha3.html

SHA-3 is significantly slower than SHA-2 which is already very slow
for a hash (if we didn't need a crypto hash there are lots of very
very fast hashes used elsewhere) so it's a big penalty when you
aren't buying say, future proofing, which you aren't because SHA-3
was agreed way before the dust settled on how to do this style of
hash, there are currently half a dozen like it, all seemingly secure,
most faster, none standardised. This isn't like AES where the rough
direction is understood and now you're buying hardware that
accelerates it, so that not doing AES ends up slower because you lose
hardware assist.

Langley recommends SHA-512/256 (note for those unfamiliar this is
literally the name of the hash, not two different hashes you can pick
from) if you care about length extension attacks and otherwise
SHA-256 is fine. The reason for SHA-512/256 is that the output isn't
the entire internal state, it's only half the state, meaning a length
extension fails, and it only needs the same size structure to store
the hash as SHA-256 (but it is slower).

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 10:41 UTC (Tue) by epa (subscriber, #39769) [Link]

    Also, I wonder, will I be able to verify old signed commits and
    tags after the transition is complete?

Perhaps you would be able to verify them slowly by recomputing the
SHA-1 hashes of each object from scratch, even if they aren't stored
in the repository.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 20:00 UTC (Mon) by chfisher (subscriber, #106449)
[Link]

Since it is generally conceded that the question is not "if SHA-1
will be compromised" but "when will SHA-1 be compromised", it
behooves us as developers to move to a more secure option BEFORE that
compromise occurs, since an exploit that successfully infects the
kernel would have such wide ranging (and expensive) implications.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 20:15 UTC (Mon) by dkg (subscriber, #55359) [Link]

A hash collision for SHA-1 is not theoretical at all. Rather, it is
within reach of moderately funded attacker, on the order of $100K,
and has been practically demonstrated by a university+corporate team.
The price is expected only to fall.

The authors of the recent "SHA-ttered" collision have this to say
about git:

    GIT strongly relies on SHA-1 for the identification and integrity
    checking of all file objects and commits. It is essentially
    possible to create two GIT repositories with the same head commit
    hash and different contents, say a benign source code and a
    backdoored one. An attacker could potentially selectively serve
    either repository to targeted users. This will require attackers
    to compute their own collision.

Note that this weakness in git means that even git signatures made
with strong modern crypto are vulnerable, because they are signing
objects that refer to other objects only by their SHA-1 digest.

For instance, when signing tags, the signed tag itself cannot be
replaced, but the thing that the tag points to can be replaced
without invalidating the signature.

Kudos to carlson for having been working on this; it's a shame that
this kind of maintenance work never seems to get prioritized by
projects until there is a fire that needs putting out. It would have
been better if we had already completed this transition years ago.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 20:27 UTC (Mon) by walters (subscriber, #7396) [
Link]

See also https://github.com/cgwalters/git-evtag for a stronger signed
tag.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 21:11 UTC (Mon) by martin.langhoff (subscriber, #
61417) [Link]

It's been demonstrated on a pair of PDF files. The format is pretty
opaque to the typical end user, and the "good" file was pre-doctored.
These are very artificial conditions.

As many have pointed out, including this article, current attacks
match the SHA-1 of an existing file that wasn't built to facilitate
the attack in the first place... have to add a bunch of "random" data
to get to a collision. For a code file, which is the typical content
of git, that's pretty "visible".

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 21:49 UTC (Mon) by dkg (subscriber, #55359) [Link]

I recommend reading Joey Hess's discussion from 2011 (in particular
the discussion in the comments) for why the legibility of the commit
messages and code objects typically covered by git is not necessarily
sufficient: other stuff can be included in the hashes that won't be
visible to normal end users. (maybe this was fixed in the last
decade? i haven't tested recently)

Even if it were somehow true that git hashes only cover the things
that are directly exposed to the user, "git history is
cryptographically strong for repositories that contain only
human-readable code" is a significant reduction in scope from "git
history is cryptographically strong". I don't think we want to make
that reduction, and i know of no repositories (and no tooling) that
would deliberately enforce that kind of limitation for the sake of
retaining cryptographic strength of the git history.

Also, many "code only" repositories contain the occasional binary
graphic file (screenshot, logo etc), firmware, test corpus, etc, all
of which could be used to hide the "tumor" needed for this kind of
collision attack.

This needs fixing, and we've known it needed fixing for nearly as
long as git has been around. Why advance an argument that seems like
it would only help to delay getting a fix deployed?

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 23:39 UTC (Mon) by martin.langhoff (subscriber, #
61417) [Link]

To be clear, this is progress, and progress is needed. At some point,
SHA-1 will be truly broken in a "useful" way, with real life impact,
and we better have made the transition by then.

It's not known to be broken today in a useful, usable way, for the
typical uses of git.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 15:25 UTC (Tue) by joey (subscriber, #328) [Link]

It has not been fixed.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 22:03 UTC (Mon) by khim (subscriber, #9252) [Link]

I don't know where you get the notion that problem of "creating an
existing file that wasn't built to facilitate the attack" is even
remotely possible.

Not even MD-4 is broken for preimage attack in practice. MD-4 was
"broken" with preimage attack of complexity 2102 - which is really
worrying: maybe in a few more years with some ASICs... maybe... Very
unlikely though: very few entities could spend literally trillions of
dollars to show that old, almost completely forgotten, hash is no
longer useful.There are exist theoretical attack on MD-5 of 2123[?]4
complexity, but if you'll recall that there are 2128 MD-5 hashes is
total... that's pretty trivial improvement. SHA-1 doesn't even have a
theoretical preimage attacks currently (but there are few for
"reduced" versions means soon we'll see something for the full one).

So no, don't expect preimage attack on SHA-1 to happen in your
lifetime... unless you plan to live for 300 years.

Now, collision attacks are pretty easy for MD-4, MD-5, and relatively
easy for SHA-1 (tens of thousands of dollars) - but they all require
attacker to "plant bomb" in the "good repo". These are still nasty
enough to worry about these, but as you could guess urgency is quite
low: I still think it's cheaper to just submit a dozen of patches
with subtle buffer overflows and get one of them accepted than to
generate such a collision. But price goes down each year...

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 21:56 UTC (Mon) by dkg (subscriber, #55359) [Link]

I should also mention that the "shambles" attack published in January
2020 claims costs of $11K (USD) for an arbitrary collision and $45K
(USD) for a chosen-prefix collision.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 22:02 UTC (Mon) by josh (subscriber, #17465) [Link
]

The decision on SHA256 as the successor was made back in 2018. I
wonder if the rationale still holds as strongly then as it does now?
There are several new candidates that have substantially higher
performance than SHA256, and in particular, a couple that have the
advantage of supporting parallel hashing for large blocks of data,
notably BLAKE3.

(I *don't* want to bikeshed the hash selection here. But I wonder if
that hash selection might be worth benchmarking and re-evaluating now
that the infrastructure is ready.)

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 2:07 UTC (Tue) by KaiRo (subscriber, #1987) [Link]

I ' ve wondered about that as well - SHA256 has good hardware support
right now but SHA3/keccak or even the very new blake3 would
technically be better, though it will take some time until esp. the
latter will be supported in hardware - probably before SHA1
collisions will be a practical problem in git repos though. How
flexible is the code in that patch to go right to an even newer hash
algorithm?

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 9:07 UTC (Tue) by jwilk (subscriber, #63328) [Link
]

https://www.imperialviolet.org/2017/05/31/skipsha3.html

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 14:40 UTC (Tue) by cesarb (subscriber, #6266) [
Link]

BLAKE3 might have another potential advantage for Git: due to its
tree structure, it could allow breaking large blobs into small pieces
which can be hashed independently, without changing the final hash.
This might help with some of the issues Git has with large files in a
repository.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 3, 2020 22:10 UTC (Mon) by newren (subscriber, #5160) [
Link]

It's worth noting that Git already transitioned away from SHA1 to
SHA1DC (SHA-1 with detection of collisions), using https://github.com
/cr-marcstevens/sha1collisiondetection. This was done about 3 years
ago, and prevents a lot of the existing sha1 attacks, including even
the recent sha-mbles stuff (see e.g. https://lore.kernel.org/git/
20200107203147.r33c5plp5g7pmx...)

All that said, I'm glad Brian is doing such great work in
transitioning the codebase over to a newer hash algorithm. It's a
huge pile of work, and I'm glad he's been tackling it.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 5:21 UTC (Tue) by pabs (subscriber, #43278) [Link]

I wonder what other changes should be added when changing the format
of git repositories.

For example: I would like to see restic/borg style rolling chunking,
for more efficient storage of large files.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 11:45 UTC (Tue) by keeperofdakeys (subscriber, #
82635) [Link]

It's worth pointing out that the current collisions rely on inserting
or appending an arbitrary amount of data to create the collision. Git
stores both a type and size in a git commit, so its much harder to
successfully create a malicious object with the same hash as another
compared to existing attacks.

https://marc.info/?l=git&m=148787047422954

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 20:40 UTC (Tue) by tialaramex (subscriber, #21167)
[Link]

Basically, no.

Collisions are not a second pre-image attack. The bad guys create two
blobs, which are the same size, and have the same hash but are
different. They get to show you either blob and trick you by
substituting the other one which you'll believe is the same because
it has the same SHA-1.

An attacker would need to target git specifically, yes, but it isn't
particularly more difficult as a result of tracking size and type.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 15:04 UTC (Tue) by osma (subscriber, #6912) [Link]

I wonder if it would make sense to use a combination of SHA-1 and
SHA-256 for the new hash, just concatenating them together. I know
this is not much secure than either hash alone in terms of
cryptography, but then the shortened commit IDs would still remain
the same and existing references in, say, commit messages and other
sources would still be valid prefixes of the new hashes.

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 20:18 UTC (Tue) by Hattifnattar (subscriber, #
93737) [Link]

Wow! This is a brilliant idea! I am not sure it will be adopted,
though...

[Reply to this comment]
A new hash algorithm for Git

Posted Feb 4, 2020 21:20 UTC (Tue) by meuh (subscriber, #22042) [Link
]

+1, I like that.

I've not found this suggestion being rejected in https://github.com/
git/git/blob/v2.25.0/Documentation/tec... but I would assume there's
a catch !

[Reply to this comment]

                  Copyright (c) 2020, Eklektix, Inc.
   Comments and public postings are copyrighted by their creators.
          Linux is a registered trademark of Linus Torvalds