[HN Gopher] Git's database internals I: packed object store
       ___________________________________________________________________
        
       Git's database internals I: packed object store
        
       Author : todsacerdoti
       Score  : 158 points
       Date   : 2022-08-29 13:01 UTC (9 hours ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | legulere wrote:
       | > What if Git had a long-running daemon that could satisfy
       | queries on-demand, but also keep that in-memory representation of
       | data instead of needing to parse objects from disk every time
       | 
       | I guess that would be performance-wise the same as switching from
       | running the compiler in a one-off mode to a LSP analyzer.
        
       | srathi wrote:
       | Shameless plug: I gave a tech talk on this at my last company,
       | and implemented some of these in Golang as a learning exercise
       | [0]. While it is not mandatory to know the internals, but doing
       | so helps a lot when you occasionally encounter a git-gotcha!
       | 
       | [0] https://github.com/ssrathi/gogit
        
       | masklinn wrote:
       | This article gives me whiplash, it starts at a high level, dives,
       | stops _just_ before it touches upon the implementation details
       | (and all the weird stuff which makes you realise the developers
       | were human after all) then shoots back up again.
       | 
       | > Git does not currently have the capability to update a packfile
       | in real time without shutting down concurrent reads from that
       | file. Such a change could be possible, but it would require
       | updating Git's storage significantly. I think this is one area
       | where a database expert could contribute to the Git project in
       | really interesting ways.
       | 
       | I'm not sure it would be super useful, because of the delta-
       | ification process. Adding "literal" (non-delta) files to a pack
       | isn't much of a gain (file contents are zlib-compressed either
       | way).
        
         | kevincox wrote:
         | Also presumably "shutting down concurrent reads from that file"
         | is not much of a problem because the file can just be unlinked
         | and it will be GCed when the last reader finishes naturally.
         | Other than hung processes the only downside is extra disk space
         | usage and some cache.
        
       | josephg wrote:
       | One of the things that surprises me about Git is that it doesn't
       | store diffs. It just stores the full content of every file every
       | time it's modified and committed.
       | 
       | I work on concurrent editing systems (using CRDTs). People always
       | ask about how we deal with the problem that if we store changes
       | forever, the filesize grows indefinitely. Modern CRDTs are
       | massively more space efficient than git but nobody seems to
       | notice or care about the size of our git repositories. I think
       | git supports shallow clones but I don't know anyone who bothers
       | with them.
        
         | mandeepj wrote:
         | > It just stores the full content of every file every time it's
         | modified and committed.
         | 
         | That's not true; otherwise each git branch folder would be
         | huge, but it's not. It stores the snapshot of changes in branch
         | folders
        
           | thro388 wrote:
           | Maybe OP stores binary files?
        
             | sekao wrote:
             | Binary files are also delta'd in git.
        
           | _ikke_ wrote:
           | > It stores the snapshot of changes in branch folders
           | 
           | No, gp is right. There is no snapshot of changes. Each unique
           | version of a file is an object in the object store. That does
           | mean each commit would only introduce new objects for files
           | that have changed (these are already added to the object
           | store when you use `git add`) .
           | 
           | But the pack files like the article explains (which are
           | created regularly during gc/maintenance runs) is why the repo
           | size can often be smaller then the working tree.
           | 
           | > Not only is each object compressed individually, they can
           | also be compressed against each other to take advantage of
           | common data.
        
           | CorrectHorseBat wrote:
           | Branches are just pointers to commits in Git. How those
           | commits are stored doesn't affect their size.
        
         | tomstuart wrote:
         | I know what you mean, but -- as the article explains [0] --
         | Git's packfile format uses delta compression, which is
         | essentially "storing diffs".
         | 
         | [0] https://github.blog/2022-08-29-gits-database-internals-i-
         | pac...
        
           | masklinn wrote:
           | > Git's packfile format uses delta compression, which is
           | essentially "storing diffs".
           | 
           | FWIW it's closer to lossless compression e.g. LZ77's
           | behaviour: git's delta compression has two commands which are
           | LITERAL and COPY, and technically it doesn't need to copy
           | linearly from the base object, it would be unlikely[0] but a
           | delta could start by COPY-ing the end of the base object,
           | then add literal data, then COPY the start of the base object
           | (or COPY the end again, or whatever, you get the point).
           | 
           | [0] outside of handcrafting, as the packing process would
           | need to find such a situation which it would estimate useful
        
             | dunham wrote:
             | Many years ago, I wrote a some code that made pack files
             | with out of order copy commands. (Just messing around in Go
             | trying to learn how pack files worked.)
             | 
             | I was too lazy to implement the diffing algorithm, so I
             | used the suffixarray from the go standard library and
             | greedily grabbed the longest match at each point in the
             | input. It's been a while, but I think I got a 12%
             | improvement on my test case. (Probably wouldn't do that
             | well in general.)
             | 
             | I'm assumed the git code just runs off of a diff (longest
             | common substring), but I never checked.
        
           | srvmshr wrote:
           | In my mental model of git internals: I envision the initial
           | commit stores a full copy in object store - and thereafter
           | the states are captured as diffs. If a new file is added at a
           | later commit stage, of course the "blob" will be stored, but
           | subsequent commits will add the deltas to it - not a full
           | copy.
        
             | masklinn wrote:
             | That's not the case.
             | 
             | The actual model of git internals is that every commit
             | always stores full copies (of modified entries, unmodified
             | entries are dedup'd through the magic of content
             | addressing).
             | 
             | Then _as an optimisation of the packing process_ git will
             | perform delta compression between objects based on a bunch
             | of heuristics. Until you pack you get full copies, and
             | depending how  "hard" you pack you may or may not gets huge
             | savings.
             | 
             | Furthermore, Git's pack files are ordered, so depending on
             | the selection it makes it can diff later files from earlier
             | ones (your model) or the reverse, or even both at the same
             | time.
             | 
             | You can see some of the early packing heuristics at https:/
             | /github.com/git/git/blob/master/Documentation/technic...
        
               | cryptonector wrote:
               | > That's not the case.
               | 
               | As far as the actual internals are concerned, yes. But as
               | a _model_ it 's not _wrong_. It 's like epicycles aren't
               | wrong as a model either. GP's mental model of Git is mine
               | as well, and it works very well for me.
        
               | masklinn wrote:
               | > But as a model it's not wrong.
               | 
               | As a model it's wrong, and useless, and possibly harmful.
               | 
               | Because the "states are captured as diffs" model does not
               | tell you anything true that's useful. And at worst it
               | gives you incorrect ideas of the performance
               | characteristics of the system e.g. that small changes are
               | bad, because lots of changes means lots of diffs to apply
               | which is slow.
               | 
               | By forgetting that model, you're strictly less wrong, and
               | helped no less about reasoning about git.
               | 
               | > It's like epicycles aren't wrong as a model either.
               | 
               | But they were. "Adding epicycles" has literally become a
               | saying about trying to pile on more crap to try and (fail
               | to) make an incorrect model work.
               | 
               | And epicycles were at least trying to keep a simple model
               | working, the "snapshot model" is more complicated than
               | git's actual behaviour to start with.
        
               | robertlagrant wrote:
               | Calling things harmful considered harmful.
        
               | cryptonector wrote:
               | Especially without an explanation.                 them:
               | harmful       me: how?       them: really harmful
               | me: sure, but, can you explain it?       them: like a
               | bajillion harmful
        
               | cryptonector wrote:
               | > As a model it's wrong, and useless, and possibly
               | harmful.
               | 
               | It's no more wrong than epicycles. It's not useless
               | either: it works very well for me. For _users_ (as
               | opposed to Git developers), it 's hard to see how it
               | could be harmful.
        
               | johannes1234321 wrote:
               | The harmful part comes from assumptions based on it.
               | 
               | Systems like CVS or Subversion indeed store diffs of file
               | conceptually and storage wise. This has notable co
               | sequences: when doing operations on a large span of
               | revisions all takes long as a bunch of diffs have to be
               | applied in sequence. This then leads to reluctance of
               | small commits for the wrong reasons.
               | 
               | Wrong assumptions about the workings leading to wrong
               | decisions is harmful.
        
               | cryptonector wrote:
               | > This has notable co sequences: when doing operations on
               | a large span of revisions all takes long as a bunch of
               | diffs have to be applied in sequence. This then leads to
               | reluctance of small commits for the wrong reasons.
               | 
               | This happens in Git as well though. If you have to rebase
               | your commits across a few thousand upstream commits,
               | you'll be in for some pain. (For example, I've a commit
               | to PostgreSQL to add ALWAYS DEFERRED constraints, and
               | I've had to rebase it across thousands of upstream
               | commits because I've lacked the time to see that
               | contribution across the finish line.)
               | 
               | You seem to be indirectly arguing for merge-based
               | workflows. (At least that's what I'm reading between the
               | lines. Perhaps my reading is wrong, but I'll run with it
               | until you tell me otherwise.) However, a) merging is not
               | always accepted by upstream, and b) merge-based workflows
               | suck.
               | 
               | I grant that merging is very simple with Git's internals:
               | Git does the three-way merge thing, you manually merge
               | conflicts, you create a commit for the conflict-resolved
               | checked out state, recording the two parents in that
               | commit, and you're done. This works because Git simply
               | stores the objects as they are, internally using deltas
               | -not quite diffs- to compress the object store. This is
               | very simple, yes, but it yields a workflow that doesn't
               | work for many projects. Rebasing is much better, and it
               | always works, especially with [0].
               | 
               | So now you might wonder how to deal with the rebase-
               | across-thousands-of-commits problem. Well, I've a
               | script[0] (not originally mine, but I've hacked on it)
               | that does something like bisection to find upstream
               | commits causing conflicts so as to greatly speed up the
               | process.
               | 
               | [0] https://gist.github.com/nicowilliams/ea2fa2b445c2db50
               | d2ee650...
        
               | johannes1234321 wrote:
               | If you got to rebase over many decisions you are likely
               | in trouble anyways. There small commits can be helpful to
               | ease dealing with merge conflicts.
               | 
               | However rebase is a single special operation. The diff-
               | based storage affects all. From updating a local
               | repository from upstream, diffing between some distance,
               | ...
        
               | cryptonector wrote:
               | > The diff-based storage affects all.
               | 
               | No one is saying Git does or should do that.
               | 
               | > If you got to rebase over many decisions you are likely
               | in trouble anyways.
               | 
               | That script I linked made it easy.
               | 
               | > There small commits can be helpful to ease dealing with
               | merge conflicts.
               | 
               | They were indeed mostly small commits. Mine wasn't that
               | small -- it's fairly invasive and not possible to split
               | up into more than two commits.
        
             | cryptonector wrote:
             | This is a good mental model. The internals are the
             | internals. But the model users see is that a repository is
             | a bag of commits (deltas) that organize into a tree, with
             | branches and tags being names for specific commits. With
             | this model in mind anyone can become a Git power user.
        
               | stu2b50 wrote:
               | From a user point of view, you should not see commits as
               | diffs or deltas, but as a full snapshots of the contents
               | of the repository.
               | 
               | Delta compression is just an implementation detail for
               | storage optimization. Conceptually each commit is an
               | immutable snapshot in its own right.
        
               | cryptonector wrote:
               | > From a user point of view, you should not see commits
               | as diffs or deltas, but as a full snapshots of the
               | contents of the repository.
               | 
               | That tends to lead to a merge-centric workflow.
               | 
               | The alternative makes it easy to understand rebase. Ergo
               | it's good.
        
               | stu2b50 wrote:
               | I don't see any particular reason that'd be the case.
               | Even with rebase, the diff model causes some of git's
               | behavior to no longer make sense. For instance, the fact
               | that rebase causes an entirely new series of commits to
               | be made - this is because rebase is just a layer over
               | cherry-picking commits, and that cherry-picking commits
               | calculates the diffs between two commits then applies the
               | diff as its own, new commit.
               | 
               | If commits were diffs, then that process doesn't make any
               | sense. It only makes sense when you think of commits as
               | snapshots, and cherry-pick ing as taking a commit,
               | calculating the diff with the current commit, and then
               | applying that diff to make a brand new commit.
        
               | cryptonector wrote:
               | Yes, rebase is a series of cherry-picks. Each commit
               | picked is treated as a set of diffs that are applied with
               | conflict resolution, then committed, then off to the
               | next. The mental model of commits as diffs works,
               | obviously.
               | 
               | > If commits were diffs, then that process doesn't make
               | any sense. It only makes sense when you think of commits
               | as snapshots, and cherry-pick ing as taking a commit,
               | calculating the diff with the current commit, and then
               | applying that diff to make a brand new commit.
               | 
               | Sure it does. It works for me. You can use whatever
               | mental model of Git that you works for you.
        
               | josephg wrote:
               | > But the model users see is that a repository is a bag
               | of commits (deltas)
               | 
               | Commits in your .git directory aren't stored using deltas
               | from a previous commit. Each different version of every
               | file is stored in full.
        
               | cryptonector wrote:
               | Surely you must know what the word "model" means.
        
               | juped wrote:
               | It doesn't mean "get out of jail free card for
               | cryptonector being wrong", no.
               | 
               | The simplified model that leaves out tricky
               | implementation details is just "every file that's
               | different is a different blob object, stored in full on
               | its own, under the hash of its contents". If you work
               | with this model, you're basically correct in all your
               | reasoning, except that in reality a bit less storage is
               | used because some objects get packed. If you work with
               | the wrong (wrong! not simplified, wrong!) model "every
               | blob object is stored as base-and-deltas", you won't
               | reason correctly about either how the object store
               | presents itself in plumbing commands or how it's stored.
        
               | cryptonector wrote:
               | > It doesn't mean "get out of jail free card for
               | cryptonector being wrong", no.
               | 
               | I never said that Git commits _are_ diffs. Keep it civil.
               | 
               | > The simplified model that leaves out tricky
               | implementation details [...]
               | 
               | Models do that.
               | 
               | > If you work with this model, you're basically correct
               | in all your reasoning, except that in reality a bit less
               | storage is used because some objects get packed.
               | 
               | That's it? That's the one detail that should stop me
               | thinking correctly based on this model?
        
               | MaxBarraclough wrote:
               | From https://git-scm.com/book/en/v2/Git-Internals-
               | Packfiles :
               | 
               | > _You have two nearly identical 22K objects on your disk
               | (each compressed to approximately 7K). Wouldn't it be
               | nice if Git could store one of them in full but then the
               | second object only as the delta between it and the
               | first?_
               | 
               | > _It turns out that it can. The initial format in which
               | Git saves objects on disk is called a "loose" object
               | format. However, occasionally Git packs up several of
               | these objects into a single binary file called a
               | "packfile" in order to save space and be more efficient._
               | 
               | ...
               | 
               | > _What is also interesting is that the second version of
               | the file is the one that is stored intact, whereas the
               | original version is stored as a delta -- this is because
               | you're most likely to need faster access to the most
               | recent version of the file._
        
               | masklinn wrote:
               | That is what's covered by the article, but as various
               | commenters have noted it's not part of git's _model_ ,
               | it's an optimisation of git's _storage_.
               | 
               | So thinking in those terms for modelling / visualising
               | git's behaviour is nonsensical.
        
               | MaxBarraclough wrote:
               | It's not hard to make correct statements about the git
               | user interface without being misleading about its
               | internals.
               | 
               | josephg said:
               | 
               | > Each different version of every file is stored in full.
               | 
               | which is an untrue statement about how git internally
               | manages its data. It would be fine had they instead said:
               | 
               | > From the user's point of view, it's as if each
               | different version of every file is stored in full.
               | Internally, git may perform compression, including
               | differential compression.
               | 
               | This isn't the first time I've seen this confusion. This
               | 2014 StackOverflow thread [0] does the same, as does an
               | old HN thread. [1] The solution is clear communication.
               | 
               | [0] https://stackoverflow.com/q/8198105/
               | 
               | [1] https://news.ycombinator.com/item?id=25459757
        
               | juped wrote:
               | There are two ways commits can be treated in Git. Most of
               | the time, a commit is a specific tree and some metadata.
               | There are, however, some rebase-like operations that use
               | the commit metadata to treat a commit as a diff against
               | one or more parents which it points to. You're wrong in
               | this discussion because it's about storage, but you do
               | need to flip into this dual view of the commit graph to
               | use rebase-like operations correctly.
        
               | cryptonector wrote:
               | Exactly. Thinking of commits as diffs helps think about
               | rebasing.
        
             | eminence32 wrote:
             | When you rename a file in git, you can get a glimps of its
             | internal model -- renames are not explicitly tracked. Since
             | git only stores full copies of each object, it has to
             | employ heuristics to figure out if a (deleted file, created
             | file) pair should be considered a rename. This is why you
             | sometimes see a percentage next to a rename in a "git log"
             | -- this is git's internal metric about how
             | similar/dissimilar the files are.
        
               | rossmohax wrote:
               | Is there any value in first class renames other than some
               | performance boost ?
        
             | JoshTriplett wrote:
             | Git actually arranges to delta starting from the current
             | version to older versions, so that it has less work to do
             | to show you recent versions, and does more work as it works
             | its way backwards in time.
        
         | moreira wrote:
         | It's a reminder that sometimes it's OK to ignore hypothetical
         | problems.
         | 
         | "How does git deal with the problem that if we store changes
         | forever, the file size grows indefinitely?"
         | 
         | Even though Git uses deltas for diffs to reduce space usage, it
         | still technically grows indefinitely, same as CRDTs. So the
         | answer is it doesn't, and it doesn't matter.
         | 
         | If the size of something actually becomes a problem, -then- it
         | can be handled (e.g. Git LFS).
         | 
         | With CRDTs, depending on what you're doing, it might never
         | become a problem. Keeping all versions of a document is
         | entirely inconsequential, as an example. The size might grow
         | indefinitely but how many edits is a document going to have,
         | realistically?
        
         | Existenceblinks wrote:
         | It kinda store diff in `rr-cache/` the preimage | postimage |
         | thisimage combined is a kind of diff.
        
         | Nzen wrote:
         | Ah, thanks. You've reminded me about the 2016 debacle [0]
         | wherein github rate limited the cocoapods repository because
         | their practice of using a shallow clone and later fetches
         | impacted github's performance. I'm going to warc that now, so I
         | don't forget again.
         | 
         | [0]
         | https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...
         | 
         | [+] https://github.com/orgs/Homebrew/discussions/225
        
         | MonkeyMalarky wrote:
         | I have used shallow clones to speed up and shrink what gets
         | pulled into a docker image when building it. Of course it was a
         | hack to make up for past mistakes but I was glad the option was
         | available.
        
           | masklinn wrote:
           | Did you need to update the docker image in-place afterwards?
           | Unless you did, why not just export?
        
         | giancarlostoro wrote:
         | It's easy to forget that code is... so small, but I guess if
         | your codebase on its own was hundreds of megabytes big, then
         | that would become noticed more quickly. I never realized it
         | stored a copy every time, I assumed it used diffs. So git uses
         | the same solution some of us use when we're about to do
         | something on git we're unfamiliar with... you copy the entire
         | directory housing your code + git repo into a new location just
         | in case it fails miserably.
        
           | NavinF wrote:
           | > I never realized it stored a copy every time, I assumed it
           | used diffs
           | 
           | git does store diffs. People are confusing the logical
           | representation with the on-disk layout.
        
         | masklinn wrote:
         | > Modern CRDTs are massively more space efficient than git but
         | nobody seems to notice or care about the size of our git
         | repositories.
         | 
         | The packing process explained in the article is why: rather
         | than make compression an intrinsic part of the system's model
         | (which most VCS do / have done), git moves it to a later
         | "optimisation" phase. This is an additional and somewhat odd
         | costs, but on the other hand it can be extremely efficient (at
         | $DAYJOB we've got files whose compression ratio is 99%).
         | 
         | An other interesting bit (which makes it very annoying to
         | inspect the raw contents of a git repo) is that git DEFLATEs
         | object contents when it stores them on-disk. But while that's
         | clearly documented for packfiles (although the lack of end-of-
         | data marker is implicit and extremely annoying), it's not
         | really spelled out for loose object, you get to find out when
         | you can't make heads or tails of loose objects, and git barfs
         | at your hand-crafted loose objects.
        
       | spapas82 wrote:
       | I have written a small python library to read info directly from
       | the .git folder (without using the git executable):
       | https://github.com/spapas/python-git-info/
       | 
       | Decoding the pack files was notoriously difficult, probably there
       | are still cases that I don't handle properly. The packfile is a
       | really complex/messy format and in the top of that it lacks any
       | proper documentation!
       | 
       | This article explains a lot of the concepts very good but doesn't
       | go into implementation details (where the big snakes are).
       | Another great resource for unpacking packfiles is this post:
       | https://codewords.recurse.com/issues/three/unpacking-git-pac...
       | and of course reading the source code of other git clients.
       | 
       | Unfortunately it seems that the official packfile format
       | documentation is the git source code :|
        
         | masklinn wrote:
         | > The packfile is a really complex/messy format and in the top
         | of that it lacks any proper documentation!
         | 
         | Maybe it's changed since you'd checked it, but I found the pack
         | files to be documented, through nowhere near as well as index
         | files (now that was a pleasure, the index file format is
         | straightforward, pretty logical, and very well documented).
         | 
         | The problem I had with the pack files documentation is that
         | it's non-linear, so if you read it linearly as you implement it
         | you hit areas which turn out to be documented a few screens
         | later. Furthermore it doesn't necessarily define or spell out
         | what it's talking about, so it can take a while to realise that
         | it has 3 (.5?) different formats of varints, or that the size
         | it provides is for decompressed content, and that it relies on
         | zlib to discover the end of the compressed stream (and good
         | luck to you if your implementation doesn't expose that).
         | 
         | But in my experience, it's really nothing compared to the
         | documentation of the wire format. That leaves even more details
         | out, some of the explanations are outright misleading (I spent
         | hours convinced the v2 protocol was using an HTTP connection as
         | a half-duplex socket and wondering how I would manage), and
         | with TODOs covering half the protocol.
        
           | spapas82 wrote:
           | Can you please share the documentation you used to understand
           | the pack file implemention? It will be very useful to me!
        
             | masklinn wrote:
             | Git's own: https://github.com/git/git/blob/master/Documenta
             | tion/gitform...
             | 
             | I also used gitoxide's source when I got unsure / lost
             | about the nomenclature / varints.
             | 
             | I'd suggest giving it a pass unless you have to or are
             | really interested though.
        
           | ori_b wrote:
           | Yes. The wire format docs, especially for HTTP, are
           | atrocious. The pack format is well documented enough. (https:
           | //github.com/git/git/tree/maint/Documentation/technica...)
           | 
           | I wrote the 9front implementation (https://git.9front.org/pla
           | n9front/plan9front/HEAD/sys/src/cm...) more or less entirely
           | from the official documentation in the git repo. For
           | debugging, GIT_TRACE_PACKET=1 is very useful for comparison.
        
       | tex0 wrote:
       | Packfiles give me PTSD. They are so notoriously hard to maintain
       | on the large scale server side.
       | 
       | I really wish git wasn't using them.
        
         | gbrown_ wrote:
         | Could you elaborate on "maintain" please?
        
       ___________________________________________________________________
       (page generated 2022-08-29 23:00 UTC)