[HN Gopher] Git's database internals I: packed object store ___________________________________________________________________ Git's database internals I: packed object store Author : todsacerdoti Score : 158 points Date : 2022-08-29 13:01 UTC (9 hours ago) (HTM) web link (github.blog) (TXT) w3m dump (github.blog) | legulere wrote: | > What if Git had a long-running daemon that could satisfy | queries on-demand, but also keep that in-memory representation of | data instead of needing to parse objects from disk every time | | I guess that would be performance-wise the same as switching from | running the compiler in a one-off mode to a LSP analyzer. | srathi wrote: | Shameless plug: I gave a tech talk on this at my last company, | and implemented some of these in Golang as a learning exercise | [0]. While it is not mandatory to know the internals, but doing | so helps a lot when you occasionally encounter a git-gotcha! | | [0] https://github.com/ssrathi/gogit | masklinn wrote: | This article gives me whiplash, it starts at a high level, dives, | stops _just_ before it touches upon the implementation details | (and all the weird stuff which makes you realise the developers | were human after all) then shoots back up again. | | > Git does not currently have the capability to update a packfile | in real time without shutting down concurrent reads from that | file. Such a change could be possible, but it would require | updating Git's storage significantly. I think this is one area | where a database expert could contribute to the Git project in | really interesting ways. | | I'm not sure it would be super useful, because of the delta- | ification process. Adding "literal" (non-delta) files to a pack | isn't much of a gain (file contents are zlib-compressed either | way). | kevincox wrote: | Also presumably "shutting down concurrent reads from that file" | is not much of a problem because the file can just be unlinked | and it will be GCed when the last reader finishes naturally. | Other than hung processes the only downside is extra disk space | usage and some cache. | josephg wrote: | One of the things that surprises me about Git is that it doesn't | store diffs. It just stores the full content of every file every | time it's modified and committed. | | I work on concurrent editing systems (using CRDTs). People always | ask about how we deal with the problem that if we store changes | forever, the filesize grows indefinitely. Modern CRDTs are | massively more space efficient than git but nobody seems to | notice or care about the size of our git repositories. I think | git supports shallow clones but I don't know anyone who bothers | with them. | mandeepj wrote: | > It just stores the full content of every file every time it's | modified and committed. | | That's not true; otherwise each git branch folder would be | huge, but it's not. It stores the snapshot of changes in branch | folders | thro388 wrote: | Maybe OP stores binary files? | sekao wrote: | Binary files are also delta'd in git. | _ikke_ wrote: | > It stores the snapshot of changes in branch folders | | No, gp is right. There is no snapshot of changes. Each unique | version of a file is an object in the object store. That does | mean each commit would only introduce new objects for files | that have changed (these are already added to the object | store when you use `git add`) . | | But the pack files like the article explains (which are | created regularly during gc/maintenance runs) is why the repo | size can often be smaller then the working tree. | | > Not only is each object compressed individually, they can | also be compressed against each other to take advantage of | common data. | CorrectHorseBat wrote: | Branches are just pointers to commits in Git. How those | commits are stored doesn't affect their size. | tomstuart wrote: | I know what you mean, but -- as the article explains [0] -- | Git's packfile format uses delta compression, which is | essentially "storing diffs". | | [0] https://github.blog/2022-08-29-gits-database-internals-i- | pac... | masklinn wrote: | > Git's packfile format uses delta compression, which is | essentially "storing diffs". | | FWIW it's closer to lossless compression e.g. LZ77's | behaviour: git's delta compression has two commands which are | LITERAL and COPY, and technically it doesn't need to copy | linearly from the base object, it would be unlikely[0] but a | delta could start by COPY-ing the end of the base object, | then add literal data, then COPY the start of the base object | (or COPY the end again, or whatever, you get the point). | | [0] outside of handcrafting, as the packing process would | need to find such a situation which it would estimate useful | dunham wrote: | Many years ago, I wrote a some code that made pack files | with out of order copy commands. (Just messing around in Go | trying to learn how pack files worked.) | | I was too lazy to implement the diffing algorithm, so I | used the suffixarray from the go standard library and | greedily grabbed the longest match at each point in the | input. It's been a while, but I think I got a 12% | improvement on my test case. (Probably wouldn't do that | well in general.) | | I'm assumed the git code just runs off of a diff (longest | common substring), but I never checked. | srvmshr wrote: | In my mental model of git internals: I envision the initial | commit stores a full copy in object store - and thereafter | the states are captured as diffs. If a new file is added at a | later commit stage, of course the "blob" will be stored, but | subsequent commits will add the deltas to it - not a full | copy. | masklinn wrote: | That's not the case. | | The actual model of git internals is that every commit | always stores full copies (of modified entries, unmodified | entries are dedup'd through the magic of content | addressing). | | Then _as an optimisation of the packing process_ git will | perform delta compression between objects based on a bunch | of heuristics. Until you pack you get full copies, and | depending how "hard" you pack you may or may not gets huge | savings. | | Furthermore, Git's pack files are ordered, so depending on | the selection it makes it can diff later files from earlier | ones (your model) or the reverse, or even both at the same | time. | | You can see some of the early packing heuristics at https:/ | /github.com/git/git/blob/master/Documentation/technic... | cryptonector wrote: | > That's not the case. | | As far as the actual internals are concerned, yes. But as | a _model_ it 's not _wrong_. It 's like epicycles aren't | wrong as a model either. GP's mental model of Git is mine | as well, and it works very well for me. | masklinn wrote: | > But as a model it's not wrong. | | As a model it's wrong, and useless, and possibly harmful. | | Because the "states are captured as diffs" model does not | tell you anything true that's useful. And at worst it | gives you incorrect ideas of the performance | characteristics of the system e.g. that small changes are | bad, because lots of changes means lots of diffs to apply | which is slow. | | By forgetting that model, you're strictly less wrong, and | helped no less about reasoning about git. | | > It's like epicycles aren't wrong as a model either. | | But they were. "Adding epicycles" has literally become a | saying about trying to pile on more crap to try and (fail | to) make an incorrect model work. | | And epicycles were at least trying to keep a simple model | working, the "snapshot model" is more complicated than | git's actual behaviour to start with. | robertlagrant wrote: | Calling things harmful considered harmful. | cryptonector wrote: | Especially without an explanation. them: | harmful me: how? them: really harmful | me: sure, but, can you explain it? them: like a | bajillion harmful | cryptonector wrote: | > As a model it's wrong, and useless, and possibly | harmful. | | It's no more wrong than epicycles. It's not useless | either: it works very well for me. For _users_ (as | opposed to Git developers), it 's hard to see how it | could be harmful. | johannes1234321 wrote: | The harmful part comes from assumptions based on it. | | Systems like CVS or Subversion indeed store diffs of file | conceptually and storage wise. This has notable co | sequences: when doing operations on a large span of | revisions all takes long as a bunch of diffs have to be | applied in sequence. This then leads to reluctance of | small commits for the wrong reasons. | | Wrong assumptions about the workings leading to wrong | decisions is harmful. | cryptonector wrote: | > This has notable co sequences: when doing operations on | a large span of revisions all takes long as a bunch of | diffs have to be applied in sequence. This then leads to | reluctance of small commits for the wrong reasons. | | This happens in Git as well though. If you have to rebase | your commits across a few thousand upstream commits, | you'll be in for some pain. (For example, I've a commit | to PostgreSQL to add ALWAYS DEFERRED constraints, and | I've had to rebase it across thousands of upstream | commits because I've lacked the time to see that | contribution across the finish line.) | | You seem to be indirectly arguing for merge-based | workflows. (At least that's what I'm reading between the | lines. Perhaps my reading is wrong, but I'll run with it | until you tell me otherwise.) However, a) merging is not | always accepted by upstream, and b) merge-based workflows | suck. | | I grant that merging is very simple with Git's internals: | Git does the three-way merge thing, you manually merge | conflicts, you create a commit for the conflict-resolved | checked out state, recording the two parents in that | commit, and you're done. This works because Git simply | stores the objects as they are, internally using deltas | -not quite diffs- to compress the object store. This is | very simple, yes, but it yields a workflow that doesn't | work for many projects. Rebasing is much better, and it | always works, especially with [0]. | | So now you might wonder how to deal with the rebase- | across-thousands-of-commits problem. Well, I've a | script[0] (not originally mine, but I've hacked on it) | that does something like bisection to find upstream | commits causing conflicts so as to greatly speed up the | process. | | [0] https://gist.github.com/nicowilliams/ea2fa2b445c2db50 | d2ee650... | johannes1234321 wrote: | If you got to rebase over many decisions you are likely | in trouble anyways. There small commits can be helpful to | ease dealing with merge conflicts. | | However rebase is a single special operation. The diff- | based storage affects all. From updating a local | repository from upstream, diffing between some distance, | ... | cryptonector wrote: | > The diff-based storage affects all. | | No one is saying Git does or should do that. | | > If you got to rebase over many decisions you are likely | in trouble anyways. | | That script I linked made it easy. | | > There small commits can be helpful to ease dealing with | merge conflicts. | | They were indeed mostly small commits. Mine wasn't that | small -- it's fairly invasive and not possible to split | up into more than two commits. | cryptonector wrote: | This is a good mental model. The internals are the | internals. But the model users see is that a repository is | a bag of commits (deltas) that organize into a tree, with | branches and tags being names for specific commits. With | this model in mind anyone can become a Git power user. | stu2b50 wrote: | From a user point of view, you should not see commits as | diffs or deltas, but as a full snapshots of the contents | of the repository. | | Delta compression is just an implementation detail for | storage optimization. Conceptually each commit is an | immutable snapshot in its own right. | cryptonector wrote: | > From a user point of view, you should not see commits | as diffs or deltas, but as a full snapshots of the | contents of the repository. | | That tends to lead to a merge-centric workflow. | | The alternative makes it easy to understand rebase. Ergo | it's good. | stu2b50 wrote: | I don't see any particular reason that'd be the case. | Even with rebase, the diff model causes some of git's | behavior to no longer make sense. For instance, the fact | that rebase causes an entirely new series of commits to | be made - this is because rebase is just a layer over | cherry-picking commits, and that cherry-picking commits | calculates the diffs between two commits then applies the | diff as its own, new commit. | | If commits were diffs, then that process doesn't make any | sense. It only makes sense when you think of commits as | snapshots, and cherry-pick ing as taking a commit, | calculating the diff with the current commit, and then | applying that diff to make a brand new commit. | cryptonector wrote: | Yes, rebase is a series of cherry-picks. Each commit | picked is treated as a set of diffs that are applied with | conflict resolution, then committed, then off to the | next. The mental model of commits as diffs works, | obviously. | | > If commits were diffs, then that process doesn't make | any sense. It only makes sense when you think of commits | as snapshots, and cherry-pick ing as taking a commit, | calculating the diff with the current commit, and then | applying that diff to make a brand new commit. | | Sure it does. It works for me. You can use whatever | mental model of Git that you works for you. | josephg wrote: | > But the model users see is that a repository is a bag | of commits (deltas) | | Commits in your .git directory aren't stored using deltas | from a previous commit. Each different version of every | file is stored in full. | cryptonector wrote: | Surely you must know what the word "model" means. | juped wrote: | It doesn't mean "get out of jail free card for | cryptonector being wrong", no. | | The simplified model that leaves out tricky | implementation details is just "every file that's | different is a different blob object, stored in full on | its own, under the hash of its contents". If you work | with this model, you're basically correct in all your | reasoning, except that in reality a bit less storage is | used because some objects get packed. If you work with | the wrong (wrong! not simplified, wrong!) model "every | blob object is stored as base-and-deltas", you won't | reason correctly about either how the object store | presents itself in plumbing commands or how it's stored. | cryptonector wrote: | > It doesn't mean "get out of jail free card for | cryptonector being wrong", no. | | I never said that Git commits _are_ diffs. Keep it civil. | | > The simplified model that leaves out tricky | implementation details [...] | | Models do that. | | > If you work with this model, you're basically correct | in all your reasoning, except that in reality a bit less | storage is used because some objects get packed. | | That's it? That's the one detail that should stop me | thinking correctly based on this model? | MaxBarraclough wrote: | From https://git-scm.com/book/en/v2/Git-Internals- | Packfiles : | | > _You have two nearly identical 22K objects on your disk | (each compressed to approximately 7K). Wouldn't it be | nice if Git could store one of them in full but then the | second object only as the delta between it and the | first?_ | | > _It turns out that it can. The initial format in which | Git saves objects on disk is called a "loose" object | format. However, occasionally Git packs up several of | these objects into a single binary file called a | "packfile" in order to save space and be more efficient._ | | ... | | > _What is also interesting is that the second version of | the file is the one that is stored intact, whereas the | original version is stored as a delta -- this is because | you're most likely to need faster access to the most | recent version of the file._ | masklinn wrote: | That is what's covered by the article, but as various | commenters have noted it's not part of git's _model_ , | it's an optimisation of git's _storage_. | | So thinking in those terms for modelling / visualising | git's behaviour is nonsensical. | MaxBarraclough wrote: | It's not hard to make correct statements about the git | user interface without being misleading about its | internals. | | josephg said: | | > Each different version of every file is stored in full. | | which is an untrue statement about how git internally | manages its data. It would be fine had they instead said: | | > From the user's point of view, it's as if each | different version of every file is stored in full. | Internally, git may perform compression, including | differential compression. | | This isn't the first time I've seen this confusion. This | 2014 StackOverflow thread [0] does the same, as does an | old HN thread. [1] The solution is clear communication. | | [0] https://stackoverflow.com/q/8198105/ | | [1] https://news.ycombinator.com/item?id=25459757 | juped wrote: | There are two ways commits can be treated in Git. Most of | the time, a commit is a specific tree and some metadata. | There are, however, some rebase-like operations that use | the commit metadata to treat a commit as a diff against | one or more parents which it points to. You're wrong in | this discussion because it's about storage, but you do | need to flip into this dual view of the commit graph to | use rebase-like operations correctly. | cryptonector wrote: | Exactly. Thinking of commits as diffs helps think about | rebasing. | eminence32 wrote: | When you rename a file in git, you can get a glimps of its | internal model -- renames are not explicitly tracked. Since | git only stores full copies of each object, it has to | employ heuristics to figure out if a (deleted file, created | file) pair should be considered a rename. This is why you | sometimes see a percentage next to a rename in a "git log" | -- this is git's internal metric about how | similar/dissimilar the files are. | rossmohax wrote: | Is there any value in first class renames other than some | performance boost ? | JoshTriplett wrote: | Git actually arranges to delta starting from the current | version to older versions, so that it has less work to do | to show you recent versions, and does more work as it works | its way backwards in time. | moreira wrote: | It's a reminder that sometimes it's OK to ignore hypothetical | problems. | | "How does git deal with the problem that if we store changes | forever, the file size grows indefinitely?" | | Even though Git uses deltas for diffs to reduce space usage, it | still technically grows indefinitely, same as CRDTs. So the | answer is it doesn't, and it doesn't matter. | | If the size of something actually becomes a problem, -then- it | can be handled (e.g. Git LFS). | | With CRDTs, depending on what you're doing, it might never | become a problem. Keeping all versions of a document is | entirely inconsequential, as an example. The size might grow | indefinitely but how many edits is a document going to have, | realistically? | Existenceblinks wrote: | It kinda store diff in `rr-cache/` the preimage | postimage | | thisimage combined is a kind of diff. | Nzen wrote: | Ah, thanks. You've reminded me about the 2016 debacle [0] | wherein github rate limited the cocoapods repository because | their practice of using a shallow clone and later fetches | impacted github's performance. I'm going to warc that now, so I | don't forget again. | | [0] | https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm... | | [+] https://github.com/orgs/Homebrew/discussions/225 | MonkeyMalarky wrote: | I have used shallow clones to speed up and shrink what gets | pulled into a docker image when building it. Of course it was a | hack to make up for past mistakes but I was glad the option was | available. | masklinn wrote: | Did you need to update the docker image in-place afterwards? | Unless you did, why not just export? | giancarlostoro wrote: | It's easy to forget that code is... so small, but I guess if | your codebase on its own was hundreds of megabytes big, then | that would become noticed more quickly. I never realized it | stored a copy every time, I assumed it used diffs. So git uses | the same solution some of us use when we're about to do | something on git we're unfamiliar with... you copy the entire | directory housing your code + git repo into a new location just | in case it fails miserably. | NavinF wrote: | > I never realized it stored a copy every time, I assumed it | used diffs | | git does store diffs. People are confusing the logical | representation with the on-disk layout. | masklinn wrote: | > Modern CRDTs are massively more space efficient than git but | nobody seems to notice or care about the size of our git | repositories. | | The packing process explained in the article is why: rather | than make compression an intrinsic part of the system's model | (which most VCS do / have done), git moves it to a later | "optimisation" phase. This is an additional and somewhat odd | costs, but on the other hand it can be extremely efficient (at | $DAYJOB we've got files whose compression ratio is 99%). | | An other interesting bit (which makes it very annoying to | inspect the raw contents of a git repo) is that git DEFLATEs | object contents when it stores them on-disk. But while that's | clearly documented for packfiles (although the lack of end-of- | data marker is implicit and extremely annoying), it's not | really spelled out for loose object, you get to find out when | you can't make heads or tails of loose objects, and git barfs | at your hand-crafted loose objects. | spapas82 wrote: | I have written a small python library to read info directly from | the .git folder (without using the git executable): | https://github.com/spapas/python-git-info/ | | Decoding the pack files was notoriously difficult, probably there | are still cases that I don't handle properly. The packfile is a | really complex/messy format and in the top of that it lacks any | proper documentation! | | This article explains a lot of the concepts very good but doesn't | go into implementation details (where the big snakes are). | Another great resource for unpacking packfiles is this post: | https://codewords.recurse.com/issues/three/unpacking-git-pac... | and of course reading the source code of other git clients. | | Unfortunately it seems that the official packfile format | documentation is the git source code :| | masklinn wrote: | > The packfile is a really complex/messy format and in the top | of that it lacks any proper documentation! | | Maybe it's changed since you'd checked it, but I found the pack | files to be documented, through nowhere near as well as index | files (now that was a pleasure, the index file format is | straightforward, pretty logical, and very well documented). | | The problem I had with the pack files documentation is that | it's non-linear, so if you read it linearly as you implement it | you hit areas which turn out to be documented a few screens | later. Furthermore it doesn't necessarily define or spell out | what it's talking about, so it can take a while to realise that | it has 3 (.5?) different formats of varints, or that the size | it provides is for decompressed content, and that it relies on | zlib to discover the end of the compressed stream (and good | luck to you if your implementation doesn't expose that). | | But in my experience, it's really nothing compared to the | documentation of the wire format. That leaves even more details | out, some of the explanations are outright misleading (I spent | hours convinced the v2 protocol was using an HTTP connection as | a half-duplex socket and wondering how I would manage), and | with TODOs covering half the protocol. | spapas82 wrote: | Can you please share the documentation you used to understand | the pack file implemention? It will be very useful to me! | masklinn wrote: | Git's own: https://github.com/git/git/blob/master/Documenta | tion/gitform... | | I also used gitoxide's source when I got unsure / lost | about the nomenclature / varints. | | I'd suggest giving it a pass unless you have to or are | really interested though. | ori_b wrote: | Yes. The wire format docs, especially for HTTP, are | atrocious. The pack format is well documented enough. (https: | //github.com/git/git/tree/maint/Documentation/technica...) | | I wrote the 9front implementation (https://git.9front.org/pla | n9front/plan9front/HEAD/sys/src/cm...) more or less entirely | from the official documentation in the git repo. For | debugging, GIT_TRACE_PACKET=1 is very useful for comparison. | tex0 wrote: | Packfiles give me PTSD. They are so notoriously hard to maintain | on the large scale server side. | | I really wish git wasn't using them. | gbrown_ wrote: | Could you elaborate on "maintain" please? ___________________________________________________________________ (page generated 2022-08-29 23:00 UTC)