[HN Gopher] Scaling monorepo maintenance ___________________________________________________________________ Scaling monorepo maintenance Author : pimterry Score : 250 points Date : 2021-04-30 09:52 UTC (1 days ago) (HTM) web link (github.blog) (TXT) w3m dump (github.blog) | whymauri wrote: | I love when people use Git in ways I haven't thought about | before. Reminds me of the first time I played around with | 'blobs.' | debarshri wrote: | It is a great writeup. I wonder how gitlab solves this problem. | lbotos wrote: | GL packs refs at various times and frequencies depending on | usage: | https://docs.gitlab.com/ee/administration/housekeeping.html | | It works well for most repos but as you start to get out to the | edges of lots of commits it can cause slowness. GL admins can | repack reasonably safely at various times to get access | speedups, but the solution that is presented in the blog would | def speed packing up. | | (I work as a Support Engineering Leader at GL but I'm reading | HN for fun <3) | masklinn wrote: | They might have yet to encounter it. Git is hosting some really | big repos. | lbotos wrote: | Oh, we def have. I've seen some large repos (50+GB) in some | GL installations. | the_duke wrote: | Very well written post and upstream work is always appreciated. | | I also really like monorepos, but Git and GitHub really don't | work well at all for them. | | On the Git side there is no way to clone only parts of a repo or | to limit access by user. All the Git tooling out there, from CLI | to the various IDE integrations, are all very ill adjusted to a | huge repo with lots of unrelated commits. | | On the Github side there is no separation between the different | parts of a monorepo in the UI (issues, prs, CI), the workflows, | or the permission system. Sure, you can hack something together | with labels and custom bots, but it always feels like a hack. | | Using Git(hub) for monorepos is really painful in my experience. | | There is a reason why Google, Facebook et all have heaps of | custom tooling. | krasin wrote: | I really like monorepos. But I find that it's almost never a | good idea to hide parts of a source code from developers. And | if there's some secret sauce that's so sensitive that only | single-digit number of developers in the whole company can | access it, then it's probably okay to have a separate | repository just for it. | | Working in environments where different people have partial | access to different parts of the code never felt productive to | me -- often, the time to figure out who can take on a task and | how to grant all the access might take longer than the task | itself. | jayd16 wrote: | I wouldn't call it painful exactly but I'll be happy when | shallow and sparse cloning become rock solid and boring. | Beowolve wrote: | On this note, GitHub does reach out to customers with Monorepos | and is aware of their short comings. I think overtime we will | see them change to have better support. It's only a matter of | time. | jeffbee wrote: | It's funny that you mention this as if monorepos of course | require custom tooling. Google started with off-the-shelf | Perforce and that was fine for many years, long after their | repo became truly huge. Only when it became _monstrously_ huge | did they need custom tools and even then they basically just | re-implemented Perforce instead of adopting git concepts. You, | too, can just use Perforce. It 's even free for up to five | users. You won't outgrow its limits until you get about a | million engineer-years under your belt. | | The reason git doesn't have partial repo cloning is because it | was written by people without regard to past experience of | software development organizations. It is suited to the | radically decentralized group of Linux maintainers. It is | likely that your organization much more closely resembles | Google or Facebook than Linux. Perforce has had partial | checkout since ~always, because that's a pretty obvious | requirement when you stop and think about what software | development _companies_ do all day. | forrestthewoods wrote: | It's somewhat mind boggling that no one has made a better | Perforce. It has numerous issues and warts. But it's much | closer to what the majority of projects need that Git imho. | And for bonus points I can teach an artist/designer how to | safely and correctly use Perforce in about 10 minutes. | | I've been using Git/Hg for years and I still run into the | occasional Gitastrophe where I have to Google how to unbreak | myself. | Chyzwar wrote: | git have added recently sparse checkout and there is also | Virtual File System for Git from Microsoft. | | From my experience git/vcs is not an issue for monorepo. | Build, test, automation, deployments, CI/CD are way harder. | You will end with a bunch of shell scripts, make files, grunt | and a combination of ugly hacks. If you are smart you will | adopt something like bazel and have a dedicated tooling team. | If you see everything as nail, you will split monorepo into | an unmaintainable mess of small repos that slowly rot away. | throwaway894345 wrote: | I've always found that the biggest issue with monorepos is the | build tooling. I can't get my head around Bazel and other Blaze | derivatives enough to extend them to support any interesting | case, and Nix has too many usability issues to use productively | (and I've been in an organization that gave it an earnest | shot). | krasin wrote: | Can you please give an example of such an interesting case? I | am genuinely curious. | | And I agree with the general point that monorepos require a | great build tooling as a match. | csnweb wrote: | With GitHub actions you can quite easily specify a workflow | for parts of your repo (simple file path filter | https://docs.github.com/en/actions/reference/workflow- | syntax...). So you can basically just write one workflow for | each project in the monorepo and have only those running | where changes occured. | infogulch wrote: | This was great! My summary: | | A git packfile is an aggregated and indexed collection of | historical git objects which reduce the time it takes to serve | requests to those objects, implemented as two files: .pack and | .idx. GitHub was having issues maintaining packfiles for very | large repos in particular because regular repacking always has to | repack the entire history into a single new packfile every time | -- which is an expensive quadratic algorithm. GitHub's | engineering team ameliorated this problem in two steps: 1. Enable | repos to be served from multiple packfiles at once, 2. Design a | packfile maintenance strategy that uses multiple packfiles | sustainably. | | Multi-pack indexes are a new git feature, but the initial | implementation was missing performance-critial reachability | bitmaps for multi-pack indexes. In general, index files store | object names in lexicographic order and point to the named | object's position in the associated packfile. As a first step to | implement reachability bitmaps for multi-pack indexes, they | introduced a reverse index file (.rev) which maps packfile object | positions back to index file name offsets. This alone had a big | performance improvement, but it also filled in the missing piece | in order to implement multi-pack bitmaps. | | With the issues of serving repos from multiple packs solved, they | need to efficiently utilize multiple packs to reduce maintenance | overhead. They chose to maintain historical packfiles in a | geometrically increasing size. I.e., during the maintenance job, | consider the first N most recent packfiles, if the sum of the | size of all packfiles from [1, N] is less than the size of | packfile N+1, then packfiles [1, N] are repacked into a single | packfile, done; however if their summed size is greater than the | size of packfile N+1, then iterate and consider all the packfiles | [1, N+1] compared to packfile N+2 etc. This results in a set of | packfiles where each file is roughly double the size of the | previous when ordered by age, which has a number of beneficial | properties for both serving and the average case maintenance run. | Funny enough, this selection procedure struck me as similar to | the game "2048". | underdeserver wrote: | 30 minute read + the Git object model = mind boggled. | | I'd have appreciated a series of articles instead of one, for me | it's way too much info to take in in one sitting. | iudqnolq wrote: | I'm currently working through the book Building Git. Best $30 | I've spent in a while. It's about 700 pages, but 200 pages in | and I can stage files to/from the index, make commits, and see | the current status (although not on a repo with packfiles). | | I'm thinking about writing a blog post where I write a git | commit with hexdump, zlib, and vim. | georgyo wrote: | It was a lot to digest, but it was also all one continuous | thought. | | If it was broken up, I don't think it would have been nearly as | good. And I don't think I would have been able to keep all the | context to understand smaller chunks. | | I really enjoyed the whole thing. | swiley wrote: | Mono-repos are like having a flat directory structure. | | Sure it's simple but it makes it hard to find anything if you | have a lot of stuff/people. submodules and package managers exist | for a reason. | no_wizard wrote: | Note: for the sake of discussion I'm assuming when we say | monorepo we mean _monorepo and associated tools used to manage | them_ | | The trade off is simplified management of dependencies. With a | monorepo, I can control every version of a given dependency so | they're uniform across packages. If I update one package it is | always going to be linked to the other in its latest version. I | can simplify releases and managing my infrastructure in the | long term, though there is a trade off in initial complexity | for certain things if you want to do something like say, only | run tests for packages that have changed in CI (useful in some | cases). | | It's all trade offs, but the quality of code has been higher | for our org in a monorepo on average | mr_tristan wrote: | I've found that many developers do not pay attention to | dependency management, so this approach of "it's either in | the repo or it doesn't exist" is actually a nice guard rail. | | I'm reading between the lines here, but, I'm assuming you've | setup your tooling to enforce this. As in: the various | projects in the repo don't just optionally decide to have | external references, i.e., Maven central, npm, etc. | | This puts quite a lot of "stuff" in the repo, but with | improvements like this article mentioned, makes monorepos in | git much easier to use. | | I'd have to think, you could generate a lot of automation and | reports triggering out of commits pretty easily, too. I'd say | that would make the monorepo even easier to observe with a | modicum of the tooling required to maintain independent | repositories. | no_wizard wrote: | That is accurate, I wouldn't use a monorepo without | tooling, and in the JavaScript / TypeScript ecosystem, you | really can't do much without tooling (though npm supports | workspaces now, it doesn't support much else yet, like | plugins or hooks etc). | | I have tried in the past, trying to achieve the same goals, | particularly around the dependency graph and not | duplicating functionality found in shared libraries (though | this concern goes in hand with solving another concern I | have, which is documentation enforcement), were just not | really possible in a way that I could automate with a high | degree of accuracy and confidence, without even more | complexity, like having to use some kind of CI integration | to pull dependency files across packages and compare them, | in a monorepo I have a single tool that does this for _all_ | dependencies whenever any package.json file is updated or | the lock file is updated | | If you care at all about your dependency graph, and in my | not so humble opinion every developer should have some | high-level awareness here in their given domain, I haven't | found a better solution that is less complex than | leveraging a monorepo | Denvercoder9 wrote: | _> Sure it 's simple but it makes it hard to find anything if | you have a lot of stuff/people._ | | I think this is a bad analogy. Looking up a file or directory | in a monorepo isn't harder than looking up a repository. In | fact, I'd argue it's easier, as we've developed decades of | tooling for searching through filesystems, while for searching | through remotely hosted repositories you're dependent on the | search function of the repository host, which is often worse. | cryptica wrote: | To scale a monorepo, you need to split it up into multiple repos; | that way each repo can be maintained independently by a separate | team... | | We can call it a multi-monrepo, that way our brainwashed managers | will agree to it. | Orphis wrote: | And that way, you can't have atomic updates across the | repositories and need to synchronize them all the time, great. | iudqnolq wrote: | What do atomic source updates get you if you don't have | atomic deploys? I'm just a student but my impression is that | literally no one serious has atomic deploys, not even Google, | because the only way to do it is scheduled downtime. | | If you need to handle different versions talking to each | other in production it doesn't seem any harder to also deal | with different versions in source, and I'd worry atomic | updates to source would give a false sense of security in | deployment. | status_quo69 wrote: | > If you need to handle different versions talking to each | other in production it doesn't seem any harder to also deal | with different versions in source | | It's much more annoying to deal with multi-repo setups and | it can be a real productivity killer. Additionally, if you | have a shared dependency, now you have to juggle managing | that shared dep. For example, repo A needs shared lib | Foo@1.2.0 and repo B needs Foo@1.3.4, because developers on | team A didn't update their dependencies often enough to | keep up with version bumps from the Foo team. Now there's a | really weird situation going on in your company where not | all teams are on the same page. A naiive monorepo forces | that shared dep change to be applied across the board at | once. | | Edit: In regards to your "old code talking to new version" | problem, that's a culture problem IMO. At work we must | always consider the fact that a deployment rollout takes | time, so our changes in sensitive areas (controllers, jobs, | etc) should be as backwards compatible as possible for that | one deploy barring a rollback of some kind. We have linting | rules and a very stupid bot that posts a message reminding | us of that fact if we're trying to change something | sensitive to version changes, but the main thing that keeps | it all sane is we have it all collectively drilled in our | heads from the first time that we deploy to production that | we support N number of versions backwards. Since we're in a | monorepo, the PR to rip out the backwards compat check is | usually ripped out immediately after a deployment is | verified as good. In a multi-repo setup, ripping that | compat check out would require _another_ version bump and N | number of PRs to make sure that everyone is on the same | page. It really sucks. | slver wrote: | We have repository systems built for centralized atomic | updates, and giant monorepos, like SVN. Question is why are | we trying to have Git do this, which was explicitly designed | with the exact opposite goal? Is this an attempt to do SVN in | Git so we get to keep the benefits of the former, and the | cool buzzword-factor of the latter? I don't know. | | Also when I try to think about reasons to have atomic cross- | project changes, my mind keeps drawing negative examples, | such as another team changing the code on your project, is | that a good practice? Not really. Well unless all projects | are owned by the same team, it'll happen in a monorepo. | | Atomic updates not scaling beyond certain technical level is | often a good thing, because they also don't scale on human | and organizational level. | alexhutcheson wrote: | 1. You determine that a library used by a sizable fraction | of the code in your entire org has a problem that's | critical to fix (maybe a security issue, or maybe the | change could just save millions of dollars in compute | resources, etc.), but the fix requires updating the use of | that library in ~30 call sites spread across the codebases | of ~10 different teams. | | 2. You create a PR that fixes the code and the problematic | call sites in a single commit. It gets merged and you're | done. | | In the multi-repo world, you need to instead: | | 1. Add conditional branching in your library so that it | supports both the old behavior and new behavior. This could | be an experiment flag, a new method DoSomethingV2, a new | constructor arg, etc. Depending on how you do this, you | might dramatically increase the number of call sites that | need to be modified. | | 2. Either wait for all the problematic clients to update to | the new version of your library, or create PRs to manually | bump their version. Whoops - turns out a couple of them | were on a very old version, and the upgrade is non-trivial. | Now that's your problem to resolve before you proceed. | | 3. Create PRs to modify the calling code in every repo that | includes problematic calls, and follow up with 10 different | reviewers to get them merged. | | 4. If you still have the stamina, go through steps 1-3 | again to clean up the conditional logic you added to your | library in step 1. | | Basically, if code calls libraries that exist in different | repos, then making backwards-incompatible changes to those | libraries becomes extremely expensive. This is bad, because | sometimes backwards-incompatible changes would have very | high value. | | If the numbers from my example were higher (e.g. 1000 call | sites across 100 teams), then the library maintainer in a | monorepo would probably still want to use a feature flag or | similar to avoid trying to merge a commit that affects 1000 | files in one go. However, the library maintainer's job is | still dramatically easier, because they don't have to deal | with 100 individual repos, and they don't need to do | anything to ensure that everyone is using the latest | version of their library. | slver wrote: | Your monorepo scenario makes the following unlikely | assumptions: | | 1. A critical security/performance fix has no other | recourse than breaking the interface compatibility of a | library. Far more common scenario is this can be fixed in | the implementation without BC breaks (otherwise systems | like semver wouldn't make sense). | | 2. The person maintaining the library knows the codebases | of 10 teams better than the those 10 teams, so that | person can patch their projects better and faster than | the actual teams. | | As a library maintainer, you know the interface of your | library. But that's merely the "how" on the other end of | those 30 call sites. You don't know the "why". You can | easily break their projects, despite your code compiles | just fine. So that'd be reckless of an approach. | | Also your multi-repo scenario is artificially contrived. | No, you don't need conditional branching and all this | nonsense. | | In the common scenario, you just push a patch that | maintains BC and tell the teams to update and that's it. | | And if you do have BC breaks, then: | | 1. Push a major version with the BC breaks and the fix. | | 2. Push a patch version deprecating that release and | telling developers to update. | | That's it. You don't need all this nonsense you listed. | hamandcheese wrote: | I've lived both lives. It absolutely is an ordeal making | changes across repos. The model you are highlighting | opens up substantial risk that folks don't update in a | timely manner. What you are describing is basically just | throwing code over the wall and hoping for the best. | howinteresting wrote: | Semver is a second-rate coping mechanism for when better | coordination systems don't exist. | slver wrote: | Patching the code of 10 projects you don't maintain isn't | an example of a "coordination system". It's an example of | avoiding having one. | | In multithreading this would be basically mutable shared | state with no coordination. Every thread sees everything, | and is free to mutate any of it at any point. Which as we | all know is a best practice in multithreading /s | howinteresting wrote: | The same code can have multiple overlapping sets of | maintainers. For example, one team can be responsible for | business logic while another team can manage core | abstractions shared by many product teams. Yet another | team may be responsible for upgrading to newer toolchains | and language features. They'll all want to touch the same | code but make different, roughly orthogonal changes to | it. | | Semver provides just a few bits of information, not | nearly enough to cover the whole gamut of shared and | distributed responsibility. | | The comparison with multithreading is not really valid, | since monorepos typically linearize history. | slver wrote: | Semver was enough for me to resolve very simply a | scenario above that was presented as some kind of | unsurmountable nightmare. So I think Semver is just fine. | It's an example of a simple, well designed abstraction. | Having "more bits" is not a virtue here. | | I could have some comments on your "overlapping | responsibilities" as well, but your description is too | abstract and vague to address, so I'm pass on that. But | you literally described the concept of library at one | point. There's nothing overlapping about it. | iudqnolq wrote: | > You create a PR that fixes the code and the problematic | call sites in a single commit. It gets merged and you're | done. | | What happens when you roll this out and partway through | the rollout an old version talks to a new version? I | thought you still needed backwards compat? I'm a student | and I've never worked on a project with no-downtime | deploys, so I'm interested in how this can be possible. | howinteresting wrote: | Of course I want people who care about modernizing code to | come in and modernize my code (such as upgrades to newer | language versions). Why should the burden be distributed | when it can be concentrated among experts? | | I leverage type systems and write tests to catch any | mistakes they might make. | swiley wrote: | Yes you can, it happens when you bump the sub module | reference. This is how reasonable people use git. | Denvercoder9 wrote: | Submodules often provide a terrible user experience | _because_ they are locked to a single version. To propagate | a single commit, you need to update every single dependent | repository. In some contexts that can be helpful, but in my | experience it 's mostly an enormous hassle. | | Also it's awful that a simple git pull doesn't actually | pull updated submodules, you need to run git submodule | update (or sync or whatever it is) as well. | | I don't want to work with git submodules ever again. The | idea is nice, but the user experience is really terrible. | fpoling wrote: | Looking back I just do not understand why git came up | with this awkward mess of submodules. Instead it should | have a way to say that a particular directory is self- | contained and any commit affecting it should be two | objects. The first is the commit object for the directory | using only relative paths. The second is commit for the | rest of code with a reference to it. Then one can just | pull any repository into the main repository without and | use it normally. | | git subtree tries to emulate that, but it does not scale | to huge repositories as it needs to change all commits in | the subtree to use new nested paths. | mdaniel wrote: | And woe unto junior developers who change into the | submodule directory and do a git commit, then made | infinitely worse if it's followed by git push because now | there's a sha hanging out in the repo which works on one | machine but that no one else's submodule update will see | without surgery | | I'm not at my computer to see if modern git prohibits | that behavior, but it is indicative of the "watch out" | that comes with advanced git usage: it is a very sharp | knife | dylan-m wrote: | Or define your interfaces properly, version them, and | publish libraries (precompiled, ideally) somewhere outside | of your source repo. Your associated projects depend on | those rather than random chunks of code that happen to be | in the same file structure. It's more work, but it | encourages better organization in general and saves an | incredible amount of time later on for any complex project. | throwaway894345 wrote: | I don't like this because it assumes that all of those | repositories are accessible all of the time to everyone | who might want to build something. If one repo for some | core artifact becomes unreachable, everyone is dead in | the water. | | Ideally "cached on the network" could be a sort of | optional side effect, like with Nix, but you can still | reproducibly build from source. That said, I can't | recommend Nix, not for philosophical reasons, but for | lots of implementation details. | cryptica wrote: | If the project has good separation of concerns, you don't | need atomic updates. Good separation of concerns yields many | benefits beyond ease of project management. It requires a bit | more thought, but if done correctly, it's worth many times | the effort. | | Good separation of concerns is like earning compound interest | on your code. | | Just keep the dependencies generic and tailor the higher | level logic to the business domain. Then you rarely need to | update the dependencies. | | I've been doing this on commercial projects (to much success) | for decades; before most of the down-voters on here even | wrote their first hello world programs. | [deleted] | WayToDoor wrote: | The article is really impressive. It's nice to see GitHub | contribute changes back to the git project, and to know that the | two work closely together. | slver wrote: | It's in their mutual interest. Imagine what happens to GIThub | if GIT goes out of fashion. | infogulch wrote: | Yes, isn't it nice when interests of multiple parties are | aligned such that they help each other make progress towards | their shared goals? | slver wrote: | Well, it's nice to see they're rational, indeed. | jackbravo wrote: | Other rational companies could try to fix this without | contributing upstream. Doing it upstream benefits | competitors like gitlab. So yeah! It's nice seeing this | kind of behavior | slver wrote: | First, they not only contributed upstream, upstream | developers contributed to this patch. I.e. they got help | outside GitHub to make this patch possible. | | Second, if they had decided to fork Git, then they'd have | to maintain this fork forever. | | Third, this fork could overtime become visibly or even | worse subtly incompatible with stock Git which is still | the Git running on GitHub users' machines, and both | should interact with each other in 100% compatible | manner. | | So, in this case, not contributing upstream was literally | no-go. The only rational choice would be to not fork Git. ___________________________________________________________________ (page generated 2021-05-01 23:00 UTC)