[HN Gopher] Scaling monorepo maintenance
       ___________________________________________________________________
        
       Scaling monorepo maintenance
        
       Author : pimterry
       Score  : 250 points
       Date   : 2021-04-30 09:52 UTC (1 days ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | whymauri wrote:
       | I love when people use Git in ways I haven't thought about
       | before. Reminds me of the first time I played around with
       | 'blobs.'
        
       | debarshri wrote:
       | It is a great writeup. I wonder how gitlab solves this problem.
        
         | lbotos wrote:
         | GL packs refs at various times and frequencies depending on
         | usage:
         | https://docs.gitlab.com/ee/administration/housekeeping.html
         | 
         | It works well for most repos but as you start to get out to the
         | edges of lots of commits it can cause slowness. GL admins can
         | repack reasonably safely at various times to get access
         | speedups, but the solution that is presented in the blog would
         | def speed packing up.
         | 
         | (I work as a Support Engineering Leader at GL but I'm reading
         | HN for fun <3)
        
         | masklinn wrote:
         | They might have yet to encounter it. Git is hosting some really
         | big repos.
        
           | lbotos wrote:
           | Oh, we def have. I've seen some large repos (50+GB) in some
           | GL installations.
        
       | the_duke wrote:
       | Very well written post and upstream work is always appreciated.
       | 
       | I also really like monorepos, but Git and GitHub really don't
       | work well at all for them.
       | 
       | On the Git side there is no way to clone only parts of a repo or
       | to limit access by user. All the Git tooling out there, from CLI
       | to the various IDE integrations, are all very ill adjusted to a
       | huge repo with lots of unrelated commits.
       | 
       | On the Github side there is no separation between the different
       | parts of a monorepo in the UI (issues, prs, CI), the workflows,
       | or the permission system. Sure, you can hack something together
       | with labels and custom bots, but it always feels like a hack.
       | 
       | Using Git(hub) for monorepos is really painful in my experience.
       | 
       | There is a reason why Google, Facebook et all have heaps of
       | custom tooling.
        
         | krasin wrote:
         | I really like monorepos. But I find that it's almost never a
         | good idea to hide parts of a source code from developers. And
         | if there's some secret sauce that's so sensitive that only
         | single-digit number of developers in the whole company can
         | access it, then it's probably okay to have a separate
         | repository just for it.
         | 
         | Working in environments where different people have partial
         | access to different parts of the code never felt productive to
         | me -- often, the time to figure out who can take on a task and
         | how to grant all the access might take longer than the task
         | itself.
        
         | jayd16 wrote:
         | I wouldn't call it painful exactly but I'll be happy when
         | shallow and sparse cloning become rock solid and boring.
        
         | Beowolve wrote:
         | On this note, GitHub does reach out to customers with Monorepos
         | and is aware of their short comings. I think overtime we will
         | see them change to have better support. It's only a matter of
         | time.
        
         | jeffbee wrote:
         | It's funny that you mention this as if monorepos of course
         | require custom tooling. Google started with off-the-shelf
         | Perforce and that was fine for many years, long after their
         | repo became truly huge. Only when it became _monstrously_ huge
         | did they need custom tools and even then they basically just
         | re-implemented Perforce instead of adopting git concepts. You,
         | too, can just use Perforce. It 's even free for up to five
         | users. You won't outgrow its limits until you get about a
         | million engineer-years under your belt.
         | 
         | The reason git doesn't have partial repo cloning is because it
         | was written by people without regard to past experience of
         | software development organizations. It is suited to the
         | radically decentralized group of Linux maintainers. It is
         | likely that your organization much more closely resembles
         | Google or Facebook than Linux. Perforce has had partial
         | checkout since ~always, because that's a pretty obvious
         | requirement when you stop and think about what software
         | development _companies_ do all day.
        
           | forrestthewoods wrote:
           | It's somewhat mind boggling that no one has made a better
           | Perforce. It has numerous issues and warts. But it's much
           | closer to what the majority of projects need that Git imho.
           | And for bonus points I can teach an artist/designer how to
           | safely and correctly use Perforce in about 10 minutes.
           | 
           | I've been using Git/Hg for years and I still run into the
           | occasional Gitastrophe where I have to Google how to unbreak
           | myself.
        
           | Chyzwar wrote:
           | git have added recently sparse checkout and there is also
           | Virtual File System for Git from Microsoft.
           | 
           | From my experience git/vcs is not an issue for monorepo.
           | Build, test, automation, deployments, CI/CD are way harder.
           | You will end with a bunch of shell scripts, make files, grunt
           | and a combination of ugly hacks. If you are smart you will
           | adopt something like bazel and have a dedicated tooling team.
           | If you see everything as nail, you will split monorepo into
           | an unmaintainable mess of small repos that slowly rot away.
        
         | throwaway894345 wrote:
         | I've always found that the biggest issue with monorepos is the
         | build tooling. I can't get my head around Bazel and other Blaze
         | derivatives enough to extend them to support any interesting
         | case, and Nix has too many usability issues to use productively
         | (and I've been in an organization that gave it an earnest
         | shot).
        
           | krasin wrote:
           | Can you please give an example of such an interesting case? I
           | am genuinely curious.
           | 
           | And I agree with the general point that monorepos require a
           | great build tooling as a match.
        
           | csnweb wrote:
           | With GitHub actions you can quite easily specify a workflow
           | for parts of your repo (simple file path filter
           | https://docs.github.com/en/actions/reference/workflow-
           | syntax...). So you can basically just write one workflow for
           | each project in the monorepo and have only those running
           | where changes occured.
        
       | infogulch wrote:
       | This was great! My summary:
       | 
       | A git packfile is an aggregated and indexed collection of
       | historical git objects which reduce the time it takes to serve
       | requests to those objects, implemented as two files: .pack and
       | .idx. GitHub was having issues maintaining packfiles for very
       | large repos in particular because regular repacking always has to
       | repack the entire history into a single new packfile every time
       | -- which is an expensive quadratic algorithm. GitHub's
       | engineering team ameliorated this problem in two steps: 1. Enable
       | repos to be served from multiple packfiles at once, 2. Design a
       | packfile maintenance strategy that uses multiple packfiles
       | sustainably.
       | 
       | Multi-pack indexes are a new git feature, but the initial
       | implementation was missing performance-critial reachability
       | bitmaps for multi-pack indexes. In general, index files store
       | object names in lexicographic order and point to the named
       | object's position in the associated packfile. As a first step to
       | implement reachability bitmaps for multi-pack indexes, they
       | introduced a reverse index file (.rev) which maps packfile object
       | positions back to index file name offsets. This alone had a big
       | performance improvement, but it also filled in the missing piece
       | in order to implement multi-pack bitmaps.
       | 
       | With the issues of serving repos from multiple packs solved, they
       | need to efficiently utilize multiple packs to reduce maintenance
       | overhead. They chose to maintain historical packfiles in a
       | geometrically increasing size. I.e., during the maintenance job,
       | consider the first N most recent packfiles, if the sum of the
       | size of all packfiles from [1, N] is less than the size of
       | packfile N+1, then packfiles [1, N] are repacked into a single
       | packfile, done; however if their summed size is greater than the
       | size of packfile N+1, then iterate and consider all the packfiles
       | [1, N+1] compared to packfile N+2 etc. This results in a set of
       | packfiles where each file is roughly double the size of the
       | previous when ordered by age, which has a number of beneficial
       | properties for both serving and the average case maintenance run.
       | Funny enough, this selection procedure struck me as similar to
       | the game "2048".
        
       | underdeserver wrote:
       | 30 minute read + the Git object model = mind boggled.
       | 
       | I'd have appreciated a series of articles instead of one, for me
       | it's way too much info to take in in one sitting.
        
         | iudqnolq wrote:
         | I'm currently working through the book Building Git. Best $30
         | I've spent in a while. It's about 700 pages, but 200 pages in
         | and I can stage files to/from the index, make commits, and see
         | the current status (although not on a repo with packfiles).
         | 
         | I'm thinking about writing a blog post where I write a git
         | commit with hexdump, zlib, and vim.
        
         | georgyo wrote:
         | It was a lot to digest, but it was also all one continuous
         | thought.
         | 
         | If it was broken up, I don't think it would have been nearly as
         | good. And I don't think I would have been able to keep all the
         | context to understand smaller chunks.
         | 
         | I really enjoyed the whole thing.
        
       | swiley wrote:
       | Mono-repos are like having a flat directory structure.
       | 
       | Sure it's simple but it makes it hard to find anything if you
       | have a lot of stuff/people. submodules and package managers exist
       | for a reason.
        
         | no_wizard wrote:
         | Note: for the sake of discussion I'm assuming when we say
         | monorepo we mean _monorepo and associated tools used to manage
         | them_
         | 
         | The trade off is simplified management of dependencies. With a
         | monorepo, I can control every version of a given dependency so
         | they're uniform across packages. If I update one package it is
         | always going to be linked to the other in its latest version. I
         | can simplify releases and managing my infrastructure in the
         | long term, though there is a trade off in initial complexity
         | for certain things if you want to do something like say, only
         | run tests for packages that have changed in CI (useful in some
         | cases).
         | 
         | It's all trade offs, but the quality of code has been higher
         | for our org in a monorepo on average
        
           | mr_tristan wrote:
           | I've found that many developers do not pay attention to
           | dependency management, so this approach of "it's either in
           | the repo or it doesn't exist" is actually a nice guard rail.
           | 
           | I'm reading between the lines here, but, I'm assuming you've
           | setup your tooling to enforce this. As in: the various
           | projects in the repo don't just optionally decide to have
           | external references, i.e., Maven central, npm, etc.
           | 
           | This puts quite a lot of "stuff" in the repo, but with
           | improvements like this article mentioned, makes monorepos in
           | git much easier to use.
           | 
           | I'd have to think, you could generate a lot of automation and
           | reports triggering out of commits pretty easily, too. I'd say
           | that would make the monorepo even easier to observe with a
           | modicum of the tooling required to maintain independent
           | repositories.
        
             | no_wizard wrote:
             | That is accurate, I wouldn't use a monorepo without
             | tooling, and in the JavaScript / TypeScript ecosystem, you
             | really can't do much without tooling (though npm supports
             | workspaces now, it doesn't support much else yet, like
             | plugins or hooks etc).
             | 
             | I have tried in the past, trying to achieve the same goals,
             | particularly around the dependency graph and not
             | duplicating functionality found in shared libraries (though
             | this concern goes in hand with solving another concern I
             | have, which is documentation enforcement), were just not
             | really possible in a way that I could automate with a high
             | degree of accuracy and confidence, without even more
             | complexity, like having to use some kind of CI integration
             | to pull dependency files across packages and compare them,
             | in a monorepo I have a single tool that does this for _all_
             | dependencies whenever any package.json file is updated or
             | the lock file is updated
             | 
             | If you care at all about your dependency graph, and in my
             | not so humble opinion every developer should have some
             | high-level awareness here in their given domain, I haven't
             | found a better solution that is less complex than
             | leveraging a monorepo
        
         | Denvercoder9 wrote:
         | _> Sure it 's simple but it makes it hard to find anything if
         | you have a lot of stuff/people._
         | 
         | I think this is a bad analogy. Looking up a file or directory
         | in a monorepo isn't harder than looking up a repository. In
         | fact, I'd argue it's easier, as we've developed decades of
         | tooling for searching through filesystems, while for searching
         | through remotely hosted repositories you're dependent on the
         | search function of the repository host, which is often worse.
        
       | cryptica wrote:
       | To scale a monorepo, you need to split it up into multiple repos;
       | that way each repo can be maintained independently by a separate
       | team...
       | 
       | We can call it a multi-monrepo, that way our brainwashed managers
       | will agree to it.
        
         | Orphis wrote:
         | And that way, you can't have atomic updates across the
         | repositories and need to synchronize them all the time, great.
        
           | iudqnolq wrote:
           | What do atomic source updates get you if you don't have
           | atomic deploys? I'm just a student but my impression is that
           | literally no one serious has atomic deploys, not even Google,
           | because the only way to do it is scheduled downtime.
           | 
           | If you need to handle different versions talking to each
           | other in production it doesn't seem any harder to also deal
           | with different versions in source, and I'd worry atomic
           | updates to source would give a false sense of security in
           | deployment.
        
             | status_quo69 wrote:
             | > If you need to handle different versions talking to each
             | other in production it doesn't seem any harder to also deal
             | with different versions in source
             | 
             | It's much more annoying to deal with multi-repo setups and
             | it can be a real productivity killer. Additionally, if you
             | have a shared dependency, now you have to juggle managing
             | that shared dep. For example, repo A needs shared lib
             | Foo@1.2.0 and repo B needs Foo@1.3.4, because developers on
             | team A didn't update their dependencies often enough to
             | keep up with version bumps from the Foo team. Now there's a
             | really weird situation going on in your company where not
             | all teams are on the same page. A naiive monorepo forces
             | that shared dep change to be applied across the board at
             | once.
             | 
             | Edit: In regards to your "old code talking to new version"
             | problem, that's a culture problem IMO. At work we must
             | always consider the fact that a deployment rollout takes
             | time, so our changes in sensitive areas (controllers, jobs,
             | etc) should be as backwards compatible as possible for that
             | one deploy barring a rollback of some kind. We have linting
             | rules and a very stupid bot that posts a message reminding
             | us of that fact if we're trying to change something
             | sensitive to version changes, but the main thing that keeps
             | it all sane is we have it all collectively drilled in our
             | heads from the first time that we deploy to production that
             | we support N number of versions backwards. Since we're in a
             | monorepo, the PR to rip out the backwards compat check is
             | usually ripped out immediately after a deployment is
             | verified as good. In a multi-repo setup, ripping that
             | compat check out would require _another_ version bump and N
             | number of PRs to make sure that everyone is on the same
             | page. It really sucks.
        
           | slver wrote:
           | We have repository systems built for centralized atomic
           | updates, and giant monorepos, like SVN. Question is why are
           | we trying to have Git do this, which was explicitly designed
           | with the exact opposite goal? Is this an attempt to do SVN in
           | Git so we get to keep the benefits of the former, and the
           | cool buzzword-factor of the latter? I don't know.
           | 
           | Also when I try to think about reasons to have atomic cross-
           | project changes, my mind keeps drawing negative examples,
           | such as another team changing the code on your project, is
           | that a good practice? Not really. Well unless all projects
           | are owned by the same team, it'll happen in a monorepo.
           | 
           | Atomic updates not scaling beyond certain technical level is
           | often a good thing, because they also don't scale on human
           | and organizational level.
        
             | alexhutcheson wrote:
             | 1. You determine that a library used by a sizable fraction
             | of the code in your entire org has a problem that's
             | critical to fix (maybe a security issue, or maybe the
             | change could just save millions of dollars in compute
             | resources, etc.), but the fix requires updating the use of
             | that library in ~30 call sites spread across the codebases
             | of ~10 different teams.
             | 
             | 2. You create a PR that fixes the code and the problematic
             | call sites in a single commit. It gets merged and you're
             | done.
             | 
             | In the multi-repo world, you need to instead:
             | 
             | 1. Add conditional branching in your library so that it
             | supports both the old behavior and new behavior. This could
             | be an experiment flag, a new method DoSomethingV2, a new
             | constructor arg, etc. Depending on how you do this, you
             | might dramatically increase the number of call sites that
             | need to be modified.
             | 
             | 2. Either wait for all the problematic clients to update to
             | the new version of your library, or create PRs to manually
             | bump their version. Whoops - turns out a couple of them
             | were on a very old version, and the upgrade is non-trivial.
             | Now that's your problem to resolve before you proceed.
             | 
             | 3. Create PRs to modify the calling code in every repo that
             | includes problematic calls, and follow up with 10 different
             | reviewers to get them merged.
             | 
             | 4. If you still have the stamina, go through steps 1-3
             | again to clean up the conditional logic you added to your
             | library in step 1.
             | 
             | Basically, if code calls libraries that exist in different
             | repos, then making backwards-incompatible changes to those
             | libraries becomes extremely expensive. This is bad, because
             | sometimes backwards-incompatible changes would have very
             | high value.
             | 
             | If the numbers from my example were higher (e.g. 1000 call
             | sites across 100 teams), then the library maintainer in a
             | monorepo would probably still want to use a feature flag or
             | similar to avoid trying to merge a commit that affects 1000
             | files in one go. However, the library maintainer's job is
             | still dramatically easier, because they don't have to deal
             | with 100 individual repos, and they don't need to do
             | anything to ensure that everyone is using the latest
             | version of their library.
        
               | slver wrote:
               | Your monorepo scenario makes the following unlikely
               | assumptions:
               | 
               | 1. A critical security/performance fix has no other
               | recourse than breaking the interface compatibility of a
               | library. Far more common scenario is this can be fixed in
               | the implementation without BC breaks (otherwise systems
               | like semver wouldn't make sense).
               | 
               | 2. The person maintaining the library knows the codebases
               | of 10 teams better than the those 10 teams, so that
               | person can patch their projects better and faster than
               | the actual teams.
               | 
               | As a library maintainer, you know the interface of your
               | library. But that's merely the "how" on the other end of
               | those 30 call sites. You don't know the "why". You can
               | easily break their projects, despite your code compiles
               | just fine. So that'd be reckless of an approach.
               | 
               | Also your multi-repo scenario is artificially contrived.
               | No, you don't need conditional branching and all this
               | nonsense.
               | 
               | In the common scenario, you just push a patch that
               | maintains BC and tell the teams to update and that's it.
               | 
               | And if you do have BC breaks, then:
               | 
               | 1. Push a major version with the BC breaks and the fix.
               | 
               | 2. Push a patch version deprecating that release and
               | telling developers to update.
               | 
               | That's it. You don't need all this nonsense you listed.
        
               | hamandcheese wrote:
               | I've lived both lives. It absolutely is an ordeal making
               | changes across repos. The model you are highlighting
               | opens up substantial risk that folks don't update in a
               | timely manner. What you are describing is basically just
               | throwing code over the wall and hoping for the best.
        
               | howinteresting wrote:
               | Semver is a second-rate coping mechanism for when better
               | coordination systems don't exist.
        
               | slver wrote:
               | Patching the code of 10 projects you don't maintain isn't
               | an example of a "coordination system". It's an example of
               | avoiding having one.
               | 
               | In multithreading this would be basically mutable shared
               | state with no coordination. Every thread sees everything,
               | and is free to mutate any of it at any point. Which as we
               | all know is a best practice in multithreading /s
        
               | howinteresting wrote:
               | The same code can have multiple overlapping sets of
               | maintainers. For example, one team can be responsible for
               | business logic while another team can manage core
               | abstractions shared by many product teams. Yet another
               | team may be responsible for upgrading to newer toolchains
               | and language features. They'll all want to touch the same
               | code but make different, roughly orthogonal changes to
               | it.
               | 
               | Semver provides just a few bits of information, not
               | nearly enough to cover the whole gamut of shared and
               | distributed responsibility.
               | 
               | The comparison with multithreading is not really valid,
               | since monorepos typically linearize history.
        
               | slver wrote:
               | Semver was enough for me to resolve very simply a
               | scenario above that was presented as some kind of
               | unsurmountable nightmare. So I think Semver is just fine.
               | It's an example of a simple, well designed abstraction.
               | Having "more bits" is not a virtue here.
               | 
               | I could have some comments on your "overlapping
               | responsibilities" as well, but your description is too
               | abstract and vague to address, so I'm pass on that. But
               | you literally described the concept of library at one
               | point. There's nothing overlapping about it.
        
               | iudqnolq wrote:
               | > You create a PR that fixes the code and the problematic
               | call sites in a single commit. It gets merged and you're
               | done.
               | 
               | What happens when you roll this out and partway through
               | the rollout an old version talks to a new version? I
               | thought you still needed backwards compat? I'm a student
               | and I've never worked on a project with no-downtime
               | deploys, so I'm interested in how this can be possible.
        
             | howinteresting wrote:
             | Of course I want people who care about modernizing code to
             | come in and modernize my code (such as upgrades to newer
             | language versions). Why should the burden be distributed
             | when it can be concentrated among experts?
             | 
             | I leverage type systems and write tests to catch any
             | mistakes they might make.
        
           | swiley wrote:
           | Yes you can, it happens when you bump the sub module
           | reference. This is how reasonable people use git.
        
             | Denvercoder9 wrote:
             | Submodules often provide a terrible user experience
             | _because_ they are locked to a single version. To propagate
             | a single commit, you need to update every single dependent
             | repository. In some contexts that can be helpful, but in my
             | experience it 's mostly an enormous hassle.
             | 
             | Also it's awful that a simple git pull doesn't actually
             | pull updated submodules, you need to run git submodule
             | update (or sync or whatever it is) as well.
             | 
             | I don't want to work with git submodules ever again. The
             | idea is nice, but the user experience is really terrible.
        
               | fpoling wrote:
               | Looking back I just do not understand why git came up
               | with this awkward mess of submodules. Instead it should
               | have a way to say that a particular directory is self-
               | contained and any commit affecting it should be two
               | objects. The first is the commit object for the directory
               | using only relative paths. The second is commit for the
               | rest of code with a reference to it. Then one can just
               | pull any repository into the main repository without and
               | use it normally.
               | 
               | git subtree tries to emulate that, but it does not scale
               | to huge repositories as it needs to change all commits in
               | the subtree to use new nested paths.
        
               | mdaniel wrote:
               | And woe unto junior developers who change into the
               | submodule directory and do a git commit, then made
               | infinitely worse if it's followed by git push because now
               | there's a sha hanging out in the repo which works on one
               | machine but that no one else's submodule update will see
               | without surgery
               | 
               | I'm not at my computer to see if modern git prohibits
               | that behavior, but it is indicative of the "watch out"
               | that comes with advanced git usage: it is a very sharp
               | knife
        
             | dylan-m wrote:
             | Or define your interfaces properly, version them, and
             | publish libraries (precompiled, ideally) somewhere outside
             | of your source repo. Your associated projects depend on
             | those rather than random chunks of code that happen to be
             | in the same file structure. It's more work, but it
             | encourages better organization in general and saves an
             | incredible amount of time later on for any complex project.
        
               | throwaway894345 wrote:
               | I don't like this because it assumes that all of those
               | repositories are accessible all of the time to everyone
               | who might want to build something. If one repo for some
               | core artifact becomes unreachable, everyone is dead in
               | the water.
               | 
               | Ideally "cached on the network" could be a sort of
               | optional side effect, like with Nix, but you can still
               | reproducibly build from source. That said, I can't
               | recommend Nix, not for philosophical reasons, but for
               | lots of implementation details.
        
           | cryptica wrote:
           | If the project has good separation of concerns, you don't
           | need atomic updates. Good separation of concerns yields many
           | benefits beyond ease of project management. It requires a bit
           | more thought, but if done correctly, it's worth many times
           | the effort.
           | 
           | Good separation of concerns is like earning compound interest
           | on your code.
           | 
           | Just keep the dependencies generic and tailor the higher
           | level logic to the business domain. Then you rarely need to
           | update the dependencies.
           | 
           | I've been doing this on commercial projects (to much success)
           | for decades; before most of the down-voters on here even
           | wrote their first hello world programs.
        
           | [deleted]
        
       | WayToDoor wrote:
       | The article is really impressive. It's nice to see GitHub
       | contribute changes back to the git project, and to know that the
       | two work closely together.
        
         | slver wrote:
         | It's in their mutual interest. Imagine what happens to GIThub
         | if GIT goes out of fashion.
        
           | infogulch wrote:
           | Yes, isn't it nice when interests of multiple parties are
           | aligned such that they help each other make progress towards
           | their shared goals?
        
             | slver wrote:
             | Well, it's nice to see they're rational, indeed.
        
               | jackbravo wrote:
               | Other rational companies could try to fix this without
               | contributing upstream. Doing it upstream benefits
               | competitors like gitlab. So yeah! It's nice seeing this
               | kind of behavior
        
               | slver wrote:
               | First, they not only contributed upstream, upstream
               | developers contributed to this patch. I.e. they got help
               | outside GitHub to make this patch possible.
               | 
               | Second, if they had decided to fork Git, then they'd have
               | to maintain this fork forever.
               | 
               | Third, this fork could overtime become visibly or even
               | worse subtly incompatible with stock Git which is still
               | the Git running on GitHub users' machines, and both
               | should interact with each other in 100% compatible
               | manner.
               | 
               | So, in this case, not contributing upstream was literally
               | no-go. The only rational choice would be to not fork Git.
        
       ___________________________________________________________________
       (page generated 2021-05-01 23:00 UTC)