[HN Gopher] Make your monorepo feel small with Git's sparse index
       ___________________________________________________________________
        
       Make your monorepo feel small with Git's sparse index
        
       Author : CRConrad
       Score  : 141 points
       Date   : 2021-11-11 15:27 UTC (7 hours ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | harvie wrote:
       | I hope one day all of this will be as easy as in SVN. eg.:
       | 
       | I have repository https://example.com/myrepo
       | 
       | And i can simply do:
       | 
       | svn co https://example.com/myrepo/some/directory/
       | 
       | And i can work with that subdirectory as if it was actual repo.
       | Completely transparently.
       | 
       | This i really miss in git.
        
         | Gigachad wrote:
         | I'm working in hell right now. The current company has the site
         | frontend, backend, and tests in separate repos and it's
         | basically impossible to do anything without force merging
         | because the build is broken without a chicken and egg situation
         | between the 3 pull requests.
        
           | laurent123456 wrote:
           | I worked at a company that not only did that, but also
           | decided to split the main web app into multiple repos, one
           | per country. It was so much fun to do anything in this
           | project.
        
             | xorcist wrote:
             | Now, _that 's_ a microservice if there ever was one!
        
         | williamvds wrote:
         | With shallow checkouts cloning is much quicker. You could try
         | combining it with sparse checkouts too. You can even have Git
         | fetch the full history in the background, and from a quick test
         | you can do stuff like commit while it's fetching. Obviously the
         | limited history means commands like log and blame will be
         | inaccurate until it's done.                 $ git clone
         | --depth=1 <url>       $ cd repo       $ git fetch --unshallow &
         | $ <do work>
        
         | zwieback wrote:
         | Yeah, that's really the one thing I miss from my SVN days. I'm
         | also still using Perforce, which can do even crazier things
         | with workspace mappings.
        
       | haberman wrote:
       | > The index file stores a list of every file at HEAD, along with
       | the object ID for its blob and some metadata. This list of files
       | is stored as a flat list and Git parses the index into an array.
       | 
       | I'm surprised that the index is not hierarchical, like tree
       | objects in Git's object storage.
       | 
       | With tree objects (https://git-scm.com/book/en/v2/Git-Internals-
       | Git-Objects#_tr...), each level of the hierarchy is a separate
       | object. So you would only need to load directories that are
       | interesting to you. You could use a single hash compare to
       | determine that two directories are identical without actually
       | recursing into them.
       | 
       | In particular, I can't understand why you would need a full list
       | of all files to create a commit. If your commit is known not to
       | touch certain directories, it should be able to simply refer to
       | the existing tree object without loading or expanding it.
       | 
       | I guess that's what this sparse-index work is doing. I'm just
       | surprised it didn't already work that way.
        
         | arxanas wrote:
         | It makes more sense if you think of the index as a structure
         | meant specifically to speed up `git status` operations. (It was
         | originally called the "dircache"! See https://github.com/git/gi
         | t/commit/5adf317b31729707fad4967c1a...) We desperately want to
         | reduce the number of file accesses we have to make, so directly
         | using the object database and a tree object (or similar
         | structures) would more than double file accesses.
         | 
         | There's performance-related metadata in the index which isn't
         | in tree objects. For example, the modified-time of a given file
         | exists in its index entry, which can be used to avoid reading
         | the file from disk if it seems to be unmodified. If you have to
         | do a disk lookup to decide whether to read a file from disk,
         | then the overhead is potentially as much as the operation
         | itself.
         | 
         | There's also semantic metadata, such as which stage the file is
         | in (for merge conflict resolution).
         | 
         | It's worth noting that you can turn on the cache tree extension
         | (https://git-scm.com/docs/index-format#_cache_tree) in order to
         | speed up commit operations. It doesn't replace objects in the
         | index with trees, but it does keep ranges of the index cached,
         | if they're known to correspond to a tree.
        
       | junon wrote:
       | What I'd really like to see is Git have the ability to
       | consolidate repeat submodules down into a single set of objects
       | in the super repository. Currently cloning the same submodule
       | results in a copy of the repository for each path, which is
       | absurd.
       | 
       | It's been something on my list to address on the mailing lists
       | for a while, just haven't had time.
        
       | arxanas wrote:
       | The index as a data structure is really starting to show its age,
       | especially as developers adapt Git to monorepo scale. It's really
       | fast for repositories up to a certain size, but big tech
       | organizations grow exponentially, and start to suffer performance
       | issues. At some point, you can't afford to use a data structure
       | that scales with the size of the repo, and have to switch to one
       | that scales with the size of the user's change.
       | 
       | I spent a good chunk of time working around the lack of sparse
       | indexes in libgit2, which produced speedups on the order of 500x
       | for certain operations, because reading and writing the entire
       | index is unnecessary for most users of a monorepo:
       | https://github.com/libgit2/libgit2/issues/6036. I'm excited to
       | see sparse indexes make their way into Git proper.
       | 
       | Shameless plug: I'm working on improving monorepo-scale Git
       | tooling at https://github.com/arxanas/git-branchless, such as
       | with in-memory rebases: https://blog.waleedkhan.name/in-memory-
       | rebases/. Try it out if you work in a Git monorepo.
        
         | stormbrew wrote:
         | > I'm working on improving monorepo-scale Git tooling at
         | https://github.com/arxanas/git-branchless
         | 
         | I'm intrigued by this but the readme could maybe use some work
         | to describe how you envision it being used day-to-day? All the
         | examples seem to be about using it to fix things but I'm not at
         | all clear how it helps enable a new workflow.
         | 
         | Even if it was just a link to a similar tool?
        
           | arxanas wrote:
           | Thanks for the feedback. I also received this request today
           | to document a relevant workflow:
           | https://github.com/arxanas/git-branchless/issues/210. If you
           | want to be notified when I write the documentation (hopefully
           | today?), then you can watch that issue.
           | 
           | There's a decent discussion here on "stacked changes":
           | https://docs.graphite.dev/getting-started/why-use-stacked-
           | ch..., with references to other articles. This workflow is
           | sometimes called development via "patch stack" or "stacked
           | diffs". But that's just a part of the workflow which git-
           | branchless enables.
           | 
           | The most similar tool would be Mercurial as used at large
           | companies (and in fact, `git-branchless` is, for now, just
           | trying to get to feature parity with it). But I don't know if
           | the feature set which engineers rely on is documented
           | anywhere publicly.
           | 
           | I use git-branchless 1) simply to scale to a monorepo,
           | because `git move` is a lot faster than `git rebase`, and 2)
           | to do highly speculative work and jump between many different
           | approaches to the same problem (a kind of breadth-first
           | search). I always had this problem with Git where I wanted to
           | make many speculative changes, but branch and stash
           | management got in the way. (For example, it's hard to update
           | a commit which is a common ancestor of two or more branches.
           | `git move` solves this.) The branchless workflow lets me be
           | more nimble and update the commit graph more deftly, so that
           | I can do experimental work much more easily.
        
       | rq1 wrote:
       | What I was looking for recently is a way to make "sparse push".
       | And trigger a chain reaction with hooks.
       | 
       | Didn't find anything interesting.
        
       | speedgoose wrote:
       | "One of the biggest traps for smart engineers is optimizing
       | something that shouldn't exist."
       | 
       | Elon Musk.
        
         | joconde wrote:
         | Was he talking about something specific?
        
           | junon wrote:
           | He was talking about the battery vibrator plates or something
           | in Tesla cars.
        
           | plopz wrote:
           | I remember him saying something like that during a
           | walkthrough of the base building starship and if I recall it
           | was in reference to overengineering something about the grid
           | fins.
        
         | solarmist wrote:
         | Let's make the snarky comment into a helpful comment. Why do
         | you think it shouldn't exist?
        
           | speedgoose wrote:
           | Monorepos create more issues than what they solve.
        
             | solarmist wrote:
             | Such as? That's just parroting "common opinion" otherwise.
        
               | speedgoose wrote:
               | The second paragraph of the article we are discussing
               | about for example.
               | 
               | But you can find a list on Wikipedia and make your own
               | opinion : https://en.m.wikipedia.org/wiki/Monorepo
        
             | jeremyjh wrote:
             | This is no different from saying "monorepo bad". Aside from
             | performance issues in git, why would a monorepo be bad? It
             | seems very natural to me to have a whole system referenced
             | with a single branch/tag that must all pass CI together.
             | Otherwise supporting projects can introduce breaking
             | changes downstream that are not apparent before they hit
             | master.
        
             | tambourine_man wrote:
             | Supporting evidence?
        
               | speedgoose wrote:
               | The Wikipedia article about monorepos has a good summary
               | 
               | https://en.m.wikipedia.org/wiki/Monorepo
               | 
               | Then you can do your own opinion, I'm sharing mine.
        
               | jeremyjh wrote:
               | Apart from performance issues that article offers more
               | (and more significant) advantages than it does drawbacks,
               | so it really does not support your statement.
        
             | ratww wrote:
             | That's completely false. Monorepos don't really create
             | issues at all when done properly, and when used in
             | situations where they make sense.
             | 
             | In smaller scales, for example, they're fantastic for
             | productivity, and my company is not looking back.
        
             | tsimionescu wrote:
             | So the Linux devs have no idea how to properly use Git?
        
               | speedgoose wrote:
               | I'm not sure whether the Linux kernel git repository
               | qualifies as a Monorepo.
        
       | anon9001 wrote:
       | This is well written and deserves my upvote, because sparse-
       | checkout is part of git and knowing how it works is useful.
       | 
       | That said, there's absolutely no reason to structure your code in
       | a monorepo.
       | 
       | Here's what I think GitHub is doing:
       | 
       | 1) Encourage monorepo adoption
       | 
       | 2) Build tooling for monorepos
       | 
       | 3) Selling tooling to developers stranded in monorepos
       | 
       | Microsoft, which owns GitHub, created the microsoft/git fork
       | linked in the article, and they explain their justification here:
       | https://github.com/microsoft/git#why-is-this-fork-needed
       | 
       | > Well, because Git is a distributed version control system, each
       | Git repository has a copy of all files in the entire history. As
       | large repositories, aka monorepos grow, Git can struggle to
       | manage all that data. As Git commands like status and fetch get
       | slower, developers stop waiting and start switching context. And
       | context switches harm developer productivity.
       | 
       | I believe that Google's brand is so big that it led to this mass
       | cognitive dissonance, which is being exploited by GitHub.
       | 
       | To be clear, here are the two ideas in conflict:
       | 
       | * Git is decentralized and fast, and Google famously doesn't use
       | it.
       | 
       | * Companies want to use "industry standard" tech, and Google is
       | the standard for success.
       | 
       | Now apply those observations to a world where your engineers only
       | use "git".
       | 
       | The result is market demand to misuse git for monorepos, which
       | Microsoft is pouring huge amounts of resources into enabling via
       | GitHub.
       | 
       | It makes great sense that GitHub wants to lean into this. More
       | centralization and being more reliant on GitHub's custom tooling
       | is obviously better for GitHub.
       | 
       | It just so happens that GitHub is building tools to enable
       | monorepos, essentially normalizing their usage.
       | 
       | Then GitHub can sell tools to deal with your enormous monorepo,
       | because your traditional tools will feel slow and worse than
       | GitHub's tools.
       | 
       | In other words, GitHub is propping up the failed monorepo idea as
       | a strategy to get people in the pipeline for things like
       | CodeSpaces: https://github.com/features/codespaces
       | 
       | Because if you have 100 projects and they're all separate, you
       | can do development locally for each and it's fast and sensible.
       | But if all your projects are in one repo, the tools grind to a
       | halt, and suddenly you need to buy a solution that just works to
       | meet your business goals.
        
         | jeffbee wrote:
         | > Git is ... fast, and Google ... doesn't use it.
         | 
         | Everything about git is orders of magnitude slower than the
         | monorepo in use at Google. Git is not fast, and its slowness
         | scales with the size of your repo.
        
         | tsimionescu wrote:
         | Monorepos are much easier for everyone to use, and are the only
         | natural way to manage code for any project. You keep talking
         | about Google, but a much more famous monorepo is Linux itself.
         | Perhaps Linus Torvalds has fallen into Google's hype?
         | 
         | The fact that git is very poor at scaling monorepos might mean
         | that it's a bad idea to use git for larger organizations, not
         | that it's a bad idea to use monorepos. If git can be improved
         | to work with monorepos, all the better.
        
           | anon9001 wrote:
           | > Monorepos are much easier for everyone to use, and are the
           | only natural way to manage code for any project.
           | 
           | I strongly disagree with that, but I'll let this blog post
           | explain it better than I can:
           | https://medium.com/@mattklein123/monorepos-please-
           | dont-e9a27...
           | 
           | > You keep talking about Google, but a much more famous
           | monorepo is Linux itself.
           | 
           | I thought it was fairly well known that monorepos came
           | directly from Google as part of their SRE strategy. It didn't
           | even come into common usage until around 2017 (according to
           | wikipedia). If I'm remembering correctly, the SRE book
           | recommends it, and that's why it gained popularity.
           | 
           | Also, I don't believe that Linux is a valid interpretation of
           | "monorepo". Linux is a singular product. You can't build the
           | kernel without all of the parts.
           | 
           | A better example would be if there was a "Linus" repo that
           | contained both git and linux. There isn't, and for good
           | reason.
           | 
           | > The fact that git is very poor at scaling monorepos might
           | mean that it's a bad idea to use git for larger
           | organizations, not that it's a bad idea to use monorepos. If
           | git can be improved to work with monorepos, all the better.
           | 
           | Any performance improvement in git is welcome, but anything
           | that sacrifices a full clone of the entire repository is
           | antithetical to decentralization.
           | 
           | The whole point of git is decentralized source code.
        
             | solarmist wrote:
             | Monorepos (up to a certain size where git starts getting
             | too slow) are easier to use unless you have sufficient
             | investment into dev tooling.
             | 
             | I think "monorepo" here is a shorthand for large, complex
             | repos with long histories which git does not scale well to
             | whether or not it is all of the repos for an organization.
             | For example I'd call the Windows OS a monorepo for all of
             | the important reasons.
        
             | howinteresting wrote:
             | > The whole point of git is decentralized source code.
             | 
             | The "whole point of git" is to provide value to its users.
             | Full decentralization is not necessary for that.
        
             | dataangel wrote:
             | > Also, I don't believe that Linux is a valid
             | interpretation of "monorepo". Linux is a singular product.
             | You can't build the kernel without all of the parts.
             | 
             | But it's also larger scale than the vast majority of
             | startups will ever reach. My work has had the same monorepo
             | for 8 years with over 100 employees now and git has had few
             | problems.
        
             | cdcarter wrote:
             | I think it's at least somewhat fair to call Linux a
             | monorepo. There are a lot of drivers included in the main
             | tree. They don't need to be, (we know this because there
             | are also lots of drivers not in the source tree). But by
             | including them, the kernel devs can make large changes to
             | the API and all the drivers in one go. This is a classic
             | "why use a monorepo".
        
         | ajkjk wrote:
         | Very much doubt that's their corporate strategy. More likely
         | it's as simple as: lots of people have monorepos; they have
         | lots of issues with Git and Github; Github wants their
         | business.
        
         | ratww wrote:
         | _> That said, there 's absolutely no reason to structure your
         | code in a monorepo._
         | 
         | Bullshit. There are very good reasons to use it in some
         | situations. My company is using it and it's a tremendous
         | productivity boon. And Git works _perfectly fine_ for smaller
         | scales.
         | 
         | Obviously, "because Google does it" is a terrible reason. But
         | it's disingenuous to say that's the only reason people are
         | doing it. Not everyone is a moron.
        
           | anon9001 wrote:
           | I'm glad you're having a good experience now, and git as a
           | monorepo will work fine at smaller scales, but you will
           | outgrow it at some point.
           | 
           | When you do, you have two choices. You can either commit to
           | the monorepo direction and start using non-standard tooling
           | that sacrifices decentralization, or you can break up your
           | repo into smaller manageable repos.
           | 
           | I don't have any problem with small organizations throwing
           | everything into one git repo because it's convenient.
           | 
           | My objection is that when you eventually do hit the limits of
           | git, will you choose to break the fundamentals of git
           | decentralization as a workaround? Or will you break up the
           | repo into a couple of other repos with specific purposes?
           | 
           | I don't like that GitHub makes money by encouraging people to
           | make the wrong choice at that juncture.
        
             | ratww wrote:
             | When I hit the limits of git then I will worry about it.
             | 
             | One of our tasks when building the monorepo was proving it
             | was possible to split it again. It was trivial and we have
             | tools to help us avoid complexity.
             | 
             | We're not using Github so that part doesn't apply to me.
             | 
             | Also, nice of you to assume we'll get to Google scale, but
             | thanks to the monorepo, I was able to make a few pull
             | requests reducing duplication and reducing line count of
             | app by thousands ever since. So I really don't see us
             | getting into Google scale anytime soon. We're downsizing.
             | 
             | I also find it ironic that you're accusing people of
             | "copying Google" in a parent post but you're the one
             | assuming that everyone will hit Google limits...
        
               | anon9001 wrote:
               | If you ever do hit a git limit where it's no longer
               | comfortable to keep the whole repo on each developer
               | machine, I would encourage you to split up the repo into
               | separate project-based repos rather than switching to
               | Microsoft's git fork.
               | 
               | As a best practice, there's a reason that Linus started
               | git in a separate repo, rather than as part of the Linux
               | project. The reason is that if you put too many projects
               | into one git repo, and it gets too large, you do
               | eventually hit a scale where it becomes a problem.
               | 
               | A very simple way to mitigate that is to keep each
               | project in its own repo, which you can easily do once you
               | start hitting git scale problems.
               | 
               | Thankfully, one of the original git use cases was to
               | decompose huge svn repos into smaller git repos, so the
               | tooling required is already built in.
               | 
               | > I honestly find it ironic that you're accusing people
               | of "copying google" in a parent post but you're the one
               | assuming that everyone will hit Google limits...
               | 
               | I think you got the wrong take there. I'm saying that
               | Google's monorepo approach is only valid because they
               | invested so heavily into building custom tooling to
               | handle it. We don't have access to those tools and
               | therefore shouldn't use their monorepo approach.
               | 
               | If you're going to use git, you're going to have the most
               | success using it as intended, which is some logical
               | separation of "one repo per project" where "project"
               | doesn't grow too out of hand. The Linux kernel could be
               | thought of as a large project that git still handles just
               | fine.
               | 
               | Tragically, I think if Google did opensource their
               | internal vcs and monorepo tooling, they would immediately
               | displace git as the dominant vcs and we would regress
               | back to trunk-based development.
        
             | rsj_hn wrote:
             | > I'm glad you're having a good experience now, and git as
             | a monorepo will work fine at smaller scales, but you will
             | outgrow it at some point.
             | 
             | I would say the opposite. A lot of companies are fine with
             | independent teams using their own versions of dependencies
             | and their own versions of core code, but at some point that
             | becomes unmanageable and you need to start using a common
             | set of dependencies and the same version of base frameworks
             | to reduce the complexity. That means pushing a patch to a
             | framework means all the teams are upgraded. Monorepos are
             | the most common solution to enforce that behavior.
             | 
             | Look, this is all dealing with the problem of coordination
             | in large teams. Different organizations have different
             | capacities for coordination, and so it's like squeezing a
             | balloon -- yes, you want more agility to pick your own deps
             | but then the cost of that is dealing with so much
             | complexity when you need to push a fix to a commonly used
             | framework or when a CVE is found in a widely used dep and
             | needs to be updated by 1000 different teams all manually.
             | 
             | There is no "right" way. It's just something organizations
             | have to struggle with because it's going to be costly no
             | matter what, and all that matters is what type of cost your
             | org is most easily able to bear. That will decide whether
             | you use a monorepo or a bunch of independent repos, whether
             | you go for microservices or a monolith, and most companies
             | will do some mix of all of the above.
        
               | anon9001 wrote:
               | > Monorepos are the most common solution to enforce that
               | behavior.
               | 
               | Yes. This is very accurate and also the problem.
               | Monorepos are being used as a political tool to change
               | behavior, but the problem is that it has severe technical
               | implications.
               | 
               | > There is no "right" way.
               | 
               | With git, there is a "wrong" way, and that's not
               | separating your project into different repos. It causes
               | real world technical problems, otherwise we wouldn't have
               | this article posted in the first place.
               | 
               | > It's just something organizations have struggle with
               | because it's going to be costly no matter what, and all
               | that matters is what type of cost your org is most easily
               | able to bear.
               | 
               | It's not a coin toss whether monorepos will have better
               | or worse support from all standard git tooling. It will
               | be worse every time.
               | 
               | The amount of tooling required to enforce dependency
               | upgrades, code styles, security checks, etc across many
               | repos is significantly less than the amount of tooling
               | required to successfully use a monorepo.
        
               | philosopher1234 wrote:
               | If you want to play right and wrong, I will say that noe
               | it's the right way since there is support for sparse
               | checkouts in git.
               | 
               | This isn't a useful game to play.
        
             | eximius wrote:
             | If you are in an enterprise setting, you _don 't need
             | decentralized version control_.
             | 
             | So, yea, for companies, monorepos are a no brainer in a lot
             | of ways.
             | 
             | For open source, separate repos makes more sense.
             | 
             | To expand on corporate monorepos, if you can still set up
             | access control (e.g., code owners to review additions by
             | domain) and code visibility (so there isn't _unlimited_
             | code sharing), then I can't think of a reason to not use
             | monorepos.
        
             | IshKebab wrote:
             | > you will outgrow it at some point
             | 
             | Given that Google and Microsoft use monorepos that seems
             | unlikely!
        
               | anon9001 wrote:
               | Google had to build an internal version control system as
               | an alternative to git and perforce to support their
               | monorepo.
               | 
               | Microsoft forked git and layered their own file system on
               | top of it to support a centralized git workflow so that
               | they could have a monorepo.
        
               | dlp211 wrote:
               | Having had used both, Google's implementation is IMO the
               | superior version of monorepo. Really, Google's
               | Engineering Systems are just better than anything that I
               | have ever used anywhere else.
        
               | anon9001 wrote:
               | This is exactly as I'd expect.
               | 
               | If you want a centralized, trunk-based version control,
               | don't use git.
               | 
               | It's funny how each company decides to solve these
               | problems.
               | 
               | Google called in the computer scientists and designed a
               | better centralized vcs for their purposes. Good on them.
               | It'd be great if they open sourced it. So typical of
               | Google to invent their own thing and keep it private.
               | 
               | Microsoft took the most popular vcs (git), and inserted a
               | shim layer to make it compatible with their use case. How
               | expected that Microsoft would build a compatibility shim
               | that attempts to hide complexity from the end user.
               | 
               | Meanwhile, Linux and Git are plugging along just fine, in
               | their own separate repos, even though many people work on
               | both projects.
        
               | IshKebab wrote:
               | > So typical of Google to invent their own thing and keep
               | it private.
               | 
               | Yeah like their build system... Bazel, that's completely
               | closed source.
        
               | jayd16 wrote:
               | Your logic is circular.                   No one should
               | work on monorepos because...         monorepos are bad
               | because...         git can't easily handle them and we
               | shouldn't fix that because...         No one should work
               | on monorepos...
               | 
               | Clearly there are reasons people like monorepos and it
               | makes sense to update git to support the workflow.
        
               | anon9001 wrote:
               | That isn't circular. The conclusion should be that git, a
               | decentralized vcs, should not take on changes to make it
               | a centralized vcs.
               | 
               | If you think that git needs to be "fixed" or "updated" to
               | support a centralized vcs server to do partial updates
               | over the network, then I think you've missed the point of
               | git.
        
           | dboreham wrote:
           | > it's a tremendous productivity boon
           | 
           | Curious to hear more specifics on this. Did you migrate from
           | separate repos to a monorepo and subsequently measure
           | improved productivity as a result?
        
             | ratww wrote:
             | Correct. We measured how long it took to integrate changes
             | in the core libraries into the consumers (multiple PRs)
             | versus doing it on a monorepo (single PR for change). We
             | ran them together for a couple weeks and the difference was
             | big.
             | 
             | The biggest differences were in changes that would break
             | the consumers. For this cases we had to go back and patch
             | the original library, or revert and start from scratch. But
             | even in the easy changes, just the "bureaucracy" of opening
             | tens of pull-requests, watching a few CI pipelines and
             | getting them approved by different code owners was also
             | large.
             | 
             | Now, whenever we have changes in one of the core libraries,
             | we also run full tests in the library consumers. With tests
             | running in parallel, sometimes it takes 20 minutes (instead
             | of 4, 5 hours) to get a patch affecting all frontends
             | tested, approved and merged into the main branch.
             | 
             | Also, everyone agreed that having multiple PRs open is
             | quite stressful.
        
         | solarmist wrote:
         | From my understanding Microsoft is doing it because they want
         | to use git for developing windows which is(was?) a large
         | monorepo.
        
         | omegalulw wrote:
         | Your take is extremely biased. You only just discuss why
         | monorepos are bad.
         | 
         | Here's some of the many reasons why monorepos are excellent:
         | 
         | - Continuous integration. Every project is almost always using
         | the lastest code from other projects and libraries it depends
         | on.
         | 
         | - Builds from scratch are very easy and don't need extravagant
         | tooling.
         | 
         | - Problems due to build versions in dependency management are
         | reduced (everyone is expected to use HEAD).
         | 
         | - The whole organization settles on a common build patterns -
         | so if you want to add a new dependency you wouldn't need to
         | struggle with their build system. Conversely, you need to write
         | lesser documentation on how to build your code - cause that's
         | now standard.
        
           | anon9001 wrote:
           | Heh, the major problems that I've run into using monorepos in
           | the real world at scale are:
           | 
           | - CI breaks all the time. Even one temperamental test from
           | anywhere else in the organization can cause your CI run to
           | fail.
           | 
           | - Building the monorepo locally becomes very complicated,
           | even to just get your little section running. Now all
           | developers need all the tools used in the monorepo.
           | 
           | - Dependencies get upgraded unexpectedly. Tests aren't
           | perfect, so people upgrade dependencies and your code
           | inevitably breaks.
           | 
           | It's cool that everyone is on the same coding style, but
           | that's very much achievable with a shared linter
           | configuration.
        
             | dlp211 wrote:
             | Your problem isn't monorepo, it's bad tooling. Tests should
             | only execute against code that changed. Builds should only
             | build the thing you want to build, not the whole
             | repository.
        
               | anon9001 wrote:
               | Yes!
               | 
               | The problem is choosing a monorepo _because_ the tooling
               | isn 't suited for monorepos.
               | 
               | Trying to build a monorepo with git is like trying to
               | build your CRUD web app frontend in c++.
               | 
               | Sure, you can do it. Webassembly exists and clang can
               | compile to it. I wouldn't recommend it because the
               | tooling doesn't match your actual problem.
               | 
               | Or maybe a better example is that it's like deciding the
               | browser widgets aren't very good, so we'll re-render our
               | own custom widgets with WebGL. Yes, this is all quite
               | possible, and your result might get to some definition of
               | "better", but you're not really solving the problem you
               | had of building a CRUD web app.
               | 
               | Can Microsoft successfully shim git so that it appears
               | like a centralized trunk-based monorepo, the way you'd
               | find at an old cvs/svn/perforce shop? Yes, they did, but
               | they shouldn't have.
               | 
               | My thesis is they're only pushing monorepos because it
               | helps GitHub monetize, and I stand by that.
               | 
               | > Tests should only execute against code that changed.
               | Builds should only build the thing you want to build, not
               | the whole repository.
               | 
               | How do you run your JS monorepo? Did you somehow get
               | bazel to remote cache a webpack build into individual
               | objects, so you're only building the changes? Can this
               | even be done with a modern minimization tool in the
               | pipeline? Is there another web packager that does take a
               | remotely cachable object-based approach?
               | 
               | I don't know enough about JS build systems to make a
               | monorepo work in any sensible way that utilizes caching
               | and minimizes build times. If anything good comes out of
               | the monorepo movement, it will be a forcing function that
               | makes JS transpilers more cacheable.
               | 
               | And all this for what? Trunk-based development? So we can
               | get surprise dependency updates? So that some manager
               | feels good that all the code is in one directory?
               | 
               | The reason Linus invented git in the first place was
               | because decentralized is the best way to build software.
               | He literally stopped work on the kernel for 2 weeks to
               | build the first version of git because the scale by which
               | he could merge code was the superpower that grew Linux.
               | 
               | If you YouTube search for "git linus" you can listen to
               | the original author explain the intent from 14 years ago:
               | https://www.youtube.com/watch?v=4XpnKHJAok8
               | 
               | If this is a topic you're passionate about, I'd encourage
               | you to watch that video, as he addresses why
               | decentralizing is so important and how it makes for
               | healthy software projects. It's also fun to watch old
               | Googlers not "get it".
               | 
               | He was right then and he's right now. It's disappointing
               | to see so much of HN not get it.
        
         | Orphis wrote:
         | > Git is decentralized and fast, and Google famously doesn't
         | use it.
         | 
         | Most (all?) of Google OSS software is hosted on either Gerrit
         | or Github. Git is not used by the "google3" monorepo, but it's
         | used by quite a few major projects.
        
       | nightpool wrote:
       | Is there a point in having a monorepo if you're all in on the
       | microservices approach? I'm a big microservices skeptic, but as
       | far as I understand it, the benefit of microservices is
       | independence of change & deployment enforced by solid API
       | contracts--don't you give that all up when you use a monorepo?
       | What does "Monorepo with microservices" give you that a normal
       | monolithic backend doesn't?
       | 
       | (Obviously e.g. an image resizer or something else completely
       | decoupled from your business logic should be a separate service /
       | repo _anyway_ --my point is more along the lines of "If something
       | shares code, shouldn't it share a deployment strategy?")
        
         | rsj_hn wrote:
         | Yeah, I've seen it used to allow teams to use consistent
         | frameworks and libraries across many different microservices.
         | Think of authentication, DB clients, logging, webservers,
         | grpc/http service front ends, uptime oracle -- there's lots of
         | cross cutting concerns that are shared code among many
         | microservices.
         | 
         | So the next thing you decide to do is create some microservice
         | framework that bundles all that stuff in and allow your
         | microservice team to write some business logic on top. But now
         | 99% of your executables are in this microservice framework that
         | everyone is using, and that's the point where a lot of
         | companies go the monorepo route.
         | 
         | Actually most companies do some mix -- have a lot of stuff in a
         | big repo and then other smaller repos alongside that.
        
         | johnmaguire wrote:
         | Monorepo with microservices gives you the ability to scale and
         | perform SRE-type maintenance at a granular level. Teams
         | maintain responsibility for their service, but are more easily
         | able to refactor, share code, pull dependencies like GraphQL
         | schemas into the frontend, etc. across many services.
        
           | nightpool wrote:
           | So basically each team has to reinvent devops from the ground
           | up, and staff their own on call rotation, instead of having a
           | centralized devops function that provides a stable platform?
           | That sounds horrendous.
           | 
           | Although, that said, I can at least _see_ the benefits of the
           | "1 service per team" methodology, where you have a dedicated
           | team that's independently responsible for updating their
           | service. I'm more used to associating "microservices" with
           | the model where a single team is managing 5 or 6 interacting
           | services, and the benefits there seem much smaller.
        
             | johnmaguire wrote:
             | > That sounds horrendous.
             | 
             | Different teams can make their own decisions, but as a
             | developer on a team that ran our own SRE, I found it came
             | with many advantages. Specifically, we saw very little
             | downtime, and when outages did occur were very prepared to
             | fix it as we knew the exact state of our services (code,
             | infrastructure, recent changes and deploys.) Additionally,
             | we had very good logging and metrics because we knew what
             | we'd want to have in the event of a problem.
             | 
             | And I'm not sure what you mean "from the ground up." We
             | were able to share a lot of Ansible playbooks, frameworks,
             | and our entire observability stack across all teams.
             | 
             | But I think you may also be missing the rest of my post.
             | This is only one possible advantage. Even if the team
             | doesn't perform their own SRE, these services can be scaled
             | independently - both in terms of infrastructure and
             | codebase - even while sharing code (including things like
             | protocol data structures, auth schemes, etc.)
             | 
             | A service that receives 1 SAML Response from an IdP
             | (identity provider) per day per user may not need the same
             | resources as a dashboard that exposes all SP (service
             | providers) to a user many times a day. And an
             | administration panel for this has still different needs.
             | 
             | Yet, all of these services may communicate with each other.
        
         | echelon wrote:
         | > Is there a point in having a monorepo if you're all in on the
         | microservices approach?
         | 
         | Monorepos are excellent for microservices.
         | 
         | - You can update the protobuf service graph (all strongly
         | typed) easily and make sure all the code changes are
         | compatible. You still have to release in a sensible order to
         | make sure the APIs are talking in an expected way, but this at
         | least ensures that the code agrees.
         | 
         | - You can address library vulns and upgrades all at once for
         | everything. Everything can get the new gRPC release at the same
         | time. Instead of having app owners be on the hook for this, a
         | central team can manage these important upgrades and provide
         | assistance / pairing for only the most complex situations.
         | 
         | - If you're the one working on a very large library migration,
         | you can rebase daily against the entire fleet of microservices
         | and not manage N-many code changes in N-many repos. This makes
         | huge efforts much easier. Bonus: you can land incrementally
         | across everything.
         | 
         | - If you're the one scoping out one of these "big changes", you
         | can statically find all of the code you'll impact or need to
         | understand. This is such an amazing win. No more hunting for
         | repos and grepping for code in hundreds of undiscovered places.
         | 
         | - Once a vuln is fixed, you can tell all apps to deploy after
         | SHA X to fix VULN Y. This is such an easy thing for app owners
         | to do.
         | 
         | - You can collect common service library code in a central
         | place (eg. internal Guava, auth tools, i18n, etc). Such
         | packages are easy to share and reuse. All of your internal code
         | is "vendored" essentially, but you can choose to depend on only
         | the things you need. A monorepo only feels heavy if you depend
         | on all the things (or your tooling doesn't support git or build
         | operations - you seriously have to staff a monorepo team).
         | 
         | - Other teams can easily discover and read your code.
         | 
         | Monorepos are the best possible way to go as long as you have
         | the tooling to support it. They fall over and become a burden
         | if they're not seriously staffed. When they work, they really
         | work.
        
           | nightpool wrote:
           | None of this addresses my question--what benefits do you get
           | from having monorepo-with-microservices over a monolithic
           | backend? All of the things you mentioned would be even
           | _easier_ with a monolithic backend.
        
             | staticassertion wrote:
             | They solve different problems, some of which may overlap I
             | suppose.
             | 
             | For one thing you get clear ownership of deployed code.
             | There isn't one monolithic service that everyone is
             | responsible for babying, every team can baby their own,
             | even if they all share libraries and whatnot.
             | 
             | You also get things like fault isolation and fine grained
             | scaling too.
        
           | echelon wrote:
           | Should have also mentioned: all those changes to cross
           | cutting library code will trigger builds and tests of the
           | dependent services. You can find out at once what breaks.
           | It's a superpower.
        
       | aranchelk wrote:
       | The more I've come to rely on techniques like canary testing and
       | opt-in version upgrades the more firmly I believe one of the main
       | motivations for monorepos is flawed: at any given time there may
       | not be a fact of the matter as to which single version of an app
       | or service is running in an environment.
       | 
       | At places I've worked when we thought about canary testing we
       | ignored the fact that there were multiple versions of the
       | software running in parallel, we classified it as part of an
       | orchestration process and not a reality about the code or env,
       | but we really did have multiple versions of a service running at
       | once, sometimes for days.
       | 
       | Similarly if you've got a setup where you can upgrade
       | users/regions/etc piecemeal (opt-in or by other selection
       | criteria) I don't know how you reflect this in a monorepo. I'm
       | curious how Google actually does this as I recall they have
       | offered user opt-in upgrades for Gmail. My suspicion is this gets
       | solved with something like directories ./v2/ and ./v3/ -- but
       | that's far from ideal.
        
         | eximius wrote:
         | that doesnt seem like a problem with monorepos (or otherwise).
         | 
         | youd just need to tag your releases, right?
        
           | aranchelk wrote:
           | I don't think it's a problem, rather I'm challenging a touted
           | benefit.
           | 
           | In large monorepos the supposition is you've got a class of
           | compatible apps and services bundled together. Version
           | dependencies are somewhat implicit: the correct version for
           | each project to interoperate is whatever was checked-in
           | together in your commit.
           | 
           | I don't know how it works in practice at different orgs, but
           | there's certainly an idea I've heard repeated that you can
           | essentially test, build, and deploy your monorepo atomically,
           | but the reality in my experience is you can't escape the need
           | to think about compatibility across multiple versions of
           | services once you use techniques like canary testing or
           | targeted upgrades.
        
             | sroussey wrote:
             | This is still true, but to a matter of degree. Even in the
             | feature flagged deploys mixed with canary the permutations
             | are all evident, and ideally tested.
             | 
             | Also, you wouldn't expect a schema change to occur with
             | code that requires it. Those will need to happen earlier.
             | 
             | Real systems are complex. A monorepo is one attempt at
             | capping the complexity to known permutations. For smaller
             | teams, it might collapse to a single one.
        
             | staticassertion wrote:
             | You still have to think about compatibility across versions
             | - that does not go away in a monorepo, and you should use
             | protocols that enforce compatible changes. The monorepo
             | just tells you that all tests pass across your entire
             | codebase given a change you made to some other part.
        
               | eximius wrote:
               | thats fair. you require reasonable deployment intervals
               | and may need to wait to merge based on deployment.
               | Workflow actions that can check whether a commit is
               | deployed in a given environment are invaluable
        
               | jeffbee wrote:
               | > may need to wait to merge based on deployment
               | 
               | Again, this fundamentally misunderstands the purpose of
               | the source code repo and how it relates to the build
               | artifacts deployed in production. If you are waiting for
               | something to happen in production before landing some
               | change, that tells me right there you have committed some
               | kind of serious error.
        
               | joshuamorton wrote:
               | I'd caveat this with _code_ change.
               | 
               | Its very common to need to wait for some version of a
               | binary to be live before updating some associated
               | configuration to enable a feature in that binary (since
               | the _dynamic_ configuration isn 't usually versioned with
               | the binary). It's possible that some systems exist that
               | fail quiet, with a non-existent configuration option
               | being silently ignored, but the ones I know of don't do
               | that.
        
         | coryrc wrote:
         | Only binaries are released. Binaries are timestamped and linked
         | to a changelist.
         | 
         | The "opt-in upgrades" are all live code. I know more than a few
         | "foo" "foo2" directories, but I wouldn't want an actively-
         | delivered, long-running service to be living in a feature
         | branch so I would still expect anyone to be using a similar
         | naming scheme.
        
         | ASinclair wrote:
         | > I'm curious how Google actually does this
         | 
         | Branches are cut for releases. Binaries are versioned.
        
         | [deleted]
        
         | jeffbee wrote:
         | > the main motivations for monorepos is flawed: at any given
         | time there may not be a fact of the matter as to which single
         | version of an app or service is running in an environment.
         | 
         | Your understanding of the motivations for monorepo is flawed.
         | I've never heard anyone even advocate for this as a reason for
         | monorepos. For some actual reasons people use monorepos, see
         | https://danluu.com/monorepo/
         | 
         | Regarding your question, which I re-emphasize has got nothing
         | to do with the arrangement of the source code, the solution is
         | to simply treat your protocol as API and follow these rules: 1)
         | Follow "Postel's Law", accepting to the best of your abilities
         | unknown messages and fields; 2) never change the meaning of
         | anything in an existing protocol. Do not change between
         | incompatible types, or change an optional item to required; 3)
         | Never re-use a retired aspect of the protocol with a different
         | meaning; 4) Generally do not make any incompatible change to an
         | existing protocol. If you must change the semantics, then you
         | are making a new protocol, not changing the old one,
         | 
         | > I don't know how you reflect this in a monorepo
         | 
         | You don't. Why would the deployed version of some application
         | be coded in your repo? It's simply a fact on the ground and
         | there's no reason for that to appear in source control.
        
           | aranchelk wrote:
           | We may just be talking past each other, but in the link you
           | provided, sections "Simplified dependencies" and (to a lesser
           | extent) "Cross-project changes" are pretty much exactly what
           | I'm talking about.
        
             | joshuamorton wrote:
             | They aren't, because those discussions are all related to
             | link-time stuff (if I update foo.h and bar.c that depends
             | on foo.h, I can do so atomically, because those are built
             | into the same artifact).
             | 
             | As soon as you discuss network traffic (or really anything
             | that crosses an RPC boundary), things get more complicated,
             | but none of that has anything to do with a monorepo, and
             | monorepos still sometimes simplify things.
             | 
             | So there's a few tools that are common: feature flags, 3
             | stage-rollouts, and probably more that are relevant, but
             | let's dive into those first two.
             | 
             | Feature "flags" are often dynamically scoped and runtime-
             | modifiable. You can change a feature flag via an RPC,
             | without restarting the binary running. This is done by
             | having something along the lines of                   if
             | (condition_that_enables_feature()) {
             | do_feature_thing()         } else {
             | do_old_thing()         }
             | 
             | A/B testing tools like optimizely and co provide this, and
             | there are generic frameworks too.
             | `condition_that_enables_feature()`, here is a dynamic
             | function that may change value based on the time of day,
             | the user, etc. (think something like
             | `hash(user.username).startswith(b'00') and user.locale ==
             | 'EN'`). The tools allow you to modify these conditions and
             | push and change the conditions all without restarts. That's
             | how you get per-user opt-in to certain behaviors.
             | Fundamentally, you might have an app that is capable of
             | serving two completely different UIs for the same user
             | journey.
             | 
             | Then you have "3-phase" updates. In this process, you have
             | a client and server. You want to update them to use "v2" of
             | some api, that's totally incompatible with v1. You start by
             | updating the server to accept requests in either v1 or v2
             | format. That's stage one. Then you update the clients to
             | sent requests in v2 format. That's stage two. Then you
             | remove all support for v1. That's stage three.
             | 
             | When you canary a new version of a binary, you'll have the
             | old version that only supports v1, and the canary version
             | that supports v1 and v2. If it's the server, none of the
             | clients use v2 yet, so this is fine. If it's the client,
             | you've already updated the server to support v2, so it
             | works fine.
             | 
             | Note again that all of this happens whether or not you use
             | a monorepo.
        
             | howinteresting wrote:
             | In general, it is a good practice to try and maximize
             | compile-time resolution of dependencies and minimize
             | network resolution of them. Services are great when the
             | working set doesn't fit in RAM or the different parts have
             | different hardware needs, but trying to make every little
             | thing its own service is foolish.
             | 
             | Doing so makes this a less pertinent problem.
        
         | [deleted]
        
       | rurban wrote:
       | Even better looks the new OTR merge strategy, which benefits
       | everyone. Not only the tiny monorepo userbase.
        
       ___________________________________________________________________
       (page generated 2021-11-11 23:00 UTC)