[HN Gopher] Make your monorepo feel small with Git's sparse index ___________________________________________________________________ Make your monorepo feel small with Git's sparse index Author : CRConrad Score : 141 points Date : 2021-11-11 15:27 UTC (7 hours ago) (HTM) web link (github.blog) (TXT) w3m dump (github.blog) | harvie wrote: | I hope one day all of this will be as easy as in SVN. eg.: | | I have repository https://example.com/myrepo | | And i can simply do: | | svn co https://example.com/myrepo/some/directory/ | | And i can work with that subdirectory as if it was actual repo. | Completely transparently. | | This i really miss in git. | Gigachad wrote: | I'm working in hell right now. The current company has the site | frontend, backend, and tests in separate repos and it's | basically impossible to do anything without force merging | because the build is broken without a chicken and egg situation | between the 3 pull requests. | laurent123456 wrote: | I worked at a company that not only did that, but also | decided to split the main web app into multiple repos, one | per country. It was so much fun to do anything in this | project. | xorcist wrote: | Now, _that 's_ a microservice if there ever was one! | williamvds wrote: | With shallow checkouts cloning is much quicker. You could try | combining it with sparse checkouts too. You can even have Git | fetch the full history in the background, and from a quick test | you can do stuff like commit while it's fetching. Obviously the | limited history means commands like log and blame will be | inaccurate until it's done. $ git clone | --depth=1 <url> $ cd repo $ git fetch --unshallow & | $ <do work> | zwieback wrote: | Yeah, that's really the one thing I miss from my SVN days. I'm | also still using Perforce, which can do even crazier things | with workspace mappings. | haberman wrote: | > The index file stores a list of every file at HEAD, along with | the object ID for its blob and some metadata. This list of files | is stored as a flat list and Git parses the index into an array. | | I'm surprised that the index is not hierarchical, like tree | objects in Git's object storage. | | With tree objects (https://git-scm.com/book/en/v2/Git-Internals- | Git-Objects#_tr...), each level of the hierarchy is a separate | object. So you would only need to load directories that are | interesting to you. You could use a single hash compare to | determine that two directories are identical without actually | recursing into them. | | In particular, I can't understand why you would need a full list | of all files to create a commit. If your commit is known not to | touch certain directories, it should be able to simply refer to | the existing tree object without loading or expanding it. | | I guess that's what this sparse-index work is doing. I'm just | surprised it didn't already work that way. | arxanas wrote: | It makes more sense if you think of the index as a structure | meant specifically to speed up `git status` operations. (It was | originally called the "dircache"! See https://github.com/git/gi | t/commit/5adf317b31729707fad4967c1a...) We desperately want to | reduce the number of file accesses we have to make, so directly | using the object database and a tree object (or similar | structures) would more than double file accesses. | | There's performance-related metadata in the index which isn't | in tree objects. For example, the modified-time of a given file | exists in its index entry, which can be used to avoid reading | the file from disk if it seems to be unmodified. If you have to | do a disk lookup to decide whether to read a file from disk, | then the overhead is potentially as much as the operation | itself. | | There's also semantic metadata, such as which stage the file is | in (for merge conflict resolution). | | It's worth noting that you can turn on the cache tree extension | (https://git-scm.com/docs/index-format#_cache_tree) in order to | speed up commit operations. It doesn't replace objects in the | index with trees, but it does keep ranges of the index cached, | if they're known to correspond to a tree. | junon wrote: | What I'd really like to see is Git have the ability to | consolidate repeat submodules down into a single set of objects | in the super repository. Currently cloning the same submodule | results in a copy of the repository for each path, which is | absurd. | | It's been something on my list to address on the mailing lists | for a while, just haven't had time. | arxanas wrote: | The index as a data structure is really starting to show its age, | especially as developers adapt Git to monorepo scale. It's really | fast for repositories up to a certain size, but big tech | organizations grow exponentially, and start to suffer performance | issues. At some point, you can't afford to use a data structure | that scales with the size of the repo, and have to switch to one | that scales with the size of the user's change. | | I spent a good chunk of time working around the lack of sparse | indexes in libgit2, which produced speedups on the order of 500x | for certain operations, because reading and writing the entire | index is unnecessary for most users of a monorepo: | https://github.com/libgit2/libgit2/issues/6036. I'm excited to | see sparse indexes make their way into Git proper. | | Shameless plug: I'm working on improving monorepo-scale Git | tooling at https://github.com/arxanas/git-branchless, such as | with in-memory rebases: https://blog.waleedkhan.name/in-memory- | rebases/. Try it out if you work in a Git monorepo. | stormbrew wrote: | > I'm working on improving monorepo-scale Git tooling at | https://github.com/arxanas/git-branchless | | I'm intrigued by this but the readme could maybe use some work | to describe how you envision it being used day-to-day? All the | examples seem to be about using it to fix things but I'm not at | all clear how it helps enable a new workflow. | | Even if it was just a link to a similar tool? | arxanas wrote: | Thanks for the feedback. I also received this request today | to document a relevant workflow: | https://github.com/arxanas/git-branchless/issues/210. If you | want to be notified when I write the documentation (hopefully | today?), then you can watch that issue. | | There's a decent discussion here on "stacked changes": | https://docs.graphite.dev/getting-started/why-use-stacked- | ch..., with references to other articles. This workflow is | sometimes called development via "patch stack" or "stacked | diffs". But that's just a part of the workflow which git- | branchless enables. | | The most similar tool would be Mercurial as used at large | companies (and in fact, `git-branchless` is, for now, just | trying to get to feature parity with it). But I don't know if | the feature set which engineers rely on is documented | anywhere publicly. | | I use git-branchless 1) simply to scale to a monorepo, | because `git move` is a lot faster than `git rebase`, and 2) | to do highly speculative work and jump between many different | approaches to the same problem (a kind of breadth-first | search). I always had this problem with Git where I wanted to | make many speculative changes, but branch and stash | management got in the way. (For example, it's hard to update | a commit which is a common ancestor of two or more branches. | `git move` solves this.) The branchless workflow lets me be | more nimble and update the commit graph more deftly, so that | I can do experimental work much more easily. | rq1 wrote: | What I was looking for recently is a way to make "sparse push". | And trigger a chain reaction with hooks. | | Didn't find anything interesting. | speedgoose wrote: | "One of the biggest traps for smart engineers is optimizing | something that shouldn't exist." | | Elon Musk. | joconde wrote: | Was he talking about something specific? | junon wrote: | He was talking about the battery vibrator plates or something | in Tesla cars. | plopz wrote: | I remember him saying something like that during a | walkthrough of the base building starship and if I recall it | was in reference to overengineering something about the grid | fins. | solarmist wrote: | Let's make the snarky comment into a helpful comment. Why do | you think it shouldn't exist? | speedgoose wrote: | Monorepos create more issues than what they solve. | solarmist wrote: | Such as? That's just parroting "common opinion" otherwise. | speedgoose wrote: | The second paragraph of the article we are discussing | about for example. | | But you can find a list on Wikipedia and make your own | opinion : https://en.m.wikipedia.org/wiki/Monorepo | jeremyjh wrote: | This is no different from saying "monorepo bad". Aside from | performance issues in git, why would a monorepo be bad? It | seems very natural to me to have a whole system referenced | with a single branch/tag that must all pass CI together. | Otherwise supporting projects can introduce breaking | changes downstream that are not apparent before they hit | master. | tambourine_man wrote: | Supporting evidence? | speedgoose wrote: | The Wikipedia article about monorepos has a good summary | | https://en.m.wikipedia.org/wiki/Monorepo | | Then you can do your own opinion, I'm sharing mine. | jeremyjh wrote: | Apart from performance issues that article offers more | (and more significant) advantages than it does drawbacks, | so it really does not support your statement. | ratww wrote: | That's completely false. Monorepos don't really create | issues at all when done properly, and when used in | situations where they make sense. | | In smaller scales, for example, they're fantastic for | productivity, and my company is not looking back. | tsimionescu wrote: | So the Linux devs have no idea how to properly use Git? | speedgoose wrote: | I'm not sure whether the Linux kernel git repository | qualifies as a Monorepo. | anon9001 wrote: | This is well written and deserves my upvote, because sparse- | checkout is part of git and knowing how it works is useful. | | That said, there's absolutely no reason to structure your code in | a monorepo. | | Here's what I think GitHub is doing: | | 1) Encourage monorepo adoption | | 2) Build tooling for monorepos | | 3) Selling tooling to developers stranded in monorepos | | Microsoft, which owns GitHub, created the microsoft/git fork | linked in the article, and they explain their justification here: | https://github.com/microsoft/git#why-is-this-fork-needed | | > Well, because Git is a distributed version control system, each | Git repository has a copy of all files in the entire history. As | large repositories, aka monorepos grow, Git can struggle to | manage all that data. As Git commands like status and fetch get | slower, developers stop waiting and start switching context. And | context switches harm developer productivity. | | I believe that Google's brand is so big that it led to this mass | cognitive dissonance, which is being exploited by GitHub. | | To be clear, here are the two ideas in conflict: | | * Git is decentralized and fast, and Google famously doesn't use | it. | | * Companies want to use "industry standard" tech, and Google is | the standard for success. | | Now apply those observations to a world where your engineers only | use "git". | | The result is market demand to misuse git for monorepos, which | Microsoft is pouring huge amounts of resources into enabling via | GitHub. | | It makes great sense that GitHub wants to lean into this. More | centralization and being more reliant on GitHub's custom tooling | is obviously better for GitHub. | | It just so happens that GitHub is building tools to enable | monorepos, essentially normalizing their usage. | | Then GitHub can sell tools to deal with your enormous monorepo, | because your traditional tools will feel slow and worse than | GitHub's tools. | | In other words, GitHub is propping up the failed monorepo idea as | a strategy to get people in the pipeline for things like | CodeSpaces: https://github.com/features/codespaces | | Because if you have 100 projects and they're all separate, you | can do development locally for each and it's fast and sensible. | But if all your projects are in one repo, the tools grind to a | halt, and suddenly you need to buy a solution that just works to | meet your business goals. | jeffbee wrote: | > Git is ... fast, and Google ... doesn't use it. | | Everything about git is orders of magnitude slower than the | monorepo in use at Google. Git is not fast, and its slowness | scales with the size of your repo. | tsimionescu wrote: | Monorepos are much easier for everyone to use, and are the only | natural way to manage code for any project. You keep talking | about Google, but a much more famous monorepo is Linux itself. | Perhaps Linus Torvalds has fallen into Google's hype? | | The fact that git is very poor at scaling monorepos might mean | that it's a bad idea to use git for larger organizations, not | that it's a bad idea to use monorepos. If git can be improved | to work with monorepos, all the better. | anon9001 wrote: | > Monorepos are much easier for everyone to use, and are the | only natural way to manage code for any project. | | I strongly disagree with that, but I'll let this blog post | explain it better than I can: | https://medium.com/@mattklein123/monorepos-please- | dont-e9a27... | | > You keep talking about Google, but a much more famous | monorepo is Linux itself. | | I thought it was fairly well known that monorepos came | directly from Google as part of their SRE strategy. It didn't | even come into common usage until around 2017 (according to | wikipedia). If I'm remembering correctly, the SRE book | recommends it, and that's why it gained popularity. | | Also, I don't believe that Linux is a valid interpretation of | "monorepo". Linux is a singular product. You can't build the | kernel without all of the parts. | | A better example would be if there was a "Linus" repo that | contained both git and linux. There isn't, and for good | reason. | | > The fact that git is very poor at scaling monorepos might | mean that it's a bad idea to use git for larger | organizations, not that it's a bad idea to use monorepos. If | git can be improved to work with monorepos, all the better. | | Any performance improvement in git is welcome, but anything | that sacrifices a full clone of the entire repository is | antithetical to decentralization. | | The whole point of git is decentralized source code. | solarmist wrote: | Monorepos (up to a certain size where git starts getting | too slow) are easier to use unless you have sufficient | investment into dev tooling. | | I think "monorepo" here is a shorthand for large, complex | repos with long histories which git does not scale well to | whether or not it is all of the repos for an organization. | For example I'd call the Windows OS a monorepo for all of | the important reasons. | howinteresting wrote: | > The whole point of git is decentralized source code. | | The "whole point of git" is to provide value to its users. | Full decentralization is not necessary for that. | dataangel wrote: | > Also, I don't believe that Linux is a valid | interpretation of "monorepo". Linux is a singular product. | You can't build the kernel without all of the parts. | | But it's also larger scale than the vast majority of | startups will ever reach. My work has had the same monorepo | for 8 years with over 100 employees now and git has had few | problems. | cdcarter wrote: | I think it's at least somewhat fair to call Linux a | monorepo. There are a lot of drivers included in the main | tree. They don't need to be, (we know this because there | are also lots of drivers not in the source tree). But by | including them, the kernel devs can make large changes to | the API and all the drivers in one go. This is a classic | "why use a monorepo". | ajkjk wrote: | Very much doubt that's their corporate strategy. More likely | it's as simple as: lots of people have monorepos; they have | lots of issues with Git and Github; Github wants their | business. | ratww wrote: | _> That said, there 's absolutely no reason to structure your | code in a monorepo._ | | Bullshit. There are very good reasons to use it in some | situations. My company is using it and it's a tremendous | productivity boon. And Git works _perfectly fine_ for smaller | scales. | | Obviously, "because Google does it" is a terrible reason. But | it's disingenuous to say that's the only reason people are | doing it. Not everyone is a moron. | anon9001 wrote: | I'm glad you're having a good experience now, and git as a | monorepo will work fine at smaller scales, but you will | outgrow it at some point. | | When you do, you have two choices. You can either commit to | the monorepo direction and start using non-standard tooling | that sacrifices decentralization, or you can break up your | repo into smaller manageable repos. | | I don't have any problem with small organizations throwing | everything into one git repo because it's convenient. | | My objection is that when you eventually do hit the limits of | git, will you choose to break the fundamentals of git | decentralization as a workaround? Or will you break up the | repo into a couple of other repos with specific purposes? | | I don't like that GitHub makes money by encouraging people to | make the wrong choice at that juncture. | ratww wrote: | When I hit the limits of git then I will worry about it. | | One of our tasks when building the monorepo was proving it | was possible to split it again. It was trivial and we have | tools to help us avoid complexity. | | We're not using Github so that part doesn't apply to me. | | Also, nice of you to assume we'll get to Google scale, but | thanks to the monorepo, I was able to make a few pull | requests reducing duplication and reducing line count of | app by thousands ever since. So I really don't see us | getting into Google scale anytime soon. We're downsizing. | | I also find it ironic that you're accusing people of | "copying Google" in a parent post but you're the one | assuming that everyone will hit Google limits... | anon9001 wrote: | If you ever do hit a git limit where it's no longer | comfortable to keep the whole repo on each developer | machine, I would encourage you to split up the repo into | separate project-based repos rather than switching to | Microsoft's git fork. | | As a best practice, there's a reason that Linus started | git in a separate repo, rather than as part of the Linux | project. The reason is that if you put too many projects | into one git repo, and it gets too large, you do | eventually hit a scale where it becomes a problem. | | A very simple way to mitigate that is to keep each | project in its own repo, which you can easily do once you | start hitting git scale problems. | | Thankfully, one of the original git use cases was to | decompose huge svn repos into smaller git repos, so the | tooling required is already built in. | | > I honestly find it ironic that you're accusing people | of "copying google" in a parent post but you're the one | assuming that everyone will hit Google limits... | | I think you got the wrong take there. I'm saying that | Google's monorepo approach is only valid because they | invested so heavily into building custom tooling to | handle it. We don't have access to those tools and | therefore shouldn't use their monorepo approach. | | If you're going to use git, you're going to have the most | success using it as intended, which is some logical | separation of "one repo per project" where "project" | doesn't grow too out of hand. The Linux kernel could be | thought of as a large project that git still handles just | fine. | | Tragically, I think if Google did opensource their | internal vcs and monorepo tooling, they would immediately | displace git as the dominant vcs and we would regress | back to trunk-based development. | rsj_hn wrote: | > I'm glad you're having a good experience now, and git as | a monorepo will work fine at smaller scales, but you will | outgrow it at some point. | | I would say the opposite. A lot of companies are fine with | independent teams using their own versions of dependencies | and their own versions of core code, but at some point that | becomes unmanageable and you need to start using a common | set of dependencies and the same version of base frameworks | to reduce the complexity. That means pushing a patch to a | framework means all the teams are upgraded. Monorepos are | the most common solution to enforce that behavior. | | Look, this is all dealing with the problem of coordination | in large teams. Different organizations have different | capacities for coordination, and so it's like squeezing a | balloon -- yes, you want more agility to pick your own deps | but then the cost of that is dealing with so much | complexity when you need to push a fix to a commonly used | framework or when a CVE is found in a widely used dep and | needs to be updated by 1000 different teams all manually. | | There is no "right" way. It's just something organizations | have to struggle with because it's going to be costly no | matter what, and all that matters is what type of cost your | org is most easily able to bear. That will decide whether | you use a monorepo or a bunch of independent repos, whether | you go for microservices or a monolith, and most companies | will do some mix of all of the above. | anon9001 wrote: | > Monorepos are the most common solution to enforce that | behavior. | | Yes. This is very accurate and also the problem. | Monorepos are being used as a political tool to change | behavior, but the problem is that it has severe technical | implications. | | > There is no "right" way. | | With git, there is a "wrong" way, and that's not | separating your project into different repos. It causes | real world technical problems, otherwise we wouldn't have | this article posted in the first place. | | > It's just something organizations have struggle with | because it's going to be costly no matter what, and all | that matters is what type of cost your org is most easily | able to bear. | | It's not a coin toss whether monorepos will have better | or worse support from all standard git tooling. It will | be worse every time. | | The amount of tooling required to enforce dependency | upgrades, code styles, security checks, etc across many | repos is significantly less than the amount of tooling | required to successfully use a monorepo. | philosopher1234 wrote: | If you want to play right and wrong, I will say that noe | it's the right way since there is support for sparse | checkouts in git. | | This isn't a useful game to play. | eximius wrote: | If you are in an enterprise setting, you _don 't need | decentralized version control_. | | So, yea, for companies, monorepos are a no brainer in a lot | of ways. | | For open source, separate repos makes more sense. | | To expand on corporate monorepos, if you can still set up | access control (e.g., code owners to review additions by | domain) and code visibility (so there isn't _unlimited_ | code sharing), then I can't think of a reason to not use | monorepos. | IshKebab wrote: | > you will outgrow it at some point | | Given that Google and Microsoft use monorepos that seems | unlikely! | anon9001 wrote: | Google had to build an internal version control system as | an alternative to git and perforce to support their | monorepo. | | Microsoft forked git and layered their own file system on | top of it to support a centralized git workflow so that | they could have a monorepo. | dlp211 wrote: | Having had used both, Google's implementation is IMO the | superior version of monorepo. Really, Google's | Engineering Systems are just better than anything that I | have ever used anywhere else. | anon9001 wrote: | This is exactly as I'd expect. | | If you want a centralized, trunk-based version control, | don't use git. | | It's funny how each company decides to solve these | problems. | | Google called in the computer scientists and designed a | better centralized vcs for their purposes. Good on them. | It'd be great if they open sourced it. So typical of | Google to invent their own thing and keep it private. | | Microsoft took the most popular vcs (git), and inserted a | shim layer to make it compatible with their use case. How | expected that Microsoft would build a compatibility shim | that attempts to hide complexity from the end user. | | Meanwhile, Linux and Git are plugging along just fine, in | their own separate repos, even though many people work on | both projects. | IshKebab wrote: | > So typical of Google to invent their own thing and keep | it private. | | Yeah like their build system... Bazel, that's completely | closed source. | jayd16 wrote: | Your logic is circular. No one should | work on monorepos because... monorepos are bad | because... git can't easily handle them and we | shouldn't fix that because... No one should work | on monorepos... | | Clearly there are reasons people like monorepos and it | makes sense to update git to support the workflow. | anon9001 wrote: | That isn't circular. The conclusion should be that git, a | decentralized vcs, should not take on changes to make it | a centralized vcs. | | If you think that git needs to be "fixed" or "updated" to | support a centralized vcs server to do partial updates | over the network, then I think you've missed the point of | git. | dboreham wrote: | > it's a tremendous productivity boon | | Curious to hear more specifics on this. Did you migrate from | separate repos to a monorepo and subsequently measure | improved productivity as a result? | ratww wrote: | Correct. We measured how long it took to integrate changes | in the core libraries into the consumers (multiple PRs) | versus doing it on a monorepo (single PR for change). We | ran them together for a couple weeks and the difference was | big. | | The biggest differences were in changes that would break | the consumers. For this cases we had to go back and patch | the original library, or revert and start from scratch. But | even in the easy changes, just the "bureaucracy" of opening | tens of pull-requests, watching a few CI pipelines and | getting them approved by different code owners was also | large. | | Now, whenever we have changes in one of the core libraries, | we also run full tests in the library consumers. With tests | running in parallel, sometimes it takes 20 minutes (instead | of 4, 5 hours) to get a patch affecting all frontends | tested, approved and merged into the main branch. | | Also, everyone agreed that having multiple PRs open is | quite stressful. | solarmist wrote: | From my understanding Microsoft is doing it because they want | to use git for developing windows which is(was?) a large | monorepo. | omegalulw wrote: | Your take is extremely biased. You only just discuss why | monorepos are bad. | | Here's some of the many reasons why monorepos are excellent: | | - Continuous integration. Every project is almost always using | the lastest code from other projects and libraries it depends | on. | | - Builds from scratch are very easy and don't need extravagant | tooling. | | - Problems due to build versions in dependency management are | reduced (everyone is expected to use HEAD). | | - The whole organization settles on a common build patterns - | so if you want to add a new dependency you wouldn't need to | struggle with their build system. Conversely, you need to write | lesser documentation on how to build your code - cause that's | now standard. | anon9001 wrote: | Heh, the major problems that I've run into using monorepos in | the real world at scale are: | | - CI breaks all the time. Even one temperamental test from | anywhere else in the organization can cause your CI run to | fail. | | - Building the monorepo locally becomes very complicated, | even to just get your little section running. Now all | developers need all the tools used in the monorepo. | | - Dependencies get upgraded unexpectedly. Tests aren't | perfect, so people upgrade dependencies and your code | inevitably breaks. | | It's cool that everyone is on the same coding style, but | that's very much achievable with a shared linter | configuration. | dlp211 wrote: | Your problem isn't monorepo, it's bad tooling. Tests should | only execute against code that changed. Builds should only | build the thing you want to build, not the whole | repository. | anon9001 wrote: | Yes! | | The problem is choosing a monorepo _because_ the tooling | isn 't suited for monorepos. | | Trying to build a monorepo with git is like trying to | build your CRUD web app frontend in c++. | | Sure, you can do it. Webassembly exists and clang can | compile to it. I wouldn't recommend it because the | tooling doesn't match your actual problem. | | Or maybe a better example is that it's like deciding the | browser widgets aren't very good, so we'll re-render our | own custom widgets with WebGL. Yes, this is all quite | possible, and your result might get to some definition of | "better", but you're not really solving the problem you | had of building a CRUD web app. | | Can Microsoft successfully shim git so that it appears | like a centralized trunk-based monorepo, the way you'd | find at an old cvs/svn/perforce shop? Yes, they did, but | they shouldn't have. | | My thesis is they're only pushing monorepos because it | helps GitHub monetize, and I stand by that. | | > Tests should only execute against code that changed. | Builds should only build the thing you want to build, not | the whole repository. | | How do you run your JS monorepo? Did you somehow get | bazel to remote cache a webpack build into individual | objects, so you're only building the changes? Can this | even be done with a modern minimization tool in the | pipeline? Is there another web packager that does take a | remotely cachable object-based approach? | | I don't know enough about JS build systems to make a | monorepo work in any sensible way that utilizes caching | and minimizes build times. If anything good comes out of | the monorepo movement, it will be a forcing function that | makes JS transpilers more cacheable. | | And all this for what? Trunk-based development? So we can | get surprise dependency updates? So that some manager | feels good that all the code is in one directory? | | The reason Linus invented git in the first place was | because decentralized is the best way to build software. | He literally stopped work on the kernel for 2 weeks to | build the first version of git because the scale by which | he could merge code was the superpower that grew Linux. | | If you YouTube search for "git linus" you can listen to | the original author explain the intent from 14 years ago: | https://www.youtube.com/watch?v=4XpnKHJAok8 | | If this is a topic you're passionate about, I'd encourage | you to watch that video, as he addresses why | decentralizing is so important and how it makes for | healthy software projects. It's also fun to watch old | Googlers not "get it". | | He was right then and he's right now. It's disappointing | to see so much of HN not get it. | Orphis wrote: | > Git is decentralized and fast, and Google famously doesn't | use it. | | Most (all?) of Google OSS software is hosted on either Gerrit | or Github. Git is not used by the "google3" monorepo, but it's | used by quite a few major projects. | nightpool wrote: | Is there a point in having a monorepo if you're all in on the | microservices approach? I'm a big microservices skeptic, but as | far as I understand it, the benefit of microservices is | independence of change & deployment enforced by solid API | contracts--don't you give that all up when you use a monorepo? | What does "Monorepo with microservices" give you that a normal | monolithic backend doesn't? | | (Obviously e.g. an image resizer or something else completely | decoupled from your business logic should be a separate service / | repo _anyway_ --my point is more along the lines of "If something | shares code, shouldn't it share a deployment strategy?") | rsj_hn wrote: | Yeah, I've seen it used to allow teams to use consistent | frameworks and libraries across many different microservices. | Think of authentication, DB clients, logging, webservers, | grpc/http service front ends, uptime oracle -- there's lots of | cross cutting concerns that are shared code among many | microservices. | | So the next thing you decide to do is create some microservice | framework that bundles all that stuff in and allow your | microservice team to write some business logic on top. But now | 99% of your executables are in this microservice framework that | everyone is using, and that's the point where a lot of | companies go the monorepo route. | | Actually most companies do some mix -- have a lot of stuff in a | big repo and then other smaller repos alongside that. | johnmaguire wrote: | Monorepo with microservices gives you the ability to scale and | perform SRE-type maintenance at a granular level. Teams | maintain responsibility for their service, but are more easily | able to refactor, share code, pull dependencies like GraphQL | schemas into the frontend, etc. across many services. | nightpool wrote: | So basically each team has to reinvent devops from the ground | up, and staff their own on call rotation, instead of having a | centralized devops function that provides a stable platform? | That sounds horrendous. | | Although, that said, I can at least _see_ the benefits of the | "1 service per team" methodology, where you have a dedicated | team that's independently responsible for updating their | service. I'm more used to associating "microservices" with | the model where a single team is managing 5 or 6 interacting | services, and the benefits there seem much smaller. | johnmaguire wrote: | > That sounds horrendous. | | Different teams can make their own decisions, but as a | developer on a team that ran our own SRE, I found it came | with many advantages. Specifically, we saw very little | downtime, and when outages did occur were very prepared to | fix it as we knew the exact state of our services (code, | infrastructure, recent changes and deploys.) Additionally, | we had very good logging and metrics because we knew what | we'd want to have in the event of a problem. | | And I'm not sure what you mean "from the ground up." We | were able to share a lot of Ansible playbooks, frameworks, | and our entire observability stack across all teams. | | But I think you may also be missing the rest of my post. | This is only one possible advantage. Even if the team | doesn't perform their own SRE, these services can be scaled | independently - both in terms of infrastructure and | codebase - even while sharing code (including things like | protocol data structures, auth schemes, etc.) | | A service that receives 1 SAML Response from an IdP | (identity provider) per day per user may not need the same | resources as a dashboard that exposes all SP (service | providers) to a user many times a day. And an | administration panel for this has still different needs. | | Yet, all of these services may communicate with each other. | echelon wrote: | > Is there a point in having a monorepo if you're all in on the | microservices approach? | | Monorepos are excellent for microservices. | | - You can update the protobuf service graph (all strongly | typed) easily and make sure all the code changes are | compatible. You still have to release in a sensible order to | make sure the APIs are talking in an expected way, but this at | least ensures that the code agrees. | | - You can address library vulns and upgrades all at once for | everything. Everything can get the new gRPC release at the same | time. Instead of having app owners be on the hook for this, a | central team can manage these important upgrades and provide | assistance / pairing for only the most complex situations. | | - If you're the one working on a very large library migration, | you can rebase daily against the entire fleet of microservices | and not manage N-many code changes in N-many repos. This makes | huge efforts much easier. Bonus: you can land incrementally | across everything. | | - If you're the one scoping out one of these "big changes", you | can statically find all of the code you'll impact or need to | understand. This is such an amazing win. No more hunting for | repos and grepping for code in hundreds of undiscovered places. | | - Once a vuln is fixed, you can tell all apps to deploy after | SHA X to fix VULN Y. This is such an easy thing for app owners | to do. | | - You can collect common service library code in a central | place (eg. internal Guava, auth tools, i18n, etc). Such | packages are easy to share and reuse. All of your internal code | is "vendored" essentially, but you can choose to depend on only | the things you need. A monorepo only feels heavy if you depend | on all the things (or your tooling doesn't support git or build | operations - you seriously have to staff a monorepo team). | | - Other teams can easily discover and read your code. | | Monorepos are the best possible way to go as long as you have | the tooling to support it. They fall over and become a burden | if they're not seriously staffed. When they work, they really | work. | nightpool wrote: | None of this addresses my question--what benefits do you get | from having monorepo-with-microservices over a monolithic | backend? All of the things you mentioned would be even | _easier_ with a monolithic backend. | staticassertion wrote: | They solve different problems, some of which may overlap I | suppose. | | For one thing you get clear ownership of deployed code. | There isn't one monolithic service that everyone is | responsible for babying, every team can baby their own, | even if they all share libraries and whatnot. | | You also get things like fault isolation and fine grained | scaling too. | echelon wrote: | Should have also mentioned: all those changes to cross | cutting library code will trigger builds and tests of the | dependent services. You can find out at once what breaks. | It's a superpower. | aranchelk wrote: | The more I've come to rely on techniques like canary testing and | opt-in version upgrades the more firmly I believe one of the main | motivations for monorepos is flawed: at any given time there may | not be a fact of the matter as to which single version of an app | or service is running in an environment. | | At places I've worked when we thought about canary testing we | ignored the fact that there were multiple versions of the | software running in parallel, we classified it as part of an | orchestration process and not a reality about the code or env, | but we really did have multiple versions of a service running at | once, sometimes for days. | | Similarly if you've got a setup where you can upgrade | users/regions/etc piecemeal (opt-in or by other selection | criteria) I don't know how you reflect this in a monorepo. I'm | curious how Google actually does this as I recall they have | offered user opt-in upgrades for Gmail. My suspicion is this gets | solved with something like directories ./v2/ and ./v3/ -- but | that's far from ideal. | eximius wrote: | that doesnt seem like a problem with monorepos (or otherwise). | | youd just need to tag your releases, right? | aranchelk wrote: | I don't think it's a problem, rather I'm challenging a touted | benefit. | | In large monorepos the supposition is you've got a class of | compatible apps and services bundled together. Version | dependencies are somewhat implicit: the correct version for | each project to interoperate is whatever was checked-in | together in your commit. | | I don't know how it works in practice at different orgs, but | there's certainly an idea I've heard repeated that you can | essentially test, build, and deploy your monorepo atomically, | but the reality in my experience is you can't escape the need | to think about compatibility across multiple versions of | services once you use techniques like canary testing or | targeted upgrades. | sroussey wrote: | This is still true, but to a matter of degree. Even in the | feature flagged deploys mixed with canary the permutations | are all evident, and ideally tested. | | Also, you wouldn't expect a schema change to occur with | code that requires it. Those will need to happen earlier. | | Real systems are complex. A monorepo is one attempt at | capping the complexity to known permutations. For smaller | teams, it might collapse to a single one. | staticassertion wrote: | You still have to think about compatibility across versions | - that does not go away in a monorepo, and you should use | protocols that enforce compatible changes. The monorepo | just tells you that all tests pass across your entire | codebase given a change you made to some other part. | eximius wrote: | thats fair. you require reasonable deployment intervals | and may need to wait to merge based on deployment. | Workflow actions that can check whether a commit is | deployed in a given environment are invaluable | jeffbee wrote: | > may need to wait to merge based on deployment | | Again, this fundamentally misunderstands the purpose of | the source code repo and how it relates to the build | artifacts deployed in production. If you are waiting for | something to happen in production before landing some | change, that tells me right there you have committed some | kind of serious error. | joshuamorton wrote: | I'd caveat this with _code_ change. | | Its very common to need to wait for some version of a | binary to be live before updating some associated | configuration to enable a feature in that binary (since | the _dynamic_ configuration isn 't usually versioned with | the binary). It's possible that some systems exist that | fail quiet, with a non-existent configuration option | being silently ignored, but the ones I know of don't do | that. | coryrc wrote: | Only binaries are released. Binaries are timestamped and linked | to a changelist. | | The "opt-in upgrades" are all live code. I know more than a few | "foo" "foo2" directories, but I wouldn't want an actively- | delivered, long-running service to be living in a feature | branch so I would still expect anyone to be using a similar | naming scheme. | ASinclair wrote: | > I'm curious how Google actually does this | | Branches are cut for releases. Binaries are versioned. | [deleted] | jeffbee wrote: | > the main motivations for monorepos is flawed: at any given | time there may not be a fact of the matter as to which single | version of an app or service is running in an environment. | | Your understanding of the motivations for monorepo is flawed. | I've never heard anyone even advocate for this as a reason for | monorepos. For some actual reasons people use monorepos, see | https://danluu.com/monorepo/ | | Regarding your question, which I re-emphasize has got nothing | to do with the arrangement of the source code, the solution is | to simply treat your protocol as API and follow these rules: 1) | Follow "Postel's Law", accepting to the best of your abilities | unknown messages and fields; 2) never change the meaning of | anything in an existing protocol. Do not change between | incompatible types, or change an optional item to required; 3) | Never re-use a retired aspect of the protocol with a different | meaning; 4) Generally do not make any incompatible change to an | existing protocol. If you must change the semantics, then you | are making a new protocol, not changing the old one, | | > I don't know how you reflect this in a monorepo | | You don't. Why would the deployed version of some application | be coded in your repo? It's simply a fact on the ground and | there's no reason for that to appear in source control. | aranchelk wrote: | We may just be talking past each other, but in the link you | provided, sections "Simplified dependencies" and (to a lesser | extent) "Cross-project changes" are pretty much exactly what | I'm talking about. | joshuamorton wrote: | They aren't, because those discussions are all related to | link-time stuff (if I update foo.h and bar.c that depends | on foo.h, I can do so atomically, because those are built | into the same artifact). | | As soon as you discuss network traffic (or really anything | that crosses an RPC boundary), things get more complicated, | but none of that has anything to do with a monorepo, and | monorepos still sometimes simplify things. | | So there's a few tools that are common: feature flags, 3 | stage-rollouts, and probably more that are relevant, but | let's dive into those first two. | | Feature "flags" are often dynamically scoped and runtime- | modifiable. You can change a feature flag via an RPC, | without restarting the binary running. This is done by | having something along the lines of if | (condition_that_enables_feature()) { | do_feature_thing() } else { | do_old_thing() } | | A/B testing tools like optimizely and co provide this, and | there are generic frameworks too. | `condition_that_enables_feature()`, here is a dynamic | function that may change value based on the time of day, | the user, etc. (think something like | `hash(user.username).startswith(b'00') and user.locale == | 'EN'`). The tools allow you to modify these conditions and | push and change the conditions all without restarts. That's | how you get per-user opt-in to certain behaviors. | Fundamentally, you might have an app that is capable of | serving two completely different UIs for the same user | journey. | | Then you have "3-phase" updates. In this process, you have | a client and server. You want to update them to use "v2" of | some api, that's totally incompatible with v1. You start by | updating the server to accept requests in either v1 or v2 | format. That's stage one. Then you update the clients to | sent requests in v2 format. That's stage two. Then you | remove all support for v1. That's stage three. | | When you canary a new version of a binary, you'll have the | old version that only supports v1, and the canary version | that supports v1 and v2. If it's the server, none of the | clients use v2 yet, so this is fine. If it's the client, | you've already updated the server to support v2, so it | works fine. | | Note again that all of this happens whether or not you use | a monorepo. | howinteresting wrote: | In general, it is a good practice to try and maximize | compile-time resolution of dependencies and minimize | network resolution of them. Services are great when the | working set doesn't fit in RAM or the different parts have | different hardware needs, but trying to make every little | thing its own service is foolish. | | Doing so makes this a less pertinent problem. | [deleted] | rurban wrote: | Even better looks the new OTR merge strategy, which benefits | everyone. Not only the tiny monorepo userbase. ___________________________________________________________________ (page generated 2021-11-11 23:00 UTC)