[HN Gopher] Improving large monorepo performance on GitHub
       ___________________________________________________________________
        
       Improving large monorepo performance on GitHub
        
       Author : todsacerdoti
       Score  : 224 points
       Date   : 2021-03-16 16:37 UTC (6 hours ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | seattle_spring wrote:
       | What geographic feature is pictured in the hero shot of this
       | blogpost? At first I thought it was the Golden Throne in Capitol
       | Reef but I now think it's something else. I'm 90% sure it's
       | either in Capitol Reef or Grand Staircase.
        
         | david_allison wrote:
         | https://www.flickr.com/photos/23155134@N06/7132776459
         | 
         | Forest Mountains of Zion National Park, Utah
        
           | seattle_spring wrote:
           | Close but so far! Thanks so much.
        
       | chris_wot wrote:
       | Did they contribute the repack optimizations upstream?
        
       | iliekcomputers wrote:
       | Nice work, really interesting blog post!
       | 
       | On a sidenote, git itself can also get painfully slow with large
       | monorepos. Hope GitHub can push some changes there as well.
       | 
       | I know FB moved off git to mercurial because of performance
       | issues.
        
         | klodolph wrote:
         | My understanding is that neither Git nor Mercurial can do this
         | well out of the box, and FB and Google both have their own
         | extensions to Mercurial to make this possible (because even
         | though Mercurial is often slower than Git, it's extensible)
         | 
         | e.g. https://facebook.github.io/watchman/ - used as part of
         | Facebook's Mercurial solution, I think.
        
           | vtbassmatt wrote:
           | Git also has a file system monitor interface which can use
           | Watchman. We (GitHub) are working on a native file system
           | monitor implementation in addition -
           | https://github.com/gitgitgadget/git/pull/900.
        
           | jauer wrote:
           | And then from mercurial extensions to our own server,
           | mononoke, which apparently has been moved under the Eden
           | umbrella: https://github.com/facebookexperimental/eden
        
           | jayd16 wrote:
           | I thought Google used some custom fork of perforce.
        
             | klodolph wrote:
             | From what I understand, Piper is not a fork of Perforce,
             | but instead a completely different system with the same
             | interface. You know, built on top of a BigTable or Spanner
             | cluster instead of whatever Perforce uses.
             | 
             | The Mercurial extensions are then an _alternative client_
             | for Piper.
        
         | pitaj wrote:
         | You might be interested in scalar [1] developed by Microsoft
         | for handling large repos.
         | 
         | [1]: https://github.com/microsoft/scalar
        
           | WorldMaker wrote:
           | It's also interesting to note how much of Microsoft's work
           | for handling large repos in git has merged upstream directly
           | into git itself.
           | 
           | One very interesting part of that is the effort that has gone
           | into the git commit-graph: https://git-scm.com/docs/commit-
           | graph.
           | 
           | It's part of what makes scalar interesting compared to some
           | of the projects you hear mentioned used inside the FB and
           | Google gates: not only is scalar itself open source, but a
           | lot of what scalar does is tune configuration flags to turn
           | on optional git features such as the commit-graph, sparse
           | checkout "cones", etc that are all themselves directly
           | supported by the git client. Even if you aren't at the scale
           | where it makes sense to use all of the tools that scalar
           | provides, you can get some interesting baby steps by
           | following scalar's "advice" on git configuration.
        
         | rmasters wrote:
         | If you are willing to adapt to a different structure and
         | workflow, you can filter the scope of git down dramatically
         | with sparse checkouts (as @WorldMaker also mentioned).
         | 
         | https://github.blog/2020-01-17-bring-your-monorepo-down-to-s...
        
           | chgibb wrote:
           | Sparse-checkouts are amazing. I wrote some small tools that
           | use dependency information in Flutter packages to drive a
           | sparse-checkout. We use it at $dayjob now.
        
       | tuyiown wrote:
       | Offtopic, but I often wonder if there are people using `git
       | worktree` to have several related code trees within the same
       | repo.
       | 
       | Technically it works the mostly the same as multiple repos, but
       | theoretically allows to have something like a bootstrap script
       | with everything self contained in the same repos. Looks like an
       | alternative tradeoff between a monorepos with shared history and
       | multiple repositories.
        
         | numbsafari wrote:
         | I'm not sure worktree works exactly as you think.
         | 
         | I use worktree locally so that, for example, I can have my
         | working copy that I am doing development in and then a separate
         | working copy where I can do code review for someone else,
         | without having to interrupt what I am doing in my own worktree.
         | 
         | My own experience is that if you are using branches with
         | radically different content for different purposes in the same
         | tree, it's going to end up a mess at some point. Worktrees, as
         | far as I am aware, do not help with that in any special way.
        
         | pjc50 wrote:
         | I tried that, and discovered that it won't let you have two
         | worktrees with the same branch checked out?
        
           | hiq wrote:
           | Offtopic (I think), but I just learned that "For instance,
           | git worktree add -d <path> creates a new working tree with a
           | detached HEAD at the same commit as the current branch."
           | (from the manpage).
           | 
           | It's offtopic because the second worktree has a detached
           | HEAD, so that doesn't help in the case you mention.
        
           | ivanbakel wrote:
           | Which is very sane - what should git do if you modify the
           | branch in one tree but not in the other? The least painful
           | solution would require something like multiple-refs-per-
           | remote-branch, which would be (to my understanding) a re-
           | architecture.
        
             | pjc50 wrote:
             | Well, I wanted the same effect as two separate checkouts
             | (ie entirely separate branch structures) but with a bit of
             | the disk space shared between them, but that's not how it
             | works.
        
         | oftenwrong wrote:
         | Are you thinking of git subtree?
         | 
         | https://www.atlassian.com/git/tutorials/git-subtree
        
         | hiq wrote:
         | Do you have an example of this setup?
         | 
         | As far as I know git worktree is just to have different
         | branches of the same repo checked out in different locations.
         | At least, that's the only way I use it (and it's great!). Are
         | you suggesting to have different projects on different
         | branches? So an empty "master", then "project1", "project2"
         | etc. as branches?
        
       | adeltoso wrote:
       | Just 10 years too late, I remember when Facebook switched to
       | Mercurial because the Git community wouldn't care about big
       | monorepos. Mercurial is great!
        
       | kevincox wrote:
       | I'm slightly surprised that GitHub is still basically storing a
       | git repo on a regular filesystem using the Git CLI. I would have
       | expected that the repos were broken up into individual objects
       | and stored in an object store. This should make pushes much
       | faster as you have basically infinitely scalable writes. However
       | it does make pulls more difficult. However computing packfiles
       | could still be done (asynchronously) and with some attention to
       | data-locality it should be possible.
       | 
       | This would be a huge rewrite of some internals but seems like it
       | would be a lot easier to manage. It would also provide a some
       | benefits as objects could be shared between repos (although some
       | care would probably be necessary for hash collisions) and it
       | would remove some of the oddness about forks (as IIUC they
       | effectively share the repo with the "parent" repo).
       | 
       | I would love to know if something like this has been considered
       | and why the decided against it.
        
         | hobofan wrote:
         | > I'm slightly surprised that GitHub is still basically storing
         | a git repo on a regular filesystem using the Git CLI.
         | 
         | Maybe I'm a bit dense, but how did you get that from the
         | article? I'm fairly certain that in other pieces of writing
         | they showed that they are using an object store, and I'm
         | guessing that's what the "file servers" in the article are.
        
           | stuhood wrote:
           | `git repack` is an operation that is fairly specific to git's
           | default file format: if they were storing objects in any
           | (other) database, it is very unlikely that they would
           | experience blocking repack operations, as that is an area
           | where databases are highly optimized to execute
           | incrementally.
        
         | lumost wrote:
         | I am not a github employee, but my 2 cents.
         | 
         | An object store lacks an index which your typical FS will
         | provide with a relatively high degree of efficiency. FS's can
         | be distributed to arbitrary write velocity given an
         | appropriately distributed block storage solution ( which will
         | provide the k/v API of an object store that you're looking for
         | ). Distributed FS's are conveniently compatible with most POSIX
         | operations rather than requiring bespoke integration. Most
         | object stores are optimized for largish objects and lack the
         | ability to condense records into an individual write (via the
         | block API) or pre-emptively pre-fetch the most likely next set
         | of requested blocks.
         | 
         | In the GitHub's case the choice of diverging from GitCLI/FS
         | based storage APIs could lead to long term support issues and
         | an implicit "github" flavor of git rather than improving the
         | core git toolchain.
         | 
         | Object Stores are great, but if you need some form of index
         | they get slow and painful really fast.
        
           | ddorian43 wrote:
           | You should be able to split the object store into 2 systems:
           | 1 metadata (think rdbms/nosql/etc) and a blob-data service,
           | keeping large files, think 10KB+. Both systems should be able
           | to be more efficient than the current method.
           | 
           | Example: you can add erasure coding to the blob-data service
           | for better efficiency. You can add fancy indexing to your
           | metadata store. etc etc.
           | 
           | But somebody has to create it, that's the issue.
        
             | lumost wrote:
             | That's exactly how distributed filesystems are built.
             | 
             | Systems such as HDFS use the NameNode for this task, but
             | depending on the exact characteristics of the fileSystem a
             | multi-master setup is often used. I know of at least one
             | NFS implementation which uses postgres as its metadata
             | layer.
        
         | Denvercoder9 wrote:
         | > This should make pushes much faster as you have basically
         | infinitely scalable writes. However it does make pulls more
         | difficult.
         | 
         | I bet GitHub has much more read traffic than write traffic, so
         | this trade-off does not make sense.
        
           | random5634 wrote:
           | Seriously, imagine the compute and requests costs to assemble
           | a large git pull.
        
             | cordite wrote:
             | Sounds a lot like this here
             | https://github.com/Homebrew/brew/pull/9383
        
           | kevincox wrote:
           | I said "difficult" not expensive. Once you assembled the
           | packfiles (much like they do today) it should be roughly the
           | same cost.
        
           | WorldMaker wrote:
           | Yes and I would also imagine that the trade-offs between
           | writing a proprietary object storage and reusing the battle-
           | tested object storage that everyone else uses would have been
           | considered as well.
           | 
           | It seems like the sort of thing that would be an interesting
           | open source research topic if you could build an object
           | database for git that performs better than its packed in
           | filesystem object store. But it's probably not something you
           | want to do as a proprietary project with fewer eyeballs on
           | its performance trade-offs and more engineering work every
           | time git slightly changes its object storage behavior which
           | would remain tuned for the filesystem object store because it
           | was entirely unaware of your efforts.
        
         | oconnor663 wrote:
         | I think the GitHub folks have written more than one article
         | about this. I'm not sure I can find the one I'm thinking of,
         | but here's another one:
         | https://github.blog/2016-04-05-introducing-dgit/
         | 
         | > Perhaps it's surprising that GitHub's repository-storage
         | tier, DGit, is built using the same technologies. Why not a
         | SAN? A distributed file system? Some other magical cloud
         | technology that abstracts away the problem of storing bits
         | durably? The answer is simple: it's fast and it's robust.
        
         | parhamn wrote:
         | Have you ever tried it? It's not remotely performant and
         | wouldn't make sense since GH is read heavy. Plus I'm sure they
         | spend a lot of time thinking about this stuff, no?
         | 
         | If you want to get your feet wet, check out go-git[1]. It's a
         | native golang implementation of git. They have a storage layer
         | abstracted over a lean interface that you quickly create
         | alternative drivers for in golang. You'll be effectively
         | implementing poorly sharded file system on a database, then it
         | becomes obvious why scaling the FS is just easier.
         | 
         | [1] https://github.com/go-git/go-git/tree/master/storage
        
           | brown9-2 wrote:
           | > Plus I'm sure they spend a lot of time thinking about this
           | stuff, no?
           | 
           | I think this is unfair - the author was not insinuating that
           | the people who designed this system at Github are stupid in
           | some way, but just asking if other architectures have been
           | considered.
        
             | parhamn wrote:
             | > ...Github are stupid in some way, but just asking if
             | other architectures have been considered.
             | 
             | To me, asking an engineering org if they've considered
             | alternative architectures for their main engineering
             | problem is silly at best, overconfident at worst.
        
               | ben0x539 wrote:
               | I think the main point of the comment was asking _why_
               | they decided against it. At one point the wording mildly
               | suggests the possibility that no one at github has
               | thought about it:
               | 
               | > I would love to know if something like this has been
               | considered and why the decided against it.
               | 
               | ... but that still sounds more like a grammatical hedge
               | than an actual suggestion that github didn't think it
               | through.
               | 
               | imo it's fair to lay out why you're surprised about some
               | decision in the hopes that someone will enlighten you,
               | even if it can be tricky to phrase that without coming
               | off like a "why didn't you just..." comment.
        
             | tkiolp4 wrote:
             | > but just asking if other architectures have been
             | considered.
             | 
             | That's the polite way of calling them stupids ;)
        
               | JeremyBanks wrote:
               | They're just expressing curiosity. Jeeze.
        
               | ben0x539 wrote:
               | The problem with recognizing that a lot of phrasings are
               | just the polite way to call someone stupid/tell someone
               | to fuck off/etc is that you start seeing assholes
               | whenever someone is just trying to be polite. :(
        
         | mvzvm wrote:
         | This kind of comment is why every single project needs to be
         | justified with "What problems are you solving?" and "What
         | usecase are you supporting?". Because I could 150% imagine
         | somebody getting excited about this and then:
         | 
         | 1) Framing it as such with poor justification "a lot easier to
         | manage"
         | 
         | 2) "This would be a huge rewrite of some internals" Becoming a
         | multi-year migration quagmire
         | 
         | 3) The dawning realization that you have used a write-heavy
         | architecture in a read-heavy system
        
         | Ericson2314 wrote:
         | I always had the impression GitHub was not preemptively
         | investing in the fundamentals like that. So yeah, agree it's
         | bummer but also not surprised.
         | 
         | And hey, at least that means a post GitHub FOSS world won't be
         | leaving fundamental improvements behind!
        
       | lamontcg wrote:
       | Additionally to everything else in this thread it'd be nice to
       | see better support for monorepos in the github UI as well.
       | 
       | Something like the ability to have
       | github.com/<org>/<repo>/<subproject>/issues be a shard of all the
       | issues for a subproject.
       | 
       | You can do that with tagging, but that's a bit of a PITA because
       | that's all fairly bad and unscalable of a UI.
        
       | masklinn wrote:
       | > Improving repository maintenance
       | 
       | There's one thing I'd really like to see there: the ability to
       | lock out the repository and perform a _really_ aggressive repack.
       | I 'm talking `-AdF --window=500` or somesuch. On $dayjob's
       | repository, the base checkout is several gigs. Aggressively
       | repacking it reduces its size by 60%.
       | 
       | There's also a git-level thing which would greatly benefit large
       | repositories: for packs to be easier to craft and more reliably
       | kept, so it's easier to e.g. segregate assets into packs and not
       | bother compressing that, or segregate l10n files separately from
       | the code and run a more expensive compression scheme on _that_.
        
         | tasuki wrote:
         | > On $dayjob's repository, the base checkout is several gigs.
         | 
         | Why is it several gigs? Is that really necessary?
        
           | chrisseaton wrote:
           | > Why is it several gigs? Is that really necessary?
           | 
           | A lot of code written by a lot of engineers over a lot of
           | years.
           | 
           | I'm not sure what other answer you're expecting?
           | 
           | I work with a compiler that has a ten of tens on it over a
           | decade or so and even that's 5 GB. No binary assets. I really
           | don't think it's that unusual.
        
       | wikibob wrote:
       | When is GitHub going to finally add support for Microsoft's
       | VFSforGit?
       | 
       | https://github.com/microsoft/VFSForGit
       | 
       | https://vfsforgit.org/
        
         | hyperrail wrote:
         | I'm not sure that will ever happen [1] as Microsoft itself is
         | limiting active development of VFS for Git in favor of Scalar
         | [2] by the same team, which aims to improve client-side big
         | repo performance _without_ having to use OS-level file system
         | virtualization hooks.
         | 
         | I don't believe VFS for Git will ever be abandoned by
         | Microsoft, but I'm doubtful it will ever get any more major
         | improvements from them.
         | 
         | Scalar does use the VFS for Git client-server protocol, and
         | both Scalar and VFS for Git rely on the same improvements to
         | the git app itself, so I could imagine that GitHub would adopt
         | the GVFS protocol and support Scalar without formally
         | supporting GVFS itself.
         | 
         | [1] GitHub did announce future GVFS support in 2017 -
         | https://venturebeat.com/2017/11/15/github-adopts-microsofts-...
         | - but if anything came out of that I don't see it in GitHub
         | help today.
         | 
         | [2] https://github.com/microsoft/scalar
        
           | vitorgrs wrote:
           | You know Microsoft runs windows repo with VFS right?
        
             | hyperrail wrote:
             | Yes, I do know the Windows os git repo uses GVFS. In fact,
             | I shared my personal experience with git in the os repo
             | some time ago:
             | https://news.ycombinator.com/item?id=20748778
             | 
             | When I left Microsoft about half a year ago, GVFS and
             | Scalar were both in heavy use there.
        
           | hyperrail wrote:
           | I should clarify that Scalar does not _require_ a VFS for Git
           | server to work correctly, even though it can get significant
           | benefits if a VFS server is available. This means you can use
           | Scalar today with GitHub, but not VFS.
           | 
           | Scalar also supports Windows and macOS, while VFS only
           | supports Windows: https://github.com/microsoft/VFSForGit/blob
           | /v1.0.21014.1/doc...
        
         | vtbassmatt wrote:
         | Hey, I'm the product manager for Git Systems at GitHub. Can you
         | share more about how you'd use VFS for Git / GVFS protocol if
         | we had it on GitHub?
         | 
         | Right now we don't plan on supporting it; most of our work is
         | focused on upstreamable changes and opinionated defaults. But
         | that could change if we're missing some important use cases.
         | 
         | Feel free to email me - my HN alias @github.com - if you prefer
         | to discuss privately.
        
       | jmull wrote:
       | I'm curious what counts as a "large" monorepo?
        
         | bob1029 wrote:
         | This is a very subjective evaluation. You could look at # of
         | files versioned, total bytes of the repository on disk, # of
         | logical business apps contained within, total # of commits,
         | etc.
         | 
         | For me, its any repository where I would think "damnit im going
         | to have to do a fresh clone" if the situation comes up. There
         | isnt a hard line in the sand, but there is certainly some
         | abstract sensation of "largeness" around git repos when things
         | start to slow down a bit.
        
       | crecker wrote:
       | I can bet whatever you want they did this improvement for
       | microsoft/windows repo.
        
         | noahl wrote:
         | Microsoft/windows is hosted on Azure DevOps, and they have also
         | blogged about what they've done to improve its performance!
         | 
         | Here's a recent post:
         | https://devblogs.microsoft.com/devops/introducing-scalar/
        
           | WorldMaker wrote:
           | Rumors are Azure DevOps and GitHub are converging "soon", and
           | maybe "Project Cyclops" wasn't specifically to improve
           | Microsoft/Windows repo performance, but it seems reasonable
           | given the convergence rumors it could be a step in the
           | direction of preparing for/migrating the repo to GitHub. Of
           | course Microsoft doesn't want to panic Enterprise developers
           | on Azure DevOps just yet so they are extremely quiet right
           | now about any convergence efforts, so I take the rumors with
           | a grain of salt. It is something that I wish Microsoft would
           | properly announce sooner rather than later as it might
           | provide momentum towards GitHub in capital-E Enterprise
           | development world (even if will panic those that are still
           | afraid of GitHub for whatever reasons).
        
       | endisneigh wrote:
       | kind of an aside, but what's the best practice for pushing and
       | building separate projects in a monorepo?
       | 
       | say you have a structure like: projectA projectB sharedUtils
       | 
       | Each time you push you might have a build for projectA and
       | projectB but it builds both each time you push to master. Ideally
       | you could use Git to see if anything in projectA or sharedUtils
       | changed to trigger projectA's build and same for projectB, but
       | I'm curious what others are doing.
        
         | numbsafari wrote:
         | Perhaps check out a tool like please[1]. There are other tools
         | in this space, but that one has worked well for me without the
         | complexity of some other, similar tools.
         | 
         | [1] https://please.build
        
           | oftenwrong wrote:
           | I can't speak for using it in a massive monorepo, but I
           | started using https://please.build for some of my personal
           | projects recently just as an alternative to the dominant Java
           | build systems (Ant/Maven/Gradle). It's far more
           | straightforward to use, and incremental builds actually work
           | reliably.
        
         | zdw wrote:
         | Monorepos require much more care to be put into the
         | integration/CI side of the process.
         | 
         | This is worth a read: https://yosefk.com/blog/dont-ask-if-a-
         | monorepo-is-good-for-y...
        
         | alfalfasprout wrote:
         | As a few others have mentioned this is something that build
         | systems handle since they understand the dependency graph. For
         | example, Bazel is often used to this end.
         | 
         | However... I would _strongly_ advise not going for a monorepo.
         | No, I don 't mean something like tensorflow where you have a
         | bunch of related tools and projects in a single repo. I mean
         | one repo for the entire org where totally unrelated projects
         | live.
         | 
         | Every company I've been at that used a monorepo found
         | themselves struggling to make it work since you need a ton of
         | full time engineers just to keep things working and scaling.
         | Many of the problems that monorepos try to solve (simplifying
         | dependency and version management) are traded for 10x as many
         | problems and many of them are hard (incremental builds,
         | dependency resolution).
         | 
         | Google has a huge team in charge of helping their monorepo
         | scale and work efficiently. You are not google... don't be
         | tempted.
        
           | jschwartzi wrote:
           | Well, sure. if you have a pile of totally unrelated things
           | that never need to change in lock-step, then you don't need a
           | "monorepo." But on the other hand if you're building an
           | entire software system such as a collection of API services,
           | a database schema, embedded device firmware, and a website,
           | and all of these things are interdependent and incompatible
           | across versions then please for the love of god use a
           | monorepo.
           | 
           | At my job our cloud team uses multiple separate repositories
           | which makes sense, but it also moves the burden of versioning
           | to run-time. This is because they have to interface with
           | multiple different versions of the device firmware. So they
           | deploy different run-time versions of the APIs to support
           | legacy and current production firmware versions. But our
           | firmware repository is a monorepo in that the sources and
           | build system builds the artifacts for multiple devices from
           | the same source tree.
           | 
           | So it's not so cut and dried as "never use a monorepo" or
           | "always use a monorepo." It involves engineering tradeoffs
           | and decisions that are made in a context, and you can't
           | extract your advice from the context in which it exists. What
           | works for our cloud team would be a terrible mess on the
           | embedded side simply because of how the software is deployed
           | and managed.
        
           | jayd16 wrote:
           | "You are not google" is also an argument for why you don't
           | have to worry about scaling a monorepo.
        
           | benreesman wrote:
           | I'll try to tread at least a little lightly here because this
           | topic does tend to be a bit flammable, but caveat emptor.
           | 
           | My contrasting anecdotal experience is that whether at BigCo
           | or on a small team monorepo is almost always the right answer
           | until your requirements get exotic enough that you're in
           | special-case land anyways (like a separate repo for machine-
           | initiated commits, or something that's security-sensitive
           | enough to wall off some contributors).
           | 
           | Both `git` and `hg` scale easily to to really big projects if
           | you're storing text in them (at FB our C++ code was in a
           | `git` monorepo on stock software until like 2014 or something
           | before it started bogging down, I'll gloss over the numbers
           | but, big): the monorepo-scaling argument is brought out a lot
           | but rarely quantified.
           | 
           | The multi-repo problem that gets you is dependency
           | management, which in the general case requires a SAT-solver
           | (https://research.swtch.com/version-sat), but of course you
           | don't have a SAT-solver in your build script for your small-
           | to-medium organization, so you get some half-assed thing like
           | what `pip` and `npm` do.
           | 
           | Again purely anecdotal, but in my personal experience multi-
           | repo too often gets pushed by folks who want to make their
           | own rules for part of the codebase ("the braces go _here_ "),
           | push an agenda around unnecessary RPCs, or both. That's not
           | true of all cases of course, but it's a common enough
           | antipattern to be memorable.
        
           | adsfoiu1 wrote:
           | I personally have seen the opposite problem - the friction of
           | making small changes to "utility" libraries becomes a huge
           | pain point for developers when you have to make changes, test
           | locally, push to package manager, update all consumers to use
           | the new version... It's much easier, in my experience, to
           | just consume a class that's already in the same project /
           | repo.
        
             | agency wrote:
             | I have also experienced this pain where a company I worked
             | for went too hard on splitting every thing into separate
             | repos, such that updating something deep in the dependency
             | tree becomes very painful and involves a protracted
             | "version bump dance" on dependent repos. There's no silver
             | bullet here.
        
           | TechBro8615 wrote:
           | > Google has a huge team in charge of helping their monorepo
           | scale and work efficiently. You are not google... don't be
           | tempted.
           | 
           | It's funny, I've heard this exact same argument for why you
           | should not use micro services.
        
           | brown9-2 wrote:
           | The other half of needing to use a build system that
           | understands the dependency graph like Bazel is that Bazel
           | _keeps state_, so that it knows which part of the graph does
           | not need to be re-built when you push commit B because it was
           | already built in commit A.
        
         | simias wrote:
         | If separate projects have independent builds maybe a monorepo
         | was not a great idea to begin with?
         | 
         | I have a big monorepo at work but whenever anything changes I
         | want to rebuild everything to generate a new firmware image. I
         | have ccache setup to speedup the process given that obviously
         | only a tiny fraction of the code actually needs to be rebuilt.
         | 
         | It's a bit wasteful, sure, but if I were to optimize it I'd be
         | worried about ending up with buggy, non-reproducible builds.
         | Easier to just recompile everything every time and make sure
         | everything still works the way you expect.
         | 
         | So basically my approach is KISS, even if it means longer build
         | times.
        
         | jrockway wrote:
         | That's what build systems aim to do, and there are many of
         | them. In general, I've found all the tooling required around
         | monorepos to be a job for a full-time team. Shortcuts (as
         | suggested in other replies) or full builds on every commit tend
         | to stop scaling relatively quickly. If you take shortcuts, you
         | will find that it becomes "tribal knowledge" to do a full build
         | every time you edit a single line of code, and people who were
         | once making multiple changes a day start making one change a
         | week, or they start committing code without ever having run it.
         | (It happened to me on a 4 person team. We had so many things
         | that needed to be running to test your app, that people just
         | started committing and pushing to production without ever
         | having run their changes locally! That is the kind of thing
         | that happens if you stop caring about tooling, and it happens
         | fast. I addressed it by taking a couple of days to start up the
         | exact environment locally in a few seconds, without a docker
         | build, and people started running their code again.) Be very,
         | very careful.
         | 
         | If you do a full build on every commit, it gets slow much
         | sooner than you'd expect, and people are going to do less work
         | while they context switch to posting to HN while waiting for
         | their 15 minute build for a 1 line code change.
         | 
         | I worked at Google and we had a monorepo, and there were
         | hundreds if not thousands of engineers working on build speed
         | and developer productivity, and it was still significantly
         | slower to "bazel run my-go-binary" versus "go run cmd/my-go-
         | binary". In many cases, it was worth it, but in very isolated
         | applications, it was definitely not worth it. (And people did
         | work around it, by just setting up Git somewhere and using
         | Makefiles or whatever, and that ended up being even worse. But
         | it gets worse incrementally over time, and you're kind of the
         | frog getting boiled alive.)
         | 
         | Where I'm going with is to advise you to be very careful. The
         | tools to support real productivity in a monorepo are expensive
         | in terms of your org's time. If you can get by with a repo per
         | app and a common modules repo, and just update the app to refer
         | to a version of the modules repo as though it's some random
         | open source project you depend on, you're going to get much
         | farther with much less tooling work than you would with a
         | monorepo. But, the modules repo is going to break apps without
         | knowing, and that's going to be a pain. Monorepos do exist for
         | a good reason.
         | 
         | (The other thing I like about monorepos is that you do less
         | per-project setup work. Want to make some new app? You can just
         | start writing it, and you get the build, deploy, framework,
         | etc. for free. It can be very productive if you're finding
         | yourself starting new projects regularly. In my spare time, I
         | write a software, and I really regret splitting it up into
         | multiple projects. But, it's kind of necessary for open source
         | stuff -- people don't want to download ekglue if they want to
         | just run jlog. So I split them, but it costs me my valuable
         | free time to do something I've already done ;)
         | 
         | My TL;DR is that you will be tempted to take shortcuts and the
         | shortcuts will suck. If your project has the resources to have
         | someone set up Bazel, distribute the right version of Bazel and
         | the JRE to developer workstations, setup CI that is aware of
         | Bazel artifact caching, and SREs to be around 24/7 to support
         | your now-custom build environment, you will have a good
         | experience. Be aware that a monorepo is that level of
         | investment.
         | 
         | Meanwhile, if you just have a frontend and a backend in the
         | same repo, you can probably get away with a full build for
         | every commit. And you don't need that shadow team of tooling
         | engineers to make it work, you just need a docker build, and a
         | script that runs "go test ./... && npm test" or whatever ;)
        
         | dgellow wrote:
         | IIRC you can specify filters "paths" and "paths-ignore" when
         | you define a github action that should only be triggered when a
         | subdirectory changes.
         | 
         | See this documentation page:
         | https://docs.github.com/en/actions/reference/workflow-syntax...
         | 
         | Their example is:                 on:         push:
         | paths:           - '\*.js'
         | 
         | but I believe you can also specify the subdirectory you care
         | about.
        
       | kroolik wrote:
       | > We made a change to compute these replica checksums prior to
       | taking the lock. By precomputing the checksums, we've been able
       | to reduce the lock to under 1 second, allowing more write
       | operations to succeed immediatelly.
       | 
       | Isnt this changeset introducing a race condition? One of the
       | replicas' checksum could change between the checksum is computed
       | and the lock is taken. Otherwise, there is no need for the lock
       | at all.
        
         | jacoblambda wrote:
         | You can compute the checksums outside of the lock. You just
         | need to compare them inside the lock.
         | 
         | The key thing here is that prior to the lock if data changes
         | you recompute the checksums. As long as any change outside the
         | lock triggers a recompute of the corresponding checksums and no
         | changes can occur during the lock, there is no race condition.
         | 
         | I imagine that this may result in data getting de-
         | synced/failing the checksum comparisons more often however it's
         | still a net performance increase as long as the aggregate time
         | spent re-syncing the data is less than the extra time spent
         | waiting for checksums in the lock.
        
         | alexhutcheson wrote:
         | No, it's just switching from a pessimistic locking approach to
         | an optimistic one:
         | https://en.wikipedia.org/wiki/Optimistic_concurrency_control
        
       ___________________________________________________________________
       (page generated 2021-03-16 23:00 UTC)