[HN Gopher] Mozilla sccache: ccache with cloud storage
       ___________________________________________________________________
        
       Mozilla sccache: ccache with cloud storage
        
       Author : thunderbong
       Score  : 151 points
       Date   : 2023-12-22 10:02 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Sytten wrote:
       | Can you combine that with cross rs? I positively hate the
       | official GHA cache action so anything to replace that would be
       | nice. But we cross compile with cross rs.
        
         | Scarjit wrote:
         | It should work if you modify the cross docker image, so that it
         | uses the sccache executable as wrapper
        
         | sgammon wrote:
         | Another alternative is Buildless with the Actions setup step.
         | This sets up a remote and local endpoint for sccache inside
         | actions, with a connection up to remote caching by Dragonfly
         | 
         | https://github.com/buildless/setup
        
       | fifteen1506 wrote:
       | IDK what happened at Mozilla but keep going!
        
         | CaptainOfCoit wrote:
         | Worth noting that the first commit in sccache git repository
         | was in 2014 (https://github.com/mozilla/sccache/commit/115016e0
         | a83b290dc2...). So I suppose that what "happened" happened waay
         | back.
         | 
         | Then in 2016 it seems like sccache was re-implemented in Rust (
         | https://github.com/mozilla/sccache/commit/3da89195ce91a576cc...
         | ), from the initial Python implementation.
        
           | altairprime wrote:
           | Also around then, taskcluster happened. https://github.com/ta
           | skcluster/taskcluster/commit/54ffef79db...
        
       | jandeboevrie wrote:
       | Regular ccache supports remote storage as well:
       | https://ccache.dev/manual/latest.html#_remote_storage_backen...
        
         | CaptainOfCoit wrote:
         | I guess the difference is that sccache supports cloud storage
         | (S3, R2, Google Cloud Storage, el al) out of the box while
         | cache doesn't, as far as I can tell.
        
           | slavik81 wrote:
           | The sccache description was written before remote storage was
           | added to ccache. Remote storage is a relatively new feature
           | for ccache, and it didn't exist when sccache was created.
        
           | kinkaid wrote:
           | The ability to use the local and remote caches in tandem is
           | the most important feature to me. sccache will use one or the
           | other exclusively, so successive builds will have to pay the
           | latency and bandwidth costs every time an artifact is built
           | rather than just unpacking from the local cache. ccache
           | stages remote artifacts locally by default, so it only pays
           | the network costs once per artifact. In CI builds they are
           | more or less the same, but the local build experience for
           | ccache is much nicer imo.
        
         | sgammon wrote:
         | both are supported by buildless as well https://less.build
        
         | sgammon wrote:
         | (ccache is only about caching, not distributing builds,
         | also...)
        
       | wongarsu wrote:
       | sccache is also delightfully simple to set up if you just want
       | local storage. It's my go-to solution for sharing build artifacts
       | between rust projects
        
       | IshKebab wrote:
       | In my experience the time this saves is generally outweighed by
       | the effort of setting it up and the many hours lost when it goes
       | wrong and you don't think to try a clean build. It's certainly
       | better than ccache in that regard but you really need something
       | like Bazel if you're going to be aggressively caching C++ builds
       | like this (or impure Rust builds).
       | 
       | (And if you're using Bazel or one of its brethren then they
       | generally have native remote caching and execution support.)
        
         | goku12 wrote:
         | Setting it up took hardly 5 minutes for me (for Rust). And it
         | hasn't caused any issues so far. Forcing a fresh build of
         | artifacts is also very easy.
         | 
         | Meanwhile, the only legitimate problem you mentioned is if it
         | causes a build error and we don't immediately consider it as
         | the source of the issue. But I use the check command so often
         | that it is easy to suspect the cache if check succeeds and
         | build fails.
        
           | IshKebab wrote:
           | When these things go wrong it often doesn't cause a
           | compilation error; it can just cause inexplicable runtime
           | behaviour.
           | 
           | You may be lucky and be building a pure or mostly pure Rust
           | program, in which case it works pretty well. Throw in some
           | C/C++ and it starts to degrade (though it's still better than
           | with an actual C/C++ program because you aren't actually
           | editing the C/C++ code generally).
           | 
           | And are you actually using remote caching/compilation?
           | Because there's no way you can set that up in 5 minutes.
        
             | goku12 wrote:
             | > And are you actually using remote caching/compilation?
             | Because there's no way you can set that up in 5 minutes.
             | 
             | Who said anything about remote caching? You don't need it
             | to benefit from it. It's useful if you build a lot of Rust
             | code. A lot of packages turn up repeatedly as dependencies
             | among several projects.
             | 
             | > You may be lucky and be building a pure or mostly pure
             | Rust program, in which case it works pretty well. Throw in
             | some C/C++ and it starts to degrade
             | 
             | You're making assumptions again. What is the issue with
             | pure Rust code? Rust isn't like Python needing C or C++
             | support for process-intensive parts. And much of the C/C++
             | dependencies are dynamically-linked, with Rust wrappers. I
             | haven't seen many projects that require Rust and C/C++ code
             | to be built together and statically linked.
             | 
             | Besides, I haven't heard anyone complain about sccache that
             | much. How prevalent is the degradation anyway?
        
             | sgammon wrote:
             | After using sccache and Gradle with Buildless for months,
             | years, I have literally never seen these tools mixup or use
             | the wrong binary objects.
             | 
             | Knowing the internals of some of them, I've found that
             | build cache clients are way more likely to miss with a
             | cache key misalignment than they are to mixup two objects.
             | I'm sure it's possible, I've just never seen it happen in
             | the wild, after extensive usage.
             | 
             | Generally speaking these tools are very conservative about
             | two inputs matching: an identical file at a different path
             | will cause a cache key change in sccache.
        
           | saghm wrote:
           | > Setting it up took hardly 5 minutes for me (for Rust)
           | 
           | You're underselling it, honestly. For those who haven't
           | looked into it, if you want to enable it for all Rust
           | projects on your system, literally all you need is install
           | the binary and then add this to ~/.cargo/config, and then it
           | will be enabled whenever you invoke `cargo` (or even rustc
           | directly iirc) from the user who's config file you modified)
           | [build]         rustc-wrapper = "/path/to/sccache"
           | 
           | I'm sure there are people with legitimate reasons not to
           | wanting it enabled implicitly or who have multiple users they
           | might have to set this up for, but for 90% of people it won't
           | take any more time than it took to read this comment.
        
       | sgammon wrote:
       | If you're interested in a drop in remote cache for sccache, check
       | out Buildless
       | 
       | We just released S3 and Redis support. https://less.build
       | 
       | Buildless also supports Gradle, Maven, Bazel, CCache and Turbo
        
         | sgammon wrote:
         | Our beta is open, just shoot me an email at sam@less.build if
         | you want to try it out!
        
           | phamilton wrote:
           | I sent an email and it was blocked: "Recipient address
           | rejected: Access denied."
        
         | xjia wrote:
         | What is the benefit of using a remote cache instead of a local
         | ~/.cache directory? Is it only for sharing build results among
         | team members? How do you make sure the build results are not
         | spoofed?
        
           | sgammon wrote:
           | Sharing with team members, sharing with CI, and the ability
           | to pull from more than just what's on your machine (i.e. a
           | larger addressable cache than you are willing to keep on
           | disk). Cache objects also compound across projects, so it's
           | nice to ship them up somewhere and have them nearby when you
           | need them.
           | 
           | Re/spoofing, obviously it's all protected with API keys and
           | tokens, and we're working on mechanisms to perform end-to-end
           | encryption. In general, build cache objects are usually
           | addressed by a content-addressable-hash, so that also helps
           | because your build typically knows the content it's looking
           | for and can verify.
           | 
           | That isn't true for all tools, though, so we're working to
           | understand where the gaps are and fix them.
        
             | sgammon wrote:
             | (Fwiw, group conversation encryption tech like MLS is
             | somewhat applicable, and that's the sort of pattern we're
             | looking at, but it would be cool to know if that's moving
             | to you on the problem of safety w.r.t. builds.)
        
             | Thorrez wrote:
             | >In general, build cache objects are usually addressed by a
             | content-addressable-hash
             | 
             | How does that work? I would think the simplest case of a
             | build object that needs to be cached is a .o file created
             | from a .c file. The compiler sees the .c file and can
             | determine its hash, but how can the compiler determine the
             | hash of the .o file to know what to look up in the cache? I
             | think the compiler would need to perform the lookup using
             | the hash of the .c file, which isn't a hash of the data in
             | the cache.
        
               | sgammon wrote:
               | In Bazel's case and other cases, build cache objects are
               | held in CAS and then referenced from other keys. I
               | believe BuildXL from Microsoft also works this way.
               | 
               | Of course one other advantage to build caches is they are
               | verifiable: the intent is to produce the exact same
               | output as a normal call, and that's easily checked on the
               | client side.
               | 
               | No question that build caching poses inherent supply
               | chain risks though and that's part of what we want to
               | solve. I think people are hesitant to trust build caching
               | for good reason until there are safer mechanisms and
               | better cryptographic patterns applied.
        
               | krupan wrote:
               | When a .o is stored in the cache it is associated with
               | the hash of the .c file
        
               | aseipp wrote:
               | In the case of the Remote Execution/Cache API used by
               | Bazel among others[1] at least, it's a bit more detailed.
               | There's an "ActionCache" and an actual content-addressed
               | cache that just stores blobs
               | ("ContentAddressableStorage"). When you run a `gcc -O2
               | foo.c -o foo.o` command (locally or remotely; doesn't
               | matter), you upload an "Action" into the action cache,
               | which basically said "This command was run. As a result
               | it had this stderr, stdout, error code, and these input
               | files read and output files written." The input and
               | output files are referenced by the hash of their
               | contents, in this case, and they get uploaded into the
               | CAS system.
               | 
               | Most importantly you can look up an action in the
               | ActionCache without actually running it, provided you
               | have the inputs at hand. So now when another person comes
               | by and runs the same build command, they say "Has this
               | Action, with these inputs, been run before?" and the
               | server can say "Yes, and the output is a file identified
               | by hash XYZ" where XYZ is the hash of foo.o, so you can
               | just instantly download it from the CAS.
               | 
               | So there are a few more moving parts to make it all work.
               | But the system really is ultimately content-addressed,
               | for the most part.
               | 
               | [1] https://github.com/bazelbuild/remote-
               | apis/blob/main/build/ba...
        
               | sgammon wrote:
               | Yep, aseipp, and we support the full gRPC interface for
               | remote caching offered by Bazel, including the newer
               | APIs.
               | 
               | Explained better than I could for sure. I find it very
               | interesting how BuildXL and Bazel ended up at similar
               | models for this problem. I don't yet know the history of
               | which informed which.
               | 
               | (As compared to, say, Gradle, which works based on input
               | hashes instead.)
        
             | xjia wrote:
             | IIUC the actual computation (e.g. compiling, linking, ...)
             | happens on client (CI or developer) machines and the
             | results are written to the server-side cache.
             | 
             | By spoofing I meant to say that an authenticated but
             | malicious client (intentionally or not, e.g. a clueless
             | intern) may be able to write malicious contents to the
             | cache. For example, their build toolchain could be
             | contaminated and the resulting build outputs are
             | contaminated. The "action" per se and its hash is still
             | legit, but the hash is only used as the lookup key -- their
             | corresponding value is "spoofed."
             | 
             | The only safe way I can imagine to use such a remote cache
             | is for CI to publish its build results so that they could
             | be reused by developers. The direction from developers to
             | developers or even to CI seems difficult to handle and has
             | less value. But I might be missing some important insights
             | here so my conclusion could be wrong.
             | 
             | But if that's the case, is the most valuable use case to
             | just configure the CI to read from / write to the remote
             | cache, and developers to only read from the remote cache?
             | And given such an assumption, is it much easier to
             | design/implememt a remote cache product?
        
               | sgammon wrote:
               | All great points but in practice, tools like Bazel and
               | sccache are incredibly conservative about hashes
               | matching, to include file path on disk and even env var
               | state.
               | 
               | One goal of these tools is to guarantee that such
               | misconfiguration results in a cache key mismatch, rather
               | than a hit and a bug.
               | 
               | There are tons of challenges designing a remote build
               | cache product, like anything, but that one has turned out
               | to be a reliable truth.
               | 
               | Some other interesting insights:
               | 
               | - transmitting large objects is often not profitable, so
               | we found that setting reasonable caps on what's shared
               | with the cache can be really effective for keeping
               | transmissions small and hits fast
               | 
               | - deferring uploads is important because you can't
               | penalize individual devs for contributing to the cache,
               | and not everybody has a fast upload link. making this
               | part smooth is important so that everyone can benefit
               | from every compile.
               | 
               | - build caching is ancient, Make does its own simple form
               | of build caching, but the protocols for it vary in
               | robustness greatly, from WebDAV in ccache to Bazel's gRPC
               | interface
               | 
               | - most GitHub Actions builds occur in a small physical
               | area, so accelerating build artifacts is an easier
               | problem than, say, full blown CDN serving
               | 
               | The assumptions that definitely help:
               | 
               | - it's a cache, not a database; things can be missing, it
               | doesn't need strong consistency
               | 
               | - replication lag is okay because a build cache entry is
               | typically not requested multiple times in a short window
               | of time; the client that created it has it locally
               | 
               | - it's much better to give a fast miss than a slow hit,
               | since the compiler is quite fast
               | 
               | - it's much better to give a fast miss than an error. You
               | can NEVER break a build; at worst it should just not be
               | accelerated.
               | 
               | It's an interesting problem to work on for sure.
        
           | aseipp wrote:
           | Not just team members; if you make your cache publicly
           | readable, contributors to e.g. your GitHub/GitLab/Whatever
           | project can also use them and get really fast builds, the
           | first time they try to contribute. So a remote cache is nice
           | to have, if it's seamless.
           | 
           | Nix works this way by default (and much of the community
           | operates caches like this) and it can be a massive, massive
           | time saver.
           | 
           | > How do you make sure the build results are not spoofed?
           | 
           | What do you mean "spoofed?" As in, someone put an evil
           | artifact in the cache? Or overwrote an existing artifact with
           | a new one? Or someone just stole your developers access and
           | started shoving shit in there? There's a whole bunch of small
           | details here that really matter to understand what
           | security/integrity properties you want the cache to uphold.
           | 
           | FWIW, I've been looking into this in Buck2/Bazel land, and my
           | understanding is that most large orgs just use some kind of
           | terminating auth proxy that the underlying
           | connection/flow/build artifacts can be correlated back to. So
           | you know this cache artifact was first inserted by build B,
           | done by user X, who authenticated with their key K, etc etc.
        
             | sgammon wrote:
             | Exactly -- just like Git, everything is ultimately
             | identified with a key which can tie back to a stable
             | identity thru OIDC or similar mechanisms. At least that's
             | how we did it.
        
             | yjftsjthsd-h wrote:
             | Nix only caches at the package level, doesn't it?
        
               | sgammon wrote:
               | Nix is different, yeah, and it won't wire together a
               | build cache for you. Nix is great for many things of
               | course, it's just not a replacement for sccache per se
               | 
               | Nix + sccache would probably be pretty great for
               | preserving paths and environment, which is really healthy
               | for build caching in general.
        
           | mgaunard wrote:
           | I built my own build system that does something similar.
           | 
           | I've set it up at work with two S3 buckets: trusted and
           | untrusted. CI/CD read/write from trusted only. Developers
           | read/write from untrusted, and read-only from trusted.
        
             | sgammon wrote:
             | We decided to back our main cache with in-memory storage
             | for spicier performance. I'm curious how well S3 has worked
             | for you here? Is it fast enough?
             | 
             | Or, maybe the blobs you're dealing with are on the bigger
             | end? That would also make sense
        
               | mgaunard wrote:
               | Each object file (.o) has a unique hash and is stored as
               | thehash.o.
               | 
               | It's certainly much faster to download the .o than it is
               | to build it. Once it's downloaded it stays on the local
               | filesystem until it's garbage-collected.
        
               | sgammon wrote:
               | Hm, interesting. Our free tier is planned to be this plus
               | R2, so I'm happy to hear S3-style data exchange is
               | working for people. Thanks for sharing
        
               | mgaunard wrote:
               | The whole point of S3 is that it is inexpensive. You
               | don't want to pay premium money for terabytes of data
               | that are usually invalidated everyone someone makes a
               | significant change.
        
           | throwawaaarrgh wrote:
           | It's for sharing and aggregating. Ccache is useful locally,
           | but really shines when combined with Distcc, a distributed
           | compiler. Every host contributes a cache object that other
           | hosts can use, and every host can use the cache object
           | contributed by other hosts. So you don't even have to built
           | it once yourself to benefit from the cache of everyone else.
           | It therefore speeds up multiple hosts/users builds,
           | distributed builds and the dev experience of individuals.
        
         | __float wrote:
         | Is this only a remote _cache_ for Bazel, but it does not
         | support the remote execution API at all? It 's a little
         | worrisome to trust all user outputs when you do not also
         | control the execution of them. (In the "best" case this could
         | mean caching non-reproducible ("works on my machine") build
         | results, in the worst case this could be actively dangerous if
         | a malicious user poisons the build cache.)
        
           | sgammon wrote:
           | It's only a remote cache and that's deliberate. We see it as
           | much safer to only offer a cache that the user can control
           | and use however they want
           | 
           | We would see taking over execution of your build as much more
           | dangerous.
           | 
           | No question though that build caching in shared form, in SaaS
           | form, needs extra special attention paid to security. Our
           | product doesn't introspect cache blobs and in fact doesn't
           | really want to. Once we figure out how to make the crypto
           | work, we shouldn't be able to see any of that data at all.
           | 
           | Access can be made public for reads (OSS) but is always
           | identified for writes.
        
           | sgammon wrote:
           | (Also, speaking as a Bazel user now, the Remote Execution
           | APIs have always been a bit brittle and hard to setup, use,
           | and maintain; certainly harder than just setting a cache
           | endpoint.
           | 
           | I've found that remote execution ends up returning much less
           | benefit than remote caching, but that's just me and it's
           | entirely possible I Did It Wrong the whole time)
        
         | sgammon wrote:
         | Yes!! Glad people are thinking about this. We just added Cache
         | Projects which will be launching soon, it should allow this
         | style of public cache sharing.
         | 
         | The intent with Buildless is to release a free-first toolchain
         | that helps with build caching in earnest and makes the whole
         | problem much less error prone. Then the Cloud stuff on top is
         | for groups who need more gas. Cloudflare is generously
         | supporting our upcoming free tier.
        
       | satvikpendem wrote:
       | I use this with Rust, works great. Simply add a
       | ~/.cargo/config.toml with                   [build]
       | rustc-wrapper = "/path/to/sccache"
       | 
       | And it will work everywhere with cargo. I also like to combine it
       | with the mold linker.
        
         | throwup238 wrote:
         | You can also set the RUSTC_WRAPPER environment variable to make
         | it system wide.
        
           | pie_flavor wrote:
           | The parent comment is also system wide.
        
           | sodality2 wrote:
           | It should be system wide already because it's in .home/cargo
        
       | heads wrote:
       | We used this for a while at Speechmatics but our LM researchers
       | have a well established workflow based on git working copies on
       | NFS /home and we had a lot of instability between sccache and
       | NFS.
       | 
       | Is sccache susceptible to cache misses when using full paths as
       | cache keys? It would be very helpful if compiling
       | "/home/heads/project/foo.c" could use the cached result of
       | compiling "/home/thunderbong/project/foo.c".
        
         | phlip9 wrote:
         | sccache only caches if builds are run from the same absolute
         | path, so indeed different home dirs won't work
        
           | tentacleuno wrote:
           | As a bystander, what would the reasoning be for doing this? I
           | would have assumed that they'd hash each file and use that as
           | a key in a lookup table.
        
             | sgammon wrote:
             | In some languages, symbols are provided which evaluate to a
             | file's path or directory parent, so program behavior can
             | vary even for the same content hash. That's just one way
             | paths can bleed in to violate hermeticity/correctness.
        
             | MereInterest wrote:
             | It sounds like somebody assuming a docker build, where
             | everybody's build will use the same file path. It's still a
             | very silly restriction, because not everything occurs
             | within docker.
        
               | sgammon wrote:
               | It's an unfortunate safety tradeoff to guarantee
               | consistency. Better visibility into program behavior
               | could fix it.
        
         | slavik81 wrote:
         | With ccache that would be a cache miss by default. However,
         | that could be made a cache hit by configuring the
         | CCACHE_BASEDIR option. There doesn't seem to be an exact
         | equivalent in sccache.
         | https://github.com/mozilla/sccache/issues/35
        
       | goodpoint wrote:
       | Pity they are not using GPL or at least MPL.
        
       | szundi wrote:
       | My first thought is how to prevent bad guys injecting rootkit
       | binaries into these systems.
        
       | throwawaaarrgh wrote:
       | It would be nice if every application in the world didn't have to
       | hack on support for X number of different almost-identical
       | storage vendors who all decided it would be better to have
       | completely incompatible interfaces.
       | 
       | NFS isn't horrible, though it's limited. OTOH object storage is
       | limited but has some other advantages. So there needs to either
       | be the latter as a standard that can be adopted by all vendors,
       | apps and OSes, or a new standard that fills the gaps of what apps
       | want.
        
         | sgammon wrote:
         | sccache uses OpenDAL... https://opendal.apache.org/
        
       | tedunangst wrote:
       | Next step: store the object files in the blockchain to make a
       | global cache so everyone can compile the browser at the speed of,
       | uh...
        
         | sgammon wrote:
         | I know you're joking, but Unison built this without the
         | blockchain cruft on top. Very cool project.
         | 
         | When any unique piece of Unison is compiled by anyone, it no
         | longer needs to be compiled by everyone.
         | 
         | https://www.unison-lang.org/
        
       ___________________________________________________________________
       (page generated 2023-12-24 23:00 UTC)