[HN Gopher] Mozilla sccache: ccache with cloud storage ___________________________________________________________________ Mozilla sccache: ccache with cloud storage Author : thunderbong Score : 151 points Date : 2023-12-22 10:02 UTC (2 days ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | Sytten wrote: | Can you combine that with cross rs? I positively hate the | official GHA cache action so anything to replace that would be | nice. But we cross compile with cross rs. | Scarjit wrote: | It should work if you modify the cross docker image, so that it | uses the sccache executable as wrapper | sgammon wrote: | Another alternative is Buildless with the Actions setup step. | This sets up a remote and local endpoint for sccache inside | actions, with a connection up to remote caching by Dragonfly | | https://github.com/buildless/setup | fifteen1506 wrote: | IDK what happened at Mozilla but keep going! | CaptainOfCoit wrote: | Worth noting that the first commit in sccache git repository | was in 2014 (https://github.com/mozilla/sccache/commit/115016e0 | a83b290dc2...). So I suppose that what "happened" happened waay | back. | | Then in 2016 it seems like sccache was re-implemented in Rust ( | https://github.com/mozilla/sccache/commit/3da89195ce91a576cc... | ), from the initial Python implementation. | altairprime wrote: | Also around then, taskcluster happened. https://github.com/ta | skcluster/taskcluster/commit/54ffef79db... | jandeboevrie wrote: | Regular ccache supports remote storage as well: | https://ccache.dev/manual/latest.html#_remote_storage_backen... | CaptainOfCoit wrote: | I guess the difference is that sccache supports cloud storage | (S3, R2, Google Cloud Storage, el al) out of the box while | cache doesn't, as far as I can tell. | slavik81 wrote: | The sccache description was written before remote storage was | added to ccache. Remote storage is a relatively new feature | for ccache, and it didn't exist when sccache was created. | kinkaid wrote: | The ability to use the local and remote caches in tandem is | the most important feature to me. sccache will use one or the | other exclusively, so successive builds will have to pay the | latency and bandwidth costs every time an artifact is built | rather than just unpacking from the local cache. ccache | stages remote artifacts locally by default, so it only pays | the network costs once per artifact. In CI builds they are | more or less the same, but the local build experience for | ccache is much nicer imo. | sgammon wrote: | both are supported by buildless as well https://less.build | sgammon wrote: | (ccache is only about caching, not distributing builds, | also...) | wongarsu wrote: | sccache is also delightfully simple to set up if you just want | local storage. It's my go-to solution for sharing build artifacts | between rust projects | IshKebab wrote: | In my experience the time this saves is generally outweighed by | the effort of setting it up and the many hours lost when it goes | wrong and you don't think to try a clean build. It's certainly | better than ccache in that regard but you really need something | like Bazel if you're going to be aggressively caching C++ builds | like this (or impure Rust builds). | | (And if you're using Bazel or one of its brethren then they | generally have native remote caching and execution support.) | goku12 wrote: | Setting it up took hardly 5 minutes for me (for Rust). And it | hasn't caused any issues so far. Forcing a fresh build of | artifacts is also very easy. | | Meanwhile, the only legitimate problem you mentioned is if it | causes a build error and we don't immediately consider it as | the source of the issue. But I use the check command so often | that it is easy to suspect the cache if check succeeds and | build fails. | IshKebab wrote: | When these things go wrong it often doesn't cause a | compilation error; it can just cause inexplicable runtime | behaviour. | | You may be lucky and be building a pure or mostly pure Rust | program, in which case it works pretty well. Throw in some | C/C++ and it starts to degrade (though it's still better than | with an actual C/C++ program because you aren't actually | editing the C/C++ code generally). | | And are you actually using remote caching/compilation? | Because there's no way you can set that up in 5 minutes. | goku12 wrote: | > And are you actually using remote caching/compilation? | Because there's no way you can set that up in 5 minutes. | | Who said anything about remote caching? You don't need it | to benefit from it. It's useful if you build a lot of Rust | code. A lot of packages turn up repeatedly as dependencies | among several projects. | | > You may be lucky and be building a pure or mostly pure | Rust program, in which case it works pretty well. Throw in | some C/C++ and it starts to degrade | | You're making assumptions again. What is the issue with | pure Rust code? Rust isn't like Python needing C or C++ | support for process-intensive parts. And much of the C/C++ | dependencies are dynamically-linked, with Rust wrappers. I | haven't seen many projects that require Rust and C/C++ code | to be built together and statically linked. | | Besides, I haven't heard anyone complain about sccache that | much. How prevalent is the degradation anyway? | sgammon wrote: | After using sccache and Gradle with Buildless for months, | years, I have literally never seen these tools mixup or use | the wrong binary objects. | | Knowing the internals of some of them, I've found that | build cache clients are way more likely to miss with a | cache key misalignment than they are to mixup two objects. | I'm sure it's possible, I've just never seen it happen in | the wild, after extensive usage. | | Generally speaking these tools are very conservative about | two inputs matching: an identical file at a different path | will cause a cache key change in sccache. | saghm wrote: | > Setting it up took hardly 5 minutes for me (for Rust) | | You're underselling it, honestly. For those who haven't | looked into it, if you want to enable it for all Rust | projects on your system, literally all you need is install | the binary and then add this to ~/.cargo/config, and then it | will be enabled whenever you invoke `cargo` (or even rustc | directly iirc) from the user who's config file you modified) | [build] rustc-wrapper = "/path/to/sccache" | | I'm sure there are people with legitimate reasons not to | wanting it enabled implicitly or who have multiple users they | might have to set this up for, but for 90% of people it won't | take any more time than it took to read this comment. | sgammon wrote: | If you're interested in a drop in remote cache for sccache, check | out Buildless | | We just released S3 and Redis support. https://less.build | | Buildless also supports Gradle, Maven, Bazel, CCache and Turbo | sgammon wrote: | Our beta is open, just shoot me an email at sam@less.build if | you want to try it out! | phamilton wrote: | I sent an email and it was blocked: "Recipient address | rejected: Access denied." | xjia wrote: | What is the benefit of using a remote cache instead of a local | ~/.cache directory? Is it only for sharing build results among | team members? How do you make sure the build results are not | spoofed? | sgammon wrote: | Sharing with team members, sharing with CI, and the ability | to pull from more than just what's on your machine (i.e. a | larger addressable cache than you are willing to keep on | disk). Cache objects also compound across projects, so it's | nice to ship them up somewhere and have them nearby when you | need them. | | Re/spoofing, obviously it's all protected with API keys and | tokens, and we're working on mechanisms to perform end-to-end | encryption. In general, build cache objects are usually | addressed by a content-addressable-hash, so that also helps | because your build typically knows the content it's looking | for and can verify. | | That isn't true for all tools, though, so we're working to | understand where the gaps are and fix them. | sgammon wrote: | (Fwiw, group conversation encryption tech like MLS is | somewhat applicable, and that's the sort of pattern we're | looking at, but it would be cool to know if that's moving | to you on the problem of safety w.r.t. builds.) | Thorrez wrote: | >In general, build cache objects are usually addressed by a | content-addressable-hash | | How does that work? I would think the simplest case of a | build object that needs to be cached is a .o file created | from a .c file. The compiler sees the .c file and can | determine its hash, but how can the compiler determine the | hash of the .o file to know what to look up in the cache? I | think the compiler would need to perform the lookup using | the hash of the .c file, which isn't a hash of the data in | the cache. | sgammon wrote: | In Bazel's case and other cases, build cache objects are | held in CAS and then referenced from other keys. I | believe BuildXL from Microsoft also works this way. | | Of course one other advantage to build caches is they are | verifiable: the intent is to produce the exact same | output as a normal call, and that's easily checked on the | client side. | | No question that build caching poses inherent supply | chain risks though and that's part of what we want to | solve. I think people are hesitant to trust build caching | for good reason until there are safer mechanisms and | better cryptographic patterns applied. | krupan wrote: | When a .o is stored in the cache it is associated with | the hash of the .c file | aseipp wrote: | In the case of the Remote Execution/Cache API used by | Bazel among others[1] at least, it's a bit more detailed. | There's an "ActionCache" and an actual content-addressed | cache that just stores blobs | ("ContentAddressableStorage"). When you run a `gcc -O2 | foo.c -o foo.o` command (locally or remotely; doesn't | matter), you upload an "Action" into the action cache, | which basically said "This command was run. As a result | it had this stderr, stdout, error code, and these input | files read and output files written." The input and | output files are referenced by the hash of their | contents, in this case, and they get uploaded into the | CAS system. | | Most importantly you can look up an action in the | ActionCache without actually running it, provided you | have the inputs at hand. So now when another person comes | by and runs the same build command, they say "Has this | Action, with these inputs, been run before?" and the | server can say "Yes, and the output is a file identified | by hash XYZ" where XYZ is the hash of foo.o, so you can | just instantly download it from the CAS. | | So there are a few more moving parts to make it all work. | But the system really is ultimately content-addressed, | for the most part. | | [1] https://github.com/bazelbuild/remote- | apis/blob/main/build/ba... | sgammon wrote: | Yep, aseipp, and we support the full gRPC interface for | remote caching offered by Bazel, including the newer | APIs. | | Explained better than I could for sure. I find it very | interesting how BuildXL and Bazel ended up at similar | models for this problem. I don't yet know the history of | which informed which. | | (As compared to, say, Gradle, which works based on input | hashes instead.) | xjia wrote: | IIUC the actual computation (e.g. compiling, linking, ...) | happens on client (CI or developer) machines and the | results are written to the server-side cache. | | By spoofing I meant to say that an authenticated but | malicious client (intentionally or not, e.g. a clueless | intern) may be able to write malicious contents to the | cache. For example, their build toolchain could be | contaminated and the resulting build outputs are | contaminated. The "action" per se and its hash is still | legit, but the hash is only used as the lookup key -- their | corresponding value is "spoofed." | | The only safe way I can imagine to use such a remote cache | is for CI to publish its build results so that they could | be reused by developers. The direction from developers to | developers or even to CI seems difficult to handle and has | less value. But I might be missing some important insights | here so my conclusion could be wrong. | | But if that's the case, is the most valuable use case to | just configure the CI to read from / write to the remote | cache, and developers to only read from the remote cache? | And given such an assumption, is it much easier to | design/implememt a remote cache product? | sgammon wrote: | All great points but in practice, tools like Bazel and | sccache are incredibly conservative about hashes | matching, to include file path on disk and even env var | state. | | One goal of these tools is to guarantee that such | misconfiguration results in a cache key mismatch, rather | than a hit and a bug. | | There are tons of challenges designing a remote build | cache product, like anything, but that one has turned out | to be a reliable truth. | | Some other interesting insights: | | - transmitting large objects is often not profitable, so | we found that setting reasonable caps on what's shared | with the cache can be really effective for keeping | transmissions small and hits fast | | - deferring uploads is important because you can't | penalize individual devs for contributing to the cache, | and not everybody has a fast upload link. making this | part smooth is important so that everyone can benefit | from every compile. | | - build caching is ancient, Make does its own simple form | of build caching, but the protocols for it vary in | robustness greatly, from WebDAV in ccache to Bazel's gRPC | interface | | - most GitHub Actions builds occur in a small physical | area, so accelerating build artifacts is an easier | problem than, say, full blown CDN serving | | The assumptions that definitely help: | | - it's a cache, not a database; things can be missing, it | doesn't need strong consistency | | - replication lag is okay because a build cache entry is | typically not requested multiple times in a short window | of time; the client that created it has it locally | | - it's much better to give a fast miss than a slow hit, | since the compiler is quite fast | | - it's much better to give a fast miss than an error. You | can NEVER break a build; at worst it should just not be | accelerated. | | It's an interesting problem to work on for sure. | aseipp wrote: | Not just team members; if you make your cache publicly | readable, contributors to e.g. your GitHub/GitLab/Whatever | project can also use them and get really fast builds, the | first time they try to contribute. So a remote cache is nice | to have, if it's seamless. | | Nix works this way by default (and much of the community | operates caches like this) and it can be a massive, massive | time saver. | | > How do you make sure the build results are not spoofed? | | What do you mean "spoofed?" As in, someone put an evil | artifact in the cache? Or overwrote an existing artifact with | a new one? Or someone just stole your developers access and | started shoving shit in there? There's a whole bunch of small | details here that really matter to understand what | security/integrity properties you want the cache to uphold. | | FWIW, I've been looking into this in Buck2/Bazel land, and my | understanding is that most large orgs just use some kind of | terminating auth proxy that the underlying | connection/flow/build artifacts can be correlated back to. So | you know this cache artifact was first inserted by build B, | done by user X, who authenticated with their key K, etc etc. | sgammon wrote: | Exactly -- just like Git, everything is ultimately | identified with a key which can tie back to a stable | identity thru OIDC or similar mechanisms. At least that's | how we did it. | yjftsjthsd-h wrote: | Nix only caches at the package level, doesn't it? | sgammon wrote: | Nix is different, yeah, and it won't wire together a | build cache for you. Nix is great for many things of | course, it's just not a replacement for sccache per se | | Nix + sccache would probably be pretty great for | preserving paths and environment, which is really healthy | for build caching in general. | mgaunard wrote: | I built my own build system that does something similar. | | I've set it up at work with two S3 buckets: trusted and | untrusted. CI/CD read/write from trusted only. Developers | read/write from untrusted, and read-only from trusted. | sgammon wrote: | We decided to back our main cache with in-memory storage | for spicier performance. I'm curious how well S3 has worked | for you here? Is it fast enough? | | Or, maybe the blobs you're dealing with are on the bigger | end? That would also make sense | mgaunard wrote: | Each object file (.o) has a unique hash and is stored as | thehash.o. | | It's certainly much faster to download the .o than it is | to build it. Once it's downloaded it stays on the local | filesystem until it's garbage-collected. | sgammon wrote: | Hm, interesting. Our free tier is planned to be this plus | R2, so I'm happy to hear S3-style data exchange is | working for people. Thanks for sharing | mgaunard wrote: | The whole point of S3 is that it is inexpensive. You | don't want to pay premium money for terabytes of data | that are usually invalidated everyone someone makes a | significant change. | throwawaaarrgh wrote: | It's for sharing and aggregating. Ccache is useful locally, | but really shines when combined with Distcc, a distributed | compiler. Every host contributes a cache object that other | hosts can use, and every host can use the cache object | contributed by other hosts. So you don't even have to built | it once yourself to benefit from the cache of everyone else. | It therefore speeds up multiple hosts/users builds, | distributed builds and the dev experience of individuals. | __float wrote: | Is this only a remote _cache_ for Bazel, but it does not | support the remote execution API at all? It 's a little | worrisome to trust all user outputs when you do not also | control the execution of them. (In the "best" case this could | mean caching non-reproducible ("works on my machine") build | results, in the worst case this could be actively dangerous if | a malicious user poisons the build cache.) | sgammon wrote: | It's only a remote cache and that's deliberate. We see it as | much safer to only offer a cache that the user can control | and use however they want | | We would see taking over execution of your build as much more | dangerous. | | No question though that build caching in shared form, in SaaS | form, needs extra special attention paid to security. Our | product doesn't introspect cache blobs and in fact doesn't | really want to. Once we figure out how to make the crypto | work, we shouldn't be able to see any of that data at all. | | Access can be made public for reads (OSS) but is always | identified for writes. | sgammon wrote: | (Also, speaking as a Bazel user now, the Remote Execution | APIs have always been a bit brittle and hard to setup, use, | and maintain; certainly harder than just setting a cache | endpoint. | | I've found that remote execution ends up returning much less | benefit than remote caching, but that's just me and it's | entirely possible I Did It Wrong the whole time) | sgammon wrote: | Yes!! Glad people are thinking about this. We just added Cache | Projects which will be launching soon, it should allow this | style of public cache sharing. | | The intent with Buildless is to release a free-first toolchain | that helps with build caching in earnest and makes the whole | problem much less error prone. Then the Cloud stuff on top is | for groups who need more gas. Cloudflare is generously | supporting our upcoming free tier. | satvikpendem wrote: | I use this with Rust, works great. Simply add a | ~/.cargo/config.toml with [build] | rustc-wrapper = "/path/to/sccache" | | And it will work everywhere with cargo. I also like to combine it | with the mold linker. | throwup238 wrote: | You can also set the RUSTC_WRAPPER environment variable to make | it system wide. | pie_flavor wrote: | The parent comment is also system wide. | sodality2 wrote: | It should be system wide already because it's in .home/cargo | heads wrote: | We used this for a while at Speechmatics but our LM researchers | have a well established workflow based on git working copies on | NFS /home and we had a lot of instability between sccache and | NFS. | | Is sccache susceptible to cache misses when using full paths as | cache keys? It would be very helpful if compiling | "/home/heads/project/foo.c" could use the cached result of | compiling "/home/thunderbong/project/foo.c". | phlip9 wrote: | sccache only caches if builds are run from the same absolute | path, so indeed different home dirs won't work | tentacleuno wrote: | As a bystander, what would the reasoning be for doing this? I | would have assumed that they'd hash each file and use that as | a key in a lookup table. | sgammon wrote: | In some languages, symbols are provided which evaluate to a | file's path or directory parent, so program behavior can | vary even for the same content hash. That's just one way | paths can bleed in to violate hermeticity/correctness. | MereInterest wrote: | It sounds like somebody assuming a docker build, where | everybody's build will use the same file path. It's still a | very silly restriction, because not everything occurs | within docker. | sgammon wrote: | It's an unfortunate safety tradeoff to guarantee | consistency. Better visibility into program behavior | could fix it. | slavik81 wrote: | With ccache that would be a cache miss by default. However, | that could be made a cache hit by configuring the | CCACHE_BASEDIR option. There doesn't seem to be an exact | equivalent in sccache. | https://github.com/mozilla/sccache/issues/35 | goodpoint wrote: | Pity they are not using GPL or at least MPL. | szundi wrote: | My first thought is how to prevent bad guys injecting rootkit | binaries into these systems. | throwawaaarrgh wrote: | It would be nice if every application in the world didn't have to | hack on support for X number of different almost-identical | storage vendors who all decided it would be better to have | completely incompatible interfaces. | | NFS isn't horrible, though it's limited. OTOH object storage is | limited but has some other advantages. So there needs to either | be the latter as a standard that can be adopted by all vendors, | apps and OSes, or a new standard that fills the gaps of what apps | want. | sgammon wrote: | sccache uses OpenDAL... https://opendal.apache.org/ | tedunangst wrote: | Next step: store the object files in the blockchain to make a | global cache so everyone can compile the browser at the speed of, | uh... | sgammon wrote: | I know you're joking, but Unison built this without the | blockchain cruft on top. Very cool project. | | When any unique piece of Unison is compiled by anyone, it no | longer needs to be compiled by everyone. | | https://www.unison-lang.org/ ___________________________________________________________________ (page generated 2023-12-24 23:00 UTC)