[HN Gopher] Deterministic Linux for controlled testing and softw... ___________________________________________________________________ Deterministic Linux for controlled testing and software bug-finding Author : rrnewton Score : 82 points Date : 2022-11-22 18:12 UTC (4 hours ago) (HTM) web link (developers.facebook.com) (TXT) w3m dump (developers.facebook.com) | jasonwhite wrote: | TL;DR: This is a Rust project that forces deterministic execution | of arbitrary programs and acts like a reproducible container. | That is, it _hermetically_ isolates the program from sources of | non-determinism such as time, thread interleavings, random number | generation, etc. Guaranteed determinism is a powerful tool and it | serves as a basis for a number of applications, including | concurrency stress testing, record /replay, reproducible builds, | automatic diagnosis of concurrency bugs, and more. | | I've been on the team working on this project over the past ~2 | years. AMA! | | Here is the GitHub repository: | https://github.com/facebookexperimental/hermit | mtlmtlmtlmtl wrote: | This is exactly the type of thing I've been wanting for testing | my chess engine. Parallelism is based on emergent pseudorandom | effects of things like interleaving causing searcher threads to | mostly end up in non-overlapping parts of the search tree. | | One question: How do you avoid the program being affected by | things like overall system load and memory pressure? | jasonwhite wrote: | Since CPU is a "compressible" resource, system load doesn't | affect the determinism. It'll make it slower of course. Since | memory is a non-compressible resource, things can start | getting killed by the OOM-killer and there's nothing we can | do about it. There are also certainly things like external | network communication and file system access that are non- | deterministic that must be handled at a higher level (e.g., | with a reproducible FS image or by recording & replaying | network traffic). | mtlmtlmtlmtl wrote: | The idea of CPU being compressible is very insightful, | thanks. | | I'm curious about what happens with time control though. | Engines can be given a time constraint and will use various | heuristics to allocate that time. How does Hermit intercept | gettimeofday()? | rrnewton wrote: | Well, gettimeofday is a syscall, and we do intercept it | (along with clock_gettime, clock_gettres, time, | nanosleep, and the rdtsc instruction, even though that | last one is not a syscall). When we intercept it, we | report virtual time back to the guest. We make sure that | virtual time is deterministic, across all threads in the | container, irrespective of what the wall clock time is on | the host machine. | | So for instance, if there are multiple threads in a chess | engine, and they are racing to write search results to a | data structure, these threads will interleave in a | reproducible order under Hermit, and the races will | resolve consistently. | | But the downside is that Hermit does sequentialize | execution onto one core. So in the current version, a | multithreaded program doesn't get actual wall-clock | speedup from its parallelism. (The earlier dettrace did | allow some limited guest parallelism, and we plan to | bring that back.) For this reason, Hermit's good for | consistent testing multithreaded software, but you | wouldn't want to run parallel software under it outside | of testing. | comex wrote: | Nice. Some questions: | | - How does this compare with rr? | | - Out of curiosity, have you ever looked at libTAS? (I realize | it has a very different intended use case.) | | - Have you had an issues with behavior differences between | CPUs? I know there is architecture-level undefined behavior | that can differ between CPUs; on the other hand, it sounds like | you're primarily interested in running well-behaved executables | that wouldn't intentionally invoke such behavior. | jasonwhite wrote: | - We've taken a lot of inspiration from RR and it is very | good at what it does (record/replay + randomizing thread | schedules). This project diverges a bit from RR in that we | focus more on the concurrency testing application of | determinism. I wrote the toy record/replay system that sits | on top of the determinism layer (so that we don't have to | record as much stuff), but it's a long way from being | complete. | | - I haven't seen libTAS until now, so I can't comment on it | much. At first glance, it does look similar! | | - See @rrnewton's reply on this one. | rrnewton wrote: | (1) rr [formerly Mozilla rr] | | We're big fans of rr! | | Hermit is different in that creating a deterministic OS | semantics is different than recording whatever | nondeterministic behavior occurs under normal Linux. BUT, | there's a lot of overlap. And indeed `hermit record` is | straight up RnR (record & replay). | | But hermit for RnR but is not nearly as developed as rr. We | integrate with gdb/lldb as an (RSP) debugger backend, just | like rr. Any failing execution you can create with hermit, | you can attach a debugger. But our support is very | preliminary, and you'll probably find rough edges. Also, we | don't support backwards stepping yet (except by running | again). | | If we invest more in using Hermit as a debugger (rather than | for finding and analyzing concurrency bugs), then there | should be _some_ advantages over traditional RnR. These would | relate to the fact that deterministically executing is | different than recording. For example, process and thread | IDs, and memory addresses all stay the same across multiple | runs of the program, even as you begin adding printfs and | modifying the program to fix the bug. With traditional RnR, | you can play the same recording as many times as you like, | but as soon as you take a second recording all bets are off | wrt what is the same or different compared to the prior | recording. (That includes losing the "mental state" of | things like tids & memory addresses, which is a good point | Robert O Callahan makes about the benefits of RnR when | accessing the same recording multiple times.) | | (2) libTAS - no we haven't! Checking it out now. | | (3) Yes, definitely issues with CPU portability. | | In general, we are interested in not just determinism on the | same machine, but portability between machines in our fleet. | As with any tech company that uses the cloud, at Meta people | are usually trying to debug an issue on a different machine | than where the problem occurred. I.e. taking a crash from a | production or CI machine to a local dev machine. | | The way we do this is that we mostly report a fairly old CPU | to the guest, which disables certain features IF the guest is | well behaved. | | With the current processor tech, I don't think there's any | way we can stop an adversarial program, which, for example, | would execute CPUID, find that RDRAND is not supported on the | processor, but then execute RDRAND anyway. We could build a | much more invasive binary-instrumentation based emulator that | _would_ be able to enforce these kinds of rules at the | instruction granularity, but it would have higher overhead, | especially startup overhead. The nice thing about Reverie | though is that we (or others) can add different | instrumentation backends while keeping the same programming | instrumentation API. So we could have a "hardened" backend | that was more about sandboxing and reverse-engineering | adversarial software, making a different tradeoff with | respect to performance overhead. | CJefferson wrote: | This looks cool. | | Personally I'd also be interested in this for academic work -- | anything which makes it easier to be sure an experiment can be | reproduced later (a week, year, or decade later, in increasing | order of difficulty), is good to have. | rrnewton wrote: | Yes, I'm very interested in that as well. I've been involved | with the ACM Artifact Evaluation process which has been going | on in several conferences for a while. | | https://www.acm.org/publications/policies/artifact-review- | ba... | | But it's been pretty frustrating. As an author, my PLDI 2014 | artifact stopped working less than 5 years later (Docker | image binary incompatibility). And when I was co-chair of an | Artifact Evaluation Committee in 2017, there was not great | reproducibilty of the artifacts that were submitted either. | | If you package a VM (freezing the Linux kernel), and are | pretty sure that VM will run in 10 years, PLUS you | determinize the execution itself... that should allow durable | bitwise reproducibility. Maybe Hermit could be one ingredient | of that. | | For scientific reproducibility, there is a lot of other | tooling to build too, and I know some folks have been working | in that area: | | https://ctuning.org/ | gmartres wrote: | Thanks for open-sourcing this! Roughly, what's the performance | overhead from running code under hermit? I'm wondering if this | could be used for doing benchmarking with less variance on non- | deterministic platforms such as the JVM (I assume hermit is | "deterministic enough" that the JIT and GC threads of the JVM | will run the same code on every execution?) | rrnewton wrote: | Alas the performance overhead in _realtime_ is not great yet. | It still uses ptrace currently, which often results in a | multiple-X slowdown (but at least it doesn 't "subscribe" to | every syscall like strace does, because some are naturally | deterministic). Reverie's whole design is to make it support | swappable backends, and this ptrace backend is just the | reference implementation. The `experimental/reverie-sabre` | directory in the Reverie repo contains our high performance | backend, but it's still work-in-progress. It uses binary | instrumentation and in our early experiments is 10X faster | than our current backend in the worst case (i.e. strace is | >10X faster when rewritten with reverie-sabre and run on a | program that does nothing but syscalls). | | But to the second part of your question about deterministic | benchmarking, that is really a separate question. Hermit | defines a deterministic notion of virtual time, which is | based on the branches retired and system calls executed by | all threads. When you run hermit with `--summary`, it reports | a total "Elasped virtual global time", which is completely | deterministic: | | $ hermit run --summary /bin/date | | ... | | Elapsed virtual global (cpu) time: 5_039_700ns | | Therefore, any program that runs under hermit can get this | deterministic notion of performance. We figured that could be | useful for setting performance regression tests with very | small regression margins (<1%), which you can't do on normal | noisy systems. Compilers are one place I've worked where we | wanted smaller performance regression alarms (for generated | code) than we could achieve in practice. We haven't actually | explored this application yet though. There's a whole small | field of people studying performance modeling and prediction, | and if one wanted to try this deterministic benchmarking | approach, they might want take some of that knowledge and | build a more accurate (correlated with wall time) performance | model, more realistic than Hermit's current virtual time that | is. | dekhn wrote: | Does running /bin/date under hermit always return the same | time? Or does it just follow the same codepath to retrieve | the actual time? | [deleted] | wyldfire wrote: | > AMA! | | Eager to try it but encountering the build error here - | https://github.com/facebookexperimental/hermit/issues/11 | | Do you have a reference build log / environment you can share? | Last known good commit sha and/or output from "rustup show"? | jasonwhite wrote: | We're working on it! Should be fixed soon. | rrnewton wrote: | Reverted a badly-timed breaking change that came through | the sync system. Will fix it properly shortly (and add a | Dockerfile and release tag). But for now you may have | better luck on the main branch after that reversion, which | yielded 6cb5575ffd287289769144ec82e2900cbf6cd1ad. | | Let's discuss further on that issue #11. | rrnewton wrote: | Note that this is a follow-on project from the earlier Dettrace | system, which was applied mainly to reproducible builds (as in | the academic paper, | https://dl.acm.org/doi/10.1145/3373376.3378519, and presented | to the Debian Reproducible Builds summit): | | - https://github.com/dettrace/dettrace | | And one cool part of it is this Rust program instrumentation | layer: | | - https://github.com/facebookexperimental/reverie | | It's good for building OS-emulator style projects or tracing | tools. | oulipo wrote: | Would something like this work for embedded code, like ESP32? | rrnewton wrote: | Well, probably not _on device_ ;-). | | The underlying Reverie instrumentation layer works on ARM, | but Hermit isn't ported yet, and we haven't touched RISC-V | yet at all. (Contributions welcome!) | | One thing we haven't tried yet is just putting a whole | emulator (qemu etc) underneath Hermit. That would address any | sources of irreproducibility that the emulator lets through | from the host (threads, RNG, etc). | wyldfire wrote: | > thread interleavings | | I was going to ask if it could do the flip side - instead of | stabilizing the scheduler, make it less predictable. | | AFAICT, it can! Awesome, looking forward to giving it a try. | hermit run --chaos --seed-from=SystemRandom | ./target/debug/hello_race; | rrnewton wrote: | Yes indeed. | | That concurrency testing capability is a pretty well-studied | area and we implement a couple existing algorithms. The first | is our adaptation of the PCT algorithm (ASPLOS'10 | https://www.microsoft.com/en-us/research/wp- | content/uploads/...). That's what you get by default with | `--chaos`. | | But we also have variations on straight up randomized | scheduler (random thread selection at each time step). | | rr chaos mode has its own take on this: | https://robert.ocallahan.org/2016/02/introducing-rr-chaos- | mo... | | This study compares a few approaches - http://www.doc.ic.ac.u | k/~afd/homepages/papers/pdfs/2016/TOPC.... | daniel-levin wrote: | Neat! This is the direction I'd hoped to see gvisor go in. What's | the reasoning for building from scratch and not piggybacking off | gvisor? | jasonwhite wrote: | We certainly looked into gVisor and Firecracker when we started | this project a few years ago. These systems use KVM and gVisor | in particular uses the Model Specific Registers (MSRs) to | intercept system calls before forwarding them to the host | kernel. Intercepting syscalls this way has less overhead than | ptrace and we would have complete control over the system | environment. I think it's a good approach and worth exploring | more, but ultimately the deal breaker was that KVM requires | root privileges to run and it wouldn't run on our already- | virtualized dev machines. We also wanted to allow the guest | program to interact with the host's file system. So, we went | with good ol' ptrace. Last I checked gVisor also has a ptrace | backend, but it wasn't very far along at the time. When going | the ptrace route, there is less reason to depend on another | project. Another reason of course is that we'd be beholden to a | Google project. ;) | srosenberg wrote: | Great work and thanks for making it OSS! I was familiar with the | prior (academic) work and its limitations, specifically TCP/IP. | Could you elaborate on how you solved that problem? | rrnewton wrote: | Sure! So it really breaks down into two cases: internal and | external networking relative to the container Hermit creates. | | (1) internal networking | | If you run a test like `rust/network_hello_world.rs` under | Hermit, then the communication between threads is part of the | "deterministic bubble" that we're running inside of. When one | thread blocks on a network call, the Hermit scheduler takes the | thread out of the run pool, and it has to deterministically | decide when it is ready to rejoin the run-pool by waking up. | The scheduler proceeds in linear turns (labeled "COMMIT" in the | logs), and if thread 5 unblocks from a network read at turn 100 | in one run, it must unblock at that same point in time in all | other runs. | | Sometimes we use a precise model of the blocking operation | (like with futexes) and other times we depend on sending Linux | a non-blocking version of the syscall as a way to poll the IO | and see if it is ready to complete (given the history of every | operation that has committed on turns 1..N-1). | | (2) external networking | | This is impossible to determinize, of course. Unless you suck | the whole network including both hosts into the deterministic | bubble, as the DDOS fork of Linux experimented with in ~2013. | That was kind of a negative result IMO because performance was | pretty bad, but the paper is here: | https://www.dcc.fc.up.pt/~ines/aulas/1314/SDM/papers/DDOS.pdf | | That's where record-replay comes in. `hermit record` can record | network calls, but is in a pretty early state and doesn't | support many programs. `hermit run` can just allow networking | through and hope for the best, but in the future we plan to add | features to record _just_ network calls (and no other | syscalls), so that you can mix and match different external- | network-responses with different thread schedules. That is, you | could "pin" the network responses with network-only recording, | and then mess around with other parameters or even modify the | program. | teknopaul wrote: | Can you explain how making flakey tests, not flakey, helps find | bugs. I would have thought these differences are essentially free | fuzzing and desirable? | rrnewton wrote: | Sure! I think underpinning your question is a really subtle | point there. And I think the answer is in the different | purposes of regression testing and bug finding. In regression | testing (CI), you're testing if the code introduced new | problems. You don't at that point in time really want to know | that someone else's test downstream from your component fails | when given a new thread schedule that it has not previously | seen. Wherease if you're stress testing (including fuzzing and | concurrency testing) you probably want to torture the program | overnight to see if you can turn up new failures. | | The Coyote project at Microsoft is a concurrency testing | project with some similarities to Hermit. For the reasons | above, they say in their docs to use a constant seed for CI | regression testing, but use random exploration for bug finding: | https://www.microsoft.com/en-us/research/project/coyote/ | | Still, it does feel like wasted resources to test the same | points in the (exponentially large) schedule space again and | again. Kind of like some exploration/exploitation tradeoff. | | We don't do it yet, but I would consider doing a randomized | exploration during CI, but making the observable semantics the | fixed version. If the randomized one fails, send that over to | the "bug finding" component for further study, while quickly | retrying with the known-good seed for the CI visible regression | test results. | | I don't think there's one right policy here. But having control | over these knobs lets us be intentional about it. | | P.S. Taking the random schedules the OS gives us is kind of | "free fuzzing", but it is very BAD free fuzzing. It over- | samples the probable, boring schedules and under-samples the | more extreme corner cases. Hence concurrency bugs lurk until | the machine is under load in production and edge cases emerge. | jasonwhite wrote: | Once we have complete control over the determinism of a test, | we can start to play with tweaking the non-deterministic inputs | in a controlled way. For example, we can tweak the RNG seed | used for thread scheduling to explore schedules that wouldn't | normally happen under the Linux scheduler. | stonemetal12 wrote: | How do you know if a flakey test has been fixed? A | deterministic environment can turn flakey into repeatable | failure and then known to be fixed. | rrnewton wrote: | This has been the culmination of several years of work | intercepting and sanitizing the Linux system call API. It's now | open source. | aseipp wrote: | Hey Ryan, just wanted to say I hope you're doing great these | days! Really glad to see that the work originally done by you | and Joe at Cloudseal evolving and becoming more readily | available to all of us. I still thought back every once in a | while about what y'all were up to. :) Super excited that it's | kept on chuggin' and now is something we can all enjoy! | hoosieree wrote: | I guess FB poaching you from IU in the middle of teaching your | compiler course has a silver lining! Nice to see this being | shared with the open source community. | alex_suzuki wrote: | Is it just me or are we experiencing an uptick in high-quality, | sophisticated software projects being open-sourced by FAANG | companies? | crmd wrote: | It's great how much open source infrastructure software has | come out of FAANG in the past decade. Between Kubernetes, | react, Kafka, etc it's wild how much of my tech stack is open | source software developed by people at large tech companies. | It's also great PR for the companies. | Enginerrrd wrote: | Facebook/Meta in particular seem to be hitting the front page | with nerd-bait research. I suspect it is indeed part of a | public relations strategy. | | And I hate to say it, but it's working on me. I've seen some | really cool projects come out of Facebook lately! | rrnewton wrote: | Heh, I implore you to consider the engineers-eye view. We're | tech geeks inside or outside of FAANG, with all the usual | incentives. I've been the tech lead on this project for 5 | years, through 2 companies, and of course I hope someone | finds it useful and chooses to contribute. | | I'd like to think we could help someone somehow with public | relations, but I don't think we can ;-). Actually, I don't | think any of the big techs are leaning all that much into | _recruiting_ right now though.... | crmd wrote: | Hey, I totally rewrote my comment after reading your | response. Thanks for engaging - hearing from the project | author made me realize I was acting like a cynical asshole. | Congrats on Hermit and thanks for supporting free software. | umanwizard wrote: | Are we? PyTorch, React, etc. have been around for ages. | tobyhinloopen wrote: | React? He mentioned high quality. React is a mess, especially | when hooks became dominant. | topazas wrote: | maybe symbolic execution also can be included here? | rrnewton wrote: | It's a good question. We would like to make it usable as a | platform for dynamic analysis. The idea being that you can | control all these external factors (like thread scheduling), | find a crashing run, and then ask introspective questions of | what the code is doing in a crashing run. | | In practice, one challenge we have is bridging between the | runtime view of the software (as a debugger would see) -- raw | machine instructions and system calls, and the static view that | you would get from analyzing the source code. | | Sanitizers, for example (ASAN, TSAN, etc), are typically | implemented in the compiler as program instrumenations. If we | integrated binary instrumentation tools like Intel Pin or | DynamoRio, we could perform additional dynamic analysis, but | still at the machine code rather than source code level, which | is a big gap from how symbolic execution normally happens, at | the source code / AST level. ___________________________________________________________________ (page generated 2022-11-22 23:00 UTC)