[HN Gopher] Deterministic Linux for controlled testing and softw...
       ___________________________________________________________________
        
       Deterministic Linux for controlled testing and software bug-finding
        
       Author : rrnewton
       Score  : 82 points
       Date   : 2022-11-22 18:12 UTC (4 hours ago)
        
 (HTM) web link (developers.facebook.com)
 (TXT) w3m dump (developers.facebook.com)
        
       | jasonwhite wrote:
       | TL;DR: This is a Rust project that forces deterministic execution
       | of arbitrary programs and acts like a reproducible container.
       | That is, it _hermetically_ isolates the program from sources of
       | non-determinism such as time, thread interleavings, random number
       | generation, etc. Guaranteed determinism is a powerful tool and it
       | serves as a basis for a number of applications, including
       | concurrency stress testing, record /replay, reproducible builds,
       | automatic diagnosis of concurrency bugs, and more.
       | 
       | I've been on the team working on this project over the past ~2
       | years. AMA!
       | 
       | Here is the GitHub repository:
       | https://github.com/facebookexperimental/hermit
        
         | mtlmtlmtlmtl wrote:
         | This is exactly the type of thing I've been wanting for testing
         | my chess engine. Parallelism is based on emergent pseudorandom
         | effects of things like interleaving causing searcher threads to
         | mostly end up in non-overlapping parts of the search tree.
         | 
         | One question: How do you avoid the program being affected by
         | things like overall system load and memory pressure?
        
           | jasonwhite wrote:
           | Since CPU is a "compressible" resource, system load doesn't
           | affect the determinism. It'll make it slower of course. Since
           | memory is a non-compressible resource, things can start
           | getting killed by the OOM-killer and there's nothing we can
           | do about it. There are also certainly things like external
           | network communication and file system access that are non-
           | deterministic that must be handled at a higher level (e.g.,
           | with a reproducible FS image or by recording & replaying
           | network traffic).
        
             | mtlmtlmtlmtl wrote:
             | The idea of CPU being compressible is very insightful,
             | thanks.
             | 
             | I'm curious about what happens with time control though.
             | Engines can be given a time constraint and will use various
             | heuristics to allocate that time. How does Hermit intercept
             | gettimeofday()?
        
               | rrnewton wrote:
               | Well, gettimeofday is a syscall, and we do intercept it
               | (along with clock_gettime, clock_gettres, time,
               | nanosleep, and the rdtsc instruction, even though that
               | last one is not a syscall). When we intercept it, we
               | report virtual time back to the guest. We make sure that
               | virtual time is deterministic, across all threads in the
               | container, irrespective of what the wall clock time is on
               | the host machine.
               | 
               | So for instance, if there are multiple threads in a chess
               | engine, and they are racing to write search results to a
               | data structure, these threads will interleave in a
               | reproducible order under Hermit, and the races will
               | resolve consistently.
               | 
               | But the downside is that Hermit does sequentialize
               | execution onto one core. So in the current version, a
               | multithreaded program doesn't get actual wall-clock
               | speedup from its parallelism. (The earlier dettrace did
               | allow some limited guest parallelism, and we plan to
               | bring that back.) For this reason, Hermit's good for
               | consistent testing multithreaded software, but you
               | wouldn't want to run parallel software under it outside
               | of testing.
        
         | comex wrote:
         | Nice. Some questions:
         | 
         | - How does this compare with rr?
         | 
         | - Out of curiosity, have you ever looked at libTAS? (I realize
         | it has a very different intended use case.)
         | 
         | - Have you had an issues with behavior differences between
         | CPUs? I know there is architecture-level undefined behavior
         | that can differ between CPUs; on the other hand, it sounds like
         | you're primarily interested in running well-behaved executables
         | that wouldn't intentionally invoke such behavior.
        
           | jasonwhite wrote:
           | - We've taken a lot of inspiration from RR and it is very
           | good at what it does (record/replay + randomizing thread
           | schedules). This project diverges a bit from RR in that we
           | focus more on the concurrency testing application of
           | determinism. I wrote the toy record/replay system that sits
           | on top of the determinism layer (so that we don't have to
           | record as much stuff), but it's a long way from being
           | complete.
           | 
           | - I haven't seen libTAS until now, so I can't comment on it
           | much. At first glance, it does look similar!
           | 
           | - See @rrnewton's reply on this one.
        
           | rrnewton wrote:
           | (1) rr [formerly Mozilla rr]
           | 
           | We're big fans of rr!
           | 
           | Hermit is different in that creating a deterministic OS
           | semantics is different than recording whatever
           | nondeterministic behavior occurs under normal Linux. BUT,
           | there's a lot of overlap. And indeed `hermit record` is
           | straight up RnR (record & replay).
           | 
           | But hermit for RnR but is not nearly as developed as rr. We
           | integrate with gdb/lldb as an (RSP) debugger backend, just
           | like rr. Any failing execution you can create with hermit,
           | you can attach a debugger. But our support is very
           | preliminary, and you'll probably find rough edges. Also, we
           | don't support backwards stepping yet (except by running
           | again).
           | 
           | If we invest more in using Hermit as a debugger (rather than
           | for finding and analyzing concurrency bugs), then there
           | should be _some_ advantages over traditional RnR. These would
           | relate to the fact that deterministically executing is
           | different than recording. For example, process and thread
           | IDs, and memory addresses all stay the same across multiple
           | runs of the program, even as you begin adding printfs and
           | modifying the program to fix the bug. With traditional RnR,
           | you can play the same recording as many times as you like,
           | but as soon as you take a second recording all bets are off
           | wrt what is the same or different compared to the prior
           | recording. (That includes losing the  "mental state" of
           | things like tids & memory addresses, which is a good point
           | Robert O Callahan makes about the benefits of RnR when
           | accessing the same recording multiple times.)
           | 
           | (2) libTAS - no we haven't! Checking it out now.
           | 
           | (3) Yes, definitely issues with CPU portability.
           | 
           | In general, we are interested in not just determinism on the
           | same machine, but portability between machines in our fleet.
           | As with any tech company that uses the cloud, at Meta people
           | are usually trying to debug an issue on a different machine
           | than where the problem occurred. I.e. taking a crash from a
           | production or CI machine to a local dev machine.
           | 
           | The way we do this is that we mostly report a fairly old CPU
           | to the guest, which disables certain features IF the guest is
           | well behaved.
           | 
           | With the current processor tech, I don't think there's any
           | way we can stop an adversarial program, which, for example,
           | would execute CPUID, find that RDRAND is not supported on the
           | processor, but then execute RDRAND anyway. We could build a
           | much more invasive binary-instrumentation based emulator that
           | _would_ be able to enforce these kinds of rules at the
           | instruction granularity, but it would have higher overhead,
           | especially startup overhead. The nice thing about Reverie
           | though is that we (or others) can add different
           | instrumentation backends while keeping the same programming
           | instrumentation API. So we could have a  "hardened" backend
           | that was more about sandboxing and reverse-engineering
           | adversarial software, making a different tradeoff with
           | respect to performance overhead.
        
         | CJefferson wrote:
         | This looks cool.
         | 
         | Personally I'd also be interested in this for academic work --
         | anything which makes it easier to be sure an experiment can be
         | reproduced later (a week, year, or decade later, in increasing
         | order of difficulty), is good to have.
        
           | rrnewton wrote:
           | Yes, I'm very interested in that as well. I've been involved
           | with the ACM Artifact Evaluation process which has been going
           | on in several conferences for a while.
           | 
           | https://www.acm.org/publications/policies/artifact-review-
           | ba...
           | 
           | But it's been pretty frustrating. As an author, my PLDI 2014
           | artifact stopped working less than 5 years later (Docker
           | image binary incompatibility). And when I was co-chair of an
           | Artifact Evaluation Committee in 2017, there was not great
           | reproducibilty of the artifacts that were submitted either.
           | 
           | If you package a VM (freezing the Linux kernel), and are
           | pretty sure that VM will run in 10 years, PLUS you
           | determinize the execution itself... that should allow durable
           | bitwise reproducibility. Maybe Hermit could be one ingredient
           | of that.
           | 
           | For scientific reproducibility, there is a lot of other
           | tooling to build too, and I know some folks have been working
           | in that area:
           | 
           | https://ctuning.org/
        
         | gmartres wrote:
         | Thanks for open-sourcing this! Roughly, what's the performance
         | overhead from running code under hermit? I'm wondering if this
         | could be used for doing benchmarking with less variance on non-
         | deterministic platforms such as the JVM (I assume hermit is
         | "deterministic enough" that the JIT and GC threads of the JVM
         | will run the same code on every execution?)
        
           | rrnewton wrote:
           | Alas the performance overhead in _realtime_ is not great yet.
           | It still uses ptrace currently, which often results in a
           | multiple-X slowdown (but at least it doesn 't "subscribe" to
           | every syscall like strace does, because some are naturally
           | deterministic). Reverie's whole design is to make it support
           | swappable backends, and this ptrace backend is just the
           | reference implementation. The `experimental/reverie-sabre`
           | directory in the Reverie repo contains our high performance
           | backend, but it's still work-in-progress. It uses binary
           | instrumentation and in our early experiments is 10X faster
           | than our current backend in the worst case (i.e. strace is
           | >10X faster when rewritten with reverie-sabre and run on a
           | program that does nothing but syscalls).
           | 
           | But to the second part of your question about deterministic
           | benchmarking, that is really a separate question. Hermit
           | defines a deterministic notion of virtual time, which is
           | based on the branches retired and system calls executed by
           | all threads. When you run hermit with `--summary`, it reports
           | a total "Elasped virtual global time", which is completely
           | deterministic:
           | 
           | $ hermit run --summary /bin/date
           | 
           | ...
           | 
           | Elapsed virtual global (cpu) time: 5_039_700ns
           | 
           | Therefore, any program that runs under hermit can get this
           | deterministic notion of performance. We figured that could be
           | useful for setting performance regression tests with very
           | small regression margins (<1%), which you can't do on normal
           | noisy systems. Compilers are one place I've worked where we
           | wanted smaller performance regression alarms (for generated
           | code) than we could achieve in practice. We haven't actually
           | explored this application yet though. There's a whole small
           | field of people studying performance modeling and prediction,
           | and if one wanted to try this deterministic benchmarking
           | approach, they might want take some of that knowledge and
           | build a more accurate (correlated with wall time) performance
           | model, more realistic than Hermit's current virtual time that
           | is.
        
             | dekhn wrote:
             | Does running /bin/date under hermit always return the same
             | time? Or does it just follow the same codepath to retrieve
             | the actual time?
        
           | [deleted]
        
         | wyldfire wrote:
         | > AMA!
         | 
         | Eager to try it but encountering the build error here -
         | https://github.com/facebookexperimental/hermit/issues/11
         | 
         | Do you have a reference build log / environment you can share?
         | Last known good commit sha and/or output from "rustup show"?
        
           | jasonwhite wrote:
           | We're working on it! Should be fixed soon.
        
             | rrnewton wrote:
             | Reverted a badly-timed breaking change that came through
             | the sync system. Will fix it properly shortly (and add a
             | Dockerfile and release tag). But for now you may have
             | better luck on the main branch after that reversion, which
             | yielded 6cb5575ffd287289769144ec82e2900cbf6cd1ad.
             | 
             | Let's discuss further on that issue #11.
        
         | rrnewton wrote:
         | Note that this is a follow-on project from the earlier Dettrace
         | system, which was applied mainly to reproducible builds (as in
         | the academic paper,
         | https://dl.acm.org/doi/10.1145/3373376.3378519, and presented
         | to the Debian Reproducible Builds summit):
         | 
         | - https://github.com/dettrace/dettrace
         | 
         | And one cool part of it is this Rust program instrumentation
         | layer:
         | 
         | - https://github.com/facebookexperimental/reverie
         | 
         | It's good for building OS-emulator style projects or tracing
         | tools.
        
         | oulipo wrote:
         | Would something like this work for embedded code, like ESP32?
        
           | rrnewton wrote:
           | Well, probably not _on device_ ;-).
           | 
           | The underlying Reverie instrumentation layer works on ARM,
           | but Hermit isn't ported yet, and we haven't touched RISC-V
           | yet at all. (Contributions welcome!)
           | 
           | One thing we haven't tried yet is just putting a whole
           | emulator (qemu etc) underneath Hermit. That would address any
           | sources of irreproducibility that the emulator lets through
           | from the host (threads, RNG, etc).
        
         | wyldfire wrote:
         | > thread interleavings
         | 
         | I was going to ask if it could do the flip side - instead of
         | stabilizing the scheduler, make it less predictable.
         | 
         | AFAICT, it can! Awesome, looking forward to giving it a try.
         | hermit run --chaos --seed-from=SystemRandom
         | ./target/debug/hello_race;
        
           | rrnewton wrote:
           | Yes indeed.
           | 
           | That concurrency testing capability is a pretty well-studied
           | area and we implement a couple existing algorithms. The first
           | is our adaptation of the PCT algorithm (ASPLOS'10
           | https://www.microsoft.com/en-us/research/wp-
           | content/uploads/...). That's what you get by default with
           | `--chaos`.
           | 
           | But we also have variations on straight up randomized
           | scheduler (random thread selection at each time step).
           | 
           | rr chaos mode has its own take on this:
           | https://robert.ocallahan.org/2016/02/introducing-rr-chaos-
           | mo...
           | 
           | This study compares a few approaches - http://www.doc.ic.ac.u
           | k/~afd/homepages/papers/pdfs/2016/TOPC....
        
       | daniel-levin wrote:
       | Neat! This is the direction I'd hoped to see gvisor go in. What's
       | the reasoning for building from scratch and not piggybacking off
       | gvisor?
        
         | jasonwhite wrote:
         | We certainly looked into gVisor and Firecracker when we started
         | this project a few years ago. These systems use KVM and gVisor
         | in particular uses the Model Specific Registers (MSRs) to
         | intercept system calls before forwarding them to the host
         | kernel. Intercepting syscalls this way has less overhead than
         | ptrace and we would have complete control over the system
         | environment. I think it's a good approach and worth exploring
         | more, but ultimately the deal breaker was that KVM requires
         | root privileges to run and it wouldn't run on our already-
         | virtualized dev machines. We also wanted to allow the guest
         | program to interact with the host's file system. So, we went
         | with good ol' ptrace. Last I checked gVisor also has a ptrace
         | backend, but it wasn't very far along at the time. When going
         | the ptrace route, there is less reason to depend on another
         | project. Another reason of course is that we'd be beholden to a
         | Google project. ;)
        
       | srosenberg wrote:
       | Great work and thanks for making it OSS! I was familiar with the
       | prior (academic) work and its limitations, specifically TCP/IP.
       | Could you elaborate on how you solved that problem?
        
         | rrnewton wrote:
         | Sure! So it really breaks down into two cases: internal and
         | external networking relative to the container Hermit creates.
         | 
         | (1) internal networking
         | 
         | If you run a test like `rust/network_hello_world.rs` under
         | Hermit, then the communication between threads is part of the
         | "deterministic bubble" that we're running inside of. When one
         | thread blocks on a network call, the Hermit scheduler takes the
         | thread out of the run pool, and it has to deterministically
         | decide when it is ready to rejoin the run-pool by waking up.
         | The scheduler proceeds in linear turns (labeled "COMMIT" in the
         | logs), and if thread 5 unblocks from a network read at turn 100
         | in one run, it must unblock at that same point in time in all
         | other runs.
         | 
         | Sometimes we use a precise model of the blocking operation
         | (like with futexes) and other times we depend on sending Linux
         | a non-blocking version of the syscall as a way to poll the IO
         | and see if it is ready to complete (given the history of every
         | operation that has committed on turns 1..N-1).
         | 
         | (2) external networking
         | 
         | This is impossible to determinize, of course. Unless you suck
         | the whole network including both hosts into the deterministic
         | bubble, as the DDOS fork of Linux experimented with in ~2013.
         | That was kind of a negative result IMO because performance was
         | pretty bad, but the paper is here:
         | https://www.dcc.fc.up.pt/~ines/aulas/1314/SDM/papers/DDOS.pdf
         | 
         | That's where record-replay comes in. `hermit record` can record
         | network calls, but is in a pretty early state and doesn't
         | support many programs. `hermit run` can just allow networking
         | through and hope for the best, but in the future we plan to add
         | features to record _just_ network calls (and no other
         | syscalls), so that you can mix and match different external-
         | network-responses with different thread schedules. That is, you
         | could  "pin" the network responses with network-only recording,
         | and then mess around with other parameters or even modify the
         | program.
        
       | teknopaul wrote:
       | Can you explain how making flakey tests, not flakey, helps find
       | bugs. I would have thought these differences are essentially free
       | fuzzing and desirable?
        
         | rrnewton wrote:
         | Sure! I think underpinning your question is a really subtle
         | point there. And I think the answer is in the different
         | purposes of regression testing and bug finding. In regression
         | testing (CI), you're testing if the code introduced new
         | problems. You don't at that point in time really want to know
         | that someone else's test downstream from your component fails
         | when given a new thread schedule that it has not previously
         | seen. Wherease if you're stress testing (including fuzzing and
         | concurrency testing) you probably want to torture the program
         | overnight to see if you can turn up new failures.
         | 
         | The Coyote project at Microsoft is a concurrency testing
         | project with some similarities to Hermit. For the reasons
         | above, they say in their docs to use a constant seed for CI
         | regression testing, but use random exploration for bug finding:
         | https://www.microsoft.com/en-us/research/project/coyote/
         | 
         | Still, it does feel like wasted resources to test the same
         | points in the (exponentially large) schedule space again and
         | again. Kind of like some exploration/exploitation tradeoff.
         | 
         | We don't do it yet, but I would consider doing a randomized
         | exploration during CI, but making the observable semantics the
         | fixed version. If the randomized one fails, send that over to
         | the "bug finding" component for further study, while quickly
         | retrying with the known-good seed for the CI visible regression
         | test results.
         | 
         | I don't think there's one right policy here. But having control
         | over these knobs lets us be intentional about it.
         | 
         | P.S. Taking the random schedules the OS gives us is kind of
         | "free fuzzing", but it is very BAD free fuzzing. It over-
         | samples the probable, boring schedules and under-samples the
         | more extreme corner cases. Hence concurrency bugs lurk until
         | the machine is under load in production and edge cases emerge.
        
         | jasonwhite wrote:
         | Once we have complete control over the determinism of a test,
         | we can start to play with tweaking the non-deterministic inputs
         | in a controlled way. For example, we can tweak the RNG seed
         | used for thread scheduling to explore schedules that wouldn't
         | normally happen under the Linux scheduler.
        
         | stonemetal12 wrote:
         | How do you know if a flakey test has been fixed? A
         | deterministic environment can turn flakey into repeatable
         | failure and then known to be fixed.
        
       | rrnewton wrote:
       | This has been the culmination of several years of work
       | intercepting and sanitizing the Linux system call API. It's now
       | open source.
        
         | aseipp wrote:
         | Hey Ryan, just wanted to say I hope you're doing great these
         | days! Really glad to see that the work originally done by you
         | and Joe at Cloudseal evolving and becoming more readily
         | available to all of us. I still thought back every once in a
         | while about what y'all were up to. :) Super excited that it's
         | kept on chuggin' and now is something we can all enjoy!
        
         | hoosieree wrote:
         | I guess FB poaching you from IU in the middle of teaching your
         | compiler course has a silver lining! Nice to see this being
         | shared with the open source community.
        
       | alex_suzuki wrote:
       | Is it just me or are we experiencing an uptick in high-quality,
       | sophisticated software projects being open-sourced by FAANG
       | companies?
        
         | crmd wrote:
         | It's great how much open source infrastructure software has
         | come out of FAANG in the past decade. Between Kubernetes,
         | react, Kafka, etc it's wild how much of my tech stack is open
         | source software developed by people at large tech companies.
         | It's also great PR for the companies.
        
           | Enginerrrd wrote:
           | Facebook/Meta in particular seem to be hitting the front page
           | with nerd-bait research. I suspect it is indeed part of a
           | public relations strategy.
           | 
           | And I hate to say it, but it's working on me. I've seen some
           | really cool projects come out of Facebook lately!
        
           | rrnewton wrote:
           | Heh, I implore you to consider the engineers-eye view. We're
           | tech geeks inside or outside of FAANG, with all the usual
           | incentives. I've been the tech lead on this project for 5
           | years, through 2 companies, and of course I hope someone
           | finds it useful and chooses to contribute.
           | 
           | I'd like to think we could help someone somehow with public
           | relations, but I don't think we can ;-). Actually, I don't
           | think any of the big techs are leaning all that much into
           | _recruiting_ right now though....
        
             | crmd wrote:
             | Hey, I totally rewrote my comment after reading your
             | response. Thanks for engaging - hearing from the project
             | author made me realize I was acting like a cynical asshole.
             | Congrats on Hermit and thanks for supporting free software.
        
         | umanwizard wrote:
         | Are we? PyTorch, React, etc. have been around for ages.
        
           | tobyhinloopen wrote:
           | React? He mentioned high quality. React is a mess, especially
           | when hooks became dominant.
        
       | topazas wrote:
       | maybe symbolic execution also can be included here?
        
         | rrnewton wrote:
         | It's a good question. We would like to make it usable as a
         | platform for dynamic analysis. The idea being that you can
         | control all these external factors (like thread scheduling),
         | find a crashing run, and then ask introspective questions of
         | what the code is doing in a crashing run.
         | 
         | In practice, one challenge we have is bridging between the
         | runtime view of the software (as a debugger would see) -- raw
         | machine instructions and system calls, and the static view that
         | you would get from analyzing the source code.
         | 
         | Sanitizers, for example (ASAN, TSAN, etc), are typically
         | implemented in the compiler as program instrumenations. If we
         | integrated binary instrumentation tools like Intel Pin or
         | DynamoRio, we could perform additional dynamic analysis, but
         | still at the machine code rather than source code level, which
         | is a big gap from how symbolic execution normally happens, at
         | the source code / AST level.
        
       ___________________________________________________________________
       (page generated 2022-11-22 23:00 UTC)