[HN Gopher] Running e2e tests 10x faster using firecracker VMs
       ___________________________________________________________________
        
       Running e2e tests 10x faster using firecracker VMs
        
       Author : samanthachai
       Score  : 114 points
       Date   : 2022-04-17 17:00 UTC (6 hours ago)
        
 (HTM) web link (webapp.io)
 (TXT) w3m dump (webapp.io)
        
       | fideloper wrote:
       | What do y'all run firecracker on? The metal servers on aws (the
       | only servers you can run firecracker on in aws) are pretty
       | expensive!
        
       | neatze wrote:
       | What does this e2e tests in webapp ?
       | 
       | I don't understand why you need to rebuild docker image every app
       | build, this seems like really wasteful.
        
         | n8ta wrote:
         | If the app itself is part of the image you need to rebuild the
         | image every time a dev wants to test their change.
        
           | mtoddsmith wrote:
           | Is that the same as redeploying your app to an existing
           | container?
        
       | goodpoint wrote:
       | This has been done successfully using VMs since 2 decades.
        
       | yewenjie wrote:
       | Interesting. What other cool things are people doing with
       | Firecracker?
        
         | cpach wrote:
         | Fly built a whole platform with Firecracker VMs:
         | https://fly.io/
        
         | sjosh003 wrote:
         | I have been using Weave Ignite [1] recently to run Firecracker
         | micro vm(s) instead of containers for a multitude of tasks!
         | 
         | 1. https://github.com/weaveworks/ignite
        
       | kaivalyagandhi wrote:
       | interesting, I wonder if you can use this with GitHub self hosted
       | runners?
        
       | StreamBright wrote:
       | Great article. Firecracker has been an amazing addition to my
       | toolkit and it is good to see succeeding in solving real world
       | problems.
        
       | rossmohax wrote:
       | They seem to be comparing CI runner starting from scratch to
       | always on VM with firecracker preconfigured.
        
         | nicoburns wrote:
         | Firecracker _is_ a CI runner starting a VM for each run in this
         | case, just a more optimised one, no?
        
       | greatgib wrote:
       | Always amaze me to see the new trend of DevOps that will be
       | happily following such a tutorial, wget and running random code
       | from the internet in production...
        
         | jrockway wrote:
         | I don't think this is production, this is for running your
         | tests. Your code in the "tests haven't run yet" state probably
         | leak all the secrets they have access to and destroy the
         | machine they're running on, so you don't let them have any
         | secrets and create a new machine each time. "curl | bash" here
         | just injects potential flakiness (as does "npm install" when
         | npm dies, etc.)
         | 
         | Obviously a lot of people treat their CI system as their CD
         | system, and do things like letting tests have highly privileged
         | access to their production k8s cluster. That's a terrible idea
         | even if you aren't installing software with "curl | bash".
         | 
         | So overall, I don't think this is worth a HN comment to
         | complain about. People are going to install software in non-
         | auditable non-reproducible ways.
        
       | CraigJPerry wrote:
       | There's an even faster strategy than this and it's easier to
       | setup.
       | 
       | You're going to deploy 4 CI pipelines (so make sure you're not
       | manually putting together ci pipelines configs, use automation):
       | 
       | Pipeline 1: A conveyor belt of environments. All this pipeline
       | does is spin up fresh environments then run a short automated
       | smoke test. Hydrate the env with the most recent mask from prod.
       | The trigger condition is there's less than <Threshold>
       | environments available. I did 8 on a whim and never saw a need to
       | change it.
       | 
       | Pipeline 2: Normal garden variety CI pipeline triggered on merges
       | to main. Output of this will be two artifacts persisted: a built
       | package and your unit test evidence
       | 
       | Pipeline 3: Test your automated deployment by deploying the
       | package build from #2 into the first of the queue of free envs
       | from #1 trigger your end to end and integration and contract
       | tests. Don't run your security or operability tests here.
       | 
       | Pipeline 4: Async pipeline triggered on a 6hr schedule, do your
       | long running stuff like fuzz testing here, your security tests
       | etc. do these outside of the dev cycle.
       | 
       | Release candidates can only be signed after a successful run
       | through 2, 3 & 4. That means prod deploys are on a predictable
       | cadence which users and ops are usually appreciative of rather
       | than we drop it in when it's ready.
       | 
       | The DevEx is pretty sweet - you don't see pipeline 1 or 4 in your
       | build loop. Only the runtime of 3 would be comparable to the
       | article - slightly faster than the article because no firecracker
       | bringup overhead, no matter how small that is.
        
         | drjasonharrison wrote:
         | There are times when some corner of software development speaks
         | a specialized language and this is an example.
         | 
         | 1. Conveyor belt(?) of environments. Hydrate(?) the
         | env(ironment). Mask(?) from prod(uction)
         | 
         | 2. I think I got this. Typical "merge to main pipeline" with
         | built product and test results as outputs.
         | 
         | DevEx(?). And not sure why I wouldn't see pipeline #4 in my
         | build loop because I can't deploy unless 2, 3 and 4 pass....
         | Maybe you mean I don't wait to see it.
         | 
         | Also not sure how it's faster because environments still need
         | to be brought up. Unless you are trying to say that the
         | environment is already running when the merge to master
         | pipeline succeeds.
        
       | forgotusername6 wrote:
       | Used to do something similar with vsphere a while back. The
       | servers took ages to get into the right state to test so much
       | easier to just revert to snapshot to get a clean state.
        
       | wyldfire wrote:
       | Gee, why not just go straight to step 3 via fork/exec? Bound to
       | shave off a few milliseconds beyond that 10x. And no firecracker
       | required.
        
         | melony wrote:
         | If you a cloud host, you need a way to sandbox hostile code.
         | Firecracker allows you to do that (it is a configuration of the
         | traditional KVM virtualization system except lighter and
         | faster, instead of booting a VPS which can take minutes, you
         | can now spawn one in under a second).
        
           | FooBarWidget wrote:
           | Not just sandboxing, but just ensuring that each test runs in
           | a clean environment, without interference from
           | files/processes left behind by a previous or even concurrent
           | test.
        
           | wyldfire wrote:
           | To clarify my post: I see the reason for Firecracker to exist
           | in general, it's great. But does "e2e tests" include
           | untrusted code? I think it really shouldn't.
           | 
           | So why use firecracker here? Invoking your tests in a bare VM
           | or container is great for making sure that you are
           | controlling the environment and enumerating your system
           | dependencies. But this post proposes discarding those things
           | and instead using some saved state as the entry point into
           | your Firecracker. So now you are booting from Your Image
           | instead of a { Official Distro Image + Dependency Recipe }.
           | It seems like a step backward.
        
             | chrisseaton wrote:
             | > But does "e2e tests" include untrusted code?
             | 
             | What other possible way could CI work?
        
             | melony wrote:
             | The company that wrote the article is a e2e testing _cloud
             | hosting_ company that runs your code in _their cloud_.
        
             | colinchartier wrote:
             | Author here!
             | 
             | I think there's always been a push/pull of "fat base
             | images" versus "install everything every time" - It's
             | obviously subjective, but I think it's more important to
             | run the tests on every commit than it is to start the
             | environment from scratch.
             | 
             | It's also not necessarily mutually exclusive, you could
             | have a "staging branch" where you make something that looks
             | a lot like production and then re-run end-to-end tests
             | there, while running the per-branch tests with this method
             | to avoid slowing down developers.
        
         | legulere wrote:
         | Because process isolation under unix is pretty lax. Processes
         | have by default have all the rights of the user. And you might
         | end up with a system different from the initial state
        
       | ithkuil wrote:
       | Firecracker is great and all, but the core idea here described
       | works also with plain docker; i.e. there is nothing inherently
       | firecracker specific to the basic technique
        
         | colinchartier wrote:
         | Author here!
         | 
         | The three big differences are:
         | 
         | 1. Docker doesn't deal with running processes (like postgres or
         | redis), only the filesystem state
         | 
         | 2. Docker doesn't have enough isolation, so you'd probably need
         | to run it within qemu or firecracker for compliance in bigger
         | teams
         | 
         | 3. Docker-in-docker is still pretty painful, if you need to do
         | anything nonstandard like change the size of /dev/shm, access
         | /dev/kvm, or load kernel drivers, it'll take custom
         | configuration.
        
           | ignoramous wrote:
           | Hi, offtopic but: is webapp.io a pivot from layerci, or just
           | a rebranding?
           | 
           | Interesting that you're folks now use firecracker. I assume
           | it now fills in adequately for the previously homegrown tech
           | at layerci [0]?
           | 
           | [0] https://news.ycombinator.com/item?id=25979941
        
             | colinchartier wrote:
             | Just a rebranding! (The technology's gotten better as well,
             | of course - we didn't used to use firecracker at all)
             | 
             | https://webapp.io/blog/layerci-has-rebranded-to-webapp-io/
        
           | throwaway894345 wrote:
           | I'm confused. Why do you need to snapshot live processes? Are
           | we concerned about startup time of Postgres or whatever?
           | Also, why is isolation needed for e2e tests? Lastly, why is
           | docker-in-docker a requirement, and how is that easier than
           | qemu in qemu or qemu in docker or whatever?
        
             | colinchartier wrote:
             | > Why do you need to snapshot live processes?
             | 
             | Often times there are long-living processes which rarely
             | change but take a long time to warm up. The Bazel [1] agent
             | for C++ projects, the buildkit [2] state for docker, or the
             | running Postgres or Redis server for a cloud native app for
             | example.
             | 
             | It's why running "docker build" twice on your laptop is so
             | fast, but running "docker build" in CI seems glacially
             | slow.
             | 
             | > why is docker-in-docker a requirement, and how is that
             | easier than qemu in qemu or qemu in docker or whatever?
             | 
             | The example given was running "docker-compose build", so
             | you'd need either docker-in-firecracker (this post),
             | docker-in-docker, or docker-in-qemu. You'd almost never run
             | docker-compose build on bare metal in practice, because
             | you'd immediately need to send the images you built
             | somewhere else in order to use them.
             | 
             | [1] https://bazel.build/ [2]
             | https://docs.docker.com/develop/develop-
             | images/build_enhance...
        
               | cpuguy83 wrote:
               | But that's state on disk, not process state. It should
               | not affect startup time in buildkit.
               | 
               | I'm not experienced enough with Bazel to comment on that.
        
           | cpuguy83 wrote:
           | Docker does handle snapshots of running processes. It's
           | called checkpoint/restore, it utilizes the CRIU tooling to do
           | this.
           | 
           | In terms of doing this in a CI env like actions where you may
           | have different types of machines serving you, it may be
           | problematic as the machine specs need to pretty closely
           | match.
        
         | jitl wrote:
         | Yeah, I don't like that the article itself treats building the
         | DB seed data, etc, into the Firecracker VM image like this is
         | impossible to do in Docker. The techniques are good things to
         | do -- but it's very tenuous how the techniques are connected to
         | Firecracker.
         | 
         | I've do all of the above using multi-layered Docker files and a
         | cron CI job to rebuild the base integration test image every 6
         | hours. Sure if you need the isolation, Firecracker is the way
         | to go. But if you invest primarily in container shenanigans to
         | speed up CI with Docker, it's not too much extra work to wrap
         | it in a Firecracker VM, plain QEMU, or whatever once you start
         | wanting more isolation.
         | 
         | Also, maybe I'm holding it wrong but Docker in Docker had not
         | bitten us yet on our GitHub action runners.
        
         | lgierth wrote:
         | You don't need a management daemon running though, and get a
         | complete virtualized kernel that can be customized if needed.
        
           | bornfreddy wrote:
           | Ok, so IIUC, the main difference with firecracker versus
           | docker is that processes are better separated from each other
           | ("micro VM" instead of namespaces) and that one can run a
           | customized kernel. But for e2e tests I've written, neither of
           | these advantages mattered.
           | 
           | I do love the idea of taking a snapshot of a prebuilt
           | database image and can see where this would really speed up
           | the tests.
        
       | tedunangst wrote:
       | But why does it require firecracker and not qemu?
        
         | colinchartier wrote:
         | QEMU takes much longer to save/restore snapshots, and it's much
         | harder to do via the API
        
           | [deleted]
        
       | n8ta wrote:
       | Sounds like having an actual non-ephemeral computer with extra
       | steps...
        
       ___________________________________________________________________
       (page generated 2022-04-17 23:00 UTC)