[HN Gopher] We replaced Firecracker with QEMU ___________________________________________________________________ We replaced Firecracker with QEMU Author : hugodutka Score : 308 points Date : 2023-07-10 14:15 UTC (8 hours ago) (HTM) web link (hocus.dev) (TXT) w3m dump (hocus.dev) | amarshall wrote: | No mention of Cloud Hypervisor [1]...perhaps they don't know | about it? It's based in part on Firecracker and supports free | page reporting, virtio-blk-pci, PCI passthrough, and (I believe) | discard in virtio-blk. | | [1]: https://www.cloudhypervisor.org/ | hugodutka wrote: | We do, and we'd love to use it in the future. We've found that | it's not ready for prime time yet and it's missing some | features. The biggest problem was that it does not support | discard operations yet. Here's a short writeup we did about | VMMs that we considered: https://github.com/hocus- | dev/hocus/blob/main/rfd/0002-worksp... | amarshall wrote: | Thanks for the link to the elaboration! FYI footnotes 3 and 4 | seem to be swapped. | CompuIves wrote: | At CodeSandbox we use Firecracker for hosting development | environments, and I agree with the points. Though I don't think | that means you should not use Firecracker for running long-lived | workloads. | | We reclaim memory with a memory balloon device, for the disk | trimming we discard (& compress) the disk, and for i/o speed we | use io_uring (which we only use for scratch disks, the project | disks are network disks). | | It's a tradeoff. It's more work and does require custom | implementations. For us that made sense, because in return we get | a lightweight VMM that we can more easily extend with | functionality like memory snapshotting and live VM cloning | [1][2]. | | [1]: https://codesandbox.io/blog/how-we-clone-a-running-vm- | in-2-s... | | [2]: https://codesandbox.io/blog/cloning-microvms-using- | userfault... | wiradikusuma wrote: | Fly uses Firecracker, and they host long-running processes. I | wonder what's their opinion about it. | tptacek wrote: | We like Firecracker. People should use whatever makes sense for | them. | gorbak25 wrote: | I think their usecase makes a lot of sense as their workloads | consume a predefined amount of ram. As a customer you rent a VM | with a specified amount of memory so fly.io does not care about | reclaiming it from a running VM. | no_wizard wrote: | Depends on if they're using smart memory allocation to keep | costs lower, IE, if they can pattern that certain workloads | only need N amount of memory at Y time, they can effectively | borrow memory from one VM for usage in another that has an | opposite statistical likelihood of needing that memory. | | This is why paying for dedicated memory is often more | expensive than its counter part, because that dedicated | memory is not considered as part of pooling. | adql wrote: | I think it would be easier to just enable KSM (KVM can use | it to deduplicate memory and save some RAM on duplicate | blocks like people loading same libraries into memory) | nerpderp82 wrote: | Shit like this is how you burn customers and then they find | out you aren't changing the sheets, they change hotels. | | Just because you can doesn't mean you should. | klabb3 wrote: | Almost everything is oversubscribed in our world. | | Malloc will happily "return" the 15 TiB you asked for. | | If 10000 people called 911 at the same time, only a tiny | fraction would get through (and even fewer would get | help). | | Evacuating a large city by road would result in giant | traffic jams. | | There are 5-8x as many parking spots as there are cars | (and we still can't find a goddamn spot). | | And of course... the great toilet paper shortage of 2020. | no_wizard wrote: | Super common though. Everyone from Hetzner to AWS does | this, unless you specifically pay for full allocation, | which is typically ~2x the price. | | Chances are you are using systems that do this and you | haven't even noticed. | yokaze wrote: | Werner Vogels seems to disagree: | https://twitter.com/Werner/status/25137574680 | rob-olmos wrote: | Tweeted Sep 21, 2010, might still be true for EC2 VMs | though. | mulmen wrote: | I don't see the problem. Is this not the value | proposition of the cloud? At scale things like block | level de-duplication and over-provisioning of compute are | more efficient to say nothing of power and cooling. This | efficiency drives down cost. As long as your data isn't | leaking and you get the hardware you pay for when you | need it what's the problem? | geodel wrote: | >.. they change hotels | | And then they learn all hotels are doing exactly same | thing. One hotel doing is risk, all hotels doing is | industry standard. | | Airlines, hotels, restaurants, doctors and so on | oversubscribe all the time. Whoever complains are free to | move and add to their further disappointments. | imtringued wrote: | Memory (DDR4) is like 1.50EUR per GB nowadays. There is | no need to skimp on it. The most expensive component is | still the CPU if you actually want to do work and not | just idle your webserver 99% of the time. | no_wizard wrote: | this ignores just how competitive these services have | become and how much competition there is in this | marketplace. Anything that allows a company to balance | user experience with cost of hardware will be evaluated. | The fact is this is super common because even though RAM | is super cheap, at scale, these things add up. If you as | a provider don't have to spend 120K on RAM and can use | that on more CPUs or something else, it can mean having a | competitive edge over the provider that _does_ have to | spend that 120K. | ushakov wrote: | It's a common technique though. I believe it's called | oversubscription, where you rent the same hardware to | more tenants hoping they won't use it all at once. | | Fly.io themselves admitted they're oversubscribed and AWS | is doing the same for _years_ now | | Source: https://fly.io/blog/the-serverless-server/ | heavyset_go wrote: | Someone posted this and then immediately deleted their comment: | https://qemu.readthedocs.io/en/latest/system/i386/microvm.ht... | | I didn't know it existed until they posted, but QEMU has a | Firecracker-inspired target: | | > _microvm is a machine type inspired by Firecracker and | constructed after its machine model._ | | > _It's a minimalist machine type without PCI nor ACPI support, | designed for short-lived guests. microvm also establishes a | baseline for benchmarking and optimizing both QEMU and guest | operating systems, since it is optimized for both boot time and | footprint._ | lessaligned wrote: | "the fork was very very bad for eating soup - this is a story | about how we migrated to a spoon" | | ...firecracker does fine what it was designed to - short running | fast start workloads. | | (oh, and the article starts by slightly misusing a bunch of | technical terms, firecracker's not technically a hypervisor per | se) | sheepscreek wrote: | ..so is it more to support directly deploying functions to the | cloud? Like, what AWS Lambda and CloudFront Functions might be | built on? | yjftsjthsd-h wrote: | I'm pretty sure firecracker was literally created to underlie | AWS Lambda. | | EDIT: Okay, https://www.geekwire.com/2018/firecracker-amazon- | web-service... says my "pretty sure" memory is in fact | correct. | bkettle wrote: | As does the paper [1] with details in section 4.1. | | [1]: https://www.usenix.org/system/files/nsdi20-paper- | agache.pdf | zokier wrote: | That being said, firecracker also runs long-running tasks | on AWS in the form of Fargate | dathinab wrote: | yes, it was created originally for AWS Lambda | | mainly it's optimized to run code only shortly (init time max | 10s, max usage is 15min, and default max request time 130s | AFIK) | | also it's focused on thin server less functions, like e.g. | deserialize some request, run some thin simple business logic | and then delegate to other lambdas based on it. This kind of | functions often have similar memory usage per-call and if a | call is an outlier it can just discard the VM instance soon | after (i.e. at most after starting up a new instance, i.e. at | most 10s later) | dathinab wrote: | it's not that simple many other companies running longer | running jobs, including their competition, use Firecracker | | so while Firecracker was designed for thing running just a few | seconds there are many places running it with jobs running way | longer then that | | the problem is if you want to make it work with long running | general purpose images you don't control you have to put a ton | of work into making it work nicely on all levels of you | infrastructure and code ... which is costly ... which a startup | competing on a online dev environment compared to e.g. a vm | hosting service probably shouldn't wast time on | | So AFIK the decision in the article make sense the reasons but | listed for the decision are oversimplified to a point you could | say they aren't quite right. Idk. why, could be anything from | the engineer believing that to them avoiding issues with some | shareholder/project lead which is obsessed with "we need to do | Firecracker because competition does so too". | [deleted] | mgaunard wrote: | I know a good way to make a process make the most of the hardware | and play cooperatively with other processes: don't use | virtualization. | | I will never understand the whole virtual machine and cloud | craze. Your operating system is better than any hypervisor at | sharing resources efficiently. | kdmytro wrote: | I want to segregate, not to share. | akira2501 wrote: | > cloud craze. | | Automatic scaling is great. Cloud parallelization (a.k.a fork) | is absolutely wild once you get it rolling. Code deployments | are incredibly simple. Never having to worry about physical | machines or variable traffic loads is worth the small overhead | they charge me for the wrapper. The generic system wide | permissions model is an absolute joy once you get over the | learning curve. | gwd wrote: | Listen people, Firecracker is NOT A HYPERVISOR. A hypervisor runs | right on the hardware. KVM is a hypervisor. Firecracker is a | process that controls KVM. If you want to call firecracker (and | QEMU, when used in conjunction with KVM) a VMM ("virtual machine | monitor") I won't complain. But please please please, we need a | word for what KVM and Xen are, and "hypervisor" is the best fit. | Stop using that word for a user-level process like Firecracker. | jstarks wrote: | Keep fighting the good fight, friend. | | Although I'll note that the line between a VMM and hypervisor | are not always clear. E.g., KVM includes some things that other | hypervisors delegate to the VMM (such as instruction | completion). And macOS's hypervisor.framework is almost a pass | through to the CPU's raw capabilities. | foundry27 wrote: | Nitpick: it's not accurate to say that a hypervisor, by | definition, runs right on the hardware. Xen (as a type-1 | hypervisor) has this property; KVM (as a type-2 hypervisor) | does not. It's important to remember that the single core | responsibility of a hypervisor is to divide hardware resources | and time between VMs, and this decision-making doesn't require | bare-metal. | | For those unfamiliar, the informal distinction between type-1 | and type-2 is that type-1 hypervisors are in direct control of | the allocation of all resources of the physical computer, while | type-2 hypervisors operate as some combination of being "part | of" / "running on" a host operating system, which owns and | allocates the resources. KVM (for example) gives privileged | directions to the Linux kernel and its virtualization kernel | module for how to manage VMs, and the kernel then schedules and | allocates the appropriate system resources. Yes, the type-2 | hypervisor needs kernel-mode primitives for managing VMs, and | the kernel runs right on the hardware, but those primitives | aren't making management decisions for the division of hardware | resources and time between VMs. The type-2 hypervisor is making | those decisions, and the hypervisor is scheduled by the OS like | any other user-mode process. | veber-alex wrote: | KVM is a type-1 hypervisor [1] | | [1]: https://www.redhat.com/en/topics/virtualization/what-is- | KVM | compscigirl wrote: | Not really, calling KVM a type-1 is a misunderstanding of | what the "bare-metal" distinction is referring to. The real | difference between the two types is whether the hypervisor | owns the hardware or not. In the case of a type-1, the | hypervisor runs below the kernel and controls access to the | hardware, even for the kernel. In type-2, the hypervisor | runs on the kernel, which owns the hardware, and must go | through the kernel to use hardware resources. | cthalupa wrote: | But that's not how that works. KVM is as "bare-metal" in | access to the system as ESXi is. The hypervisor code runs | in ring 0 in both cases. | monocasa wrote: | There's arguments in both directions for something like | kvm. Wiki states it pretty well: | | > The distinction between these two types is not always | clear. For instance, KVM and bhyve are kernel modules[6] | that effectively convert the host operating system to a | type-1 hypervisor.[7] At the same time, since Linux | distributions and FreeBSD are still general-purpose | operating systems, with applications competing with each | other for VM resources, KVM and bhyve can also be | categorized as type-2 hypervisors.[8] | | https://en.wikipedia.org/wiki/Hypervisor#Classification | cthalupa wrote: | Type-1 and type-2 hypervisor is terminology that should at | this point be relegated to the past. | | It was never popularly used in a way accurate to the origin | of the classification - in the original paper by Popek and | Goldberg talked about formal proofs for the two types and | they really have very little to do with how the terms began | being used in the 90s and 00s. Things have changed a lot with | computers since the 70s when the paper was written and the | terminology was coined. | | So, language evolves, and Type-1 and Type-2 came to mean | something else in common usage. And this might have made | sense to differentiate something like esx from vmware | workstation in their capabilities, but it's lost that utility | in trying to differentiate Xen from KVM for the overwhelming | majority of use cases. | | Why would I say it's useless in trying to differentiate, say, | Xen and KVM? Couple of reasons: | | 1) There's no performance benefit to type-1 - a lot of | performance sits on the device emulation side, and both are | going to default to qemu there. Other parts are based heavily | on CPU extensions, and Xen and KVM have equal access there. | Both can pass through hardware, support sr-iov, etc., as | well. | | 2) There's no overhead benefit in Xen - you still need a dom0 | VM, which is going to arguably be even more overhead than a | stripped down KVM setup. There's been work on dom0less Xen, | but it's frankly in a rough state and the related drawbacks | make it challenging to use in a production environment. | | Neither term provides any real advantage or benefit in | reasoning between modern hypervisors. | gwd wrote: | > Type-1 and type-2 hypervisor is terminology that should | at this point be relegated to the past. | | Maybe it's because of the time I grew up in, but in my mind | the prototypical Type-I hypervisor is VMWare ESX Server; | and the prototypical Type-II hypervisor is VMWare | Workstation. | | It should be noted that VMWare Workstation always required | a kernel module (either on Windows or Linux) to run; so the | core "hypervisor-y" bit runs in kernel mode either way. So | what's the difference? | | The key difference between those two, to me is: Is the | thing at the bottom designed exclusively to run VMs, such | that every other factor gives way? Or does the thing at the | bottom have to "play nice" with random other processes? | | The scheduler for ESX Server is written explicitly to | schedule VMs. The scheduler for Workstation is the Windows | scheduler. Under ESX, your VMs are the star of the show; | under Workstation, your VMs are competing with the random | updater from the printer driver. | | Xen is like ESX Sever: VMs are the star of the show. KVM is | like Workstation: VMs are "just" processes, and are | competing with whatever random bash script was created at | startup. | | KVM gets loads of benefits from being in Linux; like, it | had hypervisor swap from day one, and as soon as anyone | implements something new (like say, NUMA balancing) for | Linux, KVM gets it "for free". But it's not _really_ for | free, because the cost is that KVM has to make | accommodations to all the other use cases out there. | | > There's no performance benefit to type-1 - a lot of | performance sits on the device emulation side, and both are | going to default to qemu there. | | Er, both KVM and Xen try to switch to paravirtualized | interfaces as fast as possible, to minimize the emulation | that QEMU has to do. | cthalupa wrote: | >Maybe it's because of the time I grew up in, but in my | mind the prototypical Type-I hypervisor is VMWare ESX | Server; and the prototypical Type-II hypervisor is VMWare | Workstation. | | My point is that these are largely appropriated terms - | neither would fit the definitions of type 1 or type 2 | from the early days when Popek and Goldberg were writing | about them. | | > Or does the thing at the bottom have to "play nice" | with random other processes? | | From this perspective, Xen doesn't count. You can have | all sorts of issues from the dom0 side and competing with | resources - you mention PV drivers later, and you can | 100% run into issues with VMs because of how dom0 | schedules blkback and netback when competing with other | processes. | | ESXi can also run plenty of unmodified linux binaries - | go back in time 15 years and it's basically a fully | featured OS. There's a lot running on it, too. Meanwhile, | you can build a linux kernel with plenty of things | switched off and a root filesystem with just the bare | essentials for managing kvm and qemu that is even less | useful for general purpose computing than esxi. | | >Er, both KVM and Xen try to switch to paravirtualized | interfaces as fast as possible, to minimize the emulation | that QEMU has to do. | | There are more things being emulated than there are PV | drivers for, but this is a bit outside of my point. | | For KVM, the vast majority of implementations are using | qemu for managing their VirtIO devices as well - | https://developer.ibm.com/articles/l-virtio/ - you'll | notice that IBM even discusses these paravirtual drivers | directly in context of "emulating" the device. Perhaps a | better way to get the intent across here would be saying | qemu handles the device model. | | From a performance perspective, ideally you'd want to | avoid PV here too and go with sr-iov devices or | passthrough. | bonzini wrote: | According to the actual paper that introduced the | distinction, and adjusting for change of terminology in the | last 50 years, a type-1 hypervisor runs in kernel space and a | type-2 hypervisor runs in user space. x86 is not | virtualizable by a type-2 hypervisor, except by software | emulation of the processor. | | What actually can change is the amount of work that the | kernel-mode hypervisor leaves to a less privileged (user | space) component. | | For more detail see | https://www.spinics.net/lists/kvm/msg150882.html | rdtsc wrote: | I think people just pick the coolest sounding term. Imagine | someone is sharing what they are working on, what's cooler | sounding "I am working on a virtual machine monitor" or "I am | working on a hypervisor". Hypervisor just sounds futuristic and | awesome. | | It's like with "isomorphic" code. That just sounds much cooler | than "js that runs on the client and the server". | arun-mani-j wrote: | I think you could help me answer the question that has been in | my mind for a month :) | | Is there any article that tells the difference and relationship | between KVM, QEMU, libvirt, virt-manager, Xen, Proxmox etc. | with their typical use cases? | jacobr1 wrote: | [flagged] | doliveira wrote: | Honestly, when we're discussing with humans we want the | humans' response. Anyone of us can ask ChatGPT, I don't | know why people keep pasting its answers | solarpunk wrote: | people keep doing it because it lets them farm upvotes. | theossuary wrote: | Damn, outdone by ChatGPT | bonzini wrote: | Pretty good I must say. AWS however is using KVM only these | days! | orangepurple wrote: | What is AWS Nitro then? | bonzini wrote: | Nitro is the custom hardware that runs in AWS data | centers (or in outposts). You can use it bare metal, or | if you use VM instances it will run Linux (plus KVM) and | a custom user space component that replaces QEMU. | theossuary wrote: | KVM is a Linux kernel implementation of the cpu extensions to | accelerate vms to near bare metal speeds. | | Qemu is a user space system emulator. It can emulate in | software different architectures like ARM, x86, etc. It can | also emulate drivers, networking, disks, etc. Is called via | the command line. | | The reason you'll see Qemu/KVM a lot is because Qemu is the | emulator, the things actually running the VM. And it utilizes | KVM (on linux, OSX has HVF, for example) to accelerate the VM | when the host architecture matches the VM's. | | Libvirt is an XML based API on top of Qemu (and others). It | allows you to define networks, VMs (it calls them domains), | and much more with a unified XML schema through libvirtd. | | Virsh is a CLI tool to manage libvirtd. Virt-manager is a GUI | to do the same. | | Proxmox is Debian under the hood with Qemu/KVM running VMs. | It provides a robust web UI and easy clustering capabilities. | Along with nice to haves like easy management of disks, ceph, | etc. You can also manage Ceph through an API with Terraform. | | Xen is an alternative hypervisor (like esxi). Instead of | running on top of Linux, Xen has it's own microkernel. This | means less flexibility (there's no Linux body running | things), but also simpler to manage and less attack surface. | I haven't played much with xen though, KVM is kind of the | defacto, but IIRC AWS used to use a modified Xen before KVM | came along and ate Xen's lunch. | cthalupa wrote: | > Xen is an alternative hypervisor (like esxi). Instead of | running on top of Linux, Xen has it's own microkernel. This | means less flexibility (there's no Linux body running | things), but also simpler to manage and less attack | surface. | | You're going to need dom0 (a "control domain") on any Xen | host. Gotta have something running xl and the rest of the | toolstack for managing it. dom0less technically exists but | the drawbacks mean it's not really usable by most people in | a production situation. | adql wrote: | >Xen is an alternative hypervisor (like esxi). Instead of | running on top of Linux, Xen has it's own microkernel. This | means less flexibility (there's no Linux body running | things), but also simpler to manage and less attack | surface. I haven't played much with xen though, KVM is kind | of the defacto, but iirc AWS uses a modified Xen. | | If you actually played with Xen you'd know it's not | actually easier to manage. And increased security claims | are dubious at best, as same thing that would be attacked | (dom0 managing the whole thing and running linux) have | direct unfettered access to xen microkernel. There is | reason many sites migrated away from Xen to KVM. Also many | Xen drivers de facto run part Linux dom0 instance so you | don't even get that isolation. | | We ran Xen for few years, as KVM at first was still not as | refined and Xen was first to mature market, and it was just | million little annoying things. | | KVM offers far simple and straightforward management. A VM | is just a process. You can look at its CPU usage via normal | tools. No magic. No driver problems. | monocasa wrote: | > I haven't played much with xen though, KVM is kind of the | defacto, but iirc AWS uses a modified Xen. | | AWS switched to KVM, and even a lot of AWS systems that | report themselves as Xen are running as KVM with a compat | shim to say it's Xen. | gabereiser wrote: | KVM is kernel-based virtual machine, with libvirt being its | API abstraction over all of it. QEMU is a virtual machine | host that leverages kvm or software virtualization to spin up | machines on the host. virt-manager does the same. Xen is | another virtual machine host, like KVM. Proxmox is a virtual | machine manager (like QEMU, virt-manager) but is web based. | Libvirt will provide abstraction for kvm,qemu,xen | | Use cases: proxmox web interface exposed on your local | network on a KVM Linux box that uses QEMU to manage VM's. | Proxmox will allow you to do that from the web. QEMU is great | for single or small fleet of machines but should be automated | for any heavy lifting. Proxmox will do that. | bonzini wrote: | This is almost entirely wrong especially as far as QEMU, | Libvirt and virt-manager are concerned. | | QEMU is a low level process that represents the virtual | machine. It has no equivalent in Xen. Using QEMU directly | is not a good idea unless your needs for VM configurations | change all the time and you hardly reuse VMs. | | Libvirt is at a higher level than QEMU. It manages the QEMU | processes and gives them access to system resources (image | files, network interfaces, pass-through PCI devices). It | also makes it easy to manage the configuration of your | virtual machines and the resources they use. | | Higher still is virt-manager, which is a GUI interface for | libvirt. Proxmox sits at roughly the same level as virt- | manager. | gabereiser wrote: | How? KVM and Xen are kernel level. QEMU uses KVM but also | has a software virtualization capability. Libvirt is an | API abstraction over it all. virt-manager is a gui app to | manage libvirt machines. Proxmox as well. Proxmox VE | talks to VMHost via libvirt. | bonzini wrote: | Libvirt does not use KVM. Libvirt uses either QEMU (which | in turn might or might not use KVM) or Xen or other | hypervisors. So it's incorrect to say that Libvirt | abstracts over KVM. | | And virt-manager indeed manages Libvirt machines so it's | not at the level of QEMU as you wrote in the parent | comment: | | > Proxmox is a virtual machine manager (like QEMU, virt- | manager) | gabereiser wrote: | Semantics, libvirt abstracts over KVM via QEMU because | QEMU/KVM/HVT is all one driver. | bonzini wrote: | KVM is not enough to create a virtual machine. KVM only | virtualizes the processor, not a whole machine. | gabereiser wrote: | >Using KVM, one can run multiple virtual machines running | unmodified Linux or Windows images. Each virtual machine | has private virtualized hardware: a network card, disk, | graphics adapter, etc. | | Straight from their site. QEMU is the user space | interface, KVM the kernel space driver. It's enough to | run whatever OS. That's the point. | | For libvirt: https://libvirt.org/drivers.html | | They support a bunch as well. | cthalupa wrote: | I don't want to necessarily make this an argument to/from | authority, but for some context here - you are discussing | this with Paolo Bonzini, maintainer of KVM, contributor | to QEMU. In the list of people that best understand the | difference and demarcation points between KVM and QEMU, | he's pretty far up there. | bonzini wrote: | Exactly, it's QEMU that abstracts over the processor | virtualization APIs of KVM, Microsoft WHPX, Apple | Hypervisor.framework etc. Not Libvirt. | Izmaki wrote: | I don't know if _one_ such article exists, but here is a | piece of tech doc from oVirt (yet another tool) that shows | how - or that - VDSM is used by oVirt to communicate with | QEMU through libvirt: https://www.ovirt.org/develop/architect | ure/architecture.html... | | In really simple terms, so simple that I'm not 100% sure they | are correct: | | * KVM is a hypervisor, or rather it lets you turn linux into | a hypervisor [1], which will let you run VMs on your machine. | I've heard KVM is rather hard to work with (steep learning | curve). (Xen is also a hypervisor.) | | * QEMU is a wrapper-of-a-sorts (a "machine emulator and | virtualizer" [2]) which can be used on top of KVM (or Xen). | "When used as a virtualizer, QEMU achieves near native | performance by executing the guest code directly on the host | CPU. QEMU supports virtualization when executing under the | Xen hypervisor or using the KVM kernel module in Linux." [2] | | * libvirt "is a toolkit to manage virtualization platforms" | [3] and is used, e.g., by VDSM to communicate with QEMU. | | * virt-manager is "a desktop user interface for managing | virtual machines through libvirt" [4]. The screenshots on the | project page should give an idea of what its typical use-case | is - think VirtualBox and similar solutions. | | * Proxmox is the above toolstack (-ish) but as one product. | | --- | | [1] https://www.redhat.com/en/topics/virtualization/what-is- | KVM | | [2] https://wiki.qemu.org/Main_Page | | [3] https://libvirt.org/ | | [4] https://virt-manager.org/ | Izmaki wrote: | Ps: typically if you want to run VMs you are faced with two | paths only: 1) you want one or more VMs on your personal | computer or 2) you want one or more VMs in an office | environment. | | On the first path you are likely going to be just fine with | VirtualBox, VMWare Workstation or Hyper-V (Windows only) / | Parallels (Mac intended). Which one you should pick depends | on your desired use of the machines. | | On the second path you would go with a solution that deals | with the nitty-gritty details, such as Proxmox, oVirt, | Hyper-V, ESXi, or any of the other many available options - | granted you are not going full cloud-based, which opens up | a whole lot of different options too. | | You would generally never need to worry about which | components are needed where and why. I've had to worry | about it once or twice before, because I've had to debug | why an oVirt solution was not behaving like I wanted it to | behave. Knowing the inner workings helps in that case. | crabbone wrote: | > On the first path you are likely going to be just fine | with VirtualBox, VMWare Workstation or Hyper-V (Windows | only) / Parallels (Mac intended). | | As a Linux user, why would you want to use VirtualBox or | VMWare Workstation? They are not so well integrated with | the system, and, frankly, VirtualBox is more of a toy VM | player... just go for virt-manager. It gives a | conceptually similar interface to VirtualBox, but better | integration with the rest of the system. Especially, when | it comes to stuff like sending different key | combinations. | | I honestly cannot think of a single benefit to using | VirtualBox (and I'm less familiar with VMWare player) | compared to virt-manager. My guess is that it's more | often used because it's also a common choice on MS | Windows, so, you get more hits if you are going to search | the Web for questions associated to VMs / you'd get | tutorials for how to set up a VM that use VirtualBox. | But, if you apply yourself to learning how either one of | these works, you'd see no reason to choose it. | mkesper wrote: | The only reason for using it would be that the GUI is a | little bit more capable. But feature crippling without | the PUEL addons is horrible. | [deleted] | MuffinFlavored wrote: | > virtual machine monitor | | Is it good to think of libvirt as a virtual machine mointor, or | is that more "virtual machine management"? | zbrozek wrote: | I'd love to get a clear explanation of what libvirt actually | does. As far as I can tell it's a qemu argument assembler and | launcher. For my own use-case, I just launch qemu from | systemd unit files: | | https://wiki.archlinux.org/title/QEMU#With_systemd_service | bonzini wrote: | The main important point is that Libvirt takes care of | privilege separation. | | It makes sure that if your VM and/or QEMU are broken out | of, there are extra layers to prevent getting access to the | whole physical machine. For example it runs QEMU as a very | limited user and, if you're using SELinux, the QEMU process | can hardly read any file other than the vm image file. | | By contrast the method in the arch wiki runs QEMU as root. | QEMU is exposed to all sort of untrusted input, so you | really don't want it to run as root. | | Libvirt also handles cross machine operations such as live | migration, and makes it easier to query a bunch of things | from QEMU. | | For more info see https://www.redhat.com/en/blog/all-you- | need-know-about-kvm-u... | zbrozek wrote: | Thank you, this is the best explanation of the utility | that I've seen. I appreciate it. | MuffinFlavored wrote: | > As far as I can tell it's a qemu argument assembler | | Raises a few questions to me: | | Can you use KVM/do KVM stuff without QEMU? | | Can you do libvirt stuff without QEMU? | | Hoping the answers to both aren't useless/"technically, but | why would you want to?" | zokier wrote: | There is cute article from lwn demoing using kvm directly | without anything else: https://lwn.net/Articles/658511/ | mrweasel wrote: | That is really neat, it never occurred to me that you | could just use KVM as a sort of sandbox. | simcop2387 wrote: | > Can you use KVM/do KVM stuff without QEMU? | | Yes there's a few things out there like Firecracker that | use KVM without using QEMU. I'm not completely aware of | all of them but they do exist | | > Can you do libvirt stuff without QEMU? | | Yes it can also manager LXC containers and a few other | types like Xen and Bhyve and Virtuozzo, like QEMU without | KVM. The without KVM part is important to letting you run | VMs that are emulating other architectures than the | native one. | | For a good bit of this, it is "why would you want to" but | there are definitely real cases where you'd want to be | able to do this. Like the LXC or Virtuozzo support means | that you can run lighter weight containers (same | underlying tech as Docker essentially) through the same | orchestration/management that you use for virtual | machines. And the Bhyve support lets you do the same | thing for running things on top of FreeBSD (though I've | never used it this way) so that a heterogeneous mix of | hosts is managed through the same interfaces. | jononor wrote: | libvirt supports other hypervisors than QEMU, such as | VirtualBox, Xen and LXC. | MuffinFlavored wrote: | I know ChatGPT responses get downvoted here but I fed it | this to help me better understand this: | | > is KVM a hypervisor? is it type 1 or type 2? is QEMU a | hypervisor, is it type 1 or type 2? if QEMU is using KVM, | is QEMU then not a hypervisor in that use case? | | Yes, KVM (Kernel-Based Virtual Machine) is indeed a | hypervisor. It's a type 1 hypervisor, also known as a | "bare metal" hypervisor. This is because KVM directly | runs on the host's hardware to control the hardware and | to manage guest operating systems. The fact that it's a | Linux kernel module that allows the Linux kernel to | function as a hypervisor makes it very efficient. | | QEMU (Quick Emulator) is a bit more complex. By itself, | it is technically a type 2 or "hosted" hypervisor, | meaning it runs within a conventional operating system | environment. QEMU is a generic, open-source machine | emulator and virtualizer that can emulate a variety of | hardware types and host a range of guest operating | systems. | | However, when QEMU is used with KVM, the picture changes | somewhat. In this case, KVM provides the hardware | virtualization where it allows the host machine to call | CPU instructions of full virtualization. Then, QEMU | emulates the hardware resources and provides the user | interface for the VM, thus allowing for better | performance and usability. It's this combination of KVM's | hardware acceleration capabilities and QEMU's emulation | capabilities that makes them often used together. | | In this case, QEMU is not acting purely as a hypervisor; | it's providing hardware emulation and user interface for | the VMs, while KVM is the part providing the hypervisor | functionality. However, we often refer to the combination | of "QEMU/KVM" as a unit when talking about this mode of | operation. | adql wrote: | > Can you use KVM/do KVM stuff without QEMU? | | There are projects doing that althought qemu is the de | facto standard and best bet if you don't need to boot | your machines in 20ms | | > Can you do libvirt stuff without QEMU? | | Libvirt have _many_ backends so yes. IIRC it can even | manage virtualbox, althought I 'm, not sure why anyone | would want to | | > Hoping the answers to both aren't useless/"technically, | but why would you want to?" | | ...why? Is there a problem kvm+qemu+libvirt doesn't solve | for you? | adql wrote: | It's a lot of glue to present a consistent interface but it | also does the management part. | | "API to virtualization system" would probably be closest | approximation but it also does some more advanced stuff like | coordinating cross-host VM migration | datadeft wrote: | Tl;dr: We tried to misuse technology and we failed. If | Firecracker was developed for a single binary executed fir a | short period of time why do you try to use it for multiple | executables running for a long time? Does it make any sense to | even try? | zokier wrote: | AWS uses firecracker to execute long-running Fargate tasks; its | hardly misuse | mike_hock wrote: | Where in the "sales" pitch on the fancy-CSS website as well as | the README does it say only to use it for single-shot | workloads? | | I think the complaints are perfectly valid. | favflam wrote: | I know that Firecracker does not let you bind mount volumes, but | QEMU does. So, we changed to QEMU from Firecracker. If you run | the workloads in Kubernetes, you just have to change a single | value in a yaml file to change the runtime. | | I would be scared to let unknown persons use QEMU that bind | mounts volumes as that is a huge security risk. Firecracker, I | think, was designed from the start to run un-sanitized workloads, | hence, no bind mounting. | brunoborges wrote: | > you just have to change a single value in a yaml file | | Most dangerous 12-words sentence. | Havoc wrote: | I toyed with it a bit and was delighted to get it running. Only | to discover getting even basic networking going is another | mission in itself. | | Light is cool but for many tasks that level of Spartan is | overkill | | If I'm investing time in light it might as well be wasm tech | k__ wrote: | _" Firecracker...'s excellent for running short-lived | workloads...A little-known fact about Firecracker is its lack of | support... for long-lived workloads."_ | | Okay. | rgbrenner wrote: | _" Firecracker's RAM footprint starts low, but once a workload | inside allocates RAM, Firecracker will never return it to the | host system."_ | | Firecracker has a balloon device you can inflate (ie: acquire as | much memory inside the VM as possible) and then deflate... | returning the memory to the host. You can do this while the VM is | running. | | https://github.com/firecracker-microvm/firecracker/blob/main... | 0xbadcafebee wrote: | That and the fact that this was after "several weeks of | testing" tells me this team doesn't have much virtualization | experience. Firecracker is designed to quickly virtualize 1 | headless stateless app (like a container), not run hundreds of | different programs in a developer environment. | CompuIves wrote: | Yes, we use this at CodeSandbox for reclaiming memory to the | host (and to reduce snapshot size when we hibernate the VM). | paulv wrote: | The first footnote says _If you squint hard enough, you 'll | find that Firecracker does support dynamic memory management | with a technique called ballooning. However, in practice, it's | not usable. To reclaim memory, you need to make sure that the | guest OS isn't using it, which, for a general-purpose workload, | is nearly impossible_ | adql wrote: | Yeah it's pretty hard problem as you'd need to defragment | physical memory (while fixing all the virtual-to-physical | mappings) to make contiguous block to free | [deleted] | dspillett wrote: | _> which, for a general-purpose workload, is nearly | impossible_ | | That depends on the workload and the maximum memory allocated | to the guest OS. | | A lot of workloads rely on the OS cache/buffers to manage IO | so unless RAM is quite restricted you can call in to release | that pretty easily prior to having the balloon driver do its | thing. In fact I'd not be surprised to be told the balloon | process does this automatically itself. | | If the workload does its own IO management and memory | allocation (something like SQL Server which will eat what RAM | it can and does its own IO cacheing) or the VM's memory | allocation is too small for OS caching to be a significant | use after the rest of the workload (you might pair memory | down to the bare minimum like this for a "fairly static | content" server that doesn't see much variation in memory | needs and can be allowed to swap a little if things grow | temporarily), then I'd believe is it more difficult. That is | hardly the use case for firecracker though so if that is the | sort of workload being run perhaps reassessing the tool used | for the job was the right call. | | Having said that my use of VMs is generally such that I can | give them a good static amount of RAM for their needs and | don't need to worry about dynamic allocation, so I'm far from | a subject expert here. | | And, isn't firecraker more geared towards short-lived VMs, | quick to spin up, do a job, spin down immediately (or after | only a short idle timeout if the VM might answer another | request if one comes in immediately or is already queued), so | you are better off cycling VMs, which is probably happening | anyway, than messing around with memory balloons? Again, I'm | not talking from a position of personal experience here so | corrections/details welcome! | dathinab wrote: | > is nearly impossible | | for many mostly "general purpose" use cases it's quite | viable, or else ~fly.io~ AWS Fargate wouldn't be able to use | it | | this doesn't mean it's easy to implement the necessary | automatized tooling etc. | | so it's depending on your dev resources and priorities it | might be a bad choice | | still I feel the article was had quite a bit a being subtil | judgemental while moving some quite relevant parts for the | content of the article into a footnote and also omitting that | this "supposedly unusable tool" is used successfully by | various other companies... | | like as it it was written by and engineer being overly | defensive about their decision due having to defend it to the | 100th time because shareholders, customers, higher level | management just wouldn't shut up about "but that uses | Firecracker" | tedunangst wrote: | I'm struggling to understand how qemu with free page | reporting isn't exactly the same as a firecracker balloon. | londons_explore wrote: | I really want VM's to integrate 'smarter' with the host. | | For example, if I'm running 5 VM's, there is a good chance that | many of the pages are identical. Not only do I want those pages | to be deduplicated, but I want them to be zero-copy (ie. not | deduplicated after-the-fact by some daemon). | | To do that, the guest block cache needs to be integrated with the | host block-cache, so that whenever some guest application tries | to map data from disk, the host notices that another virtual | machine has already caused this data to be loaded, so we can just | map the same page of already loaded data into the VM that is | asking. | jarym wrote: | Well that's all nice, but that would also need to be compute- | efficient for it to be worthwhile and near-real-time dedupe of | memory pages would be a REALLY tough challenge. | londons_explore wrote: | Pretty straightforward for disk blocks. Many VM disks are | already de-duped, either through snapshopping or through copy | on write host filesystems. | | The host block cache will end up deduplicating it | automatically because all the 'copies' lead back to the same | block on disk. | gorbak25 wrote: | I've tried to use virtio-pmem + DAX for the page cache to not | be duplicated between the guest and the host. In practice the | RAM overhead of virtio-pmem is unacceptable and it doesn't | support discard operations at all. So yes a better solution | would be needed. | adql wrote: | https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.ht... | | zero-copy is harder as one system upgrade on one of them will | trash it, but KSM is overall pretty effective at saving some | memory on similar VMs | re-thc wrote: | KVM has KSM (kernel samepage merging) since a long time ago | that de-duplicates pages. | londons_explore wrote: | But that makes a copy first, and only later notices that the | pages are the same and merges them again. | | Better to not make copies in the first place. | adql wrote: | it's not really possible in hypervisor as it doesn't know | what guest will be putting in its memory beforehand | Liquid_Fire wrote: | How are you going to know in advance that the pages are | going to be the same? | | e.g. your guest kernel is loading an application into | memory, by reading some parts of an ELF file from disk. | Presumably each VM has its own unique disk, so the | hypervisor can't know that this is "the same" page of data | as another VM has without actually reading it into memory | first and calculating a hash or something. | | If the VMs share a disk image (e.g. the image is copy-on- | write), then I could see it being feasible - e.g. with KVM, | even if your VMs are instantiated by distinct userspace | processes, they would probably share the pages as they mmap | the same disk image. You would still need your virtualised | disk device to support copy-on-write, which may or may not | be possible depending on your use case. | | But your copy-on-write disk images will probably quickly | diverge in a way that makes most pages not shareable, | unless you use some sort of filesystem optimised for that. | | Lastly, since you mentioned Chromium or Slack in another | comment - I'm sure you'll find nearly all of the loading | time there is not spent loading the executable from disk, | but actually executing it (and all its | startup/initialisation code). So this probably won't be the | speedup you're imagining. It would just save memory. | londons_explore wrote: | Chromium load time with an empty profile on my system is | 4.5 seconds with a cleared disk block cache, and 2.0 | seconds with a warm disk cache. | | So startup time could be better than halved. Seems worth | it. | londons_explore wrote: | > pages not shareable, unless you use some sort of | filesystem optimised for that. | | btrfs on the host would have support for deduplication of | identical pages in the disk images. It's true that a CPU- | costly scan would be needed to identify new shared pages, | if for example, two VM's are both updated to the latest | distro release. | ec109685 wrote: | It has side channel attacks so be careful when enabling: | https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM) | scarface_74 wrote: | This seems like a security issue waiting to happen when you're | running code from different users. | hamandcheese wrote: | If you already know so much about your application(s), are you | sure you need virtualization? | drbawb wrote: | The second I read "shared block cache" my brain went to | containers. | | If you want data colocated on the same filesystem, then put | it on the same filesystem. VMs suck, nobody spins up a whole | fabricated IBM-compatible PC and gaslights their executable | _because they want to._ [1] They do it because their OS (a) | doesn't have containers, (b) doesn't provide strong enough | isolation between containers, or (c) the host kernel can't | run their workload. (Different ISA, different syscalls, | different executable format, etc.) | | Anyone who has ever tried to run heavyweight VMs atop a | snapshotting volume already knows the idea of "shared blocks" | is a fantasy; as soon as you do one large update inside the | guest the delta between your volume clones and the base | snapshot grows immensely. That's why Docker et al. has a | concept of layers and you describe your desired state as a | series of idempotent instructions applied to those layers. | That's possible _because_ Docker operates semantically on a | filesystem; much harder to do at the level of a block device. | | Is the a block containing b"hello, world" part of a program's | text section, or part of a user's document? You don't know, | because the guest is asking you for an LBA, not a path, not | modes, not an ACL, etc. - If you don't know that, the host | kernel has no idea how the page should be mapped into memory. | Furthermore storing the information to dedup common blocks is | non-trivial: go look at the manpage for ZFS' deduplication | and it is littered w/ warnings about the performance, memory, | and storage implications of dealing with the dedup table. | | [1]: https://www.youtube.com/watch?v=coFIEH3vXPw | tjoff wrote: | Doubt it is worth the hassle. How many do you really expect to | be identical? | | An OS isn't large. Your spotify/slack/browser instance is of | comparable size. Says more about browser based apps but still. | londons_explore wrote: | Basically all code pages should be the same if some other VM | has the same version of ubuntu and running the same version | of spotify/slack. | | And remember that as well as RAM savings, you also get | 'instant loading' because there is no need to do slow SSD | accesses to load hundreds of megabytes of a chromium binary | to get slack running... | drbawb wrote: | > An OS isn't large. Your spotify/slack/browser instance is | of comparable size. | | A fairly recent Windows 11 Pro image is ~26GB unpacked and | 141k dirents. After finishing OOBE it's already running like | >100 processes, >1000 threads, and >100k handles. My Chrome | install is ~600MB and 115 dirents. (Not including UserData.) | It runs ~1 process per tab. Comparable in scope and | complexity? That's debatable, but I tend to agree that modern | browsers are pretty similar in scope to what an OS should be. | (The other day my "web browser" flashed the firmware on the | microcontroller for my keyboard.) | | They're not even close to "being comparable in size," | although I guess that says more about Windows. | cthalupa wrote: | My reading was that the "comparable in size" was more about | memory footprint and less about usage of storage | shrubble wrote: | OpenVZ does this. If you have 5 VMs each loading the same | library then memory is conserved, as I understand it. | anthk wrote: | kvm does the same with KSM. | shrubble wrote: | Not precisely, in that KSM does it after the fact while | OpenVZ has it occur as a consequence of its design, on the | loading of the program. | | See (OpenVZ) "Containers share dynamic libraries, which | greatly saves memory." It's just 1 Linux kernel when you | are running OpenVZ containers. | | https://docs.openvz.org/openvz_users_guide.webhelp/_openvz_ | c... | | See (KVM/KSM): "KSM enables the kernel to examine two or | more already running programs and compare their memory. If | any memory regions or pages are identical, KSM reduces | multiple identical memory pages to a single page. This page | is then marked copy on write." | | https://access.redhat.com/documentation/en- | us/red_hat_enterp... | | In KVM's defense, it supports a much wider range of OSes; | OpenVZ only really does different versions of Linux, while | KVM can run OpenBSD/FreeBSD/NetBSD/Windows and even OS/2 in | addition to Linux. | andrewflnr wrote: | Are you sure you're not thinking "copy on write" rather than | "zero copy"? The latter implies you can predict in advance | which pages will be the same forever... | londons_explore wrote: | The pages would be copy-on-write, but since this would mostly | be for code pages, they would never be written, and therefore | never copied. | | By 'zero copy', I mean that when a guest tries to read a | page, if another guest has that page in RAM, then no copy | operation is done to get it into the memory space of the 2nd | guest. | ComputerGuru wrote: | The article did an ok job of explaining the firecracker | limitations they ran into but it was extremely skimpy when it | came to qemu and just rushed to the conclusion "we did a lot of | work so try our product." | naikrovek wrote: | yeah I was reading so I could find out what they did. | | I understand that they need to sell their product but jeez. | don't leave us hanging like that | hugodutka wrote: | I didn't want to go into all the technical details, but we | have another write-up that goes into details about RAM | management: https://github.com/hocus- | dev/hocus/blob/main/rfd/0003-worksp... | | Other than making sure we release unused memory to the host, | we didn't customize QEMU that much. Although we do have a | cool layered storage solution - basically a faster | alternative to QCOW2 that's also VMM independent. It's called | overlaybd, and was created and implemented in Alibaba. That | will probably be another blog post. | https://github.com/containerd/overlaybd | ComputerGuru wrote: | > I didn't want to go into all the technical details | | HN is here for the technical details ;) | gunapologist99 wrote: | that should be the HN motto! | xen2xen1 wrote: | Thirded.. Ed. | blinkingled wrote: | > The main issue we've had with QEMU is that it has too many | options you need to configure. For instance, enabling your VM to | return unused RAM to the host requires at least three challenging | tasks | | This just works on Hyper-V Linux guests btw. For all the crap MS | gets they do some things very right. | freeone3000 wrote: | It _kind of_ just works. It's actually broken under Debian 13 | for some reason; memory usage infinitely balloons if the | feature is enabled. ___________________________________________________________________ (page generated 2023-07-10 23:00 UTC)