[HN Gopher] We replaced Firecracker with QEMU
       ___________________________________________________________________
        
       We replaced Firecracker with QEMU
        
       Author : hugodutka
       Score  : 308 points
       Date   : 2023-07-10 14:15 UTC (8 hours ago)
        
 (HTM) web link (hocus.dev)
 (TXT) w3m dump (hocus.dev)
        
       | amarshall wrote:
       | No mention of Cloud Hypervisor [1]...perhaps they don't know
       | about it? It's based in part on Firecracker and supports free
       | page reporting, virtio-blk-pci, PCI passthrough, and (I believe)
       | discard in virtio-blk.
       | 
       | [1]: https://www.cloudhypervisor.org/
        
         | hugodutka wrote:
         | We do, and we'd love to use it in the future. We've found that
         | it's not ready for prime time yet and it's missing some
         | features. The biggest problem was that it does not support
         | discard operations yet. Here's a short writeup we did about
         | VMMs that we considered: https://github.com/hocus-
         | dev/hocus/blob/main/rfd/0002-worksp...
        
           | amarshall wrote:
           | Thanks for the link to the elaboration! FYI footnotes 3 and 4
           | seem to be swapped.
        
       | CompuIves wrote:
       | At CodeSandbox we use Firecracker for hosting development
       | environments, and I agree with the points. Though I don't think
       | that means you should not use Firecracker for running long-lived
       | workloads.
       | 
       | We reclaim memory with a memory balloon device, for the disk
       | trimming we discard (& compress) the disk, and for i/o speed we
       | use io_uring (which we only use for scratch disks, the project
       | disks are network disks).
       | 
       | It's a tradeoff. It's more work and does require custom
       | implementations. For us that made sense, because in return we get
       | a lightweight VMM that we can more easily extend with
       | functionality like memory snapshotting and live VM cloning
       | [1][2].
       | 
       | [1]: https://codesandbox.io/blog/how-we-clone-a-running-vm-
       | in-2-s...
       | 
       | [2]: https://codesandbox.io/blog/cloning-microvms-using-
       | userfault...
        
       | wiradikusuma wrote:
       | Fly uses Firecracker, and they host long-running processes. I
       | wonder what's their opinion about it.
        
         | tptacek wrote:
         | We like Firecracker. People should use whatever makes sense for
         | them.
        
         | gorbak25 wrote:
         | I think their usecase makes a lot of sense as their workloads
         | consume a predefined amount of ram. As a customer you rent a VM
         | with a specified amount of memory so fly.io does not care about
         | reclaiming it from a running VM.
        
           | no_wizard wrote:
           | Depends on if they're using smart memory allocation to keep
           | costs lower, IE, if they can pattern that certain workloads
           | only need N amount of memory at Y time, they can effectively
           | borrow memory from one VM for usage in another that has an
           | opposite statistical likelihood of needing that memory.
           | 
           | This is why paying for dedicated memory is often more
           | expensive than its counter part, because that dedicated
           | memory is not considered as part of pooling.
        
             | adql wrote:
             | I think it would be easier to just enable KSM (KVM can use
             | it to deduplicate memory and save some RAM on duplicate
             | blocks like people loading same libraries into memory)
        
             | nerpderp82 wrote:
             | Shit like this is how you burn customers and then they find
             | out you aren't changing the sheets, they change hotels.
             | 
             | Just because you can doesn't mean you should.
        
               | klabb3 wrote:
               | Almost everything is oversubscribed in our world.
               | 
               | Malloc will happily "return" the 15 TiB you asked for.
               | 
               | If 10000 people called 911 at the same time, only a tiny
               | fraction would get through (and even fewer would get
               | help).
               | 
               | Evacuating a large city by road would result in giant
               | traffic jams.
               | 
               | There are 5-8x as many parking spots as there are cars
               | (and we still can't find a goddamn spot).
               | 
               | And of course... the great toilet paper shortage of 2020.
        
               | no_wizard wrote:
               | Super common though. Everyone from Hetzner to AWS does
               | this, unless you specifically pay for full allocation,
               | which is typically ~2x the price.
               | 
               | Chances are you are using systems that do this and you
               | haven't even noticed.
        
               | yokaze wrote:
               | Werner Vogels seems to disagree:
               | https://twitter.com/Werner/status/25137574680
        
               | rob-olmos wrote:
               | Tweeted Sep 21, 2010, might still be true for EC2 VMs
               | though.
        
               | mulmen wrote:
               | I don't see the problem. Is this not the value
               | proposition of the cloud? At scale things like block
               | level de-duplication and over-provisioning of compute are
               | more efficient to say nothing of power and cooling. This
               | efficiency drives down cost. As long as your data isn't
               | leaking and you get the hardware you pay for when you
               | need it what's the problem?
        
               | geodel wrote:
               | >.. they change hotels
               | 
               | And then they learn all hotels are doing exactly same
               | thing. One hotel doing is risk, all hotels doing is
               | industry standard.
               | 
               | Airlines, hotels, restaurants, doctors and so on
               | oversubscribe all the time. Whoever complains are free to
               | move and add to their further disappointments.
        
               | imtringued wrote:
               | Memory (DDR4) is like 1.50EUR per GB nowadays. There is
               | no need to skimp on it. The most expensive component is
               | still the CPU if you actually want to do work and not
               | just idle your webserver 99% of the time.
        
               | no_wizard wrote:
               | this ignores just how competitive these services have
               | become and how much competition there is in this
               | marketplace. Anything that allows a company to balance
               | user experience with cost of hardware will be evaluated.
               | The fact is this is super common because even though RAM
               | is super cheap, at scale, these things add up. If you as
               | a provider don't have to spend 120K on RAM and can use
               | that on more CPUs or something else, it can mean having a
               | competitive edge over the provider that _does_ have to
               | spend that 120K.
        
               | ushakov wrote:
               | It's a common technique though. I believe it's called
               | oversubscription, where you rent the same hardware to
               | more tenants hoping they won't use it all at once.
               | 
               | Fly.io themselves admitted they're oversubscribed and AWS
               | is doing the same for _years_ now
               | 
               | Source: https://fly.io/blog/the-serverless-server/
        
       | heavyset_go wrote:
       | Someone posted this and then immediately deleted their comment:
       | https://qemu.readthedocs.io/en/latest/system/i386/microvm.ht...
       | 
       | I didn't know it existed until they posted, but QEMU has a
       | Firecracker-inspired target:
       | 
       | > _microvm is a machine type inspired by Firecracker and
       | constructed after its machine model._
       | 
       | > _It's a minimalist machine type without PCI nor ACPI support,
       | designed for short-lived guests. microvm also establishes a
       | baseline for benchmarking and optimizing both QEMU and guest
       | operating systems, since it is optimized for both boot time and
       | footprint._
        
       | lessaligned wrote:
       | "the fork was very very bad for eating soup - this is a story
       | about how we migrated to a spoon"
       | 
       | ...firecracker does fine what it was designed to - short running
       | fast start workloads.
       | 
       | (oh, and the article starts by slightly misusing a bunch of
       | technical terms, firecracker's not technically a hypervisor per
       | se)
        
         | sheepscreek wrote:
         | ..so is it more to support directly deploying functions to the
         | cloud? Like, what AWS Lambda and CloudFront Functions might be
         | built on?
        
           | yjftsjthsd-h wrote:
           | I'm pretty sure firecracker was literally created to underlie
           | AWS Lambda.
           | 
           | EDIT: Okay, https://www.geekwire.com/2018/firecracker-amazon-
           | web-service... says my "pretty sure" memory is in fact
           | correct.
        
             | bkettle wrote:
             | As does the paper [1] with details in section 4.1.
             | 
             | [1]: https://www.usenix.org/system/files/nsdi20-paper-
             | agache.pdf
        
             | zokier wrote:
             | That being said, firecracker also runs long-running tasks
             | on AWS in the form of Fargate
        
           | dathinab wrote:
           | yes, it was created originally for AWS Lambda
           | 
           | mainly it's optimized to run code only shortly (init time max
           | 10s, max usage is 15min, and default max request time 130s
           | AFIK)
           | 
           | also it's focused on thin server less functions, like e.g.
           | deserialize some request, run some thin simple business logic
           | and then delegate to other lambdas based on it. This kind of
           | functions often have similar memory usage per-call and if a
           | call is an outlier it can just discard the VM instance soon
           | after (i.e. at most after starting up a new instance, i.e. at
           | most 10s later)
        
         | dathinab wrote:
         | it's not that simple many other companies running longer
         | running jobs, including their competition, use Firecracker
         | 
         | so while Firecracker was designed for thing running just a few
         | seconds there are many places running it with jobs running way
         | longer then that
         | 
         | the problem is if you want to make it work with long running
         | general purpose images you don't control you have to put a ton
         | of work into making it work nicely on all levels of you
         | infrastructure and code ... which is costly ... which a startup
         | competing on a online dev environment compared to e.g. a vm
         | hosting service probably shouldn't wast time on
         | 
         | So AFIK the decision in the article make sense the reasons but
         | listed for the decision are oversimplified to a point you could
         | say they aren't quite right. Idk. why, could be anything from
         | the engineer believing that to them avoiding issues with some
         | shareholder/project lead which is obsessed with "we need to do
         | Firecracker because competition does so too".
        
       | [deleted]
        
       | mgaunard wrote:
       | I know a good way to make a process make the most of the hardware
       | and play cooperatively with other processes: don't use
       | virtualization.
       | 
       | I will never understand the whole virtual machine and cloud
       | craze. Your operating system is better than any hypervisor at
       | sharing resources efficiently.
        
         | kdmytro wrote:
         | I want to segregate, not to share.
        
         | akira2501 wrote:
         | > cloud craze.
         | 
         | Automatic scaling is great. Cloud parallelization (a.k.a fork)
         | is absolutely wild once you get it rolling. Code deployments
         | are incredibly simple. Never having to worry about physical
         | machines or variable traffic loads is worth the small overhead
         | they charge me for the wrapper. The generic system wide
         | permissions model is an absolute joy once you get over the
         | learning curve.
        
       | gwd wrote:
       | Listen people, Firecracker is NOT A HYPERVISOR. A hypervisor runs
       | right on the hardware. KVM is a hypervisor. Firecracker is a
       | process that controls KVM. If you want to call firecracker (and
       | QEMU, when used in conjunction with KVM) a VMM ("virtual machine
       | monitor") I won't complain. But please please please, we need a
       | word for what KVM and Xen are, and "hypervisor" is the best fit.
       | Stop using that word for a user-level process like Firecracker.
        
         | jstarks wrote:
         | Keep fighting the good fight, friend.
         | 
         | Although I'll note that the line between a VMM and hypervisor
         | are not always clear. E.g., KVM includes some things that other
         | hypervisors delegate to the VMM (such as instruction
         | completion). And macOS's hypervisor.framework is almost a pass
         | through to the CPU's raw capabilities.
        
         | foundry27 wrote:
         | Nitpick: it's not accurate to say that a hypervisor, by
         | definition, runs right on the hardware. Xen (as a type-1
         | hypervisor) has this property; KVM (as a type-2 hypervisor)
         | does not. It's important to remember that the single core
         | responsibility of a hypervisor is to divide hardware resources
         | and time between VMs, and this decision-making doesn't require
         | bare-metal.
         | 
         | For those unfamiliar, the informal distinction between type-1
         | and type-2 is that type-1 hypervisors are in direct control of
         | the allocation of all resources of the physical computer, while
         | type-2 hypervisors operate as some combination of being "part
         | of" / "running on" a host operating system, which owns and
         | allocates the resources. KVM (for example) gives privileged
         | directions to the Linux kernel and its virtualization kernel
         | module for how to manage VMs, and the kernel then schedules and
         | allocates the appropriate system resources. Yes, the type-2
         | hypervisor needs kernel-mode primitives for managing VMs, and
         | the kernel runs right on the hardware, but those primitives
         | aren't making management decisions for the division of hardware
         | resources and time between VMs. The type-2 hypervisor is making
         | those decisions, and the hypervisor is scheduled by the OS like
         | any other user-mode process.
        
           | veber-alex wrote:
           | KVM is a type-1 hypervisor [1]
           | 
           | [1]: https://www.redhat.com/en/topics/virtualization/what-is-
           | KVM
        
             | compscigirl wrote:
             | Not really, calling KVM a type-1 is a misunderstanding of
             | what the "bare-metal" distinction is referring to. The real
             | difference between the two types is whether the hypervisor
             | owns the hardware or not. In the case of a type-1, the
             | hypervisor runs below the kernel and controls access to the
             | hardware, even for the kernel. In type-2, the hypervisor
             | runs on the kernel, which owns the hardware, and must go
             | through the kernel to use hardware resources.
        
               | cthalupa wrote:
               | But that's not how that works. KVM is as "bare-metal" in
               | access to the system as ESXi is. The hypervisor code runs
               | in ring 0 in both cases.
        
             | monocasa wrote:
             | There's arguments in both directions for something like
             | kvm. Wiki states it pretty well:
             | 
             | > The distinction between these two types is not always
             | clear. For instance, KVM and bhyve are kernel modules[6]
             | that effectively convert the host operating system to a
             | type-1 hypervisor.[7] At the same time, since Linux
             | distributions and FreeBSD are still general-purpose
             | operating systems, with applications competing with each
             | other for VM resources, KVM and bhyve can also be
             | categorized as type-2 hypervisors.[8]
             | 
             | https://en.wikipedia.org/wiki/Hypervisor#Classification
        
           | cthalupa wrote:
           | Type-1 and type-2 hypervisor is terminology that should at
           | this point be relegated to the past.
           | 
           | It was never popularly used in a way accurate to the origin
           | of the classification - in the original paper by Popek and
           | Goldberg talked about formal proofs for the two types and
           | they really have very little to do with how the terms began
           | being used in the 90s and 00s. Things have changed a lot with
           | computers since the 70s when the paper was written and the
           | terminology was coined.
           | 
           | So, language evolves, and Type-1 and Type-2 came to mean
           | something else in common usage. And this might have made
           | sense to differentiate something like esx from vmware
           | workstation in their capabilities, but it's lost that utility
           | in trying to differentiate Xen from KVM for the overwhelming
           | majority of use cases.
           | 
           | Why would I say it's useless in trying to differentiate, say,
           | Xen and KVM? Couple of reasons:
           | 
           | 1) There's no performance benefit to type-1 - a lot of
           | performance sits on the device emulation side, and both are
           | going to default to qemu there. Other parts are based heavily
           | on CPU extensions, and Xen and KVM have equal access there.
           | Both can pass through hardware, support sr-iov, etc., as
           | well.
           | 
           | 2) There's no overhead benefit in Xen - you still need a dom0
           | VM, which is going to arguably be even more overhead than a
           | stripped down KVM setup. There's been work on dom0less Xen,
           | but it's frankly in a rough state and the related drawbacks
           | make it challenging to use in a production environment.
           | 
           | Neither term provides any real advantage or benefit in
           | reasoning between modern hypervisors.
        
             | gwd wrote:
             | > Type-1 and type-2 hypervisor is terminology that should
             | at this point be relegated to the past.
             | 
             | Maybe it's because of the time I grew up in, but in my mind
             | the prototypical Type-I hypervisor is VMWare ESX Server;
             | and the prototypical Type-II hypervisor is VMWare
             | Workstation.
             | 
             | It should be noted that VMWare Workstation always required
             | a kernel module (either on Windows or Linux) to run; so the
             | core "hypervisor-y" bit runs in kernel mode either way. So
             | what's the difference?
             | 
             | The key difference between those two, to me is: Is the
             | thing at the bottom designed exclusively to run VMs, such
             | that every other factor gives way? Or does the thing at the
             | bottom have to "play nice" with random other processes?
             | 
             | The scheduler for ESX Server is written explicitly to
             | schedule VMs. The scheduler for Workstation is the Windows
             | scheduler. Under ESX, your VMs are the star of the show;
             | under Workstation, your VMs are competing with the random
             | updater from the printer driver.
             | 
             | Xen is like ESX Sever: VMs are the star of the show. KVM is
             | like Workstation: VMs are "just" processes, and are
             | competing with whatever random bash script was created at
             | startup.
             | 
             | KVM gets loads of benefits from being in Linux; like, it
             | had hypervisor swap from day one, and as soon as anyone
             | implements something new (like say, NUMA balancing) for
             | Linux, KVM gets it "for free". But it's not _really_ for
             | free, because the cost is that KVM has to make
             | accommodations to all the other use cases out there.
             | 
             | > There's no performance benefit to type-1 - a lot of
             | performance sits on the device emulation side, and both are
             | going to default to qemu there.
             | 
             | Er, both KVM and Xen try to switch to paravirtualized
             | interfaces as fast as possible, to minimize the emulation
             | that QEMU has to do.
        
               | cthalupa wrote:
               | >Maybe it's because of the time I grew up in, but in my
               | mind the prototypical Type-I hypervisor is VMWare ESX
               | Server; and the prototypical Type-II hypervisor is VMWare
               | Workstation.
               | 
               | My point is that these are largely appropriated terms -
               | neither would fit the definitions of type 1 or type 2
               | from the early days when Popek and Goldberg were writing
               | about them.
               | 
               | > Or does the thing at the bottom have to "play nice"
               | with random other processes?
               | 
               | From this perspective, Xen doesn't count. You can have
               | all sorts of issues from the dom0 side and competing with
               | resources - you mention PV drivers later, and you can
               | 100% run into issues with VMs because of how dom0
               | schedules blkback and netback when competing with other
               | processes.
               | 
               | ESXi can also run plenty of unmodified linux binaries -
               | go back in time 15 years and it's basically a fully
               | featured OS. There's a lot running on it, too. Meanwhile,
               | you can build a linux kernel with plenty of things
               | switched off and a root filesystem with just the bare
               | essentials for managing kvm and qemu that is even less
               | useful for general purpose computing than esxi.
               | 
               | >Er, both KVM and Xen try to switch to paravirtualized
               | interfaces as fast as possible, to minimize the emulation
               | that QEMU has to do.
               | 
               | There are more things being emulated than there are PV
               | drivers for, but this is a bit outside of my point.
               | 
               | For KVM, the vast majority of implementations are using
               | qemu for managing their VirtIO devices as well -
               | https://developer.ibm.com/articles/l-virtio/ - you'll
               | notice that IBM even discusses these paravirtual drivers
               | directly in context of "emulating" the device. Perhaps a
               | better way to get the intent across here would be saying
               | qemu handles the device model.
               | 
               | From a performance perspective, ideally you'd want to
               | avoid PV here too and go with sr-iov devices or
               | passthrough.
        
           | bonzini wrote:
           | According to the actual paper that introduced the
           | distinction, and adjusting for change of terminology in the
           | last 50 years, a type-1 hypervisor runs in kernel space and a
           | type-2 hypervisor runs in user space. x86 is not
           | virtualizable by a type-2 hypervisor, except by software
           | emulation of the processor.
           | 
           | What actually can change is the amount of work that the
           | kernel-mode hypervisor leaves to a less privileged (user
           | space) component.
           | 
           | For more detail see
           | https://www.spinics.net/lists/kvm/msg150882.html
        
         | rdtsc wrote:
         | I think people just pick the coolest sounding term. Imagine
         | someone is sharing what they are working on, what's cooler
         | sounding "I am working on a virtual machine monitor" or "I am
         | working on a hypervisor". Hypervisor just sounds futuristic and
         | awesome.
         | 
         | It's like with "isomorphic" code. That just sounds much cooler
         | than "js that runs on the client and the server".
        
         | arun-mani-j wrote:
         | I think you could help me answer the question that has been in
         | my mind for a month :)
         | 
         | Is there any article that tells the difference and relationship
         | between KVM, QEMU, libvirt, virt-manager, Xen, Proxmox etc.
         | with their typical use cases?
        
           | jacobr1 wrote:
           | [flagged]
        
             | doliveira wrote:
             | Honestly, when we're discussing with humans we want the
             | humans' response. Anyone of us can ask ChatGPT, I don't
             | know why people keep pasting its answers
        
               | solarpunk wrote:
               | people keep doing it because it lets them farm upvotes.
        
             | theossuary wrote:
             | Damn, outdone by ChatGPT
        
             | bonzini wrote:
             | Pretty good I must say. AWS however is using KVM only these
             | days!
        
               | orangepurple wrote:
               | What is AWS Nitro then?
        
               | bonzini wrote:
               | Nitro is the custom hardware that runs in AWS data
               | centers (or in outposts). You can use it bare metal, or
               | if you use VM instances it will run Linux (plus KVM) and
               | a custom user space component that replaces QEMU.
        
           | theossuary wrote:
           | KVM is a Linux kernel implementation of the cpu extensions to
           | accelerate vms to near bare metal speeds.
           | 
           | Qemu is a user space system emulator. It can emulate in
           | software different architectures like ARM, x86, etc. It can
           | also emulate drivers, networking, disks, etc. Is called via
           | the command line.
           | 
           | The reason you'll see Qemu/KVM a lot is because Qemu is the
           | emulator, the things actually running the VM. And it utilizes
           | KVM (on linux, OSX has HVF, for example) to accelerate the VM
           | when the host architecture matches the VM's.
           | 
           | Libvirt is an XML based API on top of Qemu (and others). It
           | allows you to define networks, VMs (it calls them domains),
           | and much more with a unified XML schema through libvirtd.
           | 
           | Virsh is a CLI tool to manage libvirtd. Virt-manager is a GUI
           | to do the same.
           | 
           | Proxmox is Debian under the hood with Qemu/KVM running VMs.
           | It provides a robust web UI and easy clustering capabilities.
           | Along with nice to haves like easy management of disks, ceph,
           | etc. You can also manage Ceph through an API with Terraform.
           | 
           | Xen is an alternative hypervisor (like esxi). Instead of
           | running on top of Linux, Xen has it's own microkernel. This
           | means less flexibility (there's no Linux body running
           | things), but also simpler to manage and less attack surface.
           | I haven't played much with xen though, KVM is kind of the
           | defacto, but IIRC AWS used to use a modified Xen before KVM
           | came along and ate Xen's lunch.
        
             | cthalupa wrote:
             | > Xen is an alternative hypervisor (like esxi). Instead of
             | running on top of Linux, Xen has it's own microkernel. This
             | means less flexibility (there's no Linux body running
             | things), but also simpler to manage and less attack
             | surface.
             | 
             | You're going to need dom0 (a "control domain") on any Xen
             | host. Gotta have something running xl and the rest of the
             | toolstack for managing it. dom0less technically exists but
             | the drawbacks mean it's not really usable by most people in
             | a production situation.
        
             | adql wrote:
             | >Xen is an alternative hypervisor (like esxi). Instead of
             | running on top of Linux, Xen has it's own microkernel. This
             | means less flexibility (there's no Linux body running
             | things), but also simpler to manage and less attack
             | surface. I haven't played much with xen though, KVM is kind
             | of the defacto, but iirc AWS uses a modified Xen.
             | 
             | If you actually played with Xen you'd know it's not
             | actually easier to manage. And increased security claims
             | are dubious at best, as same thing that would be attacked
             | (dom0 managing the whole thing and running linux) have
             | direct unfettered access to xen microkernel. There is
             | reason many sites migrated away from Xen to KVM. Also many
             | Xen drivers de facto run part Linux dom0 instance so you
             | don't even get that isolation.
             | 
             | We ran Xen for few years, as KVM at first was still not as
             | refined and Xen was first to mature market, and it was just
             | million little annoying things.
             | 
             | KVM offers far simple and straightforward management. A VM
             | is just a process. You can look at its CPU usage via normal
             | tools. No magic. No driver problems.
        
             | monocasa wrote:
             | > I haven't played much with xen though, KVM is kind of the
             | defacto, but iirc AWS uses a modified Xen.
             | 
             | AWS switched to KVM, and even a lot of AWS systems that
             | report themselves as Xen are running as KVM with a compat
             | shim to say it's Xen.
        
           | gabereiser wrote:
           | KVM is kernel-based virtual machine, with libvirt being its
           | API abstraction over all of it. QEMU is a virtual machine
           | host that leverages kvm or software virtualization to spin up
           | machines on the host. virt-manager does the same. Xen is
           | another virtual machine host, like KVM. Proxmox is a virtual
           | machine manager (like QEMU, virt-manager) but is web based.
           | Libvirt will provide abstraction for kvm,qemu,xen
           | 
           | Use cases: proxmox web interface exposed on your local
           | network on a KVM Linux box that uses QEMU to manage VM's.
           | Proxmox will allow you to do that from the web. QEMU is great
           | for single or small fleet of machines but should be automated
           | for any heavy lifting. Proxmox will do that.
        
             | bonzini wrote:
             | This is almost entirely wrong especially as far as QEMU,
             | Libvirt and virt-manager are concerned.
             | 
             | QEMU is a low level process that represents the virtual
             | machine. It has no equivalent in Xen. Using QEMU directly
             | is not a good idea unless your needs for VM configurations
             | change all the time and you hardly reuse VMs.
             | 
             | Libvirt is at a higher level than QEMU. It manages the QEMU
             | processes and gives them access to system resources (image
             | files, network interfaces, pass-through PCI devices). It
             | also makes it easy to manage the configuration of your
             | virtual machines and the resources they use.
             | 
             | Higher still is virt-manager, which is a GUI interface for
             | libvirt. Proxmox sits at roughly the same level as virt-
             | manager.
        
               | gabereiser wrote:
               | How? KVM and Xen are kernel level. QEMU uses KVM but also
               | has a software virtualization capability. Libvirt is an
               | API abstraction over it all. virt-manager is a gui app to
               | manage libvirt machines. Proxmox as well. Proxmox VE
               | talks to VMHost via libvirt.
        
               | bonzini wrote:
               | Libvirt does not use KVM. Libvirt uses either QEMU (which
               | in turn might or might not use KVM) or Xen or other
               | hypervisors. So it's incorrect to say that Libvirt
               | abstracts over KVM.
               | 
               | And virt-manager indeed manages Libvirt machines so it's
               | not at the level of QEMU as you wrote in the parent
               | comment:
               | 
               | > Proxmox is a virtual machine manager (like QEMU, virt-
               | manager)
        
               | gabereiser wrote:
               | Semantics, libvirt abstracts over KVM via QEMU because
               | QEMU/KVM/HVT is all one driver.
        
               | bonzini wrote:
               | KVM is not enough to create a virtual machine. KVM only
               | virtualizes the processor, not a whole machine.
        
               | gabereiser wrote:
               | >Using KVM, one can run multiple virtual machines running
               | unmodified Linux or Windows images. Each virtual machine
               | has private virtualized hardware: a network card, disk,
               | graphics adapter, etc.
               | 
               | Straight from their site. QEMU is the user space
               | interface, KVM the kernel space driver. It's enough to
               | run whatever OS. That's the point.
               | 
               | For libvirt: https://libvirt.org/drivers.html
               | 
               | They support a bunch as well.
        
               | cthalupa wrote:
               | I don't want to necessarily make this an argument to/from
               | authority, but for some context here - you are discussing
               | this with Paolo Bonzini, maintainer of KVM, contributor
               | to QEMU. In the list of people that best understand the
               | difference and demarcation points between KVM and QEMU,
               | he's pretty far up there.
        
               | bonzini wrote:
               | Exactly, it's QEMU that abstracts over the processor
               | virtualization APIs of KVM, Microsoft WHPX, Apple
               | Hypervisor.framework etc. Not Libvirt.
        
           | Izmaki wrote:
           | I don't know if _one_ such article exists, but here is a
           | piece of tech doc from oVirt (yet another tool) that shows
           | how - or that - VDSM is used by oVirt to communicate with
           | QEMU through libvirt: https://www.ovirt.org/develop/architect
           | ure/architecture.html...
           | 
           | In really simple terms, so simple that I'm not 100% sure they
           | are correct:
           | 
           | * KVM is a hypervisor, or rather it lets you turn linux into
           | a hypervisor [1], which will let you run VMs on your machine.
           | I've heard KVM is rather hard to work with (steep learning
           | curve). (Xen is also a hypervisor.)
           | 
           | * QEMU is a wrapper-of-a-sorts (a "machine emulator and
           | virtualizer" [2]) which can be used on top of KVM (or Xen).
           | "When used as a virtualizer, QEMU achieves near native
           | performance by executing the guest code directly on the host
           | CPU. QEMU supports virtualization when executing under the
           | Xen hypervisor or using the KVM kernel module in Linux." [2]
           | 
           | * libvirt "is a toolkit to manage virtualization platforms"
           | [3] and is used, e.g., by VDSM to communicate with QEMU.
           | 
           | * virt-manager is "a desktop user interface for managing
           | virtual machines through libvirt" [4]. The screenshots on the
           | project page should give an idea of what its typical use-case
           | is - think VirtualBox and similar solutions.
           | 
           | * Proxmox is the above toolstack (-ish) but as one product.
           | 
           | ---
           | 
           | [1] https://www.redhat.com/en/topics/virtualization/what-is-
           | KVM
           | 
           | [2] https://wiki.qemu.org/Main_Page
           | 
           | [3] https://libvirt.org/
           | 
           | [4] https://virt-manager.org/
        
             | Izmaki wrote:
             | Ps: typically if you want to run VMs you are faced with two
             | paths only: 1) you want one or more VMs on your personal
             | computer or 2) you want one or more VMs in an office
             | environment.
             | 
             | On the first path you are likely going to be just fine with
             | VirtualBox, VMWare Workstation or Hyper-V (Windows only) /
             | Parallels (Mac intended). Which one you should pick depends
             | on your desired use of the machines.
             | 
             | On the second path you would go with a solution that deals
             | with the nitty-gritty details, such as Proxmox, oVirt,
             | Hyper-V, ESXi, or any of the other many available options -
             | granted you are not going full cloud-based, which opens up
             | a whole lot of different options too.
             | 
             | You would generally never need to worry about which
             | components are needed where and why. I've had to worry
             | about it once or twice before, because I've had to debug
             | why an oVirt solution was not behaving like I wanted it to
             | behave. Knowing the inner workings helps in that case.
        
               | crabbone wrote:
               | > On the first path you are likely going to be just fine
               | with VirtualBox, VMWare Workstation or Hyper-V (Windows
               | only) / Parallels (Mac intended).
               | 
               | As a Linux user, why would you want to use VirtualBox or
               | VMWare Workstation? They are not so well integrated with
               | the system, and, frankly, VirtualBox is more of a toy VM
               | player... just go for virt-manager. It gives a
               | conceptually similar interface to VirtualBox, but better
               | integration with the rest of the system. Especially, when
               | it comes to stuff like sending different key
               | combinations.
               | 
               | I honestly cannot think of a single benefit to using
               | VirtualBox (and I'm less familiar with VMWare player)
               | compared to virt-manager. My guess is that it's more
               | often used because it's also a common choice on MS
               | Windows, so, you get more hits if you are going to search
               | the Web for questions associated to VMs / you'd get
               | tutorials for how to set up a VM that use VirtualBox.
               | But, if you apply yourself to learning how either one of
               | these works, you'd see no reason to choose it.
        
               | mkesper wrote:
               | The only reason for using it would be that the GUI is a
               | little bit more capable. But feature crippling without
               | the PUEL addons is horrible.
        
         | [deleted]
        
         | MuffinFlavored wrote:
         | > virtual machine monitor
         | 
         | Is it good to think of libvirt as a virtual machine mointor, or
         | is that more "virtual machine management"?
        
           | zbrozek wrote:
           | I'd love to get a clear explanation of what libvirt actually
           | does. As far as I can tell it's a qemu argument assembler and
           | launcher. For my own use-case, I just launch qemu from
           | systemd unit files:
           | 
           | https://wiki.archlinux.org/title/QEMU#With_systemd_service
        
             | bonzini wrote:
             | The main important point is that Libvirt takes care of
             | privilege separation.
             | 
             | It makes sure that if your VM and/or QEMU are broken out
             | of, there are extra layers to prevent getting access to the
             | whole physical machine. For example it runs QEMU as a very
             | limited user and, if you're using SELinux, the QEMU process
             | can hardly read any file other than the vm image file.
             | 
             | By contrast the method in the arch wiki runs QEMU as root.
             | QEMU is exposed to all sort of untrusted input, so you
             | really don't want it to run as root.
             | 
             | Libvirt also handles cross machine operations such as live
             | migration, and makes it easier to query a bunch of things
             | from QEMU.
             | 
             | For more info see https://www.redhat.com/en/blog/all-you-
             | need-know-about-kvm-u...
        
               | zbrozek wrote:
               | Thank you, this is the best explanation of the utility
               | that I've seen. I appreciate it.
        
             | MuffinFlavored wrote:
             | > As far as I can tell it's a qemu argument assembler
             | 
             | Raises a few questions to me:
             | 
             | Can you use KVM/do KVM stuff without QEMU?
             | 
             | Can you do libvirt stuff without QEMU?
             | 
             | Hoping the answers to both aren't useless/"technically, but
             | why would you want to?"
        
               | zokier wrote:
               | There is cute article from lwn demoing using kvm directly
               | without anything else: https://lwn.net/Articles/658511/
        
               | mrweasel wrote:
               | That is really neat, it never occurred to me that you
               | could just use KVM as a sort of sandbox.
        
               | simcop2387 wrote:
               | > Can you use KVM/do KVM stuff without QEMU?
               | 
               | Yes there's a few things out there like Firecracker that
               | use KVM without using QEMU. I'm not completely aware of
               | all of them but they do exist
               | 
               | > Can you do libvirt stuff without QEMU?
               | 
               | Yes it can also manager LXC containers and a few other
               | types like Xen and Bhyve and Virtuozzo, like QEMU without
               | KVM. The without KVM part is important to letting you run
               | VMs that are emulating other architectures than the
               | native one.
               | 
               | For a good bit of this, it is "why would you want to" but
               | there are definitely real cases where you'd want to be
               | able to do this. Like the LXC or Virtuozzo support means
               | that you can run lighter weight containers (same
               | underlying tech as Docker essentially) through the same
               | orchestration/management that you use for virtual
               | machines. And the Bhyve support lets you do the same
               | thing for running things on top of FreeBSD (though I've
               | never used it this way) so that a heterogeneous mix of
               | hosts is managed through the same interfaces.
        
               | jononor wrote:
               | libvirt supports other hypervisors than QEMU, such as
               | VirtualBox, Xen and LXC.
        
               | MuffinFlavored wrote:
               | I know ChatGPT responses get downvoted here but I fed it
               | this to help me better understand this:
               | 
               | > is KVM a hypervisor? is it type 1 or type 2? is QEMU a
               | hypervisor, is it type 1 or type 2? if QEMU is using KVM,
               | is QEMU then not a hypervisor in that use case?
               | 
               | Yes, KVM (Kernel-Based Virtual Machine) is indeed a
               | hypervisor. It's a type 1 hypervisor, also known as a
               | "bare metal" hypervisor. This is because KVM directly
               | runs on the host's hardware to control the hardware and
               | to manage guest operating systems. The fact that it's a
               | Linux kernel module that allows the Linux kernel to
               | function as a hypervisor makes it very efficient.
               | 
               | QEMU (Quick Emulator) is a bit more complex. By itself,
               | it is technically a type 2 or "hosted" hypervisor,
               | meaning it runs within a conventional operating system
               | environment. QEMU is a generic, open-source machine
               | emulator and virtualizer that can emulate a variety of
               | hardware types and host a range of guest operating
               | systems.
               | 
               | However, when QEMU is used with KVM, the picture changes
               | somewhat. In this case, KVM provides the hardware
               | virtualization where it allows the host machine to call
               | CPU instructions of full virtualization. Then, QEMU
               | emulates the hardware resources and provides the user
               | interface for the VM, thus allowing for better
               | performance and usability. It's this combination of KVM's
               | hardware acceleration capabilities and QEMU's emulation
               | capabilities that makes them often used together.
               | 
               | In this case, QEMU is not acting purely as a hypervisor;
               | it's providing hardware emulation and user interface for
               | the VMs, while KVM is the part providing the hypervisor
               | functionality. However, we often refer to the combination
               | of "QEMU/KVM" as a unit when talking about this mode of
               | operation.
        
               | adql wrote:
               | > Can you use KVM/do KVM stuff without QEMU?
               | 
               | There are projects doing that althought qemu is the de
               | facto standard and best bet if you don't need to boot
               | your machines in 20ms
               | 
               | > Can you do libvirt stuff without QEMU?
               | 
               | Libvirt have _many_ backends so yes. IIRC it can even
               | manage virtualbox, althought I 'm, not sure why anyone
               | would want to
               | 
               | > Hoping the answers to both aren't useless/"technically,
               | but why would you want to?"
               | 
               | ...why? Is there a problem kvm+qemu+libvirt doesn't solve
               | for you?
        
           | adql wrote:
           | It's a lot of glue to present a consistent interface but it
           | also does the management part.
           | 
           | "API to virtualization system" would probably be closest
           | approximation but it also does some more advanced stuff like
           | coordinating cross-host VM migration
        
       | datadeft wrote:
       | Tl;dr: We tried to misuse technology and we failed. If
       | Firecracker was developed for a single binary executed fir a
       | short period of time why do you try to use it for multiple
       | executables running for a long time? Does it make any sense to
       | even try?
        
         | zokier wrote:
         | AWS uses firecracker to execute long-running Fargate tasks; its
         | hardly misuse
        
         | mike_hock wrote:
         | Where in the "sales" pitch on the fancy-CSS website as well as
         | the README does it say only to use it for single-shot
         | workloads?
         | 
         | I think the complaints are perfectly valid.
        
       | favflam wrote:
       | I know that Firecracker does not let you bind mount volumes, but
       | QEMU does. So, we changed to QEMU from Firecracker. If you run
       | the workloads in Kubernetes, you just have to change a single
       | value in a yaml file to change the runtime.
       | 
       | I would be scared to let unknown persons use QEMU that bind
       | mounts volumes as that is a huge security risk. Firecracker, I
       | think, was designed from the start to run un-sanitized workloads,
       | hence, no bind mounting.
        
         | brunoborges wrote:
         | > you just have to change a single value in a yaml file
         | 
         | Most dangerous 12-words sentence.
        
       | Havoc wrote:
       | I toyed with it a bit and was delighted to get it running. Only
       | to discover getting even basic networking going is another
       | mission in itself.
       | 
       | Light is cool but for many tasks that level of Spartan is
       | overkill
       | 
       | If I'm investing time in light it might as well be wasm tech
        
       | k__ wrote:
       | _" Firecracker...'s excellent for running short-lived
       | workloads...A little-known fact about Firecracker is its lack of
       | support... for long-lived workloads."_
       | 
       | Okay.
        
       | rgbrenner wrote:
       | _" Firecracker's RAM footprint starts low, but once a workload
       | inside allocates RAM, Firecracker will never return it to the
       | host system."_
       | 
       | Firecracker has a balloon device you can inflate (ie: acquire as
       | much memory inside the VM as possible) and then deflate...
       | returning the memory to the host. You can do this while the VM is
       | running.
       | 
       | https://github.com/firecracker-microvm/firecracker/blob/main...
        
         | 0xbadcafebee wrote:
         | That and the fact that this was after "several weeks of
         | testing" tells me this team doesn't have much virtualization
         | experience. Firecracker is designed to quickly virtualize 1
         | headless stateless app (like a container), not run hundreds of
         | different programs in a developer environment.
        
         | CompuIves wrote:
         | Yes, we use this at CodeSandbox for reclaiming memory to the
         | host (and to reduce snapshot size when we hibernate the VM).
        
         | paulv wrote:
         | The first footnote says _If you squint hard enough, you 'll
         | find that Firecracker does support dynamic memory management
         | with a technique called ballooning. However, in practice, it's
         | not usable. To reclaim memory, you need to make sure that the
         | guest OS isn't using it, which, for a general-purpose workload,
         | is nearly impossible_
        
           | adql wrote:
           | Yeah it's pretty hard problem as you'd need to defragment
           | physical memory (while fixing all the virtual-to-physical
           | mappings) to make contiguous block to free
        
           | [deleted]
        
           | dspillett wrote:
           | _> which, for a general-purpose workload, is nearly
           | impossible_
           | 
           | That depends on the workload and the maximum memory allocated
           | to the guest OS.
           | 
           | A lot of workloads rely on the OS cache/buffers to manage IO
           | so unless RAM is quite restricted you can call in to release
           | that pretty easily prior to having the balloon driver do its
           | thing. In fact I'd not be surprised to be told the balloon
           | process does this automatically itself.
           | 
           | If the workload does its own IO management and memory
           | allocation (something like SQL Server which will eat what RAM
           | it can and does its own IO cacheing) or the VM's memory
           | allocation is too small for OS caching to be a significant
           | use after the rest of the workload (you might pair memory
           | down to the bare minimum like this for a "fairly static
           | content" server that doesn't see much variation in memory
           | needs and can be allowed to swap a little if things grow
           | temporarily), then I'd believe is it more difficult. That is
           | hardly the use case for firecracker though so if that is the
           | sort of workload being run perhaps reassessing the tool used
           | for the job was the right call.
           | 
           | Having said that my use of VMs is generally such that I can
           | give them a good static amount of RAM for their needs and
           | don't need to worry about dynamic allocation, so I'm far from
           | a subject expert here.
           | 
           | And, isn't firecraker more geared towards short-lived VMs,
           | quick to spin up, do a job, spin down immediately (or after
           | only a short idle timeout if the VM might answer another
           | request if one comes in immediately or is already queued), so
           | you are better off cycling VMs, which is probably happening
           | anyway, than messing around with memory balloons? Again, I'm
           | not talking from a position of personal experience here so
           | corrections/details welcome!
        
           | dathinab wrote:
           | > is nearly impossible
           | 
           | for many mostly "general purpose" use cases it's quite
           | viable, or else ~fly.io~ AWS Fargate wouldn't be able to use
           | it
           | 
           | this doesn't mean it's easy to implement the necessary
           | automatized tooling etc.
           | 
           | so it's depending on your dev resources and priorities it
           | might be a bad choice
           | 
           | still I feel the article was had quite a bit a being subtil
           | judgemental while moving some quite relevant parts for the
           | content of the article into a footnote and also omitting that
           | this "supposedly unusable tool" is used successfully by
           | various other companies...
           | 
           | like as it it was written by and engineer being overly
           | defensive about their decision due having to defend it to the
           | 100th time because shareholders, customers, higher level
           | management just wouldn't shut up about "but that uses
           | Firecracker"
        
           | tedunangst wrote:
           | I'm struggling to understand how qemu with free page
           | reporting isn't exactly the same as a firecracker balloon.
        
       | londons_explore wrote:
       | I really want VM's to integrate 'smarter' with the host.
       | 
       | For example, if I'm running 5 VM's, there is a good chance that
       | many of the pages are identical. Not only do I want those pages
       | to be deduplicated, but I want them to be zero-copy (ie. not
       | deduplicated after-the-fact by some daemon).
       | 
       | To do that, the guest block cache needs to be integrated with the
       | host block-cache, so that whenever some guest application tries
       | to map data from disk, the host notices that another virtual
       | machine has already caused this data to be loaded, so we can just
       | map the same page of already loaded data into the VM that is
       | asking.
        
         | jarym wrote:
         | Well that's all nice, but that would also need to be compute-
         | efficient for it to be worthwhile and near-real-time dedupe of
         | memory pages would be a REALLY tough challenge.
        
           | londons_explore wrote:
           | Pretty straightforward for disk blocks. Many VM disks are
           | already de-duped, either through snapshopping or through copy
           | on write host filesystems.
           | 
           | The host block cache will end up deduplicating it
           | automatically because all the 'copies' lead back to the same
           | block on disk.
        
         | gorbak25 wrote:
         | I've tried to use virtio-pmem + DAX for the page cache to not
         | be duplicated between the guest and the host. In practice the
         | RAM overhead of virtio-pmem is unacceptable and it doesn't
         | support discard operations at all. So yes a better solution
         | would be needed.
        
         | adql wrote:
         | https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.ht...
         | 
         | zero-copy is harder as one system upgrade on one of them will
         | trash it, but KSM is overall pretty effective at saving some
         | memory on similar VMs
        
         | re-thc wrote:
         | KVM has KSM (kernel samepage merging) since a long time ago
         | that de-duplicates pages.
        
           | londons_explore wrote:
           | But that makes a copy first, and only later notices that the
           | pages are the same and merges them again.
           | 
           | Better to not make copies in the first place.
        
             | adql wrote:
             | it's not really possible in hypervisor as it doesn't know
             | what guest will be putting in its memory beforehand
        
             | Liquid_Fire wrote:
             | How are you going to know in advance that the pages are
             | going to be the same?
             | 
             | e.g. your guest kernel is loading an application into
             | memory, by reading some parts of an ELF file from disk.
             | Presumably each VM has its own unique disk, so the
             | hypervisor can't know that this is "the same" page of data
             | as another VM has without actually reading it into memory
             | first and calculating a hash or something.
             | 
             | If the VMs share a disk image (e.g. the image is copy-on-
             | write), then I could see it being feasible - e.g. with KVM,
             | even if your VMs are instantiated by distinct userspace
             | processes, they would probably share the pages as they mmap
             | the same disk image. You would still need your virtualised
             | disk device to support copy-on-write, which may or may not
             | be possible depending on your use case.
             | 
             | But your copy-on-write disk images will probably quickly
             | diverge in a way that makes most pages not shareable,
             | unless you use some sort of filesystem optimised for that.
             | 
             | Lastly, since you mentioned Chromium or Slack in another
             | comment - I'm sure you'll find nearly all of the loading
             | time there is not spent loading the executable from disk,
             | but actually executing it (and all its
             | startup/initialisation code). So this probably won't be the
             | speedup you're imagining. It would just save memory.
        
               | londons_explore wrote:
               | Chromium load time with an empty profile on my system is
               | 4.5 seconds with a cleared disk block cache, and 2.0
               | seconds with a warm disk cache.
               | 
               | So startup time could be better than halved. Seems worth
               | it.
        
               | londons_explore wrote:
               | > pages not shareable, unless you use some sort of
               | filesystem optimised for that.
               | 
               | btrfs on the host would have support for deduplication of
               | identical pages in the disk images. It's true that a CPU-
               | costly scan would be needed to identify new shared pages,
               | if for example, two VM's are both updated to the latest
               | distro release.
        
           | ec109685 wrote:
           | It has side channel attacks so be careful when enabling:
           | https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)
        
         | scarface_74 wrote:
         | This seems like a security issue waiting to happen when you're
         | running code from different users.
        
         | hamandcheese wrote:
         | If you already know so much about your application(s), are you
         | sure you need virtualization?
        
           | drbawb wrote:
           | The second I read "shared block cache" my brain went to
           | containers.
           | 
           | If you want data colocated on the same filesystem, then put
           | it on the same filesystem. VMs suck, nobody spins up a whole
           | fabricated IBM-compatible PC and gaslights their executable
           | _because they want to._ [1] They do it because their OS (a)
           | doesn't have containers, (b) doesn't provide strong enough
           | isolation between containers, or (c) the host kernel can't
           | run their workload. (Different ISA, different syscalls,
           | different executable format, etc.)
           | 
           | Anyone who has ever tried to run heavyweight VMs atop a
           | snapshotting volume already knows the idea of "shared blocks"
           | is a fantasy; as soon as you do one large update inside the
           | guest the delta between your volume clones and the base
           | snapshot grows immensely. That's why Docker et al. has a
           | concept of layers and you describe your desired state as a
           | series of idempotent instructions applied to those layers.
           | That's possible _because_ Docker operates semantically on a
           | filesystem; much harder to do at the level of a block device.
           | 
           | Is the a block containing b"hello, world" part of a program's
           | text section, or part of a user's document? You don't know,
           | because the guest is asking you for an LBA, not a path, not
           | modes, not an ACL, etc. - If you don't know that, the host
           | kernel has no idea how the page should be mapped into memory.
           | Furthermore storing the information to dedup common blocks is
           | non-trivial: go look at the manpage for ZFS' deduplication
           | and it is littered w/ warnings about the performance, memory,
           | and storage implications of dealing with the dedup table.
           | 
           | [1]: https://www.youtube.com/watch?v=coFIEH3vXPw
        
         | tjoff wrote:
         | Doubt it is worth the hassle. How many do you really expect to
         | be identical?
         | 
         | An OS isn't large. Your spotify/slack/browser instance is of
         | comparable size. Says more about browser based apps but still.
        
           | londons_explore wrote:
           | Basically all code pages should be the same if some other VM
           | has the same version of ubuntu and running the same version
           | of spotify/slack.
           | 
           | And remember that as well as RAM savings, you also get
           | 'instant loading' because there is no need to do slow SSD
           | accesses to load hundreds of megabytes of a chromium binary
           | to get slack running...
        
           | drbawb wrote:
           | > An OS isn't large. Your spotify/slack/browser instance is
           | of comparable size.
           | 
           | A fairly recent Windows 11 Pro image is ~26GB unpacked and
           | 141k dirents. After finishing OOBE it's already running like
           | >100 processes, >1000 threads, and >100k handles. My Chrome
           | install is ~600MB and 115 dirents. (Not including UserData.)
           | It runs ~1 process per tab. Comparable in scope and
           | complexity? That's debatable, but I tend to agree that modern
           | browsers are pretty similar in scope to what an OS should be.
           | (The other day my "web browser" flashed the firmware on the
           | microcontroller for my keyboard.)
           | 
           | They're not even close to "being comparable in size,"
           | although I guess that says more about Windows.
        
             | cthalupa wrote:
             | My reading was that the "comparable in size" was more about
             | memory footprint and less about usage of storage
        
         | shrubble wrote:
         | OpenVZ does this. If you have 5 VMs each loading the same
         | library then memory is conserved, as I understand it.
        
           | anthk wrote:
           | kvm does the same with KSM.
        
             | shrubble wrote:
             | Not precisely, in that KSM does it after the fact while
             | OpenVZ has it occur as a consequence of its design, on the
             | loading of the program.
             | 
             | See (OpenVZ) "Containers share dynamic libraries, which
             | greatly saves memory." It's just 1 Linux kernel when you
             | are running OpenVZ containers.
             | 
             | https://docs.openvz.org/openvz_users_guide.webhelp/_openvz_
             | c...
             | 
             | See (KVM/KSM): "KSM enables the kernel to examine two or
             | more already running programs and compare their memory. If
             | any memory regions or pages are identical, KSM reduces
             | multiple identical memory pages to a single page. This page
             | is then marked copy on write."
             | 
             | https://access.redhat.com/documentation/en-
             | us/red_hat_enterp...
             | 
             | In KVM's defense, it supports a much wider range of OSes;
             | OpenVZ only really does different versions of Linux, while
             | KVM can run OpenBSD/FreeBSD/NetBSD/Windows and even OS/2 in
             | addition to Linux.
        
         | andrewflnr wrote:
         | Are you sure you're not thinking "copy on write" rather than
         | "zero copy"? The latter implies you can predict in advance
         | which pages will be the same forever...
        
           | londons_explore wrote:
           | The pages would be copy-on-write, but since this would mostly
           | be for code pages, they would never be written, and therefore
           | never copied.
           | 
           | By 'zero copy', I mean that when a guest tries to read a
           | page, if another guest has that page in RAM, then no copy
           | operation is done to get it into the memory space of the 2nd
           | guest.
        
       | ComputerGuru wrote:
       | The article did an ok job of explaining the firecracker
       | limitations they ran into but it was extremely skimpy when it
       | came to qemu and just rushed to the conclusion "we did a lot of
       | work so try our product."
        
         | naikrovek wrote:
         | yeah I was reading so I could find out what they did.
         | 
         | I understand that they need to sell their product but jeez.
         | don't leave us hanging like that
        
           | hugodutka wrote:
           | I didn't want to go into all the technical details, but we
           | have another write-up that goes into details about RAM
           | management: https://github.com/hocus-
           | dev/hocus/blob/main/rfd/0003-worksp...
           | 
           | Other than making sure we release unused memory to the host,
           | we didn't customize QEMU that much. Although we do have a
           | cool layered storage solution - basically a faster
           | alternative to QCOW2 that's also VMM independent. It's called
           | overlaybd, and was created and implemented in Alibaba. That
           | will probably be another blog post.
           | https://github.com/containerd/overlaybd
        
             | ComputerGuru wrote:
             | > I didn't want to go into all the technical details
             | 
             | HN is here for the technical details ;)
        
               | gunapologist99 wrote:
               | that should be the HN motto!
        
               | xen2xen1 wrote:
               | Thirded.. Ed.
        
       | blinkingled wrote:
       | > The main issue we've had with QEMU is that it has too many
       | options you need to configure. For instance, enabling your VM to
       | return unused RAM to the host requires at least three challenging
       | tasks
       | 
       | This just works on Hyper-V Linux guests btw. For all the crap MS
       | gets they do some things very right.
        
         | freeone3000 wrote:
         | It _kind of_ just works. It's actually broken under Debian 13
         | for some reason; memory usage infinitely balloons if the
         | feature is enabled.
        
       ___________________________________________________________________
       (page generated 2023-07-10 23:00 UTC)