Continuing our 8-part blog smackdown between containerization and virtualization, today we are going to dive deeper into overhead. We’ll discuss how much virtual machine (VM) overhead can be reduced with the right know-how and tools and how that matches up with containerization overhead.
As a reminder, our “wrestling match” is actually a friendly contest between Taras Shapovalov, (team container) & Piotr Wachowicz (team VM). The battle royal will include:
Piotr – As I mentioned at the end of our Round 1 discussion on overhead, CPU virtualization is relatively cheap whereas virtualizing disk (block) and network I/O is a different story. (Fear not, by the end of this post I’ll discuss how it can be done).
So, what are the stumbling blocks? There are two: issues resulting from having other tenants on the same hypervisor, and the overhead of virtualization itself.
Let me begin with the first one. If you run multiple VMs on the same host, any VMs with particularly “noisy” input/output (I/O) can have a big impact on the other VMs. We’ve seen this ourselves many times in our own OpenStack cloud. But, there’s a pretty easy workaround – either through imposing limits on the number of disk access operations per second, or instructing OpenStack’s nova-scheduler not to mix VMs from different tenants/projects on the same host. You can even restrict scheduling to only a single VM per host.
However, the second stumbling block, actual virtualization overhead, is a tricky beast. There are many ways to do disk and net I/O virtualization, and multiple metrics to measure and compare them, but you cannot simply state that disk I/O virtualization overhead in a kernel-based virtual machine (KVM) is 30 percent, for example. That would be a gross over-simplification. The actual number depends on many different factors, and can easily vary as much as 25 percent on either side.
For instance, it depends on whether we are we doing full virtualization, para-virtualization, or device pass-through? What are our I/O patterns? How many input/output operations per second (IOPS) does the VMs generate? Are we measuring the number of IOPS, average/maximum/minimum operation latency or maybe bandwidth? How much overhead on the host CPU is acceptable? Does our NIC support VXLAN hardware offloading? How many VMs are doing intensive I/O per host? I could go on, but you get the idea.
So, yes it is easy to get high overhead with virtualized disk/net I/O, but it’s not impossible to get reasonably close to the bare metal performance. To get the most out of your virtualized I/O you need a solid monitoring tool that you can use to benchmark your workloads under various configurations. That way you can actually see which configurations work, and which do not.
When it comes to the in-memory footprint of a running VM, if the VMs you’re running on your hypervisor are similar (for example, they all run a similar underlying operating system), many of the memory pages occupied by the VMs in the hypervisor’s memory will be identical. It would be a shame not to make use of that.
And, in fact, Linux does just that, by using kernel same-page merging (KSM). KSM allows the operating system and an application, like the hypervisor, to de-duplicate memory pages, and use the same memory page for multiple VMs. For example, KVM/QEMU, the most popular OpenStack hypervisor according to the OpenStack user survey, makes excellent use of that feature.
So now we need to ask, “How well does it work?” And the answer, as usual, is, “It depends!” In this case, it depends upon how many memory pages are identical among the different VMs. But just to sprinkle some numbers around: in one article I read, users were able to run 52 VMs, each with 1 GB of system memory, on a hypervisor host having only 16GB of memory. I think even you will agree that’s not too bad.
So, yes, VMs are not as expensive as one might think. Not as efficient as containers, but not far off.
Taras – I see what you mean. It sounds like you can reduce VM overhead dramatically – if you measure the performance and tune parameters from scratch for each setup. Let me contrast that with the containerization world, where you don’t need to be an expert to achieve very low overhead. This makes containerization more accessible for beginners. I predict that because of this accessibility, we’ll see the size of the containerization community continue to grow compared to the virtualization community over the short-term.
Piotr – So now I have a question for you – Can Docker use KSM when running multiple similar containers on the same system?
Taras – KSM is indeed a very useful technology that works for containers – but only in some cases. For example, if you use Docker as a container orchestration tool, then you can select which storage backend will be used. If you select AUFS or OverlayFS as a backend, then the appropriate driver will make KSM usage possible and the memory will be saved when loading files from identical container images. But if you select, for example, DeviceMapper, then KSM will not work, because the memory page cache is keyed by device ID, so it cannot be shared. It’s the same story if you use the BTRFS backend. So it seems that backends like DeviceMapper and BTRFS may not be the best choice for platform as a service (PaaS) and other high density container use cases.
Piotr – One more question – How are network interfaces exposed to containers in multi-tenant environments? Isn’t that also done with a form of virtualization, which would also imply some overhead?
Taras – At the moment, there is no native support for multi-tenancy in container orchestration tools. But multi-tenancy is a very hot topic of discussion in different containerization-related communities. Everyone agrees multi-tenancy will be useful when integrated with tools like OpenStack, where it’s already supported. The good news is that there is a project called Hypernetes, which brings multi-tenancy to Kubernetes. I hope this will be a part of standard Kubernetes distribution in the future.
But if we consider containerization networking in general, there can be some overhead, depending on how the containers are configured. You can share all or some of the network interfaces among the host and containers that run on the host. This is probably the best choice from a performance point of view, because processes inside the containers use the network in the same way as processes that run outside of the containers. But this sharing breaks the isolation that is one of containerization’s most useful properties, because you can’t share a network among different users in a PaaS setup or even on an HPC cluster.
The opposite of sharing network interfaces among containers and host is creating dedicated networks for each container, or bunch of containers, running on one or multiple hosts. There are several tools to virtualize networking among containers (like OVS or Flannel), but each one brings some overhead with it. This is because each packet sent by a container will go through more layers than is would if it used a host network interface directly. Each additional layer costs something in terms of overhead. On the other hand, you can say the same thing about VMs, because you need to virtualize networking in the same way as you need to virtualize the network for containers.
From this point of view, VMs and containers are similar, and both can even use the same tools, like OVS. There are a few projects (Calico comes to mind) that should solve the problem of virtualized networks performance, but I think it is still too early to use them in production.
Now, when it comes to HPC, networking virtualization requirement restricts containerization usage on HPC clusters. For example, when you have a distributed MPI application, you expect that communication among MPI processes will not go through additional layers of virtualization, which can slow the whole application down. HPC clusters are expensive systems, so any additional overhead will make it more expensive for organizations. Therefore, at the moment container use is not as widespread as I would like to see.
I, for one, hope that the software defined networks community will solve this issue soon, so containerization for parallel HPC jobs will be common. Although if there is no restriction on network isolation, containers can be used in HPC now.
I have to mention here a comparably young project called Singularity, that allows you to create self-executed container images and run them as a regular application in workload manager job script. You just ask Singularity to put the list of files or even the content of an RPM inside a single bundle that can be moved to another cluster and executed as a regular binary. All the processes started by the bundle will be automatically put inside a single container. Say, if you have some MPI application, and you don’t need any networking isolation, Singularity will allow you to get benefit of the containerized applications easily. Bright Cluster Manager 7.3 will provide the singularity rpm and documentation so that every customer can try it out on their HPC jobs.
Phew – that was an exhausting round. Feels like a double knock out! We’ll be back next time with Round 3 on Start time and Setup Orchestration