RunX: next generation secured containers
In this article, you'll discover a way to give all your containers a new level of isolation and security you can't have by default. All of this in a transparent manner in your existing container workflows!
Introduction to RunX
Before talking about RunX, let's take a look at how your containers work currently. These examples are with Docker, but it works the same way with all systems relying on the standardized container format (OCI).
Until now, to create a container, one would rely on using a container component named RunC.
RunC is a low-level container runtime, introduced by Docker in 2015, allowing the execution of a container (using the OCI format). This component is executed by
containerd, another container runtime of Docker at a higher level (push/pull images, transport layer, complex APIs...).
This is a very straightforward and convenient way to run containers, but it's far from being secure. Any isolation mechanisms are pretty thin. For anyone running local (dev on your local machine), this is great! But running with this low level of isolation in production might be dangerous.
The future: RunX
So what is RunX? The short answer: RunX is set of scripts that will replace RunC. In other words, it's an OCI runtime spec compliant container runtime, running containers as virtual machines.
Instead of using default APIs, Xen APIs are used, more specifically
A similar example is KataContainers, which uses KVM based virtual machines for your containers. However, RunX is a bit different outside the obvious fact it's Xen based. Unlike KataContainers, there's no attempt to communicate to the host via a side-channel and there's no agent inside the VM. Also, RunX is using a very small busybox-based ramdisk to boot the VM (if your container comes with its own ramdisk, RunX will use it).
RunX isolation is great, it has few dependencies and also a very small footprint. It's also fully compatible with Docker or Podman, even container orchestration platforms like Kubernetes, because it's replacing a layer transparently.
If you want to learn more about RunX itself, you should take a look at this presentation from Stefano Stabellini during latest Xen Summit:
Now, let's see how RunX was integrated into XCP-ng.
libxl is used by RunX, many projects are patched to provide a better integration with XCP-ng:
xenopsd, and SMAPIv3. Our goal is to avoid direct usage of
libxl, and to use
xe commands instead (remember,
xe is just a XAPI client).
SMAPIv3 to be able to access the container image file system. It's a generic driver allowing you to access a file instead of an image. Simple and efficient.
xenopsd and SMAPIv3
The easiest way to integrate
9pfs support is not to roughly/entirely patch xenopsd but rather just
SMAPIV3. Quite simply because the filesystem provided by the docker overlay can be seen as a VDI and then it would fit correctly into the toolstack. Instead of
RAW, we can use a plugin to support another format: a folder path.
It is enough to add a new datapath plugin to support host folders and a volume plugin. Thanks to this,
xenopsd continues to negotiate with
XenBus, simply because the
9pfs PV driver in the guest is similar to the classic storage PV driver.
The only thing to change in xenopsd is to write new params in the xenstore for the driver and to support a new
9pfs VBD backend type.
The benefits of patching
- Better for maintenance and architecture
- Be able to share any folder, not just the docker layer.
With these changes, at the RunX level when a container starts: we can create a VM with the right boot params using a VM template, create a
9pfs VDI with the docker path and finally we can start the VM.
qemu-dp is a lighter version of qemu components able to manipulate QCOW2 (also VHD). We patched
qemu-dp to make it compatible with the 9pfs driver.
To be able to grasp the concept, let's SSH to your dom0 with RunX enabled:
olivier@mycomputer# ssh root@xcp-ng-host root@xcp-ng-host's password: Last login: Tue Sep 14 08:55:17 2021 from mycomputer root@xcp-ng-host#
Good, now let's pull an Ubuntu Docker image:
root@xcp-ng-host# docker pull ubuntu Using default tag: latest Trying to pull repository docker.io/library/ubuntu ... sha256:9d6a8699fb5c9c39cf08a0871bd6219f0400981c570894cd8cbea30d3424a31f: Pulling from docker.io/library/ubuntu 35807b77a593: Pull complete Digest: sha256:9d6a8699fb5c9c39cf08a0871bd6219f0400981c570894cd8cbea30d3424a31f Status: Downloaded newer image for docker.io/ubuntu:latest
Now we got our image, we can use it:
root@xcp-ng-host# docker start ubuntu ubuntu root@xcp-ng-host# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 3afea5743a60 docker.io/ubuntu "/usr/bin/bash" 4 minutes ago Up 3 minutes root@xcp-ng-host# docker pause ubuntu ubuntu root@xcp-ng-host# docker stop ubuntu ubuntu
So, what happened? After we launched
docker start, instead of starting a "normal" container, a light VM was created using the Ubuntu image. Exactly like if you were using Docker with the usual RunC!
Note: all the running Docker containers are visible as VMs and therefore displayed in Xen Orchestra,
xeor XCP-ng Center.
Neat, right? All the usual containers workflow with the extra security layer via Xen.
We are working on getting the whole setup as simple as possible. There's some limitation at this stage (see below), but nothing that we couldn't solve on the long run.
We'd like to thanks Stefano Stabellini for the original idea and prototype on Xen, and his time to listen to our idea about "porting" it to XCP-ng.
- RRD's disk stats can't be fetched, due to the architecture we use with RunX. As qemu-dp is used instead of a tapdisk, XAPI isn't generating data for those disks.
- Possible improvements on R/W performance need to be explored.
- Attach additional devices with persistent data at startup isn't supported yet. However it's still possible to add disks using xe or Xen Orchestra when the VM is running.
- Snapshots and migrations are not possible due to SMAPIv3 driver being very simple.
- Despite containers commands are working, we can't reboot or shutdown inside the VM itself. Only an init script modification is needed to bring this possibility.
- Configure RAM/CPU using Docker or in another container environment is not yet supported (this can be easily improved by parsing for CPU and memory limits and to pass them in the VM configuration).
- Find a way to install the xen guest tools to fetch more guest metrics and allow live migration in the future (this can be done in the init part)