RunX: next generation secured containers

Devblog Sep 14, 2021

In this article, you'll discover a way to give all your containers a new level of isolation and security you can't have by default. All of this in a transparent manner in your existing container workflows!

Introduction to RunX

Before talking about RunX, let's take a look at how your containers work currently. These examples are with Docker, but it works the same way with all systems relying on the standardized container format (OCI).

RunC

Until now, to create a container, one would rely on using a container component named RunC.

RunC is a low-level container runtime, introduced by Docker in 2015, allowing the execution of a container (using the OCI format). This component is executed by containerd, another container runtime of Docker at a higher level (push/pull images, transport layer, complex APIs...).

This is a very straightforward and convenient way to run containers, but it's far from being secure. Any isolation mechanisms are pretty thin. For anyone running local (dev on your local machine), this is great! But running with this low level of isolation in production might be dangerous.

The future: RunX

So what is RunX? The short answer: RunX is set of scripts that will replace RunC. In other words, it's an OCI runtime spec compliant container runtime, running containers as virtual machines.

Instead of using default APIs, Xen APIs are used, more specifically libxl:

A similar example is KataContainers, which uses KVM based virtual machines for your containers. However, RunX is a bit different outside the obvious fact it's Xen based. Unlike KataContainers, there's no attempt to communicate to the host via a side-channel and there's no agent inside the VM. Also, RunX is using a very small busybox-based ramdisk to boot the VM (if your container comes with its own ramdisk, RunX will use it).

RunX isolation is great, it has few dependencies and also a very small footprint. It's also fully compatible with Docker or Podman, even container orchestration platforms like Kubernetes, because it's replacing a layer transparently.

If you want to learn more about RunX itself, you should take a look at this presentation from Stefano Stabellini during latest Xen Summit:

Now, let's see how RunX was integrated into XCP-ng.

XCP-ng changes

Because libxl is used by RunX, many projects are patched to provide a better integration with XCP-ng: qemu-dp, xenopsd, and SMAPIv3. Our goal is to avoid direct usage of libxl, and to use xe commands instead (remember, xe is just a XAPI client).

Storage layer

We patched SMAPIv3 to be able to access the container image file system. It's a generic driver allowing you to access a file instead of an image. Simple and efficient.

`xenopsd` and SMAPIv3

The easiest way to integrate 9pfs support is not to roughly/entirely patch xenopsd but rather just SMAPIV3. Quite simply because the filesystem provided by the docker overlay can be seen as a VDI and then it would fit correctly into the toolstack. Instead of QCOW2, VHD or RAW, we can use a plugin to support another format: a folder path.

It is enough to add a new datapath plugin to support host folders and a volume plugin. Thanks to this, xenopsd continues to negotiate with XenBus, simply because the 9pfs PV driver in the guest is similar to the classic storage PV driver.

The only thing to change in xenopsd is to write new params in the xenstore for the driver and to support a new 9pfs VBD backend type.

The benefits of patching SMAPIv3 are:

Better for maintenance and architecture
Be able to share any folder, not just the docker layer.

With these changes, at the RunX level when a container starts: we can create a VM with the right boot params using a VM template, create a 9pfs VDI with the docker path and finally we can start the VM.

`qemu-dp`

qemu-dp is a lighter version of qemu components able to manipulate QCOW2 (also VHD). We patched qemu-dp to make it compatible with the 9pfs driver.

Demo

To be able to grasp the concept, let's SSH to your dom0 with RunX enabled:

olivier@mycomputer# ssh root@xcp-ng-host
root@xcp-ng-host's password: 
Last login: Tue Sep 14 08:55:17 2021 from mycomputer
root@xcp-ng-host#

Good, now let's pull an Ubuntu Docker image:

root@xcp-ng-host# docker pull ubuntu
Using default tag: latest
Trying to pull repository docker.io/library/ubuntu ... 
sha256:9d6a8699fb5c9c39cf08a0871bd6219f0400981c570894cd8cbea30d3424a31f: Pulling from docker.io/library/ubuntu
35807b77a593: Pull complete 
Digest: sha256:9d6a8699fb5c9c39cf08a0871bd6219f0400981c570894cd8cbea30d3424a31f
Status: Downloaded newer image for docker.io/ubuntu:latest

Now we got our image, we can use it:

root@xcp-ng-host# docker start ubuntu
ubuntu
root@xcp-ng-host# docker ps
CONTAINER ID        IMAGE                 COMMAND             CREATED             STATUS              PORTS               NAMES
3afea5743a60        docker.io/ubuntu   "/usr/bin/bash"     4 minutes ago         Up 3 minutes     
root@xcp-ng-host# docker pause ubuntu
ubuntu
root@xcp-ng-host# docker stop ubuntu
ubuntu

So, what happened? After we launched docker start, instead of starting a "normal" container, a light VM was created using the Ubuntu image. Exactly like if you were using Docker with the usual RunC!

Note: all the running Docker containers are visible as VMs and therefore displayed in Xen Orchestra, xe or XCP-ng Center.

Neat, right? All the usual containers workflow with the extra security layer via Xen.

What's next?

We are working on getting the whole setup as simple as possible. There's some limitation at this stage (see below), but nothing that we couldn't solve on the long run.

We'd like to thanks Stefano Stabellini for the original idea and prototype on Xen, and his time to listen to our idea about "porting" it to XCP-ng.

Current limitations

RRD's disk stats can't be fetched, due to the architecture we use with RunX. As qemu-dp is used instead of a tapdisk, XAPI isn't generating data for those disks.
Possible improvements on R/W performance need to be explored.
Attach additional devices with persistent data at startup isn't supported yet. However it's still possible to add disks using xe or Xen Orchestra when the VM is running.
Snapshots and migrations are not possible due to SMAPIv3 driver being very simple.
Despite containers commands are working, we can't reboot or shutdown inside the VM itself. Only an init script modification is needed to bring this possibility.
Configure RAM/CPU using Docker or in another container environment is not yet supported (this can be easily improved by parsing for CPU and memory limits and to pass them in the VM configuration).
Find a way to install the xen guest tools to fetch more guest metrics and allow live migration in the future (this can be done in the init part)