Photo by Adi Goldstein / Unsplash

IOMMU paravirtualization for Xen

Xen Apr 18, 2024

Hello! I am Teddy, an R&D intern at Vates working on the IOMMU stack of Xen. For this first blog post, I will introduce a new feature I am working on, which is IOMMU paravirtualization. This new feature will allow the Dom0 to use a paravirtualized IOMMU, that can be used for numerous things such as Dom0 DMA protection or Linux VFIO support.

This project is a part of a on-going effort to support SPDK with Xen, support for DPUs and much more.

Using SPDK with Xen
Discover the impact on a new fast datapath for your VM storage, using SPDK. And see how this might be the future of XCP-ng!
DPUs and the future of virtualization
Take a look at the future of virtualization: DPUs, and how XCP-ng will leverage them.

โš™๏ธ IOMMU Introduction

The IOMMU is a special device implemented by the platform, it takes varying names such as VT-d (Intel), AMD-Vi (AMD), SMMU (ARM) among other. This device has for role to translate or filter DMA requests from devices to the machine physical memory. It is usually used in virtualization for PCI passthrough to make DMA requests coherent with the guests memory context, but can be used by operating systems to protect their memory from devices.

It can be also used to allow userspace programs to directly interact with devices, as we see with Linux in the VFIO framework used by SPDK among others.

๐Ÿผ Xen and IOMMU

Xen already leverages the IOMMU for PCI passthrough and to restrict the memory the devices can access. Therefore, for stability and security reasons, the guest (including Dom0) can't directly access the IOMMU hardware available on the machine. All hope is not lost, we can still expose an interface to the guest to allow it to have access to a IOMMU, for instance, a simplified one relying on the paravirtualized infrastructure of Xen.

โœจ Introducing PV-IOMMU

We introduce a new paravirtualized IOMMU simply named PV-IOMMU. It basically implements the features the guest expect from a IOMMU, abstracting all the internal hardware details. In Xen, we add a new hypercall for such operations (which is HYPERVISOR_iommu_op) that provides several IOMMU operations that the guest can use (if allowed).

One of the main feature the guest expect from a IOMMU is the ability to create and modify "IOMMU domains" which is a set of translations that makes memory context and that can be applied to a device (or multiple devices). These domains are named "IOMMU contexts" in Xen to avoid confusion with Xen domains that are virtual machines.

The operations on the PV-IOMMU are exposed as sub-operations on the HYPERVISOR_iommu_op hypercall, and are abstracted in a way to be practical for the guest.

The guest needs to allocate an IOMMU context using the alloc_context sub-operation that will return a "context number" on success. This context number is a handle to the created context, it is required for further operations on IOMMU contexts.

struct pv_iommu_op op;
uint16_t ctx_no;

op.ctx_no = 0;
op.flags = 0;
op.subop_id = IOMMUOP_alloc_context;

HYPERVISOR_iommu_op(&op, 1);

// Get the context number of the context we just created.
ctx_no = op.ctx_no;

Creating an IOMMU context from guest

This context is initially empty (no mapping, thus blocking all DMA), the guest needs to modify the mappings of the context using map_page/unmap_page sub-operations with its own physical addresses.

 * Map device-visible frame number ffffh into
 * guest-visible frame number ddddh read/write.
op.ctx_no = ctx_no;
op.flags = IOMMU_OP_readable | IOMMU_OP_writeable;
op.subop_id = IOMMUOP_map_page;

op.map_page.gfn = 0xdddd;
op.map_page.dfn = 0xffff;

HYPERVISOR_iommu_op(&op, 1);

Modifying a IOMMU context by creating a mapping

In order to make the context we created actually useful, we need to attach this context in a device. By default, all devices are bound to the "default context" (ctx_no = 0), which is the default one mapping the entire guest memory (usual passthrough). We can use the reattach_device sub-operation to change the context applied on a device. A device can be mapped to only one context at a time, but a single context can be used by several devices.

 * Put the device 0000:01:00.0 in context 'ctx_no'.

op.ctx_no = ctx_no;
op.flags = 0;
op.subop_id = IOMMUOP_reattach_device; = 0x0000; = 0x01; = 0x00;

HYPERVISOR_iommu_op(&op, 1);

Reattaching a device to another context

Using these operations, we can implement a IOMMU driver for Linux, which can then be used by DMA-API (allowing DMA protection), or VFIO.

linux/drivers/iommu/xen-iommu.c at 5bdba188f1dc92974c9246f13552d2c5834c82d5 ยท TSnake41/linux
Xen IOMMU contexts work. Contribute to TSnake41/linux development by creating an account on GitHub.

These contexts needs to be managed on the Xen side as well which is another story.

๐Ÿง™โ€โ™‚๏ธ Modifying the Xen IOMMU subsystem

In Xen, the IOMMU subsystem doesn't allow multiples IOMMU context in a single Xen domain. Actually, only a single IOMMU context exists per Xen domain that have for role to translate DMA to be coherent with guest memory addresses, and that context shouldn't be modified (for various reasons).

Two approaches can be taken to allow several IOMMU contexts to exist in a single Xen domain :

  • add IOMMU contexts while keeping existing logic almost intact : that approach was taken for the initial PoC but has limitations and is complex is practice
  • redesign the IOMMU subsystem to consider IOMMU contexts and redesign existing features to use those, approach currently considered but needs work and upstream feedback

The first approach may seem practical, but the current IOMMU design with single context per Xen domain in mind cannot be expanded in practice due to several corner cases to manage (and security implications) and potential conflict with existing features (such as device quarantine that uses a per-device IOMMU domain).

Moreover, the current IOMMU design is not practical to port to other platforms, especially to implement some features such as device quarantine, and may cause issues for future porting effort, for instance RISC-V or Ampere ARM platforms.

In order to simplify the current Xen IOMMU infrastructure with new usages in mind (quarantine, iommu contexts, ...) a redesign is considered.

๐Ÿš€ Booting with PV-IOMMU

A working PoC implementing the first approach is available in the iommu-contexts-polytech-prd branch for those adventurous. Keep in mind that this implementation has several limitations (e.g doesn't properly support multiple domains, phantom devices not properly handled, only VT-d support, ...) and is not well-tested. It implements the PV-IOMMU hypercall interface and some sort of IOMMU context support in the Xen IOMMU subsystem.

GitHub - TSnake41/xen at iommu-contexts-polytech-prd
Xen IOMMU contexts work. Contribute to TSnake41/xen development by creating an account on GitHub.

๐Ÿ”ฎ Future work

The new design for the IOMMU subsystem is not complete yet (nor vetted upstream), and some work is still needed on it and the final implementation. The current implementation only consider VT-d, some work is needed to consider AMD-Vi and SMMUv3 as well. You can follow my work on my Git repo.

GitHub - TSnake41/xen at iommu-context-wip
Xen IOMMU contexts work. Contribute to TSnake41/xen development by creating an account on GitHub.

There will be more to come on this subject in the future, so stay tuned!


Teddy Astie

Xen hypervisor R&D intern in the XCP-ng team, open-source enthusiast and high performance enjoyer. Rust fan. Focusing on improving the IOMMU stack of Xen.