IOMMU paravirtualization for Xen
Hello! I am Teddy, an R&D intern at Vates working on the IOMMU stack of Xen. For this first blog post, I will introduce a new feature I am working on, which is IOMMU paravirtualization. This new feature will allow the Dom0 to use a paravirtualized IOMMU, that can be used for numerous things such as Dom0 DMA protection or Linux VFIO support.
This project is a part of a on-going effort to support SPDK with Xen, support for DPUs and much more.
โ๏ธ IOMMU Introduction
The IOMMU is a special device implemented by the platform, it takes varying names such as VT-d (Intel), AMD-Vi (AMD), SMMU (ARM) among other. This device has for role to translate or filter DMA requests from devices to the machine physical memory. It is usually used in virtualization for PCI passthrough to make DMA requests coherent with the guests memory context, but can be used by operating systems to protect their memory from devices.
It can be also used to allow userspace programs to directly interact with devices, as we see with Linux in the VFIO framework used by SPDK among others.
๐ผ Xen and IOMMU
Xen already leverages the IOMMU for PCI passthrough and to restrict the memory the devices can access. Therefore, for stability and security reasons, the guest (including Dom0) can't directly access the IOMMU hardware available on the machine. All hope is not lost, we can still expose an interface to the guest to allow it to have access to a IOMMU, for instance, a simplified one relying on the paravirtualized infrastructure of Xen.
โจ Introducing PV-IOMMU
We introduce a new paravirtualized IOMMU simply named PV-IOMMU. It basically implements the features the guest expect from a IOMMU, abstracting all the internal hardware details. In Xen, we add a new hypercall for such operations (which is HYPERVISOR_iommu_op
) that provides several IOMMU operations that the guest can use (if allowed).
One of the main feature the guest expect from a IOMMU is the ability to create and modify "IOMMU domains" which is a set of translations that makes memory context and that can be applied to a device (or multiple devices). These domains are named "IOMMU contexts" in Xen to avoid confusion with Xen domains that are virtual machines.
The operations on the PV-IOMMU are exposed as sub-operations on the HYPERVISOR_iommu_op
hypercall, and are abstracted in a way to be practical for the guest.
The guest needs to allocate an IOMMU context using the alloc_context
sub-operation that will return a "context number" on success. This context number is a handle to the created context, it is required for further operations on IOMMU contexts.
This context is initially empty (no mapping, thus blocking all DMA), the guest needs to modify the mappings of the context using map_page
/unmap_page
sub-operations with its own physical addresses.
In order to make the context we created actually useful, we need to attach this context in a device. By default, all devices are bound to the "default context" (ctx_no = 0
), which is the default one mapping the entire guest memory (usual passthrough). We can use the reattach_device
sub-operation to change the context applied on a device. A device can be mapped to only one context at a time, but a single context can be used by several devices.
Using these operations, we can implement a IOMMU driver for Linux, which can then be used by DMA-API (allowing DMA protection), or VFIO.
These contexts needs to be managed on the Xen side as well which is another story.
๐งโโ๏ธ Modifying the Xen IOMMU subsystem
In Xen, the IOMMU subsystem doesn't allow multiples IOMMU context in a single Xen domain. Actually, only a single IOMMU context exists per Xen domain that have for role to translate DMA to be coherent with guest memory addresses, and that context shouldn't be modified (for various reasons).
Two approaches can be taken to allow several IOMMU contexts to exist in a single Xen domain :
- add IOMMU contexts while keeping existing logic almost intact : that approach was taken for the initial PoC but has limitations and is complex is practice
- redesign the IOMMU subsystem to consider IOMMU contexts and redesign existing features to use those, approach currently considered but needs work and upstream feedback
The first approach may seem practical, but the current IOMMU design with single context per Xen domain in mind cannot be expanded in practice due to several corner cases to manage (and security implications) and potential conflict with existing features (such as device quarantine that uses a per-device IOMMU domain).
Moreover, the current IOMMU design is not practical to port to other platforms, especially to implement some features such as device quarantine, and may cause issues for future porting effort, for instance RISC-V or Ampere ARM platforms.
In order to simplify the current Xen IOMMU infrastructure with new usages in mind (quarantine, iommu contexts, ...) a redesign is considered.
๐ Booting with PV-IOMMU
A working PoC implementing the first approach is available in the iommu-contexts-polytech-prd
branch for those adventurous. Keep in mind that this implementation has several limitations (e.g doesn't properly support multiple domains, phantom devices not properly handled, only VT-d support, ...) and is not well-tested. It implements the PV-IOMMU hypercall interface and some sort of IOMMU context support in the Xen IOMMU subsystem.
๐ฎ Future work
The new design for the IOMMU subsystem is not complete yet (nor vetted upstream), and some work is still needed on it and the final implementation. The current implementation only consider VT-d, some work is needed to consider AMD-Vi and SMMUv3 as well. You can follow my work on my Git repo.
There will be more to come on this subject in the future, so stay tuned!