TL;DR:
On the same Dell R720 with dual MI100s passed through, FlashAttention kernels in llama.cpp and ollama are rock-solid on Proxmox (KVM) and on bare-metal Ubuntu 24.04. On XCP-ng 8.3 VMs they are hit-or-miss or outright stall. Without FlashAttention some models run, others don’t; with FlashAttention the GPU often pegs at 100% but no tokens are produced. This forced me to run Proxmox for my inference host even though I’d prefer XCP-ng. Looking for guidance on IOMMU/PASID/ATS/XNACK support and recommended settings for AMD Instinct (MI100) under passthrough.
Hardware & Setup (same physical box)
- Server: Dell R720
- GPUs: 2 x AMD Instinct MI100 (gfx908)
- Use case: Local inference server for LLMs
Behavior across environments (same hardware, different host)
- Bare-metal Ubuntu 24.04 → llama.cpp and ollama with FlashAttention ON =
Stable, fast
- Proxmox (KVM) host → Ubuntu guest with FlashAttention ON =
Stable, fast
- XCP-ng 8.3 (Xen) → Guests with FlashAttention OFF =
️ Sometimes works, model dependent
- XCP-ng 8.3 (Xen) → Guests with FlashAttention ON =
GPU stuck at 100%, no tokens, stall
All tests were done on the same hardware. Only the hypervisor/host changed.
What I observe on XCP-ng 8.3
- With FlashAttention enabled, GPU ramps to 100% then nothing — the decode loop never returns tokens.
- Disabling FlashAttention lets some models complete, but reliability is inconsistent compared to Proxmox or bare-metal.
- Same binaries, kernels, and models behave correctly on Proxmox and bare-metal.
Why FlashAttention stresses the platform (short primer)
FlashAttention fuses the attention pipeline into large tiled kernels that minimize HBM traffic by keeping working sets in on-chip memory. It does not materialize the full N x N attention matrix, but it produces heavy, sustained compute + memory traffic and can amplify page-fault / on-demand paging behavior (UVM). In practice, FlashAttention is a high-stress test for:
- device
IOMMU
host page-fault plumbing,
- PASID / ATS / PRI behavior,
- XNACK (GPU page-fault retry) correctness and latency.
Hypothesis: where the XCP-ng path may break down
-
IOMMU / PASID / ATS / PRI under Xen passthrough.
MI100 + ROCm rely on GPU page-fault retry (XNACK) and PASID/ATS features. If those are missing, disabled or serviced with high latency in a VM, large fused kernels can spin or stall — GPU busy but no forward progress. Proxmox/KVM may expose a friendlier path. -
XNACK mode mismatches.
MI100 (gfx908) is sensitive to the XNACK mode. If userland/tooling (binaries / ROCm build) and the driver/kernel disagree about xnack+ vs xnack-, FlashAttention kernels can hang. Bare-metal/Proxmox have a known-good combo; XCP-ng guests may not. -
Event channels / IRQ vector limits.
Xen guests get a finite pool of interrupt vectors. If amdgpu/ROCm requests more MSI/MSI-X vectors than allocated, MSI-X enablement could fail and fallback to legacy IRQs produce pathological behavior. XCP-ng docs recommendextra_guest_irqs
as a knob to try — less likely for MI100 vs NVMe, but easy to try if logs show MSI/MSI-X failures. -
BAR sizing / MMIO quirks on older platforms.
Dell R720 is PCIe Gen3-era hardware. If Xen/dom0 maps large MI100 BARs differently than KVM/bare-metal (windowing, segmentation), heavy kernels could hit suboptimal paths.
What I’ve validated
- Same models/builds of llama.cpp and ollama run with FlashAttention on Proxmox and bare-metal.
- On XCP-ng 8.3, disabling FlashAttention lets some models run, but unreliably.
- Enabling FlashAttention on XCP-ng consistently reproduces the GPU 100% / no tokens stall.
Requests for XCP-ng/Xen team & community
- Feature support: What is the current support status for PASID / PRI / ATS for PCIe devices exposed to guests? Any caveats for MI100 + ROCm under Xen passthrough? Recommended dom0 / Xen boot params for low-latency GPU page-fault handling?
- Guidance on XNACK: Are there constraints for enabling XNACK in guests for gfx908? Best practices for aligning
HSA_XNACK
/ guest kernel / amdgpu settings? - IRQ provisioning: Should we proactively increase
extra_guest_irqs
for GPU-heavy guests even if no MSI-X errors appear? What log lines distinguish IRQ exhaustion from IOMMU / page-fault stalls? - Known-good recipe: If anyone has MI100 + FlashAttention stable on XCP-ng 8.3, please share Xen version, dom0 kernel, guest kernel, ROCm version, GPU firmware, and any special Xen/guest kernel params.
Diagnostics I can provide
xl dmesg
(dom0) and dom0dmesg
- Guest
dmesg
filtered for amdgpu, rocm, hsa, xnack, pasid, iommu, fault messages - Guest
lspci -vv
for the GPU (MSI/MSI-X state, BARs) rocminfo
from the guest- Minimal reproducer scripts for llama.cpp and ollama (FlashAttention on/off)
Repro steps (guest VM on XCP-ng 8.3)
- Pass through both MI100s to a Linux guest.
- Install ROCm stack that matches the working bare-metal/Proxmox setup.
- Build llama.cpp with FlashAttention enabled; install ollama with FlashAttention.
- Run the same model + prompt that succeeds elsewhere.
Result: GPU hits 100% and stalls (no tokens). Disable FlashAttention: some models run.
WHY THIS MATTERS
XCP-ng is a great platform and I’d prefer to consolidate inference there. Right now there’s a reliability gap for modern AI workloads (FlashAttention / ROCm) compared to Proxmox and bare-metal on identical hardware, forcing me to run Proxmox for this node. If this is a config/feature gap (IOMMU / PASID / ATS / XNACK, IRQ provisioning, etc.), I’m happy to validate fixes or test previews. If it’s a known limitation, documenting it will save others time.
Thanks — happy to provide logs, repros, and run more tests. If anyone has a working recipe for MI100 + ROCm + FlashAttention on XCP-ng, please share.