🚨 AI on XCP‑ng 8.3: Not Ready for Prime Time? FlashAttention/ROCm Passthrough Stalls vs Proxmox & Bare‑Metal

emuchogu 0

TL;DR:
On the same Dell R720 with dual MI100s passed through, FlashAttention kernels in llama.cpp and ollama are rock-solid on Proxmox (KVM) and on bare-metal Ubuntu 24.04. On XCP-ng 8.3 VMs they are hit-or-miss or outright stall. Without FlashAttention some models run, others don’t; with FlashAttention the GPU often pegs at 100% but no tokens are produced. This forced me to run Proxmox for my inference host even though I’d prefer XCP-ng. Looking for guidance on IOMMU/PASID/ATS/XNACK support and recommended settings for AMD Instinct (MI100) under passthrough.

Hardware & Setup (same physical box)

Server: Dell R720
GPUs: 2 x AMD Instinct MI100 (gfx908)
Use case: Local inference server for LLMs

Behavior across environments (same hardware, different host)

Bare-metal Ubuntu 24.04 → llama.cpp and ollama with FlashAttention ON = Stable, fast
Proxmox (KVM) host → Ubuntu guest with FlashAttention ON = Stable, fast
XCP-ng 8.3 (Xen) → Guests with FlashAttention OFF = ️ Sometimes works, model dependent
XCP-ng 8.3 (Xen) → Guests with FlashAttention ON = GPU stuck at 100%, no tokens, stall

All tests were done on the same hardware. Only the hypervisor/host changed.

What I observe on XCP-ng 8.3

With FlashAttention enabled, GPU ramps to 100% then nothing — the decode loop never returns tokens.
Disabling FlashAttention lets some models complete, but reliability is inconsistent compared to Proxmox or bare-metal.
Same binaries, kernels, and models behave correctly on Proxmox and bare-metal.

Why FlashAttention stresses the platform (short primer)

FlashAttention fuses the attention pipeline into large tiled kernels that minimize HBM traffic by keeping working sets in on-chip memory. It does not materialize the full N x N attention matrix, but it produces heavy, sustained compute + memory traffic and can amplify page-fault / on-demand paging behavior (UVM). In practice, FlashAttention is a high-stress test for:

device IOMMU host page-fault plumbing,
PASID / ATS / PRI behavior,
XNACK (GPU page-fault retry) correctness and latency.

Hypothesis: where the XCP-ng path may break down

IOMMU / PASID / ATS / PRI under Xen passthrough.
MI100 + ROCm rely on GPU page-fault retry (XNACK) and PASID/ATS features. If those are missing, disabled or serviced with high latency in a VM, large fused kernels can spin or stall — GPU busy but no forward progress. Proxmox/KVM may expose a friendlier path.
XNACK mode mismatches.
MI100 (gfx908) is sensitive to the XNACK mode. If userland/tooling (binaries / ROCm build) and the driver/kernel disagree about xnack+ vs xnack-, FlashAttention kernels can hang. Bare-metal/Proxmox have a known-good combo; XCP-ng guests may not.
Event channels / IRQ vector limits.
Xen guests get a finite pool of interrupt vectors. If amdgpu/ROCm requests more MSI/MSI-X vectors than allocated, MSI-X enablement could fail and fallback to legacy IRQs produce pathological behavior. XCP-ng docs recommend extra_guest_irqs as a knob to try — less likely for MI100 vs NVMe, but easy to try if logs show MSI/MSI-X failures.
BAR sizing / MMIO quirks on older platforms.
Dell R720 is PCIe Gen3-era hardware. If Xen/dom0 maps large MI100 BARs differently than KVM/bare-metal (windowing, segmentation), heavy kernels could hit suboptimal paths.

What I’ve validated

Same models/builds of llama.cpp and ollama run with FlashAttention on Proxmox and bare-metal.
On XCP-ng 8.3, disabling FlashAttention lets some models run, but unreliably.
Enabling FlashAttention on XCP-ng consistently reproduces the GPU 100% / no tokens stall.

Requests for XCP-ng/Xen team & community

Feature support: What is the current support status for PASID / PRI / ATS for PCIe devices exposed to guests? Any caveats for MI100 + ROCm under Xen passthrough? Recommended dom0 / Xen boot params for low-latency GPU page-fault handling?
Guidance on XNACK: Are there constraints for enabling XNACK in guests for gfx908? Best practices for aligning HSA_XNACK / guest kernel / amdgpu settings?
IRQ provisioning: Should we proactively increase extra_guest_irqs for GPU-heavy guests even if no MSI-X errors appear? What log lines distinguish IRQ exhaustion from IOMMU / page-fault stalls?
Known-good recipe: If anyone has MI100 + FlashAttention stable on XCP-ng 8.3, please share Xen version, dom0 kernel, guest kernel, ROCm version, GPU firmware, and any special Xen/guest kernel params.

Diagnostics I can provide

xl dmesg (dom0) and dom0 dmesg
Guest dmesg filtered for amdgpu, rocm, hsa, xnack, pasid, iommu, fault messages
Guest lspci -vv for the GPU (MSI/MSI-X state, BARs)
rocminfo from the guest
Minimal reproducer scripts for llama.cpp and ollama (FlashAttention on/off)

Repro steps (guest VM on XCP-ng 8.3)

Pass through both MI100s to a Linux guest.
Install ROCm stack that matches the working bare-metal/Proxmox setup.
Build llama.cpp with FlashAttention enabled; install ollama with FlashAttention.
Run the same model + prompt that succeeds elsewhere.
Result: GPU hits 100% and stalls (no tokens). Disable FlashAttention: some models run.

WHY THIS MATTERS
XCP-ng is a great platform and I’d prefer to consolidate inference there. Right now there’s a reliability gap for modern AI workloads (FlashAttention / ROCm) compared to Proxmox and bare-metal on identical hardware, forcing me to run Proxmox for this node. If this is a config/feature gap (IOMMU / PASID / ATS / XNACK, IRQ provisioning, etc.), I’m happy to validate fixes or test previews. If it’s a known limitation, documenting it will save others time.

Thanks — happy to provide logs, repros, and run more tests. If anyone has a working recipe for MI100 + ROCm + FlashAttention on XCP-ng, please share.

TeddyAstie

@emuchogu-0 said in AI on XCP‑ng 8.3: Not Ready for Prime Time? FlashAttention/ROCm Passthrough Stalls vs Proxmox & Bare‑Metal:

xl dmesg (dom0) and dom0 dmesg
Guest dmesg filtered for amdgpu, rocm, hsa, xnack, pasid, iommu, fault messages
Guest lspci -vv for the GPU (MSI/MSI-X state, BARs)
rocminfo from the guest
Minimal reproducer scripts for llama.cpp and ollama (FlashAttention on/off)

You need to provide this information; we can't blind guess where something is failing.

olivierlambert

Hi,

Small side note: lately I’ve noticed that some posts look like they were generated by LLMs. This can actually make it harder for the community to help, because the text is often long, unclear, or missing the basic details we really need to assist.

I’d really encourage everyone to write posts in their own words and share as much relevant information as possible. The real value of this community is people helping each other directly