Posts made by emuchogu 0 | XCP-ng and XO forum

emuchogu 0

TL;DR:
On the same Dell R720 with dual MI100s passed through, FlashAttention kernels in llama.cpp and ollama are rock-solid on Proxmox (KVM) and on bare-metal Ubuntu 24.04. On XCP-ng 8.3 VMs they are hit-or-miss or outright stall. Without FlashAttention some models run, others don’t; with FlashAttention the GPU often pegs at 100% but no tokens are produced. This forced me to run Proxmox for my inference host even though I’d prefer XCP-ng. Looking for guidance on IOMMU/PASID/ATS/XNACK support and recommended settings for AMD Instinct (MI100) under passthrough.

Hardware & Setup (same physical box)

Server: Dell R720
GPUs: 2 x AMD Instinct MI100 (gfx908)
Use case: Local inference server for LLMs

Behavior across environments (same hardware, different host)

Bare-metal Ubuntu 24.04 → llama.cpp and ollama with FlashAttention ON = Stable, fast
Proxmox (KVM) host → Ubuntu guest with FlashAttention ON = Stable, fast
XCP-ng 8.3 (Xen) → Guests with FlashAttention OFF = ️ Sometimes works, model dependent
XCP-ng 8.3 (Xen) → Guests with FlashAttention ON = GPU stuck at 100%, no tokens, stall

All tests were done on the same hardware. Only the hypervisor/host changed.

What I observe on XCP-ng 8.3

With FlashAttention enabled, GPU ramps to 100% then nothing — the decode loop never returns tokens.
Disabling FlashAttention lets some models complete, but reliability is inconsistent compared to Proxmox or bare-metal.
Same binaries, kernels, and models behave correctly on Proxmox and bare-metal.

Why FlashAttention stresses the platform (short primer)

FlashAttention fuses the attention pipeline into large tiled kernels that minimize HBM traffic by keeping working sets in on-chip memory. It does not materialize the full N x N attention matrix, but it produces heavy, sustained compute + memory traffic and can amplify page-fault / on-demand paging behavior (UVM). In practice, FlashAttention is a high-stress test for:

device IOMMU host page-fault plumbing,
PASID / ATS / PRI behavior,
XNACK (GPU page-fault retry) correctness and latency.

Hypothesis: where the XCP-ng path may break down

IOMMU / PASID / ATS / PRI under Xen passthrough.
MI100 + ROCm rely on GPU page-fault retry (XNACK) and PASID/ATS features. If those are missing, disabled or serviced with high latency in a VM, large fused kernels can spin or stall — GPU busy but no forward progress. Proxmox/KVM may expose a friendlier path.
XNACK mode mismatches.
MI100 (gfx908) is sensitive to the XNACK mode. If userland/tooling (binaries / ROCm build) and the driver/kernel disagree about xnack+ vs xnack-, FlashAttention kernels can hang. Bare-metal/Proxmox have a known-good combo; XCP-ng guests may not.
Event channels / IRQ vector limits.
Xen guests get a finite pool of interrupt vectors. If amdgpu/ROCm requests more MSI/MSI-X vectors than allocated, MSI-X enablement could fail and fallback to legacy IRQs produce pathological behavior. XCP-ng docs recommend extra_guest_irqs as a knob to try — less likely for MI100 vs NVMe, but easy to try if logs show MSI/MSI-X failures.
BAR sizing / MMIO quirks on older platforms.
Dell R720 is PCIe Gen3-era hardware. If Xen/dom0 maps large MI100 BARs differently than KVM/bare-metal (windowing, segmentation), heavy kernels could hit suboptimal paths.

What I’ve validated

Same models/builds of llama.cpp and ollama run with FlashAttention on Proxmox and bare-metal.
On XCP-ng 8.3, disabling FlashAttention lets some models run, but unreliably.
Enabling FlashAttention on XCP-ng consistently reproduces the GPU 100% / no tokens stall.

Requests for XCP-ng/Xen team & community

Feature support: What is the current support status for PASID / PRI / ATS for PCIe devices exposed to guests? Any caveats for MI100 + ROCm under Xen passthrough? Recommended dom0 / Xen boot params for low-latency GPU page-fault handling?
Guidance on XNACK: Are there constraints for enabling XNACK in guests for gfx908? Best practices for aligning HSA_XNACK / guest kernel / amdgpu settings?
IRQ provisioning: Should we proactively increase extra_guest_irqs for GPU-heavy guests even if no MSI-X errors appear? What log lines distinguish IRQ exhaustion from IOMMU / page-fault stalls?
Known-good recipe: If anyone has MI100 + FlashAttention stable on XCP-ng 8.3, please share Xen version, dom0 kernel, guest kernel, ROCm version, GPU firmware, and any special Xen/guest kernel params.

Diagnostics I can provide

xl dmesg (dom0) and dom0 dmesg
Guest dmesg filtered for amdgpu, rocm, hsa, xnack, pasid, iommu, fault messages
Guest lspci -vv for the GPU (MSI/MSI-X state, BARs)
rocminfo from the guest
Minimal reproducer scripts for llama.cpp and ollama (FlashAttention on/off)

Repro steps (guest VM on XCP-ng 8.3)

Pass through both MI100s to a Linux guest.
Install ROCm stack that matches the working bare-metal/Proxmox setup.
Build llama.cpp with FlashAttention enabled; install ollama with FlashAttention.
Run the same model + prompt that succeeds elsewhere.
Result: GPU hits 100% and stalls (no tokens). Disable FlashAttention: some models run.

WHY THIS MATTERS
XCP-ng is a great platform and I’d prefer to consolidate inference there. Right now there’s a reliability gap for modern AI workloads (FlashAttention / ROCm) compared to Proxmox and bare-metal on identical hardware, forcing me to run Proxmox for this node. If this is a config/feature gap (IOMMU / PASID / ATS / XNACK, IRQ provisioning, etc.), I’m happy to validate fixes or test previews. If it’s a known limitation, documenting it will save others time.

Thanks — happy to provide logs, repros, and run more tests. If anyone has a working recipe for MI100 + ROCm + FlashAttention on XCP-ng, please share.