XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    🚨 AI on XCP‑ng 8.3: Not Ready for Prime Time? FlashAttention/ROCm Passthrough Stalls vs Proxmox & Bare‑Metal

    Scheduled Pinned Locked Moved Compute
    3 Posts 3 Posters 49 Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • E Offline
      emuchogu 0
      last edited by

      TL;DR:
      On the same Dell R720 with dual MI100s passed through, FlashAttention kernels in llama.cpp and ollama are rock-solid on Proxmox (KVM) and on bare-metal Ubuntu 24.04. On XCP-ng 8.3 VMs they are hit-or-miss or outright stall. Without FlashAttention some models run, others don’t; with FlashAttention the GPU often pegs at 100% but no tokens are produced. This forced me to run Proxmox for my inference host even though I’d prefer XCP-ng. Looking for guidance on IOMMU/PASID/ATS/XNACK support and recommended settings for AMD Instinct (MI100) under passthrough.


      Hardware & Setup (same physical box)
      • Server: Dell R720
      • GPUs: 2 x AMD Instinct MI100 (gfx908)
      • Use case: Local inference server for LLMs

      Behavior across environments (same hardware, different host)
      • Bare-metal Ubuntu 24.04 → llama.cpp and ollama with FlashAttention ON = ✅ Stable, fast
      • Proxmox (KVM) host → Ubuntu guest with FlashAttention ON = ✅ Stable, fast
      • XCP-ng 8.3 (Xen) → Guests with FlashAttention OFF = ⚠️ Sometimes works, model dependent
      • XCP-ng 8.3 (Xen) → Guests with FlashAttention ON = ❌ GPU stuck at 100%, no tokens, stall

      All tests were done on the same hardware. Only the hypervisor/host changed.


      What I observe on XCP-ng 8.3
      • With FlashAttention enabled, GPU ramps to 100% then nothing — the decode loop never returns tokens.
      • Disabling FlashAttention lets some models complete, but reliability is inconsistent compared to Proxmox or bare-metal.
      • Same binaries, kernels, and models behave correctly on Proxmox and bare-metal.

      Why FlashAttention stresses the platform (short primer)

      FlashAttention fuses the attention pipeline into large tiled kernels that minimize HBM traffic by keeping working sets in on-chip memory. It does not materialize the full N x N attention matrix, but it produces heavy, sustained compute + memory traffic and can amplify page-fault / on-demand paging behavior (UVM). In practice, FlashAttention is a high-stress test for:

      • device ↔ IOMMU ↔ host page-fault plumbing,
      • PASID / ATS / PRI behavior,
      • XNACK (GPU page-fault retry) correctness and latency.

      Hypothesis: where the XCP-ng path may break down
      1. IOMMU / PASID / ATS / PRI under Xen passthrough.
        MI100 + ROCm rely on GPU page-fault retry (XNACK) and PASID/ATS features. If those are missing, disabled or serviced with high latency in a VM, large fused kernels can spin or stall — GPU busy but no forward progress. Proxmox/KVM may expose a friendlier path.

      2. XNACK mode mismatches.
        MI100 (gfx908) is sensitive to the XNACK mode. If userland/tooling (binaries / ROCm build) and the driver/kernel disagree about xnack+ vs xnack-, FlashAttention kernels can hang. Bare-metal/Proxmox have a known-good combo; XCP-ng guests may not.

      3. Event channels / IRQ vector limits.
        Xen guests get a finite pool of interrupt vectors. If amdgpu/ROCm requests more MSI/MSI-X vectors than allocated, MSI-X enablement could fail and fallback to legacy IRQs produce pathological behavior. XCP-ng docs recommend extra_guest_irqs as a knob to try — less likely for MI100 vs NVMe, but easy to try if logs show MSI/MSI-X failures.

      4. BAR sizing / MMIO quirks on older platforms.
        Dell R720 is PCIe Gen3-era hardware. If Xen/dom0 maps large MI100 BARs differently than KVM/bare-metal (windowing, segmentation), heavy kernels could hit suboptimal paths.


      What I’ve validated
      • Same models/builds of llama.cpp and ollama run with FlashAttention on Proxmox and bare-metal.
      • On XCP-ng 8.3, disabling FlashAttention lets some models run, but unreliably.
      • Enabling FlashAttention on XCP-ng consistently reproduces the GPU 100% / no tokens stall.

      Requests for XCP-ng/Xen team & community
      • Feature support: What is the current support status for PASID / PRI / ATS for PCIe devices exposed to guests? Any caveats for MI100 + ROCm under Xen passthrough? Recommended dom0 / Xen boot params for low-latency GPU page-fault handling?
      • Guidance on XNACK: Are there constraints for enabling XNACK in guests for gfx908? Best practices for aligning HSA_XNACK / guest kernel / amdgpu settings?
      • IRQ provisioning: Should we proactively increase extra_guest_irqs for GPU-heavy guests even if no MSI-X errors appear? What log lines distinguish IRQ exhaustion from IOMMU / page-fault stalls?
      • Known-good recipe: If anyone has MI100 + FlashAttention stable on XCP-ng 8.3, please share Xen version, dom0 kernel, guest kernel, ROCm version, GPU firmware, and any special Xen/guest kernel params.

      Diagnostics I can provide
      • xl dmesg (dom0) and dom0 dmesg
      • Guest dmesg filtered for amdgpu, rocm, hsa, xnack, pasid, iommu, fault messages
      • Guest lspci -vv for the GPU (MSI/MSI-X state, BARs)
      • rocminfo from the guest
      • Minimal reproducer scripts for llama.cpp and ollama (FlashAttention on/off)

      Repro steps (guest VM on XCP-ng 8.3)
      1. Pass through both MI100s to a Linux guest.
      2. Install ROCm stack that matches the working bare-metal/Proxmox setup.
      3. Build llama.cpp with FlashAttention enabled; install ollama with FlashAttention.
      4. Run the same model + prompt that succeeds elsewhere.
        Result: GPU hits 100% and stalls (no tokens). Disable FlashAttention: some models run.

      WHY THIS MATTERS
      XCP-ng is a great platform and I’d prefer to consolidate inference there. Right now there’s a reliability gap for modern AI workloads (FlashAttention / ROCm) compared to Proxmox and bare-metal on identical hardware, forcing me to run Proxmox for this node. If this is a config/feature gap (IOMMU / PASID / ATS / XNACK, IRQ provisioning, etc.), I’m happy to validate fixes or test previews. If it’s a known limitation, documenting it will save others time.

      Thanks — happy to provide logs, repros, and run more tests. If anyone has a working recipe for MI100 + ROCm + FlashAttention on XCP-ng, please share.

      1 Reply Last reply Reply Quote 0
      • TeddyAstieT Offline
        TeddyAstie Vates 🪐 XCP-ng Team Xen Guru
        last edited by

        @emuchogu-0 said in 🚨 AI on XCP‑ng 8.3: Not Ready for Prime Time? FlashAttention/ROCm Passthrough Stalls vs Proxmox & Bare‑Metal:

        xl dmesg (dom0) and dom0 dmesg
        Guest dmesg filtered for amdgpu, rocm, hsa, xnack, pasid, iommu, fault messages
        Guest lspci -vv for the GPU (MSI/MSI-X state, BARs)
        rocminfo from the guest
        Minimal reproducer scripts for llama.cpp and ollama (FlashAttention on/off)

        Of course you need to provide this information; we can't blind guess where something is failing.

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          Hi,

          Small side note: lately I’ve noticed that some posts look like they were generated by LLMs. This can actually make it harder for the community to help, because the text is often long, unclear, or missing the basic details we really need to assist.

          I’d really encourage everyone to write posts in their own words and share as much relevant information as possible. The real value of this community is people helping each other directly 🙂

          1 Reply Last reply Reply Quote 3
          • First post
            Last post