XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login
    1. Home
    2. emuchogu 0
    3. Posts
    E
    Offline
    • Profile
    • Following 0
    • Followers 0
    • Topics 1
    • Posts 1
    • Groups 0

    Posts

    Recent Best Controversial
    • 🚨 AI on XCP‑ng 8.3: Not Ready for Prime Time? FlashAttention/ROCm Passthrough Stalls vs Proxmox & Bare‑Metal

      TL;DR:
      On the same Dell R720 with dual MI100s passed through, FlashAttention kernels in llama.cpp and ollama are rock-solid on Proxmox (KVM) and on bare-metal Ubuntu 24.04. On XCP-ng 8.3 VMs they are hit-or-miss or outright stall. Without FlashAttention some models run, others don’t; with FlashAttention the GPU often pegs at 100% but no tokens are produced. This forced me to run Proxmox for my inference host even though I’d prefer XCP-ng. Looking for guidance on IOMMU/PASID/ATS/XNACK support and recommended settings for AMD Instinct (MI100) under passthrough.


      Hardware & Setup (same physical box)
      • Server: Dell R720
      • GPUs: 2 x AMD Instinct MI100 (gfx908)
      • Use case: Local inference server for LLMs

      Behavior across environments (same hardware, different host)
      • Bare-metal Ubuntu 24.04 → llama.cpp and ollama with FlashAttention ON = ✅ Stable, fast
      • Proxmox (KVM) host → Ubuntu guest with FlashAttention ON = ✅ Stable, fast
      • XCP-ng 8.3 (Xen) → Guests with FlashAttention OFF = ⚠️ Sometimes works, model dependent
      • XCP-ng 8.3 (Xen) → Guests with FlashAttention ON = ❌ GPU stuck at 100%, no tokens, stall

      All tests were done on the same hardware. Only the hypervisor/host changed.


      What I observe on XCP-ng 8.3
      • With FlashAttention enabled, GPU ramps to 100% then nothing — the decode loop never returns tokens.
      • Disabling FlashAttention lets some models complete, but reliability is inconsistent compared to Proxmox or bare-metal.
      • Same binaries, kernels, and models behave correctly on Proxmox and bare-metal.

      Why FlashAttention stresses the platform (short primer)

      FlashAttention fuses the attention pipeline into large tiled kernels that minimize HBM traffic by keeping working sets in on-chip memory. It does not materialize the full N x N attention matrix, but it produces heavy, sustained compute + memory traffic and can amplify page-fault / on-demand paging behavior (UVM). In practice, FlashAttention is a high-stress test for:

      • device ↔ IOMMU ↔ host page-fault plumbing,
      • PASID / ATS / PRI behavior,
      • XNACK (GPU page-fault retry) correctness and latency.

      Hypothesis: where the XCP-ng path may break down
      1. IOMMU / PASID / ATS / PRI under Xen passthrough.
        MI100 + ROCm rely on GPU page-fault retry (XNACK) and PASID/ATS features. If those are missing, disabled or serviced with high latency in a VM, large fused kernels can spin or stall — GPU busy but no forward progress. Proxmox/KVM may expose a friendlier path.

      2. XNACK mode mismatches.
        MI100 (gfx908) is sensitive to the XNACK mode. If userland/tooling (binaries / ROCm build) and the driver/kernel disagree about xnack+ vs xnack-, FlashAttention kernels can hang. Bare-metal/Proxmox have a known-good combo; XCP-ng guests may not.

      3. Event channels / IRQ vector limits.
        Xen guests get a finite pool of interrupt vectors. If amdgpu/ROCm requests more MSI/MSI-X vectors than allocated, MSI-X enablement could fail and fallback to legacy IRQs produce pathological behavior. XCP-ng docs recommend extra_guest_irqs as a knob to try — less likely for MI100 vs NVMe, but easy to try if logs show MSI/MSI-X failures.

      4. BAR sizing / MMIO quirks on older platforms.
        Dell R720 is PCIe Gen3-era hardware. If Xen/dom0 maps large MI100 BARs differently than KVM/bare-metal (windowing, segmentation), heavy kernels could hit suboptimal paths.


      What I’ve validated
      • Same models/builds of llama.cpp and ollama run with FlashAttention on Proxmox and bare-metal.
      • On XCP-ng 8.3, disabling FlashAttention lets some models run, but unreliably.
      • Enabling FlashAttention on XCP-ng consistently reproduces the GPU 100% / no tokens stall.

      Requests for XCP-ng/Xen team & community
      • Feature support: What is the current support status for PASID / PRI / ATS for PCIe devices exposed to guests? Any caveats for MI100 + ROCm under Xen passthrough? Recommended dom0 / Xen boot params for low-latency GPU page-fault handling?
      • Guidance on XNACK: Are there constraints for enabling XNACK in guests for gfx908? Best practices for aligning HSA_XNACK / guest kernel / amdgpu settings?
      • IRQ provisioning: Should we proactively increase extra_guest_irqs for GPU-heavy guests even if no MSI-X errors appear? What log lines distinguish IRQ exhaustion from IOMMU / page-fault stalls?
      • Known-good recipe: If anyone has MI100 + FlashAttention stable on XCP-ng 8.3, please share Xen version, dom0 kernel, guest kernel, ROCm version, GPU firmware, and any special Xen/guest kernel params.

      Diagnostics I can provide
      • xl dmesg (dom0) and dom0 dmesg
      • Guest dmesg filtered for amdgpu, rocm, hsa, xnack, pasid, iommu, fault messages
      • Guest lspci -vv for the GPU (MSI/MSI-X state, BARs)
      • rocminfo from the guest
      • Minimal reproducer scripts for llama.cpp and ollama (FlashAttention on/off)

      Repro steps (guest VM on XCP-ng 8.3)
      1. Pass through both MI100s to a Linux guest.
      2. Install ROCm stack that matches the working bare-metal/Proxmox setup.
      3. Build llama.cpp with FlashAttention enabled; install ollama with FlashAttention.
      4. Run the same model + prompt that succeeds elsewhere.
        Result: GPU hits 100% and stalls (no tokens). Disable FlashAttention: some models run.

      WHY THIS MATTERS
      XCP-ng is a great platform and I’d prefer to consolidate inference there. Right now there’s a reliability gap for modern AI workloads (FlashAttention / ROCm) compared to Proxmox and bare-metal on identical hardware, forcing me to run Proxmox for this node. If this is a config/feature gap (IOMMU / PASID / ATS / XNACK, IRQ provisioning, etc.), I’m happy to validate fixes or test previews. If it’s a known limitation, documenting it will save others time.

      Thanks — happy to provide logs, repros, and run more tests. If anyone has a working recipe for MI100 + ROCm + FlashAttention on XCP-ng, please share.

      posted in Compute
      E
      emuchogu 0