XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    How to control AMD GPU?

    Scheduled Pinned Locked Moved Development
    5 Posts 2 Posters 982 Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • splastunovS Offline
      splastunov
      last edited by splastunov

      Hello!

      Is there any tool to control and monitor AMD GPUs?

      I'm experimenting with FirePro S7150x2 installed in supermicro AS -2024US-TRT.

      I have installed XCP-ng 8.2.1 + AMD MxGPU extension.
      XCP-ng with latest updates.
      Everything was fine until today.

      On previous days I could create and plug vGPUs to VMs and it worked.
      There was only one issue - VMs stuck on shutdown with such an error
      d668bee6-ea15-46c9-bbc4-4c8901239335-image_2023-04-10_19-46-39.png
      In this situations I used hard shutdown.
      I have been working on fixing this issue in my templates.

      After few days of experiments vGPU stopped working.
      I can create and attach new vGPU, but when I'm trying to start VM with vGPU, it hangs for about 60 seconds and than start without problems. But in OS I can't find vGPU any more.

      In dmesg I can see a lot of error messages from gim

      [692523.372640] <1>Uncorrectable error found 0xffffffff
      [692523.372642] <1> Can't clear the error
      [692523.372646] PF1    gim info:(check_base_addrs:1974) CP_MQD_BASE_ADDR = 0xffffffff:ffffffff
      [692523.372660]        gim warning:(dump_gpu_status:2029)  mmGRBM_STATUS = 0xffffffff
      [692523.372662]        gim warning:(dump_gpu_status:2032)  mmGRBM_STATUS2 = 0xffffffff
      [692523.372663]        gim warning:(dump_gpu_status:2035)  mmSRBM_STATUS = 0xffffffff
      [692523.372665]        gim warning:(dump_gpu_status:2038)  mmSRBM_STATUS2 = 0xffffffff
      [692523.372667]        gim warning:(dump_gpu_status:2041)  mmSDMA0_STATUS_REG = 0xffffffff
      [692523.372668]        gim warning:(dump_gpu_status:2044)  mmSDMA1_STATUS_REG = 0xffffffff
      [692523.372670]        gim warning:(dump_gpu_status:2056) GFX busy
      [692523.372671]        gim warning:(dump_gpu_status:2062) CP busy
      [692523.372672]        gim warning:(dump_gpu_status:2070) RLC busy
      [692523.372674]        gim warning:(dump_gpu_status:2074) RLC_STAT = 0xffffffff
      [692523.372675]        gim warning:(dump_gpu_status:2076)     RLC busy processing a context switch
      [692523.372677]        gim warning:(dump_gpu_status:2078)     RLC Graphics Power Management unit is busy
      [692523.372677]        gim warning:(dump_gpu_status:2080)     RLC Streaming Performance Monitor block is busy
      [692523.372679]        gim warning:(dump_gpu_status:2085) RLC_GPM_STAT = 0xffffffff - RLC GPM module is busy
      [692523.372680]        gim warning:(dump_gpu_status:2092) CP busy
      [692523.372681]        gim warning:(dump_gpu_status:2102) SDMA busy
      [692523.372682]        gim warning:(dump_gpu_status:2108) SDMA1 busy
      [692523.372683]        gim warning:(dump_gpu_status:2114) XDMA busy
      [692523.372686]        gim warning:(dump_gpu_status:2138) DRM busy
      [692523.372687]        gim warning:(dump_gpu_status:2146) SEM busy
      [692523.372688]        gim warning:(dump_gpu_status:2159) GRBM busy
      [692523.372688]        gim warning:(dump_gpu_status:2172) VMC busy
      [692523.372690]        gim warning:(dump_gpu_status:2185) CP_CPF_STATUS = 0xffffffff
      [692523.372691]        gim warning:(dump_gpu_status:2188)     The write pointer has been updated and the initiated work is still being processed by the GFX pipe
      [692523.372692]        gim warning:(dump_gpu_status:2192)     The HQD is busy for any of the following reasons: sending a message, fetching data, or reorder queues not empty
      [692523.372693]        gim warning:(dump_gpu_status:2196)     The Compute part of CPF is Busy.
      [692523.372695] PF1    gim info:(check_ME_CNTL:1945) CP_ME_CNTL = 0xffffffff GPU dump
      [692523.372696]        gim error:(check_ME_CNTL:1948)   ME HALTED!
      [692523.372701]        gim error:(check_ME_CNTL:1952)   PFP HALTED!
      [692523.372706]        gim error:(check_ME_CNTL:1956)   CE HALTED!
      [692523.372711]        gim warning:(dump_gpu_status:2203) CP_CPF_BUSY_STAT = 0xffffffff
      [692523.372712]        gim warning:(dump_gpu_status:2206)     The HQD has a pending Wait semaphore
      [692523.372713]        gim warning:(dump_gpu_status:2209) **** dump gpu status end
      [692523.372714]        gim error:(switch_to_pf:2665) Failed to LOAD PF
      [692523.372721] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - HDP_NONSURFACE_BASE
      [692523.372723] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_FB_LOCATION
      [692523.372725] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_FB_OFFSET
      [692523.372727] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_SYSTEM_APERTURE_HI
      [692523.372729] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_SYSTEM_APERTURE_LO
      [692523.372730] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_SYSTEM_APERTURE_DEF
      [692523.372732] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_MX_L1_TLB_CNTL
      [692523.372734] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - RLC_GPU_IOV_ACTIVE_FCN_ID
      [692523.372736] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - SMU_ACTIVE_FCN_ID
      [692523.372739] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - IH_ACTIVE_FCN_ID
      [692523.372740] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_SHARED_ACTIVE_FCN_ID
      [692523.372743] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - SDMA0_ACTIVE_FCN_ID
      [692523.372744] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - SDMA1_ACTIVE_FCN_ID
      [692523.372746] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - SEM_ACTIVE_FCN_ID
      [692523.372748] PF1    gim info:(dump_pf_vm_regs:207) 0xffffffff - VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDRESS
      [692523.372750]        gim warning:(clear_vf_fb:3357) Check out switch_vfs returning -1.  This is unexpected
      [692523.372751]        gim warning:(free_vf:3618) Clear of VF1-0 FB failed
      [692523.372753] VF1-0  gim info:(free_vf:3661) VF1-0 is in the Undefined state while trying to FREE it
      [692523.372754] VF1-0  gim info:(free_vf:3666) VF1-0 can be freed from the Undefined state
      [692523.372761]        gim warning:(free_vf:3701) PF is not present at the end of VF_FREE
      [692525.015347] pciback 0000:85:02.0: timed out waiting for pending transaction; performing function level reset anyway
      [692526.263277] pciback 0000:85:02.0: not ready 1023ms after FLR; waiting
      [692527.319289] pciback 0000:85:02.0: not ready 2047ms after FLR; waiting
      [692529.399246] pciback 0000:85:02.0: not ready 4095ms after FLR; waiting
      [692533.751187] pciback 0000:85:02.0: not ready 8191ms after FLR; waiting
      [692542.199002] pciback 0000:85:02.0: not ready 16383ms after FLR; waiting
      [692559.350673] pciback 0000:85:02.0: not ready 32767ms after FLR; waiting
      [692594.166251] pciback 0000:85:02.0: not ready 65535ms after FLR; giving up
      

      Maybe VM hard shutdown "killed" it?

      The question is - is it possible to fix it without reboot or any other option to fix it?

      For NVIDIA GPUs there is some cli tool nvidia-smi.
      You can "reset" NVIDIA GPU with this tool.

      On the Internet, I found that there is a tool for AMD GPUs as well.
      It called GRU tool and could be build from source https://github.com/GPUOpen-LibrariesAndSDKs/MxGPU-Virtualization/tree/master/utils/gru
      Is it possible to install it from official xcp-ng repo?
      Or how to build it correctly for xcp-ng?

      Toolstack restart does not fixed it.

      1 Reply Last reply Reply Quote 0
      • splastunovS Offline
        splastunov
        last edited by splastunov

        I compiled gru on the test host and copied it to host with GPU, but it does not work.
        When I'm trying to start gru an error occurred

        cat: /sys/bus/pci/drivers/gim/gpuinfo: No such file or directory
                Can't detect a device
        
                Detect device failed
        

        I have tried to reset GPU manually with command echo "1" > /sys/bus/pci/devices/0000\:85:00.0/reset, but it didn't work too.

        modprobe -r gim
        modprobe gim
        xe-toolstack-restart
        

        Didn't work too.

        Seems that VM hard shutdown blocked something in pGPU (for example GPU RAM) and now it can't allocate a new area of memory.
        We need some tool to control AMD GPUs in XCP-ng.

        Any ideas?

        1 Reply Last reply Reply Quote 0
        • SudoOracleS Offline
          SudoOracle
          last edited by SudoOracle

          I am having this same issue with my setup. Except I never had the GPU in the guest working. My guests take about 60s to start and to shut down and lots of gim errors on my host. This is a fresh install on my server.

          I finally got it working (kind of). So it only works if my xcp-ng server has been rebooted and it's the first time I'm starting the VM that has the GPU attached (so gim module is not loaded yet and hasn't virtualized my video card). The VM starts right up and no errors in dmesg and I see the card in my guest. However if I restart that VM, it takes ~60 seconds to come back up and I get those errors in dmesg and the card no longer shows up in the VM. This definitely seems like an issue trying to get it to work after gim has been loaded. I do get some errors after it has successfully loaded though. lspci still shows the card is loaded even after that last error in dmesg. Still not quite sure what is going on here.

          [  412.511208] VF1-0  gim info:(handle_rel_gpu_init_access:1373) Using the IOCTL/sysfs interface
          [  412.511215]        gim info:(gim_sysfs_send:113) Sending "BA,U0x5320/0x28,U0x540c/4,U0x5480/4" message for VF1-0 to QEMU pid 6533
          [  412.511338]        gim info:(gim_sys_read:295) gim_sys_read() - BA,U0x5320/0x28,U0x540c/4,U0x5480/4 in command buffer for VF1-0
          [  412.511343]        gim info:(gim_sys_read:297) gim_sys_read, ret = 35
          [  412.525023]        gim info:(print_gim_version:62) GPU IOV MODULE (GIM) - version 2.00.0000
          [  412.525025]        gim info:(gim_ioctl:450) GIM_IOCTL_MMIO_IS_BLOCKED
          [  412.525027]        gim info:(gim_ioctl:454) MMIO is BLOCKED for VF1-0
          [  412.530507] VF1-0  gim info:(wait_for_mmio_blocked:260) block_mmio() took 0.19282868 seconds to complete
          [  412.530510] VF1-0  gim info:(wait_for_mmio_blocked:285) QEMU successfully blocked MMIO in 0.19282868 seconds
          [  412.530512] VF1-0  gim info:(handle_rel_gpu_init_access:1395) VF0 is indicated as the current running vf
          [  412.530515] PF1    gim info:(resume_scheduler:381) Restart the Scheduler for 500 msec
          [  412.530518] PF1    gim info:(mailbox_work_handler:4251) mailbox_work_handler completed
          [  648.371177]        gim error:(wait_cmd_complete:2387)  wait_cmd_complete -- time out after 0.003006804 sec
          [  648.371219]        gim error:(wait_cmd_complete:2390)   Cmd = 0xff, Status = 0xff, cmd_Complete=0
          [  648.371239] Current function =
          [  648.371242] PF1    gim warning:(dump_function_state:248) VF0
          [  648.371243] PF1    gim warning:(dump_function_state:254) Last known states:
          [  648.371246] PF1    gim warning:(dump_function_state:255) PF = Save
          [  648.371249] VF1-0  gim warning:(dump_function_state:259) Run      , Marked as Runable
          [  648.371251]        gim warning:(dump_gpu_status:1987) **** dump gpu status begin for Adapter 135:00.00
          [  648.371254] <1>Uncorrectable error found 0xffffffff
          [  648.371259] <1> Can't clear the error
          [  648.371263] VF1-0  gim info:(check_base_addrs:1974) CP_MQD_BASE_ADDR = 0xffffffff:ffffffff
          [  648.371269]        gim warning:(dump_gpu_status:2029)  mmGRBM_STATUS = 0xffffffff
          [  648.371271]        gim warning:(dump_gpu_status:2032)  mmGRBM_STATUS2 = 0xffffffff
          [  648.371273]        gim warning:(dump_gpu_status:2035)  mmSRBM_STATUS = 0xffffffff
          [  648.371275]        gim warning:(dump_gpu_status:2038)  mmSRBM_STATUS2 = 0xffffffff
          [  648.371277]        gim warning:(dump_gpu_status:2041)  mmSDMA0_STATUS_REG = 0xffffffff
          [  648.371279]        gim warning:(dump_gpu_status:2044)  mmSDMA1_STATUS_REG = 0xffffffff
          [  648.371280]        gim warning:(dump_gpu_status:2056) GFX busy
          [  648.371282]        gim warning:(dump_gpu_status:2062) CP busy
          [  648.371283]        gim warning:(dump_gpu_status:2070) RLC busy
          [  648.371285]        gim warning:(dump_gpu_status:2074) RLC_STAT = 0xffffffff
          [  648.371286]        gim warning:(dump_gpu_status:2076)     RLC busy processing a context switch
          [  648.371288]        gim warning:(dump_gpu_status:2078)     RLC Graphics Power Management unit is busy
          [  648.371289]        gim warning:(dump_gpu_status:2080)     RLC Streaming Performance Monitor block is busy
          [  648.371291]        gim warning:(dump_gpu_status:2085) RLC_GPM_STAT = 0xffffffff - RLC GPM module is busy
          [  648.371292]        gim warning:(dump_gpu_status:2092) CP busy
          [  648.371294]        gim warning:(dump_gpu_status:2102) SDMA busy
          [  648.371295]        gim warning:(dump_gpu_status:2108) SDMA1 busy
          [  648.371296]        gim warning:(dump_gpu_status:2114) XDMA busy
          [  648.371298]        gim warning:(dump_gpu_status:2138) DRM busy
          [  648.371299]        gim warning:(dump_gpu_status:2146) SEM busy
          [  648.371300]        gim warning:(dump_gpu_status:2159) GRBM busy
          [  648.371302]        gim warning:(dump_gpu_status:2172) VMC busy
          [  648.371303]        gim warning:(dump_gpu_status:2185) CP_CPF_STATUS = 0xffffffff
          [  648.371305]        gim warning:(dump_gpu_status:2188)     The write pointer has been updated and the initiated work is still being processed by the GFX pipe
          [  648.371306]        gim warning:(dump_gpu_status:2192)     The HQD is busy for any of the following reasons: sending a message, fetching data, or reorder queues not empty
          [  648.371307]        gim warning:(dump_gpu_status:2196)     The Compute part of CPF is Busy.
          [  648.371311] PF1    gim info:(check_ME_CNTL:1945) CP_ME_CNTL = 0xffffffff GPU dump
          [  648.371312]        gim error:(check_ME_CNTL:1948)   ME HALTED!
          [  648.371326]        gim error:(check_ME_CNTL:1952)   PFP HALTED!
          [  648.371339]        gim error:(check_ME_CNTL:1956)   CE HALTED!
          [  648.371352]        gim warning:(dump_gpu_status:2203) CP_CPF_BUSY_STAT = 0xffffffff
          [  648.371354]        gim warning:(dump_gpu_status:2206)     The HQD has a pending Wait semaphore
          [  648.371355]        gim warning:(dump_gpu_status:2209) **** dump gpu status end
          [  648.371359] VF1-0  gim info:(idle_vf:2566) Wait returned rc 1 for VF1-0. Idle latency = 0.000000. Idle count = 0, run count = 0
          [  648.371361]        gim warning:(switch_away_from_function:2831) Failed to idle VF0
          [  648.371363]        gim warning:(schedule_next_function:3802) Failed to switch from VF0 to VF0
          [  648.371367]        gim warning:(world_switch:4042) schedule VF0 to VF0 failed, failure reason is FLR_REASON_FAILED_IDLE (3), try to reset
          [  648.371369]        gim warning:(world_switch:4045) VF1-0.runable = 1, VF1-0.runable = 1
          [  648.371386]        gim error:(gim_sched_reset:5230) Cannot proceed with FLR as GPU has disappeared from the bus
          [  648.371406]        gim error:(world_switch:4059) FLR failed, quick exit out of scheduler.  Make sure scheduler is paused
          
          

          EDIT:
          It looks like it doesn't stay working in the guest. I went to do a GPU test and it doesn't show the card anymore. Ugh this is frustrating.

          1 Reply Last reply Reply Quote 0
          • splastunovS Offline
            splastunov
            last edited by

            In my case problem was in bad connection.
            I have reassembled server, cleaned PCI-e (GPU) contacts and now it is stable.

            But it will be nice to have some tool to control and monitor AMD GPUs.

            1 Reply Last reply Reply Quote 0
            • splastunovS Offline
              splastunov
              last edited by

              Some time later I faced with problem that VMs (Linux and Windows) can't correctly start GPU (ADM MxGPU).
              In windows device manager there was error #43.
              I have solved this error without host reboot by reloading gim module.

              Hope this will help somebody else.

              rmmod gim gim_api
              modprobe gim gim_api
              
              1 Reply Last reply Reply Quote 1
              • First post
                Last post