Hello!
Is there any tool to control and monitor AMD GPUs?
I'm experimenting with FirePro S7150x2 installed in supermicro AS -2024US-TRT.
I have installed XCP-ng 8.2.1 + AMD MxGPU extension.
XCP-ng with latest updates.
Everything was fine until today.
On previous days I could create and plug vGPUs to VMs and it worked.
There was only one issue - VMs stuck on shutdown with such an error
In this situations I used hard shutdown.
I have been working on fixing this issue in my templates.
After few days of experiments vGPU stopped working.
I can create and attach new vGPU, but when I'm trying to start VM with vGPU, it hangs for about 60 seconds and than start without problems. But in OS I can't find vGPU any more.
In dmesg I can see a lot of error messages from gim
[692523.372640] <1>Uncorrectable error found 0xffffffff
[692523.372642] <1> Can't clear the error
[692523.372646] PF1 gim info:(check_base_addrs:1974) CP_MQD_BASE_ADDR = 0xffffffff:ffffffff
[692523.372660] gim warning:(dump_gpu_status:2029) mmGRBM_STATUS = 0xffffffff
[692523.372662] gim warning:(dump_gpu_status:2032) mmGRBM_STATUS2 = 0xffffffff
[692523.372663] gim warning:(dump_gpu_status:2035) mmSRBM_STATUS = 0xffffffff
[692523.372665] gim warning:(dump_gpu_status:2038) mmSRBM_STATUS2 = 0xffffffff
[692523.372667] gim warning:(dump_gpu_status:2041) mmSDMA0_STATUS_REG = 0xffffffff
[692523.372668] gim warning:(dump_gpu_status:2044) mmSDMA1_STATUS_REG = 0xffffffff
[692523.372670] gim warning:(dump_gpu_status:2056) GFX busy
[692523.372671] gim warning:(dump_gpu_status:2062) CP busy
[692523.372672] gim warning:(dump_gpu_status:2070) RLC busy
[692523.372674] gim warning:(dump_gpu_status:2074) RLC_STAT = 0xffffffff
[692523.372675] gim warning:(dump_gpu_status:2076) RLC busy processing a context switch
[692523.372677] gim warning:(dump_gpu_status:2078) RLC Graphics Power Management unit is busy
[692523.372677] gim warning:(dump_gpu_status:2080) RLC Streaming Performance Monitor block is busy
[692523.372679] gim warning:(dump_gpu_status:2085) RLC_GPM_STAT = 0xffffffff - RLC GPM module is busy
[692523.372680] gim warning:(dump_gpu_status:2092) CP busy
[692523.372681] gim warning:(dump_gpu_status:2102) SDMA busy
[692523.372682] gim warning:(dump_gpu_status:2108) SDMA1 busy
[692523.372683] gim warning:(dump_gpu_status:2114) XDMA busy
[692523.372686] gim warning:(dump_gpu_status:2138) DRM busy
[692523.372687] gim warning:(dump_gpu_status:2146) SEM busy
[692523.372688] gim warning:(dump_gpu_status:2159) GRBM busy
[692523.372688] gim warning:(dump_gpu_status:2172) VMC busy
[692523.372690] gim warning:(dump_gpu_status:2185) CP_CPF_STATUS = 0xffffffff
[692523.372691] gim warning:(dump_gpu_status:2188) The write pointer has been updated and the initiated work is still being processed by the GFX pipe
[692523.372692] gim warning:(dump_gpu_status:2192) The HQD is busy for any of the following reasons: sending a message, fetching data, or reorder queues not empty
[692523.372693] gim warning:(dump_gpu_status:2196) The Compute part of CPF is Busy.
[692523.372695] PF1 gim info:(check_ME_CNTL:1945) CP_ME_CNTL = 0xffffffff GPU dump
[692523.372696] gim error:(check_ME_CNTL:1948) ME HALTED!
[692523.372701] gim error:(check_ME_CNTL:1952) PFP HALTED!
[692523.372706] gim error:(check_ME_CNTL:1956) CE HALTED!
[692523.372711] gim warning:(dump_gpu_status:2203) CP_CPF_BUSY_STAT = 0xffffffff
[692523.372712] gim warning:(dump_gpu_status:2206) The HQD has a pending Wait semaphore
[692523.372713] gim warning:(dump_gpu_status:2209) **** dump gpu status end
[692523.372714] gim error:(switch_to_pf:2665) Failed to LOAD PF
[692523.372721] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - HDP_NONSURFACE_BASE
[692523.372723] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_FB_LOCATION
[692523.372725] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_FB_OFFSET
[692523.372727] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_SYSTEM_APERTURE_HI
[692523.372729] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_SYSTEM_APERTURE_LO
[692523.372730] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_SYSTEM_APERTURE_DEF
[692523.372732] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_MX_L1_TLB_CNTL
[692523.372734] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - RLC_GPU_IOV_ACTIVE_FCN_ID
[692523.372736] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - SMU_ACTIVE_FCN_ID
[692523.372739] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - IH_ACTIVE_FCN_ID
[692523.372740] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_SHARED_ACTIVE_FCN_ID
[692523.372743] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - SDMA0_ACTIVE_FCN_ID
[692523.372744] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - SDMA1_ACTIVE_FCN_ID
[692523.372746] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - SEM_ACTIVE_FCN_ID
[692523.372748] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDRESS
[692523.372750] gim warning:(clear_vf_fb:3357) Check out switch_vfs returning -1. This is unexpected
[692523.372751] gim warning:(free_vf:3618) Clear of VF1-0 FB failed
[692523.372753] VF1-0 gim info:(free_vf:3661) VF1-0 is in the Undefined state while trying to FREE it
[692523.372754] VF1-0 gim info:(free_vf:3666) VF1-0 can be freed from the Undefined state
[692523.372761] gim warning:(free_vf:3701) PF is not present at the end of VF_FREE
[692525.015347] pciback 0000:85:02.0: timed out waiting for pending transaction; performing function level reset anyway
[692526.263277] pciback 0000:85:02.0: not ready 1023ms after FLR; waiting
[692527.319289] pciback 0000:85:02.0: not ready 2047ms after FLR; waiting
[692529.399246] pciback 0000:85:02.0: not ready 4095ms after FLR; waiting
[692533.751187] pciback 0000:85:02.0: not ready 8191ms after FLR; waiting
[692542.199002] pciback 0000:85:02.0: not ready 16383ms after FLR; waiting
[692559.350673] pciback 0000:85:02.0: not ready 32767ms after FLR; waiting
[692594.166251] pciback 0000:85:02.0: not ready 65535ms after FLR; giving up
Maybe VM hard shutdown "killed" it?
The question is - is it possible to fix it without reboot or any other option to fix it?
For NVIDIA GPUs there is some cli tool nvidia-smi
.
You can "reset" NVIDIA GPU with this tool.
On the Internet, I found that there is a tool for AMD GPUs as well.
It called GRU tool
and could be build from source https://github.com/GPUOpen-LibrariesAndSDKs/MxGPU-Virtualization/tree/master/utils/gru
Is it possible to install it from official xcp-ng repo?
Or how to build it correctly for xcp-ng?
Toolstack restart does not fixed it.