@hani it began asking for license after one day but without throttling.
I have switched to AMD GPUs
@hani it began asking for license after one day but without throttling.
I have switched to AMD GPUs
@Dani Strange. Is it executable? Did you tried to follow step by step my instruction to make vGPU work?
Why it is not working?
Did you read my post about how to run Nvidia vGPU on XCP-ng?
https://xcp-ng.org/forum/post/55774
But in any case you will need Nvidia license sever
That's good sign.
On host you can check to which pci-e controller drives are connected by command
ls -al /sys/block/sd*
You will see something like this
Before ata...
you will find pci-e address
Check if you passthrough correct SATA controller.
I thing you have 2 SATA controllers because your MB have M.2 connector.
Can you see drives on the host if disable pass through?
Recently I have some troubles with OCulink to NVMe cables.
That was Supermicro server and they have different cables for SATA/SAS and NVMe.
The difference is in wire resistance - for SATA/SAS you should use cables with 100Ohm resistance and for NVM with 85Ohm.
In such way motherboard detects what type of drives are connected to hybrid port.
It is just my ideas what happens.
First you should check if host can see drives without pass through.
Live migration is not possible in this case, because of lack of free space.
You have to shutdown VM, then click this button and chose another host
Hello!
Are all VMs on this host belongs to you and you certainly know what processes running on them?
I had same issue with Dell R630.
The solution was to update to latest BIOS.
I think that some clients ran some software that triggered some bug and host rebooted.
XCP-ng security updates does not helped.
In my case only BIOS update fixed suddenly crushes.
So the work around will be to move VMs one by one to another host and check if it will solve the problem.
I compiled gru
on the test host and copied it to host with GPU, but it does not work.
When I'm trying to start gru
an error occurred
cat: /sys/bus/pci/drivers/gim/gpuinfo: No such file or directory
Can't detect a device
Detect device failed
I have tried to reset GPU manually with command echo "1" > /sys/bus/pci/devices/0000\:85:00.0/reset
, but it didn't work too.
modprobe -r gim
modprobe gim
xe-toolstack-restart
Didn't work too.
Seems that VM hard shutdown
blocked something in pGPU (for example GPU RAM) and now it can't allocate a new area of memory.
We need some tool to control AMD GPUs in XCP-ng.
Any ideas?
You can try to "attach" broken VIF to some network and then delete it.
Try to use this command
xe vif-move
If that doesn't work, you can shut down the XO VM, detach
the disk, deploy a new VM without a disk, attach
the XO disk to the new VM.
Hello!
Is there any tool to control and monitor AMD GPUs?
I'm experimenting with FirePro S7150x2 installed in supermicro AS -2024US-TRT.
I have installed XCP-ng 8.2.1 + AMD MxGPU extension.
XCP-ng with latest updates.
Everything was fine until today.
On previous days I could create and plug vGPUs to VMs and it worked.
There was only one issue - VMs stuck on shutdown with such an error
In this situations I used hard shutdown.
I have been working on fixing this issue in my templates.
After few days of experiments vGPU stopped working.
I can create and attach new vGPU, but when I'm trying to start VM with vGPU, it hangs for about 60 seconds and than start without problems. But in OS I can't find vGPU any more.
In dmesg I can see a lot of error messages from gim
[692523.372640] <1>Uncorrectable error found 0xffffffff
[692523.372642] <1> Can't clear the error
[692523.372646] PF1 gim info:(check_base_addrs:1974) CP_MQD_BASE_ADDR = 0xffffffff:ffffffff
[692523.372660] gim warning:(dump_gpu_status:2029) mmGRBM_STATUS = 0xffffffff
[692523.372662] gim warning:(dump_gpu_status:2032) mmGRBM_STATUS2 = 0xffffffff
[692523.372663] gim warning:(dump_gpu_status:2035) mmSRBM_STATUS = 0xffffffff
[692523.372665] gim warning:(dump_gpu_status:2038) mmSRBM_STATUS2 = 0xffffffff
[692523.372667] gim warning:(dump_gpu_status:2041) mmSDMA0_STATUS_REG = 0xffffffff
[692523.372668] gim warning:(dump_gpu_status:2044) mmSDMA1_STATUS_REG = 0xffffffff
[692523.372670] gim warning:(dump_gpu_status:2056) GFX busy
[692523.372671] gim warning:(dump_gpu_status:2062) CP busy
[692523.372672] gim warning:(dump_gpu_status:2070) RLC busy
[692523.372674] gim warning:(dump_gpu_status:2074) RLC_STAT = 0xffffffff
[692523.372675] gim warning:(dump_gpu_status:2076) RLC busy processing a context switch
[692523.372677] gim warning:(dump_gpu_status:2078) RLC Graphics Power Management unit is busy
[692523.372677] gim warning:(dump_gpu_status:2080) RLC Streaming Performance Monitor block is busy
[692523.372679] gim warning:(dump_gpu_status:2085) RLC_GPM_STAT = 0xffffffff - RLC GPM module is busy
[692523.372680] gim warning:(dump_gpu_status:2092) CP busy
[692523.372681] gim warning:(dump_gpu_status:2102) SDMA busy
[692523.372682] gim warning:(dump_gpu_status:2108) SDMA1 busy
[692523.372683] gim warning:(dump_gpu_status:2114) XDMA busy
[692523.372686] gim warning:(dump_gpu_status:2138) DRM busy
[692523.372687] gim warning:(dump_gpu_status:2146) SEM busy
[692523.372688] gim warning:(dump_gpu_status:2159) GRBM busy
[692523.372688] gim warning:(dump_gpu_status:2172) VMC busy
[692523.372690] gim warning:(dump_gpu_status:2185) CP_CPF_STATUS = 0xffffffff
[692523.372691] gim warning:(dump_gpu_status:2188) The write pointer has been updated and the initiated work is still being processed by the GFX pipe
[692523.372692] gim warning:(dump_gpu_status:2192) The HQD is busy for any of the following reasons: sending a message, fetching data, or reorder queues not empty
[692523.372693] gim warning:(dump_gpu_status:2196) The Compute part of CPF is Busy.
[692523.372695] PF1 gim info:(check_ME_CNTL:1945) CP_ME_CNTL = 0xffffffff GPU dump
[692523.372696] gim error:(check_ME_CNTL:1948) ME HALTED!
[692523.372701] gim error:(check_ME_CNTL:1952) PFP HALTED!
[692523.372706] gim error:(check_ME_CNTL:1956) CE HALTED!
[692523.372711] gim warning:(dump_gpu_status:2203) CP_CPF_BUSY_STAT = 0xffffffff
[692523.372712] gim warning:(dump_gpu_status:2206) The HQD has a pending Wait semaphore
[692523.372713] gim warning:(dump_gpu_status:2209) **** dump gpu status end
[692523.372714] gim error:(switch_to_pf:2665) Failed to LOAD PF
[692523.372721] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - HDP_NONSURFACE_BASE
[692523.372723] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_FB_LOCATION
[692523.372725] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_FB_OFFSET
[692523.372727] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_SYSTEM_APERTURE_HI
[692523.372729] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_SYSTEM_APERTURE_LO
[692523.372730] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_SYSTEM_APERTURE_DEF
[692523.372732] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_VM_MX_L1_TLB_CNTL
[692523.372734] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - RLC_GPU_IOV_ACTIVE_FCN_ID
[692523.372736] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - SMU_ACTIVE_FCN_ID
[692523.372739] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - IH_ACTIVE_FCN_ID
[692523.372740] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - MC_SHARED_ACTIVE_FCN_ID
[692523.372743] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - SDMA0_ACTIVE_FCN_ID
[692523.372744] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - SDMA1_ACTIVE_FCN_ID
[692523.372746] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - SEM_ACTIVE_FCN_ID
[692523.372748] PF1 gim info:(dump_pf_vm_regs:207) 0xffffffff - VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDRESS
[692523.372750] gim warning:(clear_vf_fb:3357) Check out switch_vfs returning -1. This is unexpected
[692523.372751] gim warning:(free_vf:3618) Clear of VF1-0 FB failed
[692523.372753] VF1-0 gim info:(free_vf:3661) VF1-0 is in the Undefined state while trying to FREE it
[692523.372754] VF1-0 gim info:(free_vf:3666) VF1-0 can be freed from the Undefined state
[692523.372761] gim warning:(free_vf:3701) PF is not present at the end of VF_FREE
[692525.015347] pciback 0000:85:02.0: timed out waiting for pending transaction; performing function level reset anyway
[692526.263277] pciback 0000:85:02.0: not ready 1023ms after FLR; waiting
[692527.319289] pciback 0000:85:02.0: not ready 2047ms after FLR; waiting
[692529.399246] pciback 0000:85:02.0: not ready 4095ms after FLR; waiting
[692533.751187] pciback 0000:85:02.0: not ready 8191ms after FLR; waiting
[692542.199002] pciback 0000:85:02.0: not ready 16383ms after FLR; waiting
[692559.350673] pciback 0000:85:02.0: not ready 32767ms after FLR; waiting
[692594.166251] pciback 0000:85:02.0: not ready 65535ms after FLR; giving up
Maybe VM hard shutdown "killed" it?
The question is - is it possible to fix it without reboot or any other option to fix it?
For NVIDIA GPUs there is some cli tool nvidia-smi
.
You can "reset" NVIDIA GPU with this tool.
On the Internet, I found that there is a tool for AMD GPUs as well.
It called GRU tool
and could be build from source https://github.com/GPUOpen-LibrariesAndSDKs/MxGPU-Virtualization/tree/master/utils/gru
Is it possible to install it from official xcp-ng repo?
Or how to build it correctly for xcp-ng?
Toolstack restart does not fixed it.
I made the same trick with rpm -qa
to join new host to existing pool, that is not up to date.
If nothing will help and you can't import xva with import VM
, I have solution.
add-apt-repository ppa:ubuntu-toolchain-r/test && apt-get update && apt-get install -y gcc-7
cd to xva-img dir and run next command
cmake ./
make install
xva
filetar -xf my-virtual-machine.xva -C my-virtual-machine
chmod -R 755 my-virtual-machine
my-virtual-machine
you will find some directory like Ref :1
, but maybe with another number. Remember this number.raw
disk from extracted. Replace 1 with number from previous step.xva-img -p disk-export my-virtual-machine/Ref\:1/ disk.raw
qemu-utils
apt install qemu-utils
raw
to vhd
qemu-img convert -f raw -O vpc disk.raw [vhd-name-you-like].vhd
vhd
to some SR. I'm using local SRs, so in my case path is /var/run/sr-mount/[sr-uuid]
cd
to SR folder, get VHD uuid, rename VHDcd /var/run/sr-mount/[sr-uuid]
vhd-util read -p -n [vhd-name-you-like_from-step-8].vhd
mv [vhd-name-you-like_from-step-8].vhd [uuid].vhd
@damjank seems you can use Dynamic Memory Control (DMC) or RAM ballooning to hotplug more RAM to VM.
It was deprecated in XenServer about 2020.
Didn't try to use it for a long time.
You can use 7-zip to open it and get only necessary files.
You can check task list on XCP-NG host with command
xe task-list status=pending
@goreandor if it is real vhd compatible with xen, than you should be able to read it metadata with this command
vhd-util read -p -n [vhd-name].vhd
You will see what is the UUID of this vhd.
Than you should rename it to this UUID and rescan SR.
In XOA GUI You will se VHD without name.
Put the name to it and attach to VM.
If memtest will not find ant errors, try to replug SATA cable on both sides or change it
Sometimes such simple trick helps.
xe vif-unplug uuid=[vif_uuid]
or use XenCenter and on Network
tab press button Deactivate