Issue with SR-IOV mxGPU after changing CPU
-
I have a 2 XCP-NG servers which each has 2 AMD FirePro S7150x2 cards (which have 2 gpus per card) in it and have been happily using vgpu for a few months now, recently I changed the CPU one of them so that I could create a pool for easier management and now for some reason I can only utilize one of the 1 gpu on each of the cards in the server. Whenever I try to boot a VM using more than those 2, I get the following error.
INTERNAL_ERROR(xenopsd internal error: Cannot_add(0000:0d:02.0, Device_common.QMP_Error(22, "{\"error\":{\"class\":\"GenericError\",\"desc\":\"Mapping machine irq 0 to pirq -1 failed: Operation not permitted\",\"data\":{}},\"id\":\"qmp-000019-22\"}")))
What I have tried so far is unassigned the PCI devices and tool-stack restart but there is no difference. I do not think I will be able to restart the hosts until the weekend.
Has anyone worked with vgpu in XCP-NG see this before? Thanks!
-
@spunky_surveyor Random question, if you put the old CPU back, does it suddenly work?
-
@thenorthernlight I did not try that as yet, but i was able to do a host reboot and now the host doesn't recognize one of the cards. I have hunch the GPU or its riser card perhaps was not properly seated after the service. I will update again after reseating the risers and GPUS to see if that works. If not then I will try putting the old ones back in.
-
@spunky_surveyor Definitely sounds like an install issue since other items are being affected. Dont forget to check your bios for voltage settings. This causes ALL SORTS of problems if you dont do a reset and re-learn on your BIOS. I dont know what hardware you have, but on most modern Dell's they have an option to re-run performance testing on boot when hardware changes. This fixes issues like wrong voltage settings, etc.
-
@thenorthernlight So I was able to open up the host and remove all PCIE cards and riser cards. At a glace everything appears to have made proper contact, however after reseating the cards and booting again and sure enough it appears to work again. I did not have to run any testing in the bios although the server does do a pre boot inventory check each time for changes. Host is HPE Proliant DL380 G9.
-