Firepro S7150x2 SR-IOV Errors
-
[ 1123.874023] AMD Virt GIM API [ 1123.877477] gim: module license 'Proprietary' taints kernel. [ 1123.877478] Disabling lock debugging due to kernel taint [ 1123.880580] gim info:(gim_init:197) *******AMD GIM init [ 1123.880582] gim info:(print_gim_version:62) GPU IOV MODULE (GIM) - version 2.00.0000 [ 1123.880582] gim info:(gim_init:200) Copyright (c) 2014-2016 AMD Corporation. [ 1123.880822] gim info:(parse_config_file:295) AMD GIM fb_option = 0 [ 1123.880823] gim info:(parse_config_file:295) AMD GIM sched_option = 0 [ 1123.880824] gim info:(parse_config_file:295) AMD GIM vf_num = 0 [ 1123.880825] gim info:(parse_config_file:295) AMD GIM pf_fb = 0 [ 1123.880826] gim info:(parse_config_file:295) AMD GIM vf_fb = 0 [ 1123.880827] gim info:(parse_config_file:295) AMD GIM sched_interval = 7 [ 1123.880828] gim info:(parse_config_file:295) AMD GIM fb_clear = 1 [ 1123.880829] gim info:(parse_config_file:295) AMD GIM hang_detect_timeout = 100 [ 1123.880831] gim info:(parse_config_file:295) AMD GIM max_quanta = 1000 [ 1123.880832] gim info:(parse_config_file:295) AMD GIM self_switch = 500 [ 1123.880833] gim info:(parse_config_file:295) AMD GIM exclusive = 1600 [ 1123.880834] gim info:(parse_config_file:295) AMD GIM fair_scheduling = 0 [ 1123.880835] gim info:(parse_config_file:295) AMD GIM debug_level = 3 [ 1123.880837] gim info:(parse_config_file:295) AMD GIM clear_fb_on_flr = 0 [ 1123.880838] gim info:(parse_config_file:295) AMD GIM clear_fb_on_free_vf = 1 [ 1123.880839] gim info:(init_config:445) INIT CONFIG [ 1123.890809] gim error:(gim_probe:123) gim_probe(06:00.0) [ 1123.890822] gim info:(alloc_adapter:454) allocate adapter for PF 0x0600 [ 1123.890823] gim info:(alloc_adapter:457) Found free adapter at index 0 [ 1123.890829] PF0 gim info:(SetNewAdapter:1096) curr allocated at 00000000d08994a5 [ 1123.890830] PF0 gim info:(SetNewAdapter:1102) Can't disable ATS --> Not enabled in the first place [ 1123.890831] PF0 gim info:(SetNewAdapter:1113) SRIOV is supported [ 1123.890831] PF0 gim info:(SetNewAdapter:1121) found PCI bridge device [ 1123.890832] PF0 gim info:(SetNewAdapter:1124) found: 05:8.0 [ 1123.890979] PF0 gim info:(SetNewAdapter:1147) mmio_base = 000000007da02c16 [ 1123.891597] PF0 gim info:(SetNewAdapter:1149) doorbell = 00000000c74e7207 [ 1123.981443] PF0 gim info:(SetNewAdapter:1151) pf.fb_va = 000000009a0d3e40 [ 1123.981469] gim info:(sriov_is_ari_enabled:180) PCI_SRIOV_CAP = 0x00000002 [ 1123.981470] gim info:(sriov_is_ari_enabled:190) PCI_SRIOV_CTRL = 0x00000010 [ 1123.981471] gim info:(sriov_is_ari_enabled:194) PCI_SRIOV_CTRL_ARI is set --> ARI is supported [ 1123.981474] PF0 gim info:(program_ari_mode:957) Read bif_strap8 = 0x00200004 [ 1123.981474] PF0 gim info:(program_ari_mode:963) program_ari_mode - Set ARI_Mode = PF_BUS [ 1123.981475] PF0 gim info:(program_ari_mode:978) Write bif_strap8 = 0x00000004 [ 1123.981476] PF0 gim info:(gim_read_rom_from_reg:634) Reading VBios from ROM [ 1123.981594] PF0 gim info:(gim_read_VBIOS:695) VBIOS starts: 0x55, 0xaa [ 1123.981595] PF0 gim info:(gim_read_VBIOS:698) VBios size is 0x10000 [ 1123.981601] PF0 gim info:(gim_read_VBIOS:708) pVBIOS allocated at 00000000e96a70db for size of 0x80000 [ 1123.981602] PF0 gim info:(gim_read_rom_from_reg:634) Reading VBios from ROM [ 1125.088134] PF0 gim info:(gim_read_VBIOS:718) BIOS Version Major 0xF Minor 0x31 [ 1125.088210] PF0 gim info:(gim_read_VBIOS:729) VBios Checksum = 0x541c00 [ 1125.088211] PF0 gim info:(gim_read_VBIOS:738) Valid video BIOS image, size = 0x10000, check sum is 0x541c00 [ 1125.088212] PF0 gim info:(gim_read_VBIOS:739) Read in full Vbios image of size = 0x80000 [ 1125.088266] PF0 gim info:(gim_post_VBIOS:776) Init Parser passed!, continue [ 1125.088269] <1>ATOM_CheckAsicStatus - BIOS_SCRATCH_7 = 0x00000000 [ 1125.088269] <1> Isolate ATOM_S7_ASIC_INIT_COMPLETE_MASK bit(s) = 0x00000000 [ 1125.088271] <1> RLC_CNTL = 0x00000000 [ 1125.088271] <1> Isolate RLC_CNTL__RLC_ENABLE_F32_MASK = 0x00000000 [ 1125.088272] <1>ATOM_ASIC_NEED_POST [ 1125.088273] PF0 gim info:(gim_post_VBIOS:795) Asic needs a VBios post [ 1125.088274] gim info:(ATOM_PostVBIOS:215) ATOM_PostVBIOS: FirmwareInfo passed [ 1125.088275] gim info:(ATOM_PostVBIOS:261) ATOM_PostVBIOS: ASIC_Init before, engine clock = 7530, memory clock =1e848 [ 1125.412963] gim info:(ATOM_PostVBIOS:263) ATOM_PostVBIOS: ASIC_Init after [ 1125.412964] gim info:(ATOM_PostVBIOS:273) ATOM_PostVBIOS: ATOM_InitFanCntl before [ 1125.412965] gim info:(ATOM_PostVBIOS:275) ATOM_PostVBIOS: ATOM_InitFanCntl after [ 1125.412965] PF0 gim info:(gim_post_VBIOS:801) Post INIT_ASIC successfully! [ 1125.412977] gim warning:(firmware_requires_update:473) SMU option ROM version 0x111700 versus patch version 0x111a00 [ 1125.412989] gim warning:(firmware_requires_update:486) RLCV option ROM version 113. Patch version 1 [ 1125.412989] gim info:(firmware_requires_update:495) TOC found, update it [ 1125.412990] gim info:(patch_firmware:549) Update SMC_Init table [ 1125.414871] gim warning:(patch_firmware:574) Update smu firmware [ 1125.416014] gim warning:(patch_firmware:582) Update RLCV firmware [ 1125.416083] gim warning:(patch_firmware:590) Update TOC [ 1125.416520] gim info:(func_recalc_checksum:518) func_recalc_checksum original= 56 [ 1125.416550] gim info:(func_recalc_checksum:522) func_recalc_checksum new= 89 [ 1125.416551] PF0 gim info:(gim_post_VBIOS:811) Asic needs firmware loaded [ 1125.416551] gim info:(ATOM_PostVBIOS:215) ATOM_PostVBIOS: FirmwareInfo passed [ 1125.416552] gim info:(ATOM_PostVBIOS:250) just load uCode [ 1125.416553] gim info:(ATOM_PostVBIOS:261) ATOM_PostVBIOS: ASIC_Init before, engine clock = 7530, memory clock =1e848 [ 1127.081802] gim info:(ATOM_PostVBIOS:263) ATOM_PostVBIOS: ASIC_Init after [ 1127.081803] gim info:(ATOM_PostVBIOS:273) ATOM_PostVBIOS: ATOM_InitFanCntl before [ 1127.081803] gim info:(ATOM_PostVBIOS:275) ATOM_PostVBIOS: ATOM_InitFanCntl after [ 1127.081805] PF0 gim info:(gim_post_VBIOS:817) Post LOAD_FW successfully! [ 1127.081805] PF0 gim info:(gim_post_VBIOS:818) Post VBIOS successfully! [ 1127.082653] gim info:(enable_thermal_control:643) Thermal Control Enable [ 1127.082655] PF0 gim info:(SetNewAdapter:1207) gim_post_VBIOS done [ 1127.082656] PF0 gim info:(SetNewAdapter:1248) Scheduler Time interval set to 7 msec [ 1127.082659] gim info:(EnableSriov:398) Enable SRIOV [ 1127.082659] gim info:(EnableSriov:399) Enable SRIOV vfs count = 16 [ 1127.082689] gim 0000:06:00.0: not enough MMIO resources for SR-IOV [ 1127.082702] gim error:(EnableSriov:410) Fail to enable sriov, status = fffffff4 [ 1127.082711] gim error:(SetNewAdapter:1263) Failed to properly enable SRIOV(map_image) !!!! [ 1127.186274] gim error:(gim_probe:126) Failed to create new adapter [ 1127.186297] gim: probe of 0000:06:00.0 failed with error -1 [ 1127.186312] gim error:(gim_probe:123) gim_probe(07:00.0) [ 1127.186319] gim info:(alloc_adapter:454) allocate adapter for PF 0x0700 [ 1127.186320] gim info:(alloc_adapter:457) Found free adapter at index 0 [ 1127.186325] PF0 gim info:(SetNewAdapter:1096) curr allocated at 00000000d08994a5 [ 1127.186326] PF0 gim info:(SetNewAdapter:1102) Can't disable ATS --> Not enabled in the first place [ 1127.186327] PF0 gim info:(SetNewAdapter:1113) SRIOV is supported [ 1127.186327] PF0 gim info:(SetNewAdapter:1121) found PCI bridge device [ 1127.186328] PF0 gim info:(SetNewAdapter:1124) found: 05:10.0 [ 1127.186544] PF0 gim info:(SetNewAdapter:1147) mmio_base = 00000000de310274 [ 1127.187275] PF0 gim info:(SetNewAdapter:1149) doorbell = 00000000c74e7207 [ 1127.266496] PF0 gim info:(SetNewAdapter:1151) pf.fb_va = 000000009a0d3e40 [ 1127.266520] gim info:(sriov_is_ari_enabled:180) PCI_SRIOV_CAP = 0x00000002 [ 1127.266521] gim info:(sriov_is_ari_enabled:190) PCI_SRIOV_CTRL = 0x00000010 [ 1127.266522] gim info:(sriov_is_ari_enabled:194) PCI_SRIOV_CTRL_ARI is set --> ARI is supported [ 1127.266524] PF0 gim info:(program_ari_mode:957) Read bif_strap8 = 0x00200004 [ 1127.266525] PF0 gim info:(program_ari_mode:963) program_ari_mode - Set ARI_Mode = PF_BUS [ 1127.266526] PF0 gim info:(program_ari_mode:978) Write bif_strap8 = 0x00000004 [ 1127.266526] PF0 gim info:(gim_read_rom_from_reg:634) Reading VBios from ROM [ 1127.266655] PF0 gim info:(gim_read_VBIOS:695) VBIOS starts: 0x55, 0xaa [ 1127.266658] PF0 gim info:(gim_read_VBIOS:698) VBios size is 0x10000 [ 1127.266729] PF0 gim info:(gim_read_VBIOS:708) pVBIOS allocated at 00000000e96a70db for size of 0x80000 [ 1127.266730] PF0 gim info:(gim_read_rom_from_reg:634) Reading VBios from ROM [ 1128.371441] PF0 gim info:(gim_read_VBIOS:718) BIOS Version Major 0xF Minor 0x31 [ 1128.371519] PF0 gim info:(gim_read_VBIOS:729) VBios Checksum = 0x541c00 [ 1128.371520] PF0 gim info:(gim_read_VBIOS:738) Valid video BIOS image, size = 0x10000, check sum is 0x541c00 [ 1128.371520] PF0 gim info:(gim_read_VBIOS:739) Read in full Vbios image of size = 0x80000 [ 1128.371575] PF0 gim info:(gim_post_VBIOS:776) Init Parser passed!, continue [ 1128.371579] <1>ATOM_CheckAsicStatus - BIOS_SCRATCH_7 = 0x00000000 [ 1128.371579] <1> Isolate ATOM_S7_ASIC_INIT_COMPLETE_MASK bit(s) = 0x00000000 [ 1128.371581] <1> RLC_CNTL = 0x00000000 [ 1128.371581] <1> Isolate RLC_CNTL__RLC_ENABLE_F32_MASK = 0x00000000 [ 1128.371582] <1>ATOM_ASIC_NEED_POST [ 1128.371582] PF0 gim info:(gim_post_VBIOS:795) Asic needs a VBios post [ 1128.371583] gim info:(ATOM_PostVBIOS:215) ATOM_PostVBIOS: FirmwareInfo passed [ 1128.371584] gim info:(ATOM_PostVBIOS:261) ATOM_PostVBIOS: ASIC_Init before, engine clock = 7530, memory clock =1e848 [ 1128.696397] gim info:(ATOM_PostVBIOS:263) ATOM_PostVBIOS: ASIC_Init after [ 1128.696398] gim info:(ATOM_PostVBIOS:273) ATOM_PostVBIOS: ATOM_InitFanCntl before [ 1128.696399] gim info:(ATOM_PostVBIOS:275) ATOM_PostVBIOS: ATOM_InitFanCntl after [ 1128.696400] PF0 gim info:(gim_post_VBIOS:801) Post INIT_ASIC successfully! [ 1128.696412] gim warning:(firmware_requires_update:473) SMU option ROM version 0x111700 versus patch version 0x111a00 [ 1128.696424] gim warning:(firmware_requires_update:486) RLCV option ROM version 113. Patch version 1 [ 1128.696424] gim info:(firmware_requires_update:495) TOC found, update it [ 1128.696425] gim info:(patch_firmware:549) Update SMC_Init table [ 1128.698220] gim warning:(patch_firmware:574) Update smu firmware [ 1128.699381] gim warning:(patch_firmware:582) Update RLCV firmware [ 1128.699450] gim warning:(patch_firmware:590) Update TOC [ 1128.699882] gim info:(func_recalc_checksum:518) func_recalc_checksum original= 56 [ 1128.699911] gim info:(func_recalc_checksum:522) func_recalc_checksum new= 89 [ 1128.699912] PF0 gim info:(gim_post_VBIOS:811) Asic needs firmware loaded [ 1128.699912] gim info:(ATOM_PostVBIOS:215) ATOM_PostVBIOS: FirmwareInfo passed [ 1128.699913] gim info:(ATOM_PostVBIOS:250) just load uCode [ 1128.699914] gim info:(ATOM_PostVBIOS:261) ATOM_PostVBIOS: ASIC_Init before, engine clock = 7530, memory clock =1e848 [ 1130.370214] gim info:(ATOM_PostVBIOS:263) ATOM_PostVBIOS: ASIC_Init after [ 1130.370215] gim info:(ATOM_PostVBIOS:273) ATOM_PostVBIOS: ATOM_InitFanCntl before [ 1130.370216] gim info:(ATOM_PostVBIOS:275) ATOM_PostVBIOS: ATOM_InitFanCntl after [ 1130.370217] PF0 gim info:(gim_post_VBIOS:817) Post LOAD_FW successfully! [ 1130.370217] PF0 gim info:(gim_post_VBIOS:818) Post VBIOS successfully! [ 1130.370976] gim info:(enable_thermal_control:643) Thermal Control Enable [ 1130.370977] PF0 gim info:(SetNewAdapter:1207) gim_post_VBIOS done [ 1130.370978] PF0 gim info:(SetNewAdapter:1248) Scheduler Time interval set to 7 msec [ 1130.370980] gim info:(EnableSriov:398) Enable SRIOV [ 1130.370981] gim info:(EnableSriov:399) Enable SRIOV vfs count = 16 [ 1130.370987] gim 0000:07:00.0: not enough MMIO resources for SR-IOV [ 1130.370997] gim error:(EnableSriov:410) Fail to enable sriov, status = fffffff4 [ 1130.371005] gim error:(SetNewAdapter:1263) Failed to properly enable SRIOV(map_image) !!!! [ 1130.474928] gim error:(gim_probe:126) Failed to create new adapter [ 1130.474958] gim: probe of 0000:07:00.0 failed with error -1 [ 1130.475210] gim info:(gim_ioctl_init:567) IOCTL device created and ready for use [ 1130.475211] Running Kaveri version of GIM
-
I just posted the results of dmesg, as the entire result was quite long. Let me know if you need anything else. I left in everything related to GIM. Thank you both again for your help!
-
Could a BIOS update potentially fix this issue?
https://access.redhat.com/solutions/37376 -
Update: BIOS updated to most recent (2.9). Still having the same issue.
-
@tbluml did you try the
pci=realloc
workaround, as stated in the RHEL link?# /opt/xensource/libexec/xen-cmdline --set-dom0 pci=realloc
Edit: reboot the host after applying the change.
-
@tuxen Just tried it (from the terminal), and rebooted with the same result unfortunately. Does the command need to be appended to a file, or should it work just from the terminal?
-
It's from the terminal/CLI. Alternatively, you can verify/change the boot options in
/boot/grub/grub.cfg
(for dom0 boot, seemodule2 /boot/vmlinuz
entries).Found this Citrix KB adding one more pci option, take a look:
https://support.citrix.com/article/CTX250121 -
For the moment, I took the S7150x2 out of the R720 and put it in a Supermicro X10DRH-CT-O with E5-2620v3's for testing. After everything was set up, (BIOS, OS, and driver), I found that MxGPU did work. (Good to know that if all else fails, I have a machine that will work for what I need!)
I will take a look at that, @tuxen! Thank you!
-
@tbluml Do you want to give a try to open source gim driver on your Dell machine? We may know more from it.
-
I had the chance to try the rest of the commands linked by @tuxen today, and now I can successfully run a VM with MxGPU enabled and started! It looks like adding "pci=assign-busses" to this command did it.
/opt/xensource/libexec/xen-cmdline --set-dom0 "pci=realloc pci=assign-busses"
Thank you all for you assistance!
-
That's interesting! Maybe you can add this to the documentation?
-
@olivierlambert I would be happy to. Is there a post or link to posting guidelines? (So I can make sure that what I write is in line with what has already been written?)
-
Here: https://xcp-ng.org/docs/compute.html#mxgpu-amd-vgpu
There's a link on the bottom off the page (called "Help us to improve this page!") to contribute to it and add what you did
-
@tbluml I'm trying to make a MxGPU setup with similar hardware (dell r720, 2x E5-2650).
I got the same SR-IOV errors as you. I added the pci=realloc pci=assign-busses params.
Unfortunately the the system does not manage to boot when adding pci=assign-busses.
Root disk in not discovered and dracut shell is started.
Did you run into the same issue and if so how did you fix it?Edit:
If anyone else stumbles upon this.
I reinstalled on (usb) disk that is not connected to the raid controller and it seems to work now.
I speculate that since the controller is a PCI device and pci=assign-busses allows the kernel to override pci numbers the raid device cannot be found using the predetermined data in the initramfs. But that might be complete nonsense (no expert in these matters). -
I have the same problem with xcp-ng-8.2. I'm trying to start with mxgpu with HPE ML380p Gen8 E5-2620V2. Inserting pci=realloc pci=assign-busses the server cannot boot. Below the point of boot where it crashes.
The log in images seems to recall a known bug --> "choose an explicit smt=(bool) setting. See XSA-297"
It's the pci=assign-busses that cannot permit to boot but without it "modprobe gim" has not inserted. Also using usb disk avoiding PCI disk system crashs during startup. Firmware bios is really recent ( 2019 ) , the last one. Someone has resolved this issue ?