@indyj said in Centos 8 is EOL in 2021, what will xcp-ng do?:
@jefftee I prefer Alpine Linux.
+1
Low resource footprint, no bloatware... They even have a pre-built Xen Hypervisor ISO flavor
@indyj said in Centos 8 is EOL in 2021, what will xcp-ng do?:
@jefftee I prefer Alpine Linux.
+1
Low resource footprint, no bloatware... They even have a pre-built Xen Hypervisor ISO flavor
@jshiells Did you also check /var/log/kern.log
for hardware errors? I'm seeing qemu process crashing with bad RIP (Instruction pointer) value which screams for a hardware issue, IMO. Just a 1-bit flipping in memory is enough to cause unpleasant surprises. I hope the servers are using ECC memory. I'd run a memtest and some CPU stress test on that server.
Some years ago, I had a two-socket Dell server with one bad core (no errors reported at boot). When the Xen scheduler ran a task on that core... Boom. Host crash.
@steff22 Wow, great news! Kudos to the Xen & XCP-ng dev teams
@cunrun @jorge-gbs any init errors in dom0 /var/log/kern.log
re. GIM driver? Also, if you search some topics here covering this specific GPU, there were mixed results booting dom0 with pci=realloc,assign-busses
. Maybe it worth a try.
I liked as well. Easy to find the topics and good layout
@olivierlambert congrats to the team and also to this great community!
@sasha It's worth notice that the BIOS (from 2019) is relatively old/outdated. It's recommended to update the BIOS to a more recent version.
@fred974 Yep, see the docs about NUMA/core affinity (soft/hard pinning):
@Forza Take a look:
https://xcp-ng.org/forum/post/49400
At the time of this topic, I remember asking a coworker to boot a CentOS 7.9 with more than 64 vcpus on a 48C/96T Xeon server. The VM started normally, but it didn't recognizes the vcpus > 64.
I've not tested that VM param platform:acpi=0
as a possible solution and the trade-offs. In the past, some old RHEL 5.x VMs without acpi support would simply power off (like pulling the power cord) instead of a clean shutdown on a vm-shutdown command.
Regarding that CFD software, does it support a worker/farm design? vGPU offload? I'm not a HPC expert but considering the EPYC MCM architecture, instead of a big VM, spreading the workload across many workers pinned to each CCD (or each numa nodes on a NPS4 confg) may be interesting.
Before buying those monsters, I would ask AMD to deploy a PoC using the target server model. For such demands, it's very important to do some sort of certification/validation.
@steff22 Wow, great news! Kudos to the Xen & XCP-ng dev teams
@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:
The bios disabled internal ipma when an Ext GPU card is connected even though int gpu is selected as primary gpu in the bios. So I only see xcp-ng startup on screen no xsconsole. Have tried without a screen connected extgpu same error then
I suggest to call the Asrock support and explain this behavior.
@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:
no. 2 Have tried pressing Detect only to be told that there is no more screen. Have only tried reboot
Could you try the shutdown/start after the driver installation?
@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:
At first I thought there was something wrong with the bios. But this works with Vmware esxi and proxmox.
Considering it worked with the same XCP-ng version, but on a different hardware, that's why I'm more inclined to a Xen incompatibility issue with the combo Nvidia + some AMD motherboards. If you search the forum, there's a mixed result about that.
@steff22 I have some questions:
[Detect]
button at the display settings window?Nonetheless, if the same dGPU card works normally on another XCP-ng host, a possible Xen passthrough incompatibility with that AM5 board should be considered. For example:
Tux
@steff22 After reading this Blue Iris topic, I wonder if it's related. As of Xen 4.15, there was a change on MSRs handling that would cause a guest crash if it tries to access those registers. XCP-ng 8.3 has the Xen 4.17 version. The issue seems to be CPU-vendor-model dependent too.
https://xcp-ng.org/forum/topic/8873/windows-blue-iris-xcp-ng-8-3
It's worth to test the solution provided there (VM shutdown/start cycle is required to take effect):
xe vm-param-add uuid=<VM-UUID> param-name=platform msr-relaxed=true
Replace the <VM-UUID>
with your VM W10 uuid.
Tux
@steff22 weird bug. Is that W10 VM a fresh install on Xen? It seems that the driver or the dGPU are timing out somehow. Could be related to PCI power management (ASPM), but I'm not sure. You could try booting dom0 with pcie_aspm=off
just for testing.
/opt/xensource/libexec/xen-cmdline --set-dom0 "pcie_aspm=off"
reboot
Another option that comes to mind is to compare the VM attributes on Proxmox and try to spot any VM config differences by set/unset the PCI Express
option.
Tux
@steff22 Ah, you should try to reproduce the BSOD and then run the xl dmesg
. I was wondering why there's no error this time in the log
@steff22 Ok, let's boot Xen in verbosity=all mode:
/opt/xensource/libexec/xen-cmdline --set-xen "loglvl=all guest_loglvl=all"
reboot
After the VM BSODs, post the xl dmesg
output.
Tux
@Teddy-Astie @steff22 For the Windows VM, Xen is indeed triggering a guest crash:
(XEN) [ 1022.240112] d1v2 VIRIDIAN GUEST_CRASH: 0x116 0xffffdb8ffaf76010 0xfffff8077938e9f0 0xffffffffc0000001 0x4
I also noticed that dom0 memory was autoset to only 2.6G
(out of 32G
total) which might be low for a more resource-hog dGPU. Before booting Xen in debug mode, could we rule this out by testing a non-persistent boot change to a higher value (eg. 8G)?
<e>
to edit the boot linedom0_mem=8192M,max:8192M
<F10>
to bootfree -m
). It must be within the 7000-8000
range.Tux
@steff22 what's the output of lspci -k
and xl pci-assignable-list
?
Also, the outputs of the system logs re. GPU and IOMMU initialization would be very useful:
egrep -i '(nvidia|vga|video|pciback)' /var/log/kern.log
xl dmesg
Tux
@steff22 Assuming the xen-pciback.hide
was previously set, could you try this workaround (no guarantee that'll work, since each motherboard and BIOSes have their quirks):
/opt/xensource/libexec/xen-cmdline --set-dom0 pci=realloc
reboot