@indyj said in Centos 8 is EOL in 2021, what will xcp-ng do?:
@jefftee I prefer Alpine Linux.
+1
Low resource footprint, no bloatware... They even have a pre-built Xen Hypervisor ISO flavor
@indyj said in Centos 8 is EOL in 2021, what will xcp-ng do?:
@jefftee I prefer Alpine Linux.
+1
Low resource footprint, no bloatware... They even have a pre-built Xen Hypervisor ISO flavor
@jshiells Did you also check /var/log/kern.log
for hardware errors? I'm seeing qemu process crashing with bad RIP (Instruction pointer) value which screams for a hardware issue, IMO. Just a 1-bit flipping in memory is enough to cause unpleasant surprises. I hope the servers are using ECC memory. I'd run a memtest and some CPU stress test on that server.
Some years ago, I had a two-socket Dell server with one bad core (no errors reported at boot). When the Xen scheduler ran a task on that core... Boom. Host crash.
@steff22 Wow, great news! Kudos to the Xen & XCP-ng dev teams
@cunrun @jorge-gbs any init errors in dom0 /var/log/kern.log
re. GIM driver? Also, if you search some topics here covering this specific GPU, there were mixed results booting dom0 with pci=realloc,assign-busses
. Maybe it worth a try.
I liked as well. Easy to find the topics and good layout
@olivierlambert congrats to the team and also to this great community!
@sasha It's worth notice that the BIOS (from 2019) is relatively old/outdated. It's recommended to update the BIOS to a more recent version.
@fred974 Yep, see the docs about NUMA/core affinity (soft/hard pinning):
@Forza Take a look:
https://xcp-ng.org/forum/post/49400
At the time of this topic, I remember asking a coworker to boot a CentOS 7.9 with more than 64 vcpus on a 48C/96T Xeon server. The VM started normally, but it didn't recognizes the vcpus > 64.
I've not tested that VM param platform:acpi=0
as a possible solution and the trade-offs. In the past, some old RHEL 5.x VMs without acpi support would simply power off (like pulling the power cord) instead of a clean shutdown on a vm-shutdown command.
Regarding that CFD software, does it support a worker/farm design? vGPU offload? I'm not a HPC expert but considering the EPYC MCM architecture, instead of a big VM, spreading the workload across many workers pinned to each CCD (or each numa nodes on a NPS4 confg) may be interesting.
Before buying those monsters, I would ask AMD to deploy a PoC using the target server model. For such demands, it's very important to do some sort of certification/validation.
@erfant after seeing your uploaded dmesg
, the steps 2 & 3 boot options can be put aside for while because the error isn't the same as the other topics.
The log is showing MxGPU driver probe/initialization errors. After some digging, could be the case of a GPU firmware being incompatible with UEFI. Do you have any spare server for testing XCP-ng boot in legacy/BIOS with this GPU?
[ 119.418930] gim error:(gim_probe:123) gim_probe(08:00.0)
[ 121.145663] gim error:(wait_cmd_complete:2387) wait_cmd_complete -- time out after 0.003044131 sec
[ 121.145719] gim error:(wait_cmd_complete:2390) Cmd = 0x17, Status = 0x0, cmd_Complete=0
[ 121.145984] gim error:(init_register_init_state:4643) Failed to INIT PF for initial register 'init-state'
Edited for clarification.
@Appollonius said in Strange issue with booting XCP-NG:
Its only when I install the GPU and dont connect it to a monitor that it will not boot properly.
Maybe because, when there's a GPU installed but no monitor attached, the motherboard POST fails at EDID probe? As stated, some boards/BIOS require an explicit configuration in order to boot without a monitor/keyboard/mouse plugged, eg.:
The incorrect clock results mean that Xen isn't in charge of frequency scaling management. Set the CPU Power Management to Performance Per Watt (OS)
and run the previous xenpm
, this time with a watch
for real-time:
# watch 'xenpm start 1 | grep -i "avg freq"'
Start a VM boot storm (or a stress test inside one or more VMs) in order to generate some CPU load.