Best posts made by tuxen | XCP-ng and XO forum

tuxen

@indyj said in Centos 8 is EOL in 2021, what will xcp-ng do?:

@jefftee I prefer Alpine Linux.

+1

Low resource footprint, no bloatware... They even have a pre-built Xen Hypervisor ISO flavor

tuxen

@jshiells Did you also check /var/log/kern.log for hardware errors? I'm seeing qemu process crashing with bad RIP (Instruction pointer) value which screams for a hardware issue, IMO. Just a 1-bit flipping in memory is enough to cause unpleasant surprises. I hope the servers are using ECC memory. I'd run a memtest and some CPU stress test on that server.

Some years ago, I had a two-socket Dell server with one bad core (no errors reported at boot). When the Xen scheduler ran a task on that core... Boom. Host crash.

tuxen

@steff22 Wow, great news! Kudos to the Xen & XCP-ng dev teams

tuxen

@cunrun @jorge-gbs any init errors in dom0 /var/log/kern.log re. GIM driver? Also, if you search some topics here covering this specific GPU, there were mixed results booting dom0 with pci=realloc,assign-busses. Maybe it worth a try.

tuxen

I liked as well. Easy to find the topics and good layout

tuxen

@olivierlambert congrats to the team and also to this great community!

tuxen

Kudos to XCP-ng team!

tuxen

@sasha It's worth notice that the BIOS (from 2019) is relatively old/outdated. It's recommended to update the BIOS to a more recent version.

tuxen

@fred974 Yep, see the docs about NUMA/core affinity (soft/hard pinning):

https://docs.xcp-ng.org/compute/#advanced-xen

tuxen

@Forza Take a look:

https://xcp-ng.org/forum/post/49400

At the time of this topic, I remember asking a coworker to boot a CentOS 7.9 with more than 64 vcpus on a 48C/96T Xeon server. The VM started normally, but it didn't recognizes the vcpus > 64.

I've not tested that VM param platform:acpi=0 as a possible solution and the trade-offs. In the past, some old RHEL 5.x VMs without acpi support would simply power off (like pulling the power cord) instead of a clean shutdown on a vm-shutdown command.

Regarding that CFD software, does it support a worker/farm design? vGPU offload? I'm not a HPC expert but considering the EPYC MCM architecture, instead of a big VM, spreading the workload across many workers pinned to each CCD (or each numa nodes on a NPS4 confg) may be interesting.

Before buying those monsters, I would ask AMD to deploy a PoC using the target server model. For such demands, it's very important to do some sort of certification/validation.

tuxen

@erfant after seeing your uploaded dmesg, the steps 2 & 3 boot options can be put aside for while because the error isn't the same as the other topics.

The log is showing MxGPU driver probe/initialization errors. After some digging, could be the case of a GPU firmware being incompatible with UEFI. Do you have any spare server for testing XCP-ng boot in legacy/BIOS with this GPU?

[  119.418930]        gim error:(gim_probe:123) gim_probe(08:00.0)
[  121.145663]        gim error:(wait_cmd_complete:2387)  wait_cmd_complete -- time out after 0.003044131 sec
[  121.145719]        gim error:(wait_cmd_complete:2390)   Cmd = 0x17, Status = 0x0, cmd_Complete=0
[  121.145984]        gim error:(init_register_init_state:4643) Failed to INIT PF for initial register 'init-state'

Edited for clarification.

tuxen

@Appollonius said in Strange issue with booting XCP-NG:

Its only when I install the GPU and dont connect it to a monitor that it will not boot properly.

Maybe because, when there's a GPU installed but no monitor attached, the motherboard POST fails at EDID probe? As stated, some boards/BIOS require an explicit configuration in order to boot without a monitor/keyboard/mouse plugged, eg.:

https://www.supermicro.com/support/faqs/faq.cfm?faq=11902

tuxen

The incorrect clock results mean that Xen isn't in charge of frequency scaling management. Set the CPU Power Management to Performance Per Watt (OS) and run the previous xenpm, this time with a watch for real-time:

# watch 'xenpm start 1 | grep -i "avg freq"'

Start a VM boot storm (or a stress test inside one or more VMs) in order to generate some CPU load.

Posts