@indyj said in Centos 8 is EOL in 2021, what will xcp-ng do?:
@jefftee I prefer Alpine Linux.
+1
Low resource footprint, no bloatware... They even have a pre-built Xen Hypervisor ISO flavor
@indyj said in Centos 8 is EOL in 2021, what will xcp-ng do?:
@jefftee I prefer Alpine Linux.
+1
Low resource footprint, no bloatware... They even have a pre-built Xen Hypervisor ISO flavor
@cunrun @jorge-gbs any init errors in dom0 /var/log/kern.log
re. GIM driver? Also, if you search some topics here covering this specific GPU, there were mixed results booting dom0 with pci=realloc,assign-busses
. Maybe it worth a try.
I liked as well. Easy to find the topics and good layout
@olivierlambert congrats to the team and also to this great community!
@sasha It's worth notice that the BIOS (from 2019) is relatively old/outdated. It's recommended to update the BIOS to a more recent version.
@fred974 Yep, see the docs about NUMA/core affinity (soft/hard pinning):
@Forza Take a look:
https://xcp-ng.org/forum/post/49400
At the time of this topic, I remember asking a coworker to boot a CentOS 7.9 with more than 64 vcpus on a 48C/96T Xeon server. The VM started normally, but it didn't recognizes the vcpus > 64.
I've not tested that VM param platform:acpi=0
as a possible solution and the trade-offs. In the past, some old RHEL 5.x VMs without acpi support would simply power off (like pulling the power cord) instead of a clean shutdown on a vm-shutdown command.
Regarding that CFD software, does it support a worker/farm design? vGPU offload? I'm not a HPC expert but considering the EPYC MCM architecture, instead of a big VM, spreading the workload across many workers pinned to each CCD (or each numa nodes on a NPS4 confg) may be interesting.
Before buying those monsters, I would ask AMD to deploy a PoC using the target server model. For such demands, it's very important to do some sort of certification/validation.
@erfant after seeing your uploaded dmesg
, the steps 2 & 3 boot options can be put aside for while because the error isn't the same as the other topics.
The log is showing MxGPU driver probe/initialization errors. After some digging, could be the case of a GPU firmware being incompatible with UEFI. Do you have any spare server for testing XCP-ng boot in legacy/BIOS with this GPU?
[ 119.418930] gim error:(gim_probe:123) gim_probe(08:00.0)
[ 121.145663] gim error:(wait_cmd_complete:2387) wait_cmd_complete -- time out after 0.003044131 sec
[ 121.145719] gim error:(wait_cmd_complete:2390) Cmd = 0x17, Status = 0x0, cmd_Complete=0
[ 121.145984] gim error:(init_register_init_state:4643) Failed to INIT PF for initial register 'init-state'
Edited for clarification.
@Appollonius said in Strange issue with booting XCP-NG:
Its only when I install the GPU and dont connect it to a monitor that it will not boot properly.
Maybe because, when there's a GPU installed but no monitor attached, the motherboard POST fails at EDID probe? As stated, some boards/BIOS require an explicit configuration in order to boot without a monitor/keyboard/mouse plugged, eg.:
@cunrun @jorge-gbs any init errors in dom0 /var/log/kern.log
re. GIM driver? Also, if you search some topics here covering this specific GPU, there were mixed results booting dom0 with pci=realloc,assign-busses
. Maybe it worth a try.
@KPS one thing is clear to me. The reboot is triggering a VM shutdown due to a system crash (kernel errors and memory dump files being a lead). Without a detailed stack trace (like Linux's kernel panic) and the difficulty in reproducing the issue, troubleshooting is a very hard task. One last thing I'd check is the /var/log/daemon.log
at the VM shutdown time window.
@KPS I was exactly thinking about an after-hour task doing heavy storage I/O (e.g data replication or ETL-like workloads). Under this scenario, a forced reboot might cause some sort of file system corruption due to uncommitted data being lost.
Now, other source of issue comes to mind: automatic Windows Update. Is this service active? I'm not a Windows expert but a forced reboot during a system update might also cause an unexpected behavior.
Seeing all those errors, it seems that some system file or DLL got corrupted, needing a repair. It's strongly recommended taking a snapshot before running a system repair.
@KPS When that force reboot command is issued, the VM:
@cunrun Are the XCP-ng host and the Windows 2019 Server VM booting in legacy/BIOS or UEFI? Since the FirePros was launched when the legacy/BIOS was still the standard, I'd try the this mode (if not yet).
@dhiraj26683 seeing the htop output, there's some HA-LIZARD PIDs running. So, yes, there's "extra stuff" installed on dom0
HA-LIZARD uses the TGT iSCSI driver which in turn has an implicit write-cache
option enabled by default, if not set [1][2]. Is this option disabled in /etc/tgt/targets.conf
?
[1] https://www.halizard.com/ha-iscsi
[2] https://manpages.debian.org/testing/tgt/targets.conf.5.en.html
@gecant I don't have any Supermicro server to test (mostly Dell/HPe) but checking the mobo manual [1], sadly, there's no profile for tweaking the power settings. After checking the available options, try this config:
Advanced >> CPU Configuration >> Advanced Power Management Configuration
> CPU P State Control
SpeedStep (PStates) [Enable]
EIST PSD Function [HW_ALL]
Turbo Mode [Enabled]
> Hardware PM State Control
Hardware P-States [Native Mode]
> CPU C State Control
Autonomous Core C-State [Disable]
CPU C6 Report [Enable]
Enhanced Halt State (C1E) [Disable (performance)] or [Enable (powersave)]
> Package C State Control
Package C State [C0/C1 (performance)] or [C6(Retention) state (powersave)]
[1] https://www.supermicro.com/manuals/motherboard/C420/MNL-2005.pdf
@darabontors some additional tests that I could think of:
MTU=1280
);e1000
interfaces (bypass the PV driver but not OVS). It'll keep the VM 'agile' (hot-migrate) but with a big cost in performance;23.7.5
.As for the last version, found this important info posted by the devs about a change in the MTU code [1]:
Today introduces a change in MTU handling for parent interfaces mostly
noticed by PPPoE use where the respective MTU values need to fit the
parent plus the additional header of the VLAN or PPPoE. Should the
MTU already be misconfigured to a smaller value it will be used as
configured so check your configuration and clear the MTU value if you
want the system to decide about the effective parent MTU size.
(...)
Hope it helps.
@darabontors said in Very scary host reboot issue:
Some other detail that might be unrelated: my PPPoE connection to my ISP has MTU 1492. WireGuard connection also has MTU 1492. Is this relevant in any way?
I'm not into firewall/tunneling stuff but shouldn't the WireGuard MTU be lower than the PPPoE one in order to fit the WG protocol overhead? I read that the default=1420
and minimum=1280
. I'd first reset the WG MTU to default and also test lower values within this range if the crash still persists.
Regardless the tests, indeed there's a bug somewhere because a malformed packet/frame should be handled and not triggering a crash.
@sasha It's worth notice that the BIOS (from 2019) is relatively old/outdated. It's recommended to update the BIOS to a more recent version.