I have an ASRockRack W480D4U (BIOS version L2.23, BMC version 1.02) with an Intel W-1290P in it, 4x Samsung M391A4G43AB1-CVF RAM modules, 2x Solidigm P44 Pro NVMe drives, 2x Samsung 860 Pro SATA drives, a Sparkle A310 Eco, and an Intel X710-T2L NIC (firmware version 9.50).
XCP-ng is new for me, as I left ESXi when Broadcom killed the free version, and I haven't been entirely happy with Proxmox. Both of those solutions worked for me on this hardware, but I'm hoping that XCP-ng might be something I could use longer-term without upgrade issues.
However, what I'm finding with version 8.3 LTS and the full set of updates packages installed is that I'm seeing erroneous CPU thermal shutdowns with XCP-ng. I saw this previously when running FreeBSD 13.2-RELEASE on this same hardware (though before I added the A310 card), but never had an issue, but ESXi and Proxmox never had any issues. Additionally, I never see temperatures get above the mid-60s, and I can run the CPU at 100% indefinitely on one of those systems without any issues. I also never see any temperature warnings. The host just spontaneously shuts down, and I see a CPU_THERMTRIP event in the BMC event history.
Unfortunately, ASRockRack's support can't help me with this, so I was hoping that maybe there was a fix in kernel-alt that might resolve my issue. What I'm running into with that, though, is that the kernel boots just fine, and picks up my NIC without issue, but never actually starts allows traffic on the NIC.
I have a manual/static IP setup for the management IP in XCP-ng, and with the normal kernel, I can reach it as soon as the console indicates it's up and ready. However, with kernel-alt, it never works at all. The console indicates it's up, but I can't ping it or SSH to it, and if I try to ping the gateway from the console, it just times out (though pinging localhost works). Similarly, VMs aren't able to use the network either, from the other port on the same card.
Is there something I need to do differently for the i40e drivers in kernel-alt to work with my NIC so I can see whether the erroneous thermal shutdown issues are resolved?