Wearing my best Lazarus cosplay outfit, I'll apologise for the resurrection.
Today I had an issue with my UPS which caused me to reboot XCP a few times. During those reboots I had at least 2, maybe 3, re-occurrences of this where when TrueNAS was booting, XCP would lock up. Most of the time, after a power cycle of the server, the next boot would start TRUENAS cleanly. One time it took 2 power cycles before success.
Unfortunately only one of the crashes resulted in a /var/crash report, but that did have the same symptoms as my original report:
(XEN) [ 81.101362] Non-responding CPUs: {24-47}
(XEN) [ 81.101363]
(XEN) [ 81.101364] ****************************************
(XEN) [ 81.101365] Panic on CPU 5:
(XEN) [ 81.101366] FATAL TRAP: vec 2, NMI[0000] IN INTERRUPT CONTEXT
(XEN) [ 81.101366] ****************************************
(XEN) [ 81.101367]
(XEN) [ 81.101368] Reboot in five seconds...
(XEN) [ 81.101369] Executing kexec image on cpu5
(XEN) [ 82.101441] Failed to shoot down CPUs {24-47}
Between my original report and today, I have rebooted other times, following updates, when this issue has not surfaced.
Does anyone think this could be hardware related, despite all the memory testing and stress testing I did when I built the server and again after the original issue, all with no faults. Or have I just got an unlucky set of circumstances with some sort of race condition.