TrueNAS VM failing to start
-
Thinking this could be down to the PCIe card moves, as that did change the IDs for some of the passthrough devices, I removed all the passthroughs, via the command line, and then reinstated them.
Now when I try to start TrueNAS the whole system locks up. I can't enter anything via Putty, XOA, or the Supermicro ipmi.
I have no idea where to go to from here.
Cheers.
-
Are you sure your OS (TrueNAS) isn't waiting for the PCI device that's not passed through anymore?
-
@olivierlambert The same devices are passed through, just as different IDs.
But that shouldn't "kill" XCP, so that XOA, Putty, etc no longer respond.
Cheers.
-
No it shouldn't. Have you removed the passthrough from the VM too? Without logs it's hard to tell, take a look inside if you can spot something
-
@olivierlambert
Yes, I removed the passthoughs from the VM before I removed them at the DOM level.At the moment, the system is booted directly into TrueNAS and is re-silvering the replaced NVMe. Once this finishes, I can reboot XCP and take a look. Is there any particular log you think will give the most clues.
Cheers.
-
I think the usual stuff: https://docs.xcp-ng.org/troubleshooting/log-files/
-
@olivierlambert
Sorry about the delay, got a lot going on.Anyway, was able to pick this up again and here's what happened this time. Booted XCP, noticed there were a bunch more updates, so ran the update so I'm collecting information from the very latest and greatest.
Re-boot XCP and start the TrueNAS VM with NO passthrough devices. As expected, that started up fine. Stopped TrueNAS and added all the devices and started TrueNAS again. This immediately caused the server to reboot itself. Hmmmmm.
On the restart of XCP-ng I collected the output from "xen-bugtool --yestoall" and also the /var/crash directory (how do I upload a tgz), which hopefully will give a clue as to what's going on.
I also have the output from "xl pci-assignable-list" and "xe vm-list params=other-config uuid=<uuid>" showing the passthrough devices if needed.
Cheers.
-
This is indeed not normal, could be a BIOS problem or even hardware problem (or a pretty bad bug in Xen?). What kind of hardware is it? I would be surprised to be server grade (or this points to a software bug then).
-
@olivierlambert As I pointed out earlier, everything was working perfectly until I shut down to replace an NVMe stick, which involved moving around a couple of PCIe cards, hence changing their IDs for passthrough.
It's a Supermicro X11DPH-T running a pair of Xeon Gold 5118. The BIOS was up to date as of the middle of last year, with a date of 3/5/24.
Cheers.
-
Not sure if this helps, from the bottom of the xen.log:
(XEN) [ 919.901833] Watchdog timer detects that CPU23 is stuck! (XEN) [ 919.901837] ----[ Xen-4.17.5-23 x86_64 debug=n Not tainted ]---- (XEN) [ 919.901838] CPU: 23 (XEN) [ 919.901839] RIP: e008:[<ffff82d04032ca4a>] arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901843] RFLAGS: 0000000000000012 CONTEXT: hypervisor (XEN) [ 919.901845] rax: 0000000000000030 rbx: ffff83103fff7cf8 rcx: 0000000000000017 (XEN) [ 919.901846] rdx: ffff83103fff7df8 rsi: 0000000000000000 rdi: ffff83103fff7cf8 (XEN) [ 919.901847] rbp: 0000000000000017 rsp: ffff831033b87d00 r8: 0000000000000030 (XEN) [ 919.901849] r9: ffff83103fff7cf8 r10: 0000000000000000 r11: 0000000000000000 (XEN) [ 919.901850] r12: 0000000000000000 r13: ffff82d040987680 r14: 00000000000000fb (XEN) [ 919.901851] r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000007526e0 (XEN) [ 919.901852] cr3: 000000006162f000 cr2: 00007f233881e010 (XEN) [ 919.901853] fsb: 0000000000000000 gsb: 0000000000000000 gss: ffff9ee10f280000 (XEN) [ 919.901854] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) [ 919.901857] Xen code around <ffff82d04032ca4a> (arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0): (XEN) [ 919.901858] 1f 80 00 00 00 00 f3 90 <8b> 0a 39 c8 75 f8 eb 97 66 0f 1f 44 00 00 31 ff (XEN) [ 919.901862] Xen stack trace from rsp=ffff831033b87d00: (XEN) [ 919.901863] ffff82d040987680 ffff82d04023201c ffff831033b87d98 00000000000000fb (XEN) [ 919.901865] ffff82d04031166c 0000000000000202 0000000000000000 0000000080000000 (XEN) [ 919.901867] 0000000000000000 0000000000000000 ffff831033b87fff 0000000000000000 (XEN) [ 919.901869] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) [ 919.901870] ffff831033b87fff 0000000000000000 ffff82d040201916 000000d5308191d5 (XEN) [ 919.901872] 000000d529ac0122 0000000000000017 ffff831033b916a0 ffff831033b91738 (XEN) [ 919.901873] 0000000000000060 0000000000000001 ffff82d040987680 ffff831033b87ef8 (XEN) [ 919.901875] ffff82d040988200 ffff831033b8d06c 000000d5308187aa 0000000000000000 (XEN) [ 919.901877] 000000d5308191d5 ffff831033b916d0 000000fb00000000 ffff82d0402931f4 (XEN) [ 919.901879] 000000000000e008 0000000000000246 ffff831033b87e48 0000000000000000 (XEN) [ 919.901880] ffff82d0402931ed 0000000000000000 0000000000000000 0000000000000000 (XEN) [ 919.901882] ffff82d0409875e0 0000000000000017 ffff82d0409d5340 0000000000000017 (XEN) [ 919.901884] 0000000000000017 0000000000007fff ffff82d040820c00 ffff82d040987680 (XEN) [ 919.901885] ffff82d0409d5340 ffff82d0403001bb ffff82d040988200 ffff82d0409803b0 (XEN) [ 919.901887] ffff82d0403000e0 ffff831033b92000 ffff83132018e000 ffff83103ffc9000 (XEN) [ 919.901889] 0000000000000017 ffff8323a572e000 ffff82d040301f5e 000000000000003b (XEN) [ 919.901891] 00007f2339a6a948 0000000000000003 00007f2338828840 00007f232b42a840 (XEN) [ 919.901893] 0000000000000002 00007f2339a6a8d8 00007f2339a6a950 0000000000000001 (XEN) [ 919.901894] 00000000004a2950 00007f2338813740 0000000000000000 0000000000000003 (XEN) [ 919.901896] 00000000009465e0 00007f233881dff0 000000fa00000000 00000000004a9499 (XEN) [ 919.901898] Xen call trace: (XEN) [ 919.901899] [<ffff82d04032ca4a>] R arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901902] [<ffff82d04023201c>] S smp_call_function_interrupt+0x4c/0x90 (XEN) [ 919.901905] [<ffff82d04031166c>] S do_IRQ+0x2bc/0x710 (XEN) [ 919.901907] [<ffff82d040201916>] S common_interrupt+0x136/0x150 (XEN) [ 919.901911] [<ffff82d0402931f4>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x204/0x3c0 (XEN) [ 919.901913] [<ffff82d0402931ed>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x1fd/0x3c0 (XEN) [ 919.901916] [<ffff82d0403001bb>] S arch/x86/domain.c#idle_loop+0xdb/0xf0 (XEN) [ 919.901918] [<ffff82d0403000e0>] S arch/x86/domain.c#idle_loop+0/0xf0 (XEN) [ 919.901919] [<ffff82d040301f5e>] S context_switch+0x1ee/0x900 (XEN) [ 919.901920] (XEN) [ 919.901927] CPU3 d[IDLE]v3 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901930] CPU2 d[IDLE]v2 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901934] CPU1 d[IDLE]v1 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901937] CPU0 d[IDLE]v0 e008:ffff82d04032c9d2 in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0x42/0xe0 (XEN) [ 919.901941] CPU4 d[IDLE]v4 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901945] CPU5 d[IDLE]v5 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901949] CPU6 d[IDLE]v6 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901952] CPU7 d[IDLE]v7 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901956] CPU8 d[IDLE]v8 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901960] CPU9 d[IDLE]v9 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901964] CPU10 d[IDLE]v10 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901969] CPU16 d[IDLE]v16 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901972] CPU17 d[IDLE]v17 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901975] CPU11 d[IDLE]v11 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901978] CPU22 d[IDLE]v22 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901983] CPU20 d0v11 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901986] CPU21 d[IDLE]v21 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901991] CPU14 d[IDLE]v14 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901994] CPU15 d[IDLE]v15 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901998] CPU18 d[IDLE]v18 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.902002] CPU19 d[IDLE]v19 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.902006] CPU13 d[IDLE]v13 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.902009] CPU12 d[IDLE]v12 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.912921] Non-responding CPUs: {24-47} (XEN) [ 919.912922] (XEN) [ 919.912923] **************************************** (XEN) [ 919.912923] Panic on CPU 23: (XEN) [ 919.912924] FATAL TRAP: vec 2, NMI[0000] IN INTERRUPT CONTEXT (XEN) [ 919.912925] **************************************** (XEN) [ 919.912926] (XEN) [ 919.912926] Reboot in five seconds... (XEN) [ 919.912928] Executing kexec image on cpu23 (XEN) [ 920.912554] Failed to shoot down CPUs {24-47}Cheers.