TrueNAS VM failing to start
-
This is indeed not normal, could be a BIOS problem or even hardware problem (or a pretty bad bug in Xen?). What kind of hardware is it? I would be surprised to be server grade (or this points to a software bug then).
-
@olivierlambert As I pointed out earlier, everything was working perfectly until I shut down to replace an NVMe stick, which involved moving around a couple of PCIe cards, hence changing their IDs for passthrough.
It's a Supermicro X11DPH-T running a pair of Xeon Gold 5118. The BIOS was up to date as of the middle of last year, with a date of 3/5/24.
Cheers.
-
Not sure if this helps, from the bottom of the xen.log:
(XEN) [ 919.901833] Watchdog timer detects that CPU23 is stuck! (XEN) [ 919.901837] ----[ Xen-4.17.5-23 x86_64 debug=n Not tainted ]---- (XEN) [ 919.901838] CPU: 23 (XEN) [ 919.901839] RIP: e008:[<ffff82d04032ca4a>] arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901843] RFLAGS: 0000000000000012 CONTEXT: hypervisor (XEN) [ 919.901845] rax: 0000000000000030 rbx: ffff83103fff7cf8 rcx: 0000000000000017 (XEN) [ 919.901846] rdx: ffff83103fff7df8 rsi: 0000000000000000 rdi: ffff83103fff7cf8 (XEN) [ 919.901847] rbp: 0000000000000017 rsp: ffff831033b87d00 r8: 0000000000000030 (XEN) [ 919.901849] r9: ffff83103fff7cf8 r10: 0000000000000000 r11: 0000000000000000 (XEN) [ 919.901850] r12: 0000000000000000 r13: ffff82d040987680 r14: 00000000000000fb (XEN) [ 919.901851] r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000007526e0 (XEN) [ 919.901852] cr3: 000000006162f000 cr2: 00007f233881e010 (XEN) [ 919.901853] fsb: 0000000000000000 gsb: 0000000000000000 gss: ffff9ee10f280000 (XEN) [ 919.901854] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) [ 919.901857] Xen code around <ffff82d04032ca4a> (arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0): (XEN) [ 919.901858] 1f 80 00 00 00 00 f3 90 <8b> 0a 39 c8 75 f8 eb 97 66 0f 1f 44 00 00 31 ff (XEN) [ 919.901862] Xen stack trace from rsp=ffff831033b87d00: (XEN) [ 919.901863] ffff82d040987680 ffff82d04023201c ffff831033b87d98 00000000000000fb (XEN) [ 919.901865] ffff82d04031166c 0000000000000202 0000000000000000 0000000080000000 (XEN) [ 919.901867] 0000000000000000 0000000000000000 ffff831033b87fff 0000000000000000 (XEN) [ 919.901869] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) [ 919.901870] ffff831033b87fff 0000000000000000 ffff82d040201916 000000d5308191d5 (XEN) [ 919.901872] 000000d529ac0122 0000000000000017 ffff831033b916a0 ffff831033b91738 (XEN) [ 919.901873] 0000000000000060 0000000000000001 ffff82d040987680 ffff831033b87ef8 (XEN) [ 919.901875] ffff82d040988200 ffff831033b8d06c 000000d5308187aa 0000000000000000 (XEN) [ 919.901877] 000000d5308191d5 ffff831033b916d0 000000fb00000000 ffff82d0402931f4 (XEN) [ 919.901879] 000000000000e008 0000000000000246 ffff831033b87e48 0000000000000000 (XEN) [ 919.901880] ffff82d0402931ed 0000000000000000 0000000000000000 0000000000000000 (XEN) [ 919.901882] ffff82d0409875e0 0000000000000017 ffff82d0409d5340 0000000000000017 (XEN) [ 919.901884] 0000000000000017 0000000000007fff ffff82d040820c00 ffff82d040987680 (XEN) [ 919.901885] ffff82d0409d5340 ffff82d0403001bb ffff82d040988200 ffff82d0409803b0 (XEN) [ 919.901887] ffff82d0403000e0 ffff831033b92000 ffff83132018e000 ffff83103ffc9000 (XEN) [ 919.901889] 0000000000000017 ffff8323a572e000 ffff82d040301f5e 000000000000003b (XEN) [ 919.901891] 00007f2339a6a948 0000000000000003 00007f2338828840 00007f232b42a840 (XEN) [ 919.901893] 0000000000000002 00007f2339a6a8d8 00007f2339a6a950 0000000000000001 (XEN) [ 919.901894] 00000000004a2950 00007f2338813740 0000000000000000 0000000000000003 (XEN) [ 919.901896] 00000000009465e0 00007f233881dff0 000000fa00000000 00000000004a9499 (XEN) [ 919.901898] Xen call trace: (XEN) [ 919.901899] [<ffff82d04032ca4a>] R arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901902] [<ffff82d04023201c>] S smp_call_function_interrupt+0x4c/0x90 (XEN) [ 919.901905] [<ffff82d04031166c>] S do_IRQ+0x2bc/0x710 (XEN) [ 919.901907] [<ffff82d040201916>] S common_interrupt+0x136/0x150 (XEN) [ 919.901911] [<ffff82d0402931f4>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x204/0x3c0 (XEN) [ 919.901913] [<ffff82d0402931ed>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x1fd/0x3c0 (XEN) [ 919.901916] [<ffff82d0403001bb>] S arch/x86/domain.c#idle_loop+0xdb/0xf0 (XEN) [ 919.901918] [<ffff82d0403000e0>] S arch/x86/domain.c#idle_loop+0/0xf0 (XEN) [ 919.901919] [<ffff82d040301f5e>] S context_switch+0x1ee/0x900 (XEN) [ 919.901920] (XEN) [ 919.901927] CPU3 d[IDLE]v3 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901930] CPU2 d[IDLE]v2 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901934] CPU1 d[IDLE]v1 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901937] CPU0 d[IDLE]v0 e008:ffff82d04032c9d2 in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0x42/0xe0 (XEN) [ 919.901941] CPU4 d[IDLE]v4 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901945] CPU5 d[IDLE]v5 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901949] CPU6 d[IDLE]v6 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901952] CPU7 d[IDLE]v7 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901956] CPU8 d[IDLE]v8 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901960] CPU9 d[IDLE]v9 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901964] CPU10 d[IDLE]v10 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901969] CPU16 d[IDLE]v16 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901972] CPU17 d[IDLE]v17 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901975] CPU11 d[IDLE]v11 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901978] CPU22 d[IDLE]v22 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901983] CPU20 d0v11 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901986] CPU21 d[IDLE]v21 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901991] CPU14 d[IDLE]v14 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901994] CPU15 d[IDLE]v15 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.901998] CPU18 d[IDLE]v18 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.902002] CPU19 d[IDLE]v19 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.902006] CPU13 d[IDLE]v13 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.902009] CPU12 d[IDLE]v12 e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0 (XEN) [ 919.912921] Non-responding CPUs: {24-47} (XEN) [ 919.912922] (XEN) [ 919.912923] **************************************** (XEN) [ 919.912923] Panic on CPU 23: (XEN) [ 919.912924] FATAL TRAP: vec 2, NMI[0000] IN INTERRUPT CONTEXT (XEN) [ 919.912925] **************************************** (XEN) [ 919.912926] (XEN) [ 919.912926] Reboot in five seconds... (XEN) [ 919.912928] Executing kexec image on cpu23 (XEN) [ 920.912554] Failed to shoot down CPUs {24-47}Cheers.
-
Ouch. @andyhhp in case that trace rings a bell.
-
@olivierlambert Any further thoughts or suggestions (move PCIe cards around again ??).
Cheers.
-
No but maybe @Team-Hypervisor-Kernel does
-
Hello @eddiea
I've sent you a link in private so that you can upload all your log files.
Thanks
Regards,
Yann
-
@yannsionneau Uploaded contents of /var/crash together with the output of "xen-bugtool --yestoall".
Cheers.
-
@EddieA Can you try differents combinations of passedthrough hardware in this VM ?
e.g try with each device one by one at a time; at least in the VM
-
Give me a couple of days to try. It is (obviously) down to the combination of devices passed through, as I reported this earlier:
said in TrueNAS VM failing to start:
Re-boot XCP and start the TrueNAS VM with NO passthrough devices. As expected, that started up fine.
Cheers.