XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    TrueNAS VM failing to start

    Scheduled Pinned Locked Moved Compute
    18 Posts 4 Posters 500 Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      This is indeed not normal, could be a BIOS problem or even hardware problem (or a pretty bad bug in Xen?). What kind of hardware is it? I would be surprised to be server grade (or this points to a software bug then).

      E 2 Replies Last reply Reply Quote 0
      • E Offline
        EddieA @olivierlambert
        last edited by

        @olivierlambert As I pointed out earlier, everything was working perfectly until I shut down to replace an NVMe stick, which involved moving around a couple of PCIe cards, hence changing their IDs for passthrough.

        It's a Supermicro X11DPH-T running a pair of Xeon Gold 5118. The BIOS was up to date as of the middle of last year, with a date of 3/5/24.

        Cheers.

        E 1 Reply Last reply Reply Quote 0
        • E Offline
          EddieA @EddieA
          last edited by EddieA

          Not sure if this helps, from the bottom of the xen.log:

          (XEN) [  919.901833] Watchdog timer detects that CPU23 is stuck!
          (XEN) [  919.901837] ----[ Xen-4.17.5-23  x86_64  debug=n  Not tainted ]----
          (XEN) [  919.901838] CPU:    23
          (XEN) [  919.901839] RIP:    e008:[<ffff82d04032ca4a>] arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901843] RFLAGS: 0000000000000012   CONTEXT: hypervisor
          (XEN) [  919.901845] rax: 0000000000000030   rbx: ffff83103fff7cf8   rcx: 0000000000000017
          (XEN) [  919.901846] rdx: ffff83103fff7df8   rsi: 0000000000000000   rdi: ffff83103fff7cf8
          (XEN) [  919.901847] rbp: 0000000000000017   rsp: ffff831033b87d00   r8:  0000000000000030
          (XEN) [  919.901849] r9:  ffff83103fff7cf8   r10: 0000000000000000   r11: 0000000000000000
          (XEN) [  919.901850] r12: 0000000000000000   r13: ffff82d040987680   r14: 00000000000000fb
          (XEN) [  919.901851] r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000007526e0
          (XEN) [  919.901852] cr3: 000000006162f000   cr2: 00007f233881e010
          (XEN) [  919.901853] fsb: 0000000000000000   gsb: 0000000000000000   gss: ffff9ee10f280000
          (XEN) [  919.901854] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
          (XEN) [  919.901857] Xen code around <ffff82d04032ca4a> (arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0):
          (XEN) [  919.901858]  1f 80 00 00 00 00 f3 90 <8b> 0a 39 c8 75 f8 eb 97 66 0f 1f 44 00 00 31 ff
          (XEN) [  919.901862] Xen stack trace from rsp=ffff831033b87d00:
          (XEN) [  919.901863]    ffff82d040987680 ffff82d04023201c ffff831033b87d98 00000000000000fb
          (XEN) [  919.901865]    ffff82d04031166c 0000000000000202 0000000000000000 0000000080000000
          (XEN) [  919.901867]    0000000000000000 0000000000000000 ffff831033b87fff 0000000000000000
          (XEN) [  919.901869]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
          (XEN) [  919.901870]    ffff831033b87fff 0000000000000000 ffff82d040201916 000000d5308191d5
          (XEN) [  919.901872]    000000d529ac0122 0000000000000017 ffff831033b916a0 ffff831033b91738
          (XEN) [  919.901873]    0000000000000060 0000000000000001 ffff82d040987680 ffff831033b87ef8
          (XEN) [  919.901875]    ffff82d040988200 ffff831033b8d06c 000000d5308187aa 0000000000000000
          (XEN) [  919.901877]    000000d5308191d5 ffff831033b916d0 000000fb00000000 ffff82d0402931f4
          (XEN) [  919.901879]    000000000000e008 0000000000000246 ffff831033b87e48 0000000000000000
          (XEN) [  919.901880]    ffff82d0402931ed 0000000000000000 0000000000000000 0000000000000000
          (XEN) [  919.901882]    ffff82d0409875e0 0000000000000017 ffff82d0409d5340 0000000000000017
          (XEN) [  919.901884]    0000000000000017 0000000000007fff ffff82d040820c00 ffff82d040987680
          (XEN) [  919.901885]    ffff82d0409d5340 ffff82d0403001bb ffff82d040988200 ffff82d0409803b0
          (XEN) [  919.901887]    ffff82d0403000e0 ffff831033b92000 ffff83132018e000 ffff83103ffc9000
          (XEN) [  919.901889]    0000000000000017 ffff8323a572e000 ffff82d040301f5e 000000000000003b
          (XEN) [  919.901891]    00007f2339a6a948 0000000000000003 00007f2338828840 00007f232b42a840
          (XEN) [  919.901893]    0000000000000002 00007f2339a6a8d8 00007f2339a6a950 0000000000000001
          (XEN) [  919.901894]    00000000004a2950 00007f2338813740 0000000000000000 0000000000000003
          (XEN) [  919.901896]    00000000009465e0 00007f233881dff0 000000fa00000000 00000000004a9499
          (XEN) [  919.901898] Xen call trace:
          (XEN) [  919.901899]    [<ffff82d04032ca4a>] R arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901902]    [<ffff82d04023201c>] S smp_call_function_interrupt+0x4c/0x90
          (XEN) [  919.901905]    [<ffff82d04031166c>] S do_IRQ+0x2bc/0x710
          (XEN) [  919.901907]    [<ffff82d040201916>] S common_interrupt+0x136/0x150
          (XEN) [  919.901911]    [<ffff82d0402931f4>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x204/0x3c0
          (XEN) [  919.901913]    [<ffff82d0402931ed>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x1fd/0x3c0
          (XEN) [  919.901916]    [<ffff82d0403001bb>] S arch/x86/domain.c#idle_loop+0xdb/0xf0
          (XEN) [  919.901918]    [<ffff82d0403000e0>] S arch/x86/domain.c#idle_loop+0/0xf0
          (XEN) [  919.901919]    [<ffff82d040301f5e>] S context_switch+0x1ee/0x900
          (XEN) [  919.901920] 
          (XEN) [  919.901927] CPU3	d[IDLE]v3	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901930] CPU2	d[IDLE]v2	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901934] CPU1	d[IDLE]v1	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901937] CPU0	d[IDLE]v0	e008:ffff82d04032c9d2 in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0x42/0xe0
          (XEN) [  919.901941] CPU4	d[IDLE]v4	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901945] CPU5	d[IDLE]v5	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901949] CPU6	d[IDLE]v6	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901952] CPU7	d[IDLE]v7	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901956] CPU8	d[IDLE]v8	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901960] CPU9	d[IDLE]v9	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901964] CPU10	d[IDLE]v10	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901969] CPU16	d[IDLE]v16	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901972] CPU17	d[IDLE]v17	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901975] CPU11	d[IDLE]v11	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901978] CPU22	d[IDLE]v22	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901983] CPU20	d0v11	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901986] CPU21	d[IDLE]v21	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901991] CPU14	d[IDLE]v14	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901994] CPU15	d[IDLE]v15	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.901998] CPU18	d[IDLE]v18	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.902002] CPU19	d[IDLE]v19	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.902006] CPU13	d[IDLE]v13	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.902009] CPU12	d[IDLE]v12	e008:ffff82d04032ca4a in Xen: arch/x86/time.c#time_calibration_std_rendezvous+0xba/0xe0
          (XEN) [  919.912921] Non-responding CPUs: {24-47}
          (XEN) [  919.912922] 
          (XEN) [  919.912923] ****************************************
          (XEN) [  919.912923] Panic on CPU 23:
          (XEN) [  919.912924] FATAL TRAP: vec 2, NMI[0000] IN INTERRUPT CONTEXT
          (XEN) [  919.912925] ****************************************
          (XEN) [  919.912926] 
          (XEN) [  919.912926] Reboot in five seconds...
          (XEN) [  919.912928] Executing kexec image on cpu23
          (XEN) [  920.912554] Failed to shoot down CPUs {24-47}
          
          

          Cheers.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            Ouch. @andyhhp in case that trace rings a bell.

            1 Reply Last reply Reply Quote 0
            • E Offline
              EddieA @olivierlambert
              last edited by

              @olivierlambert Any further thoughts or suggestions (move PCIe cards around again ??).

              Cheers.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                No but maybe @Team-Hypervisor-Kernel does

                1 Reply Last reply Reply Quote 0
                • Y Offline
                  yannsionneau Vates 🪐 XCP-ng Team
                  last edited by

                  Hello @eddiea

                  I've sent you a link in private so that you can upload all your log files.

                  Thanks

                  Regards,

                  Yann

                  E 1 Reply Last reply Reply Quote 0
                  • E Offline
                    EddieA @yannsionneau
                    last edited by

                    @yannsionneau Uploaded contents of /var/crash together with the output of "xen-bugtool --yestoall".

                    Cheers.

                    TeddyAstieT 1 Reply Last reply Reply Quote 1
                    • TeddyAstieT Offline
                      TeddyAstie Vates 🪐 XCP-ng Team Xen Guru @EddieA
                      last edited by TeddyAstie

                      @EddieA Can you try differents combinations of passedthrough hardware in this VM ?

                      e.g try with each device one by one at a time; at least in the VM

                      1 Reply Last reply Reply Quote 0
                      • E Offline
                        EddieA @EddieA
                        last edited by

                        Give me a couple of days to try. It is (obviously) down to the combination of devices passed through, as I reported this earlier:

                        said in TrueNAS VM failing to start:

                        Re-boot XCP and start the TrueNAS VM with NO passthrough devices. As expected, that started up fine.

                        Cheers.

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post