XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    TrueNAS VM failing to start

    Scheduled Pinned Locked Moved Compute
    23 Posts 6 Posters 2.2k Views 5 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      No but maybe @Team-Hypervisor-Kernel does

      1 Reply Last reply Reply Quote 0
      • Y Offline
        yannsionneau Vates 🪐 XCP-ng Team
        last edited by

        Hello @eddiea

        I've sent you a link in private so that you can upload all your log files.

        Thanks

        Regards,

        Yann

        E 1 Reply Last reply Reply Quote 0
        • E Offline
          EddieA @yannsionneau
          last edited by

          @yannsionneau Uploaded contents of /var/crash together with the output of "xen-bugtool --yestoall".

          Cheers.

          TeddyAstieT 1 Reply Last reply Reply Quote 1
          • TeddyAstieT Offline
            TeddyAstie Vates 🪐 XCP-ng Team Xen Guru @EddieA
            last edited by TeddyAstie

            @EddieA Can you try differents combinations of passedthrough hardware in this VM ?

            e.g try with each device one by one at a time; at least in the VM

            1 Reply Last reply Reply Quote 0
            • E Offline
              EddieA @EddieA
              last edited by

              Give me a couple of days to try. It is (obviously) down to the combination of devices passed through, as I reported this earlier:

              said in TrueNAS VM failing to start:

              Re-boot XCP and start the TrueNAS VM with NO passthrough devices. As expected, that started up fine.

              Cheers.

              E 1 Reply Last reply Reply Quote 0
              • E Offline
                EddieA @EddieA
                last edited by

                OK, not really sure what's going on. I fired XCP back up to try what @teddyastie suggested.

                Looking at the specs for the TrueNAS VM before booting it, it now had zero passthrough devices attached, which wasn't the state of the last time I tried (from memory). So re-added all but 1 passthrough, a GPU. Booted TrueNAS and this time it came up.

                Bingo, I thought, the GPU is the issue, but based on my background, I had to try again with the GPU included to prove it was the culprit. Well, what do you know, after adding it back in, TrueNAS now starts perfectly. One theory destroyed.

                All I can think, is that somehow the passthrough definitions in the VM config were corrupted and finding them all gone and re-adding them fixed this. Who knows.

                But all appears to be good again (for now).

                Y 1 Reply Last reply Reply Quote 1
                • Y Offline
                  yannsionneau Vates 🪐 XCP-ng Team @EddieA
                  last edited by

                  @EddieA Ok, good that your setup is now fully operational !

                  Let's sort this as a self-resolved problem then for now.
                  Don't hesitate to ping us again if the issue comes back.

                  A 1 Reply Last reply Reply Quote 0
                  • A Offline
                    AshleyDe @yannsionneau
                    last edited by

                    That is a frustrating loop to be in, especially with TrueNAS. Usually, when the VM fails to start after a change, it’s because XCP-ng is trying to pass through a PCI device (like an HBA) that isn't being released properly by the host.
                    Have you checked if the "hide" parameters in your grub config are still correct? Sometimes an update can reset those, and the host grabs the controller before the VM can. Another thing to try is toggling the BIOS/UEFI mode in the VM settings - TrueNAS can be picky about that depending on which version you’re running.

                    E 1 Reply Last reply Reply Quote 0
                    • E Offline
                      EddieA @AshleyDe
                      last edited by

                      Wearing my best Lazarus cosplay outfit, I'll apologise for the resurrection.

                      Today I had an issue with my UPS which caused me to reboot XCP a few times. During those reboots I had at least 2, maybe 3, re-occurrences of this where when TrueNAS was booting, XCP would lock up. Most of the time, after a power cycle of the server, the next boot would start TRUENAS cleanly. One time it took 2 power cycles before success.

                      Unfortunately only one of the crashes resulted in a /var/crash report, but that did have the same symptoms as my original report:

                      (XEN) [   81.101362] Non-responding CPUs: {24-47}
                      (XEN) [   81.101363]
                      (XEN) [   81.101364] ****************************************
                      (XEN) [   81.101365] Panic on CPU 5:
                      (XEN) [   81.101366] FATAL TRAP: vec 2, NMI[0000] IN INTERRUPT CONTEXT
                      (XEN) [   81.101366] ****************************************
                      (XEN) [   81.101367]
                      (XEN) [   81.101368] Reboot in five seconds...
                      (XEN) [   81.101369] Executing kexec image on cpu5
                      (XEN) [   82.101441] Failed to shoot down CPUs {24-47}
                      

                      Between my original report and today, I have rebooted other times, following updates, when this issue has not surfaced.

                      Does anyone think this could be hardware related, despite all the memory testing and stress testing I did when I built the server and again after the original issue, all with no faults. Or have I just got an unlucky set of circumstances with some sort of race condition.

                      T 1 Reply Last reply Reply Quote 0
                      • T Offline
                        tuxen Top contributor @EddieA
                        last edited by tuxen

                        @EddieA Looking at the original crash report, it could be the MWAIT instruction bug that some Intel CPUs have. For troubleshooting purposes, apply this Xen boot option:

                        /opt/xensource/libexec/xen-cmdline --set-xen mwait-idle=false
                        

                        After that, try some reboots/power cycles and let's see if you can reproduce the issue.

                        1 Reply Last reply Reply Quote 0

                        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                        With your input, this post could be even better 💗

                        Register Login
                        • First post
                          Last post