XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Async.VM.pool_migrate stuck at 57%

    Scheduled Pinned Locked Moved Unsolved Management
    14 Posts 4 Posters 85 Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      Hi,

      It's likely not an XO problem, but an issue with XCP-ng.

      1. Check your OS is having static RAM settings and enough RAM
      2. Do you have tools installed in your OS?
      3. Time sync between the hosts?
      W 1 Reply Last reply Reply Quote 0
      • W Offline
        wmazren @olivierlambert
        last edited by

        @olivierlambert

        1. Check your OS is having static RAM settings and enough RAM

          Yes
          ad12278f-9935-46f2-9de6-600593ad6a53-image.png

        2. Do you have tools installed in your OS?

          Yes
          1eb37966-9726-46e1-ac17-e53fce68ac06-image.png

        3. Time sync between the hosts?

          Yes

        Anything else I can check?

        4acaf12f-c647-451f-bd12-58243faeb621-image.png

        Best regards,
        Azren

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          Do you have the issue with all guests or just this VM?

          W 1 Reply Last reply Reply Quote 0
          • M Offline
            MajorP93 @wmazren
            last edited by MajorP93

            @wmazren I had a similar issue which costed my many hours to troubleshoot.

            I'd advise you to check "dmesg" output within the VM that is not able to get live migrated.

            XCP-ng / Xen behaves different than VMWare regarding live migration.

            XCP-ng will interact with the linux kernel upon live migration and the kernel will try to freeze all processes before performing the live migration.

            In my case a "fuse" process blocked the graceful freezing of all processes and my live migration task also stuck in task view similar to your case.

            After solving the fuse process issue and therefore making the system able to live migrate the issue was gone.

            All of this can be viewed in dmesg as the kernel will tell you about what is being done during live migration via XCP-ng.

            //EDIT: another thing you might want to try is toggling "migration compression" in pool settings as well as making sure you have a dedicated connection / VLAN configured for the live migration. Those 2 things also helped my live migrations being faster and more robust.

            sidS 1 Reply Last reply Reply Quote 1
            • sidS Offline
              sid @MajorP93
              last edited by sid

              I also went troubleshooting and found the same as @MajorP93. Specifically I saw this in the kernel logs (viewable either in dmesg or using journalctl -k) :

              Freezing of tasks failed after 20.005 seconds (1 task refusing to freeze, wq_busy=1)
              

              Quoting askubuntu.com:

              Before going into suspend (or hibernate for that matter), user space processes and (some) kernel threads get frozen. If the freezing fails, it will either be due to a user space process or a kernel thread failing to freeze.

              To freeze a user space process, the kernel sends it a signal that is handled automatically and, once received, cannot be ignored. If, however, the process is in the uninterruptible sleep state (e.g. waiting for I/O that cannot complete due to the device being unavailable), it will not receive the signal straight away. If this delay lasts longer than 20s (=default freeze timeout, see /sys/power/pm_freeze_timeout (in miliseconds)), the freezing will fail.

              NFS, CIFS and FUSE amongst others have been historically known for causing issues like that.

              Also from that post:

              You can grep the problematic task like this # dmesg |grep "task.*pid"

              In my case it was prometheus docker containers.

              W 1 Reply Last reply Reply Quote 1
              • W Offline
                wmazren @olivierlambert
                last edited by

                @olivierlambert

                This happens to other VMs as well.

                Best regards,
                Azren

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  I would check XCP-ng logs to watch what's going on regarding the migration, also making sure you are fully up to date on your 8.3.

                  What kind of hardware do you have?

                  W 1 Reply Last reply Reply Quote 0
                  • W Offline
                    wmazren @sid
                    last edited by

                    @sid

                    My dmesg...

                    8b4484d3-094a-4303-a7a3-551cd423d993-image.png

                    This is the XO VM that I try to migrate, but issue also happen to other VMs running MS WIndows.

                    Best regards,
                    Azren

                    1 Reply Last reply Reply Quote 0
                    • W Offline
                      wmazren @olivierlambert
                      last edited by

                      @olivierlambert

                      Both hosts are Dell PowerEdge R760 dual processor with 512GB of memory. Missing this month patches. I'm trying to live migrate VMs to 1 host so that I can start installing patches and reboot.

                      Host#1

                      5407bbff-e84c-4e62-b2cb-bc80ef5565a0-image.png

                      Host#2

                      b707b1cb-44e7-44f1-bf43-16c6776af0b9-image.png

                      Host#1: dmesg

                      6f9faa0c-e1be-4f4e-bf5b-1bed6a62211b-image.png

                      Host#2: dmesg

                      813113f7-f8c1-434f-a942-f7ea3818417c-image.png

                      W 1 Reply Last reply Reply Quote 0
                      • W Offline
                        wmazren @wmazren
                        last edited by

                        It appears that the issue is related to Host #1. Any migration into or out of Host #1 tends to cause problems. Occasionally, virtual machines (VMs) lose network connectivity during migration and become unresponsive — they cannot be shut down, powered off (even forcefully), or restarted, often getting stuck in the process.

                        I’ve added Host #3 to the pool. Migration between Host #2 and Host #3 works smoothly in both directions.

                        Any idea how can I kill the stuck VM?

                        xe vm-reset-powerstate force=true vm=MYVM03
                        This operation cannot be completed because the server is still live.
                        host: cb8311e8-d0fd-4d53-be99-fe3fea2c9351 (HOST01)
                        

                        Best regards,
                        Azren

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Offline
                          olivierlambert Vates 🪐 Co-Founder CEO
                          last edited by

                          Is it the pool master?

                          W 1 Reply Last reply Reply Quote 0
                          • W Offline
                            wmazren @olivierlambert
                            last edited by

                            @olivierlambert

                            I've already moved the pool master from host #1 to host # 2

                            Best regards,
                            Azren

                            1 Reply Last reply Reply Quote 0
                            • olivierlambertO Offline
                              olivierlambert Vates 🪐 Co-Founder CEO
                              last edited by

                              Then reboot than broken host and in the meantime, re-issue the power reset command from the master.

                              1 Reply Last reply Reply Quote 0
                              • First post
                                Last post