XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Async.VM.pool_migrate stuck at 57%

    Scheduled Pinned Locked Moved Unsolved Management
    14 Posts 4 Posters 89 Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • W Offline
      wmazren
      last edited by

      Hi,

      I'm having issue with live migration between xcp-ng host in a pool. The migration looks ok. The VM migrated to Host #1 from Host #2, but the task stuck at 57% (Async.VM.pool_migrate stuck at 57%). I have to restart the toolstack to make the tasks go away. Any idea?

      I'm using XO from source and on the latest commit.

      b35a598f-874f-48c8-80d9-ac6bf587f2dc-image.png

      f23fc220-f692-40b2-a01c-9c98cf5326c3-image.png

      Thank you.

      Best regards,
      Azren

      M 1 Reply Last reply Reply Quote 0
      • W wmazren marked this topic as a question
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        It's likely not an XO problem, but an issue with XCP-ng.

        1. Check your OS is having static RAM settings and enough RAM
        2. Do you have tools installed in your OS?
        3. Time sync between the hosts?
        W 1 Reply Last reply Reply Quote 0
        • W Offline
          wmazren @olivierlambert
          last edited by

          @olivierlambert

          1. Check your OS is having static RAM settings and enough RAM

            Yes
            ad12278f-9935-46f2-9de6-600593ad6a53-image.png

          2. Do you have tools installed in your OS?

            Yes
            1eb37966-9726-46e1-ac17-e53fce68ac06-image.png

          3. Time sync between the hosts?

            Yes

          Anything else I can check?

          4acaf12f-c647-451f-bd12-58243faeb621-image.png

          Best regards,
          Azren

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            Do you have the issue with all guests or just this VM?

            W 1 Reply Last reply Reply Quote 0
            • M Offline
              MajorP93 @wmazren
              last edited by MajorP93

              @wmazren I had a similar issue which costed my many hours to troubleshoot.

              I'd advise you to check "dmesg" output within the VM that is not able to get live migrated.

              XCP-ng / Xen behaves different than VMWare regarding live migration.

              XCP-ng will interact with the linux kernel upon live migration and the kernel will try to freeze all processes before performing the live migration.

              In my case a "fuse" process blocked the graceful freezing of all processes and my live migration task also stuck in task view similar to your case.

              After solving the fuse process issue and therefore making the system able to live migrate the issue was gone.

              All of this can be viewed in dmesg as the kernel will tell you about what is being done during live migration via XCP-ng.

              //EDIT: another thing you might want to try is toggling "migration compression" in pool settings as well as making sure you have a dedicated connection / VLAN configured for the live migration. Those 2 things also helped my live migrations being faster and more robust.

              sidS 1 Reply Last reply Reply Quote 1
              • sidS Offline
                sid @MajorP93
                last edited by sid

                I also went troubleshooting and found the same as @MajorP93. Specifically I saw this in the kernel logs (viewable either in dmesg or using journalctl -k) :

                Freezing of tasks failed after 20.005 seconds (1 task refusing to freeze, wq_busy=1)
                

                Quoting askubuntu.com:

                Before going into suspend (or hibernate for that matter), user space processes and (some) kernel threads get frozen. If the freezing fails, it will either be due to a user space process or a kernel thread failing to freeze.

                To freeze a user space process, the kernel sends it a signal that is handled automatically and, once received, cannot be ignored. If, however, the process is in the uninterruptible sleep state (e.g. waiting for I/O that cannot complete due to the device being unavailable), it will not receive the signal straight away. If this delay lasts longer than 20s (=default freeze timeout, see /sys/power/pm_freeze_timeout (in miliseconds)), the freezing will fail.

                NFS, CIFS and FUSE amongst others have been historically known for causing issues like that.

                Also from that post:

                You can grep the problematic task like this # dmesg |grep "task.*pid"

                In my case it was prometheus docker containers.

                W 1 Reply Last reply Reply Quote 1
                • W Offline
                  wmazren @olivierlambert
                  last edited by

                  @olivierlambert

                  This happens to other VMs as well.

                  Best regards,
                  Azren

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    I would check XCP-ng logs to watch what's going on regarding the migration, also making sure you are fully up to date on your 8.3.

                    What kind of hardware do you have?

                    W 1 Reply Last reply Reply Quote 0
                    • W Offline
                      wmazren @sid
                      last edited by

                      @sid

                      My dmesg...

                      8b4484d3-094a-4303-a7a3-551cd423d993-image.png

                      This is the XO VM that I try to migrate, but issue also happen to other VMs running MS WIndows.

                      Best regards,
                      Azren

                      1 Reply Last reply Reply Quote 0
                      • W Offline
                        wmazren @olivierlambert
                        last edited by

                        @olivierlambert

                        Both hosts are Dell PowerEdge R760 dual processor with 512GB of memory. Missing this month patches. I'm trying to live migrate VMs to 1 host so that I can start installing patches and reboot.

                        Host#1

                        5407bbff-e84c-4e62-b2cb-bc80ef5565a0-image.png

                        Host#2

                        b707b1cb-44e7-44f1-bf43-16c6776af0b9-image.png

                        Host#1: dmesg

                        6f9faa0c-e1be-4f4e-bf5b-1bed6a62211b-image.png

                        Host#2: dmesg

                        813113f7-f8c1-434f-a942-f7ea3818417c-image.png

                        W 1 Reply Last reply Reply Quote 0
                        • W Offline
                          wmazren @wmazren
                          last edited by

                          It appears that the issue is related to Host #1. Any migration into or out of Host #1 tends to cause problems. Occasionally, virtual machines (VMs) lose network connectivity during migration and become unresponsive — they cannot be shut down, powered off (even forcefully), or restarted, often getting stuck in the process.

                          I’ve added Host #3 to the pool. Migration between Host #2 and Host #3 works smoothly in both directions.

                          Any idea how can I kill the stuck VM?

                          xe vm-reset-powerstate force=true vm=MYVM03
                          This operation cannot be completed because the server is still live.
                          host: cb8311e8-d0fd-4d53-be99-fe3fea2c9351 (HOST01)
                          

                          Best regards,
                          Azren

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            Is it the pool master?

                            W 1 Reply Last reply Reply Quote 0
                            • W Offline
                              wmazren @olivierlambert
                              last edited by

                              @olivierlambert

                              I've already moved the pool master from host #1 to host # 2

                              Best regards,
                              Azren

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                Then reboot than broken host and in the meantime, re-issue the power reset command from the master.

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post