XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Need Help Understanding the VM Suspend Process

    Scheduled Pinned Locked Moved Solved Management
    11 Posts 2 Posters 69 Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • K Offline
      kagbasi-wgsdac
      last edited by

      Good-day Folks,

      I need help understanding the VM Suspend process: why does it take so long for XO to suspend a VM, triggered by a Smart Reboot of a Host?

      My Environment:

      HOSTs: XCP-ng 2-node pool at v8.3.0 on HP (ProLiant DL360p Gen8)
      XO: Community Edition at commit c5ba7
      NETWORKING: 1Gbps Management only
      STORAGE: Shared NFS Storage Repository (hosted on a separate TrueNAS Server)

      Today, while applying the latest host patches, I wanted to try doing a Smart Reboot. After pressing the button and acknowledging the prompt that VMs will be suspended and then un-suspended after the reboot, and quickly navigated to the Tasks page to monitor the progress. I immediately saw one VM quickly show progress from 0%, 10%, 30%.......100% and boom, it was suspended. The others, however, not so much. As you can see from the two screenshots below, they are still pending suspension and the Estimated End keeps shifting to the right.

      I don't have a 10Gb Storage Network between the hosts yet (working on it). However, I didn't think that should have such an impact. Anyway, I don't think I have a proper understanding of how the suspend operation should be working, so if anyone cares to educate me, I would really appreciate it. Thank you.

      Screenshot Taking After Pressing Smart Reboot on the Master Host
      Screenshot 2025-07-22 091604.png

      Screenshot Taking After Pressing Smart Reboot on the Master Host (10-minutes later)
      Screenshot 2025-07-22 092641.png

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        Suspend means all the VM memory (RAM) has to be written on the Suspend SR. More RAM == more time to suspend.

        K 1 Reply Last reply Reply Quote 0
        • K Offline
          kagbasi-wgsdac @olivierlambert
          last edited by

          @olivierlambert Aaah, I've always wondered what the Suspend SR on the Advanced tab of the Pool meant......now I know.

          So, the NIC capacity/pipe between the hosts and the SR does really matter here.

          1 Reply Last reply Reply Quote 0
          • K Offline
            kagbasi-wgsdac
            last edited by

            Sharing this so others might benefit from what I'm learning.

            So, I looked at the network performance on TrueNAS during the Smart Reboot of the second XCP-ng host (screenshot below). What I saw seems to suggest that I'm getting near wire speed during READ operations. However, WRITE operations seem to be hitting a ceiling and I have a feeling it might be due to me having SYNC enabled on the dataset.

            Screenshot 2025-07-22 122432.png

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Indeed, sync is likely the cause of that. And since you have to write the RAM, it can be a lot of GiB for big VMs.

              You can experiment to disable sync temporarily and test again.

              Note that a future improvement will be to save the VM RAM while the VM is running (instead of pausing it), reducing the "downtime". But this won't change the fact you must write the RAM somewhere, and this takes time.

              K 1 Reply Last reply Reply Quote 0
              • K Offline
                kagbasi-wgsdac @olivierlambert
                last edited by

                @olivierlambert Yes, I do plan on testing with SYNC disabled and then again with several permutations of dataset changes on the TrueNAS side (like compression on/off, etc.).

                Do you guys have a best practices document for setting up an NFS SR using TrueNAS? I browsed through the published XCP-ng documentation site but didn't find anything specific to TrueNAS or maybe I missed it.

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  Nothing very specific, sync vs async should be the biggest change

                  K 1 Reply Last reply Reply Quote 0
                  • K Offline
                    kagbasi-wgsdac @olivierlambert
                    last edited by kagbasi-wgsdac

                    @olivierlambert Oh okay, thanks for responding.

                    So I turned off SYNC and COMPRESSION on the dataset and retested (by suspending 11 VMs), I immediately noticed a whopping performance improvement (essentially sustained wire speeds AND 50% faster completion time*) :

                    • Roughly 984 Mb/s sustained WRITE speeds (during VM suspension)
                    • Roughly 984 Mb/s sustained READ speeds (during VM resumption)
                    • Transfer time for both READ and WRITE is about 20 minutes (down from 40-45 mins)

                    Screenshot 2025-07-22 170106.png

                    Gonna retest with SYNC disabled and COMPRESSION re-enabled and see if it degrades performance; standby for another report.

                    K 1 Reply Last reply Reply Quote 0
                    • K Offline
                      kagbasi-wgsdac @kagbasi-wgsdac
                      last edited by

                      Here's the latest, and probably last, test. Disabling compression had no appreciable impact on performance. I am now fully convinced that SYNC is the major player here.

                      Screenshot 2025-07-23 012851.png

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        Yes it is and it's pretty logical since you don't have to wait confirmation before getting the block actually written on the drive.

                        K 1 Reply Last reply Reply Quote 0
                        • K Offline
                          kagbasi-wgsdac @olivierlambert
                          last edited by

                          @olivierlambert Yes sir, it is and I'm glad I confirmed this for myself. Thanks also for helping me understand how the VM Suspend process works. Hopefully this post helps other newbies with the same understanding in the future.

                          1 Reply Last reply Reply Quote 1
                          • olivierlambertO olivierlambert marked this topic as a question
                          • olivierlambertO olivierlambert has marked this topic as solved
                          • First post
                            Last post