XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Potential bug with Windows VM backup: "Body Timeout Error"

    Scheduled Pinned Locked Moved Backup
    18 Posts 5 Posters 431 Views 5 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • H Offline
      Hex
      last edited by

      We are facing up with Xen Orchestra backup issue: Full VM backups failing with "Body Timeout Error".
      IT happens on specific conditions. After some experimentation we can reproduce this consistently on several hosts.
      Findings bellow.

      thumbnail_image.png

      Here are log entries from:
      Xen Orchestra (/var/log/syslog):
      image (1).png

      XCP-NG host (/var/log/xensource.log):
      Outlook-rasnzyjg.png

      (notice that Xen Orchestra registers the error and sends reports much sooner (almost at the beginning of VM backup task) and XCP-NG host at the real end of VM backup task (was monitoring tasks in Xen Orchestra))

      This happens when (must meet both conditions below):

      1. Compression is on (does not matter Zstd or GZIP)
      2. Free space of VM's virtual disk is large (we have noticed that about 150GB free space is sufficient for backups to start failing sometimes. 1TB of free space fails 99% of time) OR VM disk is uninitialized/unformatted (e.g. freshly created and attached virtual disk)

      Additional things we noticed:

      • Windows VMs with large virtual disks as of 1TB are failing all the time (tested about several dozen times). Virtual disks with about 150GB of free space fail only sometimes (failing usually when backing up in parallel with other VMs). Does not matter what particular OS version or type (We have tested with Windows server 2022, Windows Server 2012 R2 and Windows 11), partition table (MRB or GPT) and what filesystem (NTFS, ReFS, exFAT) is used.

      • Backing up Linux VMs with large free spaces usually does NOT fail (roughly about 99% of backups are successful) but sometimes backing up in parallel with other VMs fails. We have tested on Debian 12.9 and virtual disk with about 1TB of free space.

      • Behavior does not change when changing XCP-NG host or storage provisioning type (fat or thin). XCP-NG hosts (using LTS 8.2.1) and Xen Orchestra are up to date.

      A lsouai-vatesL 3 Replies Last reply Reply Quote 2
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Ping @lsouai-vates

        1 Reply Last reply Reply Quote 0
        • A Offline
          archw @Hex
          last edited by

          @Hex
          Ditto
          https://xcp-ng.org/forum/topic/10532/backup-failed-with-body-timeout-error/8

          I have it happen on almost every large backup. I had to give up. In the VMs that would not backup, I moved to delta backups.

          FWIW, here were my results:

          Regular backup to TrueNas, “compression” set to “Zstd”: backup fails.
          Regular backup to TrueNas, “compression” set to “disabled”: backup is successful.
          Regular backup to vanilla Ubuntu test VM, “compression” set to “Zstd”: backup is successful.
          Delta backup to TrueNas: backup is successful.
          
          1 Reply Last reply Reply Quote 0
          • lsouai-vatesL Offline
            lsouai-vates Vates 🪐 XO Team @Hex
            last edited by

            @Hex Hello, thanks for your bug report. I am informing Xen Orchestra team on the subject, and keep you in touch when I have some answers that can help you.
            Have a good day.

            1 Reply Last reply Reply Quote 0
            • lsouai-vatesL Offline
              lsouai-vates Vates 🪐 XO Team @Hex
              last edited by

              @Hex I have some answers from XO Team that I hope will be able to help you:

              "During full backup, XO dowloads an XVA file from the XCP-ng/XenServer host.
              This error mean that it did not get an answer after a configured delay (5 mins by default)
              There can be an XO issue somewhere but the problem lies most likely on the host's side..."

              "The timeout is not the bug, only the failsafe to not lock the process. "

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                This means we probably need to check with the XCP-ng team. But just before @Hex , can you try to reproduce the issue with xe CLI. If you do, then it's clearly it's not XO's fault.

                A 1 Reply Last reply Reply Quote 0
                • A Offline
                  archw @olivierlambert
                  last edited by archw

                  @olivierlambert
                  Since I'm having the same issue, can I give that suggestion a shot? If so, how do you do it from the command line (with xe CLI)?

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by olivierlambert

                    xe vm-export filename=export_filename compress=true (for gzip, otherwise use zstd for Zstd compression)

                    Note: make sure to mount a share with enough space and not export directly on the dom0 root, otherwise you'll fill it.

                    H 1 Reply Last reply Reply Quote 0
                    • H Offline
                      Hex @olivierlambert
                      last edited by

                      @olivierlambert
                      I have tested several scenarios on 1TB test VM.
                      Mounted share from same storage where backups from XO are stored.

                      1. Export of shutdown VM with zstd - succeeded
                      2. Export of snapshot while VM is running with zstd - succeeded
                      3. Export of snapshot while VM is running with gzip - succeeded

                      a188f8b2-469e-4ad2-9778-547a29576910-image.png

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        That's a very interesting result 🙂 It means the problem is either an interaction between XO and XAPI, or on XO's side, but not simply an XCP-ng issue as we could have thought initially 🤔

                        Can you check if the XVA file seems to work when importing it? (it case xe fails silently). Use xe vm-import.

                        H 1 Reply Last reply Reply Quote 0
                        • H Offline
                          Hex @olivierlambert
                          last edited by

                          @olivierlambert
                          Tested on one of the previous exports and import with " xe vm-import" was successful. VM Windows OS starts normally.

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            So maybe there's a timeout that's too long for XO. Adding @florent and/or @julien-f in the loop.

                            G 1 Reply Last reply Reply Quote 0
                            • G Offline
                              Greg_E @olivierlambert
                              last edited by

                              @olivierlambert

                              Boosting this because it looks like I have a Windows Server 2022 that is going to keep failing. It also has more then 150GB of free space and I was thinking of shrinking it down (if only I could pull a good backup in case it breaks). I no longer need that much space.

                              That said, a Linux VM with more free space went zooming right along, way faster than the Server 2022 that I was also backing up at the same time. This other Server 2022 succeeded, but I'll want to try a second on all my Windows backups to make sure they work before starting them on a schedule.

                              I saw a Delta style mentioned above, mine fails with a Delta too. The snapshot is created, then the file compression and file copy starts, and this is where things fail.

                              Writing out to an NFS share, but I might try backing up across my router to my lab which has an SMB share for backup testing.

                              I'm using XCP-ng 8.2.x for and XO from sources with commit d7e64.

                              I'm migrating that VM from one storage device to another to see if that might be part of the issue, once it is done I'll give this backup another try.

                              G 1 Reply Last reply Reply Quote 0
                              • G Offline
                                Greg_E @Greg_E
                                last edited by Greg_E

                                @Greg_E

                                Not sure if this helps, I was able to get this VM to backup using no compression. Now I'm going to make the drive smaller to remove most of the free space and see if compression works.

                                This VM had almost 400GB of free space, and I no longer need this much since Microsoft deprecated a feature I was using after win10, all my clients have been moved to win11.

                                I have one more "big" Windows VM that probably has a bunch of space I can reclaim, or I'll just go without compression for that one.

                                And this is only a Windows issue, my biggest Linux VM also has a lot of free space to hold disk images for deployment, and it was FAST compared to a windows backup.

                                [late edit] I forgot that this is a process. The Recovery partition sits at the end of disk space, so shrinking the main partition will leave uncommitted space between them, and not shrink anything at all. What I've done in the past was to boot to a Linux disk and use Gparted to move the Recovery where it needed to be. This machine is one of my domain controllers and will need to wait until I have "idle" time on the system to shut it down and do this, maybe tomorrow if I'm lucky. Since I have more than one, I generally shouldn't need to worry, but I still try to work around other users.

                                partition.png

                                1 Reply Last reply Reply Quote 0
                                • olivierlambertO Offline
                                  olivierlambert Vates 🪐 Co-Founder CEO
                                  last edited by

                                  My previous ping didn't work so I will try my luck with @lsouai-vates 😛

                                  lsouai-vatesL 1 Reply Last reply Reply Quote 0
                                  • lsouai-vatesL Offline
                                    lsouai-vates Vates 🪐 XO Team @olivierlambert
                                    last edited by

                                    @olivierlambert transfered 😉

                                    G 1 Reply Last reply Reply Quote 1
                                    • G Offline
                                      Greg_E @lsouai-vates
                                      last edited by Greg_E

                                      @lsouai-vates

                                      I backed up another Windows Server 2022 that had a lot of free space, setting no compression is the workaround right now. I'll have to get both of these shrunk down to reasonable and see if compression starts working. That's and after lunch task for the second "big" VM. I'll report back after performing the shrink steps on the one I can reboot today.

                                      I agree with the working theory way up at the top... The process is still going, counting each empty "block" and "compressing" it, but with no data moving for over 5 minutes, it errors out. And 120-150GB worth of empty space in a Windows VM is enough to hit that timer.

                                      Why the Linux machines don't do this? Might be because all of mine are done in less than 10 minutes total, which doesn't leave a lot of time where that timer can run. 3 of my linux with "large" disk went just fine, a couple only took 3 minutes to compress and copy to the remote share.

                                      [edit] After shrinking and moving the partitions, I'm finding that XO is not allowed to decrease the size of a "disk", so I might just be stuck with no compression on these two VMs.

                                      lsouai-vatesL 1 Reply Last reply Reply Quote 0
                                      • lsouai-vatesL Offline
                                        lsouai-vates Vates 🪐 XO Team @Greg_E
                                        last edited by

                                        @florent can you help him?

                                        1 Reply Last reply Reply Quote 0
                                        • First post
                                          Last post