XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Failed offline DR backup to NFS caused some issues (paused / offline VMs)

    Scheduled Pinned Locked Moved Backup
    10 Posts 2 Posters 561 Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • K Offline
      k11maris
      last edited by

      Hi,

      we ran into a problem this weekend which took all of our VMs offline and required a reboot of the pool master to get things working again.

      We have 2 xcp-ng Servers, a SAN and a TrueNAS for backups. Because snapshots take too much space and coalesce does not work while the VMs are running ("error: unexpected bump in size"), I use the offline backup feature to do a DR backup to the NAS (NFS) each weekend. The same NAS is also an iSCSI target, but this is not in use.

      Now the NAS ran into an hardware issue. Ping still worked, but the TrueNAS webinterface didn't and NFS hung. ISCSI still showed "ok", but I doubt it was working. This happend before or during Backup.

      The first backup job (Friday night, 1 larger VM) failed after 6 hours with the following error:
      Global status: failure, retry the VM Backup due to an error. Error: no opaque ref found.

      The second job (Saturday night, several small VMs) failed after 42 minutes with one of those errors:
      Retry the VM backup due to an error. Error: 408 request timeout.
      Retry the VM backup due to an error. Error: unexpected 500

      Xen Orchestra is running as a VM and I have a script which also backups this (since I assume it cannot backup itself). This gave the following errors because the TrueNAS (SMB share in this case) was unavailable.

      • Shutting down Xen VM xov001 on 22.04.2024 at 5:28:20,21 The request was asynchronously canceled.
      • Exporting Xen VM romhmxov001 on 22.04.2024 at 6:29:46,13
      • Error: Received exception: Could not find a part of the path 'N:\XEN_Backup\XenVMs\xov001.xva'.
      • Error: Unable to write output file: N:\XEN_Backup\XenVMs\xov001.xva
      • Starting Xen VM xov001 on 22.04.2024 at 6:29:46,80 The request was asynchronously canceled.

      On monday morning, all VMs were shutdown. The VMs which were running on the pool master were in a paused state. XCP-NG center showed a light green dot for these. I could not start, stop, force reboot or unpause them Only a reboot of the pool master helped (restart of tool stack did not help) . This took forever and in the end I had to reset the server. Probably because of a hung NFS session). Before I did this, I started the VMs on the second server (which showed a red dot) and those worked fine.

      I am wondering if this could be improved with better error handling. Maybe some kind of pre-flight check before starting the backups? And what about the paused state of the VMs?

      1 Reply Last reply Reply Quote 0
      • DanpD Offline
        Danp Pro Support Team
        last edited by

        Hi,

        A lot going on here...

        Because snapshots take too much space

        I'm guessing that you are thick provisioned. What storage type is being used on the SAN?

        coalesce does not work while the VMs are running ("error: unexpected bump in size")

        This isn't normal AFAIK, so sounds like you have some type of issue with your configuration

        Xen Orchestra is running as a VM and I have a script which also backups this (since I assume it cannot backup itself).

        This is incorrect. XO is capable of backing up itself along with other VMs in a single backup job

        On monday morning, all VMs were shutdown. The VMs which were running on the pool master were in a paused state.

        I recommend that you check your logs on the pool master.

        I am wondering if this could be improved with better error handling. Maybe some kind of pre-flight check before starting the backups? And what about the paused state of the VMs?

        Like I stated at the beginning, a lot going on here. We don't know the cause until you investigate further, but I can't see how offline backups would have caused this much failure to occur. 🤔

        K 1 Reply Last reply Reply Quote 0
        • K Offline
          k11maris
          last edited by olivierlambert

          Yes, it is a iSCSI SAN, so thick provisioned. I am looking into getting a thin / NFS storage, but I keep getting offers for iSCSI devices from our suppliers.

          Coalesce seems to work for Linux VMs but not for Windows.

          I habe to try the "self backup" of XO then once the NAS is up and running again.

          /var/log/daemon.log:
          Nothing apart from tapdisk errors related to the failed NAS (io errors, timeouts etc)

          xensource.log:
          Looks like the export was preventing the VMs from starting. Lots of messages like
          |Async.VM.start R:4d0799eed5c0|helpers] VM.start locking failed: caught transient failure OTHER_OPERATION_IN_PROGRESS: [ VM.{export,export}; OpaqueRef:c9a84569-de10-d94d-b503-c3052e042c5f ]

          SMLOG:
          As expected, lots of errors related to the NAS.

          kern.log:

          kernel: [1578966.659309] vif vif-26-0 vif26.0: Guest Rx stalled
          kernel: [1578966.859332] vif vif-26-0 vif26.0: Guest Rx ready
          kernel: [1578966.915324] nfs: server 192.168.9.25 not responding, timed out
          kernel: [1578966.915330] nfs: server 192.168.9.25 not responding, timed out
          kernel: [1578972.939040] vif vif-23-0 vif23.0: Guest Rx stalled
          kernel: [1578975.587196] vif vif-25-1 vif25.1: Guest Rx ready
          kernel: [1578975.819690] vif vif-27-0 vif27.0: Guest Rx stalled
          kernel: [1578983.043066] vif vif-23-0 vif23.0: Guest Rx ready
          kernel: [1578986.559032] vif vif-27-0 vif27.0: Guest Rx ready
          kernel: [1578988.035104] nfs: server 192.168.9.25 not responding, timed out
          kernel: [1578988.035139] nfs: server 192.168.9.25 not responding, timed out
          

          /var/crash: no files

          DanpD 1 Reply Last reply Reply Quote 0
          • DanpD Offline
            Danp Pro Support Team @k11maris
            last edited by

            @k11maris Do you have the guest tools installed on the Windows VMs?

            Running grep -B 5 -A 5 -i exception /var/log/SMlog on your pool master will likely point out the source of the coalesce issues.

            K 2 Replies Last reply Reply Quote 0
            • K Offline
              k11maris @Danp
              last edited by

              @Danp
              Thanks, I'll check this once I delete a snapshot next time. I looked at SMlog in the past and it always gave with "unexpected bump in size".
              Guest tools are installed.

              1 Reply Last reply Reply Quote 0
              • K Offline
                k11maris @Danp
                last edited by

                @Danp said in Failed offline DR backup to NFS caused some issues (paused / offline VMs):

                This is incorrect. XO is capable of backing up itself along with other VMs in a single backup job

                I guess this does not work with offline backups as it would simply shut down the XO VM.

                DanpD 1 Reply Last reply Reply Quote 0
                • DanpD Offline
                  Danp Pro Support Team @k11maris
                  last edited by

                  @k11maris Yes, that's common sense. 😉

                  1 Reply Last reply Reply Quote 0
                  • K Offline
                    k11maris @Danp
                    last edited by olivierlambert

                    @Danp
                    The failed backup from last weekend left an orphaned disk and a disk connected to the control domain behind which I removed.

                    I tried a couple of backups today, all worked finde, including XO. However, while 4 Linux-VMs coalesced after a few minutes, one failed. So it is not limited to Windows VMs.
                    Usually, I get "Exception unexpected bump in size" for the windows VMs, so it might be a different issue here.
                    There is nothing special about the affected VM. Ubuntu 22.04 LTS, almost no CPU or IO load.

                    Apr 26 11:31:33 xen002 SMGC: [18314] Removed leaf-coalesce from 37d94ab0[VHD](25.000G/479.051M/25.055G|a)
                    Apr 26 11:31:33 xen002 SMGC: [18314] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
                    Apr 26 11:31:33 xen002 SMGC: [18314]          ***********************
                    Apr 26 11:31:33 xen002 SMGC: [18314]          *  E X C E P T I O N  *
                    Apr 26 11:31:33 xen002 SMGC: [18314]          ***********************
                    Apr 26 11:31:33 xen002 SMGC: [18314] leaf-coalesce: EXCEPTION <class 'util.SMException'>, VDI 37d94ab0-9722-4447-b459-814afa8ba24a could not be coalesced
                    Apr 26 11:31:33 xen002 SMGC: [18314]   File "/opt/xensource/sm/cleanup.py", line 1774, in coalesceLeaf
                    Apr 26 11:31:33 xen002 SMGC: [18314]     self._coalesceLeaf(vdi)
                    Apr 26 11:31:33 xen002 SMGC: [18314]   File "/opt/xensource/sm/cleanup.py", line 2053, in _coalesceLeaf
                    Apr 26 11:31:33 xen002 SMGC: [18314]     .format(uuid=vdi.uuid))
                    Apr 26 11:31:33 xen002 SMGC: [18314]
                    
                    1 Reply Last reply Reply Quote 0
                    • DanpD Offline
                      Danp Pro Support Team
                      last edited by

                      Have you tried running vhd-util check on the affected VHD file?

                      K 1 Reply Last reply Reply Quote 0
                      • K Offline
                        k11maris @Danp
                        last edited by

                        @Danp
                        No, I am not familiar with many of the "manual" commands. I have to figure out how to use that with LVM over iSCSI.
                        Meanwhile, I stopped the VM, did a SR scan and it coalesced successfully. Offline always works fine....

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post