Failed offline DR backup to NFS caused some issues (paused / offline VMs)

k11maris

Hi,

we ran into a problem this weekend which took all of our VMs offline and required a reboot of the pool master to get things working again.

We have 2 xcp-ng Servers, a SAN and a TrueNAS for backups. Because snapshots take too much space and coalesce does not work while the VMs are running ("error: unexpected bump in size"), I use the offline backup feature to do a DR backup to the NAS (NFS) each weekend. The same NAS is also an iSCSI target, but this is not in use.

Now the NAS ran into an hardware issue. Ping still worked, but the TrueNAS webinterface didn't and NFS hung. ISCSI still showed "ok", but I doubt it was working. This happend before or during Backup.

The first backup job (Friday night, 1 larger VM) failed after 6 hours with the following error:
Global status: failure, retry the VM Backup due to an error. Error: no opaque ref found.

The second job (Saturday night, several small VMs) failed after 42 minutes with one of those errors:
Retry the VM backup due to an error. Error: 408 request timeout.
Retry the VM backup due to an error. Error: unexpected 500

Xen Orchestra is running as a VM and I have a script which also backups this (since I assume it cannot backup itself). This gave the following errors because the TrueNAS (SMB share in this case) was unavailable.

Shutting down Xen VM xov001 on 22.04.2024 at 5:28:20,21 The request was asynchronously canceled.
Exporting Xen VM romhmxov001 on 22.04.2024 at 6:29:46,13
Error: Received exception: Could not find a part of the path 'N:\XEN_Backup\XenVMs\xov001.xva'.
Error: Unable to write output file: N:\XEN_Backup\XenVMs\xov001.xva
Starting Xen VM xov001 on 22.04.2024 at 6:29:46,80 The request was asynchronously canceled.

On monday morning, all VMs were shutdown. The VMs which were running on the pool master were in a paused state. XCP-NG center showed a light green dot for these. I could not start, stop, force reboot or unpause them Only a reboot of the pool master helped (restart of tool stack did not help) . This took forever and in the end I had to reset the server. Probably because of a hung NFS session). Before I did this, I started the VMs on the second server (which showed a red dot) and those worked fine.

I am wondering if this could be improved with better error handling. Maybe some kind of pre-flight check before starting the backups? And what about the paused state of the VMs?

Danp

Hi,

A lot going on here...

Because snapshots take too much space

I'm guessing that you are thick provisioned. What storage type is being used on the SAN?

coalesce does not work while the VMs are running ("error: unexpected bump in size")

This isn't normal AFAIK, so sounds like you have some type of issue with your configuration

Xen Orchestra is running as a VM and I have a script which also backups this (since I assume it cannot backup itself).

This is incorrect. XO is capable of backing up itself along with other VMs in a single backup job

On monday morning, all VMs were shutdown. The VMs which were running on the pool master were in a paused state.

I recommend that you check your logs on the pool master.

I am wondering if this could be improved with better error handling. Maybe some kind of pre-flight check before starting the backups? And what about the paused state of the VMs?

Like I stated at the beginning, a lot going on here. We don't know the cause until you investigate further, but I can't see how offline backups would have caused this much failure to occur.

k11maris

Yes, it is a iSCSI SAN, so thick provisioned. I am looking into getting a thin / NFS storage, but I keep getting offers for iSCSI devices from our suppliers.

Coalesce seems to work for Linux VMs but not for Windows.

I habe to try the "self backup" of XO then once the NAS is up and running again.

/var/log/daemon.log:
Nothing apart from tapdisk errors related to the failed NAS (io errors, timeouts etc)

xensource.log:
Looks like the export was preventing the VMs from starting. Lots of messages like
|Async.VM.start R:4d0799eed5c0|helpers] VM.start locking failed: caught transient failure OTHER_OPERATION_IN_PROGRESS: [ VM.{export,export}; OpaqueRef:c9a84569-de10-d94d-b503-c3052e042c5f ]

SMLOG:
As expected, lots of errors related to the NAS.

kern.log:

kernel: [1578966.659309] vif vif-26-0 vif26.0: Guest Rx stalled
kernel: [1578966.859332] vif vif-26-0 vif26.0: Guest Rx ready
kernel: [1578966.915324] nfs: server 192.168.9.25 not responding, timed out
kernel: [1578966.915330] nfs: server 192.168.9.25 not responding, timed out
kernel: [1578972.939040] vif vif-23-0 vif23.0: Guest Rx stalled
kernel: [1578975.587196] vif vif-25-1 vif25.1: Guest Rx ready
kernel: [1578975.819690] vif vif-27-0 vif27.0: Guest Rx stalled
kernel: [1578983.043066] vif vif-23-0 vif23.0: Guest Rx ready
kernel: [1578986.559032] vif vif-27-0 vif27.0: Guest Rx ready
kernel: [1578988.035104] nfs: server 192.168.9.25 not responding, timed out
kernel: [1578988.035139] nfs: server 192.168.9.25 not responding, timed out

/var/crash: no files

Danp

@k11maris Do you have the guest tools installed on the Windows VMs?

Running grep -B 5 -A 5 -i exception /var/log/SMlog on your pool master will likely point out the source of the coalesce issues.

k11maris

@Danp
Thanks, I'll check this once I delete a snapshot next time. I looked at SMlog in the past and it always gave with "unexpected bump in size".
Guest tools are installed.

k11maris

@Danp said in Failed offline DR backup to NFS caused some issues (paused / offline VMs):

This is incorrect. XO is capable of backing up itself along with other VMs in a single backup job

I guess this does not work with offline backups as it would simply shut down the XO VM.

Danp

@k11maris Yes, that's common sense.

k11maris

@Danp
The failed backup from last weekend left an orphaned disk and a disk connected to the control domain behind which I removed.

I tried a couple of backups today, all worked finde, including XO. However, while 4 Linux-VMs coalesced after a few minutes, one failed. So it is not limited to Windows VMs.
Usually, I get "Exception unexpected bump in size" for the windows VMs, so it might be a different issue here.
There is nothing special about the affected VM. Ubuntu 22.04 LTS, almost no CPU or IO load.

Apr 26 11:31:33 xen002 SMGC: [18314] Removed leaf-coalesce from 37d94ab0[VHD](25.000G/479.051M/25.055G|a)
Apr 26 11:31:33 xen002 SMGC: [18314] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
Apr 26 11:31:33 xen002 SMGC: [18314]          ***********************
Apr 26 11:31:33 xen002 SMGC: [18314]          *  E X C E P T I O N  *
Apr 26 11:31:33 xen002 SMGC: [18314]          ***********************
Apr 26 11:31:33 xen002 SMGC: [18314] leaf-coalesce: EXCEPTION <class 'util.SMException'>, VDI 37d94ab0-9722-4447-b459-814afa8ba24a could not be coalesced
Apr 26 11:31:33 xen002 SMGC: [18314]   File "/opt/xensource/sm/cleanup.py", line 1774, in coalesceLeaf
Apr 26 11:31:33 xen002 SMGC: [18314]     self._coalesceLeaf(vdi)
Apr 26 11:31:33 xen002 SMGC: [18314]   File "/opt/xensource/sm/cleanup.py", line 2053, in _coalesceLeaf
Apr 26 11:31:33 xen002 SMGC: [18314]     .format(uuid=vdi.uuid))
Apr 26 11:31:33 xen002 SMGC: [18314]

Danp

Have you tried running vhd-util check on the affected VHD file?

k11maris

@Danp
No, I am not familiar with many of the "manual" commands. I have to figure out how to use that with LVM over iSCSI.
Meanwhile, I stopped the VM, did a SR scan and it coalesced successfully. Offline always works fine....