Hi,
we ran into a problem this weekend which took all of our VMs offline and required a reboot of the pool master to get things working again.
We have 2 xcp-ng Servers, a SAN and a TrueNAS for backups. Because snapshots take too much space and coalesce does not work while the VMs are running ("error: unexpected bump in size"), I use the offline backup feature to do a DR backup to the NAS (NFS) each weekend. The same NAS is also an iSCSI target, but this is not in use.
Now the NAS ran into an hardware issue. Ping still worked, but the TrueNAS webinterface didn't and NFS hung. ISCSI still showed "ok", but I doubt it was working. This happend before or during Backup.
The first backup job (Friday night, 1 larger VM) failed after 6 hours with the following error:
Global status: failure, retry the VM Backup due to an error. Error: no opaque ref found.
The second job (Saturday night, several small VMs) failed after 42 minutes with one of those errors:
Retry the VM backup due to an error. Error: 408 request timeout.
Retry the VM backup due to an error. Error: unexpected 500
Xen Orchestra is running as a VM and I have a script which also backups this (since I assume it cannot backup itself). This gave the following errors because the TrueNAS (SMB share in this case) was unavailable.
- Shutting down Xen VM xov001 on 22.04.2024 at 5:28:20,21 The request was asynchronously canceled.
- Exporting Xen VM romhmxov001 on 22.04.2024 at 6:29:46,13
- Error: Received exception: Could not find a part of the path 'N:\XEN_Backup\XenVMs\xov001.xva'.
- Error: Unable to write output file: N:\XEN_Backup\XenVMs\xov001.xva
- Starting Xen VM xov001 on 22.04.2024 at 6:29:46,80 The request was asynchronously canceled.
On monday morning, all VMs were shutdown. The VMs which were running on the pool master were in a paused state. XCP-NG center showed a light green dot for these. I could not start, stop, force reboot or unpause them Only a reboot of the pool master helped (restart of tool stack did not help) . This took forever and in the end I had to reset the server. Probably because of a hung NFS session). Before I did this, I started the VMs on the second server (which showed a red dot) and those worked fine.
I am wondering if this could be improved with better error handling. Maybe some kind of pre-flight check before starting the backups? And what about the paused state of the VMs?