One VM backup was stuck, now backups for that VM are failing with "parent VHD is missing"

CodeMercenary

Delta backups had been working fine on all my VMs. Then on Monday when I came into the office I noticed that the delta backup job from Sunday was still running, which it normally would not be since it had started at 5am on Sunday and normally takes 2 hours on the weekly full (which doesn't land on Sunday anyway) and 20 minutes on a delta.

7 of the 8 VMs being backed up worked fine, it was seemingly just stuck on one. By the next day that backup was listed as Interrupted and when I look at the details it's just that one VM backup saying it was interrupted. That makes sense.

The backup for Monday failed for all VMs because it said the backup was already running, makes sense since the prior day had been stuck.

Now today, Tuesday, the backup this morning failed for that one VM out of the seven. The "Clean VM directory" steps failed with "VHD check error" and two "parent VHD is missing" errors. There is also an overall error message for that VM within the backup job that reads "stream has ended with not enough data (actual: 474, expected: 512)". Same errors repeated for both remotes.

The details on the VHD check error are:

"/xo-vm-backups/<guid 1>/vdis/<guid 2>/<guid A>/.20240707T120231Z.vhd"
error
{"generatedMessage":false,"code":"ERR_ASSERTION","actual":false,"expected":true,"operator":"=="}

The first parent missing is:

parent
"/var/run/sr-mount/<guid 4>/<guid B>.vhd"
child
"/xo-vm-backups/<guid 1>/vdis/<guid 2>/<guid C>/20240707T120231Z.vhd"

The second parent missing is:

parent
"/var/run/sr-mount/<guid 4>/<guid D>.vhd"
child
"/xo-vm-backups/<guid 1>/vdis/<guid 2>/<guid E>/20240707T120231Z.vhd"

Note GUIDs with numbers are repeated, letters are unique, in case those give any clues.

Why might this have happened? How do I fix it? Anything I may have done to cause it or ways I can prevent it from happening again?

olivierlambert

Hi,

You need to provide a bit more details First, XO from the sources or XOA?

CodeMercenary

XO from source. The backup that is failing is a Delta with NBD enabled in the backup settings and in the pool on the network port.

Currently running commit e1dd5 which is older because a few weeks ago I had backups start failing and it turned out it was due to something introduced in a commit I had updated to causing "VDI must be free or attached to exactly one VM". I ran across this thread talking about the issue (https://xcp-ng.org/forum/topic/9215/backups-started-failing-error-vdi-must-be-free-or-attached-to-exactly-one-vm/) and the suggested solution at the time was to roll back and that fixed it for me.

Those previous backup failures started on June 21. I rolled XO back to the prior version I had been using, from June 17, and backups started working again. I have not updated this XO instance since.

Now I see that you created a consolidated CBT issue tracking thread so I'll look through that to see if this issue has been resolved and I'll update my XO if so.

I did have to emergency reset the network settings on the server that hosts the XO VM. The physical console said it had no network adapters and that it couldn't find the pool yet XO showed it was in the pool and I was able to run VMs. That issue has been going on since last week and this backup problem just started on Sunday so I don't think they are related. Maybe they are and perhaps the backup will work again tomorrow morning. I suspect it's not the issue when 7 of the 8 backups worked fine through all that other stuff.

olivierlambert

As stated here: https://xen-orchestra.com/docs/community.html#report-a-bug

We do not provide assistance when running on older commits, exactly because we might have fixed your issues recently Please update and report back, thanks!

CodeMercenary

@olivierlambert Well, last night the backup completed just fine despite me taking no action.

I updated the XO to the latest commit when I got in this morning so hopefully the issue I had back in June don't come back.