Orphan VDI snapshot after CR backup

Andrew

With XO (from source), I'm using Continuous Replication every hour. After some backups (not every time), an orphaned VDI snapshot is left (not the same VDI). The backups are successful but a detached snapshot is left until I remove it. These snapshots don't show up on the VM disks but show up on the Dashboard Health report (not the detached backup report).

For example, yesterday it left 6 at different times of different VDIs. Today, it left none. VMs are running on different hosts (in the same pool). Everything is up to date (XCP 8.2.1, XO from source commit df07d). It's been going on for a while and I just delete them occasionally (without harm). Storage is thin on NFS so it's not eating a lot of space or taking a lot of time as long as I take care of them.

Danp

Hi Andrew,

Anything unusual in the logs?

Dan

Andrew

@Danp Nope... XO left another one this morning but there is nothing in the logs at that time (no errors, no warnings, no messages). Backup worked successfully but left a new orphaned snapshot.

There are other warnings at other times about other stuff, but it seems to be fine...

olivierlambert

xo-server output would be really interesting. In theory, XO is REALLY really really careful and try to remove a disk for 20 minutes when XAPI refuses to do so (and you should have a trace in the logs, I mean, the console output of xo-server).

My gut feeling is that XAPI saying that's OK but it's not, and knowing why might help to find a storage race condition somewhere.

Andrew

@olivierlambert It's good that it is really really really careful. The rule is: Primum non nocere (First, do no harm). The backup job completes without logging an error about failing to remove the snapshot.

I'll have to increase the XO logging and see if there is more output about it.

Where is the XAPI log file should I look at?

Andrew

@olivierlambert Or is it actually a coalesce problem? The VM/VDI are not listed under the "VDIs to coalesce" after they finish.

olivierlambert

It's hard to know exactly: is it something we can see on XO's side or not? I can't tell. Maybe SMlog got more info at the time the VM snapshot is removed.

Andrew

@olivierlambert It's still an ongoing issue (XO community commit f1ab6).

Here is an error XO when it fails to remove the old snapshot:

Sep 21 16:00:59 xo1 xo-server[613294]: 2022-09-21T20:00:59.229Z xo:xapi:vm WARN VM_destroy: failed to destroy VDI {
Sep 21 16:00:59 xo1 xo-server[613294]:   error: XapiError: HANDLE_INVALID(VBD, OpaqueRef:6b28b472-e82e-4117-a0c0-b61ee894e3b5)
Sep 21 16:00:59 xo1 xo-server[613294]:       at XapiError.wrap (/opt/xo/xo-builds/xen-orchestra-202209211219/packages/xen-api/dist/_XapiError.js:26:12)
Sep 21 16:00:59 xo1 xo-server[613294]:       at /opt/xo/xo-builds/xen-orchestra-202209211219/packages/xen-api/dist/transports/json-rpc.js:46:30
Sep 21 16:00:59 xo1 xo-server[613294]:       at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
Sep 21 16:00:59 xo1 xo-server[613294]:     code: 'HANDLE_INVALID',
Sep 21 16:00:59 xo1 xo-server[613294]:     params: [ 'VBD', 'OpaqueRef:6b28b472-e82e-4117-a0c0-b61ee894e3b5' ],
Sep 21 16:00:59 xo1 xo-server[613294]:     call: { method: 'VBD.get_VM', params: [Array] },
Sep 21 16:00:59 xo1 xo-server[613294]:     url: undefined,
Sep 21 16:00:59 xo1 xo-server[613294]:     task: undefined
Sep 21 16:00:59 xo1 xo-server[613294]:   },
Sep 21 16:00:59 xo1 xo-server[613294]:   vdiRef: 'OpaqueRef:56e6071e-eb67-4e02-b6d1-b814ea43eeeb',
Sep 21 16:00:59 xo1 xo-server[613294]:   vmRef: 'OpaqueRef:31957bf1-2f2b-474d-a496-e2a2460f533f'
Sep 21 16:00:59 xo1 xo-server[613294]: }

olivierlambert

We got an exception from XAPI, but let's see if it's "because" of XO. Pinging @julien-f

Andrew

@olivierlambert This issue still continues... Using current XO Source and current XCP 8.2.1

olivierlambert

I don't know what XAPI refuse to destroy the VDI… I don't think it's an XO issue.

Andrew

@olivierlambert @julien-f Enabling Use NBD protocol to transfer disk if available (and actually using NBD) for the job in XO source (commit 3abbc) seems to resolve this issue. If I disable NBD then I start to see this random problem again in about a day. With NBD enabled I have not seen the problem for weeks.

olivierlambert

Good news then