Orphan VDI snapshot after CR backup

olivierlambert

xo-server output would be really interesting. In theory, XO is REALLY really really careful and try to remove a disk for 20 minutes when XAPI refuses to do so (and you should have a trace in the logs, I mean, the console output of xo-server).

My gut feeling is that XAPI saying that's OK but it's not, and knowing why might help to find a storage race condition somewhere.

Andrew

@olivierlambert It's good that it is really really really careful. The rule is: Primum non nocere (First, do no harm). The backup job completes without logging an error about failing to remove the snapshot.

I'll have to increase the XO logging and see if there is more output about it.

Where is the XAPI log file should I look at?

Andrew

@olivierlambert Or is it actually a coalesce problem? The VM/VDI are not listed under the "VDIs to coalesce" after they finish.

olivierlambert

It's hard to know exactly: is it something we can see on XO's side or not? I can't tell. Maybe SMlog got more info at the time the VM snapshot is removed.

Andrew

@olivierlambert It's still an ongoing issue (XO community commit f1ab6).

Here is an error XO when it fails to remove the old snapshot:

Sep 21 16:00:59 xo1 xo-server[613294]: 2022-09-21T20:00:59.229Z xo:xapi:vm WARN VM_destroy: failed to destroy VDI {
Sep 21 16:00:59 xo1 xo-server[613294]:   error: XapiError: HANDLE_INVALID(VBD, OpaqueRef:6b28b472-e82e-4117-a0c0-b61ee894e3b5)
Sep 21 16:00:59 xo1 xo-server[613294]:       at XapiError.wrap (/opt/xo/xo-builds/xen-orchestra-202209211219/packages/xen-api/dist/_XapiError.js:26:12)
Sep 21 16:00:59 xo1 xo-server[613294]:       at /opt/xo/xo-builds/xen-orchestra-202209211219/packages/xen-api/dist/transports/json-rpc.js:46:30
Sep 21 16:00:59 xo1 xo-server[613294]:       at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
Sep 21 16:00:59 xo1 xo-server[613294]:     code: 'HANDLE_INVALID',
Sep 21 16:00:59 xo1 xo-server[613294]:     params: [ 'VBD', 'OpaqueRef:6b28b472-e82e-4117-a0c0-b61ee894e3b5' ],
Sep 21 16:00:59 xo1 xo-server[613294]:     call: { method: 'VBD.get_VM', params: [Array] },
Sep 21 16:00:59 xo1 xo-server[613294]:     url: undefined,
Sep 21 16:00:59 xo1 xo-server[613294]:     task: undefined
Sep 21 16:00:59 xo1 xo-server[613294]:   },
Sep 21 16:00:59 xo1 xo-server[613294]:   vdiRef: 'OpaqueRef:56e6071e-eb67-4e02-b6d1-b814ea43eeeb',
Sep 21 16:00:59 xo1 xo-server[613294]:   vmRef: 'OpaqueRef:31957bf1-2f2b-474d-a496-e2a2460f533f'
Sep 21 16:00:59 xo1 xo-server[613294]: }

olivierlambert

We got an exception from XAPI, but let's see if it's "because" of XO. Pinging @julien-f

Andrew

@olivierlambert This issue still continues... Using current XO Source and current XCP 8.2.1

olivierlambert

I don't know what XAPI refuse to destroy the VDI… I don't think it's an XO issue.

Andrew

@olivierlambert @julien-f Enabling Use NBD protocol to transfer disk if available (and actually using NBD) for the job in XO source (commit 3abbc) seems to resolve this issue. If I disable NBD then I start to see this random problem again in about a day. With NBD enabled I have not seen the problem for weeks.

olivierlambert

Good news then