@florent
After some digging this is what I have come up with. Please double check everything...
I can PM you the whole chat session if you like.
Bug Report: XO Backup Intermittent Failure — RequestAbortedError During NBD Stream Init
Environment:
XCP-ng: 8.3.0 (build 20260408, xapi 26.1.3)
xapi-nbd: 26.1.3-1.6.xcpng8.3
xo-server: community edition (xen-orchestra from source)
Pool: 2-node pool (host1 10.100.2.10, host2 10.100.2.11)
Backup NFS target: 10.100.2.23:/volume1/backup
Symptom:
Scheduled backup jobs intermittently fail with RequestAbortedError: Request aborted during NBD stream initialization. The failure is transient — the same VMs back up successfully on subsequent runs.
xo:backups:worker ERROR unhandled error event
error: RequestAbortedError [AbortError]: Request aborted
at BodyReadable.destroy (undici/lib/api/readable.js:51:13)
at QcowStream.close (@xen-orchestra/qcow2/dist/disk/QcowStream.mjs:40:22)
at XapiQcow2StreamSource.close (@xen-orchestra/disk-transform/dist/DiskPassthrough.mjs:86:28)
at XapiQcow2StreamSource.close (@xen-orchestra/xapi/disks/XapiQcow2StreamSource.mjs:61:18)
at DiskLargerBlock.close (@xen-orchestra/disk-transform/dist/DiskLargerBlock.mjs:87:28)
at TimeoutDisk.close (@xen-orchestra/disk-transform/dist/DiskPassthrough.mjs:34:29)
at XapiStreamNbdSource.close (@xen-orchestra/disk-transform/dist/DiskPassthrough.mjs:34:29)
at XapiStreamNbdSource.init (@xen-orchestra/xapi/disks/XapiStreamNbd.mjs:66:17)
at async #openNbdStream (@xen-orchestra/xapi/disks/Xapi.mjs:108:7)
Root Cause Analysis:
The error chain is misleading — QcowStream.close and BodyReadable.destroy are cleanup, not the cause. The actual failure is inside connectNbdClientIfPossible() called at XapiStreamNbd.mjs:66.
The sequence in #openNbdStream (Xapi.mjs) is:
#openExportStream() — opens a qcow2/VHD HTTP stream from XAPI (succeeds)
new XapiStreamNbdSource(streamSource, ...) — wraps it
await source.init() — calls super.init() then connectNbdClientIfPossible()
If connectNbdClientIfPossible() throws for any reason other than NO_NBD_AVAILABLE, execution goes to the catch block in #openNbdStream which calls source?.close() — this closes the already-open qcow2 HTTP stream, producing the BodyReadable.destroy → AbortError cascade
The underlying NBD connection failure: MultiNbdClient.connect() opens nbdConcurrency (default 2) sequential connections. Each NbdClient.connect() failure causes the candidate host to be removed and retried with another candidate. With only 2 hosts in the pool and nbdConcurrency=2, a single transient TLS or TCP failure on one host during the NBD option negotiation can exhaust all candidates, causing MultiNbdClient to throw NO_NBD_AVAILABLE — but this error IS caught and falls back to stream export. So the failure here is something else: a connection that partially succeeds then aborts, throwing a non-NO_NBD_AVAILABLE error that propagates uncaught to #openNbdStream's catch block.
Specific issue: When nbdClient.connect() throws with UND_ERR_ABORTED (an undici abort), the error code is not NO_NBD_AVAILABLE, so #openNbdStream re-throws it instead of falling back to stream export. The backup then fails entirely rather than gracefully degrading.
Proposed Fix:
In Xapi.mjs, the catch block in #openNbdStream should treat any NBD connection failure as fallback-eligible, not just NO_NBD_AVAILABLE:
} catch (err) {
if (err.code === 'NO_NBD_AVAILABLE' || err.code === 'UND_ERR_ABORTED') {
warn(can't connect through NBD, fall back to stream export, { err })
if (streamSource === undefined) {
throw new Error(Can't open stream source)
}
return streamSource
}
await source?.close().catch(warn)
throw err
}
Or more robustly, treat any NBD connection error as fallback-eligible rather than hardcoding error codes:
} catch (err) {
warn(can't connect through NBD, fall back to stream export, { err })
if (streamSource === undefined) {
throw new Error(Can't open stream source)
}
return streamSource
}
This matches the intent of the existing NO_NBD_AVAILABLE fallback — NBD is opportunistic, and any failure to establish it should degrade gracefully to HTTP stream export rather than failing the entire backup job.
Observed Timeline:
02:22:11 — xo-server opens VHD + qcow2 export streams
02:22:12–15 — NBD connections attempted, fail mid-handshake
02:22:15 — backup fails with UND_ERR_ABORTED, no fallback
02:33:51 — retry attempt also fails in 5 seconds
23:03 — same VMs back up successfully (transient condition resolved)
Impact: Backup jobs fail entirely on transient NBD connectivity issues instead of falling back to HTTP stream export, which is already implemented and working.
You can file this at the XO GitHub issues or the XCP-ng forum. The fix is straightforward and low-risk — the fallback path already exists and works, it's just not being reached for UND_ERR_ABORTED errors.