Potential bug with Windows VM backup: "Body Timeout Error"

MajorP93

Hey,

I am experiencing the same issue using XO from sources (commit 4d77b79ce920925691d84b55169ea3b70f7a52f6), Node version 22, Debian 13.

I have multiple backup jobs and only one which is a full backup job is giving me issues.

Most VMs can be backed up by this full backup job just fine but some error out with "body timeout error", e.g.:

            {
              "id": "1762017810483",
              "message": "transfer",
              "start": 1762017810483,
              "status": "failure",
              "end": 1762018134258,
              "result": {
                "name": "BodyTimeoutError",
                "code": "UND_ERR_BODY_TIMEOUT",
                "message": "Body Timeout Error",
                "stack": "BodyTimeoutError: Body Timeout Error\n    at FastTimer.onParserTimeout [as _onTimeout] (/etc/xen-orchestra/node_modules/undici/lib/dispatcher/client-h1.js:646:28)\n    at Timeout.onTick [as _onTimeout] (/etc/xen-orchestra/node_modules/undici/lib/util/timers.js:162:13)\n    at listOnTimeout (node:internal/timers:588:17)\n    at process.processTimers (node:internal/timers:523:7)"
              }
            }

XO from sources VM has 8 vCPU and 8GB RAM.
Link speed of the XCP-ng hosts is 50 Gbit/s.
XO VM can reach 20 Gbit/s to the NAS in iperf.

Zstd is enabled for this backup job.
It appears that only big VMs (as in disk size) have this issue.
The VMs that have this issue on the full backup job can be backed up just fine via delta backup job.

I read in another thread that this issue can be caused by dom0 hardware constrains but dom0 has 16 vCPU and is at ~40% CPU usage while backups are running.
RAM usage sits at 2GB out of 8GB used.

I changed my full backup job to GZIP compression and will see if this helps.
Will report back.
I really need compression due to the large virtual disks of some VMs...

Best regards
MajorP

nikade

@MajorP93 im seeing this as well, I think the issue is related to communication between XO and XCP-NG.
I noticed that it doesn't seem to depend on the vdi size in our case, but rather latency between XO and XCP-NG, which are on different sites and connected via IPSEC VPN.

MajorP93

@nikade Hmm I have a hard time understanding what might cause this issue in my case since all of our 5 XCP-ng hosts are on the same site. They can talk on layer 2 with each other and have 2x 50 Gbit/s LACP bond each...
The XO VM is running on the pool master itself.
Some of the VMs that threw this error are also even running on the pool master itself.
So I would expect that the traffic does not even have to exit the physical host in this case...
Latency should be perfectly fine in this case...

All XCP-ng hosts, XO VM and NAS (backup remote) can ping each other at below 1ms latency...

Really weird.

If anyone has an idea regarding what could possibly cause this I would be grateful.
As I said before I want to test Gzip instead of Zstd but I have to wait until this backup job finished.
It has ~40TB of data to backup in total

olivierlambert

I think we have a lead, I've seen a discussion between @florent and @dinhngtu recently about that topic

MajorP93

@olivierlambert said in Potential bug with Windows VM backup: "Body Timeout Error":

I think we have a lead, I've seen a discussion between @florent and @dinhngtu recently about that topic

Sounds good!
So there is a fix currently being worked on?

olivierlambert

I think we have a lead to explore, we'll keep you posted when we have a branch to test

MajorP93

@olivierlambert said in Potential bug with Windows VM backup: "Body Timeout Error":

I think we have a lead to explore, we'll keep you posted when we have a branch to test

Sure! Thank you very much.
When there is a branch available I will be happy to compile, test and provide any information / log needed.

MajorP93

I did 2 more tests.

using full backup with encryption disabled on the remote (had it enabled before) --> same issue
switching from zstd to gzip --> same issue

MajorP93

Hey, I did some more digging.
I found some log entries that were written at the moment the backup job threw the "body timeout error". (Those blocks of log entries appear for all VMs that show this problem).

grep -i "VM.export" xensource.log | grep -i "error"

Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 HTTPS 192.168.60.30->:::80|[XO] VM export R:edfb08f9b55b|xapi_compression] nice failed to compress: exit code 70
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] [XO] VM export R:edfb08f9b55b failed with exception Server_error(CLIENT_ERROR, [ INTERNAL_ERROR: [ Unix.Unix_error(Unix.EPIPE, "write", "") ] ])
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] Raised Server_error(CLIENT_ERROR, [ INTERNAL_ERROR: [ Unix.Unix_error(Unix.EPIPE, "write", "") ] ])
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 1/16 xapi Raised at file ocaml/xapi/stream_vdi.ml, line 127
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 2/16 xapi Called from file ocaml/xapi/stream_vdi.ml, line 307
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 3/16 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 4/16 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 39
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 5/16 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 6/16 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 39
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 7/16 xapi Called from file ocaml/xapi/stream_vdi.ml, line 263
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 8/16 xapi Called from file list.ml, line 110
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 9/16 xapi Called from file ocaml/xapi/export.ml, line 707
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 10/16 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 11/16 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 39
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 12/16 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 13/16 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 39
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 14/16 xapi Called from file ocaml/xapi/server_helpers.ml, line 75
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 15/16 xapi Called from file ocaml/xapi/server_helpers.ml, line 97
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace] 16/16 xapi Called from file ocaml/libs/log/debug.ml, line 250
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|VM.export D:6f2ee2bc66b8|backtrace]
Nov  4 08:02:33 dat-xcpng01 xapi: [error||141766 :::80|handler:http/get_export D:b71f8095b88f|backtrace] VM.export D:6f2ee2bc66b8 failed with exception Server_error(CLIENT_ERROR, [ INTERNAL_ERROR: [ Unix.Unix_error(Unix.EPIPE, "write", "") ] ])

daemon.log shows:

Nov  4 08:02:31 dat-xcpng01 forkexecd: [error||0 ||forkexecd] 135416 (/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/zstd) exited with code 70

Maybe this gives some insights
To me it appears that the issue is caused by the compression (I had zstd enabled during the backup run).

Best regards

//EDIT: This is what ChatGPT thinks about this (I know AI responses have to be taken with a grain of salt):
Why would there be long “no data” gaps? With full exports, XAPI compresses on the host side. When it encounters very large runs of zeros / sparse/unused regions, compression may yield almost nothing for long stretches. If the stream goes quiet longer than Undici’s bodyTimeout (default ~5 minutes), XO aborts. This explains why it only hits some big VMs and why delta (NBD) backups are fine.

olivierlambert

@MajorP93 said in Potential bug with Windows VM backup: "Body Timeout Error":

Why would there be long “no data” gaps? With full exports, XAPI compresses on the host side. When it encounters very large runs of zeros / sparse/unused regions, compression may yield almost nothing for long stretches. If the stream goes quiet longer than Undici’s bodyTimeout (default ~5 minutes), XO aborts. This explains why it only hits some big VMs and why delta (NBD) backups are fine.

I think that's exactly the issue in here (if I remember our internal discussion).

MajorP93

@olivierlambert said in Potential bug with Windows VM backup: "Body Timeout Error":

@MajorP93 said in Potential bug with Windows VM backup: "Body Timeout Error":

Why would there be long “no data” gaps? With full exports, XAPI compresses on the host side. When it encounters very large runs of zeros / sparse/unused regions, compression may yield almost nothing for long stretches. If the stream goes quiet longer than Undici’s bodyTimeout (default ~5 minutes), XO aborts. This explains why it only hits some big VMs and why delta (NBD) backups are fine.

I think that's exactly the issue in here (if I remember our internal discussion).

Do you think it could be a good idea to make the timeout configurable via Xen Orchestra?

olivierlambert

I can't answer for my devs, just let them come here to provide an answer

Greg_E

@olivierlambert

Going back up to the top of the thread, removing compression was what fixed this for me, and my backups are still saying they are successful (maybe I better check).

I wish there was an easier way to "right size" the storage on these Windows VMs, the suggestion is to copy the VM to a new and smaller "disk" but I haven't looked into what is involved to get this done.

olivierlambert

We will fix it with compression, now we are pretty sure to know where the issue is

nikade

Happy to hear that there's a potential lead, im also happy I found this thread so I can kick back and wait for Vates to fix it

MajorP93

I can imagine that a fix could be to send "keepalive" packets in addition to the XCP-ng export-VM-data-stream so that the timeout on XO side does not occur

MajorP93

I worked around this issue by changing my full backup job to "delta backup" and enabling "force full backup" in the schedule options.

Delta backup seems more reliable as of now.

Looking forward to a fix as Zstd compression is an appealing feature of the full backup method.

andriy.sultanov

I've got a fix for this issue in the pipeline and would really appreciate if anyone here could test to confirm if it resolves the problem for you.

You'd need to install a custom build (not production-ready, be careful, all the usual warnings apply here) according to these steps:

Before updating, preferably check that exporting a particular VM with compression on results in the timeout error consistently.
Write the following file to /etc/yum.repos.d/xvatest.repo:

[xva-test-repo]
name=xva-test-repo
baseurl=https://koji.xcp-ng.org/repos/user/8/8.3/asultanov1/x86_64/
enabled=0
metadata_expire=1m
gpgcheck=1
repo_gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-xcpng

Make sure you're up to date on the rest of your packages with yum upgrade --enablerepo=testing (reboot if necessary)
Install the fix (this should install *.xvafix.1 packages):

yum upgrade xapi-core qcow-stream-tool vhd-tool --enablerepo=xva-test-repo

Restart the toolstack:

xe-toolstack-restart

MajorP93

@andriy.sultanov Thank you very much for working on this!
I can test your proposed fix in our lab during next week.
Best regards

MajorP93

@andriy.sultanov I created a small test setup in our lab. I created a WIndows VM with a lot of free disk space (2 virtual disks, 2.5 TB free space in total). Hopefully that way I will be able to replicate the issue with full backup timeout for VMs with a lot of free space that occurred in our production environment.
The backup job is currently running. I will report back once it failed and once I had a chance to test if your fix solves the issue.