So i already have a ticket in for this but thought i'd post here to draw on the knowledge of the community as well.
We have a 5 host pool with several backup and replication jobs configured in XOA. Randomly a job will stall out with some VMs partially backed up, (the estimated end will shoot up to 2 - 3 days, and progress will stop) the only way i am able to cancel it is to restart the tool stack on the pool master and restart the XOA service on the XOA appliance. Then let GC run and retry the failed VMs. This usually then results in a full of the failed VM. This can happen to both CR jobs and to regular backup jobs.
In the log i see in XOA for the failed job is:
"message": "HTTP connection has timed out",
"name": "Error",
"stack": "Error: HTTP connection has timed out\n at ClientRequest.<anonymous> (/usr/local/lib/node_modules/xo-server/node_modules/http-request-plus/index.js:61:25)\n at ClientRequest.emit (node:events:518:28)\n at ClientRequest.patchedEmit [as emit] (/usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/log/configure.js:52:17)\n at TLSSocket.emitRequestTimeout (node:_http_client:849:9)\n at Object.onceWrapper (node:events:632:28)\n at TLSSocket.emit (node:events:530:35)\n at TLSSocket.patchedEmit [as emit] (/usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/log/configure.js:52:17)\n at Socket._onTimeout (node:net:595:8)\n at listOnTimeout (node:internal/timers:581:17)\n at process.processTimers (node:internal/timers:519:7)"
}
Now i understand i can increase the timeout value, and that is what support has suggested as well as the XO documentation here:
https://docs.xen-orchestra.com/backup_troubleshooting#error-http-connection-has-timed-out
What i can't figure out is why this timeout would occur in the first place?
I have ensured the XOA appliance is running on the Pool master at all times, i have also ensured the pool master is the least loaded of all 5 hosts in our production pool in an attempt to mitigate this in the past with no success. The pool master is only running 17 VMs, has 32 cores and 384GB of RAM with 8 GB assigned to dom0.
Each host has 4x 10GBe SFP+ ports, 2 x 10GB Ports are in an multi-chasis LAG (LACP) and dedicated to storage (NFSv3), the other 2 are in another multi-chasis lag (LACP) and dedicated to VM traffic/Management/Backup traffic.
So i don't think anything is overloaded that would cause this? Any suggestions or something to look for? I have increased the timeout value for now but the docs seem to imply i need to get to the bottom of what is causing it for a more long term solution.