Backup Job HTTP connection abruptly closed

_danielgurgel

We are getting backup error on only 1 server in our pool. We've already swapped NFS storage and done a FULL CLONE of the VM for testing, but we're still failing (all other servers work fine in the backup operation to the same NFS Server).

I have not found anything related to this error and the snapshot operations are working correctly. Any tips to solve this problem?

transfer 
Start: Jul 27, 2021, 08:50:02 AM
End: Jul 27, 2021, 09:39:49 AM
Duration: an hour
Error: HTTP connection abruptly closed
Start: Jul 27, 2021, 08:50:02 AM
End: Jul 27, 2021, 09:39:49 AM
Duration: an hour
Error: HTTP connection abruptly closed
Start: Jul 27, 2021, 08:49:33 AM
End: Jul 27, 2021, 09:44:45 AM
Duration: an hour
Error: all targets have failed, step: writer.run()
Type: full

_danielgurgel

@_danielgurgel Here is complete log of the operation.

vm.copy
{
  "vm": "54676579-2328-d137-1002-0f32920eab23",
  "sr": "50c59b18-5b5c-2eed-8c82-b8f7fdc8e9b5",
  "name": "VM_NAME"
}
{
  "call": {
    "method": "VM.destroy",
    "params": [
      "OpaqueRef:fc032b38-d8d7-43ab-983c-f54bc9dc6f85"
    ]
  },
  "message": "operation timed out",
  "name": "TimeoutError",
  "stack": "TimeoutError: operation timed out
    at Promise.call (/opt/xen-orchestra/node_modules/promise-toolbox/timeout.js:13:16)
    at Xapi._call (/opt/xen-orchestra/packages/xen-api/src/index.js:644:37)
    at /opt/xen-orchestra/packages/xen-api/src/index.js:722:21
    at loopResolver (/opt/xen-orchestra/node_modules/promise-toolbox/retry.js:94:23)
    at Promise._execute (/opt/xen-orchestra/node_modules/bluebird/js/release/debuggability.js:384:9)
    at Promise._resolveFromExecutor (/opt/xen-orchestra/node_modules/bluebird/js/release/promise.js:518:18)
    at new Promise (/opt/xen-orchestra/node_modules/bluebird/js/release/promise.js:103:10)
    at loop (/opt/xen-orchestra/node_modules/promise-toolbox/retry.js:98:12)
    at retry (/opt/xen-orchestra/node_modules/promise-toolbox/retry.js:101:10)
    at Xapi._sessionCall (/opt/xen-orchestra/packages/xen-api/src/index.js:713:20)
    at Xapi.call (/opt/xen-orchestra/packages/xen-api/src/index.js:247:14)
    at loopResolver (/opt/xen-orchestra/node_modules/promise-toolbox/retry.js:94:23)
    at Promise._execute (/opt/xen-orchestra/node_modules/bluebird/js/release/debuggability.js:384:9)
    at Promise._resolveFromExecutor (/opt/xen-orchestra/node_modules/bluebird/js/release/promise.js:518:18)
    at new Promise (/opt/xen-orchestra/node_modules/bluebird/js/release/promise.js:103:10)
    at loop (/opt/xen-orchestra/node_modules/promise-toolbox/retry.js:98:12)
    at Xapi.retry (/opt/xen-orchestra/node_modules/promise-toolbox/retry.js:101:10)
    at Xapi.call (/opt/xen-orchestra/node_modules/promise-toolbox/retry.js:119:18)
    at Xapi.destroy (/opt/xen-orchestra/@xen-orchestra/xapi/src/vm.js:324:16)
    at Xapi._copyVm (file:///opt/xen-orchestra/packages/xo-server/src/xapi/index.mjs:322:9)
    at Xapi.copyVm (file:///opt/xen-orchestra/packages/xo-server/src/xapi/index.mjs:337:7)
    at Api.callApiMethod (file:///opt/xen-orchestra/packages/xo-server/src/xo-mixins/api.mjs:304:20)"
}

olivierlambert

It means XO sent an order to your XAPI (of your pool) and it never answered, at least not before a timeout.

_danielgurgel

@olivierlambert But any reason why this issue only occurs for this VM? Even cloning the VM, the problem happens with the clone... even changing the NFS Server, the problem happens... Let's try moving it to a new cluster.

olivierlambert

I can't guess without taking more time to investigate, ideally on the host directly.

My guess is the issue is related to the host/pool connection with XO, not the storage.

_danielgurgel

@olivierlambert Is there any difference between "traditional Backup" and Export VM performed by Xen Orchestra?

Even changing the cluster virtual server, the problem still occurs. However, the Export operation works normally.

Forza

@_danielgurgel said in Backup Job HTTP connection abruptly closed:

Error: all targets have failed, step: writer.run()

I had similar issue today too. But restarting the backup worked. Weird. I had another similar case a little while ago that I opened a ticket for too.

olivierlambert

@_danielgurgel if you mean basic backup, it's XVA export for both. The only different is in a back case, you are writing the file in a remote instead of sending it to your browser.

lavamind

We've been having the same problem with our Delta backups for several weeks now. The job runs every day and about 1 / 3 days, we have failures like this. It seems to affect random VMs, but one or two seem to be affected more often.

We tried increasing the ring buffers on the physical network interfaces but it didn't help. Now we're going to try to pause GC during the backups to see if it helps.

We looked at SMlog and daemon.log and could not find any obvious problems on the host occuring at the time of the error. If it's a problem with networking, how could we verify this?

olivierlambert

@lavamind please triple check you are using XOA on latest or if XO from the sources, on master.

lavamind

@olivierlambert Yeah that's definately the next thing we'll try. For now we're using sources on release 5.59. If the problem persists we'll upgrade to 5.63 next week.

Not too keen on following master, since we have issues with it in the past (including bad backups)...

lavamind

This post is deleted!

lavamind

FYI, we do our best to ensure master is not broken but we only do the complete QA process just before an XOA release

Is that still the case?

From https://github.com/vatesfr/xen-orchestra/issues/3784#issuecomment-447797895

jcharaoui created this issue in vatesfr/xen-orchestra

closed [Backup NG] Delta backup base VHDs missing after hitting retention limit #3784

olivierlambert

It's always the case

_danielgurgel

@olivierlambert Even updating the host from 8.0 to 8.2 (with last update level) and after cluster and NFS migration, the problem persists.

We updated the virtualization agent on the virtual server to the latest available version from Citrix and we were able to back it up for a few weeks...but the problem reoccurred, again only for the same server.

Are there any logs I can paste to help identify this failure?

olivierlambert

This is not an easy questions This would require investigation on the host I'm afraid.

lavamind

For the record, since upgrading to 5.63 the issue hasn't re-occurred at all.