Help with stuck delta backups.



  • I'm looking for possibly some suggestions.
    I've been running xcp-ng with XO-Community edition installed within an Ubuntu VM within xcp-ng which was installed and updated via the jarli scripts. I'm using XO to create delta backups of two virtual machines running on xcp-ng -- pfSense and the actual Ubuntu VM running XO. These backups are sent to a freeNAS box running on a local home network over a cat5e wired connection.

    It seems like maybe once or twice a week, the backups get stuck -- by stuck I mean they either get stuck during the transfer or merge state within the delta backup process. There may possibly a connection interruption, however when I see for example a job is stuck during the transfer process, I try remote ssh into the boxes and I don't see a problem. I try killing the jobs with the XO GUI and this doesn't exactly work. I'm not sure I'm educated enough to figure out which of the jobs the backup is running under when I use htop on the XO host. Each job when it completes successfully should take under 5 min.

    So my questions:

    1. How best to identify and kill the backup jobs on the xo or xcp-ng host? What should I be looking for at the command line?
    2. Future suggestion -- If XO detects the jobs are stuck (I'm not even sure if this logic is available), however can't there be a feature for XO to kill the job automatically? For example it sees a backup is running for 6 hours. I understand each users requirements may be different and a backup might actually take 6 hours, so I guess this could be an optional parameter.

    I can post any logs. In order to rectifiy the solution, I've tried forcibly killing the Ubuntu VM XO is working on and rebooting the VM -- This doesn't always work and I actually have to end up rebooting the entire xcp-ng installation -- perhaps this is overkill (probably is), however when the command xe vm-list on the xcp-ng host says something like connection lost and lists no vms --- I'm not sure what to do.


  • Admin

    Please upgrade to the latest version, there is some XAPI connection improvements that should avoid stuck jobs.

    It's not that easy to kill a job, because there is multiple things related: XAPI itself, streams, XO, etc. A job doesn't really exists per se, it's a collection of various process in various places.

    So first, please try again with latest version for a while, and report back 🙂



  • I upgraded via the jarli script however I don't think it made a difference. I ran the scripts manually after upgrade and things worked, however when the backup scripts were re-run by the cron daemon at 2am -- the entire xo web interface is frozen. I tried killing the service (stopping didn't work) but had to manually kill the node process. Upon restarting the xo-server.service -- it won't accept connections. The ubuntu machine didn't freeze so examining the journalctl logs here are some tidbits

    Apr 14 13:33:03 ubuntu_xo xo-server[650]: 2019-04-14T13:33:03.804Z - xo:plugin - [INFO] Cannot find module '/usr/local/lib/node_modules//xo-server-auth-github'
    Apr 14 13:33:03 ubuntu_xo xo-server[650]: { error:
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:    { Error: Cannot find module '/usr/local/lib/node_modules//xo-server-auth-github'
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Function.Module._resolveFilename (module.js:548:15)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Function.Module._load (module.js:475:25)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Module.require (module.js:597:17)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at require (internal/module.js:11:18)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Xo.<anonymous> (/opt/xen-orchestra/packages/xo-server/src/index.js:266:17)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Generator.next (<anonymous>)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at asyncGeneratorStep (/opt/xen-orchestra/packages/xo-server/dist/index.js:104:103)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at _next (/opt/xen-orchestra/packages/xo-server/dist/index.js:106:194)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at /opt/xen-orchestra/packages/xo-server/dist/index.js:106:364
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Promise._execute (/opt/xen-orchestra/node_modules/bluebird/js/release/debuggability.js:313:9)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Promise._resolveFromExecutor (/opt/xen-orchestra/node_modules/bluebird/js/release/promise.js:483:18)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at new Promise (/opt/xen-orchestra/node_modules/bluebird/js/release/promise.js:79:10)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Xo.<anonymous> (/opt/xen-orchestra/packages/xo-server/dist/index.js:106:97)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Xo._registerPlugin (/opt/xen-orchestra/packages/xo-server/dist/index.js:352:26)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Xo.registerPlugin (/opt/xen-orchestra/packages/xo-server/dist/index.js:320:26)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Xo.registerPluginWrapper (/opt/xen-orchestra/packages/xo-server/src/index.js:311:24)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Promise.all.name (/opt/xen-orchestra/packages/xo-server/src/index.js:336:37)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at arrayMap (/opt/xen-orchestra/node_modules/lodash/_arrayMap.js:16:21)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at map (/opt/xen-orchestra/node_modules/lodash/map.js:50:10)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Xo.<anonymous> (/opt/xen-orchestra/packages/xo-server/src/index.js:334:15)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Generator.next (<anonymous>)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at asyncGeneratorStep (/opt/xen-orchestra/packages/xo-server/dist/index.js:104:103)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at _next (/opt/xen-orchestra/packages/xo-server/dist/index.js:106:194)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at tryCatcher (/opt/xen-orchestra/node_modules/bluebird/js/release/util.js:16:23)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Promise._settlePromiseFromHandler (/opt/xen-orchestra/node_modules/bluebird/js/release/promise.js:512:31)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Promise._settlePromise (/opt/xen-orchestra/node_modules/bluebird/js/release/promise.js:569:18)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Promise._settlePromise0 (/opt/xen-orchestra/node_modules/bluebird/js/release/promise.js:614:10)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Promise._settlePromises (/opt/xen-orchestra/node_modules/bluebird/js/release/promise.js:694:18)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at _drainQueueStep (/opt/xen-orchestra/node_modules/bluebird/js/release/async.js:138:12)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at _drainQueue (/opt/xen-orchestra/node_modules/bluebird/js/release/async.js:131:9)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Async._drainQueues (/opt/xen-orchestra/node_modules/bluebird/js/release/async.js:147:5)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at Immediate.Async.drainQueues (/opt/xen-orchestra/node_modules/bluebird/js/release/async.js:17:14)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at runCallback (timers.js:810:20)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at tryOnImmediate (timers.js:768:5)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:     at processImmediate [as _immediateCallback] (timers.js:745:5)
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:      code: 'MODULE_NOT_FOUND',
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:      [Symbol(originalCallSite)]:
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:       [ CallSite {},
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         CallSite {},
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         CallSite {} ],
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:      [Symbol(mutatedCallSite)]:
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:       [ CallSite {},
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         CallSite {},
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         [Object],
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         CallSite {},
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         [Object],
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         CallSite {},
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         CallSite {},
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         [Object],
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         CallSite {},
    Apr 14 13:33:03 ubuntu_xo xo-server[650]:         CallSite {} ] } }
    

    This error keeps repeating -- which is super annoying

    Apr 14 13:37:51 ubuntu_xo xo-server[650]: xo-server-cloud: fail to connect to updater { ConnectionError: connect ECONNREFUSED 127.0.0.1:9001
    

    And then I found these errors:

    Apr 15 07:00:00 ubuntu_xo xo-server[6582]: 2019-04-15T07:00:00.036Z - xo:xapi - [DEBUG] Snapshotting VM Ubuntu Xen Orchestra as [XO Backup pfSense/XO Backup] Ubu
    Apr 15 07:00:00 ubuntu_xo systemd[1774]: Failed to canonicalize path /home/user/.config/systemd/user/run-xo\x2dserver-mounts-c2d300fe\x2d8305\x2d4bd0\x2dbdc3\x
    Apr 15 07:00:00 ubuntu_xo systemd[1774]: Failed to canonicalize path /home/user/.config/systemd/user/run-xo\x2dserver-mounts-c2d300fe\x2d8305\x2d4bd0\x2dbdc3\x
    Apr 15 07:00:00 ubuntu_xo systemd[1774]: Failed to canonicalize path /home/user/.config/systemd/user/run-xo\x2dserver-mounts-c2d300fe\x2d8305\x2d4bd0\x2dbdc3\x
    

    The entire web interface doesn't respond

    htop doesn't show memory to be full (only 351M used) although there is a bunch of node processes.

    I'm not sure what to do at this point since I'm basically back to the point of having to kill the actual vm since restarting the xo-service doesn't seem to clear out everything.


  • Admin

    1. Please use the Markdown syntax for pasting bunch of logs, otherwise it's hard to read it (I edited your post, feel free to edit youself to see what I did).
    2. We don't do support on 3rd party script install for XO (because we don't know what's done it in), you should probably ask on the GitHub repo where you found the script, or stick to our official doc to install XO from the sources.
    3. The error message repeating is because 3rd party script are not doing a proper installation, but there's nothing we could do. Please report to the script author.


  • @kevdog I've fixed the scripts on Github. You can issue the following commands
    from your XO console to remove the incorrect symlink --

    cd /usr/local/lib/node_modules/
    sudo unlink xo-server-auth-github
    

    The symlink will be properly recreated the next time you update the VM.

    P.S. I believe your issue with the hanging backups is unrelated, so you may want to dig further in the logs to see if you can identify the cause.



  • Try schedule the job during some other hours, like when you can manage it and then run a ping from the xo-vm against the mounted target.
    Do you see any packet loss?



  • Hey I made the changes as suggested above however --- I really hate how fragile this delta backup mechanism is. I do nightly backups of 3 VMs to a local FreeNAS. The backups will run without a hitch for about 3 days and then everything just blows up. My latest log file states:

    Async.VDI.snapshot (on xcp-ng) 0%  <---two entries listed like this that perpetually stay at 0%
    

    and then I get:

     Snapshot 
    Start: Apr 21, 2019, 2:30:00 AM
    End: Apr 21, 2019, 4:30:04 AM
    Error: OTHER_OPERATION_IN_PROGRESS(SR, OpaqueRef:525c4561-4986-4fc4-be7e-e275f96dfce3, VDI.snapshot)
    

    This doesn't look like a transfer problem to me -- in this case. I know I keep asking the same questions, however with this process stuck -- what do I kill? I usually have to end up not just killing the VM but rebooting the entire xcp-ng server to do this --- well actually not just I reboot, I have to physically unplug and replug in unit. xcp-ng will become unstable over time and just freeze after a few days with these stuck processes.



  • @kevdog this sounds like some kind of network-connectivity issue.
    Is the XOA and xcp-ng within the same /24 network (broadcast domain) or is there a firewall between them?

    To get rid of the tasks that are stuck you can try and restart the toolstack by executing "xe-toolstack-restart" in the cli of the xcp-ng machine.


  • Admin

    You have a problem on your XCP-ng, this is not related to a XO issue. It's stuck doing a snapshot, which is really a basic operation that shouldn't fail at all.

    Double check your SMlog on the host.


Log in to reply