Host disconnected midway during backup, now unable to start/restart/cancel

justjosh

My pool master kept disconnecting intermittently throughout the backup process.

Job status is still "started". Some VMs have already failed with errors:

Error: HANDLE_INVALID(VBD, OpaqueRef:8fe3c750-5277-454c-9cad-23481645cd1e)
Error: task has been destroyed before completion

Some are stuck as "started"

Nothing is left under "Tasks"

Cannot restart/force restart them

the job (e8c6772b-0bab-489d-a24e-b41e07f9298b) is already running

Pressing cancel on the entire job does nothing

Out of ideas!

Edit: I just want a way to stop the jobs that are stuck in purgatory. I don't mind restarting the backup from the beginning.

olivierlambert

You can restart xo-server on XOA side, and restart the toolstack on XCP-ng side to be entirely sure there's nothing left.

However, the root cause should be investigated.

justjosh

@olivierlambert

Restarting xo-server allowed me to restart all the jobs except for one particular VM. How do I unblock this?

Start: Dec 5, 2020, 06:53:28 PM
End: Dec 5, 2020, 06:53:41 PM
Error: SR_BACKEND_FAILURE_82(, Failed to snapshot VDI [opterr=['MAP_DUPLICATE_KEY', 'VDI', 'sm_config', 'OpaqueRef:92d82ecf-f03c-4f6d-9f2f-6d4f8beced23', 'paused']], )
Start: Dec 5, 2020, 06:53:28 PM
End: Dec 5, 2020, 06:53:41 PM
Duration: a few seconds
Error: SR_BACKEND_FAILURE_82(, Failed to snapshot VDI [opterr=['MAP_DUPLICATE_KEY', 'VDI', 'sm_config', 'OpaqueRef:92d82ecf-f03c-4f6d-9f2f-6d4f8beced23', 'paused']], )

Danp

@justjosh Seem to recall running into this issue once before. IIRC, I used this method to resolve the issue --

https://discussions.citrix.com/topic/399028-paused-vdi/#comment-2025338

justjosh

@danp Did you just remove the VDI or took other steps?

Danp

@justjosh I ran the commands from the link I posted to remove the "paused" flag from sm-config

justjosh

@danp Which UUID did you use as the reference? I've used the UUID provided under OpaqueRef but it's saying that UUID is invalid.

>>> vdi_ref = session.xenapi.VDI.get_by_uuid('92d82ecf-f03c-4f6d-9f2f-6d4f8beced23')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/XenAPI.py", line 264, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python2.7/site-packages/XenAPI.py", line 160, in xenapi_request
    result = _parse_result(getattr(self, methodname)(*full_params))
  File "/usr/lib/python2.7/site-packages/XenAPI.py", line 238, in _parse_result
    raise Failure(result['ErrorDescription'])
XenAPI.Failure: ['UUID_INVALID', 'VDI', '92d82ecf-f03c-4f6d-9f2f-6d4f8beced23']

Danp

@justjosh You should be able to get the correct UUID by going to the VM's Disks tab in XO and click the copy icon for the desired disk.

justjosh

@danp I was under the impression I would need to remove the VDI for the snapshot and not the VDI for the VM's disk. Is the remove_from_sm_config command supposed to be run on the VM disk?

Danp

@justjosh Yes, the goal is to 'unpause' the the VDI for the VM's disk.

If you look back at the thread I posted, they showed the output from xe vdi-list uuid=d50a85ca-eda2-4cbd-a348-80c7d6808ac1 params=all, where the UUID was from the VM's disk VDI. You could perform the same on your VDI to confirm that the issue is present in sm-config entry.

justjosh

@danp I've cleared the pause but I'm still encountering errors regarding pause. Any ideas?

I still have a snapshot from 4th Dec when the error first started. Is it safe to delete that? Will the delta backup be able to merge a full snapshot with the existing chain?

 Snapshot 
Start: Dec 8, 2020, 09:15:07 AM
End: Dec 8, 2020, 09:15:08 AM
Error: SR_BACKEND_FAILURE_82(, Failed to snapshot VDI [opterr=failed to pause VDI d11dc884-b91d-4ea0-87ef-6b96ce5b0ad4], )
Start: Dec 8, 2020, 09:15:07 AM
End: Dec 8, 2020, 09:15:08 AM
Duration: a few seconds
Error: SR_BACKEND_FAILURE_82(, Failed to snapshot VDI [opterr=failed to pause VDI d11dc884-b91d-4ea0-87ef-6b96ce5b0ad4], )

Danp

@justjosh Yes, it should be fine to delete the snapshot and then rerun the backup job.

nicolas

@Danp Did you remember what command you used to remove the "paused" flag, as the link from Citrix is not working anymore? Thanks!

nicolas

As I needed it urgently, I wrote a python script which remove the flag from the db with this command :

vdi_ref = session.xenapi.VDI.get_by_uuid(vdi_uuid)
session.xenapi.VDI.remove_from_sm_config(vdi_ref, "paused")

It worked and my vm are back to business!

Danp

@nicolas Yes, that is essentially what was shown in the link. Note: the developers have warned to be careful using this technique because the disk is in a paused state for a reason, so simply clearing the flag could lead to unintended consequences.

Alternatively, you can try running this command --

/opt/xensource/sm/resetvdis.py single <VDI UUID>

madrianr

Hello, I have the same problem - see also here:
https://community.citrix.com/topic/253636-disable-cbt-sr_backend_failure_202-map_duplicate_key/#comment-86996

If I use the following script as posted here I have the following result:

[root@wkkctxhy01 ~]# /opt/xensource/sm/resetvdis.py single 1e1c1ed3-9eb9-4b95-aabd-91b95111cc70
VDI 1e1c1ed3-9eb9-4b95-aabd-91b95111cc70 is not marked as attached anywhere, nothing to do

Any help is welcome
robert

madrianr

@madrianr said in Host disconnected midway during backup, now unable to start/restart/cancel:

[root@wkkctxhy01 ~]# /opt/xensource/sm/resetvdis.py single 1e1c1ed3-9eb9-4b95-aabd-91b95111cc70
VDI 1e1c1ed3-9eb9-4b95-aabd-91b95111cc70 is not marked as attached anywhere, nothing to do

after "Forget" the disks and Rescan/Reattach it works now...
xe vdi-forget uuid=7a3b69fb-08d3-4e10-8b07-c05ee876eabe
xe vdi-forget uuid=433d1e56-80af-4691-93f6-84af4c411565
xe vdi-forget uuid=78a38bc3-bba3-4be6-8bdb-cfd20eaf8b44
xe vdi-forget uuid=1e1c1ed3-9eb9-4b95-aabd-91b95111cc70
xe vdi-forget uuid=671fc875-cb61-462d-98f2-62912b217570

jaayb

@madrianr had similar issues tried xe vdi-foget uuid and rescanned after that but cannot find the vdi.

Any ideas?

jaayb

@jaayb said in Host disconnected midway during backup, now unable to start/restart/cancel:

@madrianr had similar issues tried xe vdi-foget uuid and rescanned after that but cannot find the vdi.

Any ideas?

never mind was able to recover...