CBT: the thread to centralize your feedback

CJ

I'm not sure if this happened after my initial manual run of my backup job or the scheduled one that ran afterwards, but one of the VMs is now showing again as attached to the control domain.

Is this something I need to keep checking or should it resolve itself? The backup job completed hours ago.

rtjdamen

@CJ normally this should not happen, i don’t see this at out end. Mostly this happens after an incomplete backupjob.

rtjdamen

@CJ said in CBT: the thread to centralize your feedback:

@rtjdamen The VMs were running on the master so I had rebooted it since I don't recall how match the UUIDs. I'll try rebooting the other node and see if that works.

EDIT: That worked. Even though they were running on the master they were attached to the other node.

If anyone has this issue and they do not want to reboot hosts, u can migrate the vm to a different sr to fix the issue partly, the vdi will still be orphan and attached till the next host reboot but the backup will run without issues.

CJ

@rtjdamen Unfortunately it's still happening to me and getting worse.

Yesterday, I had one VM with the issue. When the backup ran I got the report stating that VM failed to backup but all others succeeded. When I just checked the dashboard health, I see that I now have three VMs with control domain attached VDIs. The backup job only lists the one VM as having failed.

One unusual thing is that these three are three of the four VMs that had problems originally. So I'm not sure if there's an issue with the VMs themselves or something else causing those to error.

Tristis Oris

VM refused to launch untill i disable CBT.

vm.start
{
  "id": "59f0ba04-5814-7154-22d2-51ae24ecf146",
  "bypassMacAddressesCheck": false,
  "force": false
}
{
  "code": "FAILED_TO_START_EMULATOR",
  "params": [
    "OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624",
    "domid 29",
    "QMP failure at File \"xc/device.ml\", line 3491, characters 71-78"
  ],
  "call": {
    "method": "VM.start",
    "params": [
      "OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624",
      false,
      false
    ]
  },
  "message": "FAILED_TO_START_EMULATOR(OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624, domid 29, QMP failure at File \"xc/device.ml\", line 3491, characters 71-78)",
  "name": "XapiError",
  "stack": "XapiError: FAILED_TO_START_EMULATOR(OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624, domid 29, QMP failure at File \"xc/device.ml\", line 3491, characters 71-78)
    at Function.wrap (file:///opt/xo/xo-builds/xen-orchestra-202408301255/packages/xen-api/_XapiError.mjs:16:12)
    at file:///opt/xo/xo-builds/xen-orchestra-202408301255/packages/xen-api/transports/json-rpc.mjs:38:21
    at runNextTicks (node:internal/process/task_queues:60:5)
    at processImmediate (node:internal/timers:454:9)
    at process.callbackTrampoline (node:internal/async_hooks:130:17)"
}

rtjdamen

@CJ do you run XOA of XO from sources? are u on the latest build? we do not experience this issue like you do. i had some issues with hanging jobs that caused VDI's to stay attached to the Control Domain. Normally this should be handled by the backup job itself. However we have seen a case where this happens when a speed limit is set to the backup job, could it be that u have set one? Maybe u can try disabling it and see what it brings.

I believe that this issue in general is one that should be resolved, having disks that stay attached to the control domain is causing issues and it's not doable to restart hosts everytime this happens, there needs to be a good mechanism to recover from this kind of issues. @olivierlambert is this something that we can expect in the near future?

olivierlambert

Well, it's hard to get the problem if we cannot reproduce it either on our side. The QMP failures really looks like something else

rtjdamen

@olivierlambert i agree, but there should be a better way to recover from them.

olivierlambert

You mean cleaning those VDI attached to the control domain? I think there's some planned tasks to do a regular cleaning before/after each job. Still, I'm under the impression that @Tristis-Oris issue is different

rtjdamen

@olivierlambert yeah i agree, they should not occur in the first place but if they do it is a mess that you need to reboot all hosts to get rid of them ;-).

SylvainB

I followed the advice above, restarted all my hosts one by one, I still have errors

{
  "data": {
    "mode": "delta",
    "reportWhen": "failure"
  },
  "id": "1725368860234",
  "jobId": "9170c82b-ede4-491d-b792-f421b3e4b525",
  "jobName": "BACKUP_AND_REPLICATE_DEFAULT_BACKUP_JOB",
  "message": "backup",
  "scheduleId": "3e87844e-b48a-453f-87b4-e2a2cc3fa2c8",
  "start": 1725368860234,
  "status": "failure",
  "infos": [
    {
      "data": {
        "vms": [
          "cc59f885-c138-675d-fc81-054713586bc1",
          "96cfde06-61c0-0f3e-cf6d-f637d41cc8c6",
          "e1da68e8-ef42-44a9-386b-aceb7a920463",
          "565c724d-020f-baa8-8e7d-cf54d8b57a28"
        ]
      },
      "message": "vms"
    }
  ],
  "tasks": [
    {
      "data": {
        "type": "VM",
        "id": "cc59f885-c138-675d-fc81-054713586bc1",
        "name_label": "REC-APP-BCTI"
      },
      "id": "1725368863035",
      "message": "backup VM",
      "start": 1725368863035,
      "status": "failure",
      "tasks": [
        {
          "id": "1725368863044:0",
          "message": "clean-vm",
          "start": 1725368863044,
          "status": "success",
          "end": 1725368863493,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1725368863818",
          "message": "snapshot",
          "start": 1725368863818,
          "status": "success",
          "end": 1725368945306,
          "result": "7f7f7d83-4ac6-0cbd-0bd4-34f9fb94c4a5"
        },
        {
          "data": {
            "id": "122ddf1f-090d-4c23-8c5e-fe095321f8b9",
            "isFull": false,
            "type": "remote"
          },
          "id": "1725368945306:0",
          "message": "export",
          "start": 1725368945306,
          "status": "success",
          "tasks": [
            {
              "id": "1725369159363",
              "message": "clean-vm",
              "start": 1725369159363,
              "status": "success",
              "end": 1725369159662,
              "result": {
                "merge": false
              }
            }
          ],
          "end": 1725369159662
        },
        {
          "data": {
            "id": "beee944b-e502-61d7-e03b-e1408f01db8c",
            "isFull": false,
            "name_label": "iSCSI-STORE-CES-01_HDD-01",
            "type": "SR"
          },
          "id": "1725368945306:1",
          "message": "export",
          "start": 1725368945306,
          "status": "interrupted"
        }
      ],
      "end": 1725369159662,
      "result": {
        "code": "VDI_IN_USE",
        "params": [
          "OpaqueRef:d1b7eaca-1bb8-41da-9a96-11103411b868",
          "destroy"
        ],
        "task": {
          "uuid": "e2318576-41e0-7c97-bef3-b66fb17dc12f",
          "name_label": "Async.VDI.destroy",
          "name_description": "",
          "allowed_operations": [],
          "current_operations": {},
          "created": "20240903T13:12:39Z",
          "finished": "20240903T13:12:39Z",
          "status": "failure",
          "resident_on": "OpaqueRef:c329c60f-e5d8-4797-9019-2dcb1083227c",
          "progress": 1,
          "type": "<none/>",
          "result": "",
          "error_info": [
            "VDI_IN_USE",
            "OpaqueRef:d1b7eaca-1bb8-41da-9a96-11103411b868",
            "destroy"
          ],
          "other_config": {},
          "subtask_of": "OpaqueRef:NULL",
          "subtasks": [],
          "backtrace": "(((process xapi)(filename ocaml/xapi/message_forwarding.ml)(line 4711))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 24))((process xapi)(filename ocaml/xapi/rbac.ml)(line 205))((process xapi)(filename ocaml/xapi/server_helpers.ml)(line 95)))"
        },
        "message": "VDI_IN_USE(OpaqueRef:d1b7eaca-1bb8-41da-9a96-11103411b868, destroy)",
        "name": "XapiError",
        "stack": "XapiError: VDI_IN_USE(OpaqueRef:d1b7eaca-1bb8-41da-9a96-11103411b868, destroy)\n    at XapiError.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12)\n    at default (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_getTaskResult.mjs:13:29)\n    at Xapi._addRecordToCache (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1076:24)\n    at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1110:14\n    at Array.forEach (<anonymous>)\n    at Xapi._processEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1100:12)\n    at Xapi._watchEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1273:14)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
      }
    },
    {
      "data": {
        "type": "VM",
        "id": "e1da68e8-ef42-44a9-386b-aceb7a920463",
        "name_label": "PROD-BDD-LIC01"
      },
      "id": "1725368863036",
      "message": "backup VM",
      "start": 1725368863036,
      "status": "failure",
      "tasks": [
        {
          "id": "1725368863043",
          "message": "clean-vm",
          "start": 1725368863043,
          "status": "success",
          "end": 1725368863472,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1725368863842",
          "message": "snapshot",
          "start": 1725368863842,
          "status": "success",
          "end": 1725368876837,
          "result": "882ebd40-e16a-e286-f0ac-580728a3ec19"
        },
        {
          "data": {
            "id": "122ddf1f-090d-4c23-8c5e-fe095321f8b9",
            "isFull": false,
            "type": "remote"
          },
          "id": "1725368876838",
          "message": "export",
          "start": 1725368876838,
          "status": "success",
          "tasks": [
            {
              "id": "1725368983568",
              "message": "clean-vm",
              "start": 1725368983568,
              "status": "success",
              "end": 1725368983673,
              "result": {
                "merge": false
              }
            }
          ],
          "end": 1725368983675
        },
        {
          "data": {
            "id": "beee944b-e502-61d7-e03b-e1408f01db8c",
            "isFull": false,
            "name_label": "iSCSI-STORE-CES-01_HDD-01",
            "type": "SR"
          },
          "id": "1725368876839",
          "message": "export",
          "start": 1725368876839,
          "status": "interrupted"
        }
      ],
      "infos": [
        {
          "message": "will delete snapshot data"
        },
        {
          "data": {
            "vdiRef": "OpaqueRef:f10a5b53-170b-4c2d-9975-d419197cbf2a"
          },
          "message": "Snapshot data has been deleted"
        }
      ],
      "end": 1725368983676,
      "result": {
        "message": "can't create a stream from a metadata VDI, fall back to a base ",
        "name": "Error",
        "stack": "Error: can't create a stream from a metadata VDI, fall back to a base \n    at Xapi.exportContent (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/xapi/vdi.mjs:202:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_incrementalVm.mjs:57:32\n    at async Promise.all (index 0)\n    at async cancelableMap (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_cancelableMap.mjs:11:12)\n    at async exportIncrementalVm (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_incrementalVm.mjs:26:3)\n    at async IncrementalXapiVmBackupRunner._copy (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/_vmRunners/IncrementalXapi.mjs:44:25)\n    at async IncrementalXapiVmBackupRunner.run (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/_vmRunners/_AbstractXapi.mjs:379:9)\n    at async file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/VmsXapi.mjs:166:38"
      }
    },
    {
      "data": {
        "type": "VM",
        "id": "565c724d-020f-baa8-8e7d-cf54d8b57a28",
        "name_label": "ADMIN-SEC-ADM01"
      },
      "id": "1725368863036:0",
      "message": "backup VM",
      "start": 1725368863036,
      "status": "failure",
      "tasks": [
        {
          "id": "1725368863044",
          "message": "clean-vm",
          "start": 1725368863044,
          "status": "success",
          "end": 1725368863755,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1725368864026",
          "message": "snapshot",
          "start": 1725368864026,
          "status": "success",
          "end": 1725369030903,
          "result": "3c37e2e9-9e08-6fa7-a1e4-c16d51b4106a"
        },
        {
          "data": {
            "id": "122ddf1f-090d-4c23-8c5e-fe095321f8b9",
            "isFull": false,
            "type": "remote"
          },
          "id": "1725369030903:0",
          "message": "export",
          "start": 1725369030903,
          "status": "success",
          "tasks": [
            {
              "id": "1725369054746",
              "message": "clean-vm",
              "start": 1725369054746,
              "status": "success",
              "end": 1725369055027,
              "result": {
                "merge": false
              }
            }
          ],
          "end": 1725369055031
        },
        {
          "data": {
            "id": "beee944b-e502-61d7-e03b-e1408f01db8c",
            "isFull": false,
            "name_label": "iSCSI-STORE-CES-01_HDD-01",
            "type": "SR"
          },
          "id": "1725369030903:1",
          "message": "export",
          "start": 1725369030903,
          "status": "interrupted"
        }
      ],
      "infos": [
        {
          "message": "will delete snapshot data"
        },
        {
          "data": {
            "vdiRef": "OpaqueRef:7a307227-d0bc-4810-92a8-6be38d0eecbb"
          },
          "message": "Snapshot data has been deleted"
        },
        {
          "data": {
            "vdiRef": "OpaqueRef:0048f302-86c2-4d69-81bc-feae8ef3cc15"
          },
          "message": "Snapshot data has been deleted"
        }
      ],
      "end": 1725369055031,
      "result": {
        "message": "can't create a stream from a metadata VDI, fall back to a base ",
        "name": "Error",
        "stack": "Error: can't create a stream from a metadata VDI, fall back to a base \n    at Xapi.exportContent (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/xapi/vdi.mjs:202:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_incrementalVm.mjs:57:32\n    at async Promise.all (index 0)\n    at async cancelableMap (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_cancelableMap.mjs:11:12)\n    at async exportIncrementalVm (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_incrementalVm.mjs:26:3)\n    at async IncrementalXapiVmBackupRunner._copy (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/_vmRunners/IncrementalXapi.mjs:44:25)\n    at async IncrementalXapiVmBackupRunner.run (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/_vmRunners/_AbstractXapi.mjs:379:9)\n    at async file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/VmsXapi.mjs:166:38"
      }
    },
    {
      "data": {
        "type": "VM",
        "id": "96cfde06-61c0-0f3e-cf6d-f637d41cc8c6",
        "name_label": "SRV-SQL"
      },
      "id": "1725368863038",
      "message": "backup VM",
      "start": 1725368863038,
      "status": "failure",
      "tasks": [
        {
          "id": "1725368863045",
          "message": "clean-vm",
          "start": 1725368863045,
          "status": "success",
          "end": 1725368863747,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1725368864026:0",
          "message": "snapshot",
          "start": 1725368864026,
          "status": "success",
          "end": 1725368911729,
          "result": "4db8f7fb-d7d0-cd20-24d1-9223180f0c09"
        },
        {
          "data": {
            "id": "122ddf1f-090d-4c23-8c5e-fe095321f8b9",
            "isFull": false,
            "type": "remote"
          },
          "id": "1725368911729:0",
          "message": "export",
          "start": 1725368911729,
          "status": "success",
          "tasks": [
            {
              "id": "1725369104743",
              "message": "clean-vm",
              "start": 1725369104743,
              "status": "success",
              "end": 1725369105118,
              "result": {
                "merge": false
              }
            }
          ],
          "end": 1725369105118
        },
        {
          "data": {
            "id": "beee944b-e502-61d7-e03b-e1408f01db8c",
            "isFull": false,
            "name_label": "iSCSI-STORE-CES-01_HDD-01",
            "type": "SR"
          },
          "id": "1725368911729:1",
          "message": "export",
          "start": 1725368911729,
          "status": "interrupted"
        }
      ],
      "end": 1725369105118,
      "result": {
        "code": "VDI_IN_USE",
        "params": [
          "OpaqueRef:7b2ab98c-bfa5-4403-8890-73b88ce4b1dd",
          "destroy"
        ],
        "task": {
          "uuid": "d3978e94-c447-987c-c15d-21adcb7a8805",
          "name_label": "Async.VDI.destroy",
          "name_description": "",
          "allowed_operations": [],
          "current_operations": {},
          "created": "20240903T13:11:44Z",
          "finished": "20240903T13:11:44Z",
          "status": "failure",
          "resident_on": "OpaqueRef:c329c60f-e5d8-4797-9019-2dcb1083227c",
          "progress": 1,
          "type": "<none/>",
          "result": "",
          "error_info": [
            "VDI_IN_USE",
            "OpaqueRef:7b2ab98c-bfa5-4403-8890-73b88ce4b1dd",
            "destroy"
          ],
          "other_config": {},
          "subtask_of": "OpaqueRef:NULL",
          "subtasks": [],
          "backtrace": "(((process xapi)(filename ocaml/xapi/message_forwarding.ml)(line 4711))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 24))((process xapi)(filename ocaml/xapi/rbac.ml)(line 205))((process xapi)(filename ocaml/xapi/server_helpers.ml)(line 95)))"
        },
        "message": "VDI_IN_USE(OpaqueRef:7b2ab98c-bfa5-4403-8890-73b88ce4b1dd, destroy)",
        "name": "XapiError",
        "stack": "XapiError: VDI_IN_USE(OpaqueRef:7b2ab98c-bfa5-4403-8890-73b88ce4b1dd, destroy)\n    at XapiError.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12)\n    at default (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_getTaskResult.mjs:13:29)\n    at Xapi._addRecordToCache (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1076:24)\n    at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1110:14\n    at Array.forEach (<anonymous>)\n    at Xapi._processEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1100:12)\n    at Xapi._watchEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1273:14)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
      }
    }
  ],
  "end": 1725369159662
}

What I have to do ?

rtjdamen

@still_at_work this vdi_in_use error is something we have seen as well, when i do a retry the jobs are processing well at our end. So it seems more like a timing issue, vates is looking into this. Not shure if this is the same issue as u are seeing.

SylvainB

in my case, even if I restart the jobs in error, the error vdi_in_use persists

rtjdamen

@still_at_work let’s see what vates does discover on the logging, i bet it is some sort of timing issue.

olivierlambert

See https://github.com/vatesfr/xen-orchestra/pull/7960

Might be helpful

fbeauchamp opened this pull request in vatesfr/xen-orchestra

closed feat(backups/CBT): retry data_destroy when error is VDI_IN USE #7960

SylvainB

Thanks @olivierlambert

I'm available to install patch on my installation if you want, and if it's possible.

olivierlambert

If you have a ticket number we could take a look tomorrow

SylvainB

@olivierlambert nice !

Ticket#7727235

Thanks !

rtjdamen

@olivierlambert would this maybe also affect the other vdi_in_use error?

https://github.com/vatesfr/xen-orchestra/issues/7956

rtjdamen created this issue in vatesfr/xen-orchestra

closed VDI_IN_USE(OpaqueRef:d525416f-65c7-48b6-ad1c-200e8b315014, destroy) #7956

rtjdamen

@olivierlambert the vdi_in_use error we saw in regarding to destroy has not been occuring again. The vdi_in_use error regarding data_destroy is still occuring. Not shure if the fix did anything or that we were lucky not to see issues last night on it. I believe @florent is right and this is caused by a timing issue but maybe we need more time then 4 seconds.