CBT: the thread to centralize your feedback

rtjdamen

@CJ normally this should not happen, i don’t see this at out end. Mostly this happens after an incomplete backupjob.

rtjdamen

@CJ said in CBT: the thread to centralize your feedback:

@rtjdamen The VMs were running on the master so I had rebooted it since I don't recall how match the UUIDs. I'll try rebooting the other node and see if that works.

EDIT: That worked. Even though they were running on the master they were attached to the other node.

If anyone has this issue and they do not want to reboot hosts, u can migrate the vm to a different sr to fix the issue partly, the vdi will still be orphan and attached till the next host reboot but the backup will run without issues.

CJ

@rtjdamen Unfortunately it's still happening to me and getting worse.

Yesterday, I had one VM with the issue. When the backup ran I got the report stating that VM failed to backup but all others succeeded. When I just checked the dashboard health, I see that I now have three VMs with control domain attached VDIs. The backup job only lists the one VM as having failed.

One unusual thing is that these three are three of the four VMs that had problems originally. So I'm not sure if there's an issue with the VMs themselves or something else causing those to error.

Tristis Oris

VM refused to launch untill i disable CBT.

vm.start
{
  "id": "59f0ba04-5814-7154-22d2-51ae24ecf146",
  "bypassMacAddressesCheck": false,
  "force": false
}
{
  "code": "FAILED_TO_START_EMULATOR",
  "params": [
    "OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624",
    "domid 29",
    "QMP failure at File \"xc/device.ml\", line 3491, characters 71-78"
  ],
  "call": {
    "method": "VM.start",
    "params": [
      "OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624",
      false,
      false
    ]
  },
  "message": "FAILED_TO_START_EMULATOR(OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624, domid 29, QMP failure at File \"xc/device.ml\", line 3491, characters 71-78)",
  "name": "XapiError",
  "stack": "XapiError: FAILED_TO_START_EMULATOR(OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624, domid 29, QMP failure at File \"xc/device.ml\", line 3491, characters 71-78)
    at Function.wrap (file:///opt/xo/xo-builds/xen-orchestra-202408301255/packages/xen-api/_XapiError.mjs:16:12)
    at file:///opt/xo/xo-builds/xen-orchestra-202408301255/packages/xen-api/transports/json-rpc.mjs:38:21
    at runNextTicks (node:internal/process/task_queues:60:5)
    at processImmediate (node:internal/timers:454:9)
    at process.callbackTrampoline (node:internal/async_hooks:130:17)"
}

rtjdamen

@CJ do you run XOA of XO from sources? are u on the latest build? we do not experience this issue like you do. i had some issues with hanging jobs that caused VDI's to stay attached to the Control Domain. Normally this should be handled by the backup job itself. However we have seen a case where this happens when a speed limit is set to the backup job, could it be that u have set one? Maybe u can try disabling it and see what it brings.

I believe that this issue in general is one that should be resolved, having disks that stay attached to the control domain is causing issues and it's not doable to restart hosts everytime this happens, there needs to be a good mechanism to recover from this kind of issues. @olivierlambert is this something that we can expect in the near future?

olivierlambert

Well, it's hard to get the problem if we cannot reproduce it either on our side. The QMP failures really looks like something else

rtjdamen

@olivierlambert i agree, but there should be a better way to recover from them.

olivierlambert

You mean cleaning those VDI attached to the control domain? I think there's some planned tasks to do a regular cleaning before/after each job. Still, I'm under the impression that @Tristis-Oris issue is different

rtjdamen

@olivierlambert yeah i agree, they should not occur in the first place but if they do it is a mess that you need to reboot all hosts to get rid of them ;-).

SylvainB

I followed the advice above, restarted all my hosts one by one, I still have errors

{
  "data": {
    "mode": "delta",
    "reportWhen": "failure"
  },
  "id": "1725368860234",
  "jobId": "9170c82b-ede4-491d-b792-f421b3e4b525",
  "jobName": "BACKUP_AND_REPLICATE_DEFAULT_BACKUP_JOB",
  "message": "backup",
  "scheduleId": "3e87844e-b48a-453f-87b4-e2a2cc3fa2c8",
  "start": 1725368860234,
  "status": "failure",
  "infos": [
    {
      "data": {
        "vms": [
          "cc59f885-c138-675d-fc81-054713586bc1",
          "96cfde06-61c0-0f3e-cf6d-f637d41cc8c6",
          "e1da68e8-ef42-44a9-386b-aceb7a920463",
          "565c724d-020f-baa8-8e7d-cf54d8b57a28"
        ]
      },
      "message": "vms"
    }
  ],
  "tasks": [
    {
      "data": {
        "type": "VM",
        "id": "cc59f885-c138-675d-fc81-054713586bc1",
        "name_label": "REC-APP-BCTI"
      },
      "id": "1725368863035",
      "message": "backup VM",
      "start": 1725368863035,
      "status": "failure",
      "tasks": [
        {
          "id": "1725368863044:0",
          "message": "clean-vm",
          "start": 1725368863044,
          "status": "success",
          "end": 1725368863493,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1725368863818",
          "message": "snapshot",
          "start": 1725368863818,
          "status": "success",
          "end": 1725368945306,
          "result": "7f7f7d83-4ac6-0cbd-0bd4-34f9fb94c4a5"
        },
        {
          "data": {
            "id": "122ddf1f-090d-4c23-8c5e-fe095321f8b9",
            "isFull": false,
            "type": "remote"
          },
          "id": "1725368945306:0",
          "message": "export",
          "start": 1725368945306,
          "status": "success",
          "tasks": [
            {
              "id": "1725369159363",
              "message": "clean-vm",
              "start": 1725369159363,
              "status": "success",
              "end": 1725369159662,
              "result": {
                "merge": false
              }
            }
          ],
          "end": 1725369159662
        },
        {
          "data": {
            "id": "beee944b-e502-61d7-e03b-e1408f01db8c",
            "isFull": false,
            "name_label": "iSCSI-STORE-CES-01_HDD-01",
            "type": "SR"
          },
          "id": "1725368945306:1",
          "message": "export",
          "start": 1725368945306,
          "status": "interrupted"
        }
      ],
      "end": 1725369159662,
      "result": {
        "code": "VDI_IN_USE",
        "params": [
          "OpaqueRef:d1b7eaca-1bb8-41da-9a96-11103411b868",
          "destroy"
        ],
        "task": {
          "uuid": "e2318576-41e0-7c97-bef3-b66fb17dc12f",
          "name_label": "Async.VDI.destroy",
          "name_description": "",
          "allowed_operations": [],
          "current_operations": {},
          "created": "20240903T13:12:39Z",
          "finished": "20240903T13:12:39Z",
          "status": "failure",
          "resident_on": "OpaqueRef:c329c60f-e5d8-4797-9019-2dcb1083227c",
          "progress": 1,
          "type": "<none/>",
          "result": "",
          "error_info": [
            "VDI_IN_USE",
            "OpaqueRef:d1b7eaca-1bb8-41da-9a96-11103411b868",
            "destroy"
          ],
          "other_config": {},
          "subtask_of": "OpaqueRef:NULL",
          "subtasks": [],
          "backtrace": "(((process xapi)(filename ocaml/xapi/message_forwarding.ml)(line 4711))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 24))((process xapi)(filename ocaml/xapi/rbac.ml)(line 205))((process xapi)(filename ocaml/xapi/server_helpers.ml)(line 95)))"
        },
        "message": "VDI_IN_USE(OpaqueRef:d1b7eaca-1bb8-41da-9a96-11103411b868, destroy)",
        "name": "XapiError",
        "stack": "XapiError: VDI_IN_USE(OpaqueRef:d1b7eaca-1bb8-41da-9a96-11103411b868, destroy)\n    at XapiError.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12)\n    at default (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_getTaskResult.mjs:13:29)\n    at Xapi._addRecordToCache (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1076:24)\n    at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1110:14\n    at Array.forEach (<anonymous>)\n    at Xapi._processEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1100:12)\n    at Xapi._watchEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1273:14)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
      }
    },
    {
      "data": {
        "type": "VM",
        "id": "e1da68e8-ef42-44a9-386b-aceb7a920463",
        "name_label": "PROD-BDD-LIC01"
      },
      "id": "1725368863036",
      "message": "backup VM",
      "start": 1725368863036,
      "status": "failure",
      "tasks": [
        {
          "id": "1725368863043",
          "message": "clean-vm",
          "start": 1725368863043,
          "status": "success",
          "end": 1725368863472,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1725368863842",
          "message": "snapshot",
          "start": 1725368863842,
          "status": "success",
          "end": 1725368876837,
          "result": "882ebd40-e16a-e286-f0ac-580728a3ec19"
        },
        {
          "data": {
            "id": "122ddf1f-090d-4c23-8c5e-fe095321f8b9",
            "isFull": false,
            "type": "remote"
          },
          "id": "1725368876838",
          "message": "export",
          "start": 1725368876838,
          "status": "success",
          "tasks": [
            {
              "id": "1725368983568",
              "message": "clean-vm",
              "start": 1725368983568,
              "status": "success",
              "end": 1725368983673,
              "result": {
                "merge": false
              }
            }
          ],
          "end": 1725368983675
        },
        {
          "data": {
            "id": "beee944b-e502-61d7-e03b-e1408f01db8c",
            "isFull": false,
            "name_label": "iSCSI-STORE-CES-01_HDD-01",
            "type": "SR"
          },
          "id": "1725368876839",
          "message": "export",
          "start": 1725368876839,
          "status": "interrupted"
        }
      ],
      "infos": [
        {
          "message": "will delete snapshot data"
        },
        {
          "data": {
            "vdiRef": "OpaqueRef:f10a5b53-170b-4c2d-9975-d419197cbf2a"
          },
          "message": "Snapshot data has been deleted"
        }
      ],
      "end": 1725368983676,
      "result": {
        "message": "can't create a stream from a metadata VDI, fall back to a base ",
        "name": "Error",
        "stack": "Error: can't create a stream from a metadata VDI, fall back to a base \n    at Xapi.exportContent (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/xapi/vdi.mjs:202:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_incrementalVm.mjs:57:32\n    at async Promise.all (index 0)\n    at async cancelableMap (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_cancelableMap.mjs:11:12)\n    at async exportIncrementalVm (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_incrementalVm.mjs:26:3)\n    at async IncrementalXapiVmBackupRunner._copy (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/_vmRunners/IncrementalXapi.mjs:44:25)\n    at async IncrementalXapiVmBackupRunner.run (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/_vmRunners/_AbstractXapi.mjs:379:9)\n    at async file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/VmsXapi.mjs:166:38"
      }
    },
    {
      "data": {
        "type": "VM",
        "id": "565c724d-020f-baa8-8e7d-cf54d8b57a28",
        "name_label": "ADMIN-SEC-ADM01"
      },
      "id": "1725368863036:0",
      "message": "backup VM",
      "start": 1725368863036,
      "status": "failure",
      "tasks": [
        {
          "id": "1725368863044",
          "message": "clean-vm",
          "start": 1725368863044,
          "status": "success",
          "end": 1725368863755,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1725368864026",
          "message": "snapshot",
          "start": 1725368864026,
          "status": "success",
          "end": 1725369030903,
          "result": "3c37e2e9-9e08-6fa7-a1e4-c16d51b4106a"
        },
        {
          "data": {
            "id": "122ddf1f-090d-4c23-8c5e-fe095321f8b9",
            "isFull": false,
            "type": "remote"
          },
          "id": "1725369030903:0",
          "message": "export",
          "start": 1725369030903,
          "status": "success",
          "tasks": [
            {
              "id": "1725369054746",
              "message": "clean-vm",
              "start": 1725369054746,
              "status": "success",
              "end": 1725369055027,
              "result": {
                "merge": false
              }
            }
          ],
          "end": 1725369055031
        },
        {
          "data": {
            "id": "beee944b-e502-61d7-e03b-e1408f01db8c",
            "isFull": false,
            "name_label": "iSCSI-STORE-CES-01_HDD-01",
            "type": "SR"
          },
          "id": "1725369030903:1",
          "message": "export",
          "start": 1725369030903,
          "status": "interrupted"
        }
      ],
      "infos": [
        {
          "message": "will delete snapshot data"
        },
        {
          "data": {
            "vdiRef": "OpaqueRef:7a307227-d0bc-4810-92a8-6be38d0eecbb"
          },
          "message": "Snapshot data has been deleted"
        },
        {
          "data": {
            "vdiRef": "OpaqueRef:0048f302-86c2-4d69-81bc-feae8ef3cc15"
          },
          "message": "Snapshot data has been deleted"
        }
      ],
      "end": 1725369055031,
      "result": {
        "message": "can't create a stream from a metadata VDI, fall back to a base ",
        "name": "Error",
        "stack": "Error: can't create a stream from a metadata VDI, fall back to a base \n    at Xapi.exportContent (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/xapi/vdi.mjs:202:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_incrementalVm.mjs:57:32\n    at async Promise.all (index 0)\n    at async cancelableMap (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_cancelableMap.mjs:11:12)\n    at async exportIncrementalVm (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_incrementalVm.mjs:26:3)\n    at async IncrementalXapiVmBackupRunner._copy (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/_vmRunners/IncrementalXapi.mjs:44:25)\n    at async IncrementalXapiVmBackupRunner.run (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/_vmRunners/_AbstractXapi.mjs:379:9)\n    at async file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/VmsXapi.mjs:166:38"
      }
    },
    {
      "data": {
        "type": "VM",
        "id": "96cfde06-61c0-0f3e-cf6d-f637d41cc8c6",
        "name_label": "SRV-SQL"
      },
      "id": "1725368863038",
      "message": "backup VM",
      "start": 1725368863038,
      "status": "failure",
      "tasks": [
        {
          "id": "1725368863045",
          "message": "clean-vm",
          "start": 1725368863045,
          "status": "success",
          "end": 1725368863747,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1725368864026:0",
          "message": "snapshot",
          "start": 1725368864026,
          "status": "success",
          "end": 1725368911729,
          "result": "4db8f7fb-d7d0-cd20-24d1-9223180f0c09"
        },
        {
          "data": {
            "id": "122ddf1f-090d-4c23-8c5e-fe095321f8b9",
            "isFull": false,
            "type": "remote"
          },
          "id": "1725368911729:0",
          "message": "export",
          "start": 1725368911729,
          "status": "success",
          "tasks": [
            {
              "id": "1725369104743",
              "message": "clean-vm",
              "start": 1725369104743,
              "status": "success",
              "end": 1725369105118,
              "result": {
                "merge": false
              }
            }
          ],
          "end": 1725369105118
        },
        {
          "data": {
            "id": "beee944b-e502-61d7-e03b-e1408f01db8c",
            "isFull": false,
            "name_label": "iSCSI-STORE-CES-01_HDD-01",
            "type": "SR"
          },
          "id": "1725368911729:1",
          "message": "export",
          "start": 1725368911729,
          "status": "interrupted"
        }
      ],
      "end": 1725369105118,
      "result": {
        "code": "VDI_IN_USE",
        "params": [
          "OpaqueRef:7b2ab98c-bfa5-4403-8890-73b88ce4b1dd",
          "destroy"
        ],
        "task": {
          "uuid": "d3978e94-c447-987c-c15d-21adcb7a8805",
          "name_label": "Async.VDI.destroy",
          "name_description": "",
          "allowed_operations": [],
          "current_operations": {},
          "created": "20240903T13:11:44Z",
          "finished": "20240903T13:11:44Z",
          "status": "failure",
          "resident_on": "OpaqueRef:c329c60f-e5d8-4797-9019-2dcb1083227c",
          "progress": 1,
          "type": "<none/>",
          "result": "",
          "error_info": [
            "VDI_IN_USE",
            "OpaqueRef:7b2ab98c-bfa5-4403-8890-73b88ce4b1dd",
            "destroy"
          ],
          "other_config": {},
          "subtask_of": "OpaqueRef:NULL",
          "subtasks": [],
          "backtrace": "(((process xapi)(filename ocaml/xapi/message_forwarding.ml)(line 4711))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 24))((process xapi)(filename ocaml/xapi/rbac.ml)(line 205))((process xapi)(filename ocaml/xapi/server_helpers.ml)(line 95)))"
        },
        "message": "VDI_IN_USE(OpaqueRef:7b2ab98c-bfa5-4403-8890-73b88ce4b1dd, destroy)",
        "name": "XapiError",
        "stack": "XapiError: VDI_IN_USE(OpaqueRef:7b2ab98c-bfa5-4403-8890-73b88ce4b1dd, destroy)\n    at XapiError.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12)\n    at default (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_getTaskResult.mjs:13:29)\n    at Xapi._addRecordToCache (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1076:24)\n    at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1110:14\n    at Array.forEach (<anonymous>)\n    at Xapi._processEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1100:12)\n    at Xapi._watchEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1273:14)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
      }
    }
  ],
  "end": 1725369159662
}

What I have to do ?

rtjdamen

@still_at_work this vdi_in_use error is something we have seen as well, when i do a retry the jobs are processing well at our end. So it seems more like a timing issue, vates is looking into this. Not shure if this is the same issue as u are seeing.

SylvainB

in my case, even if I restart the jobs in error, the error vdi_in_use persists

rtjdamen

@still_at_work let’s see what vates does discover on the logging, i bet it is some sort of timing issue.

olivierlambert

See https://github.com/vatesfr/xen-orchestra/pull/7960

Might be helpful

fbeauchamp opened this pull request in vatesfr/xen-orchestra

closed feat(backups/CBT): retry data_destroy when error is VDI_IN USE #7960

SylvainB

Thanks @olivierlambert

I'm available to install patch on my installation if you want, and if it's possible.

olivierlambert

If you have a ticket number we could take a look tomorrow

SylvainB

@olivierlambert nice !

Ticket#7727235

Thanks !

rtjdamen

@olivierlambert would this maybe also affect the other vdi_in_use error?

https://github.com/vatesfr/xen-orchestra/issues/7956

rtjdamen created this issue in vatesfr/xen-orchestra

closed VDI_IN_USE(OpaqueRef:d525416f-65c7-48b6-ad1c-200e8b315014, destroy) #7956

rtjdamen

@olivierlambert the vdi_in_use error we saw in regarding to destroy has not been occuring again. The vdi_in_use error regarding data_destroy is still occuring. Not shure if the fix did anything or that we were lucky not to see issues last night on it. I believe @florent is right and this is caused by a timing issue but maybe we need more time then 4 seconds.

CJ

@rtjdamen I do run from sources and I was on the latest build when I posted. One of my first steps in regards to any issue is to update XO. I do not have any speed limits configured for the job.

@olivierlambert It seems to be happening fairly regularly for me. At least every couple of days. Let me know what I can do to help troubleshoot.

It appears that the same thing has happened again. One VM got attached to the Control Domain yesterday and therefore failed to backup. What is odd is that now I have another VM listed twice as being attached to the Control Domain. Oddly, they show the same attached UUID but different VBD UUIDs.

I see that there are patches available for the pool, so I will apply those as well as update XO and remove the Control attached VMs.