CBT: the thread to centralize your feedback

SylvainB

Hi,

Same here, I updated XOA to 5.98 and I have this error

"can't create a stream from a metadata VDI, fall back to a base" on some VM

I have an active support contract.

Here the detailed log

{
      "data": {
        "type": "VM",
        "id": "96cfde06-61c0-0f3e-cf6d-f637d41cc8c6",
        "name_label": "blabla_VM"
      },
      "id": "1725081943938",
      "message": "backup VM",
      "start": 1725081943938,
      "status": "failure",
      "tasks": [
        {
          "id": "1725081943938:0",
          "message": "clean-vm",
          "start": 1725081943938,
          "status": "success",
          "end": 1725081944676,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1725081944876",
          "message": "snapshot",
          "start": 1725081944876,
          "status": "success",
          "end": 1725081978972,
          "result": "46334bc0-cb3c-23f7-18e1-f25320a6c4b4"
        },
        {
          "data": {
            "id": "122ddf1f-090d-4c23-8c5e-fe095321f8b9",
            "isFull": false,
            "type": "remote"
          },
          "id": "1725081978972:0",
          "message": "export",
          "start": 1725081978972,
          "status": "success",
          "tasks": [
            {
              "id": "1725082089246",
              "message": "clean-vm",
              "start": 1725082089246,
              "status": "success",
              "end": 1725082089709,
              "result": {
                "merge": false
              }
            }
          ],
          "end": 1725082089719
        },
        {
          "data": {
            "id": "beee944b-e502-61d7-e03b-e1408f01db8c",
            "isFull": false,
            "name_label": "BLABLA_SR_HDD-01",
            "type": "SR"
          },
          "id": "1725081978972:1",
          "message": "export",
          "start": 1725081978972,
          "status": "pending"
        }
      ],
      "infos": [
        {
          "message": "will delete snapshot data"
        },
        {
          "data": {
            "vdiRef": "OpaqueRef:1b614f6b-0f69-47a1-a0cd-eee64007441d"
          },
          "message": "Snapshot data has been deleted"
        }
      ],
      "warnings": [
        {
          "data": {
            "error": {
              "code": "VDI_IN_USE",
              "params": [
                "OpaqueRef:989f7dd8-0b73-4a87-b249-6cfc660a90bb",
                "data_destroy"
              ],
              "call": {
                "method": "VDI.data_destroy",
                "params": [
                  "OpaqueRef:989f7dd8-0b73-4a87-b249-6cfc660a90bb"
                ]
              }
            },
            "vdiRef": "OpaqueRef:989f7dd8-0b73-4a87-b249-6cfc660a90bb"
          },
          "message": "Couldn't deleted snapshot data"
        }
      ],
      "end": 1725082089719,
      "result": {
        "message": "can't create a stream from a metadata VDI, fall back to a base ",
        "name": "Error",
        "stack": "Error: can't create a stream from a metadata VDI, fall back to a base \n    at Xapi.exportContent (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/xapi/vdi.mjs:202:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_incrementalVm.mjs:57:32\n    at async Promise.all (index 0)\n    at async cancelableMap (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_cancelableMap.mjs:11:12)\n    at async exportIncrementalVm (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_incrementalVm.mjs:26:3)\n    at async IncrementalXapiVmBackupRunner._copy (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/_vmRunners/IncrementalXapi.mjs:44:25)\n    at async IncrementalXapiVmBackupRunner.run (file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/_vmRunners/_AbstractXapi.mjs:379:9)\n    at async file:///usr/local/lib/node_modules/xo-server/node_modules/@xen-orchestra/backups/_runners/VmsXapi.mjs:166:38"
      }
    },

rtjdamen

So far i did see this fall back to base error only once, it looks like it does finish correct in the retry action. I will keep an eye on this.

CJ

I was having a backup fail due to the VDI must be free error, but updating XOA to the latest commit fixed that.

Now I'm getting VDI IN USE errors when backing up. Going to the Health tab of the Dashboard lists the VDIs still attached to the Control Domain. However, when I try to forget the VDI, I get the OPERATION NOT PERMITTED VBD still attached error.

I've enabled maintenance mode on the node which migrated the VMs to my other node, but that didn't fix the issue. I assume because I'm using shared storage. I tried coalescing the leaf but there were none.

Any suggestions for the next step?

rtjdamen

@CJ u need to check what host the vdi is attached to and reboot that host. That will release this vdi.

CJ

@rtjdamen The VMs were running on the master so I had rebooted it since I don't recall how match the UUIDs. I'll try rebooting the other node and see if that works.

EDIT: That worked. Even though they were running on the master they were attached to the other node.

flakpyro

@rtjdamen I notice this too, on retry it does run, but it seems to take much longer than a normal incremental backup would take so not entirely sure whats going on there. It ONLY happens if i migrate a VM from one host to another for me. (On shared NFS storage)

rtjdamen

@CJ nbd does pick a random host for transfer so it is not specific the poolmaster. U should be able to determine the host holding this vdi in the error message.

rtjdamen

@flakpyro this is because it creates a new full, i think it has an issue with the cbt to be invallid what is causing it to run a new full.

flakpyro

@rtjdamen Correct, it does appear to run a full. Even though the backup report afterwards says it ran a "delta". After that initial run it will run backups error free again, unless i migrate the VM to another host in which case the same error occurs once again.

rtjdamen

@flakpyro yes, we do not see this error in relation to a migration but it does sometimes just occur, in the prior version it failed in the retry, with the latest version it does resolve itself so it is improved a bit.

We also still see the vdi in use errors, would be nice if they will be improved.

CJ

I'm not sure if this happened after my initial manual run of my backup job or the scheduled one that ran afterwards, but one of the VMs is now showing again as attached to the control domain.

Is this something I need to keep checking or should it resolve itself? The backup job completed hours ago.

rtjdamen

@CJ normally this should not happen, i don’t see this at out end. Mostly this happens after an incomplete backupjob.

rtjdamen

@CJ said in CBT: the thread to centralize your feedback:

@rtjdamen The VMs were running on the master so I had rebooted it since I don't recall how match the UUIDs. I'll try rebooting the other node and see if that works.

EDIT: That worked. Even though they were running on the master they were attached to the other node.

If anyone has this issue and they do not want to reboot hosts, u can migrate the vm to a different sr to fix the issue partly, the vdi will still be orphan and attached till the next host reboot but the backup will run without issues.

CJ

@rtjdamen Unfortunately it's still happening to me and getting worse.

Yesterday, I had one VM with the issue. When the backup ran I got the report stating that VM failed to backup but all others succeeded. When I just checked the dashboard health, I see that I now have three VMs with control domain attached VDIs. The backup job only lists the one VM as having failed.

One unusual thing is that these three are three of the four VMs that had problems originally. So I'm not sure if there's an issue with the VMs themselves or something else causing those to error.

Tristis Oris

VM refused to launch untill i disable CBT.

vm.start
{
  "id": "59f0ba04-5814-7154-22d2-51ae24ecf146",
  "bypassMacAddressesCheck": false,
  "force": false
}
{
  "code": "FAILED_TO_START_EMULATOR",
  "params": [
    "OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624",
    "domid 29",
    "QMP failure at File \"xc/device.ml\", line 3491, characters 71-78"
  ],
  "call": {
    "method": "VM.start",
    "params": [
      "OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624",
      false,
      false
    ]
  },
  "message": "FAILED_TO_START_EMULATOR(OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624, domid 29, QMP failure at File \"xc/device.ml\", line 3491, characters 71-78)",
  "name": "XapiError",
  "stack": "XapiError: FAILED_TO_START_EMULATOR(OpaqueRef:064cdc5a-49c4-4c58-8bdf-5fe4f04b2624, domid 29, QMP failure at File \"xc/device.ml\", line 3491, characters 71-78)
    at Function.wrap (file:///opt/xo/xo-builds/xen-orchestra-202408301255/packages/xen-api/_XapiError.mjs:16:12)
    at file:///opt/xo/xo-builds/xen-orchestra-202408301255/packages/xen-api/transports/json-rpc.mjs:38:21
    at runNextTicks (node:internal/process/task_queues:60:5)
    at processImmediate (node:internal/timers:454:9)
    at process.callbackTrampoline (node:internal/async_hooks:130:17)"
}

rtjdamen

@CJ do you run XOA of XO from sources? are u on the latest build? we do not experience this issue like you do. i had some issues with hanging jobs that caused VDI's to stay attached to the Control Domain. Normally this should be handled by the backup job itself. However we have seen a case where this happens when a speed limit is set to the backup job, could it be that u have set one? Maybe u can try disabling it and see what it brings.

I believe that this issue in general is one that should be resolved, having disks that stay attached to the control domain is causing issues and it's not doable to restart hosts everytime this happens, there needs to be a good mechanism to recover from this kind of issues. @olivierlambert is this something that we can expect in the near future?

olivierlambert

Well, it's hard to get the problem if we cannot reproduce it either on our side. The QMP failures really looks like something else

rtjdamen

@olivierlambert i agree, but there should be a better way to recover from them.

olivierlambert

You mean cleaning those VDI attached to the control domain? I think there's some planned tasks to do a regular cleaning before/after each job. Still, I'm under the impression that @Tristis-Oris issue is different

rtjdamen

@olivierlambert yeah i agree, they should not occur in the first place but if they do it is a mess that you need to reboot all hosts to get rid of them ;-).