Pool Tasks Don't Complete? Major Issues...

omatsei

I've run into a huge problem this morning and I don't know where to go from here.

A couple VM's rebooted, but didn't come back online. I couldn't see the console in XO, and any action was met with an error, "INTERNAL_ERROR(Object with type VM and id 1dc90439-f3ef-32f7-9274-131654b850fc/config does not exist in xenopsd)". I tried restarting the toolstack, but it didn't help.

I realized that both VM's were on the same host (xcp02), and tried migrating the other VM's on that host to other hosts. Those tasks failed, with a similar error. I rebooted the host, knowing that all 4 VM's would go offline, but when the host came back up, the VM's were still shown in XO as running. The host's console (via IPMI) said no VM's were running. I tried running "xe vm-reset-powerstate", but as long as the host was online, the CLI told me it wouldn't execute the command.

I shut down the affected host completely, ran "xe vm-reset-powerstate" on those 4 VM's, and tried starting them on another host. All 4 VM start tasks got to 57% and stalled. I waited about 45 minutes before giving up.

I decided then that maybe XO was to blame, so I restarted the XO server. It's a VM living inside XCP, but on a different host (xcp04). It never came back. Since XO is down now, I can't try to view the console of the XO server, or any other VM that's having problems.

I then figured that the pool itself was to blame, and the pool master (xcp01) needed a good rebooting, despite not being able to migrate the 2 VM's off it. I ran updates and installed the 9 pending updates. Then I tried rebooting it nicely, but when the tasks sat for half an hour without progressing, I eventually did a hard power reset. It booted correctly, but the same problems remain. I'm now down 7 VM's (2 initially, 2 more from rebooting xcp02, 2 from rebooting xcp01, and the XO VM). All the networking and storage seems to be fine, but I'm only checking from the opposite sides since verifying the storage is all good via the XCP CLI seems to be lacking a bit.

At this point, I have a new VM built in VMware to take over duties of XO, but I suspect that since all/most pool tasks simply don't complete, including the "server_init" from rebooting the pool master, it won't help.

I'm all out of ideas.

planedrop

Do you know of any changes that lead up to this? It sounds like it could be some kind of hardware issue that cascaded into a larger problem, but that's a bit of a shot in the dark.

The VMs that rebooted initially, did you trigger the reboot, or did they crash and then auto-boot back up?

I've managed a solid number of XCP-ng installs and can't say I've ever seen anything quite like this.

I did have a host completely freeze up once and it would not respond to VM boot requests or anything else, but the VMs on it were still running, a reboot resolved that. This was a single host pool though so no idea how the rest of the pool would have behaved if it was the pool master.

This was on consumer grade hardware though so I strongly believe it was an issue w/ the hardware (possibly an error in RAM and I don't have ECC on this box).

So right now, you basically can't boot or manage anything in your pool? If anything seems to work right, please define what is working right.

Also, if you have pro support, I'd recommend putting a ticket in with Vates, they're support is actually quite great.

omatsei

@planedrop I'm not sure I can relate everything else I've done to try to fix the issues, but I think you're correct that it was a hardware issue. I can't identify anything that has failed, but it seems like it may have corresponded with a power bump. So maybe something rebooted and didn't boot correctly, I'm not sure.

At this point, the XO VM has started, although I don't know what made it work again. I clicked the button to start the VM, then started working on a brand new XO server and rebuilding xcp01, and when I looked back about an hour later or so, the original XO was on.

I managed to start another server also, so we're down 5 VM's currently. The majority of tasks fail after 45-60 minutes, but some succeed, with no obvious logic for which work and which don't. The errors for the ones that fail look mostly similar, but I haven't found much on the interwebs to help diagnose it:

{
  "id": "0lzd1nqf6",
  "properties": {
    "method": "vm.start",
    "params": {
      "id": "aa040c66-18a8-be17-6c52-658cf9082da4",
      "bypassMacAddressesCheck": false,
      "force": false
    },
    "name": "API call: vm.start",
    "userId": "31081b8d-b3c6-425b-83ff-e3ec68612beb",
    "type": "api.call"
  },
  "start": 1722623675010,
  "status": "failure",
  "updatedAt": 1722623975005,
  "end": 1722623975005,
  "result": {
    "name": "HeadersTimeoutError",
    "code": "UND_ERR_HEADERS_TIMEOUT",
    "message": "Headers Timeout Error",
    "call": {
      "method": "VM.start",
      "params": [
        "OpaqueRef:75b20bc4-f4c0-45d6-90e3-546a5eff7c88",
        false,
        false
      ]
    },
    "stack": "HeadersTimeoutError: Headers Timeout Error\n    at Timeout.onParserTimeout [as callback] (/data/xo/xo-builds/xen-orchestra-202407151328/node_modules/undici/lib/dispatcher/client-h1.js:622:28)\n    at Timeout.onTimeout [as _onTimeout] (/data/xo/xo-builds/xen-orchestra-202407151328/node_modules/undici/lib/util/timers.js:22:13)\n    at listOnTimeout (node:internal/timers:581:17)\n    at processTimers (node:internal/timers:519:7)"
  }
}

omatsei

I think I found the root error, but I can't find anything online about how to fix it. The error is:

Aug  2 18:43:37 xcp04 xapi: [error||625 ||backtrace] Async.VM.hard_shutdown R:6878bad62512 failed with exception Server_error(INTERNAL_ERROR, [ Object with type VM and id a773b91b-9f95-89dd-ccc6-6b8146154f37/vbd.xvdd does not exist in xenopsd ])
Aug  2 18:43:37 xcp04 xapi: [error||625 ||backtrace] Raised Server_error(INTERNAL_ERROR, [ Object with type VM and id a773b91b-9f95-89dd-ccc6-6b8146154f37/vbd.xvdd does not exist in xenopsd ])
Aug  2 18:43:37 xcp04 xapi: [error||625 ||backtrace] 1/1 xapi Raised at file (Thread 625 has no backtrace table. Was with_backtraces called?, line 0
Aug  2 18:43:37 xcp04 xapi: [error||624 :::80||cli] Converting exception INTERNAL_ERROR: [ Object with type VM and id a773b91b-9f95-89dd-ccc6-6b8146154f37/vbd.xvdd does not exist in xenopsd ] into a CLI response

In particular, the "...does not exist in xenopsd" seems like the root cause.

Danp

@omatsei Have you tried restarting the toolstack? This has cleared up the "not exist in xenopsd" for me in the past.

Is all of your storage mounted correctly? Have you checked dmesg on the pool master for obvious errors?

omatsei

@Danp Yes to both. I've probably restarted the toolstack at least a dozen times, mostly to clear hung tasks. I did notice some weird issues with a secondary SR being disconnected on xcp02, (one of 10 hosts, 9 after I ejected and forgot xcp01), but there's no disks on it. It wasn't being used for anything at all (yet), and it's fine on all the rest.

That does lead me to think maybe it was a power bump that rebooted a switch or something though. Maybe it caused some kind of hangup with xcp01 and xcp02, and since xcp01 was the pool master, it cascaded to the other issues I've seen? Could that cause the VM's that were originally running on xcp02 to die and not be able to be recovered easily?