omatsei

omatsei

@omatsei I figured out the problem. There appears to be a bug in XO that requires you to check "Check Certificate" and/or "Start TLS", save the configuration, then uncheck them, then save again. Then it should work. The bug is that they're unchecked by default, but apparently they're enabled in the background.

omatsei

@omatsei I figured out the problem. There appears to be a bug in XO that requires you to check "Check Certificate" and/or "Start TLS", save the configuration, then uncheck them, then save again. Then it should work. The bug is that they're unchecked by default, but apparently they're enabled in the background.

omatsei

I'm trying to set up a backup XO server, and part of it is the authentication. The primary XO has been set up for a few months, including auth, but the secondary doesn't seem to work, even with the same settings. The error from the syslog is:

Aug  8 00:02:28 i4-as-xo2 xo-server[40549]: 2024-08-08T00:02:28.612Z xo:api WARN xoadmin | plugin.test(...) [19ms] =!> Error: unable to get local issuer certificate

I'm sure it's something simple, I just don't know what to look at. Any suggestions?

omatsei

@Danp Yes to both. I've probably restarted the toolstack at least a dozen times, mostly to clear hung tasks. I did notice some weird issues with a secondary SR being disconnected on xcp02, (one of 10 hosts, 9 after I ejected and forgot xcp01), but there's no disks on it. It wasn't being used for anything at all (yet), and it's fine on all the rest.

That does lead me to think maybe it was a power bump that rebooted a switch or something though. Maybe it caused some kind of hangup with xcp01 and xcp02, and since xcp01 was the pool master, it cascaded to the other issues I've seen? Could that cause the VM's that were originally running on xcp02 to die and not be able to be recovered easily?

omatsei

I think I found the root error, but I can't find anything online about how to fix it. The error is:

Aug  2 18:43:37 xcp04 xapi: [error||625 ||backtrace] Async.VM.hard_shutdown R:6878bad62512 failed with exception Server_error(INTERNAL_ERROR, [ Object with type VM and id a773b91b-9f95-89dd-ccc6-6b8146154f37/vbd.xvdd does not exist in xenopsd ])
Aug  2 18:43:37 xcp04 xapi: [error||625 ||backtrace] Raised Server_error(INTERNAL_ERROR, [ Object with type VM and id a773b91b-9f95-89dd-ccc6-6b8146154f37/vbd.xvdd does not exist in xenopsd ])
Aug  2 18:43:37 xcp04 xapi: [error||625 ||backtrace] 1/1 xapi Raised at file (Thread 625 has no backtrace table. Was with_backtraces called?, line 0
Aug  2 18:43:37 xcp04 xapi: [error||624 :::80||cli] Converting exception INTERNAL_ERROR: [ Object with type VM and id a773b91b-9f95-89dd-ccc6-6b8146154f37/vbd.xvdd does not exist in xenopsd ] into a CLI response

In particular, the "...does not exist in xenopsd" seems like the root cause.

omatsei

@planedrop I'm not sure I can relate everything else I've done to try to fix the issues, but I think you're correct that it was a hardware issue. I can't identify anything that has failed, but it seems like it may have corresponded with a power bump. So maybe something rebooted and didn't boot correctly, I'm not sure.

At this point, the XO VM has started, although I don't know what made it work again. I clicked the button to start the VM, then started working on a brand new XO server and rebuilding xcp01, and when I looked back about an hour later or so, the original XO was on.

I managed to start another server also, so we're down 5 VM's currently. The majority of tasks fail after 45-60 minutes, but some succeed, with no obvious logic for which work and which don't. The errors for the ones that fail look mostly similar, but I haven't found much on the interwebs to help diagnose it:

{
  "id": "0lzd1nqf6",
  "properties": {
    "method": "vm.start",
    "params": {
      "id": "aa040c66-18a8-be17-6c52-658cf9082da4",
      "bypassMacAddressesCheck": false,
      "force": false
    },
    "name": "API call: vm.start",
    "userId": "31081b8d-b3c6-425b-83ff-e3ec68612beb",
    "type": "api.call"
  },
  "start": 1722623675010,
  "status": "failure",
  "updatedAt": 1722623975005,
  "end": 1722623975005,
  "result": {
    "name": "HeadersTimeoutError",
    "code": "UND_ERR_HEADERS_TIMEOUT",
    "message": "Headers Timeout Error",
    "call": {
      "method": "VM.start",
      "params": [
        "OpaqueRef:75b20bc4-f4c0-45d6-90e3-546a5eff7c88",
        false,
        false
      ]
    },
    "stack": "HeadersTimeoutError: Headers Timeout Error\n    at Timeout.onParserTimeout [as callback] (/data/xo/xo-builds/xen-orchestra-202407151328/node_modules/undici/lib/dispatcher/client-h1.js:622:28)\n    at Timeout.onTimeout [as _onTimeout] (/data/xo/xo-builds/xen-orchestra-202407151328/node_modules/undici/lib/util/timers.js:22:13)\n    at listOnTimeout (node:internal/timers:581:17)\n    at processTimers (node:internal/timers:519:7)"
  }
}

omatsei

I've run into a huge problem this morning and I don't know where to go from here.

A couple VM's rebooted, but didn't come back online. I couldn't see the console in XO, and any action was met with an error, "INTERNAL_ERROR(Object with type VM and id 1dc90439-f3ef-32f7-9274-131654b850fc/config does not exist in xenopsd)". I tried restarting the toolstack, but it didn't help.

I realized that both VM's were on the same host (xcp02), and tried migrating the other VM's on that host to other hosts. Those tasks failed, with a similar error. I rebooted the host, knowing that all 4 VM's would go offline, but when the host came back up, the VM's were still shown in XO as running. The host's console (via IPMI) said no VM's were running. I tried running "xe vm-reset-powerstate", but as long as the host was online, the CLI told me it wouldn't execute the command.

I shut down the affected host completely, ran "xe vm-reset-powerstate" on those 4 VM's, and tried starting them on another host. All 4 VM start tasks got to 57% and stalled. I waited about 45 minutes before giving up.

I decided then that maybe XO was to blame, so I restarted the XO server. It's a VM living inside XCP, but on a different host (xcp04). It never came back. Since XO is down now, I can't try to view the console of the XO server, or any other VM that's having problems.

I then figured that the pool itself was to blame, and the pool master (xcp01) needed a good rebooting, despite not being able to migrate the 2 VM's off it. I ran updates and installed the 9 pending updates. Then I tried rebooting it nicely, but when the tasks sat for half an hour without progressing, I eventually did a hard power reset. It booted correctly, but the same problems remain. I'm now down 7 VM's (2 initially, 2 more from rebooting xcp02, 2 from rebooting xcp01, and the XO VM). All the networking and storage seems to be fine, but I'm only checking from the opposite sides since verifying the storage is all good via the XCP CLI seems to be lacking a bit.

At this point, I have a new VM built in VMware to take over duties of XO, but I suspect that since all/most pool tasks simply don't complete, including the "server_init" from rebooting the pool master, it won't help.

I'm all out of ideas.

omatsei

@omatsei I found the following error on the source host, if it helps. I rebooted it and restarted iscsid on both the source and destination hosts, just to make sure nothing was pending or hung.

May 28 10:15:32 xcp09 xapi: [error||2507 ||backtrace] SR.scan D:9f4f3c05cc88 failed with exception Storage_error ([S(Redirect);[S(192.168.1.201)]])
May 28 10:15:32 xcp09 xapi: [error||2507 ||backtrace] Raised Storage_error ([S(Redirect);[S(192.168.1.201)]])
May 28 10:15:32 xcp09 xapi: [error||2507 ||backtrace] 1/1 xapi Raised at file (Thread 2507 has no backtrace table. Was with_backtraces called?, line 0
May 28 10:15:32 xcp09 xapi: [error||2507 ||backtrace]
May 28 10:15:32 xcp09 xapi: [error||2507 ||storage_interface] Storage_error ([S(Redirect);[S(192.168.1.201)]]) (File "storage/storage_interface.ml", line 436, characters 51-58)
May 28 10:15:32 xcp09 xapi: [error||2506 HTTP 127.0.0.1->:::80|Querying services D:6b15aa4c5bcd|storage_interface] Storage_error ([S(Redirect);[S(192.168.1.201)]]) (File "storage/storage_interface.ml", line 431, characters 49-56)
May 28 10:15:32 xcp09 xapi: [error||2506 HTTP 127.0.0.1->:::80|Querying services D:6b15aa4c5bcd|storage_interface] Storage_error ([S(Redirect);[S(192.168.1.201)]]) (File "storage/storage_interface.ml", line 436, characters 51-58)

Note that 192.168.1.201 is the pool master. I ended up rebooting the pool master after manually migrating VM's off it, and it seems to have fixed the issue. No idea why, but whatever.

omatsei

@olivierlambert Sorry, same error. I made sure there were no VM's on 2 different hosts, then restarted iscsid on both, then (via CLI) moved one VM back on. Then I tried migrating it from XO, and got the same error. I also made sure XO was updated to the latest stable release.

Random question, does XO need to be on the same subnet (or broadcast network) as the XCP hosts?

omatsei

@olivierlambert Do you mean restart the iscsid service on the XCP host?

omatsei

I'm having an issue with XO (built from source) where live migrations are failing. In XO, when I select a test VM and the target host, it fails after just a second or two. However, it works perfectly if I do it from the command line while SSH'd into the pool master. The command I'm using is:

[22:06 xcp01 ~]# xe vm-migrate uuid=406bc5e7-e814-dc16-780e-adfc2635dfbe host-uuid=2133772d-f69e-4930-980e-583e81e0afb8
[22:07 xcp01 ~]#

The full details of the error are:

vm.migrate
{
  "vm": "406bc5e7-e814-dc16-780e-adfc2635dfbe",
  "migrationNetwork": "76cfdb59-4a35-9d50-6d86-99d68317d61c",
  "targetHost": "2133772d-f69e-4930-980e-583e81e0afb8"
}
{
  "code": "SR_BACKEND_FAILURE_202",
  "params": [
    "",
    "General backend error [opterr=rc: 21, stdout: , stderr: iscsiadm: No records found
]",
    ""
  ],
  "task": {
    "uuid": "f3e2ae4b-890b-4d1b-ee11-36d151482a0a",
    "name_label": "Async.VM.migrate_send",
    "name_description": "",
    "allowed_operations": [],
    "current_operations": {},
    "created": "20240528T02:01:51Z",
    "finished": "20240528T02:01:55Z",
    "status": "failure",
    "resident_on": "OpaqueRef:cbbc463f-6d3d-4693-b5fe-333944df6766",
    "progress": 1,
    "type": "<none/>",
    "result": "",
    "error_info": [
      "SR_BACKEND_FAILURE_202",
      "",
      "General backend error [opterr=rc: 21, stdout: , stderr: iscsiadm: No records found
]",
      ""
    ],
    "other_config": {},
    "subtask_of": "OpaqueRef:NULL",
    "subtasks": [],
    "backtrace": "(((process xapi)(filename ocaml/xapi/helpers.ml)(line 1690))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 24))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 35))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 24))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 35))((process xapi)(filename ocaml/xapi/message_forwarding.ml)(line 134))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 24))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 35))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 24))((process xapi)(filename ocaml/xapi/rbac.ml)(line 205))((process xapi)(filename ocaml/xapi/server_helpers.ml)(line 95)))"
  },
  "message": "SR_BACKEND_FAILURE_202(, General backend error [opterr=rc: 21, stdout: , stderr: iscsiadm: No records found
], )",
  "name": "XapiError",
  "stack": "XapiError: SR_BACKEND_FAILURE_202(, General backend error [opterr=rc: 21, stdout: , stderr: iscsiadm: No records found
], )
    at Function.wrap (file:///data/xo/xo-builds/xen-orchestra-202405272127/packages/xen-api/_XapiError.mjs:16:12)
    at default (file:///data/xo/xo-builds/xen-orchestra-202405272127/packages/xen-api/_getTaskResult.mjs:11:29)
    at Xapi._addRecordToCache (file:///data/xo/xo-builds/xen-orchestra-202405272127/packages/xen-api/index.mjs:1035:24)
    at file:///data/xo/xo-builds/xen-orchestra-202405272127/packages/xen-api/index.mjs:1069:14
    at Array.forEach (<anonymous>)
    at Xapi._processEvents (file:///data/xo/xo-builds/xen-orchestra-202405272127/packages/xen-api/index.mjs:1059:12)
    at Xapi._watchEvents (file:///data/xo/xo-builds/xen-orchestra-202405272127/packages/xen-api/index.mjs:1232:14)
    at runNextTicks (node:internal/process/task_queues:60:5)
    at processImmediate (node:internal/timers:447:9)
    at process.callbackTrampoline (node:internal/async_hooks:128:17)"
}

The VM is Ubuntu 22, does have XenTools installed, and has been rebooted recently (earlier this afternoon).

Any ideas?

omatsei

@omatsei

Best posts made by omatsei

Latest posts made by omatsei