TrueNAS VM failing to start
-
Hi,
I had to shut down my XCP-ng system to add a replacement for a previously failed NVMe that was attached to my TrueNAS SCALE VM, which also involved moving around a couple of the PCIe cards. I thought this would be a good opportunity to catch up with the upgrades, so also applied those during the shutdown/reboot.
Following the upgrade and installing the replacement NVMe I can no longer boot my TrueNAS SCALE VM, it fails with:
vm.start { "id": "81e6cde8-baba-5f2e-0a08-a4d9f3e0a41e", "bypassMacAddressesCheck": false, "force": false } { "code": "INTERNAL_ERROR", "params": [ "xenopsd internal error: Cannot_add(0000:af:00.0, Device_common.QMP_Error(2, \"{\\\"error\\\":{\\\"class\\\":\\\"GenericError\\\",\\\"desc\\\":\\\"Failed to initialize 11/15, type = 0x1, rc: -1\\\",\\\"data\\\":{}},\\\"id\\\":\\\"qmp-000012-2\\\"}\"))" ], "call": { "duration": 8798, "method": "VM.start", "params": [ "* session id *", "OpaqueRef:63502630-5729-b5d4-4ef2-49d6c14e07bd", false, false ] }, "message": "INTERNAL_ERROR(xenopsd internal error: Cannot_add(0000:af:00.0, Device_common.QMP_Error(2, \"{\\\"error\\\":{\\\"class\\\":\\\"GenericError\\\",\\\"desc\\\":\\\"Failed to initialize 11/15, type = 0x1, rc: -1\\\",\\\"data\\\":{}},\\\"id\\\":\\\"qmp-000012-2\\\"}\")))", "name": "XapiError", "stack": "XapiError: INTERNAL_ERROR(xenopsd internal error: Cannot_add(0000:af:00.0, Device_common.QMP_Error(2, \"{\\\"error\\\":{\\\"class\\\":\\\"GenericError\\\",\\\"desc\\\":\\\"Failed to initialize 11/15, type = 0x1, rc: -1\\\",\\\"data\\\":{}},\\\"id\\\":\\\"qmp-000012-2\\\"}\"))) at Function.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12) at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/transports/json-rpc.mjs:38:21 at runNextTicks (node:internal/process/task_queues:60:5) at processImmediate (node:internal/timers:454:9) at process.callbackTrampoline (node:internal/async_hooks:130:17)" }As far as I can see, all the passthrough devices are correctly specified.
Is the device referenced in the error this:
af:00.0 PCI bridge: Intel Corporation Device 4fa1 (rev 01)Could this be caused by my switching a couple of the PCIe cards around, as one of them is an NVMe expansion that is passed through to TrueNAS (but this does show up correctly under it's own ID).
This is XCP-ng 8.3 running on a Supermicro X11DPH-T.
Cheers.
-
Thinking this could be down to the PCIe card moves, as that did change the IDs for some of the passthrough devices, I removed all the passthroughs, via the command line, and then reinstated them.
Now when I try to start TrueNAS the whole system locks up. I can't enter anything via Putty, XOA, or the Supermicro ipmi.
I have no idea where to go to from here.
Cheers.
-
Are you sure your OS (TrueNAS) isn't waiting for the PCI device that's not passed through anymore?
-
@olivierlambert The same devices are passed through, just as different IDs.
But that shouldn't "kill" XCP, so that XOA, Putty, etc no longer respond.
Cheers.
-
No it shouldn't. Have you removed the passthrough from the VM too? Without logs it's hard to tell, take a look inside if you can spot something
-
@olivierlambert
Yes, I removed the passthoughs from the VM before I removed them at the DOM level.At the moment, the system is booted directly into TrueNAS and is re-silvering the replaced NVMe. Once this finishes, I can reboot XCP and take a look. Is there any particular log you think will give the most clues.
Cheers.
-
I think the usual stuff: https://docs.xcp-ng.org/troubleshooting/log-files/
-
@olivierlambert
Sorry about the delay, got a lot going on.Anyway, was able to pick this up again and here's what happened this time. Booted XCP, noticed there were a bunch more updates, so ran the update so I'm collecting information from the very latest and greatest.
Re-boot XCP and start the TrueNAS VM with NO passthrough devices. As expected, that started up fine. Stopped TrueNAS and added all the devices and started TrueNAS again. This immediately caused the server to reboot itself. Hmmmmm.
On the restart of XCP-ng I collected the output from "xen-bugtool --yestoall" and also the /var/crash directory (how do I upload a tgz), which hopefully will give a clue as to what's going on.
I also have the output from "xl pci-assignable-list" and "xe vm-list params=other-config uuid=<uuid>" showing the passthrough devices if needed.
Cheers.