Issue after latest host update

RealTehreal

Hello community, hopefully, someone may be able to help me with my home lab, because it's offline for two days already. I tried to find some help on Reddit, but was encouraged to use the XCP-ng forums. So, here I am.

Reddit (for reference): https://www.reddit.com/r/xcpng/comments/1btzjgw/issue_after_latest_host_update/

The hosts have XCP-ng 8.2.1 installed.

The issue:
I updated three XCP-ng 8.2.1 hosts, including the pool master, to the latest version. Now, no VM can be started anymore. Not even XOA. All VMs are stuck in starting up. The VMs can neither be pinged, nor be connected to via ssh.

xsconsole is unable to provide information about the VMs. Only that they seem to permanently consume one vCore each. Only forced shutdown via cli can be used to shut VMs down.

I even deleted (uninstalled) XOA and redeployed. It's not able to start up.

What I did:

Live migrated all VMs from pool master (host 1) to host 2.
Afterward, ssh'ed into the pool master, used xe task-list to make sure, nothing's going on.
Then used yum update to update the pool master.
Then, after making sure nothing's going on (xe task-list), used xe-toolstack-restart.
Then, I live migrated all VMs back from Host 2 to Host 1 (pool master). Checked whether anything was going on, then updated Host 2 the same way via cli / ssh. Then restarted the toolstack.
Then, I remembered, that I always restarted every host after updating it. So, I tried to live migrate the VMs back from Host 1 (pool master) to Host 2 (both were updated at this point). Everything went fine.
I restarted the pool master system. When it was back online, I tried to live migrate VMs from Host 2 back to Host 1 (pool master), to restart Host 2 as well. This is when the issues started. Migration stalled and became stuck for all VMs. Taking a look at xe task-list offered live migration tasks with 1.000 state (100% done). But it wasn't really done. The VMs didn't respond to anything. XOA already broke at this point in time, because it's a VM, too. I decided to force shutdown all VMs and to delete their autostart parameter via cli.
Since nothing worked anymore, I decided to update Host 3 without caring about anything - all VMs were shut down anyway. All hosts were now updated. I shut down all three hosts and cut power of Host 2 and Host 3.
I then cold started Host 1 (pool master) again and tried to start XOA and some other VMs. They were always stuck at booting and didn't respond to ping or ssh.
After a few tries and a few more restarts of the pool master system, I decided to delete XOA: xe vm-uninstall uuid=<uuid of XOA VM>. I also tried xe vm-delete uuid=<uuid of XOA VM>. A few times, because redeployment via XCP-ng web interface always timed out. The VM was deployed, but at the point, when it would start up, it stalled like all the other VMs.
xe vm-list power-state=running shows that the VM is running. But a look at xsconsole shows 50% usage of its 2 vCores. My other VMs always show the same: one vCore at 100% (e.g. 4 vCores = 25% CPU load).
Since I always stored all VMs on a NFS network share, my last attempt was to deploy XOA on pool master's local storage - with the same result. VM stuck at starting up. I even let it sit for a few hours. Nothing.

As suggested on Reddit, I took a look at xensource.log. It's filled up with info and debug messages, but nothing useful. At least as far as I can tell.

This summarizes the issue and its history. I hope, someone can help me, getting it working again. I would also be fine with starting from scratch. This that case, I would need advice on how to back up my VMs. The VMs also have snapshots, which would need to be consolidated, no?.

Greets

manilx

@RealTehreal Not being able to help BUT what I guess caused this was your point 4. You must restart/reboot the host after updates not restart toolstack.....

RealTehreal

@manilx
The documentation states, that toolstack should be restarted after updates. That's why I always did it that way:

https://docs.xcp-ng.org/management/updates/#from-command-line

But anyway, the issues started after restarting the host.

manilx

@RealTehreal I stand corrected. We always do the rolling pool update and let xoa take care of all of this

olivierlambert

Have you done a simple dmesg and check the output?

RealTehreal

@olivierlambert I just did, but looks fine to me (dmesg.txt).

I just tried to designate one of the slaves as the new master. Still cannot start VMs. I will now eject all slaves, reinstall XCP-NG on one of them, add it to the pool again and make it the new master. Then I'll try again. If that doesn't work as well, I'll reinstall on the third device, create a new pool for this device and try again.

olivierlambert

Could you do a mem test on the current master?

What kind of storage are you using?

nikade

dmesg looks like, probably something else borked here.
You wrote in the reddit thread that you were able to start VM's but they never actually started and the task was stuck at 1.000 progress, is that still the case after electing a new master?

If yes, check "xentop" on the host where the VM was started to see if it's consuming resources.

olivierlambert

Yeah I'm baffled because this is not something we've seen before on a "normal" setup, I really wonder where the problem lies

RealTehreal

@olivierlambert The issue started on all three hosts after the latest update via yum update. I can't think of three devices having faulty memory, just one after another. Before the issue, I used a NFS share as VM storage. But I already deployed XOA on local storage (LVM). Same issue on all three hosts.

@nikade First, I'll redeploy XOA on the pool master and take a look at xentop. Regarding xsconfig, every VM runs with one vCore at 100% all the time and not responding to anything. xe vm-list always lists them as running, though. Being in this state, the only way to shut down VMs is forced shutdown, since they won't react to soft shutdown command.

I never had such issues, either. I'm running my setup for about a year now, did several updates via cli. Likewise, I'm baffled, that everything suddenly went down the flush, too.

RealTehreal

xentop shows XOA consuming 100.0 CPU (%), meaning one core. But quick deployment is stuck at "almost there", until it times out. The VM is still consuming one CPU core, while not being accessible.

nikade

I cant really understand what happend to be honest, i've done this many times without issues.
What can you see in the console tab of the VM when u start it? Or in the stats tab?

john.c

@RealTehreal What's the state of the network stack is it up and what's the activity percentage?

RealTehreal

@nikade @john-c I'm not sure... how do I elaborate? At least, I can ssh into the hosts and never disconnect.

RealTehreal

@nikade said in Issue after latest host update:

I cant really understand what happend to be honest, i've done this many times without issues.
What can you see in the console tab of the VM when u start it? Or in the stats tab?

I can'T see anything, because XOA itself is inaccessible, since it's a VM. And VMs won't start into a usable state.

john.c

@RealTehreal said in Issue after latest host update:

@nikade said in Issue after latest host update:

I cant really understand what happend to be honest, i've done this many times without issues.
What can you see in the console tab of the VM when u start it? Or in the stats tab?

I can'T see anything, because XOA itself is inaccessible, since it's a VM. And VMs won't start into a usable state.

Anything in the XCP-ng 8.2.1 host logs for it attempting to start the VM and generally? It may hold clues, about any underlying issues.

Also any appropriate logs for the NFS storage server would help, as that may reveal anything that can be causing issues on its end.

olivierlambert

Any specific MTU settings?

olivierlambert

A way to check if it's not network related would be using a local SR to boot a VM and see if it works.

RealTehreal

@john-c I already took a look at dmesg and /var/log/xensource.log (I crawled through >1k log lines) and couldn't find anything revealing. The NFS server is unrelated, because, as stated before, I currently only use host's local storage to eliminate possible external issues.

RealTehreal

@olivierlambert That's what I'm doing, to make sure, it's not a network related issue.