Team - Security group | XCP-ng and XO forum

bleader

I think whatever solution suits you will work.

Personally, if I know there are issues with it, I would tend to disable it in the bios, to be sure nobody tries to use it later and waste their time, in a enterprise settings, that can be important.

One thing to keep in mind if keeping it, is that if you want to add other hosts to the pool, they will need to have similar network topology, so if you endup having eth0 and eth1 with your current management network on eth1, any new host should be able to have its management on eth1 as well. You may work around it with interface renaming, but that tends to get messy over time.

That being said, I'm unsure even removing the realtek nic from the bios will change the interface number now that eth1 exists already and is configured.

If you don't plan to add hosts to the pool, and don't have a team with people that may act on these machines in the future without being aware of this setup history, leaving it connected and disabling the port on switch should not be an issue.

bleader

@dnikola said in [HELP] XCP-ng 4.17.5 dom0 kernel panic — page fault in TCP stack, crashdump attached:

Has anyone experienced similar page faults in the dom0 TCP stack on 4.19 kernels or XCP-ng 4.17.5?

Not that I know of.

Are there any known issues with network drivers on this kernel/hypervisor combo?

No, there can be issues with some drivers, you should have specified which network NICs and drivers you are using.

Would you recommend moving to a newer dom0 kernel or hypervisor build?

On XCP-ng, the latest version is 8.3 which you didn't specify in your post, but you're using the latest version of Xen, so I assume it is an up to date 8.3, so there is no newer build.

Could a memory issue cause this specific kind of page table inconsistency during a kernel panic?

Yes, it can be a bug in the the code, but it absolutely could be a hardware issues.

Any advice on additional debug steps or log files I should collect next time?

I would start by running a memtest on that host to make sure the memory is not having issues.

Do you know if there was a specific VM doing something specific at that time? We had some issues in the past with FreeBSD VMs using wireguard, but it does not look similar, and it should be fixed now.
What kind of guests were running on that host? linux, windows, some BSD based?
If running windows guests please be sure to have read this blog post and ensure to comply with the guidelines there.

From a quick look, I don't see anything obvious. Follow Olivier's suggestion first, if you still have issues after that, you can share an additional report using xen-bugtool -y. But please be sure to update your bios first, check your memory, and then do that.

bleader

ping @Team-Hypervisor-Kernel

bleader

@JBlessing as it looks like it does start, it looks like the networking side is working, at least at first.

Just for debugging purpose you could try to switch that VM to BIOS instead of UEFI if it is possible, maybe it is related to what the pxe is starting in the VM.

You could also try switching the VM between realtek and e1000 NIC, at this stage, PV drivers are not there so it is using an emulated NIC, maybe the image your PXE starts doesn't like the one you're using and it gets stuck somehow.

As you're already using it with vmware, I assume you know how to size your VM, but if you went for a tight RAM value for this VM, you could try to give it more RAM to see if that could be related, as everything has to fit in RAM at some point, we may be using more at startup than vmware…

Hope one of this can help

Team - Security

Posts

Member List