XCP-NG server crashes/reboots unexpectedly
-
@TeddyAstie Thanks. Unfortunately my machine doesnt have IPMI. So can I just connect a serial cable between this machine and another machine and monitor the serial output on that other, say windows, machine running putty? Anything special to consider? I never did this before but happy to read up if you maybe have any pointers.
-
@nvs said in XCP-NG server crashes/reboots unexpectedly:
Thanks. Unfortunately my machine doesnt have IPMI. So can I just connect a serial cable between this machine and another machine
Yes though you would still need to boot using the "XCP-ng (Serial)" grub entry.
(you can also add some serial console bits adding them to xen cmdline) -
Update: I replaced the PSU and the server has been running stable now for a few weeks. It appears this was a PSU related issue in the end.
-
Ahh excellent news! Thanks for keeping us posted

-
O olivierlambert marked this topic as a question on
-
O olivierlambert has marked this topic as solved on
-
Good morning,
So I unfortunately had to re-open this issue. I thought this was fixed with a new PSU. However the reboots did show up again some time later. I upgraded XCP-NG meanwhile to the latest stable 8.3.0 and ran all the patches. I was curious if it would still show this behaviour afterwards. I also tested running different VMs on that machine, and after some time I am fairly confident to say now that running one particular VM seems to cause these sudden reboots of the host:
- Its a VM with 2 CPUs, 8 GiB of RAM, running Ubuntu 22.04 with MongoDB. It has a 20 GiB OS partition and, possibly the relevant clue; a 20TB raw disk passed through into the VM (15 TiB partition).
I did a lot of testing the last weeks without that VM (MongoDB) writing to that raw disk, and had no reboots. Yesterday I configured that MongoDB VM to continously add data again to that raw disk and since I've had 2 reboots in just a single day again.
The good news is that I should be able to replicate the issue now. Could someone give me pointers what I could still try to figure out what is exactly going wrong/how I can go about fixing this?
Thanks!
-
To give an impression of how some log files look like. It seems the hard crash glitches the log writing with a lot of NUL bytes and then the next entry is when the machine starts booting up again:

-
@nvs can you try with a serial console and something listening on it; so when it crashes, we get the crash reason ?
-
@TeddyAstie I have to get my hands on a null modem cable first. Will try get that next week and then test. Will get back when I have some results.
I have never tried that before, just to be sure:
Idea is to connect serial port of the xcp-ng server < null modem cable > serial port of another PC, right? And then just i.e. putty on the other machine and the xcp-ng server should send out console text via serial. -
@nvs Hi there, we had a similar issue with on our 5900x boxes. We have about a twenty of these and they have been generally greats hosts for xcp loads. However, one of them was constantly rebooting. While testing the ram the temp jumped to 90/90. They are running in 1U boxes so I expect heat but this seemed silly. We ramped the fans to max but this did not help.
We then used bios to drop down various cpu boost profiles and gadgets to lower the heat and re-applied paste to cpu..so far so good.
-
@TeddyAstie I now got a serial port breakout panel and connected that to the 10 pin COM port connector on the mainboard of my xcp-ng machine (its a consumer ASUS Prime x570 PRO). Connected to that is a null-modem cable, which connects to a usb-to-serial adapter which is connected to a windows machine, where I run termite monitoring the respective COM port for any traffic (115200 baud, 8N1).
I start up the xcp-ng server and in grub select the 2nd entry called "XCP-ng (Serial)" and let it boot. However, I am not seeing any messages via serial unfortunately. I've checked in the bios of the xcp-ng machine and confirmed the serial port is enabled and at IRQ 4.
To troubleshoot, I've checked the 2nd grub entry XCP-ng (Serial) in /boot/efi/EFI/xenserver/grub.cfg:
serial --unit=0 --speed=115200 terminal_input serial console terminal_output serial console set default=0 set timeout=5 menuentry 'XCP-ng' { search --label --set root root-gxdhmd multiboot2 /boot/xen.gz dom0_mem=7584M,max:7584M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G console=vga vga=mode-0x0311 module2 /boot/vmlinuz-4.19-xen root=LABEL=root-gxdhmd ro nolvm hpet=disable xen-pciback.hide=(0000:05:00.0)(0000:06:00.0)(0000:07:00.0)(0000:09:00.0)(0000:0d:00.0)(0000:0d:00.1) console=hvc0 console=tty0 quiet vga=785 splash ply$ module2 /boot/initrd-4.19-xen.img } **menuentry 'XCP-ng (Serial)' { search --label --set root root-gxdhmd multiboot2 /boot/xen.gz com1=115200,8n1 console=com1,vga dom0_mem=7584M,max:7584M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G module2 /boot/vmlinuz-4.19-xen root=LABEL=root-gxdhmd ro nolvm hpet=disable xen-pciback.hide=(0000:05:00.0)(0000:06:00.0)(0000:07:00.0)(0000:09:00.0)(0000:0d:00.0)(0000:0d:00.1) console=tty0 console=hvc0 module2 /boot/initrd-4.19-xen.img** } menuentry 'XCP-ng in Safe Mode' { search --label --set root root-gxdhmd multiboot2 /boot/xen.gz nosmp noreboot noirqbalance no-mce no-bootscrub no-numa no-hap no-mmcfg max_cstate=0 nmi=ignore allow_unsafe dom0_mem=7584M,max:7584M com1=115200,8n1 console=com1,vga module2 /boot/vmlinuz-4.19-xen earlyprintk=xen root=LABEL=root-gxdhmd ro nolvm hpet=disable xen-pciback.hide=(0000:05:00.0)(0000:06:00.0)(0000:07:00.0)(0000:09:00.0)(0000:0d:00.0)(0000:0d:00.1) console=tty0 console=hvc0 module2 /boot/initrd-4.19-xen.img } menuentry 'XCP-ng (Xen 4.17.5 / Linux 4.19.0+1)' { search --label --set root root-gxdhmd multiboot2 /boot/xen-fallback.gz dom0_mem=7584M,max:7584M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G module2 /boot/vmlinuz-fallback root=LABEL=root-gxdhmd ro nolvm hpet=disable xen-pciback.hide=(0000:05:00.0)(0000:06:00.0)(0000:07:00.0)(0000:09:00.0)(0000:0d:00.0)(0000:0d:00.1) console=hvc0 console=tty0 module2 /boot/initrd-fallback.img } menuentry 'XCP-ng (Serial, Xen 4.17.5 / Linux 4.19.0+1)' { search --label --set root root-gxdhmd multiboot2 /boot/xen-fallback.gz com1=115200,8n1 console=com1,vga dom0_mem=7584M,max:7584M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G module2 /boot/vmlinuz-fallback root=LABEL=root-gxdhmd ro nolvm hpet=disable xen-pciback.hide=(0000:05:00.0)(0000:06:00.0)(0000:07:00.0)(0000:09:00.0)(0000:0d:00.0)(0000:0d:00.1) console=tty0 console=hvc0 module2 /boot/initrd-fallback.img }I am not sure, but shouldn't in the 2nd grub entry the text be "console=ttyS0 console=hvc0", instead of currently "tty0"?
Running
dmesg | grep ttySresults in nothing, which leads me to think that the serial port isn't actually seen by xcp-ng at all (I might be wrong though)?Any tips what could be the issue/how to fix it?
-
@magicker Thanks, I was also thinking it would be a good idea to monitor CPU temps just to make sure its nothing related to that. As I have a consumer motherboard I don't have IPMI to monitor. I believe I could install some linux temperature monitoring packages in the xcp-ng host, but I didn't dare to do that yet not wanting to risk breaking anything on the host. Or would you guys say it's in principle fine to install such temp monitoring tool on the host?