nvs

nvs

Update: I replaced the PSU and the server has been running stable now for a few weeks. It appears this was a PSU related issue in the end.

nvs

Yeah.. this definitely was a nightmare, I am taking a few days off after this

nvs

Hi,
I stumbled across exactly the same issue that GPU and USB pcie cards would crash the VM if passed through together. It was already mentioned in an earlier reply that updating would fix the issue, and I just wanted to confirm that works for me as well. After running the following commands on my xcp-ng 8.2:

yum update
yum upgrade

everything works nicely now! Thanks!

nvs

Hi, does anyone have any tips regarding getting the serial port to show up in xcp-ng? I imagine this normally should just work.. I am not sure why it doesnt show up in xcp-ng, while it shows enabled in the bios?

nvs

@magicker Thanks, I was also thinking it would be a good idea to monitor CPU temps just to make sure its nothing related to that. As I have a consumer motherboard I don't have IPMI to monitor. I believe I could install some linux temperature monitoring packages in the xcp-ng host, but I didn't dare to do that yet not wanting to risk breaking anything on the host. Or would you guys say it's in principle fine to install such temp monitoring tool on the host?

nvs

@TeddyAstie I now got a serial port breakout panel and connected that to the 10 pin COM port connector on the mainboard of my xcp-ng machine (its a consumer ASUS Prime x570 PRO). Connected to that is a null-modem cable, which connects to a usb-to-serial adapter which is connected to a windows machine, where I run termite monitoring the respective COM port for any traffic (115200 baud, 8N1).

I start up the xcp-ng server and in grub select the 2nd entry called "XCP-ng (Serial)" and let it boot. However, I am not seeing any messages via serial unfortunately. I've checked in the bios of the xcp-ng machine and confirmed the serial port is enabled and at IRQ 4.

To troubleshoot, I've checked the 2nd grub entry XCP-ng (Serial) in /boot/efi/EFI/xenserver/grub.cfg:


serial --unit=0 --speed=115200
terminal_input serial console
terminal_output serial console
set default=0
set timeout=5
menuentry 'XCP-ng' {
        search --label --set root root-gxdhmd
        multiboot2 /boot/xen.gz dom0_mem=7584M,max:7584M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G console=vga vga=mode-0x0311
        module2 /boot/vmlinuz-4.19-xen root=LABEL=root-gxdhmd ro nolvm hpet=disable xen-pciback.hide=(0000:05:00.0)(0000:06:00.0)(0000:07:00.0)(0000:09:00.0)(0000:0d:00.0)(0000:0d:00.1) console=hvc0 console=tty0 quiet vga=785 splash ply$
        module2 /boot/initrd-4.19-xen.img
}
**menuentry 'XCP-ng (Serial)' {
        search --label --set root root-gxdhmd
        multiboot2 /boot/xen.gz com1=115200,8n1 console=com1,vga dom0_mem=7584M,max:7584M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G
        module2 /boot/vmlinuz-4.19-xen root=LABEL=root-gxdhmd ro nolvm hpet=disable xen-pciback.hide=(0000:05:00.0)(0000:06:00.0)(0000:07:00.0)(0000:09:00.0)(0000:0d:00.0)(0000:0d:00.1) console=tty0 console=hvc0
        module2 /boot/initrd-4.19-xen.img**
}
menuentry 'XCP-ng in Safe Mode' {
        search --label --set root root-gxdhmd
        multiboot2 /boot/xen.gz nosmp noreboot noirqbalance no-mce no-bootscrub no-numa no-hap no-mmcfg max_cstate=0 nmi=ignore allow_unsafe dom0_mem=7584M,max:7584M com1=115200,8n1 console=com1,vga
        module2 /boot/vmlinuz-4.19-xen earlyprintk=xen root=LABEL=root-gxdhmd ro nolvm hpet=disable xen-pciback.hide=(0000:05:00.0)(0000:06:00.0)(0000:07:00.0)(0000:09:00.0)(0000:0d:00.0)(0000:0d:00.1) console=tty0 console=hvc0
        module2 /boot/initrd-4.19-xen.img
}
menuentry 'XCP-ng (Xen 4.17.5 / Linux 4.19.0+1)' {
        search --label --set root root-gxdhmd
        multiboot2 /boot/xen-fallback.gz dom0_mem=7584M,max:7584M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G
        module2 /boot/vmlinuz-fallback root=LABEL=root-gxdhmd ro nolvm hpet=disable xen-pciback.hide=(0000:05:00.0)(0000:06:00.0)(0000:07:00.0)(0000:09:00.0)(0000:0d:00.0)(0000:0d:00.1) console=hvc0 console=tty0
        module2 /boot/initrd-fallback.img
}
menuentry 'XCP-ng (Serial, Xen 4.17.5 / Linux 4.19.0+1)' {
        search --label --set root root-gxdhmd
        multiboot2 /boot/xen-fallback.gz com1=115200,8n1 console=com1,vga dom0_mem=7584M,max:7584M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G
        module2 /boot/vmlinuz-fallback root=LABEL=root-gxdhmd ro nolvm hpet=disable xen-pciback.hide=(0000:05:00.0)(0000:06:00.0)(0000:07:00.0)(0000:09:00.0)(0000:0d:00.0)(0000:0d:00.1) console=tty0 console=hvc0
        module2 /boot/initrd-fallback.img
}

I am not sure, but shouldn't in the 2nd grub entry the text be "console=ttyS0 console=hvc0", instead of currently "tty0"?

Running dmesg | grep ttyS results in nothing, which leads me to think that the serial port isn't actually seen by xcp-ng at all (I might be wrong though)?

Any tips what could be the issue/how to fix it?

nvs

@TeddyAstie I have to get my hands on a null modem cable first. Will try get that next week and then test. Will get back when I have some results.

I have never tried that before, just to be sure:
Idea is to connect serial port of the xcp-ng server < null modem cable > serial port of another PC, right? And then just i.e. putty on the other machine and the xcp-ng server should send out console text via serial.

nvs

To give an impression of how some log files look like. It seems the hard crash glitches the log writing with a lot of NUL bytes and then the next entry is when the machine starts booting up again:

nvs

Good morning,

So I unfortunately had to re-open this issue. I thought this was fixed with a new PSU. However the reboots did show up again some time later. I upgraded XCP-NG meanwhile to the latest stable 8.3.0 and ran all the patches. I was curious if it would still show this behaviour afterwards. I also tested running different VMs on that machine, and after some time I am fairly confident to say now that running one particular VM seems to cause these sudden reboots of the host:

Its a VM with 2 CPUs, 8 GiB of RAM, running Ubuntu 22.04 with MongoDB. It has a 20 GiB OS partition and, possibly the relevant clue; a 20TB raw disk passed through into the VM (15 TiB partition).

I did a lot of testing the last weeks without that VM (MongoDB) writing to that raw disk, and had no reboots. Yesterday I configured that MongoDB VM to continously add data again to that raw disk and since I've had 2 reboots in just a single day again.

The good news is that I should be able to replicate the issue now. Could someone give me pointers what I could still try to figure out what is exactly going wrong/how I can go about fixing this?

Thanks!

nvs

@andriy.sultanov Hi, I am having the same issue. Many PCIe USB cards aren't showing up in XO. Any updates if/when this might be fixed? Being able to configure PCIe passthough from the web UI is a really useful new feature and it would be really cool to be able to fully make use if it.

Below as requested my list of items that don't show up:

00:00.0 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse Root Complex [1480]" "ASUSTeK Computer Inc. [1043]" "Device [8808]"
00:00.2 "IOMMU [0806]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse IOMMU [1481]" "ASUSTeK Computer Inc. [1043]" "Device [8808]"
00:01.0 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse PCIe Dummy Host Bridge [1482]" "" ""
00:01.1 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse GPP Bridge [1483]" "" ""
00:01.2 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse GPP Bridge [1483]" "" ""
00:02.0 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse PCIe Dummy Host Bridge [1482]" "" ""
00:03.0 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse PCIe Dummy Host Bridge [1482]" "" ""
00:03.1 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse GPP Bridge [1483]" "" ""
00:03.2 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse GPP Bridge [1483]" "" ""
00:04.0 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse PCIe Dummy Host Bridge [1482]" "" ""
00:05.0 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse PCIe Dummy Host Bridge [1482]" "" ""
00:07.0 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse PCIe Dummy Host Bridge [1482]" "" ""
00:07.1 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1484]" "" ""
00:08.0 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse PCIe Dummy Host Bridge [1482]" "" ""
00:08.1 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1484]" "" ""
00:14.0 "SMBus [0c05]" "Advanced Micro Devices, Inc. [AMD] [1022]" "FCH SMBus Controller [790b]" -r61 "ASUSTeK Computer Inc. [1043]" "Device [87c0]"
00:14.3 "ISA bridge [0601]" "Advanced Micro Devices, Inc. [AMD] [1022]" "FCH LPC Bridge [790e]" -r51 "ASUSTeK Computer Inc. [1043]" "Device [87c0]"
00:18.0 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse/Vermeer Data Fabric: Device 18h; Function 0 [1440]" "" ""
00:18.1 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse/Vermeer Data Fabric: Device 18h; Function 1 [1441]" "" ""
00:18.2 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse/Vermeer Data Fabric: Device 18h; Function 2 [1442]" "" ""
00:18.3 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse/Vermeer Data Fabric: Device 18h; Function 3 [1443]" "" ""
00:18.4 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse/Vermeer Data Fabric: Device 18h; Function 4 [1444]" "" ""
00:18.5 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse/Vermeer Data Fabric: Device 18h; Function 5 [1445]" "" ""
00:18.6 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse/Vermeer Data Fabric: Device 18h; Function 6 [1446]" "" ""
00:18.7 "Host bridge [0600]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse/Vermeer Data Fabric: Device 18h; Function 7 [1447]" "" ""
02:00.0 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse Switch Upstream [57ad]" "" ""
03:01.0 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse PCIe GPP Bridge [57a3]" "" ""
03:02.0 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse PCIe GPP Bridge [57a3]" "" ""
03:03.0 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse PCIe GPP Bridge [57a3]" "" ""
03:04.0 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse PCIe GPP Bridge [57a3]" "" ""
03:05.0 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse PCIe GPP Bridge [57a3]" "" ""
03:06.0 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse PCIe GPP Bridge [57a3]" "" ""
03:08.0 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse PCIe GPP Bridge [57a4]" "" ""
03:09.0 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse PCIe GPP Bridge [57a4]" "" ""
03:0a.0 "PCI bridge [0604]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse PCIe GPP Bridge [57a4]" "" ""
05:00.0 "USB controller [0c03]" "Renesas Technology Corp. [1912]" "uPD720201 USB 3.0 Host Controller [0014]" -r03 -p30 "Renesas Technology Corp. [1912]" "uPD720201 USB 3.0 Host Controller [0014]"
06:00.0 "USB controller [0c03]" "Renesas Technology Corp. [1912]" "uPD720201 USB 3.0 Host Controller [0014]" -r03 -p30 "Renesas Technology Corp. [1912]" "uPD720201 USB 3.0 Host Controller [0014]"
07:00.0 "USB controller [0c03]" "Renesas Technology Corp. [1912]" "uPD720201 USB 3.0 Host Controller [0014]" -r03 -p30 "Renesas Technology Corp. [1912]" "uPD720201 USB 3.0 Host Controller [0014]"
09:00.0 "USB controller [0c03]" "Renesas Technology Corp. [1912]" "uPD720201 USB 3.0 Host Controller [0014]" -r03 -p30 "Renesas Technology Corp. [1912]" "uPD720201 USB 3.0 Host Controller [0014]"
0a:00.0 "Non-Essential Instrumentation [1300]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse Reserved SPP [1485]" "ASUSTeK Computer Inc. [1043]" "Device [8808]"
0a:00.1 "USB controller [0c03]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse USB 3.0 Host Controller [149c]" -p30 "ASUSTeK Computer Inc. [1043]" "Device [8808]"
0a:00.3 "USB controller [0c03]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse USB 3.0 Host Controller [149c]" -p30 "Advanced Micro Devices, Inc. [AMD] [1022]" "Device [148c]"
10:00.0 "Non-Essential Instrumentation [1300]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse PCIe Dummy Function [148a]" "ASUSTeK Computer Inc. [1043]" "Device [8808]"
11:00.0 "Non-Essential Instrumentation [1300]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse Reserved SPP [1485]" "ASUSTeK Computer Inc. [1043]" "Device [8808]"
11:00.1 "Encryption controller [1080]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse Cryptographic Coprocessor PSPCPP [1486]" "ASUSTeK Computer Inc. [1043]" "Device [8808]"
11:00.3 "USB controller [0c03]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Matisse USB 3.0 Host Controller [149c]" -p30 "ASUSTeK Computer Inc. [1043]" "Device [87c0]"
11:00.4 "Audio device [0403]" "Advanced Micro Devices, Inc. [AMD] [1022]" "Starship/Matisse HD Audio Controller [1487]" "ASUSTeK Computer Inc. [1043]" "Device [8733]"

Most important ones being the PCIe USB cards: 05:00.0, 06:00.0, 07:00.0, 09:00.0.

Thanks for looking into fixing this.

nvs

Update: I replaced the PSU and the server has been running stable now for a few weeks. It appears this was a PSU related issue in the end.

nvs

@TeddyAstie Thanks. Unfortunately my machine doesnt have IPMI. So can I just connect a serial cable between this machine and another machine and monitor the serial output on that other, say windows, machine running putty? Anything special to consider? I never did this before but happy to read up if you maybe have any pointers.

nvs

@nvs Machine crashed/restarted itself again this morning. I didn't even have all of the usual VMs running this time. Nothing was logged in kern.log when it crashed again. Before it crashed I checked a few times in the hours before xl dmesg but nothing obvious to me (same log as I posted above). Any suggestions highly welcome as I'm sure how to proceed with troubleshooting this. My next step would be replacing the PSU and see if anything changes, but its a long shot.