XCP-ng locks up after 1-3 days - need help in finding out why
-
Hi all,
My XCP-ng server has the tendency to lock up every 1 to 3 days, with no visible clue as to why this is happening.
Memtest is passing with flying colours, so I can rule that out already.From all logs I've checked the last message was in daemon.log, but nothing critical there. Timestamp was Sep 23 17:23:09 so I know the server was still on at that point.
The server has the following configuration:
- CPU: AMD RYZEN 9 3900XT
- MOBO: Gigabyte AORUS X570 ELITE
- RAM: HyperX 128 GB DDR4-3600 Non-ECC
- GPU: Nvidia GTX1060 6GB
- SSD: Samsung 980 PRO 1TB NVMe
- HBA: LSI 9400-16i
I have added both the GPU & the HBA as PCIe passthrough devices:
- the HBA to a VM running TrueNAS Core
- the GPU to a VM running Plex
Is it possible that because the CPU does not have integrated graphics that the PCIe passthrough to the VM is giving me issues, since the XCP-ng host then doesn't have a graphical output anymore?
I've added the report from the latest xen-bugtool execution here: https://drive.google.com/file/d/1DZ7etZp4viTSXecEn7Y1MeV_9-NKjvOK/view?usp=sharing
-
Can you give more details about what you mean with "lock up"? It doesn't answer to SSH anymore? All VMs down?
I don't know if the fact that the GPU is passed through may trigger a bug somewhere, but you could try to not pass it through for some time and check whether you still get a lock-up. And if you do, check whether you can still access the console using the server's display.
If there's a crash, you may find clues in /var/crash.
Other logs to check are listed at https://xcp-ng.org/docs/troubleshooting.html
-
So, I don't know if it's the cause of your issues, but I see various error messages in kern.log.1 :
Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.000493] ACPI BIOS Warning (bug): Incorrect checksum in table [BGRT] - 0xF6, should be 0x71 (20180810/tbprint-177)
Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.251588] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.251638] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.251734] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.251828] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.251914] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.251999] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.252094] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.252208] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.252310] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.252410] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.252509] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.252610] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.252729] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.252839] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.252938] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.253034] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 1.253085] Warning: Processor Platform Limit not supported.
Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 3.269667] ACPI: Invalid passive threshold
Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 5.898741] ACPI Warning: SystemIO range 0x0000000000000B00-0x0000000000000B08 conflicts with OpRegion 0x0000000000000B00-0x0000000000000B0F (\GSA1.SMBI) (20180810/utaddress-213) Sep 22 15:03:35 K2I-XCPNG-01 kernel: [ 5.898750] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
And then a lot of:
Sep 22 15:05:49 K2I-XCPNG-01 kernel: [ 140.797275] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:03.1 Sep 22 15:05:49 K2I-XCPNG-01 kernel: [ 140.797283] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Sep 22 15:05:49 K2I-XCPNG-01 kernel: [ 140.797293] pcieport 0000:00:03.1: device [1022:1483] error status/mask=00000040/00004000 Sep 22 15:05:49 K2I-XCPNG-01 kernel: [ 140.797298] pcieport 0000:00:03.1: [ 6] BadTLP Sep 22 15:06:06 K2I-XCPNG-01 kernel: [ 158.551038] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:03.1 Sep 22 15:06:06 K2I-XCPNG-01 kernel: [ 158.551046] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Sep 22 15:06:06 K2I-XCPNG-01 kernel: [ 158.551056] pcieport 0000:00:03.1: device [1022:1483] error status/mask=00000040/00004000 Sep 22 15:06:06 K2I-XCPNG-01 kernel: [ 158.551061] pcieport 0000:00:03.1: [ 6] BadTLP Sep 22 15:06:27 K2I-XCPNG-01 kernel: [ 178.913723] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:03.1 Sep 22 15:06:27 K2I-XCPNG-01 kernel: [ 178.913731] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Sep 22 15:06:27 K2I-XCPNG-01 kernel: [ 178.913741] pcieport 0000:00:03.1: device [1022:1483] error status/mask=00000040/00004000 Sep 22 15:06:27 K2I-XCPNG-01 kernel: [ 178.913746] pcieport 0000:00:03.1: [ 6] BadTLP Sep 22 15:06:30 K2I-XCPNG-01 kernel: [ 182.406273] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:03.1 Sep 22 15:06:30 K2I-XCPNG-01 kernel: [ 182.406282] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Sep 22 15:06:30 K2I-XCPNG-01 kernel: [ 182.406291] pcieport 0000:00:03.1: device [1022:1483] error status/mask=00000040/00004000 Sep 22 15:06:30 K2I-XCPNG-01 kernel: [ 182.406297] pcieport 0000:00:03.1: [ 6] BadTLP Sep 22 15:06:35 K2I-XCPNG-01 kernel: [ 187.258259] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:03.1 Sep 22 15:06:35 K2I-XCPNG-01 kernel: [ 187.258268] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Sep 22 15:06:35 K2I-XCPNG-01 kernel: [ 187.258278] pcieport 0000:00:03.1: device [1022:1483] error status/mask=00000040/00004000 Sep 22 15:06:35 K2I-XCPNG-01 kernel: [ 187.258284] pcieport 0000:00:03.1: [ 6] BadTLP Sep 22 15:06:35 K2I-XCPNG-01 kernel: [ 187.685746] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:03.1 Sep 22 15:06:35 K2I-XCPNG-01 kernel: [ 187.685754] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Sep 22 15:06:35 K2I-XCPNG-01 kernel: [ 187.685762] pcieport 0000:00:03.1: device [1022:1483] error status/mask=00000040/00004000 Sep 22 15:06:35 K2I-XCPNG-01 kernel: [ 187.685767] pcieport 0000:00:03.1: [ 6] BadTLP Sep 22 15:06:36 K2I-XCPNG-01 kernel: [ 188.474937] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:03.1 Sep 22 15:06:36 K2I-XCPNG-01 kernel: [ 188.474945] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Sep 22 15:06:36 K2I-XCPNG-01 kernel: [ 188.474955] pcieport 0000:00:03.1: device [1022:1483] error status/mask=00000040/00004000 Sep 22 15:06:36 K2I-XCPNG-01 kernel: [ 188.474960] pcieport 0000:00:03.1: [ 6] BadTLP Sep 22 15:06:37 K2I-XCPNG-01 kernel: [ 189.760365] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:03.1 Sep 22 15:06:37 K2I-XCPNG-01 kernel: [ 189.760375] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Sep 22 15:06:37 K2I-XCPNG-01 kernel: [ 189.760387] pcieport 0000:00:03.1: device [1022:1483] error status/mask=00000040/00004000 Sep 22 15:06:37 K2I-XCPNG-01 kernel: [ 189.760392] pcieport 0000:00:03.1: [ 6] BadTLP Sep 22 15:06:43 K2I-XCPNG-01 kernel: [ 195.289634] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:03.1 Sep 22 15:06:43 K2I-XCPNG-01 kernel: [ 195.289642] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Sep 22 15:06:43 K2I-XCPNG-01 kernel: [ 195.289652] pcieport 0000:00:03.1: device [1022:1483] error status/mask=00000040/00004000 Sep 22 15:06:43 K2I-XCPNG-01 kernel: [ 195.289658] pcieport 0000:00:03.1: [ 6] BadTLP
- Your firmware seems really buggy. I'd check for updates.
- I'd try to disable c-states in bios
- I'd look into those pcieport errors and check the devices connected to PCIE ports.
-
@stormi
Alright... That's clear and thanks for helping me analyse, very new to the whole XCP-ng environment (and Linux in general)/var/crash does not have anything in the folder
What I mean by lock up, is that it indeed stops responding to anything.
No more SSH, router doesn't show system in devices anymore, none of the VMs respond, no display on the screen attached to the machine, no input recognised from the keyboard.
The Power LED is still on and the fans are whirring so the machine stays powered on in any case.I'll check in the BIOS for the C-states at least, the BIOS firmware itself is the latest version so if that doesn't help I guess I'm out of luck for now...
-
@janvanhumbeek how long did you run Memtest for? Can you run it for the same length of time to see if it locks up. Check for BIOS updates. From the Logs you have C States errors so looks like Power management is not working you may need a different Kernel.