Host Crash Once In A Long While
-
So it would seem I have a host who crashes on a periodic basis but only occasionally, seems like it's about 2 times per year ish, trying to diagnose this though.
My logs are below, truncated since they are way too long to actually post here.
xen.log:
0x8 (XEN) [270483.449769] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [270598.526533] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [270685.029195] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [271110.558952] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [271309.499968] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [271340.748832] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [271349.428606] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [272600.756776] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [273163.116684] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [277052.852717] Uhhuh. NMI received for unknown reason 31. (XEN) [277052.852719] Do you have a strange power saving mode enabled? (XEN) [277052.852722] ----[ Xen-4.13.4-9.21.2 x86_64 debug=n Not tainted ]---- (XEN) [277052.852723] CPU: 0 (XEN) [277052.852725] RIP: e008:[<ffff82d0802d9d98>] arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0 (XEN) [277052.852731] RFLAGS: 0000000000000246 CONTEXT: hypervisor (XEN) [277052.852734] rax: 0000000000000000 rbx: ffff83107bcafc78 rcx: 0000000000000048 (XEN) [277052.852736] rdx: 0000000000000000 rsi: ffff83007be8ffff rdi: ffff83107bcafc78 (XEN) [277052.852737] rbp: ffff83107bcafc00 rsp: ffff83007be8fe68 r8: ffff83007be8fef8 (XEN) [277052.852739] r9: 0000000000000002 r10: 0000fbfaa06ecc95 r11: 0000fbfa820743ef (XEN) [277052.852740] r12: 0000fbfa64d405b7 r13: ffff83107bcafc30 r14: ffff82d080597270 (XEN) [277052.852742] r15: ffff82d0805bc300 cr0: 000000008005003b cr4: 00000000003506e0 (XEN) [277052.852743] cr3: 00000010448f3000 cr2: 00007ffcfeacafc8 (XEN) [277052.852744] fsb: 0000000000000000 gsb: ffff8aa47c440000 gss: 0000000000000000 (XEN) [277052.852746] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) [277052.852749] Xen code around <ffff82d0802d9d98> (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0): (XEN) [277052.852750] 66 90 0f 1f 40 00 fb f4 <0f> b6 46 f5 41 80 a0 fe 00 00 00 fe 66 90 fa c3 (XEN) [277052.852754] Xen stack trace from rsp=ffff83007be8fe68: (XEN) [277052.852755] ffff82d0802da28a 0000000000000000 0000000000000000 0000000000000000 (XEN) [277052.852757] ffff82d080597270 ffff82d0805bc300 ffff82d08059db00 ffff8310447ca000 (XEN) [277052.852759] ffff82d08059db00 ffff8310447ca000 0000000000000000 0000000000000000 (XEN) [277052.852761] ffff82d080278b0c ffff82d080278a40 ffff8310447ca000 ffff83107bcb1000 (XEN) [277052.852763] 00000000ffffffff ffff831044926000 0000000000000000 0000000000000000 (XEN) [277052.852764] 0000000000000000 0000000000000000 0000000000000001 0000000000000001 (XEN) [277052.852766] 0000fbeb7bad0fb6 0000000000000000 0000000000000000 000000003359c3d1 (XEN) [277052.852767] ffffffff92d3a0f0 ffffffff9364f310 0000000006568a76 ffffffff9364af78 (XEN) [277052.852769] 0000000000000001 0000000000000000 ffffffff92d3a4be 0000000000000000 (XEN) [277052.852770] 0000000000000246 ffffb50d80393ea8 0000000000000000 7bdcdc407be8ffe0 (XEN) [277052.852772] 7bdcdcc30009bf75 7bdcddb700000000 7bdcd9667be8ffe0 0000e01000000000 (XEN) [277052.852774] ffff83107bcb0000 0000000000000000 00000000003506e0 0000000000000000 (XEN) [277052.852776] 0000000000000000 7b01d30000000000 7bdce8300009bf00 (XEN) [277052.852777] Xen call trace: (XEN) [277052.852779] [<ffff82d0802d9d98>] R arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0 (XEN) [277052.852782] [<ffff82d0802da28a>] S arch/x86/acpi/cpu_idle.c#acpi_processor_idle+0x36a/0x630 (XEN) [277052.852785] [<ffff82d080278b0c>] S arch/x86/domain.c#idle_loop+0xcc/0xf0 (XEN) [277052.852786] [<ffff82d080278a40>] S arch/x86/domain.c#idle_loop+0/0xf0 (XEN) [277052.852787] (XEN) [277052.852789] (XEN) [277052.852789] **************************************** (XEN) [277052.852790] Panic on CPU 0: (XEN) [277052.852791] FATAL TRAP: vector = 2 (nmi) (XEN) [277052.852792] [error_code=0000] (XEN) [277052.852793] **************************************** (XEN) [277052.852793] (XEN) [277052.852794] Reboot in five seconds... (XEN) [277052.852796] Executing kexec image on cpu0 (XEN) [277052.853813] Shot down all CPUs
dom0.log
[ 57.411835] ERR: CIFS VFS: Send error in SessSetup = -13 [ 57.411857] ERR: CIFS VFS: cifs_mount failed w/return code = -13 [ 58.620725] INFO: EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null) [ 59.866512] NOTICE: Status code returned 0xc000006d STATUS_LOGON_FAILURE [ 59.866518] ERR: CIFS VFS: Send error in SessSetup = -13 [ 59.866528] ERR: CIFS VFS: cifs_mount failed w/return code = -13 [ 61.063207] INFO: block tda: sector-size: 512/512 capacity: 41943040 [ 61.611203] INFO: device vif1.0 entered promiscuous mode [ 61.673280] INFO: tun: Universal TUN/TAP device driver, 1.6 [ 61.882361] INFO: device tap1.0 entered promiscuous mode [ 74.108493] INFO: device tap1.0 left promiscuous mode [ 75.298831] INFO: vif vif-1-0 vif1.0: Guest Rx ready [ 1016.041203] INFO: device xapi0 entered promiscuous mode [ 1016.965567] INFO: block tdb: sector-size: 512/512 capacity: 419430400 [ 1017.489143] INFO: device vif2.0 entered promiscuous mode [ 1017.757125] INFO: device tap2.0 entered promiscuous mode [ 1023.605631] INFO: device tap2.0 left promiscuous mode [ 1025.834352] INFO: vif vif-2-0 vif2.0: Guest Rx ready [ 31340.135234] INFO: md: data-check of RAID array md127 [ 39191.770925] INFO: md: md127: data-check done. [ 54858.770670] ERR: CIFS VFS: Server 10.5.10.50 has not responded in 120 seconds. Reconnecting... [ 131654.041594] ERR: CIFS VFS: Server 10.5.10.50 has not responded in 120 seconds. Reconnecting... [ 141852.456853] ERR: CIFS VFS: Server 10.5.10.50 has not responded in 120 seconds. Reconnecting... [ 159158.312670] NOTICE: Status code returned 0xc000006d STATUS_LOGON_FAILURE [ 159158.312677] ERR: CIFS VFS: Send error in SessSetup = -13 [ 159158.312687] ERR: CIFS VFS: cifs_mount failed w/return code = -13 [ 159207.204011] NOTICE: Status code returned 0xc000006d STATUS_LOGON_FAILURE [ 159207.204018] ERR: CIFS VFS: Send error in SessSetup = -13 [ 159207.204027] ERR: CIFS VFS: cifs_mount failed w/return code = -13 [ 159357.625040] ERR: CIFS VFS: Error connecting to socket. Aborting operation. [ 159357.625051] ERR: CIFS VFS: cifs_mount failed w/return code = -111 [ 159357.658035] ERR: CIFS VFS: Error connecting to socket. Aborting operation. [ 159357.658044] ERR: CIFS VFS: cifs_mount failed w/return code = -111 [ 161050.954560] INFO: device xapi5 entered promiscuous mode [ 161082.918747] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 161083.284405] INFO: block tdd: sector-size: 512/512 capacity: 10869244 [ 161083.802559] INFO: device vif3.0 entered promiscuous mode [ 161084.088086] INFO: device tap3.0 entered promiscuous mode [ 161109.344679] INFO: device tap3.0 left promiscuous mode [ 161109.986838] INFO: device vif3.0 left promiscuous mode [ 161124.631349] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 161124.987941] INFO: block tdd: sector-size: 512/512 capacity: 10869244 [ 161125.506087] INFO: device vif4.0 entered promiscuous mode [ 161125.787692] INFO: device tap4.0 entered promiscuous mode [ 161233.296903] INFO: device tap4.0 left promiscuous mode [ 161234.069009] INFO: device vif4.0 left promiscuous mode [ 161250.012788] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 161250.433289] INFO: block tdd: sector-size: 512/512 capacity: 9568512 [ 161250.957429] INFO: device vif5.0 entered promiscuous mode [ 161251.233068] INFO: device tap5.0 entered promiscuous mode [ 162259.178729] INFO: device tap5.0 left promiscuous mode [ 162259.920375] INFO: device vif5.0 left promiscuous mode [ 162261.427391] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 162261.829462] INFO: block tdd: sector-size: 512/512 capacity: 9568512 [ 162262.339599] INFO: device vif6.0 entered promiscuous mode [ 162262.616402] INFO: device tap6.0 entered promiscuous mode [ 162407.932290] INFO: device tap6.0 left promiscuous mode [ 162408.623844] INFO: device vif6.0 left promiscuous mode [ 162410.130951] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 162410.523163] INFO: block tdd: sector-size: 512/512 capacity: 9568512 [ 162411.028504] INFO: device vif7.0 entered promiscuous mode [ 162411.308947] INFO: device tap7.0 entered promiscuous mode [ 162821.183555] INFO: device tap7.0 left promiscuous mode [ 162821.967356] INFO: device vif7.0 left promiscuous mode [ 162823.469144] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 162823.881822] INFO: block tdd: sector-size: 512/512 capacity: 9568512 [ 162824.366571] INFO: device vif8.0 entered promiscuous mode [ 162824.638473] INFO: device tap8.0 entered promiscuous mode [ 163337.236773] INFO: device tap8.0 left promiscuous mode [ 163337.925312] INFO: device vif8.0 left promiscuous mode [ 163339.313977] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 163339.869661] INFO: device vif9.0 entered promiscuous mode [ 163340.143727] INFO: device tap9.0 entered promiscuous mode [ 163345.656310] INFO: device tap9.0 left promiscuous mode [ 163347.318717] INFO: vif vif-9-0 vif9.0: Guest Rx ready [ 163356.558725] INFO: vif vif-9-0 vif9.0: Guest Rx ready [ 163386.326178] INFO: device vif9.0 left promiscuous mode [ 163407.871601] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 163408.424979] INFO: device vif10.0 entered promiscuous mode [ 163408.705660] INFO: device tap10.0 entered promiscuous mode [ 163417.631824] INFO: device tap10.0 left promiscuous mode [ 163419.199561] INFO: vif vif-10-0 vif10.0: Guest Rx ready [ 163624.199308] INFO: device vif10.0 left promiscuous mode [ 163625.565311] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 163626.129824] INFO: device vif11.0 entered promiscuous mode [ 163626.403920] INFO: device tap11.0 entered promiscuous mode [ 163638.713201] INFO: vif vif-11-0 vif11.0: Guest Rx ready [ 164717.561320] INFO: device tap11.0 left promiscuous mode [ 164718.459194] INFO: device vif11.0 left promiscuous mode [ 164719.892533] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 164720.509181] INFO: device vif12.0 entered promiscuous mode [ 164720.783963] INFO: device tap12.0 entered promiscuous mode [ 164730.684546] INFO: device tap12.0 left promiscuous mode [ 164732.743130] INFO: vif vif-12-0 vif12.0: Guest Rx ready [ 169150.464914] INFO: device vif12.0 left promiscuous mode [ 169151.874585] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 169152.465030] INFO: device vif13.0 entered promiscuous mode [ 169152.742607] INFO: device tap13.0 entered promiscuous mode [ 169159.235091] INFO: device tap13.0 left promiscuous mode [ 169161.039490] INFO: vif vif-13-0 vif13.0: Guest Rx ready [ 203049.047874] ERR: CIFS VFS: Server 10.5.10.50 has not responded in 120 seconds. Reconnecting... [ 213010.741405] INFO: pcieport 0000:00:01.1: AER: Corrected error received: 0000:00:00.0 [ 213010.741413] ERR: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) [ 213010.741425] ERR: pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000 [ 213010.741431] ERR: pcieport 0000:00:01.1: [ 6] BadTLP```
-
-
@planedrop
Take a look at this thread. Lots of reading there. There are many other results on Google if you search for it.Bottom line, looks like a hardware issue.
-
@planedrop If you boot Xen with
nmi=dom0
, they'll be forwarded to dom0 rather than being treated as fatal.
Could you also getlspci -tv
for this system? The IO_PAGE_FAULT is for a different device to the one reporting an AER BadTLP in dom0 and has a wildly bogus address, so we need to figure out if the two errors are related or independent.BadTLP is a problem, usually indicative of an electrical contact issue in the slot. Whatever is downstream of 00:01.1 wants unplugging, dusting out thoroughly, then confirming that it's adequately reseated.
-
I like to use "CRC 05103 QD Electronic Cleaner"
-
@andyhhp Sorry again for slow replies, thanks for the help here!
This is the output of lspci -tv:
[13:23 xcp-ng ~]# lspci -tv -+-[0000:40]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex | +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit | +-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | +-02.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | +-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | +-03.1-[41]--+-00.0 NVIDIA Corporation GM206 [GeForce GTX 960] | | \-00.1 NVIDIA Corporation GM206 High Definition Audio Controller | +-04.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | +-07.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | +-07.1-[42]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function | | +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor | | \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller | +-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | \-08.1-[43]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function | \-00.2 Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] \-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit +-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-01.1-[01-07]--+-00.0 Advanced Micro Devices, Inc. [AMD] X399 Series Chipset USB 3.1 xHCI Controller | +-00.1 Advanced Micro Devices, Inc. [AMD] X399 Series Chipset SATA Controller | \-00.2-[02-07]--+-00.0-[03]----00.0 Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] | +-04.0-[04]----00.0 Intel Corporation I211 Gigabit Network Connection | +-05.0-[05]----00.0 Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] | +-06.0-[06]----00.0 Intel Corporation I211 Gigabit Network Connection | \-07.0-[07]-- +-02.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-03.1-[08]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] | \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] +-04.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-07.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-07.1-[09]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function | +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor | \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller +-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-08.1-[0a]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function | +-00.2 Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] | \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller +-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller +-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge +-18.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 +-18.1 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 +-18.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 +-18.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 +-18.4 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 +-18.5 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 +-18.6 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 +-18.7 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 +-19.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 +-19.1 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 +-19.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 +-19.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 +-19.4 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 +-19.5 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 +-19.6 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 \-19.7 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
I would also note that this system was known 100% functional before being setup with XCP-ng, so it's possible it's something that needs reseating but I think an actual hardware failure issue is unlikely. Not impossible since I did transplant it to a new case (was originally my desktop) but I'm quite careful and very experienced with that sort of thing.
This upcoming weekend, assuming I have time, maybe I will give it a real good cleaning and reseat everything, then attempt GPU passthrough again.
-
@Andrew Good suggestion, might as well pick some up, thanks!