Host Crash Once In A Long While
-
So it would seem I have a host who crashes on a periodic basis but only occasionally, seems like it's about 2 times per year ish, trying to diagnose this though.
My logs are below, truncated since they are way too long to actually post here.
xen.log:
0x8 (XEN) [270483.449769] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [270598.526533] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [270685.029195] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [271110.558952] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [271309.499968] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [271340.748832] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [271349.428606] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [272600.756776] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [273163.116684] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x101, fault address = 0xfffffffdf8000000, flags = 0x8 (XEN) [277052.852717] Uhhuh. NMI received for unknown reason 31. (XEN) [277052.852719] Do you have a strange power saving mode enabled? (XEN) [277052.852722] ----[ Xen-4.13.4-9.21.2 x86_64 debug=n Not tainted ]---- (XEN) [277052.852723] CPU: 0 (XEN) [277052.852725] RIP: e008:[<ffff82d0802d9d98>] arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0 (XEN) [277052.852731] RFLAGS: 0000000000000246 CONTEXT: hypervisor (XEN) [277052.852734] rax: 0000000000000000 rbx: ffff83107bcafc78 rcx: 0000000000000048 (XEN) [277052.852736] rdx: 0000000000000000 rsi: ffff83007be8ffff rdi: ffff83107bcafc78 (XEN) [277052.852737] rbp: ffff83107bcafc00 rsp: ffff83007be8fe68 r8: ffff83007be8fef8 (XEN) [277052.852739] r9: 0000000000000002 r10: 0000fbfaa06ecc95 r11: 0000fbfa820743ef (XEN) [277052.852740] r12: 0000fbfa64d405b7 r13: ffff83107bcafc30 r14: ffff82d080597270 (XEN) [277052.852742] r15: ffff82d0805bc300 cr0: 000000008005003b cr4: 00000000003506e0 (XEN) [277052.852743] cr3: 00000010448f3000 cr2: 00007ffcfeacafc8 (XEN) [277052.852744] fsb: 0000000000000000 gsb: ffff8aa47c440000 gss: 0000000000000000 (XEN) [277052.852746] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) [277052.852749] Xen code around <ffff82d0802d9d98> (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0): (XEN) [277052.852750] 66 90 0f 1f 40 00 fb f4 <0f> b6 46 f5 41 80 a0 fe 00 00 00 fe 66 90 fa c3 (XEN) [277052.852754] Xen stack trace from rsp=ffff83007be8fe68: (XEN) [277052.852755] ffff82d0802da28a 0000000000000000 0000000000000000 0000000000000000 (XEN) [277052.852757] ffff82d080597270 ffff82d0805bc300 ffff82d08059db00 ffff8310447ca000 (XEN) [277052.852759] ffff82d08059db00 ffff8310447ca000 0000000000000000 0000000000000000 (XEN) [277052.852761] ffff82d080278b0c ffff82d080278a40 ffff8310447ca000 ffff83107bcb1000 (XEN) [277052.852763] 00000000ffffffff ffff831044926000 0000000000000000 0000000000000000 (XEN) [277052.852764] 0000000000000000 0000000000000000 0000000000000001 0000000000000001 (XEN) [277052.852766] 0000fbeb7bad0fb6 0000000000000000 0000000000000000 000000003359c3d1 (XEN) [277052.852767] ffffffff92d3a0f0 ffffffff9364f310 0000000006568a76 ffffffff9364af78 (XEN) [277052.852769] 0000000000000001 0000000000000000 ffffffff92d3a4be 0000000000000000 (XEN) [277052.852770] 0000000000000246 ffffb50d80393ea8 0000000000000000 7bdcdc407be8ffe0 (XEN) [277052.852772] 7bdcdcc30009bf75 7bdcddb700000000 7bdcd9667be8ffe0 0000e01000000000 (XEN) [277052.852774] ffff83107bcb0000 0000000000000000 00000000003506e0 0000000000000000 (XEN) [277052.852776] 0000000000000000 7b01d30000000000 7bdce8300009bf00 (XEN) [277052.852777] Xen call trace: (XEN) [277052.852779] [<ffff82d0802d9d98>] R arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0 (XEN) [277052.852782] [<ffff82d0802da28a>] S arch/x86/acpi/cpu_idle.c#acpi_processor_idle+0x36a/0x630 (XEN) [277052.852785] [<ffff82d080278b0c>] S arch/x86/domain.c#idle_loop+0xcc/0xf0 (XEN) [277052.852786] [<ffff82d080278a40>] S arch/x86/domain.c#idle_loop+0/0xf0 (XEN) [277052.852787] (XEN) [277052.852789] (XEN) [277052.852789] **************************************** (XEN) [277052.852790] Panic on CPU 0: (XEN) [277052.852791] FATAL TRAP: vector = 2 (nmi) (XEN) [277052.852792] [error_code=0000] (XEN) [277052.852793] **************************************** (XEN) [277052.852793] (XEN) [277052.852794] Reboot in five seconds... (XEN) [277052.852796] Executing kexec image on cpu0 (XEN) [277052.853813] Shot down all CPUsdom0.log
[ 57.411835] ERR: CIFS VFS: Send error in SessSetup = -13 [ 57.411857] ERR: CIFS VFS: cifs_mount failed w/return code = -13 [ 58.620725] INFO: EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null) [ 59.866512] NOTICE: Status code returned 0xc000006d STATUS_LOGON_FAILURE [ 59.866518] ERR: CIFS VFS: Send error in SessSetup = -13 [ 59.866528] ERR: CIFS VFS: cifs_mount failed w/return code = -13 [ 61.063207] INFO: block tda: sector-size: 512/512 capacity: 41943040 [ 61.611203] INFO: device vif1.0 entered promiscuous mode [ 61.673280] INFO: tun: Universal TUN/TAP device driver, 1.6 [ 61.882361] INFO: device tap1.0 entered promiscuous mode [ 74.108493] INFO: device tap1.0 left promiscuous mode [ 75.298831] INFO: vif vif-1-0 vif1.0: Guest Rx ready [ 1016.041203] INFO: device xapi0 entered promiscuous mode [ 1016.965567] INFO: block tdb: sector-size: 512/512 capacity: 419430400 [ 1017.489143] INFO: device vif2.0 entered promiscuous mode [ 1017.757125] INFO: device tap2.0 entered promiscuous mode [ 1023.605631] INFO: device tap2.0 left promiscuous mode [ 1025.834352] INFO: vif vif-2-0 vif2.0: Guest Rx ready [ 31340.135234] INFO: md: data-check of RAID array md127 [ 39191.770925] INFO: md: md127: data-check done. [ 54858.770670] ERR: CIFS VFS: Server 10.5.10.50 has not responded in 120 seconds. Reconnecting... [ 131654.041594] ERR: CIFS VFS: Server 10.5.10.50 has not responded in 120 seconds. Reconnecting... [ 141852.456853] ERR: CIFS VFS: Server 10.5.10.50 has not responded in 120 seconds. Reconnecting... [ 159158.312670] NOTICE: Status code returned 0xc000006d STATUS_LOGON_FAILURE [ 159158.312677] ERR: CIFS VFS: Send error in SessSetup = -13 [ 159158.312687] ERR: CIFS VFS: cifs_mount failed w/return code = -13 [ 159207.204011] NOTICE: Status code returned 0xc000006d STATUS_LOGON_FAILURE [ 159207.204018] ERR: CIFS VFS: Send error in SessSetup = -13 [ 159207.204027] ERR: CIFS VFS: cifs_mount failed w/return code = -13 [ 159357.625040] ERR: CIFS VFS: Error connecting to socket. Aborting operation. [ 159357.625051] ERR: CIFS VFS: cifs_mount failed w/return code = -111 [ 159357.658035] ERR: CIFS VFS: Error connecting to socket. Aborting operation. [ 159357.658044] ERR: CIFS VFS: cifs_mount failed w/return code = -111 [ 161050.954560] INFO: device xapi5 entered promiscuous mode [ 161082.918747] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 161083.284405] INFO: block tdd: sector-size: 512/512 capacity: 10869244 [ 161083.802559] INFO: device vif3.0 entered promiscuous mode [ 161084.088086] INFO: device tap3.0 entered promiscuous mode [ 161109.344679] INFO: device tap3.0 left promiscuous mode [ 161109.986838] INFO: device vif3.0 left promiscuous mode [ 161124.631349] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 161124.987941] INFO: block tdd: sector-size: 512/512 capacity: 10869244 [ 161125.506087] INFO: device vif4.0 entered promiscuous mode [ 161125.787692] INFO: device tap4.0 entered promiscuous mode [ 161233.296903] INFO: device tap4.0 left promiscuous mode [ 161234.069009] INFO: device vif4.0 left promiscuous mode [ 161250.012788] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 161250.433289] INFO: block tdd: sector-size: 512/512 capacity: 9568512 [ 161250.957429] INFO: device vif5.0 entered promiscuous mode [ 161251.233068] INFO: device tap5.0 entered promiscuous mode [ 162259.178729] INFO: device tap5.0 left promiscuous mode [ 162259.920375] INFO: device vif5.0 left promiscuous mode [ 162261.427391] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 162261.829462] INFO: block tdd: sector-size: 512/512 capacity: 9568512 [ 162262.339599] INFO: device vif6.0 entered promiscuous mode [ 162262.616402] INFO: device tap6.0 entered promiscuous mode [ 162407.932290] INFO: device tap6.0 left promiscuous mode [ 162408.623844] INFO: device vif6.0 left promiscuous mode [ 162410.130951] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 162410.523163] INFO: block tdd: sector-size: 512/512 capacity: 9568512 [ 162411.028504] INFO: device vif7.0 entered promiscuous mode [ 162411.308947] INFO: device tap7.0 entered promiscuous mode [ 162821.183555] INFO: device tap7.0 left promiscuous mode [ 162821.967356] INFO: device vif7.0 left promiscuous mode [ 162823.469144] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 162823.881822] INFO: block tdd: sector-size: 512/512 capacity: 9568512 [ 162824.366571] INFO: device vif8.0 entered promiscuous mode [ 162824.638473] INFO: device tap8.0 entered promiscuous mode [ 163337.236773] INFO: device tap8.0 left promiscuous mode [ 163337.925312] INFO: device vif8.0 left promiscuous mode [ 163339.313977] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 163339.869661] INFO: device vif9.0 entered promiscuous mode [ 163340.143727] INFO: device tap9.0 entered promiscuous mode [ 163345.656310] INFO: device tap9.0 left promiscuous mode [ 163347.318717] INFO: vif vif-9-0 vif9.0: Guest Rx ready [ 163356.558725] INFO: vif vif-9-0 vif9.0: Guest Rx ready [ 163386.326178] INFO: device vif9.0 left promiscuous mode [ 163407.871601] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 163408.424979] INFO: device vif10.0 entered promiscuous mode [ 163408.705660] INFO: device tap10.0 entered promiscuous mode [ 163417.631824] INFO: device tap10.0 left promiscuous mode [ 163419.199561] INFO: vif vif-10-0 vif10.0: Guest Rx ready [ 163624.199308] INFO: device vif10.0 left promiscuous mode [ 163625.565311] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 163626.129824] INFO: device vif11.0 entered promiscuous mode [ 163626.403920] INFO: device tap11.0 entered promiscuous mode [ 163638.713201] INFO: vif vif-11-0 vif11.0: Guest Rx ready [ 164717.561320] INFO: device tap11.0 left promiscuous mode [ 164718.459194] INFO: device vif11.0 left promiscuous mode [ 164719.892533] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 164720.509181] INFO: device vif12.0 entered promiscuous mode [ 164720.783963] INFO: device tap12.0 entered promiscuous mode [ 164730.684546] INFO: device tap12.0 left promiscuous mode [ 164732.743130] INFO: vif vif-12-0 vif12.0: Guest Rx ready [ 169150.464914] INFO: device vif12.0 left promiscuous mode [ 169151.874585] INFO: block tdc: sector-size: 512/512 capacity: 67108864 [ 169152.465030] INFO: device vif13.0 entered promiscuous mode [ 169152.742607] INFO: device tap13.0 entered promiscuous mode [ 169159.235091] INFO: device tap13.0 left promiscuous mode [ 169161.039490] INFO: vif vif-13-0 vif13.0: Guest Rx ready [ 203049.047874] ERR: CIFS VFS: Server 10.5.10.50 has not responded in 120 seconds. Reconnecting... [ 213010.741405] INFO: pcieport 0000:00:01.1: AER: Corrected error received: 0000:00:00.0 [ 213010.741413] ERR: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) [ 213010.741425] ERR: pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000 [ 213010.741431] ERR: pcieport 0000:00:01.1: [ 6] BadTLP``` -
P planedrop referenced this topic on
-
@planedrop
Take a look at this thread. Lots of reading there. There are many other results on Google if you search for it.Bottom line, looks like a hardware issue.
-
@planedrop If you boot Xen with
nmi=dom0, they'll be forwarded to dom0 rather than being treated as fatal.
Could you also getlspci -tvfor this system? The IO_PAGE_FAULT is for a different device to the one reporting an AER BadTLP in dom0 and has a wildly bogus address, so we need to figure out if the two errors are related or independent.BadTLP is a problem, usually indicative of an electrical contact issue in the slot. Whatever is downstream of 00:01.1 wants unplugging, dusting out thoroughly, then confirming that it's adequately reseated.
-
I like to use "CRC 05103 QD Electronic Cleaner"
-
@andyhhp Sorry again for slow replies, thanks for the help here!
This is the output of lspci -tv:
[13:23 xcp-ng ~]# lspci -tv -+-[0000:40]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex | +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit | +-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | +-02.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | +-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | +-03.1-[41]--+-00.0 NVIDIA Corporation GM206 [GeForce GTX 960] | | \-00.1 NVIDIA Corporation GM206 High Definition Audio Controller | +-04.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | +-07.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | +-07.1-[42]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function | | +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor | | \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller | +-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge | \-08.1-[43]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function | \-00.2 Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] \-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit +-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-01.1-[01-07]--+-00.0 Advanced Micro Devices, Inc. [AMD] X399 Series Chipset USB 3.1 xHCI Controller | +-00.1 Advanced Micro Devices, Inc. [AMD] X399 Series Chipset SATA Controller | \-00.2-[02-07]--+-00.0-[03]----00.0 Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] | +-04.0-[04]----00.0 Intel Corporation I211 Gigabit Network Connection | +-05.0-[05]----00.0 Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] | +-06.0-[06]----00.0 Intel Corporation I211 Gigabit Network Connection | \-07.0-[07]-- +-02.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-03.1-[08]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] | \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] +-04.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-07.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-07.1-[09]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function | +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor | \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller +-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-08.1-[0a]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function | +-00.2 Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] | \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller +-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller +-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge +-18.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 +-18.1 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 +-18.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 +-18.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 +-18.4 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 +-18.5 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 +-18.6 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 +-18.7 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 +-19.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 +-19.1 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 +-19.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 +-19.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 +-19.4 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 +-19.5 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 +-19.6 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 \-19.7 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7I would also note that this system was known 100% functional before being setup with XCP-ng, so it's possible it's something that needs reseating but I think an actual hardware failure issue is unlikely. Not impossible since I did transplant it to a new case (was originally my desktop) but I'm quite careful and very experienced with that sort of thing.
This upcoming weekend, assuming I have time, maybe I will give it a real good cleaning and reseat everything, then attempt GPU passthrough again.
-
@Andrew Good suggestion, might as well pick some up, thanks!
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login