Diagnosing frequent crashes on host
-
I'm running XCP-NG 8.2.1 on a Minisforum-01. It's been generally solid for some time (with the exception of a single VM, which I still haven't figured out), but in the last few weeks, the entire host has been crashing very frequently, sometimes once a day or more. It restarts itself smoothly, but this is getting ridiculous.
How do I diagnose what's going on? I've looked at /var/crash, but there's so much stuff there I don't know where to start, and randomly looking through some of those logs doesn't show anything that I can make sense of, at least. And /var/log/kern.log doesn't show anything that is helpful to me.
Where do I go to figure this out?
-
Hi,
First thing first: memtest. Always try to check the hardware before anything else
-
@olivierlambert
Thanks. I ran two full passes of memtest, and it's totally clean.I don't know what I'm looking for, but I looked through some random folders in /var/crash, and the last bit of the most recent xen.log was:
(XEN) [19617.315490] CPU4: Temperature/speed normal (XEN) [19617.315492] CPU5: Temperature/speed normal (XEN) [19622.518692] CPU5: Temperature above threshold (XEN) [19622.518693] CPU4: Temperature above threshold (XEN) [19622.518694] CPU5: Running in modulated clock mode (XEN) [19622.518695] CPU4: Running in modulated clock mode (XEN) [19707.409611] Executing kexec image on cpu5 (XEN) [19707.410615] Shot down all CPUs
There's no reason why the CPU should be running hot; this was a largely unloaded system, but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)
-
- Is this system on the older side?
- Is there adequate ventilation/fans on this system?
These small units, often have little to no ventilation so even mild usage, could cause the CPU to overheat.
If the unit is an older model, it's possible the CPU gel is old and cruddy and needs to be replaced.
HTH
-
@DustinB
It's less than a year old. I don't know how to evaluate whether the ventilation is "adequate"; the existing ventilation is exposed (i.e. not covered up), and it's inside a cabinet that is itself ventilated, but it's not in a freezingly air-conditioned space.I have tried to install lm_sensors, but it doesn't detect any sensors, so I'm not sure how to tell if it's regularly overheating.
-
@the_jest If the space is in a passively ventilated cabinet, that may not be enough for this unit.
As a silly ask, do you happen to have any large CPU coolers that you could just set on top of this unit to act as a larger heatsink?
Opening the cabinet door may be enough to fix the issue as well with passive cooling.
-
@DustinB
Sorry, I should have been more clear. The cabinet is not sealed, i.e. it's got a loose grating on the entire front, and in addition to that, it is actively ventilated, with internal fans in the back of the cabinet pulling air through. And the box itself is not covered up in any way.I do not have any additional coolers to use.
-
@the_jest Ah my misunderstanding, well that sounds like it should be fine for day to day operations then for the cabinetry. You might have to look at the components within this unit (the CPU and cooler itself).
I'm assuming this was a system that was "bought off a shelf" and not assembled by you. CPU heat issues can be annoying to deal with, but pretty simple.
Open the case, remove the heatsink, clean it up with some Isopropyl Alch (80% or higher) add some new heatsink gel and go from there.
-
@the_jest I agree that sudden crashes probably mean hardware failure. Could you post the crash logs?
-
@the_jest said in Diagnosing frequent crashes on host:
but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)
Shot down is correct. It is the past tense of "Shoot down", because the companion message you get when something went wrong is "Failed to shoot down $CPUS", and is the single most valuable print message I've ever inserted into the code.
@the_jest said in Diagnosing frequent crashes on host:
I've looked at /var/crash, but there's so much stuff there I don't know where to start,
The snippet of xen.log you've posted suggests it's a linux kernel crash, so look at dom0.log, and right at the end.
-
@andyhhp said in Diagnosing frequent crashes on host:
@the_jest said in Diagnosing frequent crashes on host:
but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)
Shot down is correct. It is the past tense of "Shoot down", because the companion message you get when something went wrong is "Failed to shoot down $CPUS", and is the single most valuable print message I've ever inserted into the code.
My apologies!
@andyhhp said in Diagnosing frequent crashes on host:
The snippet of xen.log you've posted suggests it's a linux kernel crash, so look at dom0.log, and right at the end.
The last consecutive block of messages (timewise, i.e. the part of this log from the same milisecond to the end of this log), is
[ 19701.650235] WARN: Call Trace: [ 19701.650238] WARN: <IRQ> [ 19701.650241] WARN: xen_evtchn_do_upcall+0x27/0x50 [ 19701.650245] WARN: xen_do_hypervisor_callback+0x29/0x40 [ 19701.650248] WARN: </IRQ> [ 19701.650250] WARN: RIP: e030:xen_hypercall_sched_op+0xa/0x20 [ 19701.650253] WARN: Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc [ 19701.650259] WARN: RSP: e02b:ffffc900400e7eb0 EFLAGS: 00000246 [ 19701.650261] WARN: RAX: 0000000000000000 RBX: ffff8881db639d00 RCX: ffffffff810013aa [ 19701.650264] WARN: RDX: ffffffff8203d250 RSI: 0000000000000000 RDI: 0000000000000001 [ 19701.650267] WARN: RBP: 0000000000000004 R08: 0000000000000008 R09: 000011eef4cd8cc2 [ 19701.650269] WARN: R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000 [ 19701.650272] WARN: R13: 0000000000000000 R14: ffff8881db639d00 R15: ffff8881db639d00 [ 19701.650276] WARN: ? xen_hypercall_sched_op+0xa/0x20 [ 19701.650279] WARN: ? xen_safe_halt+0xc/0x20 [ 19701.650282] WARN: ? default_idle+0x1a/0x140 [ 19701.650284] WARN: ? do_idle+0x1ea/0x260 [ 19701.650287] WARN: ? cpu_startup_entry+0x6f/0x80 [ 19701.650289] WARN: Modules linked in: tun nfsv3 nfs_acl nfs lockd grace fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter dm_multipath sunrpc nls_iso8859_1 nls_cp437 intel_powerclamp crct10dif_pclmul vfat crc32_pclmul ghash_clmulni_intel fat pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper video backlight ip_tables x_tables hid_generic usbhid hid xhci_pci nvme igc(O) xhci_hcd i40e(O) nvme_core scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt [ 19701.650330] WARN: ---[ end trace 79b40169d24b8e01 ]--- [ 19701.650333] WARN: RIP: e030:__xen_evtchn_do_upcall+0x82/0x90 [ 19701.650335] WARN: Code: 66 90 f6 c4 02 75 23 80 3b 00 75 d7 65 ff 05 85 89 ba 7e 48 8b 44 24 10 65 48 33 04 25 28 00 00 00 75 09 48 83 c4 18 5b 5d c3 <0f> 0b e8 77 aa bf ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 e9 66 ff [ 19701.650341] WARN: RSP: e02b:ffff8881dc503fb8 EFLAGS: 00010002 [ 19701.650344] WARN: RAX: 0000000000000000 RBX: ffff8881dc514100 RCX: 000000008518dd93 [ 19701.650346] WARN: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881db6aa800 [ 19701.650349] WARN: RBP: 0000000000000004 R08: 00000000000035c6 R09: ffff8881db003210 [ 19701.650352] WARN: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 19701.650355] WARN: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 19701.650361] WARN: FS: 0000000000000000(0000) GS:ffff8881dc500000(0000) knlGS:0000000000000000 [ 19701.650364] WARN: CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 19701.650367] WARN: CR2: 00007fffc3789c78 CR3: 00000001d81ca000 CR4: 0000000000040660 [ 19701.650371] EMERG: Kernel panic - not syncing: Fatal exception in interrupt
Thank you for looking this over.
-
@the_jest Ok, so it's a logical bug in Linux. Have you updated the dom0 kernel recently? Can you revert back to the older build and see if that changes the behaviour?
-
@andyhhp said in Diagnosing frequent crashes on host:
Ok, so it's a logical bug in Linux. Have you updated the dom0 kernel recently? Can you revert back to the older build and see if that changes the behaviour?
I haven't updated the kernel, or the installation in general, since I first installed it, almost a year ago. It's never been rock-solid, but in recent weeks it's been very much worse; but there's been no system change in that time.
If I were going to make any kind of change, I assume the sensible thing to do would be to upgrade to 8.3?
I've been getting the impression that people have a low opinion of this particular system, esp. with regard to thermal management. I'm not sure what to do about this; I don't really need this exact hardware but I don't know what the correct replacement should be, if I want to just ditch it.
-
Maybe there's a usage that's slightly different since when it was "more solid" and now it's trigger more easily. Is your XCP-ng fully up to date?
As Andy said, there's a bug in the kernel, and the Dom0 crashes, then Xen detects it and restart the machine.
-
@olivierlambert said in Diagnosing frequent crashes on host:
Maybe there's a usage that's slightly different since when it was "more solid" and now it's trigger more easily. Is your XCP-ng fully up to date?
No; as said originally, I'm still on 8.2.1. I have been concerned about moving to 8.3 because it's a new installation, and I don't want to screw it up, but I'm willing to accept that it's the right thing to do.