the_jest

the_jest

@olivierlambert said in Diagnosing frequent crashes on host:

Maybe there's a usage that's slightly different since when it was "more solid" and now it's trigger more easily. Is your XCP-ng fully up to date?

No; as said originally, I'm still on 8.2.1. I have been concerned about moving to 8.3 because it's a new installation, and I don't want to screw it up, but I'm willing to accept that it's the right thing to do.

the_jest

@andyhhp

@andyhhp said in Diagnosing frequent crashes on host:

Ok, so it's a logical bug in Linux. Have you updated the dom0 kernel recently? Can you revert back to the older build and see if that changes the behaviour?

I haven't updated the kernel, or the installation in general, since I first installed it, almost a year ago. It's never been rock-solid, but in recent weeks it's been very much worse; but there's been no system change in that time.

If I were going to make any kind of change, I assume the sensible thing to do would be to upgrade to 8.3?

I've been getting the impression that people have a low opinion of this particular system, esp. with regard to thermal management. I'm not sure what to do about this; I don't really need this exact hardware but I don't know what the correct replacement should be, if I want to just ditch it.

the_jest

@andyhhp

@andyhhp said in Diagnosing frequent crashes on host:

@the_jest said in Diagnosing frequent crashes on host:

but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)

Shot down is correct. It is the past tense of "Shoot down", because the companion message you get when something went wrong is "Failed to shoot down $CPUS", and is the single most valuable print message I've ever inserted into the code.

My apologies!

@andyhhp said in Diagnosing frequent crashes on host:

The snippet of xen.log you've posted suggests it's a linux kernel crash, so look at dom0.log, and right at the end.

The last consecutive block of messages (timewise, i.e. the part of this log from the same milisecond to the end of this log), is

[  19701.650235]   WARN: Call Trace:
[  19701.650238]   WARN:  <IRQ>
[  19701.650241]   WARN:  xen_evtchn_do_upcall+0x27/0x50
[  19701.650245]   WARN:  xen_do_hypervisor_callback+0x29/0x40
[  19701.650248]   WARN:  </IRQ>
[  19701.650250]   WARN: RIP: e030:xen_hypercall_sched_op+0xa/0x20
[  19701.650253]   WARN: Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[  19701.650259]   WARN: RSP: e02b:ffffc900400e7eb0 EFLAGS: 00000246
[  19701.650261]   WARN: RAX: 0000000000000000 RBX: ffff8881db639d00 RCX: ffffffff810013aa
[  19701.650264]   WARN: RDX: ffffffff8203d250 RSI: 0000000000000000 RDI: 0000000000000001
[  19701.650267]   WARN: RBP: 0000000000000004 R08: 0000000000000008 R09: 000011eef4cd8cc2
[  19701.650269]   WARN: R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
[  19701.650272]   WARN: R13: 0000000000000000 R14: ffff8881db639d00 R15: ffff8881db639d00
[  19701.650276]   WARN:  ? xen_hypercall_sched_op+0xa/0x20
[  19701.650279]   WARN:  ? xen_safe_halt+0xc/0x20
[  19701.650282]   WARN:  ? default_idle+0x1a/0x140
[  19701.650284]   WARN:  ? do_idle+0x1ea/0x260
[  19701.650287]   WARN:  ? cpu_startup_entry+0x6f/0x80
[  19701.650289]   WARN: Modules linked in: tun nfsv3 nfs_acl nfs lockd grace fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter dm_multipath sunrpc nls_iso8859_1 nls_cp437 intel_powerclamp crct10dif_pclmul vfat crc32_pclmul ghash_clmulni_intel fat pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper video backlight ip_tables x_tables hid_generic usbhid hid xhci_pci nvme igc(O) xhci_hcd i40e(O) nvme_core scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
[  19701.650330]   WARN: ---[ end trace 79b40169d24b8e01 ]---
[  19701.650333]   WARN: RIP: e030:__xen_evtchn_do_upcall+0x82/0x90
[  19701.650335]   WARN: Code: 66 90 f6 c4 02 75 23 80 3b 00 75 d7 65 ff 05 85 89 ba 7e 48 8b 44 24 10 65 48 33 04 25 28 00 00 00 75 09 48 83 c4 18 5b 5d c3 <0f> 0b e8 77 aa bf ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 e9 66 ff
[  19701.650341]   WARN: RSP: e02b:ffff8881dc503fb8 EFLAGS: 00010002
[  19701.650344]   WARN: RAX: 0000000000000000 RBX: ffff8881dc514100 RCX: 000000008518dd93
[  19701.650346]   WARN: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881db6aa800
[  19701.650349]   WARN: RBP: 0000000000000004 R08: 00000000000035c6 R09: ffff8881db003210
[  19701.650352]   WARN: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  19701.650355]   WARN: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  19701.650361]   WARN: FS:  0000000000000000(0000) GS:ffff8881dc500000(0000) knlGS:0000000000000000
[  19701.650364]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
[  19701.650367]   WARN: CR2: 00007fffc3789c78 CR3: 00000001d81ca000 CR4: 0000000000040660
[  19701.650371]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt

Thank you for looking this over.

the_jest

@DustinB
Sorry, I should have been more clear. The cabinet is not sealed, i.e. it's got a loose grating on the entire front, and in addition to that, it is actively ventilated, with internal fans in the back of the cabinet pulling air through. And the box itself is not covered up in any way.

I do not have any additional coolers to use.

the_jest

@DustinB
It's less than a year old. I don't know how to evaluate whether the ventilation is "adequate"; the existing ventilation is exposed (i.e. not covered up), and it's inside a cabinet that is itself ventilated, but it's not in a freezingly air-conditioned space.

I have tried to install lm_sensors, but it doesn't detect any sensors, so I'm not sure how to tell if it's regularly overheating.

the_jest

@olivierlambert
Thanks. I ran two full passes of memtest, and it's totally clean.

I don't know what I'm looking for, but I looked through some random folders in /var/crash, and the last bit of the most recent xen.log was:

(XEN) [19617.315490] CPU4: Temperature/speed normal
(XEN) [19617.315492] CPU5: Temperature/speed normal
(XEN) [19622.518692] CPU5: Temperature above threshold
(XEN) [19622.518693] CPU4: Temperature above threshold
(XEN) [19622.518694] CPU5: Running in modulated clock mode
(XEN) [19622.518695] CPU4: Running in modulated clock mode
(XEN) [19707.409611] Executing kexec image on cpu5
(XEN) [19707.410615] Shot down all CPUs

There's no reason why the CPU should be running hot; this was a largely unloaded system, but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)

the_jest

I'm running XCP-NG 8.2.1 on a Minisforum-01. It's been generally solid for some time (with the exception of a single VM, which I still haven't figured out), but in the last few weeks, the entire host has been crashing very frequently, sometimes once a day or more. It restarts itself smoothly, but this is getting ridiculous.

How do I diagnose what's going on? I've looked at /var/crash, but there's so much stuff there I don't know where to start, and randomly looking through some of those logs doesn't show anything that I can make sense of, at least. And /var/log/kern.log doesn't show anything that is helpful to me.

Where do I go to figure this out?

the_jest

@olivierlambert
No, the only thing I get with xl dmesg on the host, for some time back, are random brief reports of individual CPUs running above temperature threshold and then being clocked down, and then resolving. Nothing else.

the_jest

I'm not sure if this is specifically an XCP-ng issue, so forgive me if this isn't the right place for it. I'm also not sure what details would be most useful to share, but I can describe my setup further as necessary.

I've been running XCP-ng for a few weeks, and for the most part everything has been going well. I have about 6 VMs running, all based on vanilla installs of Debian Bookworm. One of these VMs serves as a Docker host, without about a dozen containers running in it; this VM has more memory and CPUs dedicated to it than others, but still runs comfortably (i.e. it's not always at 100% CPU or running out of memory or anything). Every couple of days, this VM has crashed; this doesn't appear to be associated with anything that's actively happening (i.e. it doesn't seem to happen when I do something specific on a container, it just happens). I'm attaching the dmesg output of a crash, but it starts with a "general protection fault".

What can I do to figure out what's going on?

Thank you.

the_jest

@the_jest

Latest posts made by the_jest