Posts made by the_jest | XCP-ng and XO forum

the_jest

I have not seen warnings to reboot after previous patchings (nor this one, that I recall).

In any case, I did reboot, and everything is fine, and I'm sorry to have wasted time and attention on this! Thank you.

the_jest

@Danp
Grr. OK I'll reboot later today (after some other tasks are finished) and report back.

I am worried about nuking the currently-working VMs. It would obviously be good to have another host for exactly such purposes. Thank you.

the_jest

@Danp
Yes, I did restart the toolstack; sorry for not saying this. It successfully restarted, but did not resolve the problem. Meanwhile the existing VMs do continue to work.

I also updated all the patches (which I do semi-regularly as well).

I have tried to look at the log files, but there's nothing obvious there that makes sense to me. I'm happy to share this here or elsewhere as appropriate; also happy to try anything else.

Thank you for the attention.

the_jest

I have an XCP-NG 8.3.0 installation, on just a single host, that has been extremely stable for some time, usually running 5–6 VMs. I needed to restart one VM to update some packages, and it shut down cleanly but did not restart. When I tried to do it manually, I got:

[19:38 xcpng2 ~]# xe vm-start uuid=[FOO]
The server failed to handle your request, due to an internal error. The given message may give details useful for debugging the problem.
message: xenopsd internal error: Unix.Unix_error(Unix.ENOENT, "open", "/dev/net/tun")

Subsequently, it seems that another VM stopped on its own, and trying to restart that gives the same message. I tried to create a new VM out of curiosity, and that also won't start, with the same message.

Meanwhile there are three existing VMs that are running just fine; two of them are very active and are dealing with a lot of load and network traffic without a problem. I'd prefer not to restart the host, since I don't want to interrupt these VMs just for the hope that a power-cycle will do anything.

How can I evaluate this problem? Searching for this error message hasn't led me to anything useful.

the_jest

@olivierlambert

@olivierlambert said in Diagnosing frequent crashes on host:

Maybe there's a usage that's slightly different since when it was "more solid" and now it's trigger more easily. Is your XCP-ng fully up to date?

No; as said originally, I'm still on 8.2.1. I have been concerned about moving to 8.3 because it's a new installation, and I don't want to screw it up, but I'm willing to accept that it's the right thing to do.

the_jest

@andyhhp

@andyhhp said in Diagnosing frequent crashes on host:

Ok, so it's a logical bug in Linux. Have you updated the dom0 kernel recently? Can you revert back to the older build and see if that changes the behaviour?

I haven't updated the kernel, or the installation in general, since I first installed it, almost a year ago. It's never been rock-solid, but in recent weeks it's been very much worse; but there's been no system change in that time.

If I were going to make any kind of change, I assume the sensible thing to do would be to upgrade to 8.3?

I've been getting the impression that people have a low opinion of this particular system, esp. with regard to thermal management. I'm not sure what to do about this; I don't really need this exact hardware but I don't know what the correct replacement should be, if I want to just ditch it.

the_jest

@andyhhp

@andyhhp said in Diagnosing frequent crashes on host:

@the_jest said in Diagnosing frequent crashes on host:

but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)

Shot down is correct. It is the past tense of "Shoot down", because the companion message you get when something went wrong is "Failed to shoot down $CPUS", and is the single most valuable print message I've ever inserted into the code.

My apologies!

@andyhhp said in Diagnosing frequent crashes on host:

The snippet of xen.log you've posted suggests it's a linux kernel crash, so look at dom0.log, and right at the end.

The last consecutive block of messages (timewise, i.e. the part of this log from the same milisecond to the end of this log), is

[  19701.650235]   WARN: Call Trace:
[  19701.650238]   WARN:  <IRQ>
[  19701.650241]   WARN:  xen_evtchn_do_upcall+0x27/0x50
[  19701.650245]   WARN:  xen_do_hypervisor_callback+0x29/0x40
[  19701.650248]   WARN:  </IRQ>
[  19701.650250]   WARN: RIP: e030:xen_hypercall_sched_op+0xa/0x20
[  19701.650253]   WARN: Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[  19701.650259]   WARN: RSP: e02b:ffffc900400e7eb0 EFLAGS: 00000246
[  19701.650261]   WARN: RAX: 0000000000000000 RBX: ffff8881db639d00 RCX: ffffffff810013aa
[  19701.650264]   WARN: RDX: ffffffff8203d250 RSI: 0000000000000000 RDI: 0000000000000001
[  19701.650267]   WARN: RBP: 0000000000000004 R08: 0000000000000008 R09: 000011eef4cd8cc2
[  19701.650269]   WARN: R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
[  19701.650272]   WARN: R13: 0000000000000000 R14: ffff8881db639d00 R15: ffff8881db639d00
[  19701.650276]   WARN:  ? xen_hypercall_sched_op+0xa/0x20
[  19701.650279]   WARN:  ? xen_safe_halt+0xc/0x20
[  19701.650282]   WARN:  ? default_idle+0x1a/0x140
[  19701.650284]   WARN:  ? do_idle+0x1ea/0x260
[  19701.650287]   WARN:  ? cpu_startup_entry+0x6f/0x80
[  19701.650289]   WARN: Modules linked in: tun nfsv3 nfs_acl nfs lockd grace fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter dm_multipath sunrpc nls_iso8859_1 nls_cp437 intel_powerclamp crct10dif_pclmul vfat crc32_pclmul ghash_clmulni_intel fat pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper video backlight ip_tables x_tables hid_generic usbhid hid xhci_pci nvme igc(O) xhci_hcd i40e(O) nvme_core scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
[  19701.650330]   WARN: ---[ end trace 79b40169d24b8e01 ]---
[  19701.650333]   WARN: RIP: e030:__xen_evtchn_do_upcall+0x82/0x90
[  19701.650335]   WARN: Code: 66 90 f6 c4 02 75 23 80 3b 00 75 d7 65 ff 05 85 89 ba 7e 48 8b 44 24 10 65 48 33 04 25 28 00 00 00 75 09 48 83 c4 18 5b 5d c3 <0f> 0b e8 77 aa bf ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 e9 66 ff
[  19701.650341]   WARN: RSP: e02b:ffff8881dc503fb8 EFLAGS: 00010002
[  19701.650344]   WARN: RAX: 0000000000000000 RBX: ffff8881dc514100 RCX: 000000008518dd93
[  19701.650346]   WARN: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881db6aa800
[  19701.650349]   WARN: RBP: 0000000000000004 R08: 00000000000035c6 R09: ffff8881db003210
[  19701.650352]   WARN: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  19701.650355]   WARN: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  19701.650361]   WARN: FS:  0000000000000000(0000) GS:ffff8881dc500000(0000) knlGS:0000000000000000
[  19701.650364]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
[  19701.650367]   WARN: CR2: 00007fffc3789c78 CR3: 00000001d81ca000 CR4: 0000000000040660
[  19701.650371]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt

Thank you for looking this over.

the_jest

@DustinB
Sorry, I should have been more clear. The cabinet is not sealed, i.e. it's got a loose grating on the entire front, and in addition to that, it is actively ventilated, with internal fans in the back of the cabinet pulling air through. And the box itself is not covered up in any way.

I do not have any additional coolers to use.

the_jest

@DustinB
It's less than a year old. I don't know how to evaluate whether the ventilation is "adequate"; the existing ventilation is exposed (i.e. not covered up), and it's inside a cabinet that is itself ventilated, but it's not in a freezingly air-conditioned space.

I have tried to install lm_sensors, but it doesn't detect any sensors, so I'm not sure how to tell if it's regularly overheating.

the_jest

@olivierlambert
Thanks. I ran two full passes of memtest, and it's totally clean.

I don't know what I'm looking for, but I looked through some random folders in /var/crash, and the last bit of the most recent xen.log was:

(XEN) [19617.315490] CPU4: Temperature/speed normal
(XEN) [19617.315492] CPU5: Temperature/speed normal
(XEN) [19622.518692] CPU5: Temperature above threshold
(XEN) [19622.518693] CPU4: Temperature above threshold
(XEN) [19622.518694] CPU5: Running in modulated clock mode
(XEN) [19622.518695] CPU4: Running in modulated clock mode
(XEN) [19707.409611] Executing kexec image on cpu5
(XEN) [19707.410615] Shot down all CPUs

There's no reason why the CPU should be running hot; this was a largely unloaded system, but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)

the_jest

I'm running XCP-NG 8.2.1 on a Minisforum-01. It's been generally solid for some time (with the exception of a single VM, which I still haven't figured out), but in the last few weeks, the entire host has been crashing very frequently, sometimes once a day or more. It restarts itself smoothly, but this is getting ridiculous.

How do I diagnose what's going on? I've looked at /var/crash, but there's so much stuff there I don't know where to start, and randomly looking through some of those logs doesn't show anything that I can make sense of, at least. And /var/log/kern.log doesn't show anything that is helpful to me.

Where do I go to figure this out?

the_jest

@olivierlambert
No, the only thing I get with xl dmesg on the host, for some time back, are random brief reports of individual CPUs running above temperature threshold and then being clocked down, and then resolving. Nothing else.

the_jest

I'm not sure if this is specifically an XCP-ng issue, so forgive me if this isn't the right place for it. I'm also not sure what details would be most useful to share, but I can describe my setup further as necessary.

I've been running XCP-ng for a few weeks, and for the most part everything has been going well. I have about 6 VMs running, all based on vanilla installs of Debian Bookworm. One of these VMs serves as a Docker host, without about a dozen containers running in it; this VM has more memory and CPUs dedicated to it than others, but still runs comfortably (i.e. it's not always at 100% CPU or running out of memory or anything). Every couple of days, this VM has crashed; this doesn't appear to be associated with anything that's actively happening (i.e. it doesn't seem to happen when I do something specific on a container, it just happens). I'm attaching the dmesg output of a crash, but it starts with a "general protection fault".

What can I do to figure out what's going on?

Thank you.