@KPS one thing is clear to me. The reboot is triggering a VM shutdown due to a system crash (kernel errors and memory dump files being a lead). Without a detailed stack trace (like Linux's kernel panic) and the difficulty in reproducing the issue, troubleshooting is a very hard task. One last thing I'd check is the /var/log/daemon.log
at the VM shutdown time window.
Posts made by tuxen
-
RE: Windows 2022 VM - Reboot triggered - VM shuts down
-
RE: Windows 2022 VM - Reboot triggered - VM shuts down
@KPS I was exactly thinking about an after-hour task doing heavy storage I/O (e.g data replication or ETL-like workloads). Under this scenario, a forced reboot might cause some sort of file system corruption due to uncommitted data being lost.
Now, other source of issue comes to mind: automatic Windows Update. Is this service active? I'm not a Windows expert but a forced reboot during a system update might also cause an unexpected behavior.
Seeing all those errors, it seems that some system file or DLL got corrupted, needing a repair. It's strongly recommended taking a snapshot before running a system repair.
-
RE: Windows 2022 VM - Reboot triggered - VM shuts down
@KPS When that force reboot command is issued, the VM:
- Is under intensive I/O?
- Has a backup job started/running?
-
RE: AMD Radeon S7150x2 - Not being seen by VMs
@cunrun Are the XCP-ng host and the Windows 2019 Server VM booting in legacy/BIOS or UEFI? Since the FirePros was launched when the legacy/BIOS was still the standard, I'd try the this mode (if not yet).
-
RE: Memory Consumption goes higher day by day
@dhiraj26683 seeing the htop output, there's some HA-LIZARD PIDs running. So, yes, there's "extra stuff" installed on dom0
HA-LIZARD uses the TGT iSCSI driver which in turn has an implicit
write-cache
option enabled by default, if not set [1][2]. Is this option disabled in/etc/tgt/targets.conf
?
[1] https://www.halizard.com/ha-iscsi
[2] https://manpages.debian.org/testing/tgt/targets.conf.5.en.html -
RE: Intel Xeon W-2145 CPU on SuperMicro & failing xenpm get-cpufreq-para
@gecant I don't have any Supermicro server to test (mostly Dell/HPe) but checking the mobo manual [1], sadly, there's no profile for tweaking the power settings. After checking the available options, try this config:
Advanced >> CPU Configuration >> Advanced Power Management Configuration > CPU P State Control SpeedStep (PStates) [Enable] EIST PSD Function [HW_ALL] Turbo Mode [Enabled] > Hardware PM State Control Hardware P-States [Native Mode] > CPU C State Control Autonomous Core C-State [Disable] CPU C6 Report [Enable] Enhanced Halt State (C1E) [Disable (performance)] or [Enable (powersave)] > Package C State Control Package C State [C0/C1 (performance)] or [C6(Retention) state (powersave)]
[1] https://www.supermicro.com/manuals/motherboard/C420/MNL-2005.pdf
-
RE: Very scary host reboot issue
@darabontors some additional tests that I could think of:
- Minimum WG MTU on client-side (
MTU=1280
); - OPNSense with emulated
e1000
interfaces (bypass the PV driver but not OVS). It'll keep the VM 'agile' (hot-migrate) but with a big cost in performance; - The last OPNSense version
23.7.5
.
As for the last version, found this important info posted by the devs about a change in the MTU code [1]:
Today introduces a change in MTU handling for parent interfaces mostly
noticed by PPPoE use where the respective MTU values need to fit the
parent plus the additional header of the VLAN or PPPoE. Should the
MTU already be misconfigured to a smaller value it will be used as
configured so check your configuration and clear the MTU value if you
want the system to decide about the effective parent MTU size.
(...)Hope it helps.
- Minimum WG MTU on client-side (
-
RE: Very scary host reboot issue
@darabontors said in Very scary host reboot issue:
Some other detail that might be unrelated: my PPPoE connection to my ISP has MTU 1492. WireGuard connection also has MTU 1492. Is this relevant in any way?
I'm not into firewall/tunneling stuff but shouldn't the WireGuard MTU be lower than the PPPoE one in order to fit the WG protocol overhead? I read that the
default=1420
andminimum=1280
. I'd first reset the WG MTU to default and also test lower values within this range if the crash still persists.Regardless the tests, indeed there's a bug somewhere because a malformed packet/frame should be handled and not triggering a crash.
-
RE: Kernel panic on fresh install
@sasha It's worth notice that the BIOS (from 2019) is relatively old/outdated. It's recommended to update the BIOS to a more recent version.
-
RE: Dedicated CPU topology
@fred974 Yep, see the docs about NUMA/core affinity (soft/hard pinning):
-
RE: error -104
@ptunstall when the GPU was pushed back to dom0, did you also remove the PCI address from the VM config?
What's the output of:
xe vm-param-get uuid=<...> param-name=other-config
?
-
RE: Proper way to set default CPU Governor?
@sluflyer06 In order to persist across reboots, you must set the
cpufreq
boot option. There's no need to rebuild grub because the change will occur at Xen level (instead of dom0):/opt/xensource/libexec/xen-cmdline --set-xen cpufreq=xen:ondemand
After that, change the System power profile to
Performance Per Watt (OS)
in BIOS.Verifying the config:
Check if the attribute
current_governor
is set toondemand
:xenpm get-cpufreq-para
Check the clock scaling:
xenpm start 1|grep "Avg freq"
-
RE: HPC with 2x64core (256 threads) possible with XCP-ng?
@Forza Take a look:
https://xcp-ng.org/forum/post/49400
At the time of this topic, I remember asking a coworker to boot a CentOS 7.9 with more than 64 vcpus on a 48C/96T Xeon server. The VM started normally, but it didn't recognizes the vcpus > 64.
I've not tested that VM param
platform:acpi=0
as a possible solution and the trade-offs. In the past, some old RHEL 5.x VMs without acpi support would simply power off (like pulling the power cord) instead of a clean shutdown on a vm-shutdown command.Regarding that CFD software, does it support a worker/farm design? vGPU offload? I'm not a HPC expert but considering the EPYC MCM architecture, instead of a big VM, spreading the workload across many workers pinned to each CCD (or each numa nodes on a NPS4 confg) may be interesting.
Before buying those monsters, I would ask AMD to deploy a PoC using the target server model. For such demands, it's very important to do some sort of certification/validation.
-
RE: Accedentally set up a pool on an xcp-ng server
It could be. For an user point of view, a single host pool wouldn't make any sense, so they created the "implicit/explicit" concept and treated everything as a pool internally.
-
RE: Accedentally set up a pool on an xcp-ng server
That's a question for the Citrix dev team
-
RE: Accedentally set up a pool on an xcp-ng server
Just FYI guys, XenCenter/XCP-ng Center have the menu option Pool > Make into standalone server. As pointed out by other members, every standalone host is in a pool, but that option reverts to an "implicit" one.
Hope this helps.
-
RE: XCP 8.2 VCPUs-max settings
@jeff In order to create a virtual NUMA topology and expose it to the guest, the vNUMA feature needs to be implemented at hypervisor level and accessible through XAPI. I'm not sure if that feature is fully supported at the moment. Maybe @olivierlambert can confirm this?
You could try adding the
cores-per-socket
attribute following the physical NUMA topology (96 / 4 nodes = 24):xe vm-param-set platform:cores-per-socket=24 uuid=<VM UUID>
Let me know if it works.
-
RE: Centos 8 is EOL in 2021, what will xcp-ng do?
@indyj said in Centos 8 is EOL in 2021, what will xcp-ng do?:
@jefftee I prefer Alpine Linux.
+1
Low resource footprint, no bloatware... They even have a pre-built Xen Hypervisor ISO flavor
-
RE: VDI_IO_ERROR(Device I/O errors) when you run scheduled backup
This got my attention:
Jan 15 19:17:40 xcp-ng-xen12-lon2 xapi: [error||623653 INET :::80||import] Caught exception in import handler: VDI_IO_ERROR: [ Device I/O errors ] Jan 15 19:17:40 xcp-ng-xen12-lon2 xapi: [error||623653 INET :::80||backtrace] VDI.import D:378e6880299b failed with exception Unix.Unix_error(Unix.EPIPE, "single_write", "") Jan 15 19:17:40 xcp-ng-xen12-lon2 xapi: [error||623653 INET :::80||backtrace] Raised Unix.Unix_error(Unix.EPIPE, "single_write", "")
This Unix.EPIPE error on the remote target means that the pipe stream is being closed before VDI.Import receives all the data. The outcome is a VDI I/O error due to a broken, partial sent/received VDI.
Since a remote-over-the-internet link can be more prone to latency/intermittency issues, it might be needed to adjust the remote NFS soft timeout/retries or mounting the target with hard option.
I would also check if the remote target is running out-of-space during the backup process.
-
RE: XCP-ng 8.1 host loses network when running gateway/firewall VMs
Could the
fcoe
driver causing the issue?dmesg:
[ 42.363389] bnx2fc: QLogic FCoE Driver bnx2fc v2.12.5 (November 16, 2018) [ 42.371336] bnx2fc: FCoE initialized for eth1. [ 42.371641] bnx2fc: [04]: FCOE_INIT passed [ 42.387017] bnx2fc: FCoE initialized for eth0. [ 42.387305] bnx2fc: [04]: FCOE_INIT passed
lsmod:
fcoe 32768 0 libfcoe 77824 2 fcoe,bnx2fc libfc 147456 3 fcoe,bnx2fc,libfcoe scsi_transport_fc 69632 3 fcoe,libfc,bnx2fc