XCP-ng 8.3 with VM crashing
-
I have a Asus B550M with Ryzen 3600 CPU and 32GB Ram. I have the issue where the VM (five in total) will reboot randomly on a daily basis. Some day the Host will reboot without any dmesg or /var/crash log. Below is the host dmesg when two of the VM reboots at different times of the day. Three of the VM is running Kubernetes with average CPU 12%, very log IO and network. Average Host CPU is 12%
I have run memtest86 and Prime95 on the host for 24 hours each. Both runs are booted from USB thumbdrive without the XCP-ng running.
What could be the issue?
(XEN) [ 1.221328] Dom0 has maximum 8 VCPUs (XEN) [ 1.240452] Initial low memory virq threshold set at 0x4000 pages. (XEN) [ 1.240453] Scrubbing Free RAM in background (XEN) [ 1.240454] Std. Loglevel: Errors, warnings and info (XEN) [ 1.240455] Guest Loglevel: Nothing (Rate-limited: Errors and warnings) (XEN) [ 1.240456] *** Serial input to DOM0 (type 'CTRL-a' three times to switch input) (XEN) [ 1.240646] Freed 2048kB init memory (XEN) [52089.602170] d2v1 NRip reported inst_len 7756387 (XEN) [52089.602174] Insn mismatch: Expected opcode 0xf0001, modrm 0xd9, got nrip_len 7756387, emul_len 1 (XEN) [52089.602177] SVM Insn len emulation failed (1): d2v1 64bit @ 0010:ffffffff980c99a5 -> 55 48 89 e5 41 56 41 55 41 89 fd 41 54 53 48 83 (XEN) [131585.344364] d3v2 NRip reported inst_len 18251923 (XEN) [131585.344370] Insn mismatch: Expected opcode 0xf0001, modrm 0xd9, got nrip_len 18251923, emul_len 2 (XEN) [131585.344372] SVM Insn len emulation failed (1): d3v2 64bit @ 0010:ffffffff8f2c7275 -> 88 0f 01 48 d3 e8 eb 8a 80 63 1d fd 48 89 45 d8
-
Hi,
My first gut feeling is a buggy BIOS, can you check if you are on the latest version?
-
@olivierlambert Bios is the latest. XCP-ng latest patch also the VM (ubuntu 22.04 and 24.04) also has latest patch.
-
@AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?
-
It is a default install of the Ubuntu OS. I tried the commands below and it is negative.
lsmod | grep kvm
egrep -c '(vmx|svm)' /proc/cpuinfo
0
-
@AlbertK None of those commands are relevant in a Xen system. You want
xe vm-param-list uuid=$VM
-
@andyhhp
This the param list of one of the VM that is self rebooting.uuid ( RO) : 1199c4b4-6072-7086-7286-7d7d1cad2c33 name-label ( RW): K8s-node1 name-description ( RW): user-version ( RW): 1 is-a-template ( RW): false is-default-template ( RW): false is-a-snapshot ( RO): false snapshot-of ( RO): <not in database> snapshots ( RO): snapshot-time ( RO): 19700101T00:00:00Z snapshot-info ( RO): parent ( RO): <not in database> children ( RO): is-control-domain ( RO): false power-state ( RO): running memory-actual ( RO): 4297039872 memory-target ( RO): 4294967296 memory-overhead ( RO): 39845888 memory-static-max ( RW): 4294967296 memory-dynamic-max ( RW): 4294967296 memory-dynamic-min ( RW): 4294967296 memory-static-min ( RW): 1073741824 suspend-VDI-uuid ( RW): <not in database> suspend-SR-uuid ( RW): <not in database> VCPUs-params (MRW): VCPUs-max ( RW): 4 VCPUs-at-startup ( RW): 4 actions-after-shutdown ( RW): Destroy actions-after-softreboot ( RW): Soft reboot actions-after-reboot ( RW): Restart actions-after-crash ( RW): Restart console-uuids (SRO): 7c1c7058-8b18-06ca-60f5-9cbfedec2d11 hvm ( RO): true platform (MRW): timeoffset: 0; nic_type: e1000; device-model: qemu-upstream-uefi; secureboot: false; vga: std; videoram: 8; viridian: false; device_id: 0001; nx: true; acpi: 1; apic: true; pae: true; hpet: true allowed-operations (SRO): metadata_export; changing_VCPUs_live; changing_dynamic_range; migrate_send; pool_migrate; suspend; hard_reboot; hard_shutdown; clean_reboot; clean_shutdown; pause; checkpoint; snapshot current-operations (SRO): blocked-operations (MRW): allowed-VBD-devices (SRO): 1; 2; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 53; 54; 55; 56; 57; 58; 59; 60; 61; 62; 63; 64; 65; 66; 67; 68; 69; 70; 71; 72; 73; 74; 75; 76; 77; 78; 79; 80; 81; 82; 83; 84; 85; 86; 87; 88; 89; 90; 91; 92; 93; 94; 95; 96; 97; 98; 99; 100; 101; 102; 103; 104; 105; 106; 107; 108; 109; 110; 111; 112; 113; 114; 115; 116; 117; 118; 119; 120; 121; 122; 123; 124; 125; 126; 127; 128; 129; 130; 131; 132; 133; 134; 135; 136; 137; 138; 139; 140; 141; 142; 143; 144; 145; 146; 147; 148; 149; 150; 151; 152; 153; 154; 155; 156; 157; 158; 159; 160; 161; 162; 163; 164; 165; 166; 167; 168; 169; 170; 171; 172; 173; 174; 175; 176; 177; 178; 179; 180; 181; 182; 183; 184; 185; 186; 187; 188; 189; 190; 191; 192; 193; 194; 195; 196; 197; 198; 199; 200; 201; 202; 203; 204; 205; 206; 207; 208; 209; 210; 211; 212; 213; 214; 215; 216; 217; 218; 219; 220; 221; 222; 223; 224; 225; 226; 227; 228; 229; 230; 231; 232; 233; 234; 235; 236; 237; 238; 239; 240; 241; 242; 243; 244; 245; 246; 247; 248; 249; 250; 251; 252; 253; 254 allowed-VIF-devices (SRO): 1; 2; 3; 4; 5; 6 possible-hosts ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed domain-type ( RW): hvm current-domain-type ( RO): hvm HVM-boot-policy ( RW): BIOS order HVM-boot-params (MRW): order: cdn; firmware: uefi HVM-shadow-multiplier ( RW): 1.000 PV-kernel ( RW): PV-ramdisk ( RW): PV-args ( RW): PV-legacy-args ( RW): PV-bootloader ( RW): PV-bootloader-args ( RW): last-boot-CPU-flags ( RO): vendor: AuthenticAMD; features: 178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000 last-boot-record ( RO): '{"platformdata":{"timeoffset":"0","featureset":"178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000","usb":"true","usb_tablet":"true","device-model":"qemu-upstream-uefi","secureboot":"false","vga":"std","videoram":"8","viridian":"false","device_id":"0001","nx":"true","acpi":"1","apic":"true","pae":"true","hpet":"true"},"xen_platform":[1,2],"pv_drivers_detected":true,"pci_power_mgmt":false,"pci_msitranslate":true,"qemu_vifs":[],"qemu_vbds":[],"suspend_memory_bytes":2149556224,"original_profile":"Qemu_upstream_uefi","profile":"Qemu_upstream_uefi","nested_virt":false,"nomigrate":false,"domain_config":["X86",{"misc_flags":[],"emulation_flags":["X86_EMU_LAPIC","X86_EMU_HPET","X86_EMU_PM","X86_EMU_RTC","X86_EMU_IOAPIC","X86_EMU_PIC","X86_EMU_VGA","X86_EMU_IOMMU","X86_EMU_PIT","X86_EMU_USE_PIRQ"]}],"last_start_time":1730178556.316762,"ty":["HVM",{"firmware":["Uefi",{"backend":"xapidb","on_boot":"Persist"}],"qemu_stubdom":false,"qemu_disk_cmdline":false,"boot_order":"cdn","pci_passthrough":false,"pci_emulations":[],"serial":"pty","acpi":true,"video":"Standard_VGA","video_mib":8,"timeoffset":"0","shadow_multiplier":1.0,"hap":true}],"build_info":{"has_hard_affinity":false,"priv":["BuildHVM",{"video_mib":8,"shadow_multiplier":1.0}],"vcpus":2,"kernel":"/usr/libexec/xen/boot/hvmloader","memory_target":2097152,"memory_max":2097152},"version":2}' resident-on ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed affinity ( RW): <not in database> other-config (MRW): auto_poweron: true; xo:1199c4b4: {"creation":{"date":"2024-10-28T05:11:47.183Z","template":"df1a0e64-3799-482b-aa9f-1ed713c7dac5","user":"98707372-26e6-4877-8a14-85064b5f853a"}}; base_template_name: Ubuntu Jammy Jellyfish 22.04; import_task: OpaqueRef:807d2f23-5607-4fc9-2e3f-a3e9f055e800; mac_seed: 38c38661-6a24-4b1b-63e2-86c3ff2035d3; linux_template: true; install-methods: cdrom,nfs,http,ftp dom-id ( RO): 2 recommendations ( RO): <restrictions><restriction field="memory-static-max" max="1649267441664"/><restriction field="vcpus-max" max="64"/><restriction field="has-vendor-device" value="false"/><restriction field="allow-gpu-passthrough" value="1"/><restriction field="allow-vgpu" value="1"/><restriction field="allow-network-sriov" value="1"/><restriction field="supports-bios" value="yes"/><restriction field="supports-uefi" value="yes"/><restriction field="supports-secure-boot" value="yes"/><restriction max="255" property="number-of-vbds"/><restriction max="7" property="number-of-vifs"/></restrictions> xenstore-data (MRW): vm-data/mmio-hole-size: 268435456; vm-data: ha-always-run ( RW) [DEPRECATED]: false ha-restart-priority ( RW): blobs ( RO): start-time ( RO): 20250320T19:19:07Z install-time ( RO): 20241028T05:11:47Z VCPUs-number ( RO): 4 VCPUs-utilisation (MRO): 0: 0.115; 1: 0.106; 2: 0.111; 3: 0.107 os-version (MRO): name: Ubuntu 24.04; uname: 6.8.0-54-generic; distro: Ubuntu PV-drivers-version (MRO): major: 1; minor: 0; micro: 0; build: proto-0.4.0 PV-drivers-up-to-date ( RO) [DEPRECATED]: true memory (MRO): disks (MRO): VBDs (SRO): f0602bf4-1f5f-12f1-957b-f6c99669d98c; de3047f9-b097-e345-3a8a-77094f5f8de7 networks (MRO): 0/ip: 192.168.8.86; 0/ipv4/0: 192.168.8.86; 0/ipv6/0: fe80::dc3b:cff:fef0:d3ed PV-drivers-detected ( RO): true other (MRO): platform-feature-xs_reset_watches: 1; platform-feature-multiprocessor-suspend: 1; has-vendor-device: 0; feature-vcpu-hotplug: 1; feature-suspend: 1; feature-reboot: 1; feature-poweroff: 1; feature-balloon: 1 live ( RO): true guest-metrics-last-updated ( RO): 20250320T19:19:18Z can-use-hotplug-vbd ( RO): unspecified can-use-hotplug-vif ( RO): unspecified cooperative ( RO) [DEPRECATED]: true tags (SRW): appliance ( RW): <not in database> groups ( RW): snapshot-schedule ( RW): <not in database> is-vmss-snapshot ( RO): false start-delay ( RW): 0 shutdown-delay ( RW): 0 order ( RW): 0 version ( RO): 0 generation-id ( RO): hardware-platform-version ( RO): 0 has-vendor-device ( RW): false requires-reboot ( RO): false reference-label ( RO): ubuntu-22.04 bios-strings (MRO): bios-vendor: Xen; bios-version: ; system-manufacturer: Xen; system-product-name: HVM domU; system-version: ; system-serial-number: ; baseboard-manufacturer: ; baseboard-product-name: ; baseboard-version: ; baseboard-serial-number: ; baseboard-asset-tag: ; baseboard-location-in-chassis: ; enclosure-asset-tag: ; hp-rombios: ; oem-1: Xen; oem-2: MS_VM_CERT/SHA1/bdbeb6e0a816d43fa6d3fe8aaef04c2bad9d3e3d pending-guidances ( RO): vtpms ( RO): pending-guidances-recommended ( RO): pending-guidances-full ( RO):
-
@AlbertK Thanks. There's no nested-virt configured there.
I have to admit this is looking more and more like buggy CPU. Memory corruption is a possibility, but this is a clearly corrupt field in the middle of otherwise sane-looking fields in the VMCB.
Do you have any other identical systems? Can you swap this CPU out for another one to see what happens?
-
@andyhhp Unfortunately no, I do not have another machine to test out the CPU. I have ordered another set of 2x16GB of RAM to test if it is RAM issue.
Will report back.
-
@AlbertK I had a similar issue where the whole server would just reboot randomly. Turned out to be an option in the bios called "cstates". It has something to do with processor power saving. I disabled any mention of cstates and have not had the reboot problems.
-
@joebeasley Mine is more of one or more VM will auto reboot and sometime one VM will be not be accessible (cannot ssh or console from XO) (CPU 99%, no network or disk activity as seen in XO and need to force reboot). After that a few hours later the Host will reboot. This is happening every day now.
I am seeing a lot of this in the host dmesg.
[105679.203854] vif vif-6-0 vif6.0: Guest Rx stalled [105689.395996] vif vif-6-0 vif6.0: Guest Rx ready [105707.532509] vif vif-6-0 vif6.0: Guest Rx stalled [105717.555832] vif vif-6-0 vif6.0: Guest Rx ready [105744.154415] vif vif-6-0 vif6.0: Guest Rx stalled [105754.163666] vif vif-6-0 vif6.0: Guest Rx ready