XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng 8.3 with VM crashing

    Scheduled Pinned Locked Moved Hardware
    16 Posts 5 Posters 438 Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      AlbertK
      last edited by AlbertK

      I have a Asus B550M with Ryzen 3600 CPU and 32GB Ram. I have the issue where the VM (five in total) will reboot randomly on a daily basis. Some day the Host will reboot without any dmesg or /var/crash log. Below is the host dmesg when two of the VM reboots at different times of the day. Three of the VM is running Kubernetes with average CPU 12%, very log IO and network. Average Host CPU is 12%

      I have run memtest86 and Prime95 on the host for 24 hours each. Both runs are booted from USB thumbdrive without the XCP-ng running.

      What could be the issue?

      (XEN) [    1.221328] Dom0 has maximum 8 VCPUs
      (XEN) [    1.240452] Initial low memory virq threshold set at 0x4000 pages.
      (XEN) [    1.240453] Scrubbing Free RAM in background
      (XEN) [    1.240454] Std. Loglevel: Errors, warnings and info
      (XEN) [    1.240455] Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
      (XEN) [    1.240456] *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
      (XEN) [    1.240646] Freed 2048kB init memory
      (XEN) [52089.602170] d2v1 NRip reported inst_len 7756387
      (XEN) [52089.602174] Insn mismatch: Expected opcode 0xf0001, modrm 0xd9, got nrip_len 7756387, emul_len 1
      (XEN) [52089.602177] SVM Insn len emulation failed (1): d2v1 64bit @ 0010:ffffffff980c99a5 -> 55 48 89 e5 41 56 41 55 41 89 fd 41 54 53 48 83
      (XEN) [131585.344364] d3v2 NRip reported inst_len 18251923
      (XEN) [131585.344370] Insn mismatch: Expected opcode 0xf0001, modrm 0xd9, got nrip_len 18251923, emul_len 2
      (XEN) [131585.344372] SVM Insn len emulation failed (1): d3v2 64bit @ 0010:ffffffff8f2c7275 -> 88 0f 01 48 d3 e8 eb 8a 80 63 1d fd 48 89 45 d8
      
      J 1 Reply Last reply Reply Quote 0
      • olivierlambertO Online
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        My first gut feeling is a buggy BIOS, can you check if you are on the latest version?

        A 1 Reply Last reply Reply Quote 0
        • A Offline
          AlbertK @olivierlambert
          last edited by

          olivierlambert Bios is the latest. XCP-ng latest patch also the VM (ubuntu 22.04 and 24.04) also has latest patch.

          A 1 Reply Last reply Reply Quote 0
          • A Offline
            andyhhp Xen Guru @AlbertK
            last edited by

            AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?

            A 1 Reply Last reply Reply Quote 1
            • A Offline
              AlbertK @andyhhp
              last edited by

              It is a default install of the Ubuntu OS. I tried the commands below and it is negative.

              lsmod | grep kvm
              
              egrep -c '(vmx|svm)'  /proc/cpuinfo
              

              0

              A 1 Reply Last reply Reply Quote 0
              • A Offline
                andyhhp Xen Guru @AlbertK
                last edited by

                AlbertK None of those commands are relevant in a Xen system. You want xe vm-param-list uuid=$VM

                A 1 Reply Last reply Reply Quote 0
                • A Offline
                  AlbertK @andyhhp
                  last edited by

                  andyhhp
                  This the param list of one of the VM that is self rebooting.

                  uuid ( RO)                                  : 1199c4b4-6072-7086-7286-7d7d1cad2c33
                                              name-label ( RW): K8s-node1
                                        name-description ( RW):
                                            user-version ( RW): 1
                                           is-a-template ( RW): false
                                     is-default-template ( RW): false
                                           is-a-snapshot ( RO): false
                                             snapshot-of ( RO): <not in database>
                                               snapshots ( RO):
                                           snapshot-time ( RO): 19700101T00:00:00Z
                                           snapshot-info ( RO):
                                                  parent ( RO): <not in database>
                                                children ( RO):
                                       is-control-domain ( RO): false
                                             power-state ( RO): running
                                           memory-actual ( RO): 4297039872
                                           memory-target ( RO): 4294967296
                                         memory-overhead ( RO): 39845888
                                       memory-static-max ( RW): 4294967296
                                      memory-dynamic-max ( RW): 4294967296
                                      memory-dynamic-min ( RW): 4294967296
                                       memory-static-min ( RW): 1073741824
                                        suspend-VDI-uuid ( RW): <not in database>
                                         suspend-SR-uuid ( RW): <not in database>
                                            VCPUs-params (MRW):
                                               VCPUs-max ( RW): 4
                                        VCPUs-at-startup ( RW): 4
                                  actions-after-shutdown ( RW): Destroy
                                actions-after-softreboot ( RW): Soft reboot
                                    actions-after-reboot ( RW): Restart
                                     actions-after-crash ( RW): Restart
                                           console-uuids (SRO): 7c1c7058-8b18-06ca-60f5-9cbfedec2d11
                                                     hvm ( RO): true
                                                platform (MRW): timeoffset: 0; nic_type: e1000; device-model: qemu-upstream-uefi; secureboot: false; vga: std; videoram: 8; viridian: false; device_id: 0001; nx: true; acpi: 1; apic: true; pae: true; hpet: true
                                      allowed-operations (SRO): metadata_export; changing_VCPUs_live; changing_dynamic_range; migrate_send; pool_migrate; suspend; hard_reboot; hard_shutdown; clean_reboot; clean_shutdown; pause; checkpoint; snapshot
                                      current-operations (SRO):
                                      blocked-operations (MRW):
                                     allowed-VBD-devices (SRO): 1; 2; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 53; 54; 55; 56; 57; 58; 59; 60; 61; 62; 63; 64; 65; 66; 67; 68; 69; 70; 71; 72; 73; 74; 75; 76; 77; 78; 79; 80; 81; 82; 83; 84; 85; 86; 87; 88; 89; 90; 91; 92; 93; 94; 95; 96; 97; 98; 99; 100; 101; 102; 103; 104; 105; 106; 107; 108; 109; 110; 111; 112; 113; 114; 115; 116; 117; 118; 119; 120; 121; 122; 123; 124; 125; 126; 127; 128; 129; 130; 131; 132; 133; 134; 135; 136; 137; 138; 139; 140; 141; 142; 143; 144; 145; 146; 147; 148; 149; 150; 151; 152; 153; 154; 155; 156; 157; 158; 159; 160; 161; 162; 163; 164; 165; 166; 167; 168; 169; 170; 171; 172; 173; 174; 175; 176; 177; 178; 179; 180; 181; 182; 183; 184; 185; 186; 187; 188; 189; 190; 191; 192; 193; 194; 195; 196; 197; 198; 199; 200; 201; 202; 203; 204; 205; 206; 207; 208; 209; 210; 211; 212; 213; 214; 215; 216; 217; 218; 219; 220; 221; 222; 223; 224; 225; 226; 227; 228; 229; 230; 231; 232; 233; 234; 235; 236; 237; 238; 239; 240; 241; 242; 243; 244; 245; 246; 247; 248; 249; 250; 251; 252; 253; 254
                                     allowed-VIF-devices (SRO): 1; 2; 3; 4; 5; 6
                                          possible-hosts ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                             domain-type ( RW): hvm
                                     current-domain-type ( RO): hvm
                                         HVM-boot-policy ( RW): BIOS order
                                         HVM-boot-params (MRW): order: cdn; firmware: uefi
                                   HVM-shadow-multiplier ( RW): 1.000
                                               PV-kernel ( RW):
                                              PV-ramdisk ( RW):
                                                 PV-args ( RW):
                                          PV-legacy-args ( RW):
                                           PV-bootloader ( RW):
                                      PV-bootloader-args ( RW):
                                     last-boot-CPU-flags ( RO): vendor: AuthenticAMD; features: 178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000
                                        last-boot-record ( RO): '{"platformdata":{"timeoffset":"0","featureset":"178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000","usb":"true","usb_tablet":"true","device-model":"qemu-upstream-uefi","secureboot":"false","vga":"std","videoram":"8","viridian":"false","device_id":"0001","nx":"true","acpi":"1","apic":"true","pae":"true","hpet":"true"},"xen_platform":[1,2],"pv_drivers_detected":true,"pci_power_mgmt":false,"pci_msitranslate":true,"qemu_vifs":[],"qemu_vbds":[],"suspend_memory_bytes":2149556224,"original_profile":"Qemu_upstream_uefi","profile":"Qemu_upstream_uefi","nested_virt":false,"nomigrate":false,"domain_config":["X86",{"misc_flags":[],"emulation_flags":["X86_EMU_LAPIC","X86_EMU_HPET","X86_EMU_PM","X86_EMU_RTC","X86_EMU_IOAPIC","X86_EMU_PIC","X86_EMU_VGA","X86_EMU_IOMMU","X86_EMU_PIT","X86_EMU_USE_PIRQ"]}],"last_start_time":1730178556.316762,"ty":["HVM",{"firmware":["Uefi",{"backend":"xapidb","on_boot":"Persist"}],"qemu_stubdom":false,"qemu_disk_cmdline":false,"boot_order":"cdn","pci_passthrough":false,"pci_emulations":[],"serial":"pty","acpi":true,"video":"Standard_VGA","video_mib":8,"timeoffset":"0","shadow_multiplier":1.0,"hap":true}],"build_info":{"has_hard_affinity":false,"priv":["BuildHVM",{"video_mib":8,"shadow_multiplier":1.0}],"vcpus":2,"kernel":"/usr/libexec/xen/boot/hvmloader","memory_target":2097152,"memory_max":2097152},"version":2}'
                                             resident-on ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                                affinity ( RW): <not in database>
                                            other-config (MRW): auto_poweron: true; xo:1199c4b4: {"creation":{"date":"2024-10-28T05:11:47.183Z","template":"df1a0e64-3799-482b-aa9f-1ed713c7dac5","user":"98707372-26e6-4877-8a14-85064b5f853a"}}; base_template_name: Ubuntu Jammy Jellyfish 22.04; import_task: OpaqueRef:807d2f23-5607-4fc9-2e3f-a3e9f055e800; mac_seed: 38c38661-6a24-4b1b-63e2-86c3ff2035d3; linux_template: true; install-methods: cdrom,nfs,http,ftp
                                                  dom-id ( RO): 2
                                         recommendations ( RO): <restrictions><restriction field="memory-static-max" max="1649267441664"/><restriction field="vcpus-max" max="64"/><restriction field="has-vendor-device" value="false"/><restriction field="allow-gpu-passthrough" value="1"/><restriction field="allow-vgpu" value="1"/><restriction field="allow-network-sriov" value="1"/><restriction field="supports-bios" value="yes"/><restriction field="supports-uefi" value="yes"/><restriction field="supports-secure-boot" value="yes"/><restriction max="255" property="number-of-vbds"/><restriction max="7" property="number-of-vifs"/></restrictions>
                                           xenstore-data (MRW): vm-data/mmio-hole-size: 268435456; vm-data:
                              ha-always-run ( RW) [DEPRECATED]: false
                                     ha-restart-priority ( RW):
                                                   blobs ( RO):
                                              start-time ( RO): 20250320T19:19:07Z
                                            install-time ( RO): 20241028T05:11:47Z
                                            VCPUs-number ( RO): 4
                                       VCPUs-utilisation (MRO): 0: 0.115; 1: 0.106; 2: 0.111; 3: 0.107
                                              os-version (MRO): name: Ubuntu 24.04; uname: 6.8.0-54-generic; distro: Ubuntu
                                      PV-drivers-version (MRO): major: 1; minor: 0; micro: 0; build: proto-0.4.0
                      PV-drivers-up-to-date ( RO) [DEPRECATED]: true
                                                  memory (MRO):
                                                   disks (MRO):
                                                    VBDs (SRO): f0602bf4-1f5f-12f1-957b-f6c99669d98c; de3047f9-b097-e345-3a8a-77094f5f8de7
                                                networks (MRO): 0/ip: 192.168.8.86; 0/ipv4/0: 192.168.8.86; 0/ipv6/0: fe80::dc3b:cff:fef0:d3ed
                                     PV-drivers-detected ( RO): true
                                                   other (MRO): platform-feature-xs_reset_watches: 1; platform-feature-multiprocessor-suspend: 1; has-vendor-device: 0; feature-vcpu-hotplug: 1; feature-suspend: 1; feature-reboot: 1; feature-poweroff: 1; feature-balloon: 1
                                                    live ( RO): true
                              guest-metrics-last-updated ( RO): 20250320T19:19:18Z
                                     can-use-hotplug-vbd ( RO): unspecified
                                     can-use-hotplug-vif ( RO): unspecified
                                cooperative ( RO) [DEPRECATED]: true
                                                    tags (SRW):
                                               appliance ( RW): <not in database>
                                                  groups ( RW):
                                       snapshot-schedule ( RW): <not in database>
                                        is-vmss-snapshot ( RO): false
                                             start-delay ( RW): 0
                                          shutdown-delay ( RW): 0
                                                   order ( RW): 0
                                                 version ( RO): 0
                                           generation-id ( RO):
                               hardware-platform-version ( RO): 0
                                       has-vendor-device ( RW): false
                                         requires-reboot ( RO): false
                                         reference-label ( RO): ubuntu-22.04
                                            bios-strings (MRO): bios-vendor: Xen; bios-version: ; system-manufacturer: Xen; system-product-name: HVM domU; system-version: ; system-serial-number: ; baseboard-manufacturer: ; baseboard-product-name: ; baseboard-version: ; baseboard-serial-number: ; baseboard-asset-tag: ; baseboard-location-in-chassis: ; enclosure-asset-tag: ; hp-rombios: ; oem-1: Xen; oem-2: MS_VM_CERT/SHA1/bdbeb6e0a816d43fa6d3fe8aaef04c2bad9d3e3d
                                       pending-guidances ( RO):
                                                   vtpms ( RO):
                           pending-guidances-recommended ( RO):
                                  pending-guidances-full ( RO):
                  
                  A 1 Reply Last reply Reply Quote 0
                  • A Offline
                    andyhhp Xen Guru @AlbertK
                    last edited by

                    AlbertK Thanks. There's no nested-virt configured there.

                    I have to admit this is looking more and more like buggy CPU. Memory corruption is a possibility, but this is a clearly corrupt field in the middle of otherwise sane-looking fields in the VMCB.

                    Do you have any other identical systems? Can you swap this CPU out for another one to see what happens?

                    A 1 Reply Last reply Reply Quote 0
                    • A Offline
                      AlbertK @andyhhp
                      last edited by

                      andyhhp Unfortunately no, I do not have another machine to test out the CPU. I have ordered another set of 2x16GB of RAM to test if it is RAM issue.

                      Will report back.

                      1 Reply Last reply Reply Quote 0
                      • J Offline
                        joebeasley @AlbertK
                        last edited by

                        AlbertK I had a similar issue where the whole server would just reboot randomly. Turned out to be an option in the bios called "cstates". It has something to do with processor power saving. I disabled any mention of cstates and have not had the reboot problems.

                        A 1 Reply Last reply Reply Quote 0
                        • A Offline
                          AlbertK @joebeasley
                          last edited by AlbertK

                          joebeasley Mine is more of one or more VM will auto reboot and sometime one VM will be not be accessible (cannot ssh or console from XO) (CPU 99%, no network or disk activity as seen in XO and need to force reboot). After that a few hours later the Host will reboot. This is happening every day now.

                          I am seeing a lot of this in the host dmesg.

                          [105679.203854] vif vif-6-0 vif6.0: Guest Rx stalled
                          [105689.395996] vif vif-6-0 vif6.0: Guest Rx ready
                          [105707.532509] vif vif-6-0 vif6.0: Guest Rx stalled
                          [105717.555832] vif vif-6-0 vif6.0: Guest Rx ready
                          [105744.154415] vif vif-6-0 vif6.0: Guest Rx stalled
                          [105754.163666] vif vif-6-0 vif6.0: Guest Rx ready
                          
                          A 1 Reply Last reply Reply Quote 0
                          • A Offline
                            AlbertK @AlbertK
                            last edited by

                            I have installed a fresh set of RAM and still the system crash randomly with some of the crashes with crash log but some does not.

                            This happen on a daily basis it is either the VM reboots or Host. I notice that with the crash log there is a consistent pattern of SVM error in CPU8 and once on CPU11.

                            I then tried to disable the CPU8 and CPU11 from the cpu pool. There is no reboot from VM or Host for the last 7 days. Any ideas on why?.

                            xl cpupool-cpu-remove 8,11
                            
                            A 1 Reply Last reply Reply Quote 0
                            • A Offline
                              andyhhp Xen Guru @AlbertK
                              last edited by

                              As I said before, this is looking like a buggy CPU, and you've proved it, given a week with no incident if CPU8 is excluded.

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Online
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                Clearly, there's one or 2 damaged core(s). Likely faulty CPU I'm afraid 😞

                                1 Reply Last reply Reply Quote 0
                                • R Offline
                                  Riven
                                  last edited by

                                  If you are not getting crashes on cores 0-5 (assuming they are in use by your VMs) then its unlikely a physical problem.

                                  The Ryzen 3600 is only a 6 core CPU, "cores" 8 & 11 are the SMT (Hyperthreaded) versions of cores 2 & 5

                                  You could also try turning SMT off

                                  A 1 Reply Last reply Reply Quote 0
                                  • A Offline
                                    AlbertK @Riven
                                    last edited by

                                    Riven,

                                    What I am not sure is how Xen arrange the CPU, is it core first followed by the SMT/HyperThread Core? or is it alternating ie RealCore, HyperThread Core.

                                    1 Reply Last reply Reply Quote 0
                                    • First post
                                      Last post