XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng 8.3 with VM crashing

    Scheduled Pinned Locked Moved Hardware
    16 Posts 5 Posters 441 Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      AlbertK
      last edited by AlbertK

      I have a Asus B550M with Ryzen 3600 CPU and 32GB Ram. I have the issue where the VM (five in total) will reboot randomly on a daily basis. Some day the Host will reboot without any dmesg or /var/crash log. Below is the host dmesg when two of the VM reboots at different times of the day. Three of the VM is running Kubernetes with average CPU 12%, very log IO and network. Average Host CPU is 12%

      I have run memtest86 and Prime95 on the host for 24 hours each. Both runs are booted from USB thumbdrive without the XCP-ng running.

      What could be the issue?

      (XEN) [    1.221328] Dom0 has maximum 8 VCPUs
      (XEN) [    1.240452] Initial low memory virq threshold set at 0x4000 pages.
      (XEN) [    1.240453] Scrubbing Free RAM in background
      (XEN) [    1.240454] Std. Loglevel: Errors, warnings and info
      (XEN) [    1.240455] Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
      (XEN) [    1.240456] *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
      (XEN) [    1.240646] Freed 2048kB init memory
      (XEN) [52089.602170] d2v1 NRip reported inst_len 7756387
      (XEN) [52089.602174] Insn mismatch: Expected opcode 0xf0001, modrm 0xd9, got nrip_len 7756387, emul_len 1
      (XEN) [52089.602177] SVM Insn len emulation failed (1): d2v1 64bit @ 0010:ffffffff980c99a5 -> 55 48 89 e5 41 56 41 55 41 89 fd 41 54 53 48 83
      (XEN) [131585.344364] d3v2 NRip reported inst_len 18251923
      (XEN) [131585.344370] Insn mismatch: Expected opcode 0xf0001, modrm 0xd9, got nrip_len 18251923, emul_len 2
      (XEN) [131585.344372] SVM Insn len emulation failed (1): d3v2 64bit @ 0010:ffffffff8f2c7275 -> 88 0f 01 48 d3 e8 eb 8a 80 63 1d fd 48 89 45 d8
      
      J 1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        My first gut feeling is a buggy BIOS, can you check if you are on the latest version?

        A 1 Reply Last reply Reply Quote 0
        • A Offline
          AlbertK @olivierlambert
          last edited by

          @olivierlambert Bios is the latest. XCP-ng latest patch also the VM (ubuntu 22.04 and 24.04) also has latest patch.

          A 1 Reply Last reply Reply Quote 0
          • A Offline
            andyhhp Xen Guru @AlbertK
            last edited by

            @AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?

            A 1 Reply Last reply Reply Quote 1
            • A Offline
              AlbertK @andyhhp
              last edited by

              It is a default install of the Ubuntu OS. I tried the commands below and it is negative.

              lsmod | grep kvm
              
              egrep -c '(vmx|svm)'  /proc/cpuinfo
              

              0

              A 1 Reply Last reply Reply Quote 0
              • A Offline
                andyhhp Xen Guru @AlbertK
                last edited by

                @AlbertK None of those commands are relevant in a Xen system. You want xe vm-param-list uuid=$VM

                A 1 Reply Last reply Reply Quote 0
                • A Offline
                  AlbertK @andyhhp
                  last edited by

                  @andyhhp
                  This the param list of one of the VM that is self rebooting.

                  uuid ( RO)                                  : 1199c4b4-6072-7086-7286-7d7d1cad2c33
                                              name-label ( RW): K8s-node1
                                        name-description ( RW):
                                            user-version ( RW): 1
                                           is-a-template ( RW): false
                                     is-default-template ( RW): false
                                           is-a-snapshot ( RO): false
                                             snapshot-of ( RO): <not in database>
                                               snapshots ( RO):
                                           snapshot-time ( RO): 19700101T00:00:00Z
                                           snapshot-info ( RO):
                                                  parent ( RO): <not in database>
                                                children ( RO):
                                       is-control-domain ( RO): false
                                             power-state ( RO): running
                                           memory-actual ( RO): 4297039872
                                           memory-target ( RO): 4294967296
                                         memory-overhead ( RO): 39845888
                                       memory-static-max ( RW): 4294967296
                                      memory-dynamic-max ( RW): 4294967296
                                      memory-dynamic-min ( RW): 4294967296
                                       memory-static-min ( RW): 1073741824
                                        suspend-VDI-uuid ( RW): <not in database>
                                         suspend-SR-uuid ( RW): <not in database>
                                            VCPUs-params (MRW):
                                               VCPUs-max ( RW): 4
                                        VCPUs-at-startup ( RW): 4
                                  actions-after-shutdown ( RW): Destroy
                                actions-after-softreboot ( RW): Soft reboot
                                    actions-after-reboot ( RW): Restart
                                     actions-after-crash ( RW): Restart
                                           console-uuids (SRO): 7c1c7058-8b18-06ca-60f5-9cbfedec2d11
                                                     hvm ( RO): true
                                                platform (MRW): timeoffset: 0; nic_type: e1000; device-model: qemu-upstream-uefi; secureboot: false; vga: std; videoram: 8; viridian: false; device_id: 0001; nx: true; acpi: 1; apic: true; pae: true; hpet: true
                                      allowed-operations (SRO): metadata_export; changing_VCPUs_live; changing_dynamic_range; migrate_send; pool_migrate; suspend; hard_reboot; hard_shutdown; clean_reboot; clean_shutdown; pause; checkpoint; snapshot
                                      current-operations (SRO):
                                      blocked-operations (MRW):
                                     allowed-VBD-devices (SRO): 1; 2; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 53; 54; 55; 56; 57; 58; 59; 60; 61; 62; 63; 64; 65; 66; 67; 68; 69; 70; 71; 72; 73; 74; 75; 76; 77; 78; 79; 80; 81; 82; 83; 84; 85; 86; 87; 88; 89; 90; 91; 92; 93; 94; 95; 96; 97; 98; 99; 100; 101; 102; 103; 104; 105; 106; 107; 108; 109; 110; 111; 112; 113; 114; 115; 116; 117; 118; 119; 120; 121; 122; 123; 124; 125; 126; 127; 128; 129; 130; 131; 132; 133; 134; 135; 136; 137; 138; 139; 140; 141; 142; 143; 144; 145; 146; 147; 148; 149; 150; 151; 152; 153; 154; 155; 156; 157; 158; 159; 160; 161; 162; 163; 164; 165; 166; 167; 168; 169; 170; 171; 172; 173; 174; 175; 176; 177; 178; 179; 180; 181; 182; 183; 184; 185; 186; 187; 188; 189; 190; 191; 192; 193; 194; 195; 196; 197; 198; 199; 200; 201; 202; 203; 204; 205; 206; 207; 208; 209; 210; 211; 212; 213; 214; 215; 216; 217; 218; 219; 220; 221; 222; 223; 224; 225; 226; 227; 228; 229; 230; 231; 232; 233; 234; 235; 236; 237; 238; 239; 240; 241; 242; 243; 244; 245; 246; 247; 248; 249; 250; 251; 252; 253; 254
                                     allowed-VIF-devices (SRO): 1; 2; 3; 4; 5; 6
                                          possible-hosts ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                             domain-type ( RW): hvm
                                     current-domain-type ( RO): hvm
                                         HVM-boot-policy ( RW): BIOS order
                                         HVM-boot-params (MRW): order: cdn; firmware: uefi
                                   HVM-shadow-multiplier ( RW): 1.000
                                               PV-kernel ( RW):
                                              PV-ramdisk ( RW):
                                                 PV-args ( RW):
                                          PV-legacy-args ( RW):
                                           PV-bootloader ( RW):
                                      PV-bootloader-args ( RW):
                                     last-boot-CPU-flags ( RO): vendor: AuthenticAMD; features: 178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000
                                        last-boot-record ( RO): '{"platformdata":{"timeoffset":"0","featureset":"178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000","usb":"true","usb_tablet":"true","device-model":"qemu-upstream-uefi","secureboot":"false","vga":"std","videoram":"8","viridian":"false","device_id":"0001","nx":"true","acpi":"1","apic":"true","pae":"true","hpet":"true"},"xen_platform":[1,2],"pv_drivers_detected":true,"pci_power_mgmt":false,"pci_msitranslate":true,"qemu_vifs":[],"qemu_vbds":[],"suspend_memory_bytes":2149556224,"original_profile":"Qemu_upstream_uefi","profile":"Qemu_upstream_uefi","nested_virt":false,"nomigrate":false,"domain_config":["X86",{"misc_flags":[],"emulation_flags":["X86_EMU_LAPIC","X86_EMU_HPET","X86_EMU_PM","X86_EMU_RTC","X86_EMU_IOAPIC","X86_EMU_PIC","X86_EMU_VGA","X86_EMU_IOMMU","X86_EMU_PIT","X86_EMU_USE_PIRQ"]}],"last_start_time":1730178556.316762,"ty":["HVM",{"firmware":["Uefi",{"backend":"xapidb","on_boot":"Persist"}],"qemu_stubdom":false,"qemu_disk_cmdline":false,"boot_order":"cdn","pci_passthrough":false,"pci_emulations":[],"serial":"pty","acpi":true,"video":"Standard_VGA","video_mib":8,"timeoffset":"0","shadow_multiplier":1.0,"hap":true}],"build_info":{"has_hard_affinity":false,"priv":["BuildHVM",{"video_mib":8,"shadow_multiplier":1.0}],"vcpus":2,"kernel":"/usr/libexec/xen/boot/hvmloader","memory_target":2097152,"memory_max":2097152},"version":2}'
                                             resident-on ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                                affinity ( RW): <not in database>
                                            other-config (MRW): auto_poweron: true; xo:1199c4b4: {"creation":{"date":"2024-10-28T05:11:47.183Z","template":"df1a0e64-3799-482b-aa9f-1ed713c7dac5","user":"98707372-26e6-4877-8a14-85064b5f853a"}}; base_template_name: Ubuntu Jammy Jellyfish 22.04; import_task: OpaqueRef:807d2f23-5607-4fc9-2e3f-a3e9f055e800; mac_seed: 38c38661-6a24-4b1b-63e2-86c3ff2035d3; linux_template: true; install-methods: cdrom,nfs,http,ftp
                                                  dom-id ( RO): 2
                                         recommendations ( RO): <restrictions><restriction field="memory-static-max" max="1649267441664"/><restriction field="vcpus-max" max="64"/><restriction field="has-vendor-device" value="false"/><restriction field="allow-gpu-passthrough" value="1"/><restriction field="allow-vgpu" value="1"/><restriction field="allow-network-sriov" value="1"/><restriction field="supports-bios" value="yes"/><restriction field="supports-uefi" value="yes"/><restriction field="supports-secure-boot" value="yes"/><restriction max="255" property="number-of-vbds"/><restriction max="7" property="number-of-vifs"/></restrictions>
                                           xenstore-data (MRW): vm-data/mmio-hole-size: 268435456; vm-data:
                              ha-always-run ( RW) [DEPRECATED]: false
                                     ha-restart-priority ( RW):
                                                   blobs ( RO):
                                              start-time ( RO): 20250320T19:19:07Z
                                            install-time ( RO): 20241028T05:11:47Z
                                            VCPUs-number ( RO): 4
                                       VCPUs-utilisation (MRO): 0: 0.115; 1: 0.106; 2: 0.111; 3: 0.107
                                              os-version (MRO): name: Ubuntu 24.04; uname: 6.8.0-54-generic; distro: Ubuntu
                                      PV-drivers-version (MRO): major: 1; minor: 0; micro: 0; build: proto-0.4.0
                      PV-drivers-up-to-date ( RO) [DEPRECATED]: true
                                                  memory (MRO):
                                                   disks (MRO):
                                                    VBDs (SRO): f0602bf4-1f5f-12f1-957b-f6c99669d98c; de3047f9-b097-e345-3a8a-77094f5f8de7
                                                networks (MRO): 0/ip: 192.168.8.86; 0/ipv4/0: 192.168.8.86; 0/ipv6/0: fe80::dc3b:cff:fef0:d3ed
                                     PV-drivers-detected ( RO): true
                                                   other (MRO): platform-feature-xs_reset_watches: 1; platform-feature-multiprocessor-suspend: 1; has-vendor-device: 0; feature-vcpu-hotplug: 1; feature-suspend: 1; feature-reboot: 1; feature-poweroff: 1; feature-balloon: 1
                                                    live ( RO): true
                              guest-metrics-last-updated ( RO): 20250320T19:19:18Z
                                     can-use-hotplug-vbd ( RO): unspecified
                                     can-use-hotplug-vif ( RO): unspecified
                                cooperative ( RO) [DEPRECATED]: true
                                                    tags (SRW):
                                               appliance ( RW): <not in database>
                                                  groups ( RW):
                                       snapshot-schedule ( RW): <not in database>
                                        is-vmss-snapshot ( RO): false
                                             start-delay ( RW): 0
                                          shutdown-delay ( RW): 0
                                                   order ( RW): 0
                                                 version ( RO): 0
                                           generation-id ( RO):
                               hardware-platform-version ( RO): 0
                                       has-vendor-device ( RW): false
                                         requires-reboot ( RO): false
                                         reference-label ( RO): ubuntu-22.04
                                            bios-strings (MRO): bios-vendor: Xen; bios-version: ; system-manufacturer: Xen; system-product-name: HVM domU; system-version: ; system-serial-number: ; baseboard-manufacturer: ; baseboard-product-name: ; baseboard-version: ; baseboard-serial-number: ; baseboard-asset-tag: ; baseboard-location-in-chassis: ; enclosure-asset-tag: ; hp-rombios: ; oem-1: Xen; oem-2: MS_VM_CERT/SHA1/bdbeb6e0a816d43fa6d3fe8aaef04c2bad9d3e3d
                                       pending-guidances ( RO):
                                                   vtpms ( RO):
                           pending-guidances-recommended ( RO):
                                  pending-guidances-full ( RO):
                  
                  A 1 Reply Last reply Reply Quote 0
                  • A Offline
                    andyhhp Xen Guru @AlbertK
                    last edited by

                    @AlbertK Thanks. There's no nested-virt configured there.

                    I have to admit this is looking more and more like buggy CPU. Memory corruption is a possibility, but this is a clearly corrupt field in the middle of otherwise sane-looking fields in the VMCB.

                    Do you have any other identical systems? Can you swap this CPU out for another one to see what happens?

                    A 1 Reply Last reply Reply Quote 0
                    • A Offline
                      AlbertK @andyhhp
                      last edited by

                      @andyhhp Unfortunately no, I do not have another machine to test out the CPU. I have ordered another set of 2x16GB of RAM to test if it is RAM issue.

                      Will report back.

                      1 Reply Last reply Reply Quote 0
                      • J Offline
                        joebeasley @AlbertK
                        last edited by

                        @AlbertK I had a similar issue where the whole server would just reboot randomly. Turned out to be an option in the bios called "cstates". It has something to do with processor power saving. I disabled any mention of cstates and have not had the reboot problems.

                        A 1 Reply Last reply Reply Quote 0
                        • A Offline
                          AlbertK @joebeasley
                          last edited by AlbertK

                          @joebeasley Mine is more of one or more VM will auto reboot and sometime one VM will be not be accessible (cannot ssh or console from XO) (CPU 99%, no network or disk activity as seen in XO and need to force reboot). After that a few hours later the Host will reboot. This is happening every day now.

                          I am seeing a lot of this in the host dmesg.

                          [105679.203854] vif vif-6-0 vif6.0: Guest Rx stalled
                          [105689.395996] vif vif-6-0 vif6.0: Guest Rx ready
                          [105707.532509] vif vif-6-0 vif6.0: Guest Rx stalled
                          [105717.555832] vif vif-6-0 vif6.0: Guest Rx ready
                          [105744.154415] vif vif-6-0 vif6.0: Guest Rx stalled
                          [105754.163666] vif vif-6-0 vif6.0: Guest Rx ready
                          
                          A 1 Reply Last reply Reply Quote 0
                          • A Offline
                            AlbertK @AlbertK
                            last edited by

                            I have installed a fresh set of RAM and still the system crash randomly with some of the crashes with crash log but some does not.

                            This happen on a daily basis it is either the VM reboots or Host. I notice that with the crash log there is a consistent pattern of SVM error in CPU8 and once on CPU11.

                            I then tried to disable the CPU8 and CPU11 from the cpu pool. There is no reboot from VM or Host for the last 7 days. Any ideas on why?.

                            xl cpupool-cpu-remove 8,11
                            
                            A 1 Reply Last reply Reply Quote 0
                            • A Offline
                              andyhhp Xen Guru @AlbertK
                              last edited by

                              As I said before, this is looking like a buggy CPU, and you've proved it, given a week with no incident if CPU8 is excluded.

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                Clearly, there's one or 2 damaged core(s). Likely faulty CPU I'm afraid 😞

                                1 Reply Last reply Reply Quote 0
                                • R Offline
                                  Riven
                                  last edited by

                                  If you are not getting crashes on cores 0-5 (assuming they are in use by your VMs) then its unlikely a physical problem.

                                  The Ryzen 3600 is only a 6 core CPU, "cores" 8 & 11 are the SMT (Hyperthreaded) versions of cores 2 & 5

                                  You could also try turning SMT off

                                  A 1 Reply Last reply Reply Quote 0
                                  • A Offline
                                    AlbertK @Riven
                                    last edited by

                                    @Riven,

                                    What I am not sure is how Xen arrange the CPU, is it core first followed by the SMT/HyperThread Core? or is it alternating ie RealCore, HyperThread Core.

                                    1 Reply Last reply Reply Quote 0
                                    • First post
                                      Last post