XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng 8.3 with VM crashing

    Scheduled Pinned Locked Moved Hardware
    16 Posts 5 Posters 749 Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      andyhhp Xen Guru @AlbertK
      last edited by

      @AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?

      A 1 Reply Last reply Reply Quote 1
      • A Offline
        AlbertK @andyhhp
        last edited by

        It is a default install of the Ubuntu OS. I tried the commands below and it is negative.

        lsmod | grep kvm
        
        egrep -c '(vmx|svm)'  /proc/cpuinfo
        

        0

        A 1 Reply Last reply Reply Quote 0
        • A Offline
          andyhhp Xen Guru @AlbertK
          last edited by

          @AlbertK None of those commands are relevant in a Xen system. You want xe vm-param-list uuid=$VM

          A 1 Reply Last reply Reply Quote 0
          • A Offline
            AlbertK @andyhhp
            last edited by

            @andyhhp
            This the param list of one of the VM that is self rebooting.

            uuid ( RO)                                  : 1199c4b4-6072-7086-7286-7d7d1cad2c33
                                        name-label ( RW): K8s-node1
                                  name-description ( RW):
                                      user-version ( RW): 1
                                     is-a-template ( RW): false
                               is-default-template ( RW): false
                                     is-a-snapshot ( RO): false
                                       snapshot-of ( RO): <not in database>
                                         snapshots ( RO):
                                     snapshot-time ( RO): 19700101T00:00:00Z
                                     snapshot-info ( RO):
                                            parent ( RO): <not in database>
                                          children ( RO):
                                 is-control-domain ( RO): false
                                       power-state ( RO): running
                                     memory-actual ( RO): 4297039872
                                     memory-target ( RO): 4294967296
                                   memory-overhead ( RO): 39845888
                                 memory-static-max ( RW): 4294967296
                                memory-dynamic-max ( RW): 4294967296
                                memory-dynamic-min ( RW): 4294967296
                                 memory-static-min ( RW): 1073741824
                                  suspend-VDI-uuid ( RW): <not in database>
                                   suspend-SR-uuid ( RW): <not in database>
                                      VCPUs-params (MRW):
                                         VCPUs-max ( RW): 4
                                  VCPUs-at-startup ( RW): 4
                            actions-after-shutdown ( RW): Destroy
                          actions-after-softreboot ( RW): Soft reboot
                              actions-after-reboot ( RW): Restart
                               actions-after-crash ( RW): Restart
                                     console-uuids (SRO): 7c1c7058-8b18-06ca-60f5-9cbfedec2d11
                                               hvm ( RO): true
                                          platform (MRW): timeoffset: 0; nic_type: e1000; device-model: qemu-upstream-uefi; secureboot: false; vga: std; videoram: 8; viridian: false; device_id: 0001; nx: true; acpi: 1; apic: true; pae: true; hpet: true
                                allowed-operations (SRO): metadata_export; changing_VCPUs_live; changing_dynamic_range; migrate_send; pool_migrate; suspend; hard_reboot; hard_shutdown; clean_reboot; clean_shutdown; pause; checkpoint; snapshot
                                current-operations (SRO):
                                blocked-operations (MRW):
                               allowed-VBD-devices (SRO): 1; 2; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 53; 54; 55; 56; 57; 58; 59; 60; 61; 62; 63; 64; 65; 66; 67; 68; 69; 70; 71; 72; 73; 74; 75; 76; 77; 78; 79; 80; 81; 82; 83; 84; 85; 86; 87; 88; 89; 90; 91; 92; 93; 94; 95; 96; 97; 98; 99; 100; 101; 102; 103; 104; 105; 106; 107; 108; 109; 110; 111; 112; 113; 114; 115; 116; 117; 118; 119; 120; 121; 122; 123; 124; 125; 126; 127; 128; 129; 130; 131; 132; 133; 134; 135; 136; 137; 138; 139; 140; 141; 142; 143; 144; 145; 146; 147; 148; 149; 150; 151; 152; 153; 154; 155; 156; 157; 158; 159; 160; 161; 162; 163; 164; 165; 166; 167; 168; 169; 170; 171; 172; 173; 174; 175; 176; 177; 178; 179; 180; 181; 182; 183; 184; 185; 186; 187; 188; 189; 190; 191; 192; 193; 194; 195; 196; 197; 198; 199; 200; 201; 202; 203; 204; 205; 206; 207; 208; 209; 210; 211; 212; 213; 214; 215; 216; 217; 218; 219; 220; 221; 222; 223; 224; 225; 226; 227; 228; 229; 230; 231; 232; 233; 234; 235; 236; 237; 238; 239; 240; 241; 242; 243; 244; 245; 246; 247; 248; 249; 250; 251; 252; 253; 254
                               allowed-VIF-devices (SRO): 1; 2; 3; 4; 5; 6
                                    possible-hosts ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                       domain-type ( RW): hvm
                               current-domain-type ( RO): hvm
                                   HVM-boot-policy ( RW): BIOS order
                                   HVM-boot-params (MRW): order: cdn; firmware: uefi
                             HVM-shadow-multiplier ( RW): 1.000
                                         PV-kernel ( RW):
                                        PV-ramdisk ( RW):
                                           PV-args ( RW):
                                    PV-legacy-args ( RW):
                                     PV-bootloader ( RW):
                                PV-bootloader-args ( RW):
                               last-boot-CPU-flags ( RO): vendor: AuthenticAMD; features: 178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000
                                  last-boot-record ( RO): '{"platformdata":{"timeoffset":"0","featureset":"178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000","usb":"true","usb_tablet":"true","device-model":"qemu-upstream-uefi","secureboot":"false","vga":"std","videoram":"8","viridian":"false","device_id":"0001","nx":"true","acpi":"1","apic":"true","pae":"true","hpet":"true"},"xen_platform":[1,2],"pv_drivers_detected":true,"pci_power_mgmt":false,"pci_msitranslate":true,"qemu_vifs":[],"qemu_vbds":[],"suspend_memory_bytes":2149556224,"original_profile":"Qemu_upstream_uefi","profile":"Qemu_upstream_uefi","nested_virt":false,"nomigrate":false,"domain_config":["X86",{"misc_flags":[],"emulation_flags":["X86_EMU_LAPIC","X86_EMU_HPET","X86_EMU_PM","X86_EMU_RTC","X86_EMU_IOAPIC","X86_EMU_PIC","X86_EMU_VGA","X86_EMU_IOMMU","X86_EMU_PIT","X86_EMU_USE_PIRQ"]}],"last_start_time":1730178556.316762,"ty":["HVM",{"firmware":["Uefi",{"backend":"xapidb","on_boot":"Persist"}],"qemu_stubdom":false,"qemu_disk_cmdline":false,"boot_order":"cdn","pci_passthrough":false,"pci_emulations":[],"serial":"pty","acpi":true,"video":"Standard_VGA","video_mib":8,"timeoffset":"0","shadow_multiplier":1.0,"hap":true}],"build_info":{"has_hard_affinity":false,"priv":["BuildHVM",{"video_mib":8,"shadow_multiplier":1.0}],"vcpus":2,"kernel":"/usr/libexec/xen/boot/hvmloader","memory_target":2097152,"memory_max":2097152},"version":2}'
                                       resident-on ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                          affinity ( RW): <not in database>
                                      other-config (MRW): auto_poweron: true; xo:1199c4b4: {"creation":{"date":"2024-10-28T05:11:47.183Z","template":"df1a0e64-3799-482b-aa9f-1ed713c7dac5","user":"98707372-26e6-4877-8a14-85064b5f853a"}}; base_template_name: Ubuntu Jammy Jellyfish 22.04; import_task: OpaqueRef:807d2f23-5607-4fc9-2e3f-a3e9f055e800; mac_seed: 38c38661-6a24-4b1b-63e2-86c3ff2035d3; linux_template: true; install-methods: cdrom,nfs,http,ftp
                                            dom-id ( RO): 2
                                   recommendations ( RO): <restrictions><restriction field="memory-static-max" max="1649267441664"/><restriction field="vcpus-max" max="64"/><restriction field="has-vendor-device" value="false"/><restriction field="allow-gpu-passthrough" value="1"/><restriction field="allow-vgpu" value="1"/><restriction field="allow-network-sriov" value="1"/><restriction field="supports-bios" value="yes"/><restriction field="supports-uefi" value="yes"/><restriction field="supports-secure-boot" value="yes"/><restriction max="255" property="number-of-vbds"/><restriction max="7" property="number-of-vifs"/></restrictions>
                                     xenstore-data (MRW): vm-data/mmio-hole-size: 268435456; vm-data:
                        ha-always-run ( RW) [DEPRECATED]: false
                               ha-restart-priority ( RW):
                                             blobs ( RO):
                                        start-time ( RO): 20250320T19:19:07Z
                                      install-time ( RO): 20241028T05:11:47Z
                                      VCPUs-number ( RO): 4
                                 VCPUs-utilisation (MRO): 0: 0.115; 1: 0.106; 2: 0.111; 3: 0.107
                                        os-version (MRO): name: Ubuntu 24.04; uname: 6.8.0-54-generic; distro: Ubuntu
                                PV-drivers-version (MRO): major: 1; minor: 0; micro: 0; build: proto-0.4.0
                PV-drivers-up-to-date ( RO) [DEPRECATED]: true
                                            memory (MRO):
                                             disks (MRO):
                                              VBDs (SRO): f0602bf4-1f5f-12f1-957b-f6c99669d98c; de3047f9-b097-e345-3a8a-77094f5f8de7
                                          networks (MRO): 0/ip: 192.168.8.86; 0/ipv4/0: 192.168.8.86; 0/ipv6/0: fe80::dc3b:cff:fef0:d3ed
                               PV-drivers-detected ( RO): true
                                             other (MRO): platform-feature-xs_reset_watches: 1; platform-feature-multiprocessor-suspend: 1; has-vendor-device: 0; feature-vcpu-hotplug: 1; feature-suspend: 1; feature-reboot: 1; feature-poweroff: 1; feature-balloon: 1
                                              live ( RO): true
                        guest-metrics-last-updated ( RO): 20250320T19:19:18Z
                               can-use-hotplug-vbd ( RO): unspecified
                               can-use-hotplug-vif ( RO): unspecified
                          cooperative ( RO) [DEPRECATED]: true
                                              tags (SRW):
                                         appliance ( RW): <not in database>
                                            groups ( RW):
                                 snapshot-schedule ( RW): <not in database>
                                  is-vmss-snapshot ( RO): false
                                       start-delay ( RW): 0
                                    shutdown-delay ( RW): 0
                                             order ( RW): 0
                                           version ( RO): 0
                                     generation-id ( RO):
                         hardware-platform-version ( RO): 0
                                 has-vendor-device ( RW): false
                                   requires-reboot ( RO): false
                                   reference-label ( RO): ubuntu-22.04
                                      bios-strings (MRO): bios-vendor: Xen; bios-version: ; system-manufacturer: Xen; system-product-name: HVM domU; system-version: ; system-serial-number: ; baseboard-manufacturer: ; baseboard-product-name: ; baseboard-version: ; baseboard-serial-number: ; baseboard-asset-tag: ; baseboard-location-in-chassis: ; enclosure-asset-tag: ; hp-rombios: ; oem-1: Xen; oem-2: MS_VM_CERT/SHA1/bdbeb6e0a816d43fa6d3fe8aaef04c2bad9d3e3d
                                 pending-guidances ( RO):
                                             vtpms ( RO):
                     pending-guidances-recommended ( RO):
                            pending-guidances-full ( RO):
            
            A 1 Reply Last reply Reply Quote 0
            • A Offline
              andyhhp Xen Guru @AlbertK
              last edited by

              @AlbertK Thanks. There's no nested-virt configured there.

              I have to admit this is looking more and more like buggy CPU. Memory corruption is a possibility, but this is a clearly corrupt field in the middle of otherwise sane-looking fields in the VMCB.

              Do you have any other identical systems? Can you swap this CPU out for another one to see what happens?

              A 1 Reply Last reply Reply Quote 0
              • A Offline
                AlbertK @andyhhp
                last edited by

                @andyhhp Unfortunately no, I do not have another machine to test out the CPU. I have ordered another set of 2x16GB of RAM to test if it is RAM issue.

                Will report back.

                1 Reply Last reply Reply Quote 0
                • J Offline
                  joebeasley @AlbertK
                  last edited by

                  @AlbertK I had a similar issue where the whole server would just reboot randomly. Turned out to be an option in the bios called "cstates". It has something to do with processor power saving. I disabled any mention of cstates and have not had the reboot problems.

                  A 1 Reply Last reply Reply Quote 0
                  • A Offline
                    AlbertK @joebeasley
                    last edited by AlbertK

                    @joebeasley Mine is more of one or more VM will auto reboot and sometime one VM will be not be accessible (cannot ssh or console from XO) (CPU 99%, no network or disk activity as seen in XO and need to force reboot). After that a few hours later the Host will reboot. This is happening every day now.

                    I am seeing a lot of this in the host dmesg.

                    [105679.203854] vif vif-6-0 vif6.0: Guest Rx stalled
                    [105689.395996] vif vif-6-0 vif6.0: Guest Rx ready
                    [105707.532509] vif vif-6-0 vif6.0: Guest Rx stalled
                    [105717.555832] vif vif-6-0 vif6.0: Guest Rx ready
                    [105744.154415] vif vif-6-0 vif6.0: Guest Rx stalled
                    [105754.163666] vif vif-6-0 vif6.0: Guest Rx ready
                    
                    A 1 Reply Last reply Reply Quote 0
                    • A Offline
                      AlbertK @AlbertK
                      last edited by

                      I have installed a fresh set of RAM and still the system crash randomly with some of the crashes with crash log but some does not.

                      This happen on a daily basis it is either the VM reboots or Host. I notice that with the crash log there is a consistent pattern of SVM error in CPU8 and once on CPU11.

                      I then tried to disable the CPU8 and CPU11 from the cpu pool. There is no reboot from VM or Host for the last 7 days. Any ideas on why?.

                      xl cpupool-cpu-remove 8,11
                      
                      A 1 Reply Last reply Reply Quote 0
                      • A Offline
                        andyhhp Xen Guru @AlbertK
                        last edited by

                        As I said before, this is looking like a buggy CPU, and you've proved it, given a week with no incident if CPU8 is excluded.

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Offline
                          olivierlambert Vates 🪐 Co-Founder CEO
                          last edited by

                          Clearly, there's one or 2 damaged core(s). Likely faulty CPU I'm afraid 😞

                          1 Reply Last reply Reply Quote 0
                          • R Offline
                            Riven
                            last edited by

                            If you are not getting crashes on cores 0-5 (assuming they are in use by your VMs) then its unlikely a physical problem.

                            The Ryzen 3600 is only a 6 core CPU, "cores" 8 & 11 are the SMT (Hyperthreaded) versions of cores 2 & 5

                            You could also try turning SMT off

                            A 1 Reply Last reply Reply Quote 0
                            • A Offline
                              AlbertK @Riven
                              last edited by

                              @Riven,

                              What I am not sure is how Xen arrange the CPU, is it core first followed by the SMT/HyperThread Core? or is it alternating ie RealCore, HyperThread Core.

                              1 Reply Last reply Reply Quote 0
                              • First post
                                Last post