XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng 8.3 with VM crashing

    Scheduled Pinned Locked Moved Hardware
    16 Posts 5 Posters 618 Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      AlbertK @andyhhp
      last edited by

      It is a default install of the Ubuntu OS. I tried the commands below and it is negative.

      lsmod | grep kvm
      
      egrep -c '(vmx|svm)'  /proc/cpuinfo
      

      0

      A 1 Reply Last reply Reply Quote 0
      • A Offline
        andyhhp Xen Guru @AlbertK
        last edited by

        @AlbertK None of those commands are relevant in a Xen system. You want xe vm-param-list uuid=$VM

        A 1 Reply Last reply Reply Quote 0
        • A Offline
          AlbertK @andyhhp
          last edited by

          @andyhhp
          This the param list of one of the VM that is self rebooting.

          uuid ( RO)                                  : 1199c4b4-6072-7086-7286-7d7d1cad2c33
                                      name-label ( RW): K8s-node1
                                name-description ( RW):
                                    user-version ( RW): 1
                                   is-a-template ( RW): false
                             is-default-template ( RW): false
                                   is-a-snapshot ( RO): false
                                     snapshot-of ( RO): <not in database>
                                       snapshots ( RO):
                                   snapshot-time ( RO): 19700101T00:00:00Z
                                   snapshot-info ( RO):
                                          parent ( RO): <not in database>
                                        children ( RO):
                               is-control-domain ( RO): false
                                     power-state ( RO): running
                                   memory-actual ( RO): 4297039872
                                   memory-target ( RO): 4294967296
                                 memory-overhead ( RO): 39845888
                               memory-static-max ( RW): 4294967296
                              memory-dynamic-max ( RW): 4294967296
                              memory-dynamic-min ( RW): 4294967296
                               memory-static-min ( RW): 1073741824
                                suspend-VDI-uuid ( RW): <not in database>
                                 suspend-SR-uuid ( RW): <not in database>
                                    VCPUs-params (MRW):
                                       VCPUs-max ( RW): 4
                                VCPUs-at-startup ( RW): 4
                          actions-after-shutdown ( RW): Destroy
                        actions-after-softreboot ( RW): Soft reboot
                            actions-after-reboot ( RW): Restart
                             actions-after-crash ( RW): Restart
                                   console-uuids (SRO): 7c1c7058-8b18-06ca-60f5-9cbfedec2d11
                                             hvm ( RO): true
                                        platform (MRW): timeoffset: 0; nic_type: e1000; device-model: qemu-upstream-uefi; secureboot: false; vga: std; videoram: 8; viridian: false; device_id: 0001; nx: true; acpi: 1; apic: true; pae: true; hpet: true
                              allowed-operations (SRO): metadata_export; changing_VCPUs_live; changing_dynamic_range; migrate_send; pool_migrate; suspend; hard_reboot; hard_shutdown; clean_reboot; clean_shutdown; pause; checkpoint; snapshot
                              current-operations (SRO):
                              blocked-operations (MRW):
                             allowed-VBD-devices (SRO): 1; 2; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 53; 54; 55; 56; 57; 58; 59; 60; 61; 62; 63; 64; 65; 66; 67; 68; 69; 70; 71; 72; 73; 74; 75; 76; 77; 78; 79; 80; 81; 82; 83; 84; 85; 86; 87; 88; 89; 90; 91; 92; 93; 94; 95; 96; 97; 98; 99; 100; 101; 102; 103; 104; 105; 106; 107; 108; 109; 110; 111; 112; 113; 114; 115; 116; 117; 118; 119; 120; 121; 122; 123; 124; 125; 126; 127; 128; 129; 130; 131; 132; 133; 134; 135; 136; 137; 138; 139; 140; 141; 142; 143; 144; 145; 146; 147; 148; 149; 150; 151; 152; 153; 154; 155; 156; 157; 158; 159; 160; 161; 162; 163; 164; 165; 166; 167; 168; 169; 170; 171; 172; 173; 174; 175; 176; 177; 178; 179; 180; 181; 182; 183; 184; 185; 186; 187; 188; 189; 190; 191; 192; 193; 194; 195; 196; 197; 198; 199; 200; 201; 202; 203; 204; 205; 206; 207; 208; 209; 210; 211; 212; 213; 214; 215; 216; 217; 218; 219; 220; 221; 222; 223; 224; 225; 226; 227; 228; 229; 230; 231; 232; 233; 234; 235; 236; 237; 238; 239; 240; 241; 242; 243; 244; 245; 246; 247; 248; 249; 250; 251; 252; 253; 254
                             allowed-VIF-devices (SRO): 1; 2; 3; 4; 5; 6
                                  possible-hosts ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                     domain-type ( RW): hvm
                             current-domain-type ( RO): hvm
                                 HVM-boot-policy ( RW): BIOS order
                                 HVM-boot-params (MRW): order: cdn; firmware: uefi
                           HVM-shadow-multiplier ( RW): 1.000
                                       PV-kernel ( RW):
                                      PV-ramdisk ( RW):
                                         PV-args ( RW):
                                  PV-legacy-args ( RW):
                                   PV-bootloader ( RW):
                              PV-bootloader-args ( RW):
                             last-boot-CPU-flags ( RO): vendor: AuthenticAMD; features: 178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000
                                last-boot-record ( RO): '{"platformdata":{"timeoffset":"0","featureset":"178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000","usb":"true","usb_tablet":"true","device-model":"qemu-upstream-uefi","secureboot":"false","vga":"std","videoram":"8","viridian":"false","device_id":"0001","nx":"true","acpi":"1","apic":"true","pae":"true","hpet":"true"},"xen_platform":[1,2],"pv_drivers_detected":true,"pci_power_mgmt":false,"pci_msitranslate":true,"qemu_vifs":[],"qemu_vbds":[],"suspend_memory_bytes":2149556224,"original_profile":"Qemu_upstream_uefi","profile":"Qemu_upstream_uefi","nested_virt":false,"nomigrate":false,"domain_config":["X86",{"misc_flags":[],"emulation_flags":["X86_EMU_LAPIC","X86_EMU_HPET","X86_EMU_PM","X86_EMU_RTC","X86_EMU_IOAPIC","X86_EMU_PIC","X86_EMU_VGA","X86_EMU_IOMMU","X86_EMU_PIT","X86_EMU_USE_PIRQ"]}],"last_start_time":1730178556.316762,"ty":["HVM",{"firmware":["Uefi",{"backend":"xapidb","on_boot":"Persist"}],"qemu_stubdom":false,"qemu_disk_cmdline":false,"boot_order":"cdn","pci_passthrough":false,"pci_emulations":[],"serial":"pty","acpi":true,"video":"Standard_VGA","video_mib":8,"timeoffset":"0","shadow_multiplier":1.0,"hap":true}],"build_info":{"has_hard_affinity":false,"priv":["BuildHVM",{"video_mib":8,"shadow_multiplier":1.0}],"vcpus":2,"kernel":"/usr/libexec/xen/boot/hvmloader","memory_target":2097152,"memory_max":2097152},"version":2}'
                                     resident-on ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                        affinity ( RW): <not in database>
                                    other-config (MRW): auto_poweron: true; xo:1199c4b4: {"creation":{"date":"2024-10-28T05:11:47.183Z","template":"df1a0e64-3799-482b-aa9f-1ed713c7dac5","user":"98707372-26e6-4877-8a14-85064b5f853a"}}; base_template_name: Ubuntu Jammy Jellyfish 22.04; import_task: OpaqueRef:807d2f23-5607-4fc9-2e3f-a3e9f055e800; mac_seed: 38c38661-6a24-4b1b-63e2-86c3ff2035d3; linux_template: true; install-methods: cdrom,nfs,http,ftp
                                          dom-id ( RO): 2
                                 recommendations ( RO): <restrictions><restriction field="memory-static-max" max="1649267441664"/><restriction field="vcpus-max" max="64"/><restriction field="has-vendor-device" value="false"/><restriction field="allow-gpu-passthrough" value="1"/><restriction field="allow-vgpu" value="1"/><restriction field="allow-network-sriov" value="1"/><restriction field="supports-bios" value="yes"/><restriction field="supports-uefi" value="yes"/><restriction field="supports-secure-boot" value="yes"/><restriction max="255" property="number-of-vbds"/><restriction max="7" property="number-of-vifs"/></restrictions>
                                   xenstore-data (MRW): vm-data/mmio-hole-size: 268435456; vm-data:
                      ha-always-run ( RW) [DEPRECATED]: false
                             ha-restart-priority ( RW):
                                           blobs ( RO):
                                      start-time ( RO): 20250320T19:19:07Z
                                    install-time ( RO): 20241028T05:11:47Z
                                    VCPUs-number ( RO): 4
                               VCPUs-utilisation (MRO): 0: 0.115; 1: 0.106; 2: 0.111; 3: 0.107
                                      os-version (MRO): name: Ubuntu 24.04; uname: 6.8.0-54-generic; distro: Ubuntu
                              PV-drivers-version (MRO): major: 1; minor: 0; micro: 0; build: proto-0.4.0
              PV-drivers-up-to-date ( RO) [DEPRECATED]: true
                                          memory (MRO):
                                           disks (MRO):
                                            VBDs (SRO): f0602bf4-1f5f-12f1-957b-f6c99669d98c; de3047f9-b097-e345-3a8a-77094f5f8de7
                                        networks (MRO): 0/ip: 192.168.8.86; 0/ipv4/0: 192.168.8.86; 0/ipv6/0: fe80::dc3b:cff:fef0:d3ed
                             PV-drivers-detected ( RO): true
                                           other (MRO): platform-feature-xs_reset_watches: 1; platform-feature-multiprocessor-suspend: 1; has-vendor-device: 0; feature-vcpu-hotplug: 1; feature-suspend: 1; feature-reboot: 1; feature-poweroff: 1; feature-balloon: 1
                                            live ( RO): true
                      guest-metrics-last-updated ( RO): 20250320T19:19:18Z
                             can-use-hotplug-vbd ( RO): unspecified
                             can-use-hotplug-vif ( RO): unspecified
                        cooperative ( RO) [DEPRECATED]: true
                                            tags (SRW):
                                       appliance ( RW): <not in database>
                                          groups ( RW):
                               snapshot-schedule ( RW): <not in database>
                                is-vmss-snapshot ( RO): false
                                     start-delay ( RW): 0
                                  shutdown-delay ( RW): 0
                                           order ( RW): 0
                                         version ( RO): 0
                                   generation-id ( RO):
                       hardware-platform-version ( RO): 0
                               has-vendor-device ( RW): false
                                 requires-reboot ( RO): false
                                 reference-label ( RO): ubuntu-22.04
                                    bios-strings (MRO): bios-vendor: Xen; bios-version: ; system-manufacturer: Xen; system-product-name: HVM domU; system-version: ; system-serial-number: ; baseboard-manufacturer: ; baseboard-product-name: ; baseboard-version: ; baseboard-serial-number: ; baseboard-asset-tag: ; baseboard-location-in-chassis: ; enclosure-asset-tag: ; hp-rombios: ; oem-1: Xen; oem-2: MS_VM_CERT/SHA1/bdbeb6e0a816d43fa6d3fe8aaef04c2bad9d3e3d
                               pending-guidances ( RO):
                                           vtpms ( RO):
                   pending-guidances-recommended ( RO):
                          pending-guidances-full ( RO):
          
          A 1 Reply Last reply Reply Quote 0
          • A Offline
            andyhhp Xen Guru @AlbertK
            last edited by

            @AlbertK Thanks. There's no nested-virt configured there.

            I have to admit this is looking more and more like buggy CPU. Memory corruption is a possibility, but this is a clearly corrupt field in the middle of otherwise sane-looking fields in the VMCB.

            Do you have any other identical systems? Can you swap this CPU out for another one to see what happens?

            A 1 Reply Last reply Reply Quote 0
            • A Offline
              AlbertK @andyhhp
              last edited by

              @andyhhp Unfortunately no, I do not have another machine to test out the CPU. I have ordered another set of 2x16GB of RAM to test if it is RAM issue.

              Will report back.

              1 Reply Last reply Reply Quote 0
              • J Offline
                joebeasley @AlbertK
                last edited by

                @AlbertK I had a similar issue where the whole server would just reboot randomly. Turned out to be an option in the bios called "cstates". It has something to do with processor power saving. I disabled any mention of cstates and have not had the reboot problems.

                A 1 Reply Last reply Reply Quote 0
                • A Offline
                  AlbertK @joebeasley
                  last edited by AlbertK

                  @joebeasley Mine is more of one or more VM will auto reboot and sometime one VM will be not be accessible (cannot ssh or console from XO) (CPU 99%, no network or disk activity as seen in XO and need to force reboot). After that a few hours later the Host will reboot. This is happening every day now.

                  I am seeing a lot of this in the host dmesg.

                  [105679.203854] vif vif-6-0 vif6.0: Guest Rx stalled
                  [105689.395996] vif vif-6-0 vif6.0: Guest Rx ready
                  [105707.532509] vif vif-6-0 vif6.0: Guest Rx stalled
                  [105717.555832] vif vif-6-0 vif6.0: Guest Rx ready
                  [105744.154415] vif vif-6-0 vif6.0: Guest Rx stalled
                  [105754.163666] vif vif-6-0 vif6.0: Guest Rx ready
                  
                  A 1 Reply Last reply Reply Quote 0
                  • A Offline
                    AlbertK @AlbertK
                    last edited by

                    I have installed a fresh set of RAM and still the system crash randomly with some of the crashes with crash log but some does not.

                    This happen on a daily basis it is either the VM reboots or Host. I notice that with the crash log there is a consistent pattern of SVM error in CPU8 and once on CPU11.

                    I then tried to disable the CPU8 and CPU11 from the cpu pool. There is no reboot from VM or Host for the last 7 days. Any ideas on why?.

                    xl cpupool-cpu-remove 8,11
                    
                    A 1 Reply Last reply Reply Quote 0
                    • A Offline
                      andyhhp Xen Guru @AlbertK
                      last edited by

                      As I said before, this is looking like a buggy CPU, and you've proved it, given a week with no incident if CPU8 is excluded.

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        Clearly, there's one or 2 damaged core(s). Likely faulty CPU I'm afraid 😞

                        1 Reply Last reply Reply Quote 0
                        • R Offline
                          Riven
                          last edited by

                          If you are not getting crashes on cores 0-5 (assuming they are in use by your VMs) then its unlikely a physical problem.

                          The Ryzen 3600 is only a 6 core CPU, "cores" 8 & 11 are the SMT (Hyperthreaded) versions of cores 2 & 5

                          You could also try turning SMT off

                          A 1 Reply Last reply Reply Quote 0
                          • A Offline
                            AlbertK @Riven
                            last edited by

                            @Riven,

                            What I am not sure is how Xen arrange the CPU, is it core first followed by the SMT/HyperThread Core? or is it alternating ie RealCore, HyperThread Core.

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post