XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng 8.3 with VM crashing

    Scheduled Pinned Locked Moved Hardware
    16 Posts 5 Posters 2.4k Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      andyhhp Xen Guru @AlbertK
      last edited by

      @AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?

      A 1 Reply Last reply Reply Quote 1
      • A Offline
        AlbertK @andyhhp
        last edited by

        It is a default install of the Ubuntu OS. I tried the commands below and it is negative.

        lsmod | grep kvm
        
        egrep -c '(vmx|svm)'  /proc/cpuinfo
        

        0

        A 1 Reply Last reply Reply Quote 0
        • A Offline
          andyhhp Xen Guru @AlbertK
          last edited by

          @AlbertK None of those commands are relevant in a Xen system. You want xe vm-param-list uuid=$VM

          A 1 Reply Last reply Reply Quote 0
          • A Offline
            AlbertK @andyhhp
            last edited by

            @andyhhp
            This the param list of one of the VM that is self rebooting.

            uuid ( RO)                                  : 1199c4b4-6072-7086-7286-7d7d1cad2c33
                                        name-label ( RW): K8s-node1
                                  name-description ( RW):
                                      user-version ( RW): 1
                                     is-a-template ( RW): false
                               is-default-template ( RW): false
                                     is-a-snapshot ( RO): false
                                       snapshot-of ( RO): <not in database>
                                         snapshots ( RO):
                                     snapshot-time ( RO): 19700101T00:00:00Z
                                     snapshot-info ( RO):
                                            parent ( RO): <not in database>
                                          children ( RO):
                                 is-control-domain ( RO): false
                                       power-state ( RO): running
                                     memory-actual ( RO): 4297039872
                                     memory-target ( RO): 4294967296
                                   memory-overhead ( RO): 39845888
                                 memory-static-max ( RW): 4294967296
                                memory-dynamic-max ( RW): 4294967296
                                memory-dynamic-min ( RW): 4294967296
                                 memory-static-min ( RW): 1073741824
                                  suspend-VDI-uuid ( RW): <not in database>
                                   suspend-SR-uuid ( RW): <not in database>
                                      VCPUs-params (MRW):
                                         VCPUs-max ( RW): 4
                                  VCPUs-at-startup ( RW): 4
                            actions-after-shutdown ( RW): Destroy
                          actions-after-softreboot ( RW): Soft reboot
                              actions-after-reboot ( RW): Restart
                               actions-after-crash ( RW): Restart
                                     console-uuids (SRO): 7c1c7058-8b18-06ca-60f5-9cbfedec2d11
                                               hvm ( RO): true
                                          platform (MRW): timeoffset: 0; nic_type: e1000; device-model: qemu-upstream-uefi; secureboot: false; vga: std; videoram: 8; viridian: false; device_id: 0001; nx: true; acpi: 1; apic: true; pae: true; hpet: true
                                allowed-operations (SRO): metadata_export; changing_VCPUs_live; changing_dynamic_range; migrate_send; pool_migrate; suspend; hard_reboot; hard_shutdown; clean_reboot; clean_shutdown; pause; checkpoint; snapshot
                                current-operations (SRO):
                                blocked-operations (MRW):
                               allowed-VBD-devices (SRO): 1; 2; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 53; 54; 55; 56; 57; 58; 59; 60; 61; 62; 63; 64; 65; 66; 67; 68; 69; 70; 71; 72; 73; 74; 75; 76; 77; 78; 79; 80; 81; 82; 83; 84; 85; 86; 87; 88; 89; 90; 91; 92; 93; 94; 95; 96; 97; 98; 99; 100; 101; 102; 103; 104; 105; 106; 107; 108; 109; 110; 111; 112; 113; 114; 115; 116; 117; 118; 119; 120; 121; 122; 123; 124; 125; 126; 127; 128; 129; 130; 131; 132; 133; 134; 135; 136; 137; 138; 139; 140; 141; 142; 143; 144; 145; 146; 147; 148; 149; 150; 151; 152; 153; 154; 155; 156; 157; 158; 159; 160; 161; 162; 163; 164; 165; 166; 167; 168; 169; 170; 171; 172; 173; 174; 175; 176; 177; 178; 179; 180; 181; 182; 183; 184; 185; 186; 187; 188; 189; 190; 191; 192; 193; 194; 195; 196; 197; 198; 199; 200; 201; 202; 203; 204; 205; 206; 207; 208; 209; 210; 211; 212; 213; 214; 215; 216; 217; 218; 219; 220; 221; 222; 223; 224; 225; 226; 227; 228; 229; 230; 231; 232; 233; 234; 235; 236; 237; 238; 239; 240; 241; 242; 243; 244; 245; 246; 247; 248; 249; 250; 251; 252; 253; 254
                               allowed-VIF-devices (SRO): 1; 2; 3; 4; 5; 6
                                    possible-hosts ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                       domain-type ( RW): hvm
                               current-domain-type ( RO): hvm
                                   HVM-boot-policy ( RW): BIOS order
                                   HVM-boot-params (MRW): order: cdn; firmware: uefi
                             HVM-shadow-multiplier ( RW): 1.000
                                         PV-kernel ( RW):
                                        PV-ramdisk ( RW):
                                           PV-args ( RW):
                                    PV-legacy-args ( RW):
                                     PV-bootloader ( RW):
                                PV-bootloader-args ( RW):
                               last-boot-CPU-flags ( RO): vendor: AuthenticAMD; features: 178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000
                                  last-boot-record ( RO): '{"platformdata":{"timeoffset":"0","featureset":"178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000","usb":"true","usb_tablet":"true","device-model":"qemu-upstream-uefi","secureboot":"false","vga":"std","videoram":"8","viridian":"false","device_id":"0001","nx":"true","acpi":"1","apic":"true","pae":"true","hpet":"true"},"xen_platform":[1,2],"pv_drivers_detected":true,"pci_power_mgmt":false,"pci_msitranslate":true,"qemu_vifs":[],"qemu_vbds":[],"suspend_memory_bytes":2149556224,"original_profile":"Qemu_upstream_uefi","profile":"Qemu_upstream_uefi","nested_virt":false,"nomigrate":false,"domain_config":["X86",{"misc_flags":[],"emulation_flags":["X86_EMU_LAPIC","X86_EMU_HPET","X86_EMU_PM","X86_EMU_RTC","X86_EMU_IOAPIC","X86_EMU_PIC","X86_EMU_VGA","X86_EMU_IOMMU","X86_EMU_PIT","X86_EMU_USE_PIRQ"]}],"last_start_time":1730178556.316762,"ty":["HVM",{"firmware":["Uefi",{"backend":"xapidb","on_boot":"Persist"}],"qemu_stubdom":false,"qemu_disk_cmdline":false,"boot_order":"cdn","pci_passthrough":false,"pci_emulations":[],"serial":"pty","acpi":true,"video":"Standard_VGA","video_mib":8,"timeoffset":"0","shadow_multiplier":1.0,"hap":true}],"build_info":{"has_hard_affinity":false,"priv":["BuildHVM",{"video_mib":8,"shadow_multiplier":1.0}],"vcpus":2,"kernel":"/usr/libexec/xen/boot/hvmloader","memory_target":2097152,"memory_max":2097152},"version":2}'
                                       resident-on ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                          affinity ( RW): <not in database>
                                      other-config (MRW): auto_poweron: true; xo:1199c4b4: {"creation":{"date":"2024-10-28T05:11:47.183Z","template":"df1a0e64-3799-482b-aa9f-1ed713c7dac5","user":"98707372-26e6-4877-8a14-85064b5f853a"}}; base_template_name: Ubuntu Jammy Jellyfish 22.04; import_task: OpaqueRef:807d2f23-5607-4fc9-2e3f-a3e9f055e800; mac_seed: 38c38661-6a24-4b1b-63e2-86c3ff2035d3; linux_template: true; install-methods: cdrom,nfs,http,ftp
                                            dom-id ( RO): 2
                                   recommendations ( RO): <restrictions><restriction field="memory-static-max" max="1649267441664"/><restriction field="vcpus-max" max="64"/><restriction field="has-vendor-device" value="false"/><restriction field="allow-gpu-passthrough" value="1"/><restriction field="allow-vgpu" value="1"/><restriction field="allow-network-sriov" value="1"/><restriction field="supports-bios" value="yes"/><restriction field="supports-uefi" value="yes"/><restriction field="supports-secure-boot" value="yes"/><restriction max="255" property="number-of-vbds"/><restriction max="7" property="number-of-vifs"/></restrictions>
                                     xenstore-data (MRW): vm-data/mmio-hole-size: 268435456; vm-data:
                        ha-always-run ( RW) [DEPRECATED]: false
                               ha-restart-priority ( RW):
                                             blobs ( RO):
                                        start-time ( RO): 20250320T19:19:07Z
                                      install-time ( RO): 20241028T05:11:47Z
                                      VCPUs-number ( RO): 4
                                 VCPUs-utilisation (MRO): 0: 0.115; 1: 0.106; 2: 0.111; 3: 0.107
                                        os-version (MRO): name: Ubuntu 24.04; uname: 6.8.0-54-generic; distro: Ubuntu
                                PV-drivers-version (MRO): major: 1; minor: 0; micro: 0; build: proto-0.4.0
                PV-drivers-up-to-date ( RO) [DEPRECATED]: true
                                            memory (MRO):
                                             disks (MRO):
                                              VBDs (SRO): f0602bf4-1f5f-12f1-957b-f6c99669d98c; de3047f9-b097-e345-3a8a-77094f5f8de7
                                          networks (MRO): 0/ip: 192.168.8.86; 0/ipv4/0: 192.168.8.86; 0/ipv6/0: fe80::dc3b:cff:fef0:d3ed
                               PV-drivers-detected ( RO): true
                                             other (MRO): platform-feature-xs_reset_watches: 1; platform-feature-multiprocessor-suspend: 1; has-vendor-device: 0; feature-vcpu-hotplug: 1; feature-suspend: 1; feature-reboot: 1; feature-poweroff: 1; feature-balloon: 1
                                              live ( RO): true
                        guest-metrics-last-updated ( RO): 20250320T19:19:18Z
                               can-use-hotplug-vbd ( RO): unspecified
                               can-use-hotplug-vif ( RO): unspecified
                          cooperative ( RO) [DEPRECATED]: true
                                              tags (SRW):
                                         appliance ( RW): <not in database>
                                            groups ( RW):
                                 snapshot-schedule ( RW): <not in database>
                                  is-vmss-snapshot ( RO): false
                                       start-delay ( RW): 0
                                    shutdown-delay ( RW): 0
                                             order ( RW): 0
                                           version ( RO): 0
                                     generation-id ( RO):
                         hardware-platform-version ( RO): 0
                                 has-vendor-device ( RW): false
                                   requires-reboot ( RO): false
                                   reference-label ( RO): ubuntu-22.04
                                      bios-strings (MRO): bios-vendor: Xen; bios-version: ; system-manufacturer: Xen; system-product-name: HVM domU; system-version: ; system-serial-number: ; baseboard-manufacturer: ; baseboard-product-name: ; baseboard-version: ; baseboard-serial-number: ; baseboard-asset-tag: ; baseboard-location-in-chassis: ; enclosure-asset-tag: ; hp-rombios: ; oem-1: Xen; oem-2: MS_VM_CERT/SHA1/bdbeb6e0a816d43fa6d3fe8aaef04c2bad9d3e3d
                                 pending-guidances ( RO):
                                             vtpms ( RO):
                     pending-guidances-recommended ( RO):
                            pending-guidances-full ( RO):
            
            A 1 Reply Last reply Reply Quote 0
            • A Offline
              andyhhp Xen Guru @AlbertK
              last edited by

              @AlbertK Thanks. There's no nested-virt configured there.

              I have to admit this is looking more and more like buggy CPU. Memory corruption is a possibility, but this is a clearly corrupt field in the middle of otherwise sane-looking fields in the VMCB.

              Do you have any other identical systems? Can you swap this CPU out for another one to see what happens?

              A 1 Reply Last reply Reply Quote 0
              • A Offline
                AlbertK @andyhhp
                last edited by

                @andyhhp Unfortunately no, I do not have another machine to test out the CPU. I have ordered another set of 2x16GB of RAM to test if it is RAM issue.

                Will report back.

                1 Reply Last reply Reply Quote 0
                • J Offline
                  joebeasley @AlbertK
                  last edited by

                  @AlbertK I had a similar issue where the whole server would just reboot randomly. Turned out to be an option in the bios called "cstates". It has something to do with processor power saving. I disabled any mention of cstates and have not had the reboot problems.

                  A 1 Reply Last reply Reply Quote 0
                  • A Offline
                    AlbertK @joebeasley
                    last edited by AlbertK

                    @joebeasley Mine is more of one or more VM will auto reboot and sometime one VM will be not be accessible (cannot ssh or console from XO) (CPU 99%, no network or disk activity as seen in XO and need to force reboot). After that a few hours later the Host will reboot. This is happening every day now.

                    I am seeing a lot of this in the host dmesg.

                    [105679.203854] vif vif-6-0 vif6.0: Guest Rx stalled
                    [105689.395996] vif vif-6-0 vif6.0: Guest Rx ready
                    [105707.532509] vif vif-6-0 vif6.0: Guest Rx stalled
                    [105717.555832] vif vif-6-0 vif6.0: Guest Rx ready
                    [105744.154415] vif vif-6-0 vif6.0: Guest Rx stalled
                    [105754.163666] vif vif-6-0 vif6.0: Guest Rx ready
                    
                    A 1 Reply Last reply Reply Quote 0
                    • A Offline
                      AlbertK @AlbertK
                      last edited by

                      I have installed a fresh set of RAM and still the system crash randomly with some of the crashes with crash log but some does not.

                      This happen on a daily basis it is either the VM reboots or Host. I notice that with the crash log there is a consistent pattern of SVM error in CPU8 and once on CPU11.

                      I then tried to disable the CPU8 and CPU11 from the cpu pool. There is no reboot from VM or Host for the last 7 days. Any ideas on why?.

                      xl cpupool-cpu-remove 8,11
                      
                      A 1 Reply Last reply Reply Quote 0
                      • A Offline
                        andyhhp Xen Guru @AlbertK
                        last edited by

                        As I said before, this is looking like a buggy CPU, and you've proved it, given a week with no incident if CPU8 is excluded.

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Offline
                          olivierlambert Vates 🪐 Co-Founder CEO
                          last edited by

                          Clearly, there's one or 2 damaged core(s). Likely faulty CPU I'm afraid 😞

                          1 Reply Last reply Reply Quote 0
                          • R Offline
                            Riven
                            last edited by

                            If you are not getting crashes on cores 0-5 (assuming they are in use by your VMs) then its unlikely a physical problem.

                            The Ryzen 3600 is only a 6 core CPU, "cores" 8 & 11 are the SMT (Hyperthreaded) versions of cores 2 & 5

                            You could also try turning SMT off

                            A 1 Reply Last reply Reply Quote 0
                            • A Offline
                              AlbertK @Riven
                              last edited by

                              @Riven,

                              What I am not sure is how Xen arrange the CPU, is it core first followed by the SMT/HyperThread Core? or is it alternating ie RealCore, HyperThread Core.

                              1 Reply Last reply Reply Quote 0

                              Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                              Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                              With your input, this post could be even better 💗

                              Register Login
                              • First post
                                Last post