XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng 8.3 with VM crashing

    Scheduled Pinned Locked Moved Hardware
    16 Posts 5 Posters 2.5k Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      Hi,

      My first gut feeling is a buggy BIOS, can you check if you are on the latest version?

      A 1 Reply Last reply Reply Quote 0
      • A Offline
        AlbertK @olivierlambert
        last edited by

        @olivierlambert Bios is the latest. XCP-ng latest patch also the VM (ubuntu 22.04 and 24.04) also has latest patch.

        A 1 Reply Last reply Reply Quote 0
        • A Offline
          andyhhp Xen Guru @AlbertK
          last edited by

          @AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?

          A 1 Reply Last reply Reply Quote 1
          • A Offline
            AlbertK @andyhhp
            last edited by

            It is a default install of the Ubuntu OS. I tried the commands below and it is negative.

            lsmod | grep kvm
            
            egrep -c '(vmx|svm)'  /proc/cpuinfo
            

            0

            A 1 Reply Last reply Reply Quote 0
            • A Offline
              andyhhp Xen Guru @AlbertK
              last edited by

              @AlbertK None of those commands are relevant in a Xen system. You want xe vm-param-list uuid=$VM

              A 1 Reply Last reply Reply Quote 0
              • A Offline
                AlbertK @andyhhp
                last edited by

                @andyhhp
                This the param list of one of the VM that is self rebooting.

                uuid ( RO)                                  : 1199c4b4-6072-7086-7286-7d7d1cad2c33
                                            name-label ( RW): K8s-node1
                                      name-description ( RW):
                                          user-version ( RW): 1
                                         is-a-template ( RW): false
                                   is-default-template ( RW): false
                                         is-a-snapshot ( RO): false
                                           snapshot-of ( RO): <not in database>
                                             snapshots ( RO):
                                         snapshot-time ( RO): 19700101T00:00:00Z
                                         snapshot-info ( RO):
                                                parent ( RO): <not in database>
                                              children ( RO):
                                     is-control-domain ( RO): false
                                           power-state ( RO): running
                                         memory-actual ( RO): 4297039872
                                         memory-target ( RO): 4294967296
                                       memory-overhead ( RO): 39845888
                                     memory-static-max ( RW): 4294967296
                                    memory-dynamic-max ( RW): 4294967296
                                    memory-dynamic-min ( RW): 4294967296
                                     memory-static-min ( RW): 1073741824
                                      suspend-VDI-uuid ( RW): <not in database>
                                       suspend-SR-uuid ( RW): <not in database>
                                          VCPUs-params (MRW):
                                             VCPUs-max ( RW): 4
                                      VCPUs-at-startup ( RW): 4
                                actions-after-shutdown ( RW): Destroy
                              actions-after-softreboot ( RW): Soft reboot
                                  actions-after-reboot ( RW): Restart
                                   actions-after-crash ( RW): Restart
                                         console-uuids (SRO): 7c1c7058-8b18-06ca-60f5-9cbfedec2d11
                                                   hvm ( RO): true
                                              platform (MRW): timeoffset: 0; nic_type: e1000; device-model: qemu-upstream-uefi; secureboot: false; vga: std; videoram: 8; viridian: false; device_id: 0001; nx: true; acpi: 1; apic: true; pae: true; hpet: true
                                    allowed-operations (SRO): metadata_export; changing_VCPUs_live; changing_dynamic_range; migrate_send; pool_migrate; suspend; hard_reboot; hard_shutdown; clean_reboot; clean_shutdown; pause; checkpoint; snapshot
                                    current-operations (SRO):
                                    blocked-operations (MRW):
                                   allowed-VBD-devices (SRO): 1; 2; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 53; 54; 55; 56; 57; 58; 59; 60; 61; 62; 63; 64; 65; 66; 67; 68; 69; 70; 71; 72; 73; 74; 75; 76; 77; 78; 79; 80; 81; 82; 83; 84; 85; 86; 87; 88; 89; 90; 91; 92; 93; 94; 95; 96; 97; 98; 99; 100; 101; 102; 103; 104; 105; 106; 107; 108; 109; 110; 111; 112; 113; 114; 115; 116; 117; 118; 119; 120; 121; 122; 123; 124; 125; 126; 127; 128; 129; 130; 131; 132; 133; 134; 135; 136; 137; 138; 139; 140; 141; 142; 143; 144; 145; 146; 147; 148; 149; 150; 151; 152; 153; 154; 155; 156; 157; 158; 159; 160; 161; 162; 163; 164; 165; 166; 167; 168; 169; 170; 171; 172; 173; 174; 175; 176; 177; 178; 179; 180; 181; 182; 183; 184; 185; 186; 187; 188; 189; 190; 191; 192; 193; 194; 195; 196; 197; 198; 199; 200; 201; 202; 203; 204; 205; 206; 207; 208; 209; 210; 211; 212; 213; 214; 215; 216; 217; 218; 219; 220; 221; 222; 223; 224; 225; 226; 227; 228; 229; 230; 231; 232; 233; 234; 235; 236; 237; 238; 239; 240; 241; 242; 243; 244; 245; 246; 247; 248; 249; 250; 251; 252; 253; 254
                                   allowed-VIF-devices (SRO): 1; 2; 3; 4; 5; 6
                                        possible-hosts ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                           domain-type ( RW): hvm
                                   current-domain-type ( RO): hvm
                                       HVM-boot-policy ( RW): BIOS order
                                       HVM-boot-params (MRW): order: cdn; firmware: uefi
                                 HVM-shadow-multiplier ( RW): 1.000
                                             PV-kernel ( RW):
                                            PV-ramdisk ( RW):
                                               PV-args ( RW):
                                        PV-legacy-args ( RW):
                                         PV-bootloader ( RW):
                                    PV-bootloader-args ( RW):
                                   last-boot-CPU-flags ( RO): vendor: AuthenticAMD; features: 178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000
                                      last-boot-record ( RO): '{"platformdata":{"timeoffset":"0","featureset":"178bfbff-f6f83203-2e500800-040001f3-0000000f-219c01a9-00400004-00000000-00101005-00000000-00000000-10000044-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000-00000000","usb":"true","usb_tablet":"true","device-model":"qemu-upstream-uefi","secureboot":"false","vga":"std","videoram":"8","viridian":"false","device_id":"0001","nx":"true","acpi":"1","apic":"true","pae":"true","hpet":"true"},"xen_platform":[1,2],"pv_drivers_detected":true,"pci_power_mgmt":false,"pci_msitranslate":true,"qemu_vifs":[],"qemu_vbds":[],"suspend_memory_bytes":2149556224,"original_profile":"Qemu_upstream_uefi","profile":"Qemu_upstream_uefi","nested_virt":false,"nomigrate":false,"domain_config":["X86",{"misc_flags":[],"emulation_flags":["X86_EMU_LAPIC","X86_EMU_HPET","X86_EMU_PM","X86_EMU_RTC","X86_EMU_IOAPIC","X86_EMU_PIC","X86_EMU_VGA","X86_EMU_IOMMU","X86_EMU_PIT","X86_EMU_USE_PIRQ"]}],"last_start_time":1730178556.316762,"ty":["HVM",{"firmware":["Uefi",{"backend":"xapidb","on_boot":"Persist"}],"qemu_stubdom":false,"qemu_disk_cmdline":false,"boot_order":"cdn","pci_passthrough":false,"pci_emulations":[],"serial":"pty","acpi":true,"video":"Standard_VGA","video_mib":8,"timeoffset":"0","shadow_multiplier":1.0,"hap":true}],"build_info":{"has_hard_affinity":false,"priv":["BuildHVM",{"video_mib":8,"shadow_multiplier":1.0}],"vcpus":2,"kernel":"/usr/libexec/xen/boot/hvmloader","memory_target":2097152,"memory_max":2097152},"version":2}'
                                           resident-on ( RO): f8cc6a6c-8ff4-4e3b-9f92-b5f62bef04ed
                                              affinity ( RW): <not in database>
                                          other-config (MRW): auto_poweron: true; xo:1199c4b4: {"creation":{"date":"2024-10-28T05:11:47.183Z","template":"df1a0e64-3799-482b-aa9f-1ed713c7dac5","user":"98707372-26e6-4877-8a14-85064b5f853a"}}; base_template_name: Ubuntu Jammy Jellyfish 22.04; import_task: OpaqueRef:807d2f23-5607-4fc9-2e3f-a3e9f055e800; mac_seed: 38c38661-6a24-4b1b-63e2-86c3ff2035d3; linux_template: true; install-methods: cdrom,nfs,http,ftp
                                                dom-id ( RO): 2
                                       recommendations ( RO): <restrictions><restriction field="memory-static-max" max="1649267441664"/><restriction field="vcpus-max" max="64"/><restriction field="has-vendor-device" value="false"/><restriction field="allow-gpu-passthrough" value="1"/><restriction field="allow-vgpu" value="1"/><restriction field="allow-network-sriov" value="1"/><restriction field="supports-bios" value="yes"/><restriction field="supports-uefi" value="yes"/><restriction field="supports-secure-boot" value="yes"/><restriction max="255" property="number-of-vbds"/><restriction max="7" property="number-of-vifs"/></restrictions>
                                         xenstore-data (MRW): vm-data/mmio-hole-size: 268435456; vm-data:
                            ha-always-run ( RW) [DEPRECATED]: false
                                   ha-restart-priority ( RW):
                                                 blobs ( RO):
                                            start-time ( RO): 20250320T19:19:07Z
                                          install-time ( RO): 20241028T05:11:47Z
                                          VCPUs-number ( RO): 4
                                     VCPUs-utilisation (MRO): 0: 0.115; 1: 0.106; 2: 0.111; 3: 0.107
                                            os-version (MRO): name: Ubuntu 24.04; uname: 6.8.0-54-generic; distro: Ubuntu
                                    PV-drivers-version (MRO): major: 1; minor: 0; micro: 0; build: proto-0.4.0
                    PV-drivers-up-to-date ( RO) [DEPRECATED]: true
                                                memory (MRO):
                                                 disks (MRO):
                                                  VBDs (SRO): f0602bf4-1f5f-12f1-957b-f6c99669d98c; de3047f9-b097-e345-3a8a-77094f5f8de7
                                              networks (MRO): 0/ip: 192.168.8.86; 0/ipv4/0: 192.168.8.86; 0/ipv6/0: fe80::dc3b:cff:fef0:d3ed
                                   PV-drivers-detected ( RO): true
                                                 other (MRO): platform-feature-xs_reset_watches: 1; platform-feature-multiprocessor-suspend: 1; has-vendor-device: 0; feature-vcpu-hotplug: 1; feature-suspend: 1; feature-reboot: 1; feature-poweroff: 1; feature-balloon: 1
                                                  live ( RO): true
                            guest-metrics-last-updated ( RO): 20250320T19:19:18Z
                                   can-use-hotplug-vbd ( RO): unspecified
                                   can-use-hotplug-vif ( RO): unspecified
                              cooperative ( RO) [DEPRECATED]: true
                                                  tags (SRW):
                                             appliance ( RW): <not in database>
                                                groups ( RW):
                                     snapshot-schedule ( RW): <not in database>
                                      is-vmss-snapshot ( RO): false
                                           start-delay ( RW): 0
                                        shutdown-delay ( RW): 0
                                                 order ( RW): 0
                                               version ( RO): 0
                                         generation-id ( RO):
                             hardware-platform-version ( RO): 0
                                     has-vendor-device ( RW): false
                                       requires-reboot ( RO): false
                                       reference-label ( RO): ubuntu-22.04
                                          bios-strings (MRO): bios-vendor: Xen; bios-version: ; system-manufacturer: Xen; system-product-name: HVM domU; system-version: ; system-serial-number: ; baseboard-manufacturer: ; baseboard-product-name: ; baseboard-version: ; baseboard-serial-number: ; baseboard-asset-tag: ; baseboard-location-in-chassis: ; enclosure-asset-tag: ; hp-rombios: ; oem-1: Xen; oem-2: MS_VM_CERT/SHA1/bdbeb6e0a816d43fa6d3fe8aaef04c2bad9d3e3d
                                     pending-guidances ( RO):
                                                 vtpms ( RO):
                         pending-guidances-recommended ( RO):
                                pending-guidances-full ( RO):
                
                A 1 Reply Last reply Reply Quote 0
                • A Offline
                  andyhhp Xen Guru @AlbertK
                  last edited by

                  @AlbertK Thanks. There's no nested-virt configured there.

                  I have to admit this is looking more and more like buggy CPU. Memory corruption is a possibility, but this is a clearly corrupt field in the middle of otherwise sane-looking fields in the VMCB.

                  Do you have any other identical systems? Can you swap this CPU out for another one to see what happens?

                  A 1 Reply Last reply Reply Quote 0
                  • A Offline
                    AlbertK @andyhhp
                    last edited by

                    @andyhhp Unfortunately no, I do not have another machine to test out the CPU. I have ordered another set of 2x16GB of RAM to test if it is RAM issue.

                    Will report back.

                    1 Reply Last reply Reply Quote 0
                    • J Offline
                      joebeasley @AlbertK
                      last edited by

                      @AlbertK I had a similar issue where the whole server would just reboot randomly. Turned out to be an option in the bios called "cstates". It has something to do with processor power saving. I disabled any mention of cstates and have not had the reboot problems.

                      A 1 Reply Last reply Reply Quote 0
                      • A Offline
                        AlbertK @joebeasley
                        last edited by AlbertK

                        @joebeasley Mine is more of one or more VM will auto reboot and sometime one VM will be not be accessible (cannot ssh or console from XO) (CPU 99%, no network or disk activity as seen in XO and need to force reboot). After that a few hours later the Host will reboot. This is happening every day now.

                        I am seeing a lot of this in the host dmesg.

                        [105679.203854] vif vif-6-0 vif6.0: Guest Rx stalled
                        [105689.395996] vif vif-6-0 vif6.0: Guest Rx ready
                        [105707.532509] vif vif-6-0 vif6.0: Guest Rx stalled
                        [105717.555832] vif vif-6-0 vif6.0: Guest Rx ready
                        [105744.154415] vif vif-6-0 vif6.0: Guest Rx stalled
                        [105754.163666] vif vif-6-0 vif6.0: Guest Rx ready
                        
                        A 1 Reply Last reply Reply Quote 0
                        • A Offline
                          AlbertK @AlbertK
                          last edited by

                          I have installed a fresh set of RAM and still the system crash randomly with some of the crashes with crash log but some does not.

                          This happen on a daily basis it is either the VM reboots or Host. I notice that with the crash log there is a consistent pattern of SVM error in CPU8 and once on CPU11.

                          I then tried to disable the CPU8 and CPU11 from the cpu pool. There is no reboot from VM or Host for the last 7 days. Any ideas on why?.

                          xl cpupool-cpu-remove 8,11
                          
                          A 1 Reply Last reply Reply Quote 0
                          • A Offline
                            andyhhp Xen Guru @AlbertK
                            last edited by

                            As I said before, this is looking like a buggy CPU, and you've proved it, given a week with no incident if CPU8 is excluded.

                            1 Reply Last reply Reply Quote 0
                            • olivierlambertO Offline
                              olivierlambert Vates 🪐 Co-Founder CEO
                              last edited by

                              Clearly, there's one or 2 damaged core(s). Likely faulty CPU I'm afraid 😞

                              1 Reply Last reply Reply Quote 0
                              • R Offline
                                Riven
                                last edited by

                                If you are not getting crashes on cores 0-5 (assuming they are in use by your VMs) then its unlikely a physical problem.

                                The Ryzen 3600 is only a 6 core CPU, "cores" 8 & 11 are the SMT (Hyperthreaded) versions of cores 2 & 5

                                You could also try turning SMT off

                                A 1 Reply Last reply Reply Quote 0
                                • A Offline
                                  AlbertK @Riven
                                  last edited by

                                  @Riven,

                                  What I am not sure is how Xen arrange the CPU, is it core first followed by the SMT/HyperThread Core? or is it alternating ie RealCore, HyperThread Core.

                                  1 Reply Last reply Reply Quote 0

                                  Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                                  Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                                  With your input, this post could be even better 💗

                                  Register Login
                                  • First post
                                    Last post